# Freakonometrics

## Tag - Hill

Tuesday, September 18 2012

## Open question on tail index

A short post today on a rather surprising result. I can't figure out how to prove it (I don not know if the result is valid, or known, so I claim it is an open question). Consider an i.i.d. sequence $\{X_1,\cdots,X_n\}$ with common survival function $\overline{F}$. Assume further that $\overline{F}$ is regularly varying at $\infty$, written $\overline{F}\in RV_{\alpha}$ , i.e.

$\lim_{t\rightarrow\infty}\frac{\overline{F}(tx)}{\overline{F}(t)}=x^{\alpha}$

with $\alpha\in(-\infty,0)$. Then the tail of $X$ is Pareto type, and we can use Hill's estimator to estimate $\alpha$. An interesting result is that if $\overline{F}\in RV_{\alpha}$ and $\overline{G}\in RV_{\beta}$, then $\overline{F}\star\overline{G}$ is also regularly varying, where $\star$ denote the convolution operator. More precisely, when $\alpha=\beta$, then (see Feller (), section VIII.8), then $\overline{F}\star\overline{G}$ is regularly varying with the same index.

This can be visualized below, on simulation of Pareto variables, where the tail index is estimated using Hill's estimator. First, we start we one sample, either of size 20 (on the left) or 100 (on the right),

> library(evir)
> n=20
> set.seed(1)
> alpha=1.5
> X=runif(n)^(-1/alpha)
> hill(X)
> abline(h=1.5,col="blue")

If we generate two (independent) samples, and then look at the sum, Hill's estimator does not perform very well (the sum of two independent Pareto variates is no longer Pareto, but only Pareto type) we have

> set.seed(1)
> alpha=1.5
> X=runif(n)^(-1/alpha)
> Y=runif(n)^(-1/alpha)
> hill(X+Y)
> abline(h=1.5,col="blue")

The idea is then to use a Jackknife strategy in order to (artificially) increase the size of our sample. Thus, consider sums on all pairs of all $X_i$'s, i.e. sample

$\{X_1+X_2,\cdots,X_1+X_{n},X_2+X_3,\cdots,X_{n-1}+X_n\}$

Let us use Hill's estimator on this (much larger) sample.

> XC=NA
> for(i in 1:(n-1)){
+ for(j in (i+1):n){
+ XC=c(XC,X[i]+X[j])
+ }}
> XC=XC[-1]
> hill(XC)
> abline(h=1.5,col="blue")
> abline(h=1.5*2,col="blue",lty=2)
> abline(h=1.5^2,col="blue",lty=3)

Here, with 20 observations from the initial sample, it looks like the tail index is $2\alpha$ (on the left). With 100 observations, it looks like it is $\alpha^2$ (on the right). On the graphs above, I have plotted those two horizontal lines. It's odd, isn't it ? Of course, I did not really expect $\alpha$ since we do not have an i.i.d. sample. Identically distributed yes, with a regularly varying survival function from what we've mentioned before. But clearly not independent. So Hill's estimator should not be a good estimator of $\alpha$, but it might be a good estimator of some function of $\alpha$

If we go one step further, an consider all triplets,

$\{X_1+X_2+X_3,\cdots,X_1+X_2+X_{n},X_2+X_3+X_4,\cdots,X_{n-2}+X_{n-1}+X_n\}$

(observe that now, the sample size is huge, even if we start with only 20 points). Then, it looks like the tail index should be $\alpha^3$ (at least, we can observe a step of Hill's plot around $\alpha^3$).

> XC=NA
> for(i in 1:(n-2)){
+ for(j in (i+1):(n-1)){
+ for(k in (j+1):n){
+ XC=c(XC,X[i]+X[j]+X[k])
+ }}}
> XC=XC[-1]
> hill(XC)
> abline(h=1.5,col="blue")
> abline(h=1.5*3,col="blue",lty=2)
> abline(h=1.5^3,col="blue",lty=3)

My open question is whether there is general result behind, or if it was just a coincidence that those values appear so clearly.

Thursday, February 9 2012

## MAT8886 from tail estimation to risk measure(s) estimation

This week, we conclude the part on extremes with an application of extreme value theory to risk measures. We have seen last week that, if we assume that above a threshold , a Generalized Pareto Distribution will fit nicely, then we can use it to derive an estimator of the quantile function (for percentages such that the quantile is larger than the threshold)

It the threshold is , i.e. we keep the largest observations to fit a GPD, then this estimator can be written

The code we wrote last week was the following (here based on log-returns of the SP500 index, and we focus on large losses, i.e. large values of the opposite of log returns, plotted below)

> library(tseries)
> X=get.hist.quote("^GSPC")
> T=time(X)
> D=as.POSIXlt(T)
> Y=X$Close > R=diff(log(Y)) > D=D[-1] > X=-R > plot(D,X) > library(evir) > GPD=gpd(X,quantile(X,.975)) > xi=GPD$par.ests[1]
> beta=GPD$par.ests[2] > u=GPD$threshold
> QpGPD=function(p){
+ u+beta/xi*((100/2.5*(1-p))^(-xi)-1)
+ }
> QpGPD(1-1/250)
97.5%
0.04557386
> QpGPD(1-1/2500)
97.5%
0.08925095 

This is similar with the following outputs, with the return period of a yearly event (one observation out of 250 trading days)

> gpd.q(tailplot(gpd(X,quantile(X,.975))), 1-1/250, ci.type =
+ "likelihood", ci.p = 0.95,like.num = 50)
Lower CI   Estimate   Upper CI
0.04172534 0.04557655 0.05086785 

or the decennial one

> gpd.q(tailplot(gpd(X,quantile(X,.975))), 1-1/2500, ci.type =
+ "likelihood", ci.p = 0.95,like.num = 50)
Lower CI   Estimate   Upper CI
0.07165395 0.08925558 0.13636620 

Note that it is also possible to derive an estimator for another population risk measure (the quantile is simply the so-called Value-at-Risk), the expected shortfall (or Tail Value-at-Risk), i.e.

The idea is to write that expression

so that we recognize the mean excess function (discussed earlier). Thus, assuming again that above (and therefore above that high quantile) a GPD will fit, we can write

or equivalently

If we substitute estimators to unknown quantities on that expression, we get

The code is here

> EpGPD=function(p){
+ u-beta/xi+beta/xi/(1-xi)*(100/2.5*(1-p))^(-xi)
+ }
> EpGPD(1-1/250)
97.5%
0.06426508
> EpGPD(1-1/2500)
97.5%
0.1215077 

An alternative is to use Hill's approach (used to derive Hill's estimator). Assume here that , where is a slowly varying function. Then, for all ,

Since is a slowly varying function, it seem natural to assume that this ratio is almost 1 (which is true asymptotically). Thus

i.e. if we invert that function, we derive an estimator for the quantile function

which can also be written

(which is close to the relation we derived using a GPD model). Here the code is

> k=trunc(length(X)*.025)
> Xs=rev(sort(as.numeric(X)))
> xiHill=mean(log(Xs[1:k]))-log(Xs[k+1])
> u=Xs[k+1]
> QpHill=function(p){
+ u+u*((100/2.5*(1-p))^(-xiHill)-1)
+ }

with the following Hill plot

For yearly and decennial events, we have here

> QpHill(1-1/250)
[1] 0.04580548
> QpHill(1-1/2500)
[1] 0.1010204

Those quantities seem consistent since they are quite close, but they are different compared with empirical quantiles,

> quantile(X,1-1/250)
99.6%
0.04743929
> quantile(X,1-1/2500)
99.96%
0.09054039 

Note that it is also possible to use some functions to derive estimators of those quantities,

> riskmeasures(gpd(X,quantile(X,.975)),1-1/250)
p   quantile      sfall
[1,] 0.996 0.04557655 0.06426859
> riskmeasures(gpd(X,quantile(X,.975)),1-1/2500)
p   quantile     sfall
[1,] 0.9996 0.08925558 0.1215137

(in this application, we have assumed that log-returns were independent and identically distributed... which might be a rather strong assumption).

An to conclude that post (and the session on extremes), here is the schedule for next weeks,

Thursday, January 26 2012

## MAT8886 extremes and R

These data were collected at Copenhagen Reinsurance and comprise 2167 fire losses over the period 1980 to 1990, They have been adjusted for inflation to reflect 1985 values and are expressed in millions of Danish Kron. Note that it is possible to work with the same data as above but the total claim has been divided into a building loss, a loss of contents and a loss of profits.

> base1=read.table(+ "http://freakonometrics.free.fr/danish-univariate.txt",
+ header=TRUE)
Consider here the first dataset (we deal - so far - with univariate extremes),
> X=base1$Loss.in.DKM > D=as.Date(as.character(base1$Date),"%m/%d/%Y")
> plot(D,X,type="h")

The graph is the following,

A natural idea is then to plot

i.e.
> Xs=sort(X)
> logXs=rev(log(Xs))
> n=length(X)
> plot(log(Xs),log((n:1)/(n+1)))

Points are on a straight line here. The slope can be obtained using a linear regression,

> B=data.frame(X=log(Xs),Y=log((n:1)/(n+1)))
> reg=lm(Y~X,data=B)
> summary(reg)

Call:
lm(formula = Y ~ X, data = B)

Residuals:
Min       1Q   Median       3Q      Max
-0.59999 -0.00777  0.00878  0.02461  0.20309

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.089442   0.001572   56.88   <2e-16 ***
X           -1.382181   0.001477 -935.55   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04928 on 2165 degrees of freedom
Multiple R-squared: 0.9975,	Adjusted R-squared: 0.9975
F-statistic: 8.753e+05 on 1 and 2165 DF,  p-value: < 2.2e-16

> reg=lm(Y~X,data=B[(n-500):n,])
> summary(reg)

Call:
lm(formula = Y ~ X, data = B[(n - 500):n, ])

Residuals:
Min       1Q   Median       3Q      Max
-0.48502 -0.02148 -0.00900  0.01626  0.35798

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.186188   0.010033   18.56   <2e-16 ***
X           -1.432767   0.005105 -280.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07751 on 499 degrees of freedom
Multiple R-squared: 0.9937,	Adjusted R-squared: 0.9937
F-statistic: 7.878e+04 on 1 and 499 DF,  p-value: < 2.2e-16

> reg=lm(Y~X,data=B[(n-100):n,])
> summary(reg)

Call:
lm(formula = Y ~ X, data = B[(n - 100):n, ])

Residuals:
Min       1Q   Median       3Q      Max
-0.33396 -0.03743  0.02279  0.04754  0.62946

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.67377    0.06777   9.942   <2e-16 ***
X           -1.58536    0.02240 -70.772   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1299 on 99 degrees of freedom
Multiple R-squared: 0.9806,	Adjusted R-squared: 0.9804
F-statistic:  5009 on 1 and 99 DF,  p-value: < 2.2e-16 

The slope here is somehow related to the tail index of the distribution. Consider some heavy tailed distribution, i.e. , so that , where is some slowly varying function. Equivalently, the exists a slowly varying function such that . Then

i.e. since a natural estimator for is the order statistic , the slope of the straight line is the opposite of tail index . The estimator of the slope is (considering only the largest observations)

Hill's estimator is based on the assumption that the denominator above is almost 1 (which means that  , as ), i.e.

Note that, if , but not two fast, i.e. as , then (one can even get   with stronger convergence assumptions). Further

Based on that (asymptotic) distribution, it is possible to get a (asymptotic) confidence interval for

> xi=1/(1:n)*cumsum(logXs)-logXs
> xise=1.96/sqrt(1:n)*xi
> plot(1:n,xi,type="l",ylim=range(c(xi+xise,xi-xise)),
+ xlab="",ylab="",)
> polygon(c(1:n,n:1),c(xi+xise,rev(xi-xise)),
+ border=NA,col="lightblue")
> lines(1:n,xi+xise,col="red",lwd=1.5)
> lines(1:n,xi-xise,col="red",lwd=1.5)
> lines(1:n,xi,lwd=1.5)
> abline(h=0,col="grey")

It is also possible to work with , then . And similarly as (and again with additional assumptions on the rate of convergence), and

(obtained using the delta-method). Again, we can use that result to derive (asymptotic) confidence intervals

> alpha=1/xi
> alphase=1.96/sqrt(1:n)/xi
> YL=c(0,3)
> plot(1:n,alpha,type="l",ylim=YL,xlab="",ylab="",)
> polygon(c(1:n,n:1),c(alpha+alphase,rev(alpha-alphase)),
+ border=NA,col="lightblue")
> lines(1:n,alpha+alphase,col="red",lwd=1.5)
> lines(1:n,alpha-alphase,col="red",lwd=1.5)
> lines(1:n,alpha,lwd=1.5)
> abline(h=0,col="grey")

The Deckers-Einmahl-de Haan estimator is

where for

Then (given again conditions on the speed of convergence i.e. , with as ),

Finally, Pickands' estimator

it is possible to prove that, as ,

Here the code is
> Xs=rev(sort(X))
> xi=1/log(2)*log( (Xs[seq(1,length=trunc(n/4),by=1)]-
+ Xs[seq(2,length=trunc(n/4),by=2)])/
+ (Xs[seq(2,length=trunc(n/4),by=2)]-Xs[seq(4,
+ length=trunc(n/4),by=4)]) )
> xise=1.96/sqrt(seq(1,length=trunc(n/4),by=1))*
+sqrt( xi^2*(2^(xi+1)+1)/((2*(2^xi-1)*log(2))^2))
> plot(seq(1,length=trunc(n/4),by=1),xi,type="l",
+ ylim=c(0,3),xlab="",ylab="",)
> polygon(c(seq(1,length=trunc(n/4),by=1),rev(seq(1,
+ length=trunc(n/4),by=1))),c(xi+xise,rev(xi-xise)),
+ border=NA,col="lightblue")
> lines(seq(1,length=trunc(n/4),by=1),
+ xi+xise,col="red",lwd=1.5)
> lines(seq(1,length=trunc(n/4),by=1),
+ xi-xise,col="red",lwd=1.5)
> lines(seq(1,length=trunc(n/4),by=1),xi,lwd=1.5)
> abline(h=0,col="grey")

It is also possible to use maximum likelihood techniques to fit a GPD distribution over a high threshold.
> library(evd)
> library(evir)
> gpd(X,5)
$n [1] 2167$threshold
[1] 5

$p.less.thresh [1] 0.8827873$n.exceed
[1] 254

$method [1] "ml"$par.ests
xi      beta
0.6320499 3.8074817

$par.ses xi beta 0.1117143 0.4637270$varcov
[,1]        [,2]
[1,]  0.01248007 -0.03203283
[2,] -0.03203283  0.21504269

$information [1] "observed"$converged
[1] 0

$nllh.final [1] 754.1115 attr(,"class") [1] "gpd" or equivalently (or almost) > gpd.fit(X,5)$threshold
[1] 5

$nexc [1] 254$conv
[1] 0

$nllh [1] 754.1115$mle
[1] 3.8078632 0.6315749

$rate [1] 0.1172127$se
[1] 0.4636270 0.1116136
The interest of the latest function is that it is possible to visualize the profile likelihood of the tail index,
> gpd.profxi(gpd.fit(X,5),xlow=0,xup=3)

or
> gpd.profxi(gpd.fit(X,20),xlow=0,xup=3)

Hence, it is possible to plot the maximum likelihood estimator of the tail index, as a function of the threshold (including a confidence interval),
> GPDE=Vectorize(function(u){gpd(X,u)$par.ests[1]}) > GPDS=Vectorize(function(u){ + gpd(X,u)$par.ses[1]})
> u=c(seq(2,10,by=.5),seq(11,25))
> XI=GPDE(u)
> XIS=GPDS(u)
> plot(u,XI,ylim=c(0,2))
> segments(u,XI-1.96*XIS,u,XI+
+ 1.96*XIS,lwd=2,col="red")

Finally, it is possible to use block-maxima techniques.
> gev.fit(X)
$conv [1] 0$nllh
[1] 3392.418

$mle [1] 1.4833484 0.5930190 0.9168128$se
[1] 0.01507776 0.01866719 0.03035380
The estimator of the tail index is here the last coefficient, on the right.
Since it is rather difficult to install a package in class rooms, here is the source of rcodes used here (to fit a GPD for exceedances)
> source("http://freakonometrics.blog.free.fr/public/code/gpd.R")
Next time, we will discuss how to use those estimators.

Wednesday, September 29 2010

## Detecting distributions with infinite mean

In a post I published a few month ago (in French, here, based on some old paper, there), I mentioned a statistical procedure to test if the underlying distribution  of an i.i.d. sample  had a finite mean (based on extreme value results). Since I just used it on a small dataset (yes, with real data), I decided to post the R code, since it is rather simple to use. But instead of working on that dataset, let us see what happens on simulated samples. Consider =200 observations generated from a Pareto distribution
(here =2, as a start)
> a=2
> X=runif(200)^(-1/a)
Here, we will use the package developped by Mathieu Ribatet,
> library(RFA)
Le chargement a nécessité le package : tcltk
Chargement de Tcl/Tk... terminé
Message d'avis :
le package 'RFA' a été compilé avec la version R 2.10.1
• Using Generalized Pareto Distribution (and LR test)
A first idea is to fit a GPD distribution on observations that exceed a threshold  >1.
Since we would like to test  (against the assumption that the expected value is finite, i.e. ), it is natural to consider the likelihood ratio
Under the null hypothesis, the distribution of  should be chi square distribution with one degree of freedom. As mentioned here, the significance level is attained with a higher accuracy by employing Bartlett correction (there). But let  us make it as simple as possible for the blog, and use the chi-square distribution to derive the p-value.
Since it is rather difficult to select an appropriate threshold, it can be natural (as in Hill's estimator) to consider , and thus, to fit a GPD on the  largest values. And then to plot all that on a graph (like the Hill plot)
> Xs=rev(sort(X))
> s=0;G=rep(NA,length(Xs)-14);Gsd=G;LR=G;pLR=G
> for(i in length(X):15){
+ s=s+1
+ FG=fitgpd(X,Xs[i],method="mle")
+ FGc=fitgpd(X,Xs[i],method="mle",shape=1)
+ G[s]=FG$estimate[2] + Gsd[s]=FG$std.err[2]
+ FGc$fixed + LR[s]=FGc$deviance-FG$deviance + pLR[s]=1-pchisq(LR[s],df=1) + } Here we keep the estimated value of the tail index, and the associated standard deviation of the estimator, to draw some confidence interval (assuming that the maximum likelihood estimate is Gaussian, which is correct only when n is extremely large). We also calculate the deviance of the model, the deviance of the constrained model (), and the difference, which is the log likelihood ratio. Then we calculate the p-value (since under the likelihood ratio statistics has a chi-square distribution). If =2, we have the following graph, with on top the p-value (which is almost null here), the estimation of the tail index he largest values (and a confidence interval for the estimator), If =1.5 (finite mean, but infinite variance), we have i.e. for those two models, we clearly reject the assumption of infinite mean (even if gets too close from 1, we should consider thresholds large enough). On the other hand, if =1 (i.e. infinite mean) we clearly accept the assumption of infinite mean (whatever the threshold), • Using Hill's estimator An alternative could be to use Hill's estimator (with Alexander McNeil's package). See here for more details on that estimator. The test is simply based on the confidence interval derived from the (asymptotic) normal distribution of Hill's estimator, > library(evir) > Xs=rev(sort(X)) > HILL=hill(X) > G=rev(HILL$y)
> Gsd=rev(G/sqrt(HILL$x)) > pLR=1-pnorm(rep(1,length(G)),mean=G,sd=Gsd) Again, if =2, we clearly rejct the assumption of infinite mean, and similarly, if =1.5 (finite mean, but infinite variance) Here the test is more robust than the one based on the GPD. And if =1 (i.e. infinite mean), again we accept , Note that if =0.7, it is still possible to run the procedure, and hopefully, it suggests that the underlying distribution has infinite mean, (here without any doubt). Now you need to wait a few days to see some practical applications of the idea (there was on in the paper mentioned above actually, on business interruption insurance losses). Friday, June 18 2010 ## Econometrica et Ouest France: même combat ? Un court billet aujourd'hui pour présenter deux ou trois petits graphiques (je n'ai malheureusement pas encore trouvé comment aller plus loin sur l'analogie...). • La recherche et l'impact factor des revues Dans le système d'évaluation des chercheurs (je pourrais même dire des enseignants chercheurs) on est évalué tout simplement en fonction des revues dans lesquelles on arrive à placer nos papiers. Les revues sont en effet notées, classées, le classement le plus connu étant l'impact factor (j'en avais parlé l'an dernier lors du versement de primes à des chercheurs à Lyon, ici). Quelques revues ont un facteur d'impact très élevé, et ensuite se trouve des centaines voire des milliers de revues spécialisées. La distribution de ce facteur d'impact a l'allure suivante, Comme l'ont noté plusieurs auteurs (ici ou ), la loi de dite de Zipf pourrait bien reproduire cette distribution (j'en avais parlé il y a plus d'un an dans un billet sur la loi de George Kingsley Zipf (ici), que j'avais fait un peu après avoir joué avec la loi de Benford (ici et )). • La presse quotidienne, hebdomadaire ou mensuelle, et les nombres de tirages Le monde de la recherche essaye régulièrement de donner l'illusion d'une brillante autorégulation, avec une reconnaissance par ses pairs, des relecteurs anonymes des articles soumis dans les revues, etc. Mais le monde de la recherche n'est pas si éloigné que ça des querelles d'épiciers sur le tirages de la presse.... En fait, si on regarde ce qu'on appelle la diffusion de la presse payante technique et professionnelle (ici) et la presse payante grand public (), on retrouve à très peu de choses prêt la même loi... Bref, plus publier dans une revue à gros tirage veut dire qu'on est un meilleur chercheur... ou quelque chose du genre. Le code est ici pour ceux qui doutent, > presse=read.table("http://perso.univ-rennes1.fr/arthur.charpentier/presses.csv",sep=";",header=TRUE) # presse=read.table("http://perso.univ-rennes1.fr/arthur.charpentier/Science_Journal_Ranking_Version_2003.csv",sep=";",header=TRUE) > X=as.numeric(as.character(presse$Diffusion))
> X=X[is.na(X)==FALSE]
> Xs=sort(X)
> plot(length(X):1,Xs,col="blue",xlab="Ranking",ylab="Tirage (diffusion France payéé)")
> plot(log(length(X):1),Xs,col="blue",xlab="Ranking (log)",ylab="Tirage (diffusion France payéé)")

• Liens avec la théorie des valeurs extrêmes
En fait, la loi de Zipf est tout simplement une loi de Pareto dans le cas discret (oui, les chercheurs ont un don pour donner des noms différents à des objets identiques, ça permet de faire plusieurs papiers sans que personne ne se doute de rien... pour ceux qui en doute, il suffit de croiser sous Google finance avec Zipf, Pareto, Power law, on va retrouver plein de monde qui raconte exactement la même chose). Bref, la loi de Pareto donne des choses très proches,

y compris la loi exponentielle qui est un cas limite de la loi de Pareto

On peut d'ailleurs aller un peu plus loin, en estimant le paramètre de la puissance, avec l'estimateur de Hill (discuté ),

ce qui confirme que les lois sont proches (des lois puissance), même si dans le cas de la presse française, on est sur un indice proche de 1.5 (à droite), alors que la presse académique est de l'ordre de 2 (à gauche), avec toutefois une explosion du côté des extrêmes (les 30 meilleures revues ayant visiblement un impact beaucoup trop important).