Freakonometrics

To content | To menu | To search

Tuesday, September 18 2012

Open question on tail index

A short post today on a rather surprising result. I can't figure out how to prove it (I don not know if the result is valid, or known, so I claim it is an open question). Consider an i.i.d. sequence http://latex.codecogs.com/gif.latex?\{X_1,\cdots,X_n\} with common survival function http://latex.codecogs.com/gif.latex?\overline{F}. Assume further that http://latex.codecogs.com/gif.latex?\overline{F} is regularly varying at http://latex.codecogs.com/gif.latex?\infty, written http://latex.codecogs.com/gif.latex?\overline{F}\in%20RV_{\alpha} , i.e.

http://latex.codecogs.com/gif.latex?\lim_{t\rightarrow\infty}\frac{\overline{F}(tx)}{\overline{F}(t)}=x^{\alpha}

with http://latex.codecogs.com/gif.latex?\alpha\in(-\infty,0). Then the tail of http://latex.codecogs.com/gif.latex?X is Pareto type, and we can use Hill's estimator to estimate http://latex.codecogs.com/gif.latex?\alpha. An interesting result is that if http://latex.codecogs.com/gif.latex?\overline{F}\in%20RV_{\alpha} and http://latex.codecogs.com/gif.latex?\overline{G}\in%20RV_{\beta}, then http://latex.codecogs.com/gif.latex?\overline{F}\star\overline{G} is also regularly varying, where http://latex.codecogs.com/gif.latex?\star denote the convolution operator. More precisely, when http://latex.codecogs.com/gif.latex?\alpha=\beta, then (see Feller (), section VIII.8), then http://latex.codecogs.com/gif.latex?\overline{F}\star\overline{G} is regularly varying with the same index.

This can be visualized below, on simulation of Pareto variables, where the tail index is estimated using Hill's estimator. First, we start we one sample, either of size 20 (on the left) or 100 (on the right),

> library(evir)
> n=20
> set.seed(1)
> alpha=1.5
> X=runif(n)^(-1/alpha)
> hill(X)
> abline(h=1.5,col="blue")

If we generate two (independent) samples, and then look at the sum, Hill's estimator does not perform very well (the sum of two independent Pareto variates is no longer Pareto, but only Pareto type) we have

> set.seed(1)
> alpha=1.5
> X=runif(n)^(-1/alpha)
> Y=runif(n)^(-1/alpha)
> hill(X+Y)
> abline(h=1.5,col="blue")

The idea is then to use a Jackknife strategy in order to (artificially) increase the size of our sample. Thus, consider sums on all pairs of all http://latex.codecogs.com/gif.latex?X_i's, i.e. sample

http://latex.codecogs.com/gif.latex?\{X_1+X_2,\cdots,X_1+X_{n},X_2+X_3,\cdots,X_{n-1}+X_n\}

Let us use Hill's estimator on this (much larger) sample.

> XC=NA
> for(i in 1:(n-1)){
+ for(j in (i+1):n){
+ XC=c(XC,X[i]+X[j])
+ }}
> XC=XC[-1]
> hill(XC)
> abline(h=1.5,col="blue")
> abline(h=1.5*2,col="blue",lty=2)
> abline(h=1.5^2,col="blue",lty=3)

Here, with 20 observations from the initial sample, it looks like the tail index is http://latex.codecogs.com/gif.latex?2\alpha (on the left). With 100 observations, it looks like it is http://latex.codecogs.com/gif.latex?\alpha^2 (on the right). On the graphs above, I have plotted those two horizontal lines. It's odd, isn't it ? Of course, I did not really expect http://latex.codecogs.com/gif.latex?\alpha since we do not have an i.i.d. sample. Identically distributed yes, with a regularly varying survival function from what we've mentioned before. But clearly not independent. So Hill's estimator should not be a good estimator of http://latex.codecogs.com/gif.latex?\alpha, but it might be a good estimator of some function of http://latex.codecogs.com/gif.latex?\alpha

If we go one step further, an consider all triplets,

http://latex.codecogs.com/gif.latex?\{X_1+X_2+X_3,\cdots,X_1+X_2+X_{n},X_2+X_3+X_4,\cdots,X_{n-2}+X_{n-1}+X_n\}

(observe that now, the sample size is huge, even if we start with only 20 points). Then, it looks like the tail index should be http://latex.codecogs.com/gif.latex?\alpha^3 (at least, we can observe a step of Hill's plot around http://latex.codecogs.com/gif.latex?\alpha^3). 

> XC=NA
> for(i in 1:(n-2)){
+ for(j in (i+1):(n-1)){
+ for(k in (j+1):n){
+ XC=c(XC,X[i]+X[j]+X[k])
+ }}}
> XC=XC[-1]
> hill(XC)
> abline(h=1.5,col="blue")
> abline(h=1.5*3,col="blue",lty=2)
> abline(h=1.5^3,col="blue",lty=3)

My open question is whether there is general result behind, or if it was just a coincidence that those values appear so clearly.

Thursday, February 9 2012

MAT8886 from tail estimation to risk measure(s) estimation

This week, we conclude the part on extremes with an application of extreme value theory to risk measures. We have seen last week that, if we assume that above a threshold http://freakonometrics.blog.free.fr/public/perso5/qt01.gif, a Generalized Pareto Distribution will fit nicely, then we can use it to derive an estimator of the quantile function (for percentages such that the quantile is larger than the threshold)

http://freakonometrics.blog.free.fr/public/perso5/qt03.gif

It the threshold is http://freakonometrics.blog.free.fr/public/perso5/qt02.gif, i.e. we keep the http://freakonometrics.blog.free.fr/public/perso5/qt04.gif largest observations to fit a GPD, then this estimator can be written

http://freakonometrics.blog.free.fr/public/perso5/qt06.gif

The code we wrote last week was the following (here based on log-returns of the SP500 index, and we focus on large losses, i.e. large values of the opposite of log returns, plotted below)

> library(tseries)
> X=get.hist.quote("^GSPC")
> T=time(X)
> D=as.POSIXlt(T)
> Y=X$Close
> R=diff(log(Y))
> D=D[-1]
> X=-R
> plot(D,X)
> library(evir)
> GPD=gpd(X,quantile(X,.975))
> xi=GPD$par.ests[1]
> beta=GPD$par.ests[2]
> u=GPD$threshold
> QpGPD=function(p){
+ u+beta/xi*((100/2.5*(1-p))^(-xi)-1)
+ }
> QpGPD(1-1/250)
97.5%
0.04557386
> QpGPD(1-1/2500)
97.5%
0.08925095 

This is similar with the following outputs, with the return period of a yearly event (one observation out of 250 trading days)

> gpd.q(tailplot(gpd(X,quantile(X,.975))), 1-1/250, ci.type =
+ "likelihood", ci.p = 0.95,like.num = 50)
Lower CI   Estimate   Upper CI
0.04172534 0.04557655 0.05086785 

or the decennial one

> gpd.q(tailplot(gpd(X,quantile(X,.975))), 1-1/2500, ci.type =
+ "likelihood", ci.p = 0.95,like.num = 50)
Lower CI   Estimate   Upper CI
0.07165395 0.08925558 0.13636620 

Note that it is also possible to derive an estimator for another population risk measure (the quantile is simply the so-called Value-at-Risk), the expected shortfall (or Tail Value-at-Risk), i.e.

http://freakonometrics.blog.free.fr/public/perso5/qt10.gif

The idea is to write that expression

http://freakonometrics.blog.free.fr/public/perso5/qt11.gif

so that we recognize the mean excess function (discussed earlier). Thus, assuming again that above http://freakonometrics.blog.free.fr/public/perso5/qt01.gif (and therefore above that high quantile) a GPD will fit, we can write

http://freakonometrics.blog.free.fr/public/perso5/qt12.gif

or equivalently

http://freakonometrics.blog.free.fr/public/perso5/qt13.gif

If we substitute estimators to unknown quantities on that expression, we get

http://freakonometrics.blog.free.fr/public/perso5/qt09.gif

The code is here

> EpGPD=function(p){
+ u-beta/xi+beta/xi/(1-xi)*(100/2.5*(1-p))^(-xi)
+ }
> EpGPD(1-1/250)
97.5%
0.06426508
> EpGPD(1-1/2500)
97.5%
0.1215077 

An alternative is to use Hill's approach (used to derive Hill's estimator). Assume here that http://freakonometrics.blog.free.fr/public/perso5/qt20.gif, where http://freakonometrics.blog.free.fr/public/perso5/qt21.gif is a slowly varying function. Then, for all http://freakonometrics.blog.free.fr/public/perso5/qt23.gif,

http://freakonometrics.blog.free.fr/public/perso5/qt24.gif

Since http://freakonometrics.blog.free.fr/public/perso5/qt21.gif is a slowly varying function, it seem natural to assume that this ratio is almost 1 (which is true asymptotically). Thus

http://freakonometrics.blog.free.fr/public/perso5/qt25.gif

i.e. if we invert that function, we derive an estimator for the quantile function

http://freakonometrics.blog.free.fr/public/perso5/qt26.gif

which can also be written

http://freakonometrics.blog.free.fr/public/perso5/qt07.gif

(which is close to the relation we derived using a GPD model). Here the code is

> k=trunc(length(X)*.025)
> Xs=rev(sort(as.numeric(X)))
> xiHill=mean(log(Xs[1:k]))-log(Xs[k+1])
> u=Xs[k+1]
> QpHill=function(p){
+ u+u*((100/2.5*(1-p))^(-xiHill)-1)
+ }

with the following Hill plot

For yearly and decennial events, we have here

> QpHill(1-1/250)
[1] 0.04580548
> QpHill(1-1/2500)
[1] 0.1010204

Those quantities seem consistent since they are quite close, but they are different compared with empirical quantiles,

> quantile(X,1-1/250)
99.6%
0.04743929
> quantile(X,1-1/2500)
99.96%
0.09054039 

Note that it is also possible to use some functions to derive estimators of those quantities,

> riskmeasures(gpd(X,quantile(X,.975)),1-1/250)
p   quantile      sfall
[1,] 0.996 0.04557655 0.06426859
> riskmeasures(gpd(X,quantile(X,.975)),1-1/2500)
p   quantile     sfall
[1,] 0.9996 0.08925558 0.1215137

(in this application, we have assumed that log-returns were independent and identically distributed... which might be a rather strong assumption).

An to conclude that post (and the session on extremes), here is the schedule for next weeks,

Thursday, January 26 2012

MAT8886 extremes and R

These data were collected at Copenhagen Reinsurance and comprise 2167 fire losses over the period 1980 to 1990, They have been adjusted for inflation to reflect 1985 values and are expressed in millions of Danish Kron. Note that it is possible to work with the same data as above but the total claim has been divided into a building loss, a loss of contents and a loss of profits.

> base1=read.table(
+ "http://freakonometrics.free.fr/danish-univariate.txt", + header=TRUE) > base2=read.table(
+ "http://freakonometrics.free.fr/danish-multivariate.txt", + header=TRUE)
Consider here the first dataset (we deal - so far - with univariate extremes),
> X=base1$Loss.in.DKM
> D=as.Date(as.character(base1$Date),"%m/%d/%Y")
> plot(D,X,type="h")

The graph is the following,

A natural idea is then to plot
http://freakonometrics.blog.free.fr/public/perso5/hill01.gif

i.e.
> Xs=sort(X)
> logXs=rev(log(Xs))
> n=length(X)
> plot(log(Xs),log((n:1)/(n+1)))

Points are on a straight line here. The slope can be obtained using a linear regression,

> B=data.frame(X=log(Xs),Y=log((n:1)/(n+1)))
> reg=lm(Y~X,data=B)
> summary(reg)
 
Call:
lm(formula = Y ~ X, data = B)
 
Residuals:
Min       1Q   Median       3Q      Max
-0.59999 -0.00777  0.00878  0.02461  0.20309
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.089442   0.001572   56.88   <2e-16 ***
X           -1.382181   0.001477 -935.55   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.04928 on 2165 degrees of freedom
Multiple R-squared: 0.9975,	Adjusted R-squared: 0.9975
F-statistic: 8.753e+05 on 1 and 2165 DF,  p-value: < 2.2e-16
 
> reg=lm(Y~X,data=B[(n-500):n,])
> summary(reg)
 
Call:
lm(formula = Y ~ X, data = B[(n - 500):n, ])
 
Residuals:
Min       1Q   Median       3Q      Max
-0.48502 -0.02148 -0.00900  0.01626  0.35798
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.186188   0.010033   18.56   <2e-16 ***
X           -1.432767   0.005105 -280.68   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.07751 on 499 degrees of freedom
Multiple R-squared: 0.9937,	Adjusted R-squared: 0.9937
F-statistic: 7.878e+04 on 1 and 499 DF,  p-value: < 2.2e-16
 
> reg=lm(Y~X,data=B[(n-100):n,])
> summary(reg)
 
Call:
lm(formula = Y ~ X, data = B[(n - 100):n, ])
 
Residuals:
Min       1Q   Median       3Q      Max
-0.33396 -0.03743  0.02279  0.04754  0.62946
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.67377    0.06777   9.942   <2e-16 ***
X           -1.58536    0.02240 -70.772   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.1299 on 99 degrees of freedom
Multiple R-squared: 0.9806,	Adjusted R-squared: 0.9804
F-statistic:  5009 on 1 and 99 DF,  p-value: < 2.2e-16 

The slope here is somehow related to the tail index of the distribution. Consider some heavy tailed distribution, i.e. http://freakonometrics.blog.free.fr/public/perso5/hill03.gif, so that http://freakonometrics.blog.free.fr/public/perso5/hill27.gif, where http://freakonometrics.blog.free.fr/public/perso5/hill28.gif is some slowly varying function. Equivalently, the exists a slowly varying function http://freakonometrics.blog.free.fr/public/perso5/hill29.gif such that http://freakonometrics.blog.free.fr/public/perso5/hill30.gif. Then

http://freakonometrics.blog.free.fr/public/perso5/hill33.gif

i.e. since a natural estimator for http://freakonometrics.blog.free.fr/public/perso5/hill35.gif is the order statistic http://freakonometrics.blog.free.fr/public/perso5/hill36.gif, the slope of the straight line is the opposite of tail index http://freakonometrics.blog.free.fr/public/perso5/hill98.gif. The estimator of the slope is (considering only the http://freakonometrics.blog.free.fr/public/perso5/hill99.gif largest observations)

http://freakonometrics.blog.free.fr/public/perso5/hill39.gif

Hill's estimator is based on the assumption that the denominator above is almost 1 (which means that  http://freakonometrics.blog.free.fr/public/perso5/hill15.gif, as http://freakonometrics.blog.free.fr/public/perso5/hill16.gif), i.e.

http://freakonometrics.blog.free.fr/public/perso5/hill02.gif

Note that, if http://freakonometrics.blog.free.fr/public/perso5/hill14.gif, but not two fast, i.e. http://freakonometrics.blog.free.fr/public/perso5/hill15.gif as http://freakonometrics.blog.free.fr/public/perso5/hill16.gif, then http://freakonometrics.blog.free.fr/public/perso5/hill12.gif (one can even get http://freakonometrics.blog.free.fr/public/perso5/hill11.gif  with stronger convergence assumptions). Further

http://freakonometrics.blog.free.fr/public/perso5/hill04.gif

Based on that (asymptotic) distribution, it is possible to get a (asymptotic) confidence interval for http://freakonometrics.blog.free.fr/public/perso5/hill98.gif

> xi=1/(1:n)*cumsum(logXs)-logXs
> xise=1.96/sqrt(1:n)*xi
> plot(1:n,xi,type="l",ylim=range(c(xi+xise,xi-xise)),
+ xlab="",ylab="",)
> polygon(c(1:n,n:1),c(xi+xise,rev(xi-xise)),
+ border=NA,col="lightblue")
> lines(1:n,xi+xise,col="red",lwd=1.5)
> lines(1:n,xi-xise,col="red",lwd=1.5)
> lines(1:n,xi,lwd=1.5)
> abline(h=0,col="grey")

It is also possible to work with http://freakonometrics.blog.free.fr/public/perso5/hill06.gif, then http://freakonometrics.blog.free.fr/public/perso5/hill05.gif. And similarly http://freakonometrics.blog.free.fr/public/perso5/hill13.gif as http://freakonometrics.blog.free.fr/public/perso5/hill14.gif (and again http://freakonometrics.blog.free.fr/public/perso5/hill10.gif with additional assumptions on the rate of convergence), and

http://freakonometrics.blog.free.fr/public/perso5/hill09.gif

(obtained using the delta-method). Again, we can use that result to derive (asymptotic) confidence intervals

> alpha=1/xi
> alphase=1.96/sqrt(1:n)/xi
> YL=c(0,3)
> plot(1:n,alpha,type="l",ylim=YL,xlab="",ylab="",)
> polygon(c(1:n,n:1),c(alpha+alphase,rev(alpha-alphase)),
+ border=NA,col="lightblue")
> lines(1:n,alpha+alphase,col="red",lwd=1.5)
> lines(1:n,alpha-alphase,col="red",lwd=1.5)
> lines(1:n,alpha,lwd=1.5)
> abline(h=0,col="grey")

The Deckers-Einmahl-de Haan estimator is

http://freakonometrics.blog.free.fr/public/perso5/hill25.gif

where for http://freakonometrics.blog.free.fr/public/perso5/hill23.gif

http://freakonometrics.blog.free.fr/public/perso5/hill21.gif

Then (given again conditions on the speed of convergence i.e. http://freakonometrics.blog.free.fr/public/perso5/hill14.gif, with http://freakonometrics.blog.free.fr/public/perso5/hill15.gif as http://freakonometrics.blog.free.fr/public/perso5/hill16.gif),

http://freakonometrics.blog.free.fr/public/perso5/hill42.gif

Finally, Pickands' estimator

http://freakonometrics.blog.free.fr/public/perso5/hill26.gif

it is possible to prove that, as http://freakonometrics.blog.free.fr/public/perso5/hill14.gif,

http://freakonometrics.blog.free.fr/public/perso5/hill41.gif
Here the code is
> Xs=rev(sort(X))
> xi=1/log(2)*log( (Xs[seq(1,length=trunc(n/4),by=1)]-
+ Xs[seq(2,length=trunc(n/4),by=2)])/
+ (Xs[seq(2,length=trunc(n/4),by=2)]-Xs[seq(4,
+ length=trunc(n/4),by=4)]) )
> xise=1.96/sqrt(seq(1,length=trunc(n/4),by=1))*
+sqrt( xi^2*(2^(xi+1)+1)/((2*(2^xi-1)*log(2))^2))
> plot(seq(1,length=trunc(n/4),by=1),xi,type="l",
+ ylim=c(0,3),xlab="",ylab="",)
> polygon(c(seq(1,length=trunc(n/4),by=1),rev(seq(1,
+ length=trunc(n/4),by=1))),c(xi+xise,rev(xi-xise)),
+ border=NA,col="lightblue")
> lines(seq(1,length=trunc(n/4),by=1),
+ xi+xise,col="red",lwd=1.5)
> lines(seq(1,length=trunc(n/4),by=1),
+ xi-xise,col="red",lwd=1.5)
> lines(seq(1,length=trunc(n/4),by=1),xi,lwd=1.5)
> abline(h=0,col="grey")

It is also possible to use maximum likelihood techniques to fit a GPD distribution over a high threshold.
> library(evd)
> library(evir)
> gpd(X,5)
$n
[1] 2167
 
$threshold
[1] 5
 
$p.less.thresh
[1] 0.8827873
 
$n.exceed
[1] 254
 
$method
[1] "ml"
 
$par.ests
xi      beta
0.6320499 3.8074817
 
$par.ses
xi      beta
0.1117143 0.4637270
 
$varcov
[,1]        [,2]
[1,]  0.01248007 -0.03203283
[2,] -0.03203283  0.21504269
 
$information
[1] "observed"
 
$converged
[1] 0
 
$nllh.final
[1] 754.1115
 
attr(,"class")
[1] "gpd"
or equivalently (or almost)
> gpd.fit(X,5)
$threshold
[1] 5
 
$nexc
[1] 254
 
$conv
[1] 0
 
$nllh
[1] 754.1115
 
$mle
[1] 3.8078632 0.6315749
 
$rate
[1] 0.1172127
 
$se
[1] 0.4636270 0.1116136
The interest of the latest function is that it is possible to visualize the profile likelihood of the tail index,
> gpd.profxi(gpd.fit(X,5),xlow=0,xup=3)

or
> gpd.profxi(gpd.fit(X,20),xlow=0,xup=3)

Hence, it is possible to plot the maximum likelihood estimator of the tail index, as a function of the threshold (including a confidence interval),
> GPDE=Vectorize(function(u){gpd(X,u)$par.ests[1]})
> GPDS=Vectorize(function(u){
+ gpd(X,u)$par.ses[1]})
> u=c(seq(2,10,by=.5),seq(11,25))
> XI=GPDE(u)
> XIS=GPDS(u)
> plot(u,XI,ylim=c(0,2))
> segments(u,XI-1.96*XIS,u,XI+
+ 1.96*XIS,lwd=2,col="red")

Finally, it is possible to use block-maxima techniques.
> gev.fit(X)
$conv
[1] 0
 
$nllh
[1] 3392.418
 
$mle
[1] 1.4833484 0.5930190 0.9168128
 
$se
[1] 0.01507776 0.01866719 0.03035380
The estimator of the tail index is here the last coefficient, on the right.
Since it is rather difficult to install a package in class rooms, here is the source of rcodes used here (to fit a GPD for exceedances)
> source("http://freakonometrics.blog.free.fr/public/code/gpd.R")
Next time, we will discuss how to use those estimators.

Wednesday, September 29 2010

Detecting distributions with infinite mean

In a post I published a few month ago (in French, here, based on some old paper, there), I mentioned a statistical procedure to test if the underlying distribution http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO02.png of an i.i.d. sample http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO01.png had a finite mean (based on extreme value results). Since I just used it on a small dataset (yes, with real data), I decided to post the R code, since it is rather simple to use. But instead of working on that dataset, let us see what happens on simulated samples. Consider http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO03.png=200 observations generated from a Pareto distribution
http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO04.png
(here http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=2, as a start)
> a=2
> X=runif(200)^(-1/a)
Here, we will use the package developped by Mathieu Ribatet,
> library(RFA)
Le chargement a nécessité le package : tcltk
Chargement de Tcl/Tk... terminé
Message d'avis :
le package 'RFA' a été compilé avec la version R 2.10.1
  • Using Generalized Pareto Distribution (and LR test)
A first idea is to fit a GPD distribution on observations that exceed a threshold http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO11.png >1.
Since we would like to test http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO06.png (against the assumption that the expected value is finite, i.e. http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO07.png), it is natural to consider the likelihood ratio
http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO08.png
Under the null hypothesis, the distribution of http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO09.png should be chi square distribution with one degree of freedom. As mentioned here, the significance level is attained with a higher accuracy by employing Bartlett correction (there). But let  us make it as simple as possible for the blog, and use the chi-square distribution to derive the p-value.
Since it is rather difficult to select an appropriate threshold, it can be natural (as in Hill's estimator) to consider http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO10.png, and thus, to fit a GPD on the http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO13.png largest values. And then to plot all that on a graph (like the Hill plot)
> Xs=rev(sort(X))
> s=0;G=rep(NA,length(Xs)-14);Gsd=G;LR=G;pLR=G
> for(i in length(X):15){
+ s=s+1
+ FG=fitgpd(X,Xs[i],method="mle")
+ FGc=fitgpd(X,Xs[i],method="mle",shape=1)
+ G[s]=FG$estimate[2]
+ Gsd[s]=FG$std.err[2]
+ FGc$fixed
+ LR[s]=FGc$deviance-FG$deviance
+ pLR[s]=1-pchisq(LR[s],df=1)
+ }
Here we keep the estimated value of the tail index, and the associated standard deviation of the estimator, to draw some confidence interval (assuming that the maximum likelihood estimate is Gaussian, which is correct only when n is extremely large). We also calculate the deviance of the model, the deviance of the constrained model (http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO06.png), and the difference, which is the log likelihood ratio. Then we calculate the p-value (since under http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO12.png the likelihood ratio statistics has a chi-square distribution).
If http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=2, we have the following graph, with on top the p-value (which is almost null here), the estimation of the tail index he http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO13.png largest values (and a confidence interval for the estimator),
If http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=1.5 (finite mean, but infinite variance), we have
i.e. for those two models, we clearly reject the assumption of infinite mean (even if http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png gets too close from 1, we should consider thresholds large enough). On the other hand, if http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=1 (i.e. infinite mean) we clearly accept the assumption of infinite mean (whatever the threshold),
  • Using Hill's estimator
An alternative could be to use Hill's estimator (with Alexander McNeil's package). See here for more details on that estimator. The test is simply based on the confidence interval derived from the (asymptotic) normal distribution of Hill's estimator,
> library(evir)
> Xs=rev(sort(X))
> HILL=hill(X)
> G=rev(HILL$y)
> Gsd=rev(G/sqrt(HILL$x))
> pLR=1-pnorm(rep(1,length(G)),mean=G,sd=Gsd)
Again, if http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=2, we clearly rejct the assumption of infinite mean,
and similarly, if http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=1.5 (finite mean, but infinite variance)
Here the test is more robust than the one based on the GPD. And if http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=1 (i.e. infinite mean), again we accept http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO12.png,
Note that if http://perso.univ-rennes1.fr/arthur.charpentier/latex/paretoRO05.png=0.7, it is still possible to run the procedure, and hopefully, it suggests that the underlying distribution has infinite mean,
(here without any doubt). Now you need to wait a few days to see some practical applications of the idea (there was on in the paper mentioned above actually, on business interruption insurance losses).

Friday, June 18 2010

Econometrica et Ouest France: même combat ?

Un court billet aujourd'hui pour présenter deux ou trois petits graphiques (je n'ai malheureusement pas encore trouvé comment aller plus loin sur l'analogie...). 

  • La recherche et l'impact factor des revues
Dans le système d'évaluation des chercheurs (je pourrais même dire des enseignants chercheurs) on est évalué tout simplement en fonction des revues dans lesquelles on arrive à placer nos papiers. Les revues sont en effet notées, classées, le classement le plus connu étant l'impact factor (j'en avais parlé l'an dernier lors du versement de primes à des chercheurs à Lyon, ici). Quelques revues ont un facteur d'impact très élevé, et ensuite se trouve des centaines voire des milliers de revues spécialisées. La distribution de ce facteur d'impact a l'allure suivante,

Comme l'ont noté plusieurs auteurs (ici ou ), la loi de dite de Zipf pourrait bien reproduire cette distribution (j'en avais parlé il y a plus d'un an dans un billet sur la loi de George Kingsley Zipf (ici), que j'avais fait un peu après avoir joué avec la loi de Benford (ici et )). 
  • La presse quotidienne, hebdomadaire ou mensuelle, et les nombres de tirages
Le monde de la recherche essaye régulièrement de donner l'illusion d'une brillante autorégulation, avec une reconnaissance par ses pairs, des relecteurs anonymes des articles soumis dans les revues, etc. Mais le monde de la recherche n'est pas si éloigné que ça des querelles d'épiciers sur le tirages de la presse.... En fait, si on regarde ce qu'on appelle la diffusion de la presse payante technique et professionnelle (ici) et la presse payante grand public (), on retrouve à très peu de choses prêt la même loi...


Bref, plus publier dans une revue à gros tirage veut dire qu'on est un meilleur chercheur... ou quelque chose du genre. Le code est ici pour ceux qui doutent,
> presse=read.table("http://perso.univ-rennes1.fr/arthur.charpentier/presses.csv",sep=";",header=TRUE)
 # presse=read.table("http://perso.univ-rennes1.fr/arthur.charpentier/Science_Journal_Ranking_Version_2003.csv",sep=";",header=TRUE)
> X=as.numeric(as.character(presse$Diffusion))
> X=X[is.na(X)==FALSE]
> Xs=sort(X)
> plot(length(X):1,Xs,col="blue",xlab="Ranking",ylab="Tirage (diffusion France payéé)")
> plot(log(length(X):1),Xs,col="blue",xlab="Ranking (log)",ylab="Tirage (diffusion France payéé)")

  • Liens avec la théorie des valeurs extrêmes
En fait, la loi de Zipf est tout simplement une loi de Pareto dans le cas discret (oui, les chercheurs ont un don pour donner des noms différents à des objets identiques, ça permet de faire plusieurs papiers sans que personne ne se doute de rien... pour ceux qui en doute, il suffit de croiser sous Google finance avec Zipf, Pareto, Power law, on va retrouver plein de monde qui raconte exactement la même chose). Bref, la loi de Pareto donne des choses très proches,

y compris la loi exponentielle qui est un cas limite de la loi de Pareto


On peut d'ailleurs aller un peu plus loin, en estimant le paramètre de la puissance, avec l'estimateur de Hill (discuté ),

ce qui confirme que les lois sont proches (des lois puissance), même si dans le cas de la presse française, on est sur un indice proche de 1.5 (à droite), alors que la presse académique est de l'ordre de 2 (à gauche), avec toutefois une explosion du côté des extrêmes (les 30 meilleures revues ayant visiblement un impact beaucoup trop important).