Freakonometrics

To content | To menu | To search

Tag - normal

Entries feed - Comments feed

Tuesday, October 30 2012

Le passage au log dans les modèles linéaires

Un billet rapide pour compléter et illustrer le passage au log dans un modèle linéaire (que l'on abordera cette semaine en cours). Le point de départ est le modèle linéaire, où on suppose que, conditionnellement à  suit une loi normale. Pour rappel, si on a une loi normale, , alors  et . Les intervalles de confiance à 90% et 95% sont symétriques par rapport à la moyenne (qui est aussi la médiane, soit dit en passant),

Dans un modèle Gaussien avec homoscédasiticité,  i.e.  alors que . On a alors les bandes de confiance suivantes, pour un modèle de régression linéaire,

Bon, maintenant, que se passe-t-il si on prend l'exponentiel ? Pour la loi normale, rappelons que l'on obtient une loi lognormale, i.e. , les deux paramètres étant liés à la loi normale sous jacente, car désormais

alors que

Graphiquement, on a la loi suivante, avec les intervalles de confiance à 90% et 95% représentés ci-dessous. Le point noir est   alors que le point bleu est l'espérance de la loi lognormale.

On notera que le quantile de la loi log-normale est l'exponentiel du quantile de la loi normale. En effet, si http://latex.codecogs.com/gif.latex?\mathbb{P}(Y\leq%20q)=\alpha alors http://latex.codecogs.com/gif.latex?\mathbb{P}(\exp(Y)\leq%20\exp(q))=\alpha. En particulier,  n'est pas la moyenne de , mais la médiane (puisque  était la médiane de ).

Mais il n'est pas rare de voir utilisé un intervalle de confiance de la forme

qui est la forme classique de l'intervalle de confiance Gaussien (symétrique autour de la moyenne). Ici, on aurait les niveaux suivants

Notons qu'il n'y a aucune raison ici d'avoir une probabilité http://latex.codecogs.com/gif.latex?1-\alpha d'être dans l'intervalle de confiance obtenu avec les quantiles http://latex.codecogs.com/gif.latex?q_{1-\alpha/2} de la loi normale.

Maintenant, si on prend l'exponentiel d'un modèle linéaire (i.e. le logarithme de la variable d’intérêt est modélisé par un modèle linéaire) on a

avec une variance (conditionnelle) qui dépend de la variable explicative

Là encore, le plus naturel est d'utiliser comme bornes de l'intervalle de confiance des quantiles associés à la loi lognormale,

mais il n'est pas rare de voir utilisé des intervalles de type Gaussiens,

On perd là encore en interprétation car les bornes n'ont plus rien à voir avec les quantiles.

Thursday, September 27 2012

Bounding sums of random variables, part 1

For the last course MAT8886 of this (long) winter session, on copulas (and extremes), we will discuss risk aggregation. The course will be mainly on the problem of bounding  the distribution (or some risk measure, say the Value-at-Risk) for two random variables with given marginal distribution. For instance, we have two Gaussian risks. What could be be worst-case scenario for the 99% quantile of the sum ? Note that I mention implications in terms of risk management, but of course, those questions are extremely important in terms of statistical inference, see e.g. Fan & Park (2006).

This problem, is sometimes related to some question asked by Kolmogorov almost one hundred years ago, as mentioned in Makarov (1981). One year after, Rüschendorf (1982) also suggested a proof of bounds calculation. Here, we focus in dimension 2. As usual, it is the simple case. But as mentioned recently, in Kreinovich & Ferson (2005), in dimension 3 (or higher), "computing the best-possible bounds for arbitrary n is an NP-hard (computationally intractable) problem". So let us focus on the case where we sum (only) two random variable (for those interested in higher dimension, Puccetti & Rüschendorf (2012) provided interesting results for a dual version of those optimal bounds).

Let http://latex.codecogs.com/gif.latex?\Delta denote the set of univariate continuous distribution function, left-continuous, on http://latex.codecogs.com/gif.latex?\mathbb{R}. And http://latex.codecogs.com/gif.latex?\Delta^+ the set of distributions on http://latex.codecogs.com/gif.latex?\mathbb{R}^+. Thus, http://latex.codecogs.com/gif.latex?F\in\Delta^+ if http://latex.codecogs.com/gif.latex?F\in\Delta and http://latex.codecogs.com/gif.latex?F(0)=0. Consider now two distributions http://latex.codecogs.com/gif.latex?F,G\in\Delta^+. In a very general setting, it is possible to consider operators on http://latex.codecogs.com/gif.latex?\Delta^+\times%20\Delta^+. Thus, let http://latex.codecogs.com/gif.latex?T:[0,1]\times[0,1]\rightarrow[0,1] denote an operator, increasing in each component, thus that http://latex.codecogs.com/gif.latex?T(1,1)=1. And consider some function http://latex.codecogs.com/gif.latex?L:\mathbb{R}^+\times\mathbb{R}^+\rightarrow\mathbb{R}^+ assumed to be also increasing in each component (and continuous). For such functions http://latex.codecogs.com/gif.latex?T and http://latex.codecogs.com/gif.latex?L, define the following (general) operator, http://latex.codecogs.com/gif.latex?\tau_{T,L}(F,G) as

http://latex.codecogs.com/gif.latex?\tau_{T,L}(F,G)(x)=\sup_{L(u,v)=x}\{T(F(u),G(v))\}

One interesting case can be obtained when http://latex.codecogs.com/gif.latex?Tis a copula, http://latex.codecogs.com/gif.latex?C. In that case,

http://latex.codecogs.com/gif.latex?\tau_{C,L}(F,G):\Delta^+\times\Delta^+\rightarrow\Delta^+

and further, it is possible to write

http://latex.codecogs.com/gif.latex?\tau_{C,L}(F,G)(x)=\sup_{(u,v)\in%20L^{-1}(x)}\{C(F(u),G(v))\}

It is also possible to consider other (general) operators, e.g. based on the sum

http://latex.codecogs.com/gif.latex?\sigma_{C,L}(F,G)(x)=\int_{(u,v)\in%20L^{-1}(x)}%20dC(F(u),G(v))

or on the minimum,

http://latex.codecogs.com/gif.latex?\rho_{C,L}(F,G)(x)=\inf_{(u,v)\in%20L^{-1}(x)}\{C^\star(F(u),G(v))\}

where http://latex.codecogs.com/gif.latex?C^\star is the survival copula associated with http://latex.codecogs.com/gif.latex?C, i.e. http://latex.codecogs.com/gif.latex?C^\star(u,v)=u+v-C(u,v). Note that those operators can be used to define distribution functions, i.e.

http://latex.codecogs.com/gif.latex?\sigma_{C,L}(F,G):\Delta^+\times\Delta^+\rightarrow\Delta^+

and similarly

http://latex.codecogs.com/gif.latex?\rho_{C,L}(F,G):\Delta^+\times\Delta^+\rightarrow\Delta^+

All that seems too theoretical ? An application can be the case of the sum, i.e. http://latex.codecogs.com/gif.latex?L(x,y)=x+y, in that case http://latex.codecogs.com/gif.latex?\sigma_{C,+}(F,G) is the distribution of sum of two random variables with marginal distributions http://latex.codecogs.com/gif.latex?F and http://latex.codecogs.com/gif.latex?G, and copula http://latex.codecogs.com/gif.latex?C. Thus, http://latex.codecogs.com/gif.latex?\sigma_{C^\perp,+}(F,G) is simply the convolution of two distributions,

http://latex.codecogs.com/gif.latex?\sigma_{C^\perp,+}(F,G)(x)=\int_{u+v=x}%20dC^\perp(F(u),G(v))

The important result (that can be found in Chapter 7, in Schweizer and Sklar (1983)) is that given an operator http://latex.codecogs.com/gif.latex?L, then, for any copula http://latex.codecogs.com/gif.latex?C, one can find a lower bound for http://latex.codecogs.com/gif.latex?\sigma_{C,L}(F,G)

http://latex.codecogs.com/gif.latex?\tau_{C^-,L}(F,G)\leq%20\tau_{C,L}(F,G)\leq\sigma_{C,L}(F,G)

as well as an upper bound

http://latex.codecogs.com/gif.latex?\sigma_{C,L}(F,G)\leq%20\rho_{C,L}(F,G)\leq\rho_{C^-,L}(F,G)

Those inequalities come from the fact that for all copula http://latex.codecogs.com/gif.latex?C, http://latex.codecogs.com/gif.latex?C\geq%20C^-, where http://latex.codecogs.com/gif.latex?C^- is a copula. Since this function is not copula in higher dimension, one can easily imagine that get those bounds in higher dimension will be much more complicated...

In the case of the sum of two random variables, with marginal distributions http://latex.codecogs.com/gif.latex?F and http://latex.codecogs.com/gif.latex?G, bounds for the distribution of the sum http://latex.codecogs.com/gif.latex?H(x)=\mathbb{P}(X+Y\leq%20x), where http://latex.codecogs.com/gif.latex?X\sim%20F and http://latex.codecogs.com/gif.latex?Y\sim%20G, can be written

http://latex.codecogs.com/gif.latex?H^-(x)=\tau_{C^-%20,+}(F,G)(x)=\sup_{u+v=x}\{%20\max\{F(u)+G(v)-1,0\}%20\}

for the lower bound, and

http://latex.codecogs.com/gif.latex?H^+(x)=\rho_{C^-%20,+}(F,G)(x)=\inf_{u+v=x}\{%20\min\{F(u)+G(v),1\}%20\}

for the upper bound. And those bounds are sharp, in the sense that, for all http://latex.codecogs.com/gif.latex?t\in(0,1), there is a copula http://latex.codecogs.com/gif.latex?C_t such that

http://latex.codecogs.com/gif.latex?\tau_{C_t,+}(F,G)(x)=\tau_{C^-%20,+}(F,G)(x)=t

and there is (another) copula http://latex.codecogs.com/gif.latex?C_t such that

http://latex.codecogs.com/gif.latex?\sigma_{C_t,+}(F,G)(x)=\tau_{C^-%20,+}(F,G)(x)=t

Thus, using those results, it is possible to bound cumulative distribution function. But actually, all that can be done also on quantiles (see Frank, Nelsen & Schweizer (1987)). For all http://latex.codecogs.com/gif.latex?F\in\Delta^+ let http://latex.codecogs.com/gif.latex?F^{-1} denotes its generalized inverse, left continuous, and let http://latex.codecogs.com/gif.latex?\nabla^+ denote the set of those quantile functions. Define then the dual versions of our operators,

http://latex.codecogs.com/gif.latex?\tau^{-1}_{T,L}(F^{-1},G^{-1})(x)=\inf_{(u,v)\in%20T^{-1}(x)}\{L(F^{-1}(u),G^{-1}(v))\}

and

http://latex.codecogs.com/gif.latex?\rho^{-1}_{T,L}(F^{-1},G^{-1})(x)=\sup_{(u,v)\in%20T^\star^{-1}(x)}\{L(F^{-1}(u),G^{-1}(v))\}

Those definitions are really dual versions of the previous ones, in the sense that http://latex.codecogs.com/gif.latex?\tau^{-1}_{T,L}(F^{-1},G^{-1})=[\tau_{T,L}(F,G)]^{-1} and http://latex.codecogs.com/gif.latex?\rho^{-1}_{T,L}(F^{-1},G^{-1})=[\rho_{T,L}(F,G)]^{-1}.

Note that if we focus on sums of bivariate distributions, the lower bound for the quantile of the sum is

http://latex.codecogs.com/gif.latex?\tau^{-1}_{C^{-},+}(F^{-1},G^{-1})(x)=\inf_{\max\{u+v-1,0\}=x}\{F^{-1}(u)+G^{-1}(v)\}

while the upper bound is

http://latex.codecogs.com/gif.latex?\rho^{-1}_{C^{-},+}(F^{-1},G^{-1})(x)=\sup_{\min\{u+v,1\}=x}\{F^{-1}(u)+G^{-1}(v)\}

A great thing is that it should not be too difficult to compute numerically those quantities. Perhaps a little bit more for cumulative distribution functions, since they are not defined on a bounded support. But still, if the goal is to plot those bounds on , for instance. The code is the following, for the sum of two lognormal distributions 

> F=function(x) plnorm(x,0,1)
> G=function(x) plnorm(x,0,1)
> n=100
> X=seq(0,10,by=.05)
> Hinf=Hsup=rep(NA,length(X))
> for(i in 1:length(X)){
+ x=X[i]
+ U=seq(0,x,by=1/n); V=x-U
+ Hinf[i]=max(pmax(F(U)+G(V)-1,0))
+ Hsup[i]=min(pmin(F(U)+G(V),1))}

If we plot those bounds, we obtain

> plot(X,Hinf,ylim=c(0,1),type="s",col="red")
> lines(X,Hsup,type="s",col="red")

But somehow, it is even more simple to work with quantiles since they are defined on a finite support. Quantiles are here

> Finv=function(u) qlnorm(u,0,1)
> Ginv=function(u) qlnorm(u,0,1)

The idea will be to consider a discretized version of the unit interval as discussed in Williamson (1989), in a much more general setting. Again the idea is to compute, for instance

http://latex.codecogs.com/gif.latex?\sup_{u\in[0,x]}\{F^{-1}(u)+G^{-1}(x-u)\}

The idea is to consider http://latex.codecogs.com/gif.latex?x=i/n and http://latex.codecogs.com/gif.latex?u=j/n, and the bound for the quantile function at point http://latex.codecogs.com/gif.latex?i/n is then

http://latex.codecogs.com/gif.latex?\sup_{j\in\{0,1,\cdots,i\}}\left\{F^{-1}\left(\frac{j}{n}\right)+G^{-1}\left(\frac{i-j}{n}\right)\right\}

The code to compute those bounds, for a given http://latex.codecogs.com/gif.latex?n is here

> n=1000
> Qinf=Qsup=rep(NA,n-1)
> for(i in 1:(n-1)){
+ J=0:i
+ Qinf[i]=max(Finv(J/n)+Ginv((i-J)/n))
+ J=(i-1):(n-1)
+ Qsup[i]=min(Finv((J+1)/n)+Ginv((i-1-J+n)/n))
+ }

Here we have (several http://latex.codecogs.com/gif.latex?ns were considered, so that we can visualize the convergence of that numerical algorithm),

Here, we have a simple code to visualize bounds for quantiles for the sum of two risks. But it is possible to go further...

Thursday, February 23 2012

Visualization in regression analysis

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),
> library(gdata)
> XLS1=read.xls("http://api.worldbank.org/datafiles
/NY.GDP.PCAP.PP.CD_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)[3]="GDP"
> XLS2=read.xls("http://api.worldbank.org/datafiles
/SH.DYN.MORT_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)[2]="MORTALITY"
> data=merge(data1,data2)
> head(data)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1

If we estimate a simple linear regression - http://freakonometrics.blog.free.fr/public/perso5/logormal01.gif  - we get
> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)
 
Call:
lm(formula = MORTALITY ~ GDP, data = data)
 
Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14 
We can look at the scatter plot, including the linear regression line, and some confidence bounds,
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data$GDP,data$MORTALITY,data$Country.Name,pos=3)
> x=seq(-10000,100000,length=101)
> y=predict(regBB,newdata=data.frame(GDP=x),
+ interval="prediction",level = 0.9)
> lines(x,y[,1],col="red")
> lines(x,y[,2],col="red",lty=2)
> lines(x,y[,3],col="red",lty=2)

We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood,

> boxcox(regBB)

it looks like taking the logarithm of the mortality rate should be better, i.e. http://freakonometrics.blog.free.fr/public/perso5/lognormal02.gif or http://freakonometrics.blog.free.fr/public/perso5/lognormal05.gif:

> regLB=lm(log(MORTALITY)~GDP,data=data)
> summary(regLB)
 
Call:
lm(formula = log(MORTALITY) ~ GDP, data = data)
 
Residuals:
Min      1Q  Median      3Q     Max
-1.3035 -0.5837 -0.1138  0.5597  3.0583
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.989e+00  7.970e-02   50.05   <2e-16 ***
GDP         -6.487e-05  4.142e-06  -15.66   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.7666 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.5949,	Adjusted R-squared: 0.5925
F-statistic: 245.3 on 1 and 167 DF,  p-value: < 2.2e-16
 
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5,log="y")
> text(data$GDP,data$MORTALITY,data$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)

on the log scale or

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5) log scale",cex=.5)

on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals.

But why shouldn't we take also the logarithm of the GDP ? We can fit a model http://freakonometrics.blog.free.fr/public/perso5/lognormal03.gif or equivalently http://freakonometrics.blog.free.fr/public/perso5/lognormal04.gif.


> regLL=lm(log(MORTALITY)~log(GDP),data=data)
> summary(regLL)
 
Call:
lm(formula = log(MORTALITY) ~ log(GDP), data = data)
 
Residuals:
Min       1Q   Median       3Q      Max
-1.13200 -0.38326 -0.07127  0.26610  3.02212
 
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.50192    0.31556   33.28   <2e-16 ***
log(GDP)    -0.83496    0.03548  -23.54   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.5797 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.7684,	Adjusted R-squared: 0.767
F-statistic:   554 on 1 and 167 DF,  p-value: < 2.2e-16
 
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5,log="xy")
> text(data$GDP,data$MORTALITY,data$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)$sigma^2/2)
> lines(x,y,col="red")
> y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)

on the log scales or

> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5)

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).

Thursday, January 12 2012

MAT8886 Fisher-Tippett theorem and limiting distribution for the maximum

Tomorrow, we will discuss Fisher-Tippett theorem. The idea is that there are only three possible limiting distributions for normalized versions of the maxima of i.i.d. samples http://freakonometrics.blog.free.fr/public/perso5/max-00.gif. For bounded distribution, consider e.g. the uniform distribution on the unit interval, i.e. http://freakonometrics.blog.free.fr/public/perso5/max-09.gif on the unit interval. Let http://freakonometrics.blog.free.fr/public/perso5/max-10.gif and http://freakonometrics.blog.free.fr/public/perso5/max-11.gif. Then, for all http://freakonometrics.blog.free.fr/public/perso5/max-12.gif and http://freakonometrics.blog.free.fr/public/perso5/max-13.gif,

http://freakonometrics.blog.free.fr/public/perso5/max-14.gif

i.e. the limiting distribution of the maximum is Weibull's.

set.seed(1)
s=1000000
n=100
M=matrix(runif(s),n,s/n)
V=apply(M,2,max)
bn=1
an=1/n
U=(V-bn)/an
hist(U,probability=TRUE,,col="light green",
xlim=c(-7,1),main="",breaks=seq(-20,10,by=.25))
u=seq(-10,0,by=.1)
v=exp(u)
lines(u,v,lwd=3,col="red")

For heavy tailed distribution, or Pareto-type tails, consider Pareto samples, with distribution function http://freakonometrics.blog.free.fr/public/perso5/max-05.gif. Let http://freakonometrics.blog.free.fr/public/perso5/max-06.gif and http://freakonometrics.blog.free.fr/public/perso5/max-07.gif, then

http://freakonometrics.blog.free.fr/public/perso5/max-08.gif

which means that the limiting distribution is Fréchet's.

set.seed(1)
s=1000000
n=100
M=matrix((runif(s))^(-1/2),n,s/n)
V=apply(M,2,max)
bn=0
an=n^(1/2)
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(0,7),main="",breaks=seq(0,max(U)+1,by=.25))
u=seq(0,10,by=.1)
v=dfrechet(u,shape=2)
lines(u,v,lwd=3,col="red")

For light tailed distribution, or exponential tails, consider e.g. a sample of exponentially distribution variates, with common distribution function http://freakonometrics.blog.free.fr/public/perso5/max-01.gif. Let http://freakonometrics.blog.free.fr/public/perso5/max-02.gif and http://freakonometrics.blog.free.fr/public/perso5/max-03.gif, then

http://freakonometrics.blog.free.fr/public/perso5/max-04.gif

i.e. the limiting distribution for the maximum is Gumbel's distribution.

library(evd)
set.seed(1)
s=1000000
n=100
M=matrix(rexp(s,1),n,s/n)
V=apply(M,2,max)
(bn=qexp(1-1/n))
log(n)
an=1
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(-2,7),ylim=c(0,.39),main="",breaks=seq(-5,15,by=.25))
u=seq(-5,15,by=.1)
v=dgumbel(u)
lines(u,v,lwd=3,col="red")

Consider now a Gaussian http://freakonometrics.blog.free.fr/public/perso5/max-17.gif sample. We can use the following approximation of the cumulative distribution function (based on l'Hopital's rule)

http://freakonometrics.blog.free.fr/public/perso5/max-15.gif

as http://freakonometrics.blog.free.fr/public/perso5/max-16.gif. Let http://freakonometrics.blog.free.fr/public/perso5/max-18.gif and http://freakonometrics.blog.free.fr/public/perso5/max-19.gif. Then we can get

http://freakonometrics.blog.free.fr/public/perso5/max-20.gif

as http://freakonometrics.blog.free.fr/public/perso5/max-21.gif. I.e. the limiting distribution of the maximum of a Gaussian sample is Gumbel's. But what we do not see here is that for a Gaussian sample, the convergence is extremely slow, i.e., with 100 observations, we are still far away from Gumbel distribution,

and it is only slightly better with 1,000 observations,

set.seed(1)
s=10000000
n=1000
M=matrix(rnorm(s,0,1),n,s/n)
V=apply(M,2,max)
(bn=qnorm(1-1/n,0,1))
an=1/bn
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(-2,7),ylim=c(0,.39),main="",breaks=seq(-5,15,by=.25))
u=seq(-5,15,by=.1)
v=dgumbel(u)
lines(u,v,lwd=3,col="red")

Even worst, consider lognormal observations. In that case, recall that if we consider (increasing) transformation of variates, we are in the same domain of attraction. Hence, since http://freakonometrics.blog.free.fr/public/perso5/max-22.gif, if

http://freakonometrics.blog.free.fr/public/perso5/max-23.gif

then

http://freakonometrics.blog.free.fr/public/perso5/max-24.gif

i.e. using Taylor's approximation on the right term,

http://freakonometrics.blog.free.fr/public/perso5/max-25.gif

This gives us normalizing coefficients we should use here.

set.seed(1)
s=10000000
n=1000
M=matrix(rlnorm(s,0,1),n,s/n)
V=apply(M,2,max)
bn=exp(qnorm(1-1/n,0,1))
an=exp(qnorm(1-1/n,0,1))/(qnorm(1-1/n,0,1))
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(-2,7),ylim=c(0,.39),main="",breaks=seq(-5,40,by=.25))
u=seq(-5,15,by=.1)
v=dgumbel(u)
lines(u,v,lwd=3,col="red")

Credit: illustration is from Maurice Sendak's popular book where the wild things are, translated in French as Max et les Maximonstres.

Tuesday, February 15 2011

In statistics, having too much information might not be a good thing

A common idea in statistics is that if we don't know something, and we use an estimator of that something (instead of the true value) then there will be some additional uncertainty.

For instance, consider a random sample, i.i.d., from a Gaussian distribution. Then, a confidence interval for the mean is
http://freakonometrics.blog.free.fr/public/perso2/IC-cout-06.gif
where http://freakonometrics.blog.free.fr/public/perso2/inc-out-8.gif is the quantile of probability level http://freakonometrics.blog.free.fr/public/perso2/IC-cout-05.gif of the standard normal distribution http://freakonometrics.blog.free.fr/public/perso2/inc-out-09.gif. But usually, standard deviation http://freakonometrics.blog.free.fr/public/perso2/inc-cout-10.gif (the something is was talking about earlier) is usually unknown. So we substitute an estimation of the standard deviation, e.g.
http://freakonometrics.blog.free.fr/public/perso2/IC-cout-02.gif
and the cost we have to pay is that the new confidence interval is
http://freakonometrics.blog.free.fr/public/perso2/IC-cout-01.gif
where now http://freakonometrics.blog.free.fr/public/perso2/IC-cout-03.gif is the quantile of the Student distribution, of probability level http://freakonometrics.blog.free.fr/public/perso2/IC-cout-05.gif, with http://freakonometrics.blog.free.fr/public/perso2/IC-cout-04.gif degrees of freedom.
We call it a cost since the new confidence interval is now larger (the Student distribution has higher upper-quantiles than the Gaussian distribution).
So usually, if we substitute an estimation to the true value, there is a price to pay.
A few years ago, with Jean David Fermanian and Olivier Scaillet, we were writing a survey on copula density estimation (using kernels,  here). At the end, we wanted to add a small paragraph on the fact that we assumed that we wanted to fit a copula on a sample http://freakonometrics.blog.free.fr/public/perso2/ic-cout_11.gif i.i.d. with distribution http://freakonometrics.blog.free.fr/public/perso2/ic-cout_13.gif, a copula, but in practice, we start from a samplehttp://freakonometrics.blog.free.fr/public/perso2/ic-cout_12.gif with joint distribution http://freakonometrics.blog.free.fr/public/perso2/ic-cour_14.gif (assumed to have continuous margins, and - unique - copula http://freakonometrics.blog.free.fr/public/perso2/ic-cout_13.gif). But since margins are usually unknown, there should be a price for not observing them.
To be more formal, in a perfect wold, we would consider
http://freakonometrics.blog.free.fr/public/perso2/ic-cout-15.gif
but in the real world, we have to consider
http://freakonometrics.blog.free.fr/public/perso2/ic-cout-16.gif
where it is standard to consider ranks, i.e. http://freakonometrics.blog.free.fr/public/perso2/ic-cout_109.gif are empirical cumulative distribution functions.
My point is that when I ran simulations for the survey (the idea was more to give illustrations of several techniques of estimation, rather than proofs of technical theorems) we observed that the price to pay... was negative ! I.e. the variance of the estimator of the density (wherever on the unit square) was smaller on the pseudo sample http://freakonometrics.blog.free.fr/public/perso2/ic-cout-17.gif than on perfect sample http://freakonometrics.blog.free.fr/public/perso2/ic-cout_18.gif.
By that time, we could not understand why we got that counter-intuitive result: even if we do know the true distribution, it is better not to use it, and to use instead a nonparametric estimator. Our interpretation was based on the discrepancy concept and was related to the latin hypercube construction:

With ranks, the data are more regular, and marginal distributions are exactly uniform on the unit interval. So there is less variance.
This was our heuristic interpretation.
A couple of weeks ago, Christian Genest and Johan Segers proved that intuition in an article published in JMVA,

Well, we observed something for finite http://freakonometrics.blog.free.fr/public/maths/mariage01.png, but Christian and Johan obtained an analytical result. Hence, if we denote
http://freakonometrics.blog.free.fr/public/perso2/JSCG-1.gif
the empirical copula in the perfect world (with known margins) and
http://freakonometrics.blog.free.fr/public/perso2/JSCG-2.gif
the one constructed from the pseudo sample, they obtained that, everywhere
http://freakonometrics.blog.free.fr/public/perso2/JSCG-6.gif
with nice graphs of http://freakonometrics.blog.free.fr/public/perso2/JSCG-7.gif,

So I was very happy last week when Christian show me their results, to learn that our intuition was correct. Nevertheless, it is still a very counter-intuitive result.... If anyone has seen similar things, I'd be glad to hear about it !