# Freakonometrics

## Tag - normal

Tuesday, October 30 2012

## Le passage au log dans les modèles linéaires

Un billet rapide pour compléter et illustrer le passage au log dans un modèle linéaire (que l'on abordera cette semaine en cours). Le point de départ est le modèle linéaire, où on suppose que, conditionnellement à $X$$Y$ suit une loi normale. Pour rappel, si on a une loi normale, $Y\sim \mathcal{N}(\mu,\sigma^2)$, alors $\mathbb{E}(Y)=\mu$ et $\text{Var}(Y)=\sigma^2$. Les intervalles de confiance à 90% et 95% sont symétriques par rapport à la moyenne (qui est aussi la médiane, soit dit en passant),

Dans un modèle Gaussien avec homoscédasiticité, $Y|X\sim \mathcal{N}(\beta_0+\beta_1X,\sigma^2)$ i.e. $\mathbb{E}(Y|X)=\beta_0+\beta_1X$ alors que $\text{Var}(Y|X)=\sigma^2$. On a alors les bandes de confiance suivantes, pour un modèle de régression linéaire,

Bon, maintenant, que se passe-t-il si on prend l'exponentiel ? Pour la loi normale, rappelons que l'on obtient une loi lognormale, i.e. $\exp(Y)\sim LN(\mu,\sigma^2)$, les deux paramètres étant liés à la loi normale sous jacente, car désormais

$\mathbb{E}(\exp(Y))=\exp\left(\mu+\frac{\sigma^2}{2}\right)$

alors que

$\text{Var}(\exp(Y))=(\exp[\sigma^2]-1)\cdot\exp[2\mu+\sigma^2]$

Graphiquement, on a la loi suivante, avec les intervalles de confiance à 90% et 95% représentés ci-dessous. Le point noir est $\exp(\mu)$  alors que le point bleu est l'espérance de la loi lognormale.

On notera que le quantile de la loi log-normale est l'exponentiel du quantile de la loi normale. En effet, si $\mathbb{P}(Y\leq q)=\alpha$ alors $\mathbb{P}(\exp(Y)\leq \exp(q))=\alpha$. En particulier, $\exp(\mu)$ n'est pas la moyenne de $\exp(Y)$, mais la médiane (puisque $\mu$ était la médiane de $Y$).

Mais il n'est pas rare de voir utilisé un intervalle de confiance de la forme

$\mathbb(\exp(Y))\pm q_{1-\alpha/2}\sqrt{\text{Var}(\exp(Y))}$

qui est la forme classique de l'intervalle de confiance Gaussien (symétrique autour de la moyenne). Ici, on aurait les niveaux suivants

Notons qu'il n'y a aucune raison ici d'avoir une probabilité $1-\alpha$ d'être dans l'intervalle de confiance obtenu avec les quantiles $q_{1-\alpha/2}$ de la loi normale.

Maintenant, si on prend l'exponentiel d'un modèle linéaire (i.e. le logarithme de la variable d’intérêt est modélisé par un modèle linéaire) on a

$\mathbb{E}(\exp(Y)|X)=\exp\left(\beta_0+\beta_1X+\frac{\sigma^2}{2}\right)$

avec une variance (conditionnelle) qui dépend de la variable explicative

$\text{Var}(\exp(Y)|X)=(\exp[\sigma^2]-1)\cdot\exp[2(\beta_0+\beta_1X)+\sigma^2]$

Là encore, le plus naturel est d'utiliser comme bornes de l'intervalle de confiance des quantiles associés à la loi lognormale,

mais il n'est pas rare de voir utilisé des intervalles de type Gaussiens,

On perd là encore en interprétation car les bornes n'ont plus rien à voir avec les quantiles.

Thursday, September 27 2012

## Bounding sums of random variables, part 1

For the last course MAT8886 of this (long) winter session, on copulas (and extremes), we will discuss risk aggregation. The course will be mainly on the problem of bounding  the distribution (or some risk measure, say the Value-at-Risk) for two random variables with given marginal distribution. For instance, we have two Gaussian risks. What could be be worst-case scenario for the 99% quantile of the sum ? Note that I mention implications in terms of risk management, but of course, those questions are extremely important in terms of statistical inference, see e.g. Fan & Park (2006).

This problem, is sometimes related to some question asked by Kolmogorov almost one hundred years ago, as mentioned in Makarov (1981). One year after, Rüschendorf (1982) also suggested a proof of bounds calculation. Here, we focus in dimension 2. As usual, it is the simple case. But as mentioned recently, in Kreinovich & Ferson (2005), in dimension 3 (or higher), "computing the best-possible bounds for arbitrary n is an NP-hard (computationally intractable) problem". So let us focus on the case where we sum (only) two random variable (for those interested in higher dimension, Puccetti & Rüschendorf (2012) provided interesting results for a dual version of those optimal bounds).

Let $\Delta$ denote the set of univariate continuous distribution function, left-continuous, on $\mathbb{R}$. And $\Delta^+$ the set of distributions on $\mathbb{R}^+$. Thus, $F\in\Delta^+$ if $F\in\Delta$ and $F(0)=0$. Consider now two distributions $F,G\in\Delta^+$. In a very general setting, it is possible to consider operators on $\Delta^+\times \Delta^+$. Thus, let $T:[0,1]\times[0,1]\rightarrow[0,1]$ denote an operator, increasing in each component, thus that $T(1,1)=1$. And consider some function $L:\mathbb{R}^+\times\mathbb{R}^+\rightarrow\mathbb{R}^+$ assumed to be also increasing in each component (and continuous). For such functions $T$ and $L$, define the following (general) operator, $\tau_{T,L}(F,G)$ as

$\tau_{T,L}(F,G)(x)=\sup_{L(u,v)=x}\{T(F(u),G(v))\}$

One interesting case can be obtained when $T$is a copula, $C$. In that case,

$\tau_{C,L}(F,G):\Delta^+\times\Delta^+\rightarrow\Delta^+$

and further, it is possible to write

$\tau_{C,L}(F,G)(x)=\sup_{(u,v)\in L^{-1}(x)}\{C(F(u),G(v))\}$

It is also possible to consider other (general) operators, e.g. based on the sum

$\sigma_{C,L}(F,G)(x)=\int_{(u,v)\in L^{-1}(x)} dC(F(u),G(v))$

or on the minimum,

$\rho_{C,L}(F,G)(x)=\inf_{(u,v)\in L^{-1}(x)}\{C^\star(F(u),G(v))\}$

where $C^\star$ is the survival copula associated with $C$, i.e. $C^\star(u,v)=u+v-C(u,v)$. Note that those operators can be used to define distribution functions, i.e.

$\sigma_{C,L}(F,G):\Delta^+\times\Delta^+\rightarrow\Delta^+$

and similarly

$\rho_{C,L}(F,G):\Delta^+\times\Delta^+\rightarrow\Delta^+$

All that seems too theoretical ? An application can be the case of the sum, i.e. $L(x,y)=x+y$, in that case $\sigma_{C,+}(F,G)$ is the distribution of sum of two random variables with marginal distributions $F$ and $G$, and copula $C$. Thus, $\sigma_{C^\perp,+}(F,G)$ is simply the convolution of two distributions,

$\sigma_{C^\perp,+}(F,G)(x)=\int_{u+v=x} dC^\perp(F(u),G(v))$

The important result (that can be found in Chapter 7, in Schweizer and Sklar (1983)) is that given an operator $L$, then, for any copula $C$, one can find a lower bound for $\sigma_{C,L}(F,G)$

$\tau_{C^-,L}(F,G)\leq \tau_{C,L}(F,G)\leq\sigma_{C,L}(F,G)$

as well as an upper bound

$\sigma_{C,L}(F,G)\leq \rho_{C,L}(F,G)\leq\rho_{C^-,L}(F,G)$

Those inequalities come from the fact that for all copula $C$, $C\geq C^-$, where $C^-$ is a copula. Since this function is not copula in higher dimension, one can easily imagine that get those bounds in higher dimension will be much more complicated...

In the case of the sum of two random variables, with marginal distributions $F$ and $G$, bounds for the distribution of the sum $H(x)=\mathbb{P}(X+Y\leq x)$, where $X\sim F$ and $Y\sim G$, can be written

$H^-(x)=\tau_{C^- ,+}(F,G)(x)=\sup_{u+v=x}\{ \max\{F(u)+G(v)-1,0\} \}$

for the lower bound, and

$H^+(x)=\rho_{C^- ,+}(F,G)(x)=\inf_{u+v=x}\{ \min\{F(u)+G(v),1\} \}$

for the upper bound. And those bounds are sharp, in the sense that, for all $t\in(0,1)$, there is a copula $C_t$ such that

$\tau_{C_t,+}(F,G)(x)=\tau_{C^- ,+}(F,G)(x)=t$

and there is (another) copula $C_t$ such that

$\sigma_{C_t,+}(F,G)(x)=\tau_{C^- ,+}(F,G)(x)=t$

Thus, using those results, it is possible to bound cumulative distribution function. But actually, all that can be done also on quantiles (see Frank, Nelsen & Schweizer (1987)). For all $F\in\Delta^+$ let $F^{-1}$ denotes its generalized inverse, left continuous, and let $\nabla^+$ denote the set of those quantile functions. Define then the dual versions of our operators,

$\tau^{-1}_{T,L}(F^{-1},G^{-1})(x)=\inf_{(u,v)\in T^{-1}(x)}\{L(F^{-1}(u),G^{-1}(v))\}$

and

$\rho^{-1}_{T,L}(F^{-1},G^{-1})(x)=\sup_{(u,v)\in T^\star^{-1}(x)}\{L(F^{-1}(u),G^{-1}(v))\}$

Those definitions are really dual versions of the previous ones, in the sense that $\tau^{-1}_{T,L}(F^{-1},G^{-1})=[\tau_{T,L}(F,G)]^{-1}$ and $\rho^{-1}_{T,L}(F^{-1},G^{-1})=[\rho_{T,L}(F,G)]^{-1}$.

Note that if we focus on sums of bivariate distributions, the lower bound for the quantile of the sum is

$\tau^{-1}_{C^{-},+}(F^{-1},G^{-1})(x)=\inf_{\max\{u+v-1,0\}=x}\{F^{-1}(u)+G^{-1}(v)\}$

while the upper bound is

$\rho^{-1}_{C^{-},+}(F^{-1},G^{-1})(x)=\sup_{\min\{u+v,1\}=x}\{F^{-1}(u)+G^{-1}(v)\}$

A great thing is that it should not be too difficult to compute numerically those quantities. Perhaps a little bit more for cumulative distribution functions, since they are not defined on a bounded support. But still, if the goal is to plot those bounds on $[0,10]$, for instance. The code is the following, for the sum of two lognormal distributions $LN(0,1)$

> F=function(x) plnorm(x,0,1)
> G=function(x) plnorm(x,0,1)
> n=100
> X=seq(0,10,by=.05)
> Hinf=Hsup=rep(NA,length(X))
> for(i in 1:length(X)){
+ x=X[i]
+ U=seq(0,x,by=1/n); V=x-U
+ Hinf[i]=max(pmax(F(U)+G(V)-1,0))
+ Hsup[i]=min(pmin(F(U)+G(V),1))}

If we plot those bounds, we obtain

> plot(X,Hinf,ylim=c(0,1),type="s",col="red")
> lines(X,Hsup,type="s",col="red")

But somehow, it is even more simple to work with quantiles since they are defined on a finite support. Quantiles are here

> Finv=function(u) qlnorm(u,0,1)
> Ginv=function(u) qlnorm(u,0,1)

The idea will be to consider a discretized version of the unit interval as discussed in Williamson (1989), in a much more general setting. Again the idea is to compute, for instance

$\sup_{u\in[0,x]}\{F^{-1}(u)+G^{-1}(x-u)\}$

The idea is to consider $x=i/n$ and $u=j/n$, and the bound for the quantile function at point $i/n$ is then

$\sup_{j\in\{0,1,\cdots,i\}}\left\{F^{-1}\left(\frac{j}{n}\right)+G^{-1}\left(\frac{i-j}{n}\right)\right\}$

The code to compute those bounds, for a given $n$ is here

> n=1000
> Qinf=Qsup=rep(NA,n-1)
> for(i in 1:(n-1)){
+ J=0:i
+ Qinf[i]=max(Finv(J/n)+Ginv((i-J)/n))
+ J=(i-1):(n-1)
+ Qsup[i]=min(Finv((J+1)/n)+Ginv((i-1-J+n)/n))
+ }

Here we have (several $n$s were considered, so that we can visualize the convergence of that numerical algorithm),

Here, we have a simple code to visualize bounds for quantiles for the sum of two risks. But it is possible to go further...

Thursday, February 23 2012

## Visualization in regression analysis

Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),
> library(gdata)
/NY.GDP.PCAP.PP.CD_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")]
> names(data1)[3]="GDP"
/SH.DYN.MORT_Indicator_MetaData_en_EXCEL.xls", sheet = 1)
> data2=XLS2[-(1:28),c("Country.Code","X2010")]
> names(data2)[2]="MORTALITY"
> data=merge(data1,data2)
Country.Code         Country.Name       GDP MORTALITY
1          ABW                Aruba        NA        NA
2          AFG          Afghanistan  1207.278     149.2
3          AGO               Angola  6119.930     160.5
4          ALB              Albania  8817.009      18.4
5          AND              Andorra        NA       3.8
6          ARE United Arab Emirates 47215.315       7.1

If we estimate a simple linear regression -   - we get
> regBB=lm(MORTALITY~GDP,data=data)
> summary(regBB)

Call:
lm(formula = MORTALITY ~ GDP, data = data)

Residuals:
Min     1Q Median     3Q    Max
-45.24 -29.58 -12.12  16.19 115.83

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.1008781  4.1577411  16.139  < 2e-16 ***
GDP         -0.0017887  0.0002161  -8.278 3.83e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.99 on 167 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2909,	Adjusted R-squared: 0.2867
F-statistic: 68.53 on 1 and 167 DF,  p-value: 3.834e-14 
We can look at the scatter plot, including the linear regression line, and some confidence bounds,
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita",
+ ylab="Mortality rate (under 5)",cex=.5)
> text(data$GDP,data$MORTALITY,data$Country.Name,pos=3) > x=seq(-10000,100000,length=101) > y=predict(regBB,newdata=data.frame(GDP=x), + interval="prediction",level = 0.9) > lines(x,y[,1],col="red") > lines(x,y[,2],col="red",lty=2) > lines(x,y[,3],col="red",lty=2) We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood, > boxcox(regBB) it looks like taking the logarithm of the mortality rate should be better, i.e. or : > regLB=lm(log(MORTALITY)~GDP,data=data) > summary(regLB) Call: lm(formula = log(MORTALITY) ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -1.3035 -0.5837 -0.1138 0.5597 3.0583 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.989e+00 7.970e-02 50.05 <2e-16 *** GDP -6.487e-05 4.142e-06 -15.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7666 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.5949, Adjusted R-squared: 0.5925 F-statistic: 245.3 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5,log="y") > text(data$GDP,data$MORTALITY,data$Country.Name)
> x=seq(300,100000,length=101)
> y=exp(predict(regLB,newdata=data.frame(GDP=x)))*
+ exp(summary(regLB)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2) on the log scale or > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5) on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals. But why shouldn't we take also the logarithm of the GDP ? We can fit a model or equivalently . > regLL=lm(log(MORTALITY)~log(GDP),data=data) > summary(regLL) Call: lm(formula = log(MORTALITY) ~ log(GDP), data = data) Residuals: Min 1Q Median 3Q Max -1.13200 -0.38326 -0.07127 0.26610 3.02212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.50192 0.31556 33.28 <2e-16 *** log(GDP) -0.83496 0.03548 -23.54 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5797 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.7684, Adjusted R-squared: 0.767 F-statistic: 554 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5,log="xy") > text(data$GDP,data$MORTALITY,data$Country.Name)
> x=exp(seq(1,12,by=.1))
> y=exp(predict(regLL,newdata=data.frame(GDP=x)))*
+ exp(summary(regLL)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2)
> lines(x,y,col="red",lty=2)
> y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)),
+ sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2) on the log scales or > plot(data$GDP,data\$MORTALITY,xlab="GDP per capita ",
+ ylab="Mortality rate (under 5)",cex=.5)

on the standard scale. If we compare the last two predictions, we have

with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).

Thursday, January 12 2012

## MAT8886 Fisher-Tippett theorem and limiting distribution for the maximum

Tomorrow, we will discuss Fisher-Tippett theorem. The idea is that there are only three possible limiting distributions for normalized versions of the maxima of i.i.d. samples . For bounded distribution, consider e.g. the uniform distribution on the unit interval, i.e. on the unit interval. Let and . Then, for all and ,

i.e. the limiting distribution of the maximum is Weibull's.

set.seed(1)
s=1000000
n=100
M=matrix(runif(s),n,s/n)
V=apply(M,2,max)
bn=1
an=1/n
U=(V-bn)/an
hist(U,probability=TRUE,,col="light green",
xlim=c(-7,1),main="",breaks=seq(-20,10,by=.25))
u=seq(-10,0,by=.1)
v=exp(u)
lines(u,v,lwd=3,col="red")

For heavy tailed distribution, or Pareto-type tails, consider Pareto samples, with distribution function . Let and , then

which means that the limiting distribution is Fréchet's.

set.seed(1)
s=1000000
n=100
M=matrix((runif(s))^(-1/2),n,s/n)
V=apply(M,2,max)
bn=0
an=n^(1/2)
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(0,7),main="",breaks=seq(0,max(U)+1,by=.25))
u=seq(0,10,by=.1)
v=dfrechet(u,shape=2)
lines(u,v,lwd=3,col="red")

For light tailed distribution, or exponential tails, consider e.g. a sample of exponentially distribution variates, with common distribution function . Let and , then

i.e. the limiting distribution for the maximum is Gumbel's distribution.

library(evd)
set.seed(1)
s=1000000
n=100
M=matrix(rexp(s,1),n,s/n)
V=apply(M,2,max)
(bn=qexp(1-1/n))
log(n)
an=1
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(-2,7),ylim=c(0,.39),main="",breaks=seq(-5,15,by=.25))
u=seq(-5,15,by=.1)
v=dgumbel(u)
lines(u,v,lwd=3,col="red")

Consider now a Gaussian sample. We can use the following approximation of the cumulative distribution function (based on l'Hopital's rule)

as . Let and . Then we can get

as . I.e. the limiting distribution of the maximum of a Gaussian sample is Gumbel's. But what we do not see here is that for a Gaussian sample, the convergence is extremely slow, i.e., with 100 observations, we are still far away from Gumbel distribution,

and it is only slightly better with 1,000 observations,

set.seed(1)
s=10000000
n=1000
M=matrix(rnorm(s,0,1),n,s/n)
V=apply(M,2,max)
(bn=qnorm(1-1/n,0,1))
an=1/bn
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(-2,7),ylim=c(0,.39),main="",breaks=seq(-5,15,by=.25))
u=seq(-5,15,by=.1)
v=dgumbel(u)
lines(u,v,lwd=3,col="red")

Even worst, consider lognormal observations. In that case, recall that if we consider (increasing) transformation of variates, we are in the same domain of attraction. Hence, since , if

then

i.e. using Taylor's approximation on the right term,

This gives us normalizing coefficients we should use here.

set.seed(1)
s=10000000
n=1000
M=matrix(rlnorm(s,0,1),n,s/n)
V=apply(M,2,max)
bn=exp(qnorm(1-1/n,0,1))
an=exp(qnorm(1-1/n,0,1))/(qnorm(1-1/n,0,1))
U=(V-bn)/an
hist(U,probability=TRUE,col="light green",
xlim=c(-2,7),ylim=c(0,.39),main="",breaks=seq(-5,40,by=.25))
u=seq(-5,15,by=.1)
v=dgumbel(u)
lines(u,v,lwd=3,col="red")

Credit: illustration is from Maurice Sendak's popular book where the wild things are, translated in French as Max et les Maximonstres.

Tuesday, February 15 2011

## In statistics, having too much information might not be a good thing

A common idea in statistics is that if we don't know something, and we use an estimator of that something (instead of the true value) then there will be some additional uncertainty.

For instance, consider a random sample, i.i.d., from a Gaussian distribution. Then, a confidence interval for the mean is

where is the quantile of probability level of the standard normal distribution . But usually, standard deviation (the something is was talking about earlier) is usually unknown. So we substitute an estimation of the standard deviation, e.g.
and the cost we have to pay is that the new confidence interval is
where now is the quantile of the Student distribution, of probability level , with degrees of freedom.
We call it a cost since the new confidence interval is now larger (the Student distribution has higher upper-quantiles than the Gaussian distribution).
So usually, if we substitute an estimation to the true value, there is a price to pay.
A few years ago, with Jean David Fermanian and Olivier Scaillet, we were writing a survey on copula density estimation (using kernels,  here). At the end, we wanted to add a small paragraph on the fact that we assumed that we wanted to fit a copula on a sample i.i.d. with distribution , a copula, but in practice, we start from a sample with joint distribution (assumed to have continuous margins, and - unique - copula ). But since margins are usually unknown, there should be a price for not observing them.
To be more formal, in a perfect wold, we would consider
but in the real world, we have to consider
where it is standard to consider ranks, i.e. are empirical cumulative distribution functions.
My point is that when I ran simulations for the survey (the idea was more to give illustrations of several techniques of estimation, rather than proofs of technical theorems) we observed that the price to pay... was negative ! I.e. the variance of the estimator of the density (wherever on the unit square) was smaller on the pseudo sample than on perfect sample .
By that time, we could not understand why we got that counter-intuitive result: even if we do know the true distribution, it is better not to use it, and to use instead a nonparametric estimator. Our interpretation was based on the discrepancy concept and was related to the latin hypercube construction:

With ranks, the data are more regular, and marginal distributions are exactly uniform on the unit interval. So there is less variance.
This was our heuristic interpretation.
A couple of weeks ago, Christian Genest and Johan Segers proved that intuition in an article published in JMVA,

Well, we observed something for finite , but Christian and Johan obtained an analytical result. Hence, if we denote
the empirical copula in the perfect world (with known margins) and
the one constructed from the pseudo sample, they obtained that, everywhere
with nice graphs of ,
So I was very happy last week when Christian show me their results, to learn that our intuition was correct. Nevertheless, it is still a very counter-intuitive result.... If anyone has seen similar things, I'd be glad to hear about it !