# Freakonometrics

## Tag - R-english

Thursday, November 29 2012

## Save R objects, and other stuff

Yesterday, Christopher asked me how to store an R object, to save some time, when working on the project.

First, download the csv file for searches related to some keyword, via http://www.google.com/trends/, for instance "sunglasses". Recall that csv files store tabular data (numbers and text) in plain-text form, with comma-separated values (where csv term comes from). Even if it is not a single and well-defined format, it is a text file, not an excel file!

Recherche sur le Web : intérêt pour sunglasses
Dans tous les pays; De 2004 à ce jour

Intérêt dans le temps
Semaine,sunglasses
2004-01-04 - 2004-01-10,48
2004-01-11 - 2004-01-17,47
2004-01-18 - 2004-01-24,51
2004-01-25 - 2004-01-31,50
2004-02-01 - 2004-02-07,52

> report=read.table(
+ "http://freakonometrics.blog.free.fr/public/data/glasses.csv",
+ skip=4,header=TRUE,sep=",",nrows=464)

Then, we just run a function to transform that data frame. It can be found from a source file

> source("http://freakonometrics.blog.free.fr/public/code/H2M.R")

In this source file, there is function that transforms a weekly series into a monthly one. The output is either a time series, or a numeric vector,

> sunglasses=H2M(report,lang="FR",type="ts")

Here, we asked for a time series,

> sunglasses
Jan      Feb      Mar      Apr
2004 49.00000 54.27586 66.38710 80.10000
2005 48.45161 58.25000 69.93548 80.06667
2006 49.70968 57.21429 67.41935 82.10000
2007 47.32258 55.92857 70.87097 84.36667
2008 47.19355 54.20690 64.03226 79.36667
2009 45.16129 50.75000 63.58065 76.90000
2010 32.67742 44.35714 58.19355 70.00000
2011 44.38710 49.75000 59.16129 71.60000
2012 43.64516 48.75862 64.06452 70.13333
May      Jun      Jul      Aug
2004 83.77419 89.10000 84.67742 73.51613
2005 83.06452 91.36667 89.16129 76.32258
2006 86.00000 92.90000 93.00000 72.29032
2007 86.83871 88.63333 84.61290 72.93548
2008 80.70968 80.30000 78.29032 64.58065
2009 77.93548 70.40000 62.22581 51.58065
2010 71.06452 73.66667 76.90323 61.77419
2011 74.00000 79.66667 79.12903 66.29032
2012 79.74194 82.90000 79.96774 69.80645
Sep      Oct      Nov      Dec
2004 56.20000 46.25806 44.63333 53.96774
2005 56.53333 47.54839 47.60000 54.38710
2006 51.23333 46.70968 45.23333 54.22581
2007 56.33333 46.38710 44.40000 51.12903
2008 51.50000 44.61290 40.93333 47.74194
2009 37.90000 30.38710 28.43333 31.67742
2010 50.16667 46.54839 42.36667 45.90323
2011 52.23333 45.32258 42.60000 47.35484
2012 54.03333 46.09677 43.45833  

that we can plot using

> plot(sunglasses)

Now we would like to store this time series. This is done easily using

> save(sunglasses,file="sunglasses.RData")

Next time we open R, we just have to use

> load("/Users/UQAM/sunglasses.Rdata")

to load the time series in R memory. So saving objects is not difficult.

Last be not least, for the part on seasonal models, we will be using some functions from an old package. Unfortunately, on the CRAN website, we see that

but nicely, files can still be found on some archive page. On Linux, one can easily install the package using (in R)

> install.packages(
+ "/Users/UQAM/uroot_1.4-1.tar.gz",
+ type="source")

With a Mac, it is slightly more complicated (see e.g. Jon's blog): one has to open a Terminal and to type

R CMD INSTALL /Users/UQAM/uroot_1.4-1.tar.gz

On Windows, it is possible to install a package from a zipped file: one has to download the file from archive page, and then to spot it from R.

The package is now installed, we just have to load it to play with it, and use functions it contains to tests for cycles and seasonal behavior,

> library(uroot)
> CH.test
function (wts, frec = NULL, f0 = 1, DetTr = FALSE, ltrunc = NULL)
{
s <- frequency(wts)
t0 <- start(wts)
N <- length(wts)
if (class(frec) == "NULL")
frec <- rep(1, s/2)
if (class(ltrunc) == "NULL")
ltrunc <- round(s * (N/100)^0.25)
R1 <- SeasDummy(wts, "trg")
VFEalg <- SeasDummy(wts, "alg")

(etc)

Tuesday, November 13 2012

## On Box-Cox transform in regression models

A few days ago, a former student of mine, David, contacted me about Box-Cox tests in linear models. It made me look more carefully at the test, and I do not understand what is computed, to be honest. Let us start with something simple, like a linear simple regression, i.e.

$Y_i=\beta_0+\beta_1 X_i+\varepsilon_i$

Let us introduced - as suggested in Box & Cox (1964) - the following family of (power) transformations

$Y_i^{(\lambda)} = \begin{cases} \dfrac{Y_i^\lambda-1}{\lambda} &\text{ } (\lambda \neq 0)\\[8pt] \log{(Y_i)} &\text{ } (\lambda = 0) \end{cases}$

on the variable of interest. Then assume that

$Y_i^{(\lambda)}=\beta_0+\beta_1 X_i+\varepsilon_i$

As mentioned in Chapter 14 of Davidson & MacKinnon (1993) - in French - the log-likelihood of this model (assuming that observations are independent, with distribution $\varepsilon_i\sim\mathcal{N}(0,\sigma^2)$) can be written

$\log \mathcal{L}=-\frac{n}{2}\log(2\pi)-n\log\sigma \\-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\beta_1 X_i)\right]^2+(\lambda-1)\sum_{i=1}^n\log Y_i$

We can then use profile-likelihood techniques (see here) to derive the optimal transformation.

This can be done in R extremely simply,

> library(MASS)
> boxcox(lm(dist~speed,data=cars),lambda=seq(0,1,by=.1))

we then get the following graph,

If we look at the code of the function, it is based on the QR decomposition of the $\boldsymbol{X}$matrix (since we assume that $\boldsymbol{X}$ is a full-rank matrix). More precisely, $\boldsymbol{X}=QR$ where $\boldsymbol{X}$ is a $n\times 2$ matrix, $Q$ is a $n\times 2$ orthonornal matrix, and $R$ is a $2\times2$ upper triangle matrix. It might be convenient to use this matrix since, for instance, $R\widehat{\boldsymbol{\beta}}=Q'Y$.  Thus, we do have an upper triangle system of equations.

> X=lm(dist~speed,data=cars)$qr The code used to get the previous graph is (more or less) the following, > g=function(x,lambda){ + y=NA + if(lambda!=0){y=(x^lambda-1)/lambda} + if(lambda==0){y=log(x)} + return(y)} > n=nrow(cars) > X=lm(dist~speed,data=cars)$qr
> Y=cars$dist > logv=function(lambda){ + -n/2*log(sum(qr.resid(X, g(Y,lambda)/ + exp(mean(log(Y)))^(lambda-1))^2))} > L=seq(0,1,by=.05) > LV=Vectorize(logv)(L) > points(L,LV,pch=19,cex=.85,col="red") As we can see (with those red dots) we can reproduce the R graph. But it might not be consistent with other techniques (and functions described above). For instance, we can plot the profile likelihood function, $\lambda\mapsto\log\mathcal{L}$ > logv=function(lambda){ + s=summary(lm(g(dist,lambda)~speed, + data=cars))$sigma
+ e=lm(g(dist,lambda)~speed,data=cars)$residuals + -n/2*log(2 * pi)-n*log(s)-.5/s^2*(sum(e^2))+ + (lambda-1)*sum(log(Y)) + } > L=seq(0,1,by=.01) > LV=Vectorize(logv)(L) > plot(L,LV,type="l",ylab="") > (maxf=optimize(logv,0:1,maximum=TRUE))$maximum
[1] 0.430591

$objective [1] -197.6966 > abline(v=maxf$maximum,lty=2)

The good point is that the optimal value of $\lambda$ is the same as the one we got before. The only problem is that the $y$-axis has a different scale. And using profile likelihood techniques to derive a confidence interval will give us different results (with a larger confidence interval than the one given by the standard function),

> ic=maxf$objective-qchisq(.95,1) > #install.packages("rootSolve") > library(rootSolve) > f=function(x)(logv(x)-ic) > (lower=uniroot(f, c(0,maxf$maximum))$root) [1] 0.1383507 > (upper=uniroot(f, c(maxf$maximum,1))$root) [1] 0.780573 > segments(lower,ic,upper,ic,lwd=2,col="red") Actually, it possible to rewrite the log-likelihood as $\mathcal{L}=\star-\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\beta_1 X_i)}{\dot{Y}^\lambda}\right)^2\right]$ (let us just get rid of the constant), where $\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]$ Here, it becomes > logv=function(lambda){ + e=lm(g(dist,lambda)~speed,data=cars)$residuals
+ elY=(exp(mean(log(Y))))
+ -n/2*log(sum((e/elY^lambda)^2))
+ }
>
> L=seq(0,1,by=.01)
> LV=Vectorize(logv)(L)
> plot(L,LV,type="l",ylab="")
> optimize(logv,0:1,maximum=TRUE)
$maximum [1] 0.430591$objective
[1] -47.73436

with again the same optimal value for $\lambda$, and the same confidence interval, since the function is the same, up to some additive constant.

So we have been able to derive the optimal transformation according to Box-Cox transformation, but so far, the confidence interval is not the same (it might come from the fact that here we substituted an estimator to the unknown parameter $\sigma$.

Wednesday, October 31 2012

## Why pictures are so important when modeling data?

(bis repetita) Consider the following regression summary,
Call:
lm(formula = y1 ~ x1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0001     1.1247   2.667  0.02573 *
x1            0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665,	Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217 
obtained from
> data(anscombe)
> reg1=lm(y1~x1,data=anscombe)
Can we say something if we look (only) at that output ? The intercept is significatively non-null, as well as the slope, the $R^2$ is large (66%). It looks like we do have a nice model here. And in a perfect world, we might hope that data are coming from this kind of dataset,
But it might be possible to have completely different kinds of patterns. Actually, four differents sets of data are coming from Anscombe (1973). And that all those datasets are somehow equivalent: the $X$'s have the same mean, and the same variance
> apply(anscombe[,1:4],2,mean)
x1 x2 x3 x4
9  9  9  9
> apply(anscombe[,1:4],2,var)
x1 x2 x3 x4
11 11 11 11 
and so are the $Y$'s
> apply(anscombe[,5:8],2,mean)
y1       y2       y3       y4
7.500909 7.500909 7.500000 7.500909
> apply(anscombe[,5:8],2,var)
y1       y2       y3       y4
4.127269 4.127629 4.122620 4.123249 
Further, observe also that the correlation between the $X$'s and the $Y$'s is the same
> cor(anscombe)[1:4,5:8]
y1         y2         y3         y4
x1  0.8164205  0.8162365  0.8162867 -0.3140467
x2  0.8164205  0.8162365  0.8162867 -0.3140467
x3  0.8164205  0.8162365  0.8162867 -0.3140467
x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
> diag(cor(anscombe)[1:4,5:8])
[1] 0.8164205 0.8162365 0.8162867 0.8165214
which yields the same regression line (intercept and slope)
> cbind(coef(reg1),coef(reg2),coef(reg3),coef(reg4))
[,1]     [,2]      [,3]      [,4]
(Intercept) 3.0000909 3.000909 3.0024545 3.0017273
x1          0.5000909 0.500000 0.4997273 0.4999091
But there is more. Much more. For instance, we always have the standard deviation for residuals
> c(summary(reg1)$sigma,summary(reg2)$sigma,
+ summary(reg3)$sigma,summary(reg4)$sigma)
[1] 1.236603 1.237214 1.236311 1.235695
Thus, all regressions here have the same R2
> c(summary(reg1)$r.squared,summary(reg2)$r.squared,
+ summary(reg3)$r.squared,summary(reg4)$r.squared)
[1] 0.6665425 0.6662420 0.6663240 0.6667073
Finally, Fisher's F statistics is also (almost) the same.
+ c(summary(reg1)$fstatistic[1],summary(reg2)$fstatistic[1],
+ summary(reg3)$fstatistic[1],summary(reg4)$fstatistic[1])
value    value    value    value
17.98994 17.96565 17.97228 18.00329 
Thus, with the following datasets, we have the same prediction (and the same confidence intervals). Consider for instance the second dataset (the first one being mentioned above),
> reg2=lm(y2~x2,data=anscombe)
The output is here exactly the same as the one we had above
> summary(reg2)

Call:
lm(formula = y2 ~ x2, data = anscombe)

Residuals:
Min      1Q  Median      3Q     Max
-1.9009 -0.7609  0.1291  0.9491  1.2691

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.001      1.125   2.667  0.02576 *
x2             0.500      0.118   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179 
Here, the perfect model is the one obtained with a quadratic regression.
> reg2b=lm(y2~x2+I(x2^2),data=anscombe)
> summary(reg2b)

Call:
lm(formula = y2 ~ x2 + I(x2^2), data = anscombe)

Residuals:
Min         1Q     Median         3Q        Max
-0.0013287 -0.0011888 -0.0006294  0.0008741  0.0023776

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.9957343  0.0043299   -1385   <2e-16 ***
x2           2.7808392  0.0010401    2674   <2e-16 ***
I(x2^2)     -0.1267133  0.0000571   -2219   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001672 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 7.378e+06 on 2 and 8 DF,  p-value: < 2.2e-16 
Consider now the third one
> reg3=lm(y3~x3,data=anscombe)
i.e.
> summary(reg3)

Call:
lm(formula = y3 ~ x3, data = anscombe)

Residuals:
Min      1Q  Median      3Q     Max
-1.1586 -0.6146 -0.2303  0.1540  3.2411

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0025     1.1245   2.670  0.02562 *
x3            0.4997     0.1179   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176 
This time, the linear model could have been perfect. The problem is one outlier. If we remove it, we have
> reg3b=lm(y3~x3,data=anscombe[-3,])
> summary(reg3b)

Call:
lm(formula = y3 ~ x3, data = anscombe[-3, ])

Residuals:
Min         1Q     Median         3Q        Max
-0.0041558 -0.0022240  0.0000649  0.0018182  0.0050649

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0056494  0.0029242    1370   <2e-16 ***
x3          0.3453896  0.0003206    1077   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003082 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 1.161e+06 on 1 and 8 DF,  p-value: < 2.2e-16 
Finally consider
> reg4=lm(y4~x4,data=anscombe)
This time, there is an other kind of outlier, in $X$'s, but again, the regression is exactly the same,
> summary(reg4)

Call:
lm(formula = y4 ~ x4, data = anscombe)

Residuals:
Min     1Q Median     3Q    Max
-1.751 -0.831  0.000  0.809  1.839

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0017     1.1239   2.671  0.02559 *
x4            0.4999     0.1178   4.243  0.00216 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667,	Adjusted R-squared: 0.6297
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165 
The graph is here
So clearly, looking at the summary of a regression does not tell us anything... This is why we do spend some time on diagnostic, looking at graphs with the errors (the graphs above could be obtained only with one explanatory variable, while errors can be studied in any dimension): everything can be seen on thise graphs. E.g. for the first dataset,
or the second one
the third one
or the fourth one,

Wednesday, October 17 2012

## Fractals and Kronecker product

A few years ago, I went to listen to Roger Nelsen who was giving a talk about copulas with fractal support. Roger is amazing when he gives a talk (I am also a huge fan of his books, and articles), and I really wanted to play with that concept (that he did publish later on, with Gregory Fredricks and José Antonio Rodriguez-Lallena). I did mention that idea in a paper, writen with Alessandro Juri, just to mention some cases where deriving fixed point theorems is not that simple (since the limit may not exist).

The idea in the initial article was to start with something quite simple, a the so-called transformation matrix, e.g.

$T=\frac{1}{8}\left(\begin{matrix}1& 0 & 1 \\ 0 & 4 & 0 \\ 1 & 0&1\end{matrix}\right)$

Here, in all areas with mass, we spread it uniformly (say), i.e. the support of $T(C^\perp)$ is the one below, i.e. $1/8$th of the mass is located in each corner, and $1/2$ is in the center. So if we spread the mass to have a copula (with uniform margin,)we have to consider squares on intervals $[0,1/4]$, $[1/4,3/4]$ and $[3/4,1]$,

Then the idea, then, is to consider $T^2=\otimes^2T$, where  $\otimes^2T$ is the tensor product (also called Kronecker product) of $T$ with itself. Here, the support of $T^2(C^\perp)$ is

Then, consider $T^3=\otimes^3T$, where $\otimes^3T$ is the tensor product of $T$ with itself, three times. And the support of $T^3(C^\perp)$ is

Etc. Here, it is computationally extremely simple to do it, using this Kronecker product. Recall that if $\mathbf{A}=(a_{i,j})$, then

$\mathbf{A}\otimes\mathbf{B} = \begin{pmatrix} a_{11} \mathbf{B} & \cdots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1} \mathbf{B} & \cdots & a_{mn} \mathbf{B} \end{pmatrix}$

So, we need a transformation matrix: consider the following $4\times4$ matrix,
> k=4
> M=matrix(c(1,0,0,1,
+            0,1,1,0,
+            0,1,1,0,
+            1,0,0,1),k,k)
> M
[,1] [,2] [,3] [,4]
[1,]    1    0    0    1
[2,]    0    1    1    0
[3,]    0    1    1    0
[4,]    1    0    0    1
Once we have it, we just consider the Kronecker product of this matrix with itself, which yields a $4^2\times4^2$ matrix,
> N=kronecker(M,M)
> N[,1:4]
[,1]  [,2] [,3] [,4]
[1,]     1    0    0    1
[2,]     0    1    1    0
[3,]     0    1    1    0
[4,]     1    0    0    1
[5,]     0    0    0    0
[6,]     0    0    0    0
[7,]     0    0    0    0
[8,]     0    0    0    0
[9,]     0    0    0    0
[10,]    0    0    0    0
[11,]    0    0    0    0
[12,]    0    0    0    0
[13,]    1    0    0    1
[14,]    0    1    1    0
[15,]    0    1    1    0
[16,]    1    0    0    1
And then, we continue,
> for(s in 1:3){N=kronecker(N,M)}
After only a couple of loops, we have a $4^5\times4^5$ matrix. And we can plot it simply to visualize the support,
> image(N,col=c("white","blue"))
As we zoom in, we can visualize this fractal property,

Saturday, October 13 2012

## Compound Poisson and vectorized computations

Yesterday, I was asked how to write a code to generate a compound Poisson variables, i.e. a series of random variables $S=X_1+\cdots+X_N$ where $N$ is a counting random variable (here Poisson disributed) and where the $X_i$'s are i.i.d (and independent of $N$), with the convention $S=0$ when $N=0$. I came up with the following algorithm, but I was wondering if it was possible to get a better one...

>  rcpd=function(n,rN,rX){
+  N=rN(n)
+  X=rX(sum(N))
+  I=as.factor(rep(1:n,N))
+  S=tapply(X,I,sum)
+  V=as.numeric(S[as.character(1:n)])
+  V[is.na(V)]=0
+  return(V)}

Here, consider - to illustrate - the case where $N\sim\mathcal{P}(5)$ and $X_i\sim\mathcal{E}(2)$,

>  rN.P=function(n) rpois(n,5)
>  rX.E=function(n) rexp(n,2)

We can generate a sample

>  S=rcpd(1000,rN=rN.P,rX=rX.E)

and check (using simulation) than $\mathbb{E}(S)=\mathbb{E}(N)\mathbb{E}(X_i)$

> mean(S)
[1] 2.547033
> mean(rN.P(1000))*mean(rX.E(1000))
[1] 2.548309

and that $\text{Var}(S)=\mathbb{E}(N)\text{Var}(X_i)+\mathbb{E}(X_i)^2\text{Var}(N)$

> var(S)
[1] 2.60393
> mean(rN.P(1000))*var(rX.E(1000))+
+ mean(rX.E(1000))^2*var(rN.P(1000))
[1] 2.621376

If anyone might think of a faster algorithm, I'd be glad to hear about it...

Sunday, September 23 2012

## Maximum likelihood estimates for multivariate distributions

Consider our loss-ALAE dataset, and - as in Frees & Valdez (1998) - let us fit a parametric model, in order to price a reinsurance treaty. The dataset is the following,

> library(evd)
> data(lossalae)
> Z=lossalae
> X=Z[,1];Y=Z[,2]

The first step can be to estimate marginal distributions, independently. Here, we consider lognormal distributions for both components,

> Fempx=function(x) mean(X<=x)
> Fx=Vectorize(Fempx)
> u=exp(seq(2,15,by=.05))
> plot(u,Fx(u),log="x",type="l",
+ xlab="loss (log scale)")
> Lx=function(px) -sum(log(Vectorize(dlnorm)(
+ X,px[1],px[2])))
> opx=optim(c(1,5),fn=Lx)
> opx$par [1] 9.373679 1.637499 > lines(u,Vectorize(plnorm)(u,opx$par[1],
+ opx$par[2]),col="red") The fit here is quite good, For the second component, we do the same, > Fempy=function(x) mean(Y<=x) > Fy=Vectorize(Fempy) > u=exp(seq(2,15,by=.05)) > plot(u,Fy(u),log="x",type="l", + xlab="ALAE (log scale)") > Ly=function(px) -sum(log(Vectorize(dlnorm)( + Y,px[1],px[2]))) > opy=optim(c(1.5,10),fn=Ly) > opy$par
[1] 8.522452 1.429645
> lines(u,Vectorize(plnorm)(u,opy$par[1], + opy$par[2]),col="blue")

It is not as good as the fit obtained on losses, but it is not that bad,

Now, consider a multivariate model, with Gumbel copula. We've seen before that it worked well. But this time, consider the maximum likelihood estimator globally.

> Cop=function(u,v,a) exp(-((-log(u))^a+
+ (-log(v))^a)^(1/a))
> phi=function(t,a) (-log(t))^a
> cop=function(u,v,a) Cop(u,v,a)*(phi(u,a)+
+ phi(v,a))^(1/a-2)*(
+ a-1+(phi(u,a)+phi(v,a))^(1/a))*(phi(u,a-1)*
+ phi(v,a-1))/(u*v)
> L=function(p) {-sum(log(Vectorize(dlnorm)(
+ X,p[1],p[2])))-
+ sum(log(Vectorize(dlnorm)(Y,p[3],p[4])))-
+ sum(log(Vectorize(cop)(plnorm(X,p[1],p[2]),
+ plnorm(Y,p[3],p[4]),p[5])))}
> opz=optim(c(1.5,10,1.5,10,1.5),fn=L)
> opz$par [1] 9.377219 1.671410 8.524221 1.428552 1.468238 Marginal parameters are (slightly) different from the one obtained independently, > c(opx$par,opy$par) [1] 9.373679 1.637499 8.522452 1.429645 > opz$par[1:4]
[1] 9.377219 1.671410 8.524221 1.428552

And the parameter of Gumbel copula is close to the one obtained with heuristic methods in class.

Now that we have a model, let us play with it, to price a reinsurance treaty. But first, let us see how to generate Gumbel copula... One idea can be to use the frailty approach, based on a stable frailty. And we can use Chambers et al (1976) to generate a stable distribution. So here is the algorithm to generate samples from Gumbel copula

> alpha=opz$par[5] > invphi=function(t,a) exp(-t^(1/a)) > n=500 > x=matrix(rexp(2*n),n,2) > angle=runif(n,0,pi) > E=rexp(n) > beta=1/alpha > stable=sin((1-beta)*angle)^((1-beta)/beta)* + (sin(beta*angle))/(sin(angle))^(1/beta)/ + (E^(alpha-1)) > U=invphi(x/stable,alpha) > plot(U) Here, we consider only 500 simulations, Based on that copula simulation, we can then use marginal transformations to generate a pair, losses and allocated expenses, > Xloss=qlnorm(U[,1],opz$par[1],opz$par[2]) > Xalae=qlnorm(U[,2],opz$par[3],opz$par[4]) In standard reinsurance treaties - see e.g. Clarke (1996) - allocated expenses are splited prorata capita between the insurance company, and the reinsurer. If $X$ denotes losses, and $Y$ the allocated expenses, a standard excess treaty can be has payoff $\begin{cases} 0\cdot \boldsymbol{1}(X\leq R)\\ \displaystyle{\left(X-R+ \frac{X-R}{X}\cdot Y\right) \right )}\cdot\boldsymbol{1}(R where $L$ denotes the (upper) limit, and $R$ the insurer's retention. Using monte carlo simulation, it is then possible to estimate the pure premium of such a reinsurance treaty. > L=100000 > R=50000 > Z=((Xloss-R)+(Xloss-R)/Xloss*Xalae)* + (R<=Xloss)*(Xloss<L)+ + ((L-R)+(L-R)/R*Xalae)*(L<=Xloss) > mean(Z) [1] 12596.45 Now, play with it... it is possible to find a better fit, I guess... Thursday, September 20 2012 ## Interactive 3d plot, in R Following the course of this afternoon, I will just upload some codes to make interactive 3d plots, in R. > library(rgl) > library(evd); > data(lossalae) > U=rank(lossalae[,1]+rnorm(nrow(lossalae), + mean=0,sd=.001))/(nrow(lossalae)+1) > V=rank(lossalae[,2])/(nrow(lossalae)+1) > M=kde2d(qnorm(U),qnorm(V),n=35) > library(rgl) > persp3d(M$x,M$y,M$z,col='green',
+ xlab="loss",ylab="alae",zlab="")


## (nonparametric) Copula density estimation

Today, we will go further on the inference of copula functions. Some codes (and references) can be found on a previous post, on nonparametric estimators of copula densities (among other related things).  Consider (as before) the loss-ALAE dataset (since we've been working a lot on that dataset)

> library(MASS)
> library(evd)
> X=lossalae
> U=cbind(rank(X[,1])/(nrow(X)+1),rank(X[,2])/(nrow(X)+1))

The standard tool to plot nonparametric estimators of densities is to use multivariate kernels. We can look at the density using

> mat1=kde2d(U[,1],U[,2],n=35)
> persp(mat1$x,mat1$y,mat1$z,col="green", + shade=TRUE,theta=s*5, + xlab="",ylab="",zlab="",zlim=c(0,7)) or level curves (isodensity curves) with more detailed estimators (on grids with shorter steps) > mat1=kde2d(U[,1],U[,2],n=101) > image(mat1$x,mat1$y,mat1$z,col=
+ rev(heat.colors(100)),xlab="",ylab="")
> contour(mat1$x,mat1$y,mat1$z,add= + TRUE,levels = pretty(c(0,4), 11)) Kernels are nice, but we clearly observe some border bias, extremely strong in corners (the estimator is 1/4th of what it should be, see another post for more details). Instead of working on sample $(U_i,V_i)$ on the unit square, consider some transformed sample $(Q(U_i),Q(V_i))$, where $Q:(0,1)\rightarrow\mathbb{R}$ is a given function. E.g. a quantile function of an unbounded distribution, for instance the quantile function of the $\mathcal{N}(0,1)$ distribution. Then, we can estimate the density of the transformed sample, and using the inversion technique, derive an estimator of the density of the initial sample. Since the inverse of a (general) function is not that simple to compute, the code might be a bit slow. But it does work, > gaussian.kernel.copula.surface <- function (u,v,n) { + s=seq(1/(n+1), length=n, by=1/(n+1)) + mat=matrix(NA,nrow = n, ncol = n) + sur=kde2d(qnorm(u),qnorm(v),n=1000, + lims = c(-4, 4, -4, 4)) + su<-sur$z
+ for (i in 1:n) {
+     for (j in 1:n) {
+ 	Xi<-round((qnorm(s[i])+4)*1000/8)+1;
+ 	Yj<-round((qnorm(s[j])+4)*1000/8)+1
+ 	mat[i,j]<-su[Xi,Yj]/(dnorm(qnorm(s[i]))*
+ 	dnorm(qnorm(s[j])))
+     }
+ }
+ return(list(x=s,y=s,z=data.matrix(mat)))
+ }

Here, we get

Note that it is possible to consider another transformation, e.g. the quantile function of a Student-t distribution.

> student.kernel.copula.surface =
+  function (u,v,n,d=4) {
+  s <- seq(1/(n+1), length=n, by=1/(n+1))
+  mat <- matrix(NA,nrow = n, ncol = n)
+ sur<-kde2d(qt(u,df=d),qt(v,df=d),n=5000,
+ lims = c(-8, 8, -8, 8))
+ su<-sur$z + for (i in 1:n) { + for (j in 1:n) { + Xi<-round((qt(s[i],df=d)+8)*5000/16)+1; + Yj<-round((qt(s[j],df=d)+8)*5000/16)+1 + mat[i,j]<-su[Xi,Yj]/(dt(qt(s[i],df=d),df=d)* + dt(qt(s[j],df=d),df=d)) + } + } + return(list(x=s,y=s,z=data.matrix(mat))) + } Another strategy is to consider kernel that have precisely the unit interval as support. The idea is here to consider the product of Beta kernels, where parameters depend on the location > beta.kernel.copula.surface= + function (u,v,bx=.025,by=.025,n) { + s <- seq(1/(n+1), length=n, by=1/(n+1)) + mat <- matrix(0,nrow = n, ncol = n) + for (i in 1:n) { + a <- s[i] + for (j in 1:n) { + b <- s[j] + mat[i,j] <- sum(dbeta(a,u/bx,(1-u)/bx) * + dbeta(b,v/by,(1-v)/by)) / length(u) + } + } + return(list(x=s,y=s,z=data.matrix(mat))) + } On those two graphs, we can clearly observe strong tail dependence in the upper (right) corner, that cannot be intuited using a standard kernel estimator... Tuesday, September 18 2012 ## Copulas and tail dependence, part 3 We have seen extreme value copulas in the section where we did consider general families of copulas. In the bivariate case, an extreme value can be written where $A(\cdot)$ is Pickands dependence function, which is a convex function satisfying Observe that in this case, where is Kendall'tau, and can be written For instance, if then, we obtain Gumbel copula. This is what we've seen in the section where we introduced this family. Now, let us talk about (nonparametric) inference, and more precisely the estimation of the dependence function. The starting point of the most standard estimator is to observe that if has copula $C$, then has distribution function And conversely, Pickands dependence function can be written Thus, a natural estimator for Pickands function is where $\widehat{H}_n$ is the empirical cumulative distribution function of This is the estimator proposed in Capéràa, Fougères & Genest (1997). Here, we can compute everything here using > library(evd) > X=lossalae > U=cbind(rank(X[,1])/(nrow(X)+1),rank(X[,2])/(nrow(X)+1)) > Z=log(U[,1])/log(U[,1]*U[,2]) > h=function(t) mean(Z<=t) > H=Vectorize(h) > a=function(t){ + f=function(t) (H(t)-t)/(t*(1-t)) + return(exp(integrate(f,lower=0,upper=t, + subdivisions=10000)$value))
+ }
> A=Vectorize(a)
> u=seq(.01,.99,by=.01)
> plot(c(0,u,1),c(1,A(u),1),type="l",col="red",
+ ylim=c(.5,1))
Even integrate to get an estimator of Pickands' dependence function. Note that an interesting point is that the upper tail dependence index can be visualized on the graph, above,

> A(.5)/2
[1] 0.4055346

## Copulas and tail dependence, part 2

An alternative to describe tail dependence can be found in the Ledford & Tawn (1996) for instance. The intuition behind can be found in Fischer & Klein (2007)). Assume that and have the same distribution. Now, if we assume that those variables are (strictly) independent,

But if we assume that those variables are (strictly) comonotonic (i.e. equal here since they have the same distribution), then
So assume that there is a such that
Then =2 can be interpreted as independence while =1 means strong (perfect) positive dependence. Thus, consider the following transformation to get a parameter in [0,1], with a strength of dependence increasing with the index, e.g.
In order to derive a tail dependence index, assume that there exists a limit to
which will be interpreted as a (weak) tail dependence index. Thus define concentration functions

for the lower tail (on the left) and

for the upper tail (on the right). The R code to compute those functions is quite simple,
> library(evd);
> data(lossalae)
> X=lossalae
> U=rank(X[,1])/(nrow(X)+1)
> V=rank(X[,2])/(nrow(X)+1
> fL2emp=function(z) 2*log(mean(U<=z))/
+ log(mean((U<=z)&(V<=z)))-1
> fR2emp=function(z) 2*log(mean(U>=1-z))/
+ log(mean((U>=1-z)&(V>=1-z)))-1
> u=seq(.001,.5,by=.001)
> L=Vectorize(fL2emp)(u)
> R=Vectorize(fR2emp)(rev(u))
> plot(c(u,u+.5-u[1]),c(L,R),type="l",ylim=0:1,
+ xlab="LOWER TAIL      UPPER TAIL")
> abline(v=.5,col="grey")
and again, it is possible to plot those empirical functions against some parametric ones, e.g. the one obtained from a Gaussian copula (with the same Kendall's tau)
> tau=cor(lossalae,method="kendall")[1,2]
> library(copula)
> paramgauss=sin(tau*pi/2)
> copgauss=normalCopula(paramgauss)
> Lgaussian=function(z) 2*log(z)/log(pCopula(c(z,z),
+ copgauss))-1
> Rgaussian=function(z) 2*log(1-z)/log(1-2*z+
+ pCopula(c(z,z),copgauss))-1
> u=seq(.001,.5,by=.001)
> Lgs=Vectorize(Lgaussian)(u)
> Rgs=Vectorize(Rgaussian)(1-rev(u))
> lines(c(u,u+.5-u[1]),c(Lgs,Rgs),col="red")

or Gumbel copula,
> paramgumbel=1/(1-tau)
> copgumbel=gumbelCopula(paramgumbel, dim = 2)
> Lgumbel=function(z) 2*log(z)/log(pCopula(c(z,z),
+ copgumbel))-1
> Rgumbel=function(z) 2*log(1-z)/log(1-2*z+
+ pCopula(c(z,z),copgumbel))-1
> Lgl=Vectorize(Lgumbel)(u)
> Rgl=Vectorize(Rgumbel)(1-rev(u))
> lines(c(u,u+.5-u[1]),c(Lgl,Rgl),col="blue")

Again, one should look more carefully at confidence bands, but is looks like Gumbel copula provides a good fit here.

Monday, September 17 2012

## Copulas and tail dependence, part 1

As mentioned in the course last week Venter (2003) suggested nice functions to illustrate tail dependence (see also some slides used in Berlin a few years ago).
• Joe (1990)'s lambda
Joe (1990) suggested a (strong) tail dependence index. For lower tails, for instance, consider
i.e

• Upper and lower strong tail (empirical) dependence functions

The idea is to plot the function above, in order to visualize limiting behavior. Define

for the lower tail, and
for the upper tail, where is the survival copula associated with , in the sense that
while

Now, one can easily derive empirical conterparts of those function, i.e.

and
Thus, for upper tail, on the right, we have the following graph

and for the lower tail, on the left, we have

For the code, consider some real data, like the loss-ALAE dataset.

> library(evd)
> X=lossalae

The idea is to plot, on the left, the lower tail concentration function, and on the right, the upper tail function.

> U=rank(X[,1])/(nrow(X)+1)
> V=rank(X[,2])/(nrow(X)+1)
> Lemp=function(z) sum((U<=z)&(V<=z))/sum(U<=z)
> Remp=function(z) sum((U>=1-z)&(V>=1-z))/sum(U>=1-z)
> u=seq(.001,.5,by=.001)
> L=Vectorize(Lemp)(u)
> R=Vectorize(Remp)(rev(u))
> plot(c(u,u+.5-u[1]),c(L,R),type="l",ylim=0:1,
+ xlab="LOWER TAIL          UPPER TAIL")
> abline(v=.5,col="grey")

Now, we can compare this graph, with what should be obtained for some parametric copulas that have the same Kendall's tau (e.g.). For instance, if we consider a Gaussian copula,

> tau=cor(lossalae,method="kendall")[1,2]
> library(copula)
> paramgauss=sin(tau*pi/2)
> copgauss=normalCopula(paramgauss)
> Lgaussian=function(z) pCopula(c(z,z),copgauss)/z
> Rgaussian=function(z) (1-2*z+pCopula(c(z,z),copgauss))/(1-z)
> u=seq(.001,.5,by=.001)
> Lgs=Vectorize(Lgaussian)(u)
> Rgs=Vectorize(Rgaussian)(1-rev(u))> lines(c(u,u+.5-u[1]),c(Lgs,Rgs),col="red")

or Gumbel's copula,

> paramgumbel=1/(1-tau)
> copgumbel=gumbelCopula(paramgumbel, dim = 2)
> Lgumbel=function(z) pCopula(c(z,z),copgumbel)/z
> Rgumbel=function(z) (1-2*z+pCopula(c(z,z),copgumbel))/(1-z)
> u=seq(.001,.5,by=.001)
> Lgl=Vectorize(Lgumbel)(u)
> Rgl=Vectorize(Rgumbel)(1-rev(u))
> lines(c(u,u+.5-u[1]),c(Lgl,Rgl),col="blue")

That's nice (isn't it?), but since we do not have any confidence interval, it is still hard to conclude (even if it looks like Gumbel copula has a much better fit than the Gaussian one). A strategy can be to generate samples from those copulas, and to visualize what we had. With a Gaussian copula, the graph looks like
> u=seq(.0025,.5,by=.0025); nu=length(u)
> nsimul=500
> MGS=matrix(NA,nsimul,2*nu)
> for(s in 1:nsimul){
+ Xs=rCopula(nrow(X),copgauss)
+ Us=rank(Xs[,1])/(nrow(Xs)+1)
+ Vs=rank(Xs[,2])/(nrow(Xs)+1)
+ Lemp=function(z) sum((Us<=z)&(Vs<=z))/sum(Us<=z)
+ Remp=function(z) sum((Us>=1-z)&(Vs>=1-z))/sum(Us>=1-z)
+ MGS[s,1:nu]=Vectorize(Lemp)(u)
+ MGS[s,(nu+1):(2*nu)]=Vectorize(Remp)(rev(u))
+ lines(c(u,u+.5-u[1]),MGS[s,],col="red")
+ }

(including - pointwise - 90% confidence bands)

> Q95=function(x) quantile(x,.95)
> V95=apply(MGS,2,Q95)
> lines(c(u,u+.5-u[1]),V95,col="red",lwd=2)
> Q05=function(x) quantile(x,.05)
> V05=apply(MGS,2,Q05)
> lines(c(u,u+.5-u[1]),V05,col="red",lwd=2)

while it is

with Gumbel copula. Isn't it a nice (graphical) tool ?

But as mentioned in the course, the statistical convergence can be slow. Extremely slow. So assessing if the underlying copula has tail dependence, or not, it now that simple. Especially if the copula exhibits tail independence. Like the Gaussian copula. Consider a sample of size 1,000. This is what we obtain if we generate random scenarios,

or we look at the left tail (with a log-scale)

Now, consider a 10,000 sample,

or with a log-scale

We can even consider a 100,000 sample,

or with a log-scale

On those graphs, it is rather difficult to conclude if the limit is 0, or some strictly positive value (again, it is a classical statistical problem when the value of interest is at the border of the support of the parameter). So, a natural idea is to consider a weaker tail dependence index. Unless you have something like 100,000 observations...

Wednesday, September 12 2012

## Kendall's function for copulas

As mentioned in the course on copulas, a nice tool to describe dependence it Kendall's cumulative function. Given a random pair with distribution  , define random variable . Then Kendall's cumulative function is

Genest and Rivest (1993) introduced that function to choose among Archimedean copulas (we'll get back to this point below).

From a computational point of view, computing such a function can be done as follows,

• for all , compute as the proportion of observation in the lower quadrant, with upper corner , i.e.

• then compute the cumulative distribution function of 's.

To visualize the construction of that cumulative distribution function, consider the following animation

Thus, here the code to compute simply that cumulative distribution function is

n=nrow(X)
i=rep(1:n,each=n)
j=rep(1:n,n)
S=((X[i,1]>X[j,1])&(X[i,2]>X[j,2]))
Z=tapply(S,i,sum)/(n-1)

The graph can be obtain either using

plot(ecdf(Z))

or

plot(sort(Z),(1:n)/n,type="s",col="red")

The interesting point is that for an Archimedean copula with generator , then Kendall's function is simply

If we're too lazy to do the maths, at least, it is possible to compute those functions numerically. For instance, for Clayton copula,

h=.001
phi=function(t){(t^(-alpha)-1)}
dphi=function(t){(phi(t+h)-phi(t-h))/2/h}
k=function(t){t-phi(t)/dphi(t)}
Kc=Vectorize(k)

Similarly, let us consider Gumbel copula,

phi=function(t){(-log(t))^(theta)}
dphi=function(t){(phi(t+h)-phi(t-h))/2/h}
k=function(t){t-phi(t)/dphi(t)}
Kg=Vectorize(k)

If we plot the empirical Kendall's function (obtained from the sample), with different theoretical ones, derived from Clayton copulas (on the left, in blue) or Gumbel copula (on the right, in purple), we have the following,

Note that the different curves were obtained when Clayton copula has Kendall's tau equal to 0, .1, .2, .3, ..., .9, 1, and similarly for Gumbel copula (so that Figures can be compared). The following table gives a correspondence, from Kendall's tau to the underlying parameter of a copula (for different families)

as well as Spearman's rho,

To conclude, observe that there are two important particular cases that can be identified here: the case of perfect dependent, on the first diagonal when , and the case of independence, the upper green curve, . It should also be mentioned that it is also common to plot not function , but function , defined as ,

## Association and concordance measures

Following the course, in order to define assocation measures (from Kruskal (1958)) or concordance measures (from Scarsini (1984)), define a concordance function as follows: let be a random pair with copula , and with copula . Then define

the so-called concordance function. Thus

As proved last week in class,

Based on that function, several concordance measures can be derived. A popular measure is Kendall's tau, from Kendall (1938), defined as i.e.

which is simply . Here, computation can be tricky. Consider the following sample,

> set.seed(1)
> n=40
> library(mnormt)
> X=rmnorm(n,c(0,0),
+ matrix(c(1,.4,.4,1),2,2))
> U=cbind(rank(X[,1]),rank(X[,2]))/(n+1)

Then, using R function, we can obtain Kendall's tau easily,

> cor(X,method="kendall")[1,2]
[1] 0.3794872

To get our own code (and to understand a bit more how to get that coefficient), we can use

> i=rep(1:(n-1),(n-1):1)
> j=2:n
> for(k in 3:n){j=c(j,k:n)}
> M=cbind(X[i,],X[j,])
> concordant=sum((M[,1]-M[,3])*(M[,2]-M[,4])>0)
> discordant=sum((M[,1]-M[,3])*(M[,2]-M[,4])<0)
> total=n*(n-1)/2
> (K=(concordant-discordant)/total)
[1] 0.3794872

or the following (we'll use random variable quite frequently),

> i=rep(1:n,each=n)
> j=rep(1:n,n)
> Z=((X[i,1]>X[j,1])&(X[i,2]>X[j,2]))
> (K=4*mean(Z)*n/(n-1)-1)
[1] 0.3794872

Another measure is Spearman's rank correlation, from Spearman (1904),

where has distribution .

Here, which leads to the following expressions

Numerically, we have the following

> cor(X,method="spearman")[1,2]
[1] 0.5388368
> cor(rank(X[,1]),rank(X[,2]))
[1] 0.5388368

Note that it is also possible to write

Another measure is the cograduation index, from Gini (1914), obtained by sybstituting an L1 norm instead of a L2 one in the previous expression,

Note that this index can also be defined as . Here,

> Rx=rank(X[,1]);Ry=rank(X[,2]);
> (G=2/(n^2) *(sum(abs(Rx+Ry-n-1))-
+ sum(abs(Rx-Ry))))
[1] 0.41

Finally, another measure is the one from Blomqvist (1950). Let denote the median of , i.e.

Then define

or equivalently

> Mx=median(X[,1]);My=median(X[,2])
> (B=4*sum((X[,1]<=Mx)*((X[,2]<=My)))/n-1)
[1] 0.4

Monday, September 10 2012

## Unit root, or not ? is it a big deal ?

Consider a time series, generated using

set.seed(1)
E=rnorm(240)
X=rep(NA,240)
rho=0.8
X[1]=0
for(t in 2:240){X[t]=rho*X[t-1]+E[t]}

The idea is to assume that an autoregressive model can be considered, but we don't know the value of the parameter. More precisely, we can't choose if the parameter is either one (and the series is integrated), or some value strictly smaller than 1 (and the series is stationary). Based on past observations, the higher the autocorrelation, the lower the variance of the noise.

rhoest=0.9; H=260
u=241:(240+H)
P=X[240]*rhoest^(1:H)
s=sqrt(1/(sum((rhoest^(2*(1:300))))))*sd(X)
Now that we have a model, consider the following forecast, including a confidence interval,

plot(1:240,X,xlab="",xlim=c(0,240+H),
ylim=c(-9.25,9),ylab="",type="l")
V=cumsum(rhoest^(2*(1:H)))*s
polygon(c(u,rev(u)),c(P+1.96*sqrt(V),
rev(P-1.96*sqrt(V))),col="yellow",border=NA)
polygon(c(u,rev(u)),c(P+1.64*sqrt(V),
rev(P-1.64*sqrt(V))),col="orange",border=NA)
lines(u,P,col="red")
Here, forecasts can be derived, with any kind of possible autoregressive coefficient, from 0.7 to 1. I.e. we can chose to model the time series either with a stationary, or an integrated series,
.
As we can see above, assuming either that the series is stationary (parameter lower - strictly - than 1) or integrated (parameter equal to 1), the shape of the prediction can be quite different. So yes, assuming an integrated model is a big deal, since it has a strong impact on predictions.

Friday, September 7 2012

## That damn R-squared !

Another post about the R-squared coefficient, and about why, after some years teaching econometrics, I still hate when students ask questions about it. Usually, it starts with "I have a _____ R-squared... isn't it too low ?" Please, feel free to fill in the blanks with your favorite (low) number. Say 0.2. To make it simple, there are different answers to that question:

1. if you don't want to waste time understanding econometrics, I would say something like "Forget about the R-squared, it is useless" (perhaps also "please, think twice about taking that econometrics course")
2. if you're ready to spend some time to get a better understanding on subtle concepts, I would say "I don't like the R-squared. I might be interesting in some rare cases (you can probably count them on the fingers of one finger), like comparing two models on the same dataset (even so, I would recommend the adjusted one). But usually, its values has no meaning. You can compare 0.2 and 0.3 (and prefer the 0.3 R-squared model, rather than the 0.2 R-squared one), but 0.2 means nothing". Well, not exactly, since it means something, but it is not a measure tjat tells you if you deal with a good or a bad model. Well, again, not exactly, but it is rather difficult to say where bad ends, and where good starts. Actually, it is exactly like the correlation coefficient (well, there is nothing mysterious here since the R-squared can be related to some correlation coefficient, as mentioned in class)
3. if you want some more advanced advice, I would say "It's complicated..." (and perhaps also "Look in a textbook write by someone more clever than me, you can find hundreds of them in the library !")
4. if you want me to act like people we've seen recently on TV (during electoral debate), "It's extremely interesting, but before answering your question, let me tell you a story..."
Perhaps that last strategy is the best one, and I should focus on the story. I mean, this is exactly why I have my blog: to tell (nice) stories. With graphs, and math formulas inside. First of all, consider a regression model

so that the R-squared is defined as

Let us generate datasets, and then run regressions, to see what's going on...
For instance, consider 20 observations, with one variable of interest, one explanatory variable, and some low variance noise (to start with)
> set.seed(1)
> n=20
> X=runif(n)
> E=rnorm(n)
> Y=2+5*X+E*.5
> base=data.frame(X,Y)
> reg=lm(Y~X,data=base)
> summary(reg)

Call:
lm(formula = Y ~ X, data = base)

Residuals:
Min       1Q   Median       3Q      Max
-1.15961 -0.17470  0.08719  0.29409  0.52719

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.4706     0.2297   10.76 2.87e-09 ***
X             4.2042     0.3697   11.37 1.19e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.461 on 18 degrees of freedom
Multiple R-squared: 0.8778,	Adjusted R-squared: 0.871
F-statistic: 129.3 on 1 and 18 DF,  p-value: 1.192e-09 
The R-squared is high (almost 0.9). What if the underlying model is exactly the same, but now, the noise has a much higher variance ?
> Y=2+5*X+E*4
> base=data.frame(X,Y)
> reg=lm(Y~X,data=base)
> summary(reg)

Call:
lm(formula = Y ~ X, data = base)

Residuals:
Min      1Q  Median      3Q     Max
-9.2769 -1.3976  0.6976  2.3527  4.2175

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.765      1.837   3.138  0.00569 **
X             -1.367      2.957  -0.462  0.64953
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.688 on 18 degrees of freedom
Multiple R-squared: 0.01173,	Adjusted R-squared: -0.04318
F-statistic: 0.2136 on 1 and 18 DF,  p-value: 0.6495 
Now, the R-squared is rather low (around 0.01). Thus, the quality of the regression depends clearly on the variance of the noise. The higher the variance, the lower the R-squared. And usually, there is not much you can do about it ! On the graph below, the noise is changing, from no-noise, to extremely noisy, with the least square regression in blue (and a confidence interval on the prediction)
If we compare with the graph below, one can observe that the quality of the fit depends on the sample size, with now 100 observations (instead of 20),
So far, nothing new (if you remember what was said in class). In those two cases, here is the evolution of the R-squared, as a function of the variance of the noise (more precisely, here, the standard deviation of the noise)
> S=seq(0,4,by=.2)
> R2=rep(NA,length(S))
> for(s in 1:length(S)){
+ Y=2+5*X+E*S[s]
+ base=data.frame(X,Y)
+ reg=lm(Y~X,data=base)
+ R2[s]=summary(reg)$r.squared} with 20 obervations in blue, 100 in black. One important point is that in econometrics, we rarely choose the number of observations. If we have only 100 observations, we have to deal with it. Similarly, if observations are quite noisy there is (usually) not much we can do about it. All the more if you don't have any explanatory variable left. Perhaps you migh play (or try to play) with nonlinear effect... Nevertheless, it looks like some econometricians really care about the R-squared, and cannot imagine looking at a model if the R-squared is lower than - say - 0.4. It is always possible to reach that level ! you just have to add more covariates ! If you have some... And if you don't, it is always possible to use polynomials of a continuous variate. For instance, on the previous example, > S=seq(1,25,by=1) > R2=rep(NA,length(S)) > for(s in 1:length(S)){ + reg=lm(Y~poly(X,degree=s),data=base) + R2[s]=summary(reg)$r.squared}
If we plot the R-squared as a function of the degree of the polynomial regression, we have the following graph. Once again, the higher the degree, the more covariates, and the more covariates, the higher the R-squared,
I.e. with 22 degrees, it is possible to reach a 0.4 R-squared. But it might be interesting to the prediction we have with that model,
So, was it worth adding so much polynomial parts ? I mean, 22 is quite a large power... Here, the linear regression was significant, but not great. So what ? The R-squared was small ? so what ? sometimes, there's not much you can do about it... When dealing with individual observations (so called micro-econometrics), the variable of interest might be extremely noisy, and there is not much you can do. So your R-squared can be small, but your regression is perhaps still better than doing nothing... The good thing with a low R-squared is perhaps that it will recall us that we have to remain modest when we build a model. And always be cautious about conclusion. At least, provide a confidence interval...

- page 1 of 7