To content | To menu | To search

Friday, December 21 2012

The Maya prediction was correct... is the end of a World. This blog is now officially dead. See you from now on on my new blog,

Sunday, December 9 2012

Blog migration

The blog is currently migrating, from

to the plateform to

The IT is working on it, and if I still publish on the two blogs until the end of this session, from 2013, no more posts on this blog. Please, update your bookmarks...

Thursday, November 29 2012

Save R objects, and other stuff

Yesterday, Christopher asked me how to store an R object, to save some time, when working on the project. 

First, download the csv file for searches related to some keyword, via, for instance "sunglasses". Recall that csv files store tabular data (numbers and text) in plain-text form, with comma-separated values (where csv term comes from). Even if it is not a single and well-defined format, it is a text file, not an excel file!

Recherche sur le Web : intérêt pour sunglasses
Dans tous les pays; De 2004 à ce jour
Intérêt dans le temps
2004-01-04 - 2004-01-10,48
2004-01-11 - 2004-01-17,47
2004-01-18 - 2004-01-24,51
2004-01-25 - 2004-01-31,50
2004-02-01 - 2004-02-07,52

(etc) The file can be downloaded from the blog,

> report=read.table(
+ "",
+ skip=4,header=TRUE,sep=",",nrows=464)

Then, we just run a function to transform that data frame. It can be found from a source file

> source("")

In this source file, there is function that transforms a weekly series into a monthly one. The output is either a time series, or a numeric vector,

> sunglasses=H2M(report,lang="FR",type="ts")

Here, we asked for a time series,

> sunglasses
Jan      Feb      Mar      Apr
2004 49.00000 54.27586 66.38710 80.10000
2005 48.45161 58.25000 69.93548 80.06667
2006 49.70968 57.21429 67.41935 82.10000
2007 47.32258 55.92857 70.87097 84.36667
2008 47.19355 54.20690 64.03226 79.36667
2009 45.16129 50.75000 63.58065 76.90000
2010 32.67742 44.35714 58.19355 70.00000
2011 44.38710 49.75000 59.16129 71.60000
2012 43.64516 48.75862 64.06452 70.13333
May      Jun      Jul      Aug
2004 83.77419 89.10000 84.67742 73.51613
2005 83.06452 91.36667 89.16129 76.32258
2006 86.00000 92.90000 93.00000 72.29032
2007 86.83871 88.63333 84.61290 72.93548
2008 80.70968 80.30000 78.29032 64.58065
2009 77.93548 70.40000 62.22581 51.58065
2010 71.06452 73.66667 76.90323 61.77419
2011 74.00000 79.66667 79.12903 66.29032
2012 79.74194 82.90000 79.96774 69.80645
Sep      Oct      Nov      Dec
2004 56.20000 46.25806 44.63333 53.96774
2005 56.53333 47.54839 47.60000 54.38710
2006 51.23333 46.70968 45.23333 54.22581
2007 56.33333 46.38710 44.40000 51.12903
2008 51.50000 44.61290 40.93333 47.74194
2009 37.90000 30.38710 28.43333 31.67742
2010 50.16667 46.54839 42.36667 45.90323
2011 52.23333 45.32258 42.60000 47.35484
2012 54.03333 46.09677 43.45833  

that we can plot using

> plot(sunglasses)

Now we would like to store this time series. This is done easily using

> save(sunglasses,file="sunglasses.RData")

Next time we open R, we just have to use

> load("/Users/UQAM/sunglasses.Rdata")

to load the time series in R memory. So saving objects is not difficult.

Last be not least, for the part on seasonal models, we will be using some functions from an old package. Unfortunately, on the CRAN website, we see that

but nicely, files can still be found on some archive page. On Linux, one can easily install the package using (in R)

> install.packages(
+ "/Users/UQAM/uroot_1.4-1.tar.gz",
+ type="source")

With a Mac, it is slightly more complicated (see e.g. Jon's blog): one has to open a Terminal and to type

R CMD INSTALL /Users/UQAM/uroot_1.4-1.tar.gz

On Windows, it is possible to install a package from a zipped file: one has to download the file from archive page, and then to spot it from R.

The package is now installed, we just have to load it to play with it, and use functions it contains to tests for cycles and seasonal behavior,

> library(uroot)
> CH.test
function (wts, frec = NULL, f0 = 1, DetTr = FALSE, ltrunc = NULL)
s <- frequency(wts)
t0 <- start(wts)
N <- length(wts)
if (class(frec) == "NULL")
frec <- rep(1, s/2)
if (class(ltrunc) == "NULL")
ltrunc <- round(s * (N/100)^0.25)
R1 <- SeasDummy(wts, "trg")
VFEalg <- SeasDummy(wts, "alg")


Wednesday, November 28 2012

D.I.Y. strategy, and why academics should blog!

Last week, I went to the Econometrics seminar of Montréal, at UdM, where Alfred Galichon was giving a great talk on marriage market. Alfred is a former colleague (from France), a co-author, an amazing researcher, and above all, a friend of mine. And he has always be supportive about my blogging activities. So while we were having lunch, after the seminar, Alfred mentioned my blogging activity to the other researchers. I should say researchers in Econometrics (yes, with a capital E, since it is a Science, as mention in an old paper by David Hendry by the end of the 70's). Usually, when I am involved in this kind of meeting, I start with some apologies, explaining that I do like theoretical econometrics (if not, I would not come to the seminar), but I do like my freakonometrics activity. I do like to use econometrics (or statistical techniques) to figure out (at least to try) why some things works the way they do. I try to find data, and then try to briefly analyze them to answer some simple questions. Or sometime, I just run simulations to answer more theoretical questions (or at least to give clues).

But above all, I like the fact that blogging gives me the opportunity to interact with people I would never meet without this activity. For instance, last May, I was discussing (on Twitter) with @coulmont, @joelgombin and @imparibus about elections in France. Then @coulmont asked me "yes, everyone knows that there should be some ecological fallacies behind my interpretation, but I am not so sure since I have data with a small granularity. As an econometrician, what do you think ?" Usually, I hate having a label, like "... I ask you since you're a mathematician", or "as an economist, what do you think of...". Usually, when people ask me economic questions, I just claim being a mathematician, and vice-versa. But here, I even put on the front of my blog the word "econometrics" (more or less). So here, I could not escape... And the truth is, that while I was a student, I never heard anything about this "ecological fallacy". Neither did I as a researcher (even if I have been reading hundreds of econometric articles, theoretical and applied). Neither did I as a professor (even if I have been teaching econometrics for almost ten years, and I have read dozens textbooks to write notes and handouts). How comes ? How come researchers in sociology and in political sciences know things in econometrics that I have never heard about ?

The main reason - from my understanding - is the following: if everyone talks about "interdisciplinarity" no one (perhaps a few) is really willing to pay the price of working on different (not to say many) areas. I tried, and trust me, I found it difficult. It is difficult to publish a paper in a climate journal when you're not specialist in climate (and you just want to give your opinion as a statistician). It is difficult to assume that you might waste weeks (not to say months) reading articles in geophysics if you want to know more about earthquakes risks, going to seminars, etc. Research is clearly a club ("club" as defined in Buchanan (1965)) story.

This week, I planned to go to some journal club in biology and physics, at McGill (kindly, a colleague there invited me, but we got a time misunderstanding)... this has nothing to do with my teaching, nor with my research activities. But I might learn something ! Yes, I do claim that I am paid just to have fun, to read stuff that I do find interesting, trying to understand the details of a proof, trying to understand how data were obtained. In most cases, it might (and should) be a complete waste of time, since I will not publish anything (anything serious, published in some peer reviewed journal) on that topic... but should I really care ? As I explained earlier (in French), I do also claim that I have a moral obligation to return everything I have seen, heard, read. And since I am not a big fan of lectures (and that I do not think I have skills for that) I cannot give my opinion, neither on economics facts (as @adelaigue or @obouba might do on their blogs) or on science results (as @tomroud does). But I think I can help on modeling and computational issues. My point being: never trust what you read (even on my blog) but please, try to do it yourself! You read that "90% of French executive think about expatriation" (as mentioned here)? Then try to find some data that should confront that statement. And see if you come up with the same conclusion... And since it might be a bit technical sometimes, here are some lines of code, to do it on your own... Academics have a legitimacy when they give their opinions on technical issues. At least they can provide with a list of reference everyone should read to get an overview of the topic. But academics can also help people read graphs, or data. To give them "numeracy" (or a culture in numbers) necessary to understand complex issues.

To conclude, I should mention that I understood what this "ecological fallacy" was from  Thomsen (1987) and many more documents could be found on Soren Thomsen's page But I got most of the information I was looking for from a great statistician, who happens to be also an amazing blogger: Andrew Gelman (see I will probably write a post someday about this, since I found the question extremely interesting, and important.

Sunday, November 18 2012

In less than 50 days: 2013, year of statistics

A couple of days ago, Jeffrey sent me an e-mail, encouraging me to write about the international year of statistics, hoping that I would participate to the event. The goals of Statistics2013 are (as far as I understood) to increase public awareness of the power and impact of statistics on all aspects of society, nuture Statistics as a profession and promote creativity and development in Statistics.

So I will try to write more posts (yes, I surely can try) on statistical methodology, with sexy applications... So see you in a few days on this blog, to celebrate statistics ! I can even invite guests to write here... all contributors are welcome !


Saturday, November 17 2012

Somewhere else, part 21

 A very interesting article, in Scientific American,

and a nice discussion in Andrew Gelman's blog,
Still, a lot of interesting posts and articles, on many different topics,
with still a few posts on the recent U.S. election

Did I miss something ?        

Wednesday, November 14 2012

I think I've already heard that tune...

Maybe I just wanted to talk about Vince Guaraldi in Movember (let's try this challenge for this month: mention only people wearing a moustache). This week-end, the CD of Charlie Brown and Chrismas (see e.g. on youtube) was playing at home (I know, it is a bit early). And during the second track, I said to myself "I know that tune". For some reasons, 5 or 6 notes reminded me of a song of Jacques Brel... What are the odds I thought (at first) ?

The difficult point is that calculating the probability that a given sequence appears over some runs might be compiicated (I am not that good when it is time to compute probabilities). But it is still possible to run some codes to estimate the probability to get a sequence over a fixed number of runs. For instance, consider the case there where there are only two notes (call them "H" and "T", just to illustrate). 

> runappear=function(seqA,nrun){
+ win=FALSE
+ seq=sample(c("H","T"),size=
+ length(seqA)-1,replace=TRUE)
+ i=length(seqA)-1
+ while((win==FALSE)&(i<nrun)){
+ i=i+1
+ seq=c(seq,sample(c("H","T"),size=1))
+ n=length(seq)
+ if(all(seq[(n-length(seqA)+1):n]==seqA)){
+ win=TRUE}}

For instance, we can generate 100,000 samples to see how many times we have the tune (call it a tune) "THT" over 10 consecutive notes

> runappear(c("T","H","T"),10)
[1] TRUE
> simruns=function(seqA,nrun,sim){
+ s=0
+ for(i in 1:sim){s=s+runappear(seqA,nrun)};
+ return(s/sim)}
> simruns(c("T","H","T"),10,100000)
[1] 0.65684

But the most interesting point (from my point of view...) about those probabilities is that it yields a nontransitive game. Consider two players. Those two players - call them (A) and (B) - select tunes (it has to be three notes). If (A)'s tune appears before (B)'s, them (A) wins. The funny story is that there is no "optimal" strategy. More specificaly, if (B) knows what (A) has chosen, then it is always possible for (B) to choose a tune that will appear before (A)'s with more than 50% chance. Odd isn't it. This is Penney's game, as introduced by Walter Penney, see e.g. Nishiyama and Humble (2010).

It is possible to write a code where we run scenarios until one player wins,

> winner=function(seqA,seqB){
+ win="N"
+ seq=sample(c("H","T"),size=2,replace=TRUE)
+ while(win=="N"){
+ seq=c(seq,sample(c("H","T"),size=1))
+ n=length(seq)
+ if(all(seq[(n-2):n]==seqA)){win="A"}
+ if(all(seq[(n-2):n]==seqB)){win="B"}};
+ return(win)}
> A=c("H","H","H");B=c("H","T","H")
> winner(A,B)
[1] "A"

If we run one thousand games, we see here that (B) wins more frequently (with those tunes)

> game=function(seqA,seqB,n=1000){
+ win=rep(NA,n)
+ for(i in 1:n){win[i]=winner(seqA,seqB)}
+ return(table(win)/n)
+ }
> game(A,B)
A     B
0.403 0.597 

Let us look now at all possible tunes (8 if we consider a sequence of 3 notes), for players (A) and (B),

> A1=c("T","T","T")
> A2=c("T","T","H")
> A3=c("T","H","T")
> A4=c("H","T","T")
> A5=c("T","H","H")
> A6=c("H","T","H")
> A7=c("H","H","T")
> A8=c("H","H","H")
> B=data.frame(A1,A2,A3,A4,A5,A6,A7,A8)
> PROB=matrix(NA,8,8)
> for(i in 1:8){
+ for(j in 1:8){
+ PROB[i,j]=game(B[,i],B[,j],1000)[1]
+ }}

We have the following probabilities (that (A) wons over (B))

[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
[1,] 1.000 0.529 0.414 0.137 0.411 0.437 0.317 0.496
[2,] 0.493 1.000 0.674 0.242 0.664 0.625 0.480 0.701
[3,] 0.615 0.331 1.000 0.516 0.492 0.503 0.365 0.583
[4,] 0.885 0.735 0.518 1.000 0.494 0.513 0.339 0.580
[5,] 0.603 0.342 0.487 0.500 1.000 0.465 0.762 0.862
[6,] 0.602 0.360 0.527 0.506 0.507 1.000 0.329 0.567
[7,] 0.695 0.511 0.642 0.664 0.288 0.635 1.000 0.510
[8,] 0.502 0.279 0.428 0.382 0.117 0.406 0.487 1.000

It is possible to visualize the probabilities on the graph below

with below, the optimal strategy, that (B) should choose, given the tune chosen by (A),

(actually it is possible to compute explicitely those probabilites). Here, it is possible for (B) to choose a tune so that the probability that (A) wins is less than 50% (even less than 35%).

But we are a bit far away from the problem mentioned earlier... Furthermore, of course, assuming independence and randomness (in music) is extremely simplistic (see a short discussion here, a few weeks ago, in French). We might expected some Markov chain behind, in order to have something which will not hurt our ears, as stated in Brooks et al. (1957) for instance (quite a long time ago actually).

To conclude, about Jacques Brel's song: actually, Amsterdam is inspired (somehow) from Greensleeves, the traditional English folk song. And Vince Guaraldi played it on that CD...

So yes, there is a connection between all those songs. And no need to do maths to get it...

Tuesday, November 13 2012

On Box-Cox transform in regression models

A few days ago, a former student of mine, David, contacted me about Box-Cox tests in linear models. It made me look more carefully at the test, and I do not understand what is computed, to be honest. Let us start with something simple, like a linear simple regression, i.e.\beta_0+\beta_1%20X_i+\varepsilon_i

Let us introduced - as suggested in Box & Cox (1964) - the following family of (power) transformations^{(\lambda)}%20=%20\begin{cases}%20\dfrac{Y_i^\lambda-1}{\lambda}%20&\text{%20%20}%20(\lambda%20\neq%200)\\[8pt]%20\log{(Y_i)}%20%20&\text{%20}%20(\lambda%20=%200)%20\end{cases}

on the variable of interest. Then assume that^{(\lambda)}=\beta_0+\beta_1%20X_i+\varepsilon_i

As mentioned in Chapter 14 of Davidson & MacKinnon (1993) - in French - the log-likelihood of this model (assuming that observations are independent, with distribution\varepsilon_i\sim\mathcal{N}(0,\sigma^2)) can be written\log%20\mathcal{L}=-\frac{n}{2}\log(2\pi)-n\log\sigma%20\\-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\beta_1%20X_i)\right]^2+(\lambda-1)\sum_{i=1}^n\log%20Y_i

We can then use profile-likelihood techniques (see here) to derive the optimal transformation.

This can be done in R extremely simply,

> library(MASS)
> boxcox(lm(dist~speed,data=cars),lambda=seq(0,1,by=.1))

we then get the following graph,

If we look at the code of the function, it is based on the QR decomposition of the\boldsymbol{X}matrix (since we assume that\boldsymbol{X} is a full-rank matrix). More precisely,\boldsymbol{X}=QR where\boldsymbol{X} is a\times%202 matrix, is a\times%202 orthonornal matrix, and is a\times2 upper triangle matrix. It might be convenient to use this matrix since, for instance,\widehat{\boldsymbol{\beta}}=Q%27Y.  Thus, we do have an upper triangle system of equations.

> X=lm(dist~speed,data=cars)$qr

The code used to get the previous graph is (more or less) the following,

> g=function(x,lambda){
+ y=NA
+ if(lambda!=0){y=(x^lambda-1)/lambda}
+ if(lambda==0){y=log(x)}
+ return(y)} 
> n=nrow(cars)
> X=lm(dist~speed,data=cars)$qr
> Y=cars$dist
> logv=function(lambda){
+ -n/2*log(sum(qr.resid(X, g(Y,lambda)/
+ exp(mean(log(Y)))^(lambda-1))^2))}
> L=seq(0,1,by=.05)
> LV=Vectorize(logv)(L)
> points(L,LV,pch=19,cex=.85,col="red")

As we can see (with those red dots) we can reproduce the R graph. But it might not be consistent with other techniques (and functions described above). For instance, we can plot the profile likelihood function,\lambda\mapsto\log\mathcal{L}

> logv=function(lambda){
+ s=summary(lm(g(dist,lambda)~speed,
+ data=cars))$sigma
+ e=lm(g(dist,lambda)~speed,data=cars)$residuals
+ -n/2*log(2 * pi)-n*log(s)-.5/s^2*(sum(e^2))+
+ (lambda-1)*sum(log(Y))
+ }
> L=seq(0,1,by=.01)
> LV=Vectorize(logv)(L)
> plot(L,LV,type="l",ylab="")
> (maxf=optimize(logv,0:1,maximum=TRUE))
[1] 0.430591
[1] -197.6966
> abline(v=maxf$maximum,lty=2)

The good point is that the optimal value of\lambda is the same as the one we got before. The only problem is that the has a different scale. And using profile likelihood techniques to derive a confidence interval will give us different results (with a larger confidence interval than the one given by the standard function),

> ic=maxf$objective-qchisq(.95,1)
> #install.packages("rootSolve")
> library(rootSolve)
> f=function(x)(logv(x)-ic)
> (lower=uniroot(f, c(0,maxf$maximum))$root)
[1] 0.1383507
> (upper=uniroot(f, c(maxf$maximum,1))$root)
[1] 0.780573
> segments(lower,ic,upper,ic,lwd=2,col="red")

Actually, it possible to rewrite the log-likelihood as\mathcal{L}=\star-\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\beta_1%20X_i)}{\dot{Y}^\lambda}\right)^2\right]

(let us just get rid of the constant), where\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n%20\log%20Y_i\right]

Here, it becomes

> logv=function(lambda){
+ e=lm(g(dist,lambda)~speed,data=cars)$residuals
+ elY=(exp(mean(log(Y))))
+ -n/2*log(sum((e/elY^lambda)^2))
+ }
> L=seq(0,1,by=.01)
> LV=Vectorize(logv)(L)
> plot(L,LV,type="l",ylab="")
> optimize(logv,0:1,maximum=TRUE)
[1] 0.430591
[1] -47.73436

with again the same optimal value for\lambda, and the same confidence interval, since the function is the same, up to some additive constant.

So we have been able to derive the optimal transformation according to Box-Cox transformation, but so far, the confidence interval is not the same (it might come from the fact that here we substituted an estimator to the unknown parameter\sigma.

Saturday, November 10 2012

Somewhere else, part 20

So, Movember finally arrived (see So far, not a lot of articles about moustaches. But I should find some by the end of the month! Nevertheless, I discoverd a great post for those wo used to be addicted players

Also a lot of posts, this week, about the elections in the United States (and statistics-pooling-forecasting issues)
And as usual, several posts on different topics that I found interesting,
et toujours quelques billets passionnants pour les francophones,

Did I miss something ?                                 

Friday, November 9 2012

On p-values


Saturday, November 3 2012

Normality versus goodness-of-fit tests

In many cases, in statistical modeling, we would like to test whether the underlying distribution from an i.i.d. sample lies in a given (parametric) family, e.g. the Gaussian family


Consider a sample

> library(nortest)
> X=rnorm(n)

Then a natural idea is to use goodness of fit tests (natural is not necessarily correct, we'll get back on that later on), i.e.

for some  and . But since those two parameters are unknown, it is not uncommon to see people substituting estimators to those two unknown parameters, i.e.

Using Kolmogorov-Smirnov test, we get

> pn=function(x){pnorm(x,mean(X),sd(X))};
> P.KS.Norm.estimated.param=
+ ks.test(X,pn)$p.value

But since we choose parameters based on the sample we use to run a goodness of fit test, we should expect to have troubles, somewhere. So another natural idea is to split the sample: the first half will be used to estimate the parameters, and then, we use the second half to run a goodness of fit test (e.g. using Kolmogorov-Smirnov test)

> pn=function(x){pnorm(x,mean(X[1:(n/2)]),
+ sd(X[1:(n/2)]))}
> P.KS.Norm.out.of.sample=
+ ks.test(X[(n/2+1):n],pn)$p.value>.05)

As a benchmark, we can use Lilliefors test, where the distribution of Kolmogorov-Smirnov statistics is corrected to take into account the fact that we use estimators of parameters

> P.Lilliefors.Norm=
+ lillie.test(X)$p.value 

Here, let us consider i.i.d. samples of size 200 (100,000 samples were generated here). The distribution of the -value of the test is shown below,

In red, the Lilliefors test, where we see that the correction works well: the -value is uniformly distributed on the unit inteval. There is 95% chance to accept the normality assumption if we accept it when the -value exceeds 5%. On the other hand,

  • with Kolmogorv-Smirnov test, on the overall sample, we always accept the normal assumption (almost), with a lot of extremely large -values
  • with Kolmogorv-Smirnov test, with out of sample estimation, we actually observe the opposite: in a lot of simulation, the -value is lower then 5% (with the sample was from a  sample).

The cumulative distribution function of the -value is

I.e., the proportion of samples with -value exceeding 5% is 95% for Lilliefors test (as expected), while it is 85% for the out-of-sample estimator, and 99.99% for Kolmogorov-Smirnov with estimated parameters,

> mean(P.KS.Norm.out.of.sample>.05)
[1] 0.85563
> mean(P.KS.Norm.estimated.param>.05)
[1] 0.99984
> mean(P.Lilliefors.Norm>.05)
[1] 0.9489

So using Kolmogorov-Smirnov with estimated parameters is not good, since we might accept  too often. On the other hand, if we use this technique with two samples (one to estimate parameter, one to run goodness of fit tests), it looks much better ! even if we reject  too often. For one test, the rate of first type error is rather large, but for the other, it is the rate of second type error...

Wednesday, October 31 2012

Why pictures are so important when modeling data?

(bis repetita) Consider the following regression summary,
lm(formula = y1 ~ x1)
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0001     1.1247   2.667  0.02573 *
x1            0.5001     0.1179   4.241  0.00217 **
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665,	Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217 
obtained from
> data(anscombe)
> reg1=lm(y1~x1,data=anscombe)
Can we say something if we look (only) at that output ? The intercept is significatively non-null, as well as the slope, the  is large (66%). It looks like we do have a nice model here. And in a perfect world, we might hope that data are coming from this kind of dataset, 
But it might be possible to have completely different kinds of patterns. Actually, four differents sets of data are coming from Anscombe (1973). And that all those datasets are somehow equivalent: the 's have the same mean, and the same variance
> apply(anscombe[,1:4],2,mean)
x1 x2 x3 x4
9  9  9  9
> apply(anscombe[,1:4],2,var)
x1 x2 x3 x4
11 11 11 11 
and so are the 's
> apply(anscombe[,5:8],2,mean)
y1       y2       y3       y4
7.500909 7.500909 7.500000 7.500909
> apply(anscombe[,5:8],2,var)
y1       y2       y3       y4
4.127269 4.127629 4.122620 4.123249 
Further, observe also that the correlation between the 's and the 's is the same
> cor(anscombe)[1:4,5:8]
y1         y2         y3         y4
x1  0.8164205  0.8162365  0.8162867 -0.3140467
x2  0.8164205  0.8162365  0.8162867 -0.3140467
x3  0.8164205  0.8162365  0.8162867 -0.3140467
x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
> diag(cor(anscombe)[1:4,5:8])
[1] 0.8164205 0.8162365 0.8162867 0.8165214
which yields the same regression line (intercept and slope)
> cbind(coef(reg1),coef(reg2),coef(reg3),coef(reg4))
[,1]     [,2]      [,3]      [,4]
(Intercept) 3.0000909 3.000909 3.0024545 3.0017273
x1          0.5000909 0.500000 0.4997273 0.4999091
But there is more. Much more. For instance, we always have the standard deviation for residuals
> c(summary(reg1)$sigma,summary(reg2)$sigma,
+ summary(reg3)$sigma,summary(reg4)$sigma)
[1] 1.236603 1.237214 1.236311 1.235695
Thus, all regressions here have the same R2
> c(summary(reg1)$r.squared,summary(reg2)$r.squared,
+ summary(reg3)$r.squared,summary(reg4)$r.squared)
[1] 0.6665425 0.6662420 0.6663240 0.6667073
Finally, Fisher's F statistics is also (almost) the same.
+ c(summary(reg1)$fstatistic[1],summary(reg2)$fstatistic[1],
+ summary(reg3)$fstatistic[1],summary(reg4)$fstatistic[1])
value    value    value    value
17.98994 17.96565 17.97228 18.00329 
Thus, with the following datasets, we have the same prediction (and the same confidence intervals). Consider for instance the second dataset (the first one being mentioned above),
> reg2=lm(y2~x2,data=anscombe)
The output is here exactly the same as the one we had above
> summary(reg2)
lm(formula = y2 ~ x2, data = anscombe)
Min      1Q  Median      3Q     Max
-1.9009 -0.7609  0.1291  0.9491  1.2691
Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.001      1.125   2.667  0.02576 *
x2             0.500      0.118   4.239  0.00218 **
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179 
Here, the perfect model is the one obtained with a quadratic regression.
> reg2b=lm(y2~x2+I(x2^2),data=anscombe)
> summary(reg2b)
lm(formula = y2 ~ x2 + I(x2^2), data = anscombe)
Min         1Q     Median         3Q        Max
-0.0013287 -0.0011888 -0.0006294  0.0008741  0.0023776
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.9957343  0.0043299   -1385   <2e-16 ***
x2           2.7808392  0.0010401    2674   <2e-16 ***
I(x2^2)     -0.1267133  0.0000571   -2219   <2e-16 ***
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.001672 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 7.378e+06 on 2 and 8 DF,  p-value: < 2.2e-16 
Consider now the third one
> reg3=lm(y3~x3,data=anscombe)
> summary(reg3)
lm(formula = y3 ~ x3, data = anscombe)
Min      1Q  Median      3Q     Max
-1.1586 -0.6146 -0.2303  0.1540  3.2411
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0025     1.1245   2.670  0.02562 *
x3            0.4997     0.1179   4.239  0.00218 **
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663,	Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176 
This time, the linear model could have been perfect. The problem is one outlier. If we remove it, we have
> reg3b=lm(y3~x3,data=anscombe[-3,])
> summary(reg3b)
lm(formula = y3 ~ x3, data = anscombe[-3, ])
Min         1Q     Median         3Q        Max
-0.0041558 -0.0022240  0.0000649  0.0018182  0.0050649
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0056494  0.0029242    1370   <2e-16 ***
x3          0.3453896  0.0003206    1077   <2e-16 ***
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003082 on 8 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1
F-statistic: 1.161e+06 on 1 and 8 DF,  p-value: < 2.2e-16 
Finally consider
> reg4=lm(y4~x4,data=anscombe)
This time, there is an other kind of outlier, in 's, but again, the regression is exactly the same,
> summary(reg4)
lm(formula = y4 ~ x4, data = anscombe)
Min     1Q Median     3Q    Max
-1.751 -0.831  0.000  0.809  1.839
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0017     1.1239   2.671  0.02559 *
x4            0.4999     0.1178   4.243  0.00216 **
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667,	Adjusted R-squared: 0.6297
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165 
The graph is here
So clearly, looking at the summary of a regression does not tell us anything... This is why we do spend some time on diagnostic, looking at graphs with the errors (the graphs above could be obtained only with one explanatory variable, while errors can be studied in any dimension): everything can be seen on thise graphs. E.g. for the first dataset,
or the second one
the third one
or the fourth one,

Saturday, October 27 2012

Somewhere else, part 18

A nice post (found a couple of days ago) on an insteresting topic, I should perhaps discuss more often on this blog

Also, this week, a series of posts and articles about L'Aquila trial,
Still, a lot of very interesting posts found here, and there,
Also several documentaries, found, online

Et comme toujours, quelques articles en francais,

Did I miss something ?                                 

Tuesday, October 23 2012

Predictions and errors

There have been a lot of interesting papers about the "manslaughter trial" of six seismologists and a government official in Italy, where justice pointed out that there was a failure to warn the population before the deadly earthquake in 2009, see e.g. "Trial Over Earthquake in Italy Puts Focus on Probability and Panic" on, "Italian scientists convicted of manslaughter for earthquake risk report" on, "Italian court ruling sends chill through science community" on, "Scientists on trial: At fault?" on or (probably the most interesting one) "The Verdict of the l’Aquila Earthquake Trial Sends the Wrong Message" on

First of all, I started less than 15 months ago to work on earthquake series and models, and I am still working on the second paper, but so far, what I've seen is that those series are very noisy. When working on a large scale (say\pm500km), it is still very difficult to estimate the probability that there will be a large earthquake on a large period of time (from one year to a decade). Even including covariate such as foreshocks. So I can imagine that it is almost impossible to predict something accurate on a smaller scale, and on a short time frame. A second point is that I did not have time to look carefully at what was said during the trial: I just have been through what can be find in articles mentioned above.

But as a statistician, I really believe, as claimed by Niels Bohr (among many others) that "prediction is very difficult, especially about the future". Especially with a 0/1 model (warning versus not warning). In that case, you have the usual type I and type II errors (see e.g. for more details),

  • type I is "false positive error" when you issue a warning, for nothing. A "false alarm" error. With standard "test" words, it is like when a pregnancy tests predict that someone is pregnant, but she's not. 
  • type II is "false negative error" failing to assert something, what is present.Here, it is like when a pregnancy tests predict that someone is not pregnant, while she actually is.

The main problem is that statisticians wish to design a test with both errors as small as possible. But usually, you can't. You have to make a trade-off. The more you want to protect yourself against Type I errors (by choosing a low significance level), the greater the chance of a Type II error. This is actually the most important message in all Statistics 101 courses.

Another illustration can be from the course I am currently teaching this semester, precisely on prediction and forecasting techniques. Consider e.g. the following series

Here, we wish to make a forecast on this time series (involving a confidence interval, or region). Something like

The aim of the course is to be able to build up that kind of graph, to analyze it, and to know exactly what were the assumptions used to derive those confidence bands. But if you might go to jail for missing something, you can still make the following forecast

From this trial, we know that researchers can go to jail for making a type II error. So, if you do not want to go to jail, make frequent type I error (from this necessary trade-off). Because so far, you're more likely not to go to jail for that kind of error (the boy who cried wolf kind). Then, you're a shyster, a charlatan, but you shouldn't spend six years in jail ! As mentioned on Twitter, that might be a reason why economist keep announcing crisis ! That might actually be a coherent strategy...

Saturday, October 20 2012

Somewhere else, part 17

This week, the most important news is undoubtly

I also read a very interesting post
 among many others
with several articles early this week on the Nobel price (in Economics)
avec plusieurs articles en francais
mais aussi

Did I miss something ?                                 

- page 1 of 10