Freakonometrics

To content | To menu | To search

Wednesday, May 16 2012

Finding Waldo, a flag on the moon and multiple choice tests, with R

I have to admit, first, that finding Waldo has been a difficult task. And I did not succeed. Neither could I correctly spot his shirt (because actually, it was what I was looking for). You know, that red-and-white striped shirt. I guess it should have been possible to look for Waldo's face (assuming that his face does not change) but I still have problems with size factor (and resolution issues too). The problem is not that simple. At the http://mlsp2009.conwiz.dk/ conference, a price was offered for writing an algorithm in Matlab. And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo's face, as described in problem 3 on http://www1.cs.columbia.edu/~blake/'s webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

What about codes in R ? On http://stackoverflow.com/, some ideas can be found (and thank Robert Hijmans for his help on his package). So let us try here to do something, on our own. Consider the following picture,

With the following code (based on the following file) it is possible to import the picture, and to extract the colors (based on an RGB decomposition),
> library(raster)
> waldo=brick(system.file("DepartmentStoreW.grd",
+ package="raster"))
> waldo
class       : RasterBrick
dimensions  : 768, 1024, 786432, 3 (nrow,ncol,ncell,nlayer)
resolution  : 1, 1  (x, y)
extent      : 0, 1024, 0, 768  (xmin, xmax, ymin, ymax)
coord. ref. : NA
values      : C:\R\win-library\raster\DepartmentStoreW.grd
min values  : 0 0 0
max values  : 255 255 255
My strategy is simple: try to spot areas with white and red stripes (horizontal stripes). Note that here, I ran the code on a Windows machine, the package is not working well on Mac. In order to get a better understanding of what could be done, let us start with something much more simple. Like the picture below, with Waldo (and Waldo only). Here, it is possible to extract the three colors (red, green and blue),
> plot(waldo,useRaster=FALSE)

     

It is possible to extract the red zones (already on the graph above, since red is a primary color), as well as the white ones (green means white, on the left)
# white component
white = min(waldo[[1]] , waldo[[2]] , waldo[[3]])>220
focalswhite = focal(white, w=3, fun=mean)
plot(focalswhite,useRaster=FALSE)
 
# red component
red = (waldo[[1]]>150)&(max(  waldo[[2]] , waldo[[3]])<90)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE)
i.e. here we have the graphs below, with the white regions, and the red ones,

From those two parts, it has been possible to extract the red-and-white stripes from the picture, i.e. some regions that were red above, and white below (or the reverse),
# striped component
striped = red; n=length(values(striped)); h=5
values(striped)=0
values(striped)[(h+1):(n-h)]=(values(red)[1:(n-2*h)]==
TRUE)&(values(red)[(2*h+1):n]==TRUE)
focalsstriped = focal(striped, w=3, fun=mean)
plot(focalsstriped,useRaster=FALSE)
So here, we can easily spot Waldo, i.e. the guy with the red-white stripes (with two different sets of thresholds for the RGB decomposition)

 Let us try somthing slightly more complicated, with a zoom on the large picture of the department store (since, to be honest, I know where Waldo is...).

Here again, we can spot the white part (on the left) and the red one (on the right), with some thresholds for the RGB decomposition

Note that we can try to be (much) more selective, playing with threshold. Here, it is not very convincing: I cannot clearly identify the region where Waldo might be (the two graphs below were obtained playing with thresholds)

 And if we look at the overall pictures, it is worst. Here are the white zones, and the red ones,

and again, playing with RGB thresholds, I cannot spot Waldo,

Maybe I was a bit optimistic, or ambitious. Let us try something more simple, like finding a flag on the moon. Consider the picture below on the left, and let us see if we can spot an American flag,

Again, on the left, let us identify white areas, and on the left, red ones

Then as before, let us look for horizontal stripes

Waouh, I did it ! That's one small step for man, one giant leap for R-coders ! Or least for me... So, why might it be interesting to identify areas on pictures ? I mean, I am not Chloe O'Brian, I don't have to spot flags in a crowd, neither Waldo, nor some terrorists (that might wear striped shirts). This might fun if you want to give grades for your exams automatically. Consider the two following scans, the template, and a filled copy,

A first step is to identify regions where we expect to find some "red" part (I assume here that students have to use a red pencil). Let us start to check on the template and the filled form if we can identify red areas,
exam = stack("C:\\Users\\exam-blank.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE) 
exam = stack("C:\\Users\\exam-filled.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE)


First, we have to identify areas where students have to fill the blanks. So in the template, identify black boxes, and get the coordinates (here manually)
exam = stack("C:\\Users\\exam-blank.png")
black = max(  exam[[1]] ,exam[[2]] , exam[[3]])<50
focalsblack = focal(black, w=3, fun=mean)
plot(focalsblack,useRaster=FALSE)
correct=locator(20)
coordinates=locator(20)
X1=c(73,115,156,199,239)
X2=c(386,428.9,471,510,554)
Y=c(601,536,470,405,341,276,210,145,79,15)
LISTX=c(rep(X1,each=10),rep(X2,each=10))
LISTY=rep(Y,10)
points(LISTX,LISTY,pch=16,col="blue")

The blue points above are where we look for students' answers. Then, we have to define the vector of correct answers,
CORRECTX=c(X1[c(2,4,1,3,1,1,4,5,2,2)],
X2[c(2,3,4,2,1,1,1,2,5,5)])
CORRECTY=c(Y,Y)
points(CORRECTX, CORRECTY,pch=16,col="red",cex=1.3)
UNCORRECTX=c(X1[rep(1:5,10)[-(c(2,4,1,3,1,1,4,5,2,2)
+seq(0,length=10,by=5))]],
X2[rep(1:5,10)[-(c(2,3,4,2,1,1,1,2,5,5)
+seq(0,length=10,by=5))]])
UNCORRECTY=c(rep(Y,each=4),rep(Y,each=4))
Now, let us get back on red areas in the form filled by the student, identified earlier,
exam = stack("C:\\Users\\exam-filled.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=5, fun=mean)

Here, we simply have to compare what the student answered with areas where we expect to find some red in,
ind=which(values(focalsred)>.3)
yind=750-trunc(ind/610)
xind=ind-trunc(ind/610)*610
points(xind,yind,pch=19,cex=.4,col="blue")
points(CORRECTX, CORRECTY,pch=1,
col="red",cex=1.5,lwd=1.5)
Crosses on the graph on the right below are the answers identified as correct (here 13),
> icorrect=values(red)[(750-CORRECTY)*
+ 610+(CORRECTX)]
> points(CORRECTX[icorrect], CORRECTY[icorrect],
+ pch=4,col="black",cex=1.5,lwd=1.5)
> sum(icorrect)
[1] 13

In the case there are negative points for non-correct answer, we can count how many incorrect answers we had. Here 4.
> iuncorrect=values(red)[(750-UNCORRECTY)*610+
+ (UNCORRECTX)]
> sum(iuncorrect)
[1] 4
So I have not been able to find Waldo, but I least, that will probably save me hours next time I have to mark exams...

Tuesday, May 15 2012

Notes de cours sur les séries temporelles

La session d'hiver n'étant pas terminée, je vais poster mes notes de cours sur la dernière section (sur la modélisation de séries temporelles) pour le cours ACT2040. Il s'agit - comme je l'avais dit en cours - d'une remise au goût du jour de notes tapées il y a une dizaine d'années. J'ai également rajouté du code R, mais il doit resté un certain nombre de coquilles et de fautes de frappe. Je profiterais des jours qui viennent pour réviser cette version.

Thursday, May 10 2012

Basketball: score dynamics and game theory

Tomorrow morning, I will be giving a talk at Mont Tremblant, for the Journées de la Société Canadienne de Sciences Economiques. I will present a joint work - in progress - with Nathalie Colombier and Romuald Elie. Since the working paper is not online yet, I will wait a little bit before uploading the slides. But they will be online, someday (hopefully soon)...

"An important aspect of the strategy of most organizations is the provision of incentives to the employees to meet the organization’s objectives. Typically this implies tying pay to performance (see Prendergast, 1999). In order to reward employees for their effort, firms spend considerable resources on performance evaluations. In many cases, evaluation consists of comparing actual performance to a pre-defined individual target. Another frequently used format is relative performance evaluation. Relative performance evaluation may motivate employees to work harder.But it may also be demoralizing and create an excessively competitive workplace, which may hinder overall performance; see Lazear (1989). Determining the overall impact of relative performance evaluation is crucial for companies. Economic research on relative performance evaluation has mainly focused on the comparison of final performances between competitors,like in tournament theory, and on quantitative and subjective performance ratings (Lazear and Gibbs, 2009). In contrast, what happens during a competition and the impact of feedback frequency on effort have so far received little attention. Following Berger and Pope (2011), we decided to use a basketball application to get a better understanding of the role of the feedback information. Sports datasets allow to observe score and team behavior continuously (during a game but also during the season) which can be use as a proxy of the effort. Berger an Pope (2010) asked ”can loosing lead to winning ?” looking at the impact of the halftime score difference on winning probability in NCAA (college) and NBA(pro) games. More precisely, they studied whether a team loosing at halftime is more likely to win than expected using a logit model. They find that usually the higher the score difference the more likely the are to win. But if the halftime score difference is around 0 they observe a discontinuity: loosing with a small difference (e.g. down by 1 point) can lead to increase the effort and win the game. In this paper we try answer the question when loosing lead to winning ?."

Monday, May 7 2012

Bayes is playing Russian roulette

There was (once again) a nice puzzle in http://www.futilitycloset.com/. Bayes and a good friend are playing Russian roulette. The revolver has six chambers. He puts two bullets in two adjacent chambers, spin the cylinder, hold the gun to his friend's head, and pull the trigger. It clicks. So it is now Bayes's turn: he can choose either to spin the cylinder again or leave it as it is. Which is better? Hopefully, Bayes knows his theorem: if he does spin it, the probability of getting killed is 2 out of 6 (four empty chambers out of six), but if he does not, since his friend is still alive, then the hammer should be next to one of the four cylinders in red, below
So here, there is 3 chance out of 4 to survive, i.e. the probability of getting killed is 1 out of 4 (while it was 1 out of 3 when spinning). So Bayes should not spin. And as always, it is possible to see it is a more general result: more generally, in a revolver with http://freakonometrics.blog.free.fr/public/perso5/bullet01.gif chambers, it there are http://freakonometrics.blog.free.fr/public/perso5/bullet02.gif bullets in http://freakonometrics.blog.free.fr/public/perso5/bullet02.gif adjacent chambers,  if the first player survives, the probability of getting killed is k over http://freakonometrics.blog.free.fr/public/perso5/bullet01.gif, when spinning, while it would be 1 over http://freakonometrics.blog.free.fr/public/perso5/bullet03.gif if we don't. Not spinning is better if and only if

http://freakonometrics.blog.free.fr/public/perso5/bullet04.gif
i.e.
http://freakonometrics.blog.free.fr/public/perso5/bullet05.gif
So you'd better not spin, unless there was one bullet in the revolver, i.e. http://freakonometrics.blog.free.fr/public/perso5/bullet06.gif... or http://freakonometrics.blog.free.fr/public/perso5/bullet07.gif (in that case, it might not be a good idea actually to play the game).

Friday, May 4 2012

Correlations, dimension, and risk measure

Yesterday, while I was attending the IFM2 conference, at HEC Montreal, I heard a nice talk about credit risk, and a comparison between contagion (or at least default correlation), for corporate and retail companies (in the US). And it was mentioned that default correlation was much lower for retail companies than it could be for corporate risk. In a discussion that followed those slides, it was mentioned that banks in the US should actually have been working more with those small firms, since contagion risk was much lower.

A problem here is that the link between correlation, risk and dimension is rather complicated:

  • corporate means a small number of firms, high correlation (and possible large individual losses)
  • retail means a large number of firms (even perhaps extremely large), lower correlation (and small individual losses)

A simple model for default models is based on the assumption that we deal with an exchangeable portfolio (as in a previous post). With the following code, given an (individual) default probability, a default correlation, and a number of firms, it is possible to calculate the probability to have more than a given number of defaults.

 proba=function(s,a,m,n){
 b=a/m-a
 choose(n,s)*integrate(function(t){t^s*(1-t)^(n-s)*
 dbeta(t,a,b)},lower=0,upper=1,subdivisions=1000,
 stop.on.error =  FALSE)$value}
 
CDF=function(x=10,r=.4,m=.1,n=50){
a=m*(1-r)/r ;
V=rep(NA,n+1)
 for(i in 0:n){
 V[i+1]=proba(i,a,m,n)}
 V=V/sum(V);
 return(sum(V[1:(x+1)])) }

It is possible to calculate, for a large range of correlations, the probability to have - at least - 20% of default in the portfolio (in order to compare things that are comparable).

R=seq(.01,.99,by=.01)
VQ=matrix(NA,length(A),2)
for(i in 1:length(A)){
VQ[i,1]=1-CDF(r=A[i],x=4,n=20);  
VQ[i,2]=1-CDF(r=A[i],x=200,n=1000)}

With 20 firms (corporate) we want to have at least 4 defaults, while with 1000 firms (retail) there should be 200 defaults. As mentioned in the previous post, the relationship between correlation and quantiles of sums is not simple. Hence, it might not be monotone. The dotted line is the probability to have at least 4 defaults when default correlation is 50% (around 10%). The plain line is the probability to have at least 200 defaults, as a function of the correlation,

plot(A,1-VQ[,2],type="l",col="red",ylim=c(0,.22))
abline(h=1-VQ[50,1],lty=2,col="red")

In that case, with only a correlation of 10% among retail firms, the probability of having 20% defaults is the same as the same probability for corporate, but with 50% correlation... One should remember that in portfolio analysis, the links between correlation, dimension and risk measure is a sensitive issue...

Talk on bivariate count times series in finance and risk management

I will be giving a talk on May 4th, at the Mathematical Finance Days, at HEC Montréal, on multivariate dynamic models for counts. The conference is organized by IFM2 (Institut de Finance Mathématique de Montréal). I will be chairing some session and I will give a talk based on the joint paper with Mathieu Boudreault.

The slides can be downloaded from the blog,

"In various situations in the insurance industry, in finance, in epidemiology, etc., one needs to represent the joint evolution of the number of occurrences of an event. In this paper, we present a multivariate integer‐valued autoregressive (MINAR) model, derive its properties and apply the model to earthquake occurrences across various pairs of tectonic plates. The model is an extension of Pedelis & Karlis (2011) where cross autocorrelation (spatial contagion in a seismic context) is considered. We fit various bivariate count models and find that for many contiguous tectonic plates, spatial contagion is significant in both directions. Furthermore, ignoring cross autocorrelation can underestimate the potential for high numbers of occurrences over the short‐term. An application to risk management and cat‐bond pricing will be discussed."

http://freakonometrics.free.fr/ringfire.gif

Thursday, May 3 2012

Local utility and multivariate risk aversion

Marc will give a talk today at the European Center for Advanced Research in Economics and Statistics (ECARES) today, at ULB in Brussels, based on some joint work with also Alfred (the paper can be found online on http://papers.ssrn.com/).

Saturday, April 28 2012

Open data and ecological fallacy

A couple of days ago, on Twitter, @alung mentioned an old post I did publish on this blog about open-data, explaining how difficult it was to get access to data in France (the post, published almost 18 months ago can be found here, in French). And  @alung was wondering if it was still that hard to access nice datasets. My first answer was that actually, people were more receptive, and I now have more people willing to share their data. And on the internet, amazing datasets can be found now very easily. For instance in France, some detailed informations can be found about qualitifications, houses and jobs, by small geographical areas, on http://www.recensement.insee.fr (thanks @coulmont for the link). And that is great for researchers (and anyone actually willing to check things by himself).

But one should be aware that those aggregate data might not be sufficient to build up econometric models, and to infere individual behaviors. Thinking that relationships observed for groups necessarily hold for individuals is a common fallacy (the so-called " ecological fallacy"). 

In a popular paper, Robinson (1950) discussed "ecological inference", stressing the difference between ecological correlations (on groups) and individual correlations (see also Thorndike (1937)) He considered two aggregated quantities, per american state: the percent of the population that was foreign-born, and the percent that was literate. One dataset used in the paper was the following

> library(eco)
> data(forgnlit30)
> tail(forgnlit30)
Y          X         W1          W2 ICPSR
43 0.076931986 0.03097168 0.06834300 0.077206504    66
44 0.006617641 0.11479052 0.03568792 0.002847920    67
45 0.006991899 0.11459207 0.04151310 0.002524065    68
46 0.012793782 0.18491515 0.05690731 0.002785916    71
47 0.007322475 0.13196654 0.03589512 0.002978594    72
48 0.007917342 0.18816461 0.02949187 0.002916866    73

The correlation between  foreign-born and literacy was

> cor(forgnlit30$X,1-forgnlit30$Y)
[1] 0.2069447

So it seems that there is a positive correlation, so a quick interpretation could be that in the 30's, amercians were iliterate, but hopefully, literate immigrants got the idea to come in the US. But here, it is like in Simpson's paradox, because actually, the sign should be negative, as obtained on individual studies. In the state-based-data study, correlation was positive mainly because foreign-born people tend to live in states where the native-born are relatively literate...

Hence, the problem is clearly how individuals were grouped. Consider the following set of individual observations,

> n=1000
> r=-.5
> Z=rmnorm(n,c(0,0),matrix(c(1,r,r,1),2,2))
> X=Z[,1]
> E=Z[,2]
> Y=3+2*X+E
> cor(X,Y)
[1] 0.8636764

Consider now some regrouping, e.g.

> I=cut(Z[,2],qnorm(seq(0,1,by=.05)))
> Yg=tapply(Y,I,mean)
> Xg=tapply(X,I,mean)

Then the correlation is rather different,

>  cor(Xg,Yg)
[1] 0.1476422

Here we have a strong positive individual correlation, and a small (positive correlation) on grouped data, but almost anything is possible.

Models with random coefficients have been used to make ecological inferences. But that is a long story, andI will probably come back with a more detailed post on that topic, since I am still working on this with @coulmont (following some comments by @frbonnet on his post on recent French elections on http://coulmont.com/blog/).

Friday, April 27 2012

Proving tautological versus trivial results in mathematics

There is something that might be fun in mathematics, which is the connexion between trivial, tautological and difficult questions. Sometimes, things are so intuitive, that they seem to be obvious. But mathematicians aren't jedis, and they should not trust too much their intuition... I mean intuition is fine, but it is not a proof. It is like those standard results we learn in topology courses, e.g. "the closure of an open ball is not necessarily the closed ball". The other thing is that after a while, you try to prove something, until someone makes you realize that it is the definition... 
And this morning, while I was trying to make a coffee, @renaudjf came with a simple question (yes, it always starts like that). Consider the standard algorithm to generate a conditional random variable. Assume that  has a priori distribution , and that , given , has (conditional) distribution .
The standard idea is monte carlo simulation, to generate values of , is
  •  step 1: generate 
  •  step 2: given that generation of , generate 
"Can we prove that we actually generate from the (true, maybe hard to characterize) non-conditional distribution of  ? Or is it just trivial ?". After having those previous philosophical questions, we came to the point that if it was trivial, then we should be able to prove it. A standard way of writing the algorithm is to use the quantile based technique
  •   with ,
  •   with ,
For instance, to generate negative binomial distribution
n=1
theta=rgama(n,3,3)
X=rpois(n,lambda=theta)
Thus, let  where  and  are two independent random variables with a uniform distribution on the unit interval. Let us try to derive its distribution, i.e.
so
if we consider the following change of variate 
which is exactly the non-conditional distribution of .
And then, you're quite happy because you've been able to prove a trivial result ! But next time, I promise, we'll try to derive an amazing theorem that will change humanity... but next time only, first, let us prove trivial results.

Wednesday, April 25 2012

Short versus long papers, in academic journals

This Monday, during my talk on quantile regressions (at the Montreal R-meeting), we've seen how those nice graphs could be interpreted, with the evolution of the slope of the linear regression, as a function of the probability level. One illustration was on large hurricanes, from Elsner, Kossin & Jagger (2008). The other one was on birthweight, from Abrevaya (2001).

It is also to illustrate that technique to academic publication, e.g. the length of papers, over time. Actually, the data we can extract from Scopus are quite similar to the ones uses on hurricanes. For several journals, it is possible to look at the length of articles. Since Scopus is quite expensive ($60,000 per year for the campus, as far as remember, so I can imagine the penalty I might have to pay for sharing such a dataset)

base=read.table("/home/scopus.csv",
header=TRUE,sep=",")
pages=base$Page.end-base$Page.start
year=base$Year

Again, a first idea can be to look at boxplots, and regression on (nonparametric) quantiles, here for Econometrica,

boxplot(pages~as.factor(year),col="light blue")
Q=function(p=.9) as.vector(by(pages,as.factor(year),
function(x) quantile(x,p)))
u=1:16
points(u,Q(p),pch=19,col="blue")
abline(lm(Q(p)~u,weights=table(year)),lwd=2,col="blue")

Consider now (as in the slides in the previous post) a quantile regression (instead of a regression on quantiles), for instance in the Annals of Probability,

library(quantreg)
u=seq(.05,.95,by=.01)
coefstd=function(u) summary(rq(pages~year,
tau=u))$coefficients[,2]
coefest=function(u) summary(rq(pages~year,
tau=u))$coefficients[,1]
CS=Vectorize(coefstd)(u)
CE=Vectorize(coefest)(u)
k=2
plot(u,CE[k,],ylim=c(min(CE[k,]-2*CS[k,]),
max(CE[k,]+2*CS[k,])))
polygon(c(u,rev(u)),c(CE[k,]+1.96*CS[k,],
rev(CE[k,]-1.96*CS[k,])),
col="light green",border=NA)
lines(u,CE[k,],lwd=2,col="red")
abline(h=0)

We have the following slope, for the year, as a function of the probability level,

The slope is always positive, so size of papers is increasing with time, short and long papers. But the influence of time is much larger for long paper than short one: for short papers (lower decile) every year, the size keeps increasing, with one more page every three years. For long paper (upper decile), it is two more pages every three years.

If we look now at the Annals of Statistics, we have

and for the evolution of the slope of the quantile regression,

Again the impact is positive: papers are longer in 2010 than 15 years ago. But the trend is the reverse: short papers (lower decile) are much longer, almost one more page every year, with long paper increase only by one more page every two years... Initially, I want to run such a study on a much longer term, with quantile regressions and splines to see when there might have been a change, both in lower and upper tails. Unfortunately, as suggested by some colleagues, there might have been some changes in the format of the journal (columns, margins, fonts, etc). That's a shame, because I rediscover nice short papers of 5-10 pages published 20 or 30 years ago. They are nice to read (and also potentially interesting for a post on the blog). 5 pages, that's perfect, but 40 pages, that's way too long. I wonder if I am the only one having this feeling, missing those short but extremely interesting papers....

Tuesday, April 24 2012

Régression médiane et géométrie

Suite à mon exposé d'hier, Pierre me demandait d'argumenter un point que j'avais évoqué oralement sans le justifier: " une régression médiane passe forcément par deux points du nuage ". Par exemple,

x=c(1,2,3)
y=c(3,7,8)
plot(x,y,pch=19,cex=1.5,xlim=c(0,4),ylim=c(0,10))
library(quantreg)
abline(rq(y~x,tau=.5),col="red")

Essayons de le justifier... de manière un peu heuristique. Commençons un peu au hasard, avec une droite qui passe "entre" les points (on admettra que la régression médiane sépare l'espace en deux, avec la moitié des points en dessous, la moitié au dessus, à un près pour des histoires de parité).

plot(x,y,pch=19,cex=1.5,xlim=c(0,4),ylim=c(0,10))
abline(a=-1,b=3.2,col="blue")

c'est joli, mais on peut sûrement faire mieux. En particulier, on va chercher ici à minimiser la somme des valeurs absolue des erreurs (c'est le principe de la régression médiane). Essayons de translater la courbe, vers le haut, ou vers le bas,

plot(x,y,pch=19,cex=1.5,xlim=c(0,4),ylim=c(0,10))
abline(a=-1,b=3.2,col="blue",lwd=2)
for(i in seq(-2,3,by=.5)){
abline(a=i,b=3.2,col="blue",lty=2)}

Si on regarde ce que vaut la somme des valeurs absolues des erreurs, on a

d=function(h) sum(abs(y-(h+3.2*x)))
H=seq(-4,6,by=.01)
D=Vectorize(d)(H)
plot(H,D,type="l")

Formellement, l'optimum est ici
> optimize(d,lower=-5,upper=5)
$minimum
[1] -0.2000091
 
$objective
[1] 2.200009

Bref, retenons cette courbe, et notons qu'elle passe par un des points,

Et c'est assez normal. On commençait avec deux points au dessus, et un en dessous. Soit la somme des valeurs absolues des erreurs . Si on translate de , on passe de à (tant que l'on a toujours deux points au dessus, car celui en dessous sera toujours en dessous). Si on translate de , (là encore tant qu'il n'y a qu'un point en dessous). Bref, on a intérêt à translater vers le haut. Si on dépasse le premier point rencontré, on se retrouve dans la situation inverse, avec un point au dessus et deux au dessous, et on a intérêt à redescendre. Bref, la translation optimale revient a s'arrêter dès qu'on croise un point. Autrement dit, la régression passe forcément par un point. Au moins.

Maintenant, essayons de faire pivoter la courbe autour de ce point, là encore afin de minimiser la somme des valeurs absolues des erreurs,

plot(x,y,pch=19,cex=1.5,xlim=c(0,4),ylim=c(0,10))
abline(a=optimize(d,lower=-5,upper=5)$minimum,b=3.2,
col="blue",lwd=2) points(x[1],y[1],cex=1.8,lwd=2,col="blue") for(i in seq(-1,5,by=.25)){ abline(a=(y[1]-i*x[1]),b=i,col="blue",lty=2)}

La distance est alors, en fonction de la pente de la droite

d2=function(h) sum(abs(y-((y[1]-h*x[1])+h*x)))
H=seq(-4,6,by=.01)
D=Vectorize(d2)(H)
plot(H,D,type="l")

Là encore on peut formaliser un peu

> optimize(d2,lower=-5,upper=5)
$minimum
[1] 2.500018
 
$objective
[1] 1.500018

Et si on regarde cette dernière droite, on passe par deux points,

plot(x,y,pch=19,cex=1.5,xlim=c(0,4),ylim=c(0,10))
h=optimize(d2,lower=-5,upper=5)$minimum
abline(a=(y[1]-h*x[1]),
b=h,col="purple",lwd=2)

Pour comprendre pourquoi, comparons les deux cas,

  • en faisant un pivot vers le bas, pour un des points, la valeur absolue des erreurs augmente alors que pour l'autre, elle diminue

  • en faisant un pivot vers le haut, là encore pour un des points, la valeur absolue des erreurs augmente alors que pour l'autre, elle diminue, mais en sens inverse par rapport au cas précédent,

Et pour faire simple, dans le premier cas, le gain (sur la baisse) compense la perte, alors que dans le second cas, c'est le contraire. Bref, on a intérêt à pivoter vers le bas, ou vers le point à droite. Jusqu'à l'atteindre. Et optimalement, on passera par ce point. Bref, on passera par deux points. Et je laisse les plus courageux regarder avec plus de trois points, mais c'est toujours pareil...

Licornes, philosophes, vieux cons, peer-review et instituts de sondages

oui, vaste programme... Lorsque les enfants étaient petits, je me souviens avoir été maintes fois étonné par la profondeur des questions qu'ils pouvaient poser,

- dis papa, comment tu sais tout ça ?

ou

- dis papa, si j'étais pas né comment ça serait ?

Toutes ces questions sont embarrassantes, car elles sont assez fondamentales quand on y pense. Car le problème est qu'après, les enfants grandissent, et pensent avoir des réponses à ces mêmes questions,

- t'es sûre, tu ne veux pas une licorne pour ton anniversaire ?

- mais non papa, ça n'existe pas les licornes

- bien sûr que si que ça existe, n'oublie pas que les papas ça sait tout ! Et comment tu sais que ça existe pas d'abord ?

- ben j'en ai jamais vu...

- et alors... tu n'as jamais vu de crocodiles, et pourtant ça existe...

- oui mais j'en ai vu à la télé

- et moi à la télé j'ai vu des wookies et des ewoks dans Star Wars, et des dragons dans Harry Potter, et..

- oui mais à la télé ça existe pas ! Les licornes personne n'en a jamais vu d'abord

- et ça suffit pour dire que ça n'existe pas, tu penses ?

(en fait, on peut avoir ce même genre de dialogue avec dieu, ou mieux, le père Noël). Bref, suite à cette discussion je me demandais ce qui faisait qu'on y croyait ou qu'on n'y croyait plus. Par exemple en sciences: l'opposition classique entre sciences et religion est basée sur le fait que la religion repose sur la foi, et alors que la science non, on doit avoir une preuve de ce qu'on avance. Un peu comme le personnage de Gorgias (de Platon), le précurseur de tous nos hommes (et femmes) politiques: on a raison parce qu'on a eu le dernier mot. Et non parce qu'on l'a prouvé. Mais il faut être lucide. La science devient de plus en plus une histoire de foi et de croyance. Par exemple en cours, la très grande majorité de mes élèves me croient si je leur dit qu'une méthode ne marche pas, rares sont ceux qui penseraient à me demander de leur justifier, ou au mois de trouver des éléments étayant mon propos (car la tendance est de faire de moins de moins de démonstration en cours, surtout quand on fait de la statistique appliquée). Par exemple en séries temporelles: lorsque je faisais cours il y a 10 ans à l'université Paris Dauphine, on passait des heures sur les choix des ordres d'autorégression (faut-il un retard à l'ordre 4, ou 12 ?). Maintenant, tous les logiciels font ça automatiquement. Un peu comme le choix de la fenêtre quand on fait de l'estimation à noyau... sous R, on a un choix optimal de fenêtre... et souvent, quand on est pressé, on y croit.

Oui, en sciences, on utilise beaucoup cet argument de foi. Le plus classique étant de dire que ça a été publié dans une grande revue... Et c'est le principe du peer-review: on délègue à d'autres chercheurs la responsabilité de regarder en détails un papier, qui souvent dépasse nos propres compétences, afin de se reposer sur leur jugement les yeux fermés. C'est bien entendu stupide, et c'est pour ça qu'on demande aux étudiants de maîtrise de creuser les papiers en détails, de relire et comprendre les démonstrations, et de faire des simulations pour vérifier si ça marche ! Et je serais bien le premier à dire qu'il ne faut jamais croire ce qui est écrit dans mes livres ou mes papiers ! ou sur le blog ! Quoi que... sur le blog, je mets autant que possible l'intégralité des codes, afin de permettre aux personnes de vérifier plus facilement ce que j'ai fait. Et c'est le gros intérêt du blog pour les chercheurs, par rapport aux articles publiés: on n'est pas là pour épater la galerie (je renvoie d'ailleurs à un très bon article expliquant comment écrire un article scientifique). Sur le blog, on peut davantage lancer des discussions, et être beaucoup plus transparents, et modestes !

Soit dit en passant, cette histoire de foi se retrouve largement dans cette mise en abyme des sondages, évoquée dans un précédant billet, sur le fait que plus personne ne croit aux sondages. Pourtant des grands sociologues et des grands statisticiens ont travaillé depuis des dizaines d'années sur la théorie des sondages. Des centaines d'articles (parus dans des articles relus par d'autres grands scientifiques) justifient certaines méthodes d'analyse. Et pourtant la foi n'est pas là. Peut-être serait-il temps que les instituts soient plus transparents, qu'ils donnent accès aux données brutes. Car je suis convaincu que la transparence est la clé de tout. Enfin presque... pour l'existence du père Noël, je vais me battre autant que possible pour maintenir le doute !

Monday, April 23 2012

Talk on quantiles at the R Montreal group

This afternoon, I will be giving a two-hour talk at McGill on quantiles, quantile regressions, confidence regions, bagplots and outliers. Before defining (properly) quantile regressions, we will mention regression on (local) quantiles, as on the graph below, on hurricanes,

In order to illustrate quantile regression, consider the following natality database,

base=read.table(
"http://freakonometrics.free.fr/natality2005.txt",
header=TRUE,sep=";")

We can use it produce those nice graphs we can find in several papers, modeling weight of newborns,

u=seq(.05,.95,by=.01)
coefstd=function(u) summary(rq(WEIGHT~SEX+
SMOKER+WEIGHTGAIN+BIRTHRECORD+AGE+ BLACKM+
BLACKF+COLLEGE,data=base,tau=u))$coefficients[,2]
coefest=function(u) summary(rq(WEIGHT~SEX+
SMOKER+WEIGHTGAIN+BIRTHRECORD+AGE+ BLACKM+
BLACKF+COLLEGE,data=base,tau=u))$coefficients[,1]
CS=Vectorize(coefstd)(u)
CE=Vectorize(coefest)(u)

The slides can be downloaded on the blog, as well as the R-code.

Saturday, April 21 2012

Quotas et sondages

Apres @Christian il y a quelques semaines (suite aux premiers billets ou je faisais un parallèle entre l'estimation d'un paramètre sur le simplexe et les sondages électoraux), un ancien collègue de Rennes m'a demande si je pouvais faire un billet sur les sondages par quotas (au lieu de ne parler que de sondages aléatoires, ou de faire comme si les sondages politiques l’étaient). La question de Benoit était sur une phrase dans les transparents en ligne a Lyon (je n'ai pas trouve l'auteur)

"Dans les échantillonnages non probabilistes, comme la méthode des quotas utilisée par les instituts de sondage, on ne peut pas estimer la marge d’erreur due à l’élaboration de l’échantillon et généraliser les conclusions à l’ensemble de la population. En toute rigueur, les instituts de sondage ne peuvent pas calculer des intervalles de confiance. Dans la pratique, on sait que la marge d’erreur dans un échantillonnage par quotas est d’environ 3% pour un échantillon de 1000 individus (comme pour un échantillonnage aléatoire)"

N'ayant pas forcement ☐ les compétences  le temps  le courage (coche la bonne réponse), je vais me contente de rappeler qu'il y a quelques documents en ligne qui évoquent les méthodes de quotas. Par exemple le document de Philippe Cibois, ou celui en ligne a l’université de Pau. Historiquement Jerzy Neyman evoquait la methode des quotas (representative survey) en 1934 dans un article très intéressant. Dans Les sondages, publie a l'issu des journées de statistiques organisées en 1986, un chapitre explique ce qu'est la methode des quotas. 

La référence en question étant

Une référence passionnante semblerait etre l'article de Jean Claude Deville A Theory of Quota Surveys paru en 1991 dans Survey Methodology (mais non accessible en ligne). Si quelqu'un a le temps de regarder tout ça, éventuellement en faisant des simulations, je suis preneur ! Sinon merci a @wulab pour la référence des Simpsons pour l'illustration.

Wednesday, April 18 2012

25% ? sérieusemement, vous y croyez ?

(pour reprendre les statistiques postées par @guybirenbaum l'autre jour)

Réfléchissons deux minutes (ou un peu plus, c'est ce qu'on a fait ce midi avec ). On vous interroge sur un sondage: si vous croyez aux sondages, vous le dites. Mais si vous n'y croyez pas... vous répondez n'importe quoi (sinon c'est que vous y croyez, non ?). Bon essayons de formaliser ça afin de mieux comprendre... Soit la réponse donnée, et la vérité (ce que pense vraiment la personne interrogée). On sait que

Mais ce qu'on veut calculer, c'est , la probabilité qu'une personne y croit vraiment aux sondages. Pour ça, on sait que

Pour le premier terme, une personne qui y croit le dira (sinon c'est qu'elle n'y croit pas), donc . Par contre, une personne qui n'y croit pas peut dire n'importe quoi. Disons que vaut . Dans ce cas

ou encore

Supposons par exemple qu'une petite proportion des personnes qui n'y croient pas disent y croire, disons 10%. Dans ce cas, la vraie probabilité qu'une personne croit au sondage serait plutôt 16%. Si était un peu plus grand (25%), on serait juste tombé sur la proportion des rigolos qui répondent n'importe quoi aux sondages, car personne n'y croit !

- page 1 of 72