Freakonometrics

To content | To menu | To search

Sunday, July 1 2012

Visualizing uncertainty using Jackknife

Once again, I (re)discovered last week at the Rmetrics conference that old tools can be extremely interesting to illustrate complex ideas, like uncertainty in fnancial markets, and stock prices. For instance a 99.5% quantile: we look for the scenario that occur with a probability of 1 out of 200. Are there nice ways to illustrate that quantity ?

Consider the monthly evolution of the SP500 index over the last 22 years,

> library(quantmod)
> getSymbols('^GSPC', from='1990-01-01')
[1] "GSPC"
> GSPC = adjustOHLC(GSPC,
+ symbol.name='^GSPC') > MGSPC = to.monthly(GSPC) > CLOSE = MGSPC$GSPC.Close > plot(CLOSE)

It is possible to use Jackknife technique to illustrate uncertainty. The idea, in Jackknife, it to remove one of the observations, and to do that for all observations. More formally, from a sample , we define a (sub)sample where observation  as been removed, i.e. . Then, we can study all samples when one observation was removed.

Here, in the context of financial time series, over 270 months, we can wonder what might have been the final value of the index if one observation (i.e. one month) had been removed. It is actually the idea of Jackknife,

> R=diff(log(CLOSE)); R=R[-1]
> n=length(R)
> X=rnorm(n,mean(R),sd(R))
> X=R
> MX=t(matrix(X,n,n))
> MX=exp(MX)
> diag(MX)=1
> SMX=MX
> for(k in 2:n){SMX[,k]=SMX[,k-1]*(MX[,k])}

We can plot the different trajectories of the index, when we remove one month,

> init=as.numeric(CLOSE[1])
> plot(1:n,init*cumprod(exp(X)),type="l",
+ xlab="",ylab="",col="white")
> for(k in 1:n){lines(0:n,init*c(1,SMX[k,]),
+ col="light blue")}
> lines(0:n,init*c(1,cumprod(exp(X))),lwd=2,
+ col="blue")

This can be used to understand sensitivity, or unccertainty, of financial time series,

We can then look closer at the final value of the index, over those 270 scenarios,

or we also use a Box-Plot,

Here we can clearly see the impact: if we remove one good month, the index ends around 1250, while it reaches 1650 if we remove a bad month. The difference is huge. So instead of talking about volatility (which is actually a complex concept), that Jackknife idea of remove observations might be more intuitive, and much easier to get a first understanding of uncertainty. But those ideas of resampling are great. I will post a nice application soon (but first, I will discuss with some colleagues in Lyon).

Wednesday, June 27 2012

Qui se ressemble se suit (sur Twitter au moins)

Un nouveau billet pour reprendre une analyse marrante faite par @3wen  (Ewen Gallic, qui travaille à Montréal alors que je profite de la Suisse). Suite à l'analyse amusante des trolls de Twitter, j'avais lancé l'idée qu'il pourrait être amusant de regarder parmi les députés français (que j'avais un peu suivis l'autre jour), qui tweete avec qui. Ewen a eu la bonne idée de regarder sur  http://lelab.europe1.fr/ ce qui lui a permis de récupérer la liste des comptes Twitter des députés, en France. L'idée est simple: parmi les députés français, on regarde qui suit qui. Quelqu'un de très suivi sera au centre du nuage, alors que quelqu'un qui se contente de suivre sera sur le bord (intensément connecté au reste du nuage). Pour les personnes peu familières, Twitter n'est pas Facebook: on n'a pas des "amis", il y a des gens que l'on suit parce qu'ils racontent des choses qui peuvent nous intéresser.

En utilisant gephi Ewen a ensuite pu visualiser le graphique des interconnexions entre les députés.

Sans grande surprise, les députés de gauche suivent essentiellement les députés de gauche, et inversement. Quelques gros comptes font la passerelle entre les deux groupes parlementaires.

En fait, si on regarde en détail (voire sur le fichier image complet), on peut observer un peu mieux ce qui se passe,

Bon, la grande difficulté est de lire ces interactions correctement. En particulier, on ne peut pas conclure (à la vue seule du graphique) que Cécile Duflot est de gauche ! Ce que cela nous dit est que ce qu'elle raconte intéresse les gens de gauche (ou en tous cas les députés du Parti Socialiste), et pas du tout les gens de droite (les députés de l'UMP) ! On notera aussi que le centre n'existe pas. Sur Twitter en tous cas. Et si on regarde tout en bas, à droite, on retrouve le Front National, et on a la confirmation que ce que raconte le Front National n'intéresse personne !

Continue reading...

Tuesday, May 15 2012

Finding Waldo, a flag on the moon and multiple choice tests, with R

I have to admit, first, that finding Waldo has been a difficult task. And I did not succeed. Neither could I correctly spot his shirt (because actually, it was what I was looking for). You know, that red-and-white striped shirt. I guess it should have been possible to look for Waldo's face (assuming that his face does not change) but I still have problems with size factor (and resolution issues too). The problem is not that simple. At the http://mlsp2009.conwiz.dk/ conference, a price was offered for writing an algorithm in Matlab. And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo's face, as described in problem 3 on http://www1.cs.columbia.edu/~blake/'s webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

What about codes in R ? On http://stackoverflow.com/, some ideas can be found (and thank Robert Hijmans for his help on his package). So let us try here to do something, on our own. Consider the following picture,

With the following code (based on the following file) it is possible to import the picture, and to extract the colors (based on an RGB decomposition),
> library(raster)
> waldo=brick(system.file("DepartmentStoreW.grd",
+ package="raster"))
> waldo
class       : RasterBrick
dimensions  : 768, 1024, 786432, 3 (nrow,ncol,ncell,nlayer)
resolution  : 1, 1  (x, y)
extent      : 0, 1024, 0, 768  (xmin, xmax, ymin, ymax)
coord. ref. : NA
values      : C:\R\win-library\raster\DepartmentStoreW.grd
min values  : 0 0 0
max values  : 255 255 255
My strategy is simple: try to spot areas with white and red stripes (horizontal stripes). Note that here, I ran the code on a Windows machine, the package is not working well on Mac. In order to get a better understanding of what could be done, let us start with something much more simple. Like the picture below, with Waldo (and Waldo only). Here, it is possible to extract the three colors (red, green and blue),
> plot(waldo,useRaster=FALSE)

     

It is possible to extract the red zones (already on the graph above, since red is a primary color), as well as the white ones (green zones on the graphs means a white region on the picture, on the left)
# white component
white = min(waldo[[1]] , waldo[[2]] , waldo[[3]])>220
focalswhite = focal(white, w=3, fun=mean)
plot(focalswhite,useRaster=FALSE)
 
# red component
red = (waldo[[1]]>150)&(max(  waldo[[2]] , waldo[[3]])<90)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE)
i.e. here we have the graphs below, with the white regions, and the red ones,

From those two parts, it has been possible to extract the red-and-white stripes from the picture, i.e. some regions that were red above, and white below (or the reverse),
# striped component
striped = red; n=length(values(striped)); h=5
values(striped)=0
values(striped)[(h+1):(n-h)]=(values(red)[1:(n-2*h)]==
TRUE)&(values(red)[(2*h+1):n]==TRUE)
focalsstriped = focal(striped, w=3, fun=mean)
plot(focalsstriped,useRaster=FALSE)
So here, we can easily spot Waldo, i.e. the guy with the red-white stripes (with two different sets of thresholds for the RGB decomposition)

 Let us try somthing slightly more complicated, with a zoom on the large picture of the department store (since, to be honest, I know where Waldo is...).

Here again, we can spot the white part (on the left) and the red one (on the right), with some thresholds for the RGB decomposition

Note that we can try to be (much) more selective, playing with threshold. Here, it is not very convincing: I cannot clearly identify the region where Waldo might be (the two graphs below were obtained playing with thresholds)

 And if we look at the overall pictures, it is worst. Here are the white zones, and the red ones,

and again, playing with RGB thresholds, I cannot spot Waldo,

Maybe I was a bit optimistic, or ambitious. Let us try something more simple, like finding a flag on the moon. Consider the picture below on the left, and let us see if we can spot an American flag,

Again, on the left, let us identify white areas, and on the right, red ones

Then as before, let us look for horizontal stripes

Waouh, I did it ! That's one small step for man, one giant leap for R-coders ! Or least for me... So, why might it be interesting to identify areas on pictures ? I mean, I am not Chloe O'Brian, I don't have to spot flags in a crowd, neither Waldo, nor some terrorists (that might wear striped shirts). This might be fun if you want to give grades for your exams automatically. Consider the two following scans, the template, and a filled copy,

A first step is to identify regions where we expect to find some "red" part (I assume here that students have to use a red pencil). Let us start to check on the template and the filled form if we can identify red areas,
exam = stack("C:\\Users\\exam-blank.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE) 
exam = stack("C:\\Users\\exam-filled.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=3, fun=mean)
plot(focalsred,useRaster=FALSE)


First, we have to identify areas where students have to fill the blanks. So in the template, identify black boxes, and get the coordinates (here manually)
exam = stack("C:\\Users\\exam-blank.png")
black = max(  exam[[1]] ,exam[[2]] , exam[[3]])<50
focalsblack = focal(black, w=3, fun=mean)
plot(focalsblack,useRaster=FALSE)
correct=locator(20)
coordinates=locator(20)
X1=c(73,115,156,199,239)
X2=c(386,428.9,471,510,554)
Y=c(601,536,470,405,341,276,210,145,79,15)
LISTX=c(rep(X1,each=10),rep(X2,each=10))
LISTY=rep(Y,10)
points(LISTX,LISTY,pch=16,col="blue")

The blue points above are where we look for students' answers. Then, we have to define the vector of correct answers,
CORRECTX=c(X1[c(2,4,1,3,1,1,4,5,2,2)],
X2[c(2,3,4,2,1,1,1,2,5,5)])
CORRECTY=c(Y,Y)
points(CORRECTX, CORRECTY,pch=16,col="red",cex=1.3)
UNCORRECTX=c(X1[rep(1:5,10)[-(c(2,4,1,3,1,1,4,5,2,2)
+seq(0,length=10,by=5))]],
X2[rep(1:5,10)[-(c(2,3,4,2,1,1,1,2,5,5)
+seq(0,length=10,by=5))]])
UNCORRECTY=c(rep(Y,each=4),rep(Y,each=4))
Now, let us get back on red areas in the form filled by the student, identified earlier,
exam = stack("C:\\Users\\exam-filled.png")
red = (exam[[1]]>150)&(max(  exam[[2]] , exam[[3]])<150)
focalsred = focal(red, w=5, fun=mean)

Here, we simply have to compare what the student answered with areas where we expect to find some red in,
ind=which(values(focalsred)>.3)
yind=750-trunc(ind/610)
xind=ind-trunc(ind/610)*610
points(xind,yind,pch=19,cex=.4,col="blue")
points(CORRECTX, CORRECTY,pch=1,
col="red",cex=1.5,lwd=1.5)
Crosses on the graph on the right below are the answers identified as correct (here 13),
> icorrect=values(red)[(750-CORRECTY)*
+ 610+(CORRECTX)]
> points(CORRECTX[icorrect], CORRECTY[icorrect],
+ pch=4,col="black",cex=1.5,lwd=1.5)
> sum(icorrect)
[1] 13

In the case there are negative points for non-correct answer, we can count how many incorrect answers we had. Here 4.
> iuncorrect=values(red)[(750-UNCORRECTY)*610+
+ (UNCORRECTX)]
> sum(iuncorrect)
[1] 4
So I have not been able to find Waldo, but I least, that will probably save me hours next time I have to mark exams...