Freakonometrics

To content | To menu | To search

Friday, June 22 2012

Les députés sont-ils à l'image de la population

Beaucoup de choses ont été écrites sur le fait que les députés ne sont pas vraiment le reflet de la population, que ce soit en terme de profession, de sexe, d'origine, d'age, etc. La liste pourrait être longue. Il y a plusieurs mois, j'avais commencé à regarder le profil des députés, par age. En effet, le site http://www.assemblee-nationale.fr/ permet d'accéder à des données sur tous les députés, depuis la Révolution. Y compris leur date de naissance. En croisant ces données avec des données de population, par exemple via http://www.mortality.org/, on peut comparer la répartition des ages des députés, avec la répartition des ages de la population.

Pour les amateurs, le code pour récupérer les données (ou au moins les dates de naissance des députés) ressemble à

N=2002
URL=paste("http://www.assemblee-nationale.fr/
sycomore/result.asp?radio_dept=tous_departements&
regle_nom=est&Nom=&departement=&
choixdate=intervalle&D%C3%A9butMin=01%2F01%2F",
N,"&FinMin=31%2F12%2F",N,"&Dateau=&legislature=",
s,"&choixordre=chrono&Rechercher=
Lancer+la+recherche",sep="")
HTML=scan(URL,what="character")
 
k=which(HTML=="class=\"titre\">Né")
vHTML=HTML[k:length(HTML)]
vk=which(substr(vHTML,1,7)=="> ")
liste=vHTML[vk]
naissance=liste[seq(1,length(liste),by=2)]
NAISSANCE=as.Date(substr(naissance,8,17),
"%d/%m/%Y")

Maintenant, pour être tout à fait honnête, je ne suis pas certain de ce qui est vraiment renvoyé, et j'ai des doutes que cela correspondent réellement à la requête faite. En effet, même si je demande à avoir la liste des députés après l'élection, j'ai trop de monde... mais peut-être est-ce du aux décès éventuels, et il est possible que l’ensemble des députés qui ont siégé pendant la mandature apparaissent dans le résultat de la requête.

Sur la figure suivante on voit, sur plusieurs élections depuis plus de 100 ans, comment les deux distributions se déforment, avec en rouge la distribution de l'age des députés, et en bleu, la distribution de la population française, dans son ensemble (population de plus de 18 ans)

Si on veut tout suivre sur un graphique, au lieu de se regarder une animation, on peut représenter les différents quantiles (10%, 25%, 75% et 90%, retenus sur la population de plus de 18 ans, et l'age médian, au centre), avec la population française l'année de l'élection, 

et l'ensemble des élus au parlement,

Si on veut faciliter la comparaison, on peut se contenter de visualiser l'évolution des ages moyens, 

ou encore, du ratio (en % de différence) entre l'age moyen des députés, et celui de l'ensemble de la population.

Sur ce graphique, on voit que depuis 30 ans, l’age moyen des députés croit plus vite que celui de la population: la population français vieilli, mais moins que ses députés... La gérontocratie perdure donc en France. En espérant que cela ne débouche pas sur le clash générationnel que l'on semble observer ces temps-ci au Québec...

Tuesday, January 17 2012

Infidelity and econometrics

On http://www.bakadesuyo.com, there was recently an interesting discussion about infidelity, the key question being "at what ages are men and women most likely to have affairs?" The discussion is based on some graphs, e.g.

The source is a paper by Donald Cox. Based on a sample of 36 men and 22 women 3,432 respondent (NHSLS dataset) . And to be honest, I have been surprised by the shape of the curves. Especially for men... In order to compare, it is possible to use another dataset that can be found in R,

> library(Ecdat)
> data(Fair)
> tail(Fair)
sex age   ym child religious education occupation
596   male  47 15.0   yes         3        16          4
597   male  22  1.5   yes         1        12          2
598 female  32 10.0   yes         2        18          5
599   male  32 10.0   yes         2        17          6
600   male  22  7.0   yes         3        18          6
601 female  32 15.0   yes         3        14          1
rate nbaffairs
596    2         7
597    5         1
598    4         7
599    5         2
600    2         2
601    5         1
with 601 observations (from Fair (1977)). It is possible to run a Poisson regression to describe the number of affairs in the past year. E.g for men
> library(splines)
> regM=glm(nbaffairs~bs(age),family=poisson,
+ data=Fair[Fair$sex=="male",])
> a=seq(20,60)
> N=predict(regM,newdata=data.frame(age=a),type="response")
> plot(a,N,type="l",lwd=2,col="red")

or for women,
> regF=glm(nbaffairs~bs(age),family=poisson,
+ data=Fair[Fair$sex=="female",])
> N=predict(regF,newdata=data.frame(age=a),type="response")
> plot(a,N,type="l",lwd=2,col="red",lty=2)

On that (larger) dataset, we obtain curves that are more intuitive... But maybe the Poisson distribution is not an appropriate model. For instance, having no affairs do not mean that the person did not want to... So perhaps, a more interesting model would be a Poisson model with a zero-inflation, i.e. some people are honest and do not want to have affairs (and appear as 0), while some do want to have some affairs, and the number of affairs is Poisson distributed (and can take the value 0). If we focus on people wo do not want to have affairs, the model (and the prediction) is the following, where we plot the probability of not being interested in having an affair,
> library(pscl)
> regM0=zeroinfl(nbaffairs~bs(age)|bs(age),family=poisson,
+ link="logit",data=Fair[Fair$sex=="male",])
> N0=predict(regM0,newdata=data.frame(age=a),type="zero")
> plot(a,N0,type="l",lwd=2,col="blue")

For those willing to have an affair, here is the parameter of the Poisson distribution of the number of affairs,
> Nc=predict(regM0,newdata=data.frame(age=a),type="count")
> plot(a,Nc,type="l",lwd=2,col="purple")

The same can be done for women, with the probability of no-willing to have an affair,

and to Poisson rate for women willing to have an affair,

If we focus on people willing to have an affair, the curves are the following,

i.e. men below 40 have more interested, but after 40, the probability drops, while women are still more and more likely to be willing to have an affair. On the other hand, young women having affairs might be less, but they usually have much more affairs than men...

Monday, May 2 2011

Oscar awards: good actor versus good actress

I am not a big fan of those ceremonies, where some actors pretend that they are extremely happy to be there, and then some win a trophy, some don't, and those who win start to cry, and those who did not get a trophy try to pretend that they are not affected, etc. The other reason is that, since I have several kids, I do not go to see the movies that often (I mean apart from Shrek, Toy Story... Harry Potter is probably the only movie I've seen with real actors - or at least human actors).

But I remember being surprised when I looked at the nominees in newspapers,

Actresses are beautiful and look young, while actors are more experienced. So I have try to see how old were those who win an Oscar, as best actor (here) or best supporting actor (there), and best actress (here) and best supporting actress (there).
OSCAR=read.table("http://freakonometrics.blog.free.fr/public/data/OSCAR.csv",
sep=",",header=TRUE,dec=".")
actor=OSCAR[,1]
suppactor=OSCAR[,2]
actress=OSCAR[,3]
suppactress=OSCAR[,4]
actor=actor[is.na(actor)==FALSE]
actor=actor[actor>0]
actress=actress[is.na(actress)==FALSE]
actress=actress[actress>0]
suppactor=suppactor[is.na(suppactor)==FALSE]
suppactor=suppactor[suppactor>0]
suppactress=suppactress[is.na(suppactress)==FALSE]
suppactress=suppactress[suppactress>0]
 
boxplot(actor,suppactor,actress,suppactress,col=c("blue","blue","red","red"),
names=c("actor","supp. actor","actress","supp. actress"))

On average, a best actress is 36 years old, while a best actor is 44 years old.  Which is quite a difference... Perhaps because it takes more time to an actor to be a good one ? Assuming that they start acting at 18, it takes 18 more years for an actress to be recognized as a good one (here the best one), and 26 for an actor. Or perhaps it is simply because leading actresses have to look young...
The oldest actor who won an Oscar was Henry Fonda (at the age of 76) and the oldest actress was Jessica Tendy (nearing 81). Tatum O'Neal became the youngest person to win the best supporting actress award at the age of 10 (she was 8 when she was acting). The youngest best actress was Marlee Matlin, 21. The distribution was be seen below, with actors in blue, and actresses in red, best supporting actors in dotted lines, and best actors in plain lines,
plot(density(actor),xlim=c(10,80),axes=FALSE,
col="blue",names="",ylab="",xlab="",ylim=c(0,.051))
lines(density(suppactor),col="blue",lty=2)
lines(density(actress),col="red")
lines(density(suppactress),col="red",lty=2)
axis(1)

Note that the age of supporting actors is older that leading ones. E.g. the average age for supporting actors winning an Oscar is 50, while it is  44 for actors. Similarly, it is 40 for supporting actresses, and 36 for actresses.
> mean(suppactor)
[1] 50.23762
 
> mean(actor)
[1] 44.29982
 
> mean(suppactress)
[1] 40.55766
 
> mean(actress)
[1] 36.39733

Here, I have to admit that I was surprised. I always thought that being a supporting actor was a first step before being a leading one. So winners of supporting awards should have been younger that winners of leading ones. But this is not the case.

And the dynamic here is rather stable, with actors, 

and actresses,

except that the age difference between supporting roles and leading roles have increased in the 80's for actors, while it decreased in the 80's for actresses.

Monday, February 7 2011

Open data might be a false good opportunity...

I am always surprised to see many people on Twitter tweeting about , e.g. @, @, @, @ or @ among so many others... Initially, I was also very enthousiastic, but I have to admit that open data are rarely raw data. Which is what I am usually looking for, as a statistician...
Consider the following example: I was wondering (Valentine's day is approaching) when will a man born in 1975 (say) get married - if he ever gets married ? More technically, I was looking for a distribution of the age of first marriage (given the year of birth), including the proportion of men that will never get married, for that specific cohort.
The only data I found on the internet is the following, on statistics.gov.uk/

Note that we can also focus on women (e.g. here). Is it possible to use that open data to get an estimation of the distribution of first marriage for some specific cohort ? (and to answer the question I asked).
Here, we have two dimensions: on line http://freakonometrics.free.fr/blog/latex/marriage01.gif, the year (of the marriage), and on column http://freakonometrics.free.fr/blog/latex/marriage02.gif, the age of the man when he gets married. Assume that those were raw data, i.e. that we have the number of marriages of men of age http://freakonometrics.free.fr/blog/latex/marriage02.gif during the year http://freakonometrics.free.fr/blog/latex/marriage01.gif.
We are interested at a longitudinal lecture of the table, i.e. consider some man born year http://freakonometrics.free.fr/blog/latex/marriage03.gif, we want to estimate (or predict) the age he will get married, if he gets married. With raw data, we can do it... The first step is to build up triangles (to have a cohort vs. age lecture of the data), and then to consider a model, e.g.
http://freakonometrics.free.fr/blog/latex/marriage04.gif
where http://freakonometrics.free.fr/blog/latex/marriage05.gif is a year effect, and http://freakonometrics.free.fr/blog/latex/marriage06.gif is a cohort effect.
base=read.table("http://freakonometrics.free.fr/mariage-age-uk.csv",
sep=";",header=TRUE)
m=base[1:16,]
m=m[,3:10]
m=as.matrix(m)
triangle=matrix(NA,nrow(m),ncol(m))
n=ncol(m)
for(i in 1:16){
triangle[i,]=diag(m[i-1+(1:n),])
}
triangle[nrow(m),1]=m[nrow(m),1]
 
triangle
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 12 104 222 247 198 132 51 34
[2,] 8 89 228 257 202 102 75 49
[3,] 4 80 209 247 168 129 92 50
[4,] 4 73 196 236 181 140 88 45
[5,] 3 78 242 206 161 114 68 47
[6,] 11 150 223 199 157 105 73 39
[7,] 12 117 194 183 136 96 61 36
[8,] 11 118 202 175 122 92 62 40
[9,] 15 147 218 162 127 98 72 48
[10,] 20 185 204 171 138 112 82 NA
[11,] 31 197 240 209 172 138 NA NA
[12,] 34 196 233 202 169 NA NA NA
[13,] 35 166 210 199 NA NA NA NA
[14,] 26 139 210 NA NA NA NA NA
[15,] 18 104 NA NA NA NA NA NA
[16,] 10 NA NA NA NA NA NA NA
 
Y=as.vector(triangle)
YEARS=seq(1918,1993,by=5)
AGES=seq(22,57,by=5)
X1=rep(YEARS,length(AGES))
X2=rep(AGES,each=length(YEARS))
reg=glm(Y~as.factor(X1)+as.factor(X2),family="poisson")
summary(reg)
 
Call:
glm(formula = Y ~ as.factor(X1) + as.factor(X2), family = "poisson")
 
Deviance Residuals:
Min 1Q Median 3Q Max
-5.4502 -1.1611 -0.0603 1.0471 4.6214
 
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8300461 0.0712160 39.739 < 2e-16 ***
as.factor(X1)1923 0.0099503 0.0446105 0.223 0.823497
as.factor(X1)1928 -0.0212236 0.0449605 -0.472 0.636891
as.factor(X1)1933 -0.0377019 0.0451489 -0.835 0.403686
as.factor(X1)1938 -0.0844692 0.0456962 -1.848 0.064531 .
as.factor(X1)1943 -0.0439519 0.0452209 -0.972 0.331082
as.factor(X1)1948 -0.1803236 0.0468786 -3.847 0.000120 ***
as.factor(X1)1953 -0.1960149 0.0470802 -4.163 3.14e-05 ***
as.factor(X1)1958 -0.1199103 0.0461237 -2.600 0.009329 **
as.factor(X1)1963 -0.0446620 0.0458508 -0.974 0.330020
as.factor(X1)1968 0.1192561 0.0450437 2.648 0.008107 **
as.factor(X1)1973 0.0985671 0.0472460 2.086 0.036956 *
as.factor(X1)1978 0.0356199 0.0520094 0.685 0.493423
as.factor(X1)1983 0.0004365 0.0617191 0.007 0.994357
as.factor(X1)1988 -0.2191428 0.0981189 -2.233 0.025520 *
as.factor(X1)1993 -0.5274610 0.3241477 -1.627 0.103689
as.factor(X2)27 2.0748202 0.0679193 30.548 < 2e-16 ***
as.factor(X2)32 2.5768802 0.0667480 38.606 < 2e-16 ***
as.factor(X2)37 2.5350787 0.0671736 37.739 < 2e-16 ***
as.factor(X2)42 2.2883203 0.0683441 33.482 < 2e-16 ***
as.factor(X2)47 1.9601540 0.0704276 27.832 < 2e-16 ***
as.factor(X2)52 1.5216903 0.0745623 20.408 < 2e-16 ***
as.factor(X2)57 1.0060665 0.0822708 12.229 < 2e-16 ***
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for poisson family taken to be 1)
 
Null deviance: 5299.30 on 99 degrees of freedom
Residual deviance: 375.53 on 77 degrees of freedom
(28 observations deleted due to missingness)
AIC: 1052.1
 
Number of Fisher Scoring iterations: 5
Here, we have been able to derive http://freakonometrics.free.fr/blog/latex/marriage12.gif and http://freakonometrics.free.fr/blog/latex/marriage13.gif, where now http://freakonometrics.free.fr/blog/latex/marriage14.gif denotes the cohort.
We can now predict the number of marriages per year, and per cohort
http://freakonometrics.free.fr/blog/latex/marriage15.gif
Here, given the cohort http://freakonometrics.free.fr/blog/latex/marriage03.gif, the shape of http://freakonometrics.free.fr/blog/latex/marriage16.gif is the following
Yp=predict(reg,type="response")
tYp=matrix(Yp,nrow(m),ncol(m))
tYp[16,]
tYp[16,]
[1] 10.00000 222.94525 209.32773 159.87855 115.06971 42.59102
[7] 18.70168 148.92360

The errors (Pearson error) look like that
Ep=residuals(reg,type="pearson")
 
(where the darker the blue, the smaller the residuals, and the darker the red, the higher the residuals). Obviously, we are missing something here, like a diagonal effect. But this is not the main problem here...
I guess that study here is not valid. The problem is that we deal with open data, and numbers of marriages are not given here: what is given is a he proportion of marriage of men of age http://freakonometrics.free.fr/blog/latex/marriage02.gif during the year http://freakonometrics.free.fr/blog/latex/marriage01.gif, with a yearly normalization. There is a constraint on lines, i.e. we observe
http://freakonometrics.free.fr/blog/latex/marriage08.gif
so that
http://freakonometrics.free.fr/blog/latex/marriage09.gif
This is mentioned in the title
It is still possible to consider a Poisson regression on the http://freakonometrics.free.fr/blog/latex/marriage10.gif, but unfortunately, I do not think any interpretation is valid (unless demography did not change last century). For instance, the following sum
http://freakonometrics.free.fr/blog/latex/marriage17.gif
looks like that
apply(tYp,1,sum)
[1] 919.948 838.762 846.301 816.552 943.559 930.280 857.871 896.113
[9] 905.086 948.087 895.862 853.738 826.003 816.192 813.974 927.437
i.e. if we look at the graph

But I do not think we can interpret that sum as the probability (if we divide by 1,000) that a man in that cohort gets married.... And more basically, I cannot do anything with that dataset...

So open data might be interesting. The problem is that most of the time, the data are somehow normalized (or aggregated). And then, it becomes difficult to use them...

So I will have to work further to be able to write something (mathematically valid) on marriage strategy before Valentine's day.... to be continued.

Thursday, December 2 2010

Statistique de l'assurance STT6705V, partie 12

The final course (since courses end this week in Montréal) can be watched here and there. The drawings from the course can be downloaded here (including last week's). First, to come back on last week's course , we considered Lee-carter model, i.e.

Note that it is possible to go a bit further, and to consider something that can be seen as a second order development

or even more generally

This idea was implicitly in the initial paper published in the 80's. Hence, I mentioned in the course that (weighted) least square techniques can be considered to estimate parameters,

and the the first order condition is simply

(which was the interpretation that it is simply the average (over years) of mortality rate). Given that estimator, it is possible to write

and then to use the first components of the singular value decomposition to derive estimators.About the singular value decomposition (SVD, with a nice graph stolen from wikipedia, here, on the right - actually the French article on SVD is a bit more punchy, from Снайперская винтовка Драгунова, there). Note that SVD is closely related to PCA...
Thus, instead of focusing only on the first principal components, it is possible to go up to order 2.
But it is also possible to add, instead of a time component, a cohort-based component, i.e.

The estimation is rather simple, as shown below
D=as.vector(BASEB)
E=as.vector(BASEC)
A=rep(AGE,each=length(ANNEE))
Y=rep(ANNEE,length(AGE))
C=Y-A
base=data.frame(D,E,A,Y,C,a=as.factor(A),
y=as.factor(Y),c=as.factor(C))
LC4=gnm(D~a+Mult(a,y)+Mult(a,c),offset=log(E),
family=poisson,data=base)
Here, parameters have the following shape.
plot(AGE[-1],LC2$coefficients[2:90])
plot(AGE,LC2$coefficients[91:180])
plot(ANNEE,LC2$coefficients[181:279])
plot(AGE,LC2$coefficients[280:369])
plot(1808:1997,LC2$coefficients[370:559])
The first component is rather similar to what we had before
On the other hand, the Lee-Carter (age-year) component is

and the cohort effect (age-cohort) is

Note that the year-component captures wars and chocks that can affect anyone some given years, and that gains on life expectancy is more a cohort effect.
Finally, we discussed actuarial applications. But first, recall that in actuarial literature (without any dynamics), it is standard to defined

and

The mortality rate is then

i.e.

Thus,

Complete expectation of life is then

or its discrete version

Hence, it is possible to write

which can be extended in the dynamic framework as follows

A natural estimator of that quantity is

i.e., with Lee-Carter model

All those quantities can be computed quite simply.

But first, I have to work a little bit on the dataset, at least to be able to predict mortality up to 100 years old (with the demography package, we have to stop at 90). One idea can be to mix two estimation techniques: the nonlinear Poisson regression to get and up to 99 years old, and then to use Rob Hyndman's package to estimate and predict the http://freakonometrics.blog.free.fr/public/maths/viekt.png component. But first, we have to check that  's and 's with the two techniques are not too different. The code for the first model is

BASEB=BASEB[,1:99]
BASEC=BASEC[,1:99]
AGE=AGE[1:99]
library(gnm)
D=as.vector(BASEB)
E=as.vector(BASEC)
A=rep(AGE,each=length(ANNEE))
Y=rep(ANNEE,length(AGE))
base=data.frame(D,E,A,Y,a=as.factor(A),
y=as.factor(Y))
LC2=gnm(D~a+Mult(a,y)+offset=log(E),
family=poisson,data=base)
while it is
BASEB=BASEB[,1:90]
BASEC=BASEC[,1:90]
AGE=AGE[1:90]
library(demography)
MU=as.matrix(BASEB/BASEC)
base=demogdata(data=t(MU),pop=t(BASEC),
ages=AGE,years=ANNEE,type="mortality",
label="France",name="Total",lambda=0)
LC3=lca(base,interpolate=TRUE)
LC3F=forecast(LC3,150)

for the second one. Then, we can compare predictions for  's e.g. (with output from econometric regression in blue, and Rob's package in red)

plot(AGE,LC3$ax,lwd=2,col="red",xlab="",ylab="")
lines(1:98,LC2$coefficients[1]+LC2$coefficients[2:99],
lwd=2,col="blue")

The second problem is that we have to use a linear transformation, since in the econometric package, we do not use the same constraints as in Rob's package. Here, it was

plot(ANNEE,-LC2$coefficients[199:297]*50)
lines(ANNEE,LC3$kt,col="red")

Then we can compute predicted for all ages and years,

A=LC2$coefficients[1]+LC2$coefficients[2:99]
B=LC2$coefficients[100:190]
K1=LC3$kt
K2=LC3$kt[99]+LC3F$kt.f$mean
K=-c(K1,K2)/50
MU=matrix(NA,length(A),length(K))
for(i in 1:length(A)){
for(j in 1:length(K)){
MU[i,j]=exp(A[i]+B[i]*K[j])
}}
Once we have that matrix, we simply have to work on diagonal (i.e. cohorts) to calculate anything. For instance, to get the dynamic version of http://freakonometrics.blog.free.fr/public/maths/viehpx.png, the code is (here for someone of age 40 in 2000)
t=2000
x=40
s=seq(0,99-x-1)
MUd=MU[x+1+s,t+s-1898]
Pxt=cumprod(exp(-diag(MUd)))
It is also possible to compute residual life expectancy http://freakonometrics.blog.free.fr/public/maths/viex.png, including dynamics
x=40
E=rep(NA,150)
for(t in 1900:2040){
s=seq(0,90-x-1)
MUd=MU[x+1+s,t+s-1898]
Pxt=cumprod(exp(-diag(MUd)))
ext=sum(Pxt)
E[t-1899]=ext}
plot(1900:2049,E)
Note that this has been computed given the maximum lifetime we had in the dataset (here 99 years old). Hence, we assume here that no one can live more than 100 year (which will impact life expectancy and is a rather strong assumption, especially in 2050 - which might explain the concave shape of the curve on the right part).
It is also possible to compute annuities. For instance, consider a 40 years old insured, in 2000, willing to get 1 every year after age 70, until he dies, i.e.
The annuity for such a contract is
x=40
t=2000
r=.035
m=70
h=seq(0,21)
V=1/(1+r)^(m-x+h)*Pxt[m-x+h]
sum(V)
VV=rep(NA,150)

for(t in 1900:2040){
s=seq(0,90-x-1)
MUd=MU[x+1+s,t+s-1898]
Pxt=cumprod(exp(-diag(MUd)))
h=seq(0,30)
V=1/(1+r)^(m-x+h)*Pxt[m-x+h]
VV[t-1899]=sum(V,na.rm=TRUE)}
plot(1900:2049,VV)
(in order to compare the value of such a contract, from 1900 up to 2050).