Multivariate probit regression using (direct) maximum likelihood estimators
By arthur charpentier on Wednesday, May 11 2011, 13:46 - econometrics - Permalink
of binary responses, i.e.
with
taking values 1 or 2. Assume that probability
can be function of some covariates
.- The Gaussian vector latent structure
such that
if
is lower than a given threshold, and 1 otherwise.As in standard probit models, assume that

is a Gaussian
random vector. This assumption can be used to derive the
likelihood of a sample
.
> logV=function(parameter){(the code is a bit long since I had trouble working properly with matrices - or more precisely to vectorize my functions - so I used loops... I am sure it is possible to write a better code).
+ CORRELATION=parameter[1]
+ BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X))
+ z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]))
+ sigma=matrix(c(1,CORRELATION,CORRELATION,1),2,2)
+ a11=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
+ for(i in 2:nrow(z)){a11=c(a11,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
+ a10=pnorm(z[1,1],sd=sqrt(sigma[1,1]))-pmnorm(z[1,],varcov=sigma)
+ for(i in
+ 2:nrow(z)){a10=c(a10,pnorm(z[i,1],sd=sqrt(sigma[1,1]))-pmnorm(z[i,],varcov=sigma))}
+ a01=pnorm(z[1,2],sd=sqrt(sigma[2,2]))-pmnorm(z[1,],varcov=sigma)
+ for(i in
+ 2:nrow(z)){a01=c(a01,pnorm(z[i,2],sd=sqrt(sigma[2,2]))-pmnorm(z[i,],varcov=sigma))}
+ a00=1-a10-a01-a11
+ -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) +
+ ((Y[,1]==0)&(Y[,2]==1))*log(a01) +
+ ((Y[,1]==1)&(Y[,2]==0))*log(a10) +
+ ((Y[,1]==0)&(Y[,2]==0))*log(a00) )
+ }
> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")$par
It is possible to generate samples (based on that specific model) to check that we can actually derive proper maximum likelihood estimators,
> library(mnormt)
> set.seed(1)
> n=1000
> r=0.5
> X1=runif(n)
> X2=rnorm(n)
> Y1S=1+5*X1
> Y2S=8-5*X1
> RES=rmnorm(n,mean=c(0,0),varcov=matrix(c(1,r,r,1),2,2))
> YS=cbind(Y1S,Y2S)+RES
> Y1=(YS[,1]>quantile(YS[,1],.5))*1
> Y2=(YS[,2]>quantile(YS[,2],.5))*1
> base=data.frame(i,Y1,Y2,X1,X2,YS)
> head(base)
i Y1 Y2 X1 X2 Y1S Y2S
1 1 0 0 0.2655087 0.07730312 3.177587 5.533884
2 2 0 0 0.3721239 -0.29686864 1.935307 5.089524
3 3 1 0 0.5728534 -1.18324224 4.757848 5.172584
4 4 1 0 0.9082078 0.01129269 4.600029 3.878225
5 5 0 1 0.2016819 0.99160104 2.547362 6.743714
6 6 1 0 0.8983897 1.59396745 5.309974 4.421523
(the two columns on the right are latent observations, that cannot be used since theoretically they are unobservable). Note that it is a simple regression, one of the component is here only to bring some noise. First of all, let us look at marginal probit regression
> reg1=glm(Y1~X1+X2,data=base,family=binomial)
> reg2=glm(Y2~X1+X2,data=base,family=binomial)
> summary(reg1)
Call:
glm(formula = Y1 ~ X1 + X2, family = binomial, data = base)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.90570 -0.50126 -0.00266 0.49162 2.78256
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.291725 0.267149 -16.065 <2e-16
X1 8.656836 0.510153 16.969 <2e-16 ***
X2 0.007375 0.090530 0.081 0.935
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1386.29 on 999 degrees of freedom
Residual deviance: 726.48 on 997 degrees of freedom
AIC: 732.48
Number of Fisher Scoring iterations: 5
> summary(reg2)
Call:
glm(formula = Y2 ~ X1 + X2, family = binomial, data = base)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.74682 -0.51814 -0.00001 0.57969 2.58565
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.91709 0.24399 16.054 <2e-16 ***
X1 -7.89703 0.46277 -17.065 <2e-16 ***
X2 0.18360 0.08758 2.096 0.036 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1386.29 on 999 degrees of freedom
Residual deviance: 777.61 on 997 degrees of freedom
AIC: 783.61
Number of Fisher Scoring iterations: 5
Here, the optimization yields,
> OPT=optim(fn=logV,par=c(0,1,1,1,1,1,1),method="BFGS")$par
> OPT[1]
[1] 0.5261382
> matrix(OPT[2:7],2,3)
[,1] [,2] [,3]
[1,] -2.451721 4.908633 0.01600769
[2,] 2.241962 -4.539946 0.10614807
Note that the coefficients we have obtained are almost identical to the ones obtained with R standard procedure,
> library(Zelig)
> REG= zelig(list(mu1=Y1~X1+X2,
+ mu2=Y2~X1+X2,
+ rho=~1),
+ model="bprobit",data=base)
> summary(REG)
Call:
zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2,
rho = ~1), model = "bprobit", data = base)
Pearson Residuals:
Min 1Q Median 3Q Max
probit(mu1) -10.5442 -0.377243 0.0041803 0.36709 8.60398
probit(mu2) -7.8547 -0.376888 0.0083715 0.42923 5.88264
rhobit(rho) -13.8322 -0.091502 -0.0080544 0.37218 0.85101
Coefficients:
Value Std. Error t value
(Intercept):1 -2.451699 0.135369 -18.11116
(Intercept):2 2.241964 0.125072 17.92536
(Intercept):3 1.169461 0.189771 6.16249
X1:1 4.908617 0.252683 19.42602
X1:2 -4.539951 0.233632 -19.43203
X2:1 0.015992 0.050443 0.31703
X2:2 0.106154 0.049092 2.16235
Number of linear predictors: 3
Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho)
Dispersion Parameter for binom2.rho family: 1
Residual Deviance: 1460.355 on 2993 degrees of freedom
Log-likelihood: -730.1774 on 2993 degrees of freedom
Number of Iterations: 3
> matrix(coefficients(REG)[c(1:2,4:7)],2,3)
[,1] [,2] [,3]
[1,] -2.451699 4.908617 0.01599183
[2,] 2.241964 -4.539951 0.10615443
That procedure works well an can be extended to ordinal responses (not only binary ones, or to three dimensional problems,
logV=function(beta){
BETA=matrix(beta[4:(3+ncol(Y)*ncol(X))],ncol(Y),ncol(X))
z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]),X%*%(BETA[3,]))
r12=beta[1]
r23=beta[2]
r31=beta[3]
s1=s2=s3=1
sigma=matrix(c(s1^2,r12*s1*s2,r31*s1*s3,
r12*s1*s2,s2^2,r23*s2*s3,
r31*s1*s3,r23*s2*s3,s3^2),3,3)
sigma1=matrix(c(s2^2,r23*s2*s3,
r23*s2*s3,s3^2),2,2)
sigma2=matrix(c(s1^2,r31*s1*s3,
r31*s1*s3,s3^2),2,2)
sigma3=matrix(c(s1^2,r12*s1*s2,
r12*s1*s2,s2^2),2,2)
a111=pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a111=c(a111,pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
a011=pmnorm(z[1,2:3],varcov=sigma1)-pmnorm(z[1,],varcov=sigma)
for(i in 2:nrow(z)){a011=c(a011,pmnorm(z[i,2:3],varcov=sigma1)-pmnorm(z[i,],varcov=sigma))}
a101=pmnorm(z[1,c(1,3)],varcov=sigma2)-pmnorm(z[1,],varcov=sigma)
for(i in 2:nrow(z)){a101=c(a101,pmnorm(z[i,c(1,3)],varcov=sigma2)-pmnorm(z[i,],varcov=sigma))}
a110=pmnorm(z[1,1:2],varcov=sigma3)-pmnorm(z[1,],varcov=sigma)
for(i in 2:nrow(z)){a110=c(a110,pmnorm(z[i,1:2],varcov=sigma3)-pmnorm(z[i,],varcov=sigma))}
a100=pnorm(z[1,1],sd=s1)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a100=c(a100,pnorm(z[i,1],sd=s1)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
a010=pnorm(z[1,2],sd=s2)-pmnorm(z[1,c(1,2)],varcov=sigma3)-pmnorm(z[1,c(2,3)],varcov=sigma1)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a010=c(a010,pnorm(z[i,2],sd=s2)-pmnorm(z[i,c(1,2)],varcov=sigma3)-pmnorm(z[i,c(2,3)],varcov=sigma1)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
a001=pnorm(z[1,3],sd=s3)-pmnorm(z[1,c(2,3)],varcov=sigma1)-pmnorm(z[1,c(1,3)],varcov=sigma2)+pmnorm(z[1,],rep(0,ncol(Y)),varcov=sigma)
for(i in 2:nrow(z)){a001=c(a001,pnorm(z[i,3],sd=s3)-pmnorm(z[i,c(2,3)],varcov=sigma1)-pmnorm(z[i,c(1,3)],varcov=sigma2)+pmnorm(z[i,],rep(0,ncol(Y)),varcov=sigma))}
a000=1-a111-a011-a101-a110-a001-a010-a100
a111[a111<=0]=1e-50
a110[a110<=0]=1e-50
a101[a101<=0]=1e-50
a011[a011<=0]=1e-50
a100[a100<=0]=1e-50
a010[a010<=0]=1e-50
a001[a001<=0]=1e-50
a000[a000<=0]=1e-50
-sum(((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==0))*log(a111) +
((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==0))*log(a011) +
((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==0))*log(a101) +
((Y[,1]==0)&(Y[,2]==0)&(Y[,3]==1))*log(a110) +
((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==0))*log(a001) +
((Y[,1]==1)&(Y[,2]==0)&(Y[,3]==1))*log(a010) +
((Y[,1]==0)&(Y[,2]==1)&(Y[,3]==1))*log(a100) +
((Y[,1]==1)&(Y[,2]==1)&(Y[,3]==1))*log(a000) )
}
A strong assumption in that bivariate model is that residuals have a Gaussian structure. It is possible to change that assumption
- marginally: for instance if we use a logistic cumulative distribution function, then we will have a bivariate logit regression
- in terms of dependence structure: it is possible to consider another copula than the gaussian one, e.g. Gumbel's copula (also called the bivariate logistic copula), or Clayton's
> F=function(x,r){pmnorm(x,rep(0,length(x)),
+ varcov=matrix(c(1,r,r,1),2,2))}
> Fx=function(x1){F(c(x1,1e40),0)}
> Fy=function(x2){Fx(x2)}
>
> logVgen=function(parameter){
+ CORRELATION=parameter[1]
+ BETA=matrix(parameter[2:length(parameter)],ncol(Y),ncol(X))
+ z=cbind(X%*%(BETA[1,]),X%*%(BETA[2,]))
+ a11=F(z[1,],r=CORRELATION)
+ for(i in 2:nrow(z)){a11=c(a11,F(z[i,],r=CORRELATION))}
+ a10=Fx(z[1,1])-F(z[1,],r=CORRELATION)
+ for(i in 2:nrow(z)){a10=c(a10,Fx(z[i,1])-F(z[i,],r=CORRELATION))}
+ a01=Fy(z[1,2])-F(z[1,],r=CORRELATION)
+ for(i in 2:nrow(z)){a01=c(a01,Fy(z[i,2])-F(z[i,],r=CORRELATION))}
+ a00=1-a10-a01-a11
+ -sum(((Y[,1]==1)&(Y[,2]==1))*log(a11) +
+ ((Y[,1]==0)&(Y[,2]==1))*log(a01) +
+ ((Y[,1]==1)&(Y[,2]==0))*log(a10) +
+ ((Y[,1]==0)&(Y[,2]==0))*log(a00) )
+ }
>
> beta0=c(0,1,1,1,1,1,1)
> (OPT=optim(fn=logVgen,par=beta0,method="BFGS")$par)
[1] 0.52613820 -2.45172059 2.24196154 4.90863292 -4.53994592 0.01600769
[7] 0.10614807
There were 23 warnings (use warnings() to see them)
> library(copula)
> F=function(x,r){pcopula(pnorm(x),
claytonCopula(2, r))}
> Fx=function(x1){F(c(x1,1e40),0)}
> Fy=function(x2){Fx(x2)}
- An application to school tests
Consider the following dataset,
hsb2=read.table("http://freakonometrics.free.fr/hsb2.csv",
header=TRUE, sep=",")
math_male=hsb2$math[female==0]
write_male=hsb2$write[female==0]
math_female=hsb2$math[female==1]
write_female=hsb2$write[female==1]
plot(math_female, write_female, type="p",
pch=19,col="red",xlab="maths",ylab="writing",cex=.8)
points(math_male, write_male, cex=1.2, col="blue")

female :We can try to understand correlation between math and writing skills. Covariates can be the sex of the child, and his reading skills. The question will then be: are good students in maths and writing simply students that can read well ?
0: male
1: female
race :
1: hispanic
2: asian
3: african-amer
4: white
ses :
1: low
2: middle
3: high
schtyp : type of school
1: public
2: private
prog : type of program
1: general
2: academic
3: vocation
read : reading score
write : writing score
math : math score
science : science score
socst : social studies score

> W=hsb2$write>=50
> M=hsb2$math>=50
> base=data.frame(Y1=W,Y2=M,
+ X1=hsb2$female,X2=hsb2$read)
>
> library(Zelig)
> REG= zelig(list(mu1=Y1~X1+X2,
+ mu2=Y2~X1+X2,
+ rho=~1),
+ model="bprobit",data=base)
> summary(REG)
Call:
zelig(formula = list(mu1 = Y1 ~ X1 + X2, mu2 = Y2 ~ X1 + X2,
rho = ~1), model = "bprobit", data = base)
Pearson Residuals:
Min 1Q Median 3Q Max
probit(mu1) -4.7518 -0.502594 0.15038 0.53038 1.8592
probit(mu2) -3.4243 -0.653537 0.23673 0.67011 2.6072
rhobit(rho) -4.9821 0.010481 0.13500 0.40776 2.9171
Coefficients:
Value Std. Error t value
(Intercept):1 -5.484711 0.787101 -6.96825
(Intercept):2 -4.061384 0.633781 -6.40818
(Intercept):3 1.332187 0.322175 4.13497
X1:1 1.125924 0.233550 4.82092
X1:2 0.167258 0.202498 0.82598
X2:1 0.103997 0.014662 7.09286
X2:2 0.082739 0.012026 6.88017
Number of linear predictors: 3
Names of linear predictors: probit(mu1), probit(mu2), rhobit(rho)
Dispersion Parameter for binom2.rho family: 1
Residual Deviance: 364.51 on 593 degrees of freedom
Log-likelihood: -182.255 on 593 degrees of freedom
Number of Iterations: 3
> (exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1)
[1] 0.5824045
> beta0=c((exp(summary(REG)@coef3[3])-1)/(exp(summary(REG)@coef3[3])+1),
+ summary(REG)@coef3[c(1:2,4:7),1])
> beta0
(Intercept):1 (Intercept):2 X1:1 X1:2
0.58240446 -5.48471133 -4.06138412 1.12592427 0.16725842
X2:1 X2:2
0.10399668 0.08273879
> (OPT=optim(fn=logV,par=beta0,method="BFGS")$par)
(Intercept):1 (Intercept):2 X1:1 X1:2
0.5824045 -5.4847113 -4.0613841 1.1259243 0.1672584
X2:1 X2:2
0.1039967 0.0827388
i.e. we obtain (almost) exactly the same estimators. But here I have used as starting values for the optimization procedure the estimators given by R. If we change them, hopefully we have a robust maximum likelihood estimator,
So once again, it is possible to optimize numerically a likelihood function, and it works.






Comments
Bonjour ,
comment vous avez utilise $fit valeurs dans l'estimation non parametrique , pourriez vous explique , car il y a bcp des gens qui lisent votre cours ))) pourquoi vous prenez is.na(vu)==FALSE
voila :
>regloess=loess(dist~speed,cars,span = 0.9)
> u=seq(3,30,by=.1)
>v=predict(regloess,newdata=data.frame(speed=u),se=TRUE)
> vu=c(v$fit+1.96*v$se.fit,rev(v$fit-1.96*v$se.fit))
> polygon(c(u,rev(u))[is.na(vu)==FALSE],vu[is.na(vu)==FALSE],
+ col="light pink",border=FALSE)
> lines(u,v$fit,col="purple",lwd=3)
> lines(u,v$fit+1.96*v$se.fit,col="purple",lty=2)
> lines(u,v$fit-1.96*v$se.fit,col="purple",lty=2)
lien :
http://blogperso.univ-rennes1.fr/ar...
just what I was looking for. thanks