Freakonometrics

To content | To menu | To search

Tag - optimization

Entries feed - Comments feed

Saturday, June 30 2012

Simple and heuristic optimization

This week, at the Rmetrics conference, there has been an interesting discussion about heuristic optimization. The starting point was simple: in complex optimization problems (here we mean with a lot of local maxima, for instance), we do not necessarily need extremely advanced algorithms that do converge extremly fast, if we cannot ensure that they reach the optimum. Converging extremely fast, with a great numerical precision to some point (that is not the point we're looking for) is useless. And some algorithms might be much slower, but at least, it is much more likely to converge to the optimum. Wherever we start from.
We have experienced that with Mathieu, while we were looking for maximum likelihood of our MINAR process: genetic algorithm have performed extremely well. The idea is extremly simple, and natural. Let us consider as a starting point the following algorithm,

  1. Start from some 
  2. At step , draw a point  in a neighborhood of  
  • either  then  
  • or  then 
This is simple (if you do not enter into details about what such a neighborhood should be). But using that kind of algorithm, you might get trapped and attracted to some local optima if the neighborhood is not large enough. An alternative to this technique is the following: it might be interesting to change a bit more, and instead of changing when we have a maximum, we change if we have almost a maximum. Namely at step ,
  • either then  
  • or  then 
for some . To illustrate the idea, consider the following function
> f=function(x,y) { r <- sqrt(x^2+y^2);
+ 1.1^(x+y)*10 * sin(r)/r }
(on some bounded support). Here, by picking noise and  values arbitrary, we have obtained the following scenarios
> x0=15
> MX=matrix(NA,501,2)
> MX[1,]=runif(2,-x0,x0)
> k=.5
> for(s in 2:501){
+  bruit=rnorm(2)
+  X=MX[s-1,]+bruit*3
+  if(X[1]>x0){X[1]=x0}
+  if(X[1]<(-x0)){X[1]=-x0}
+  if(X[2]>x0){X[2]=x0}
+  if(X[2]<(-x0)){X[2]=-x0}
+  if(f(X[1],X[2])+k>f(MX[s-1,1],
+    MX[s-1,2])){MX[s,]=X}
+  if(f(X[1],X[2])+k<=f(MX[s-1,1],
+    MX[s-1,2])){MX[s,]=MX[s-1,]}
+}

 

It does not always converge towards the optimum,

and sometimes, we just missed it after being extremely unlucky

Note that if we run 10,000 scenarios (with different random noises and starting point), in 50% scenarios, we reach the maxima. Or at least, we are next to it, on top.
What if we compare with a standard optimization routine, like Nelder-Mead, or quasi gradient ?Since we look for the maxima on a restricted domain, we can use the following function,

> g=function(x) f(x[1],x[2])
> optim(X0, g,method="L-BFGS-B",
+ lower=-c(x0,x0),upper=c(x0,x0))$par
In that case, if we run the algorithm with 10,000 random starting point, this is where we end, below on the right (while the heuristic technique is on the left),

In only 15% of the scenarios, we have been able to reach the region where the maximum is.
So here, it looks like an heuristic method works extremelly well, if do not need to reach the maxima with a great precision. Which is usually the case actually.

Friday, September 24 2010

EM and mixture estimation, with R (part 2)

Following my previous post on optimization and mixtures (here), Nicolas told me that my idea was probably not the most clever one (there).
So, we get back to our simple mixture model,

http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM01.png
In order to describe how EM algorithm works, assume first that both http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png and  http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png are perfectly known, and the mixture parameter is the only one we care about.
  • The simple model, with only one parameter that is unknown
Here, the likelihood is
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM04.png
so that we write the log likelihood as
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM05.png
which might not be simple to maximize. Recall that the mixture model can interpreted through a latent variate http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM06.png (that cannot be observed), taking value when http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM07.png is drawn from http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png, and 0 if it is drawn from http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png. More generally (especially in the case we want to extend our model to 3, 4, ... mixtures), http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM08.png and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM09.png.
With that notation, the likelihood becomes
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM10.png
and the log likelihood
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM11.png
the term on the right is useless since we only care about p, here. From here, consider the following iterative procedure,
Assume that the mixture probability http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM13.png is known, denoted http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM12.png. Then I can predict the value of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM06.png (i.e. http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM08.png and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM09.png) for all observations,
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM14.png
So I can inject those values into my log likelihood, i.e. in
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM15.png
having maximum (no need to run numerical tools here)
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM16.png
that will be denoted http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM17.png. And I can iterate from here.
Formally, the first step is where we calculate an expected (E) value, where http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png is the best predictor of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM19.png given my observations (as well as my belief in http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM13.png). Then comes a maximization (M) step, where using http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM06.png, I can estimate probability http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM13.png.
  • A more general framework, all parameters are now unkown
So far, it was simple, since we assumed that http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png and  http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png were perfectly known. Which is not reallistic. An there is not much to change to get a complete algorithm, to estimate http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM30.png. Recall that we had http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png which was the expected value of Z_{1,i}, i.e. it is a probability that observation i has been drawn from http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png.
If http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png, instead of being in the segment http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM31.png was in http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM32.png, then we could have considered mean and standard deviations of observations such that http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png=0, and similarly on the subset of observations such that http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png=1.
But we can't. So what can be done is to consider http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png as the weight we should give to observation i when estimating parameters of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM02.png, and similarly, 1-http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM18.png would be weights given to observation i when estimating parameters of http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM03.png.
So we set, as before
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM33.png
and then
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM34.png
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM35.png
and for the variance, well, it is a weighted mean again,
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM36.png
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM37.png

and this is it.
  • Let us run the code on the same data as before
Here, the code is rather simple: let us start generating a sample
> X1 = rnorm(n,0,1)
> X20 = rnorm(n,0,1)
> Z  = sample(c(1,2,2),size=n,replace=TRUE)
> X2=4+X20
> X = c(X1[Z==1],X2[Z==2])
then, given a vector of initial values (that I called http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM12.png and then http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM99.png before),
>  s = c(0.5, mean(X)-1, var(X), mean(X)+1, var(X))
I define my function as,
>  em = function(X0,s) {
+  Ep = s[1]*dnorm(X0, s[2], sqrt(s[4]))/(s[1]*dnorm(X0, s[2], sqrt(s[4])) +
+  (1-s[1])*dnorm(X0, s[3], sqrt(s[5])))
+  s[1] = mean(Ep)
+  s[2] = sum(Ep*X0) / sum(Ep)
+  s[3] = sum((1-Ep)*X0) / sum(1-Ep)
+  s[4] = sum(Ep*(X0-s[2])^2) / sum(Ep)
+  s[5] = sum((1-Ep)*(X0-s[3])^2) / sum(1-Ep)
+  return(s)
+  }

Then I get http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM12.png, or http://perso.univ-rennes1.fr/arthur.charpentier/latex/mixEM99.png. So this is it ! We just need to iterate (here I stop after 200 iterations) since we can see that, actually, our algorithm converges quite fast,
> for(i in 2:200){
+ s=em(X,s)
+ }

Let us run the same procedure as before, i.e. I generate samples of size 200, where difference between means can be small (0) or large (4),

Ok, Nicolas, you were right, we're doing much better ! Maybe we should also go for a Gibbs sampling procedure ?... next time, maybe....

Thursday, September 23 2010

Optimization and mixture estimation, with R

Recently, one of my students asked me about optimization routines in R. He told me he that R performed well on the estimation of a time series model with different regimes, while he had trouble with a (simple) GARCH process, and he was wondering if R was good in optimization routines. Actually, I always thought that mixtures (and regimes) was something difficult to estimate, so I was a bit surprised...

Indeed, it reminded me some trouble I experienced once, while I was talking about maximum likelihooh estimation, for non standard distribution, i.e. when optimization had to be done on the log likelihood function. And even when generating nice samples, giving appropriate initial values (actually the true value used in random generation), each time I tried to optimize my log likelihood, it failed. So I decided to play a little bit with standard optimization functions, to see which one performed better when trying to estimate mixture parameter (from a mixture based sample). Here, I generate a mixture of two gaussian distributions, and I would like to see how different the mean should be to have a high probability to estimate properly the parameters of the mixture.
The density is here http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-01.png proportional to
http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-02.png
The true model is http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-03.png, and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-04.png being a parameter that will change, from 0 to 4.
The log likelihood (actually, I add a minus since most of the optimization functions actually minimize functions) is
> logvraineg <- function(param, obs) {
+ p <- param[1]
+ m1 <- param[2]
+ sd1 <- param[3]
+ m2 <- param[4]
sd2 <- param[5]
+  -sum(log(p * dnorm(x = obs, mean = m1, sd = sd1) + (1 - p) *
+ dnorm(x = obs, mean = m2, sd = sd2)))
+  }
The code to generate my samples is the following,
>X1 = rnorm(n,0,1)
> X20 = rnorm(n,0,1)
> Z  = sample(c(1,2,2),size=n,replace=TRUE)
> X2=m+X20
> X = c(X1[Z==1],X2[Z==2])

Then I use two functions to optimize my log likelihood, with identical intial values,
> O1=nlm(f = logvraineg, p = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)), obs = X)
> logvrainegX <- function(param) {logvraineg(param,X)}
> O2=optim( par = c(.5, mean(X)-sd(X)/5, sd(X), mean(X)+sd(X)/5, sd(X)),
+   fn = logvrainegX)

Actually, since I might have identification problems, I take either http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-05.png or http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-06.png, depending whether http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-07.png or http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-08.png is the smallest parameter.

On the graph above, the x-axis is the difference between means of the mixture (as on the animated grap above). Then, the red point is the median of estimated parameter I have (here http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-05.png), and I have included something that can be interpreted as a confidence interval, i.e. where I have been in 90% of my scenarios: the black vertical segments. Obviously, when the sample is not enough heterogeneous (i.e. http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-09.png and http://perso.univ-rennes1.fr/arthur.charpentier/latex/mix-ml-04.png rather different), I cannot estimate properly my parameters, I might even have a probability that exceed 1 (I did not add any constraint). The blue plain horizontal line is the true value of the parameter, while the blue dotted horizontal line is the initial value of the parameter in the optimization algorithm (I started assuming that the mixture probability was around 0.2).
The graph below is based on the second optimization routine (with identical  starting values, and of course on the same generated samples),

(just to be honest, in many cases, it did not converge, so the loop stopped, and I had to run it again... so finally, my study is based on a bit less than 500 samples (times 15 since I considered several values for the mean of my second underlying distribution), with 200 generated observations from a mixture).
The graph below compares the two (empty circles are the first algorithm, while plain circles the second one),

On average, it is not so bad.... but the probability to be far away from the tru value is not small at all... except when the difference between the two means exceeds 3...
If I change starting values for the optimization algorithm (previously, I assumed that the mixture probability was 1/5, here I start from 1/2), we have the following graph

which look like the previous one, except for small differences between the two underlying distributions (just as if initial values had not impact on the optimization, but it might come from the fact that the surface is nice, and we are not trapped in regions of local minimum).
Thus, I am far from being an expert in optimization routines in R (see here for further information), but so far, it looks like R is not doing so bad... and the two algorithm perform similarly (maybe the first one being a bit closer to the true parameter).