# Freakonometrics

## Tag - Renglish

Monday, April 23 2012

## Talk on quantiles at the R Montreal group

This afternoon, I will be giving a two-hour talk at McGill on quantiles, quantile regressions, confidence regions, bagplots and outliers. Before defining (properly) quantile regressions, we will mention regression on (local) quantiles, as on the graph below, on hurricanes,

In order to illustrate quantile regression, consider the following natality database,

```base=read.table(
"http://freakonometrics.free.fr/natality2005.txt",

We can use it produce those nice graphs we can find in several papers, modeling weight of newborns,

```u=seq(.05,.95,by=.01)
coefstd=function(u) summary(rq(WEIGHT~SEX+
SMOKER+WEIGHTGAIN+BIRTHRECORD+AGE+ BLACKM+
BLACKF+COLLEGE,data=base,tau=u))\$coefficients[,2]
coefest=function(u) summary(rq(WEIGHT~SEX+
SMOKER+WEIGHTGAIN+BIRTHRECORD+AGE+ BLACKM+
BLACKF+COLLEGE,data=base,tau=u))\$coefficients[,1]
CS=Vectorize(coefstd)(u)
CE=Vectorize(coefest)(u)```
The slides can be downloaded on the blog, as well as the R-code.

Tuesday, March 27 2012

## Ruin probability and infinite time

A couple of weeks ago, I had a discussion with a practitioner, working in some financial company, about ruin, and infinite time. And it remind me a weird result. Well, not a weird result, but a result I found disturbing, at first, when I was a student (that I rediscovered with the eyes of someone dealing with computational issues, seeing here a difficult theoretical question). Consider a simple ruin problem. A player has wealth . Then he flips a coin: tails he has a gain of 1, heads he experiences a loss of 1. At time , his wealth is where is associated to the th coin: is equal to 1 with probability (tails), and -1 with probability (heads). It is also possible to write

where can be interpreted as the net gain of the player. In order to get a good understanding of results that can be obtained. Assume to be given. Let denote the number of heads and the number of tails. Then , while . Let denote the number of paths to go from point A (wealth at time ) to point B (wealth at time ). Note that this is a Markovian problem, that can be modeled using Markov chains

But here, we will focus on combinatorial results. Hence,

In order to derive probabilities to reach , let denote the number of paths going from to . And let denote the number of paths going from to that do reach at some point between and . Using a simple reflexion property, then if and are positive,

Based on those reflexions, two results can be derived (focusing on probability, instead of counting paths). First, we can obtain that

(given that n and x have the same parity). The second result we can obtain is that

Based on those two expressions, if denotes the first time become null, given ,

then

This can be computed easily,

```> x=10
> p=.55
> ProbN=function(n){
+ pb=0
+ if(abs(n-x) %% 2 == 0)
+ pb=x/n*choose(n,(n+x)/2)*(1-p)^((n+x)/2)*(p)^((n-x)/2)
+ return(pb)}
> plot(Vectorize(ProbN)(1:1000),type="s")```

That looks nice... But if we look closer, we can wonder what

would be ? Since we have the distribution of a probabilty measure, we might expect one. But here

```> sum(Vectorize(ProbN)(1:1000))
[1] 0.134385```

And this is not due to calculation mistakes that we do not get 1 here. Actually, we should write

which might be interpreted as the probability of ruin, starting from , that we denote from now on. The term on the left can be approximated using monte-carlo simulations

```> p=.55
> x=10
> m=1000
> simul=10000
> S=sample(c(-1,1),size=m*simul,replace=TRUE,prob=c(1-p,p))
> MS=matrix(S,simul,m)
> for(k in 2:m) MS[,k]=MS[,k]+MS[,k-1]
> T0=function(vm) which(vm<=(-x))[1]
> MTmin=apply(MS,1,T0)
> mean(is.na(MTmin)==FALSE)
[1] 0.1328```

To check the validity of the relationship above, a simple (theoretical) recursive formula can be derived for the term on the right (ruin probability), namely

with a boundary conditions , and . Then is comes that

Note that it might be tricky to check using monte carlo simulation... since we cannot have an infinite number of runs. And we're dealing precisely with things that do occur when time is infinite. Actually, we can still check convergence, considering an upper limit for the number of runs, and then letting go to infinity. Note that an explicit formula can then be derived (using additional border condition )

Using the following code, it is possible to calculate ruin probability, in order to estimate .

```> MSmin=apply(MS,1,min)
> mean(MSmin<=(-x))
[1] 0.1328
> (((1-p)/p)^x-((1-p)/p)^m)/(1-((1-p)/p)^m)
[1] 0.1344306```

The following graph shows the evolution of ruin probability as a function of initial wealth (with monte carlo simulation, with a fixed horizon - including a confidence interval - versus the analytical expression)

Hence, with stopping times, one should remember that

and that those two terms can be approximated simply using simulations or standard approximations.

Thursday, March 22 2012

## Do we appreciate sunbathing in Spring ?

We are currently experiencing an extremely hot month in Montréal (and more generally in North America). Looking at people having a beer, and starting the first barbecue of the year, I was wondering: if we asked people if global warming was a good or a bad thing, what do you think they will answer ? Wearing a T-shirt in Montréal in March is nice, trust me ! So how can we study, from a quantitative point of view, depending on the time of the year, what people think of global warming ?

A few month ago, I went quickly through

```score.sentiment = function(sentences, pos.words,
neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(sentences, function(sentence, pos.words, neg.words) {
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
sentence = tolower(sentence)
word.list = strsplit(sentence, '\\s+')
words = unlist(word.list)
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}```

```hu.liu.pos = scan("positive-words.txt", what="character",
comment.char=';')
hu.liu.neg = scan('negative-words.txt', what='character',
comment.char=';')```

```> score.sentiment("It's awesome I am so happy,
thank you all",
+ hu.liu.pos,hu.liu.neg)\$score
[1] 3```

```> score.sentiment("I'm desperate, life is a nightmare,
I want to die",
+ hu.liu.pos,hu.liu.neg)\$score
[1] -3```

But one can easy see a big problem with this methodology. What if the sentence included negations ? E.g.

```> score.sentiment("I'm no longer desperate, life is
not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)\$score
[1] -3```

Here the sentence is negative, extremely negative, if we look only at the score. But it should be the opposite. I simple idea is to change (slightly) the function, so that once a negation is found in the sentence, we take the opposite of the score. Hence, we just add at the end of the function

`if("not"%in%words){score=-score}`

Here we obtain

```> score.sentiment.neg("I'm no longer desperate,
life is not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)\$score
[1] 3```

But does it really work ? Let us focus on Tweets,

`library(twitteR)`

Consider the following tweet-extractions, based on two words, a negative word, and the negation of a positive word,

```> tweets=searchTwitter('not happy',n=1000)
> NH.text= lapply(tweets, function(t) t\$getText() )
> NH.scores = score.sentiment(NH.text,
+ hu.liu.pos,hu.liu.neg)

> UH.text= lapply(tweets, function(t) t\$getText() )
> UH.scores = score.sentiment(UH.text,
+ hu.liu.pos,hu.liu.neg)```

```> plot(density(NH.scores\$score,bw=.8),col="red")
> lines(density(UH.scores\$score,bw=.8),col="blue")```

```> UH.scores = score.sentiment.neg(UH.text,
+ hu.liu.pos,hu.liu.neg)
> NH.scores = score.sentiment.neg(NH.text,
+ hu.liu.pos,hu.liu.neg)> plot(density(NH.scores\$score,bw=.8),col="red")
> lines(density(UH.scores\$score,bw=.8),col="blue")```

```> w.tweets=searchTwitter("snow",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
> W.text= lapply(w.tweets, function(t) t\$getText() )
> W.scores = score.sentiment.neg(W.text,
+ hu.liu.pos,hu.liu.neg, .progress='text')
> M[k]=mean(W.scores\$score)```
We obtain here the following score function, over three years, on Twitter,

Well, we have to admit that the pattern is not that obvious. There might me small (local) bump in the winter, but it is not obvious...

Let us get back to the point used to introduce this post. If we study what people "feel" when they mention global warming, let us run the same code, again in North America
```> w.tweets=searchTwitter("global warming",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")```
Actually, I was expecting a nice cycle, with positive scores in Spring, and perhaps negative scores during heat waves, in the middle of the Summer...

What we simply observe is that global warming was related to "negative words" on Twitter a few years ago, but we have reached a neutral position nowadays.

And to be honest, I do not really know how to interpret that: is there a problem with the technique I use (obviously, I use here a very simple scoring function, even after integrating a minor correction to take into consideration negations) or is there something going one that can be interpreted ?

Tuesday, December 20 2011

## Basics on Markov Chain (for parents)

Markov chains is a very interesting and powerful tool. Especially for parents. Because if you think about it quickly, most of the games our kids are playing at are Markovian. For instance, snakes and ladders...

It is extremely easy to write down the transition matrix, one just need to define all snakes and ladders. For the one above, we have,
```n=100
M=matrix(0,n+1,n+1+6)
rownames(M)=0:n
colnames(M)=0:(n+6)
for(i in 1:6){diag(M[,(i+1):(i+1+n)])=1/6}
M[,n+1]=apply(M[,(n+1):(n+1+6)],1,sum)
M=M[,1:(n+1)]
starting=c(4,9,17,20,28,40,51,54,62,
64,63,71,93,95,92)
ending  =c(14,31,7,38,84,59,67,34,19,
60,81,91,73,75,78)
for(i in 1:length(starting)){
v=M[,starting[i]+1]
ind=which(v>0)
M[ind,starting[i]+1]=0
M[ind,ending[i]+1]=M[ind,ending[i]+1]+v[ind]}```
So, why is it important to have a Markov Chain ? Because, once you've noticed that you had a Markov Chain game, you can derive anything you want. For instance, you can get the distribution after some turns,
```powermat=function(P,h){
Ph=P
if(h>1){
for(k in 2:h){
Ph=Ph%*%P}}
return(Ph)}
initial=c(1,rep(0,n))
COLOR=rev(heat.colors(101))
u=1:sqrt(n)
boxes=data.frame(
index=1:n,
ord=rep(u,each=sqrt(n)),
abs=rep(c(u,rev(u)),sqrt(n)/2))
position=function(h=1){
D=initial%*%powermat(M,h)
plot(0:10,0:10,col="white",axes=FALSE,
xlab="",ylab="",main=paste("Position after",h,"turns"))
segments(0:10,rep(0,11),0:10,rep(10,11))
segments(rep(0,11),0:10,rep(10,11),0:10)
for(i in 1:n){
polygon(boxes\$abs[i]-c(0,0,1,1),
boxes\$ord[i]-c(0,1,1,0),
col=COLOR[min(1+trunc(500*D[i+1]),101)],
border=NA)}
text(boxes\$abs-.5,boxes\$ord-.5,
boxes\$index,cex=.7)
segments(c(0,10),rep(0,2),c(0,10),rep(10,2))
segments(rep(0,2),c(0,10),rep(10,2),c(0,10))}```
Here, we have the following (note that I assume that once 100 is reached, the game is over)

Assume for instance, that after 10 turns, your daughter accidentally drops her pawn out of the game. Here is the theoretical (unconditional) position of her pawn after 10 turns,

so, if she claims she was either on 58, 59 or 60, here are the theoretical probabilities to be in each cell after 10 turns,
```> h=10
> (initial%*%powermat(M,h))[59:61]/
+ sum((initial%*%powermat(M,h))[59:61])
[1] 0.1597003 0.5168209 0.3234788```
i.e. it is more likely she was on 59 (60th cell of the vector since we start in 0). You can also look at the distribution of the number of turns (at first with only one player).
```distrib=initial%*%M
game=rep(NA,1000)
for(h in 1:length(game)){
game[h]=distrib[n+1]
distrib=distrib%*%M}
plot(1-game[1:200],type="l",lwd=2,col="red",
ylab="Probability to be still playing")```

Once you have that survival distribution, you have the expected number of turns to finish the game,
```> sum(1-game)
[1] 32.16499```
i.e. in 33 turns, on average, your daughter reaches the 100 cell. But in 50% of the games, it takes less than 29,
```> max(which(1-game>.5))
[1] 29```
But assuming that you are playing with your daughter, and that the game is over once one player reaches the 100 cell, it is possible to get the survival distribution of the first time one of us reaches the 100 cell,
```plot((1-game[1:200])^2,type="l",lwd=2,col="blue",
ylab="Probability to be still playing (2 players)")```

Here, the expected number of turns before ending the game is
```> sum((1-game)^2)
[1] 23.40439```
And if you ask your son to join the game, the survival distribution function is
```plot((1-game[1:200])^3,type="l",lwd=2,col="purple",
ylab="Probability to be still playing (3 players)")```

i.e. the expected number of turns before the end is now
```> sum((1-game)^3)
[1] 20.02098```

Tuesday, October 18 2011

## Short selling, volatility and bubbles

Yesterday, I wrote a post (in French) about short-selling in financial market since some journalists claimed that it was well-known that short -selling does increase volatility on financial market. Not only in French speaking journals actually, since we can read on http://www.forbes.com that  « in a market with restrictions on short-selling, volatility is reduced ». But things are not that simple. For instance http://www.optionsatoz.com/ explains it from a theoretical point of view. But we can also look at the data. For instance, we can compare the stock price of Air China, exchanged in Shanghai in blue (where short-selling is forbidden) and in Hong Kong in rouge (where short-selling is allowed), since @Igor gave me the tickers of those stocks

`library(tseries)X<-get.hist.quote("0753.HK")Y<-get.hist.quote("601111.SS")plot(Y[,4],col="blue",ylim=c(0,30))lines(X[,4],col="red")`

But as @alea_ pointed out, one asset is expressed here in Yuans renminbi, and the other one in HK dollars. So I downloaded the exchange rate from http://www.oanda.com/

`Z=read.table("http://freakonometrics.blog.free.fr/public/data/change-cny-hkd.csv",header=TRUE,sep=";",dec=",")D=as.Date(as.character(Z\$date),"%d/%m/%y")z=as.numeric(Z\$CNY.HKD)plot(D,z,type="l")X2=X[,4]for(t in 1:length(X2)){ X2[t]=X2[t]*z[D==time(X2[t])]} X2=X[,4]plot(Y[,4],col="blue",ylim=c(0,30))lines(X2,col="red")`

Now both stocks are expressed in the same currency. To compare returns volatility, a first idea can be to use GARCH models,

`RX=diff(log(X2))RY=diff(log(Y[,4]))Xgarch = garch(as.numeric(RX))SIGMAX=predict(Xgarch)Ygarch = garch(as.numeric(RY))SIGMAY=predict(Ygarch)plot(time(Y)[-1],SIGMAY[,1],col="blue",type="l")lines(time(X2)[-1],SIGMAX[,1],col="red")`

But volatility is here too eratic. So an alternative can be to use exponentially-weighted moving averages, where simple recursive relationships are considered

or equivalently
The code is not great, but it is easy to understand,
`moy.ew=function(x,r){ m=rep(NA,length(x)) for(i in 1:length(x)){  m[i]=weighted.mean(x[1:i],          rev(r^(0:(i-1))))}     return(m)}  sd.ew=function(x,r,m){ sd=rep(NA,length(x)) for(i in 1:length(x)){     sd[i]=weighted.mean((x[1:i]-m[i])^2,          rev(r^(0:(i-1))))}     return(sd)} q=.97MX=moy.ew(RX,q)SX=sd.ew(RX,q,MX)MY=moy.ew(RY,q)SY=sd.ew(RY,q,MY)plot(time(Y)[-1],SY,col="blue",type="l")lines(time(X2)[-1],SX,col="red")`

And now we have something less erratic, so we can focus now on the interpretation.
It is also possible to look on the difference between those two series of volatility, areas in blue means that in Shanghai (again, where short-selling is forbidden) returns are more volatile than in Hong Kong, and areas in red are periods where returns are more volatile in Hong Kong,
`a=time(X2)[which(time(X2)%in%time(Y))]b=SY[which(time(Y)%in%time(X2))]-  SX[which(time(X2)%in%time(Y))]n=length(a)a=a[-n];b=b[-n]plot(a,b,col="black",type="l")polygon(c(a,rev(a)),c(pmax(b,0),rep(0,length(a))),        col="blue",border=NA)polygon(c(a,rev(a)),c(pmin(b,0),rep(0,length(a))),        col="red",border=NA)`

So clearly, there is nothing clear that can be said... Sometimes, volatility is higher in Hong Kong, and sometimes, it is higher in Shanghai. But if we look at the price, instead of looking at volatility,
`a=time(X2)[which(time(X2)%in%time(Y))]b=as.numeric(Y[which(time(Y)%in%time(X2)),4])-   as.numeric(X2[which(time(X2)%in%time(Y))])n=length(a)a=a[-n];b=b[-n]plot(a,b,col="black",type="l")polygon(c(a,rev(a)),c(pmax(b,0),rep(0,length(a))),        col="blue",border=NA)polygon(c(a,rev(a)),c(pmin(b,0),rep(0,length(a))),        col="red",border=NA)`

Here, it looks like bans on short-selling creates bubbles. Might not not be a good thing.