We are currently experiencing an extremely hot month in Montréal (and more generally in North America). Looking at people having a beer, and starting the first barbecue of the year, I was wondering: if we asked people if global warming was a good or a bad thing, what do you think they will answer ? Wearing a T-shirt in Montréal in March is nice, trust me ! So how can we study, from a quantitative point of view, depending on the time of the year, what people think of global warming ?
A few month ago, I went quickly through Jeffrey Bree's tutorial (online on his blog) on sentiment analysis, where he was studying perception of airline companies in the US based on Twitter messages. His idea was to define a score function, counting the number of positive and negative words in a short sentence (Twitter is then perfect to run such a study) mentioning an Airline company. Consider the following score function (the one proposed by Jeffrey)
score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) scores = laply(sentences,
function(sentence, pos.words, neg.words) { sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) sentence = tolower(sentence) word.list = strsplit(sentence, '\\s+') words = unlist(word.list) pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }
A list of positive and negative words (in English) can be found on http://www.cs.uic.edu/ (see e.g. the chapter on sentiment analysis).
hu.liu.pos = scan("positive-words.txt", what="character", comment.char=';') hu.liu.neg = scan('negative-words.txt', what='character', comment.char=';')
The score here is simple the difference between positive versus negative words. For instance, the following sentence has a score of +3,
> score.sentiment("It's awesome I am so happy, thank you all", + hu.liu.pos,hu.liu.neg)$score [1] 3
while the following is -3.
> score.sentiment("I'm desperate, life is a nightmare, I want to die", + hu.liu.pos,hu.liu.neg)$score [1] -3
But one can easy see a big problem with this methodology. What if the sentence included negations ? E.g.
> score.sentiment("I'm no longer desperate, life is not a nightmare anymore I don't want to die", + hu.liu.pos,hu.liu.neg)$score [1] -3
Here the sentence is negative, extremely negative, if we look only at the score. But it should be the opposite. I simple idea is to change (slightly) the function, so that once a negation is found in the sentence, we take the opposite of the score. Hence, we just add at the end of the function
if("not"%in%words){score=-score}
Here we obtain
> score.sentiment.neg("I'm no longer desperate, life is not a nightmare anymore I don't want to die", + hu.liu.pos,hu.liu.neg)$score [1] 3
But does it really work ? Let us focus on Tweets,
library(twitteR)
Consider the following tweet-extractions, based on two words, a negative word, and the negation of a positive word,
> tweets=searchTwitter('not happy',n=1000) > NH.text= lapply(tweets, function(t) t$getText() ) > NH.scores = score.sentiment(NH.text, + hu.liu.pos,hu.liu.neg) > tweets=searchTwitter('unhappy',n=1000) > UH.text= lapply(tweets, function(t) t$getText() ) > UH.scores = score.sentiment(UH.text, + hu.liu.pos,hu.liu.neg)
If we draw score density in tweets, we see that scores are quite different,
> plot(density(NH.scores$score,bw=.8),col="red") > lines(density(UH.scores$score,bw=.8),col="blue")

If now we use the second function, taking the opposite of the score when the sentence contains a negation, we obtain
> UH.scores = score.sentiment.neg(UH.text, + hu.liu.pos,hu.liu.neg) > NH.scores = score.sentiment.neg(NH.text, + hu.liu.pos,hu.liu.neg)
> plot(density(NH.scores$score,bw=.8),col="red") > lines(density(UH.scores$score,bw=.8),col="blue")

So we can admit that our modified function is not that bad.
Let us now focus on tweets in North American. More specifically, we focus on tweets sent within the following region (in blue on the right), in order to avoid problems of hemispheres. Then, we restrict ourselves to specific days, or periods of time. For instance, we can wonder if snow is associated with positive or negative words, and if we enjoy snow when it does arrive, by the end of November, and if we start to find it boring by the end of March ?
Considerer the following research
> w.tweets=searchTwitter("snow",since= LISTEDATE[k], + until= LISTEDATE[k+1],geocode="40,-100,2000mi") > W.text= lapply(w.tweets, function(t) t$getText() ) > W.scores = score.sentiment.neg(W.text, + hu.liu.pos,hu.liu.neg, .progress='text') > M[k]=mean(W.scores$score)We obtain here the following score function, over three years, on Twitter,

Let us get back to the point used to introduce this post. If we study what people "feel" when they mention global warming, let us run the same code, again in North America
> w.tweets=searchTwitter("global warming",since= LISTEDATE[k], + until= LISTEDATE[k+1],geocode="40,-100,2000mi")Actually, I was expecting a nice cycle, with positive scores in Spring, and perhaps negative scores during heat waves, in the middle of the Summer...

What we simply observe is that global warming was related to "negative words" on Twitter a few years ago, but we have reached a neutral position nowadays.
And to be honest, I do not really know how to interpret that: is there a problem with the technique I use (obviously, I use here a very simple scoring function, even after integrating a minor correction to take into consideration negations) or is there something going one that can be interpreted ?
Hier matin, au moment de quitter la maison, j'ai jeté un œil à la météo
(car à Montréal le temps est très changeant, pas seulement d'un jour sur l'autre, comme je l'avis mentionné en arrivant en septembre, 
la variable indicatrice indiquant s'il pleut, ou pas (1 s'il pleut, et 0 sinon) à l'instant
.
Supposons qu'il existe un processus latent sous-jacent,
tel
que 





la fonction de répartition (jointe) du vecteur aléatoire
cette probabilité s'écrit
,



est un bruit blanc, indépendant du processus latent.
L'intérêt est que l'on a spécifié ainsi la structure de la matrice de
variance covariance. En particulier, la matrice de corrélation du
vecteur
est alors
,
(qui est associée à la
probabilité
d'avoir de la pluie pendant une heure en
moyenne) et de la corrélation
(qui est associée à la
dynamique du processus). Pour simplifier, si on prend 





A few weeks ago (
Nature published last week a series of interesting papers on natural catastrophes (and its relation with a human factor).
As people say in Montréal, "
defined as
is the air temperature (in °C), and
the wind speed (in km/h). Please don't ask me how to interpret this power 0.16 (I already find difficult to explain a square root in an econometric equation). If we look at the past previous days we observe the following observations,
where points on top are temperature, while below we have felt temperature.So, basically, winters are even colder than what you might think..
, defined here as
denotes a dewpoint (see
Recently, @










Recently, I received comments (

arming of
minimas is stronger than average temperature, and on other hand, for
maximas (high probabilities on
the right), the slope is smaller - but positive - so summer are warmer,
but not as much as winters.
In the paper on the heat wave in Paris (mentioned 



is
the following, with on the left the minimas and on the right the
maximas,
is
is
For a disaster to be entered into theso-called EM-database at least one of the following criteria must be fulfilled:
Actually, it is a collective book written by the Copenhagen Business School. Martin L. Weitzman (
As Mark Twain said "













denotes the date the prediction was made and
the date the horizon, or















Dans Les cafards, de Jo Nesbø,
Harry Hole (qui vient d'Oslo pour ceux qui n'ont pas dévoré la série)
s'entend dire à un moment qu'à Bangkok, les discussions quotidiennes portent rarement sur "la pluie et le beau temps"
pour la bonne raison que le climat est assez prévisible à Bangkok (par
contre on parle quotidiennement du trafic routier). J'avais évoqué 



















ar jour.
Mais sur le long terme, d'une semaine sur l'autre, le temps est au
contraire très corrélé, contrairement aux autres régions. A Paris,
Marseille ou Strasbourg, qu'il ait fait beau, ou qu'il ait plu la
semaine précédente, cela n'apporte aucune information sur la
probabilité d'avoir de la pluie la semaine où l'on vient en
vacances.... Mais pas en Bretagne: manifestement, il existe donc des
étés pourris, où il pourra pleuvoir toutes les semaines, et des étés
superbes où il ne pleut jamais....








