Freakonometrics

To content | To menu | To search

Tag - Twitter

Entries feed - Comments feed

Thursday, July 12 2012

Somewhere else, part 2

A series of posts (on blogs) elsewhere,


ou en francais,

Wednesday, June 27 2012

Qui se ressemble se suit (sur Twitter au moins)

Un nouveau billet pour reprendre une analyse marrante faite par @3wen  (Ewen Gallic, qui travaille à Montréal alors que je profite de la Suisse). Suite à l'analyse amusante des trolls de Twitter, j'avais lancé l'idée qu'il pourrait être amusant de regarder parmi les députés français (que j'avais un peu suivis l'autre jour), qui tweete avec qui. Ewen a eu la bonne idée de regarder sur  http://lelab.europe1.fr/ ce qui lui a permis de récupérer la liste des comptes Twitter des députés, en France. L'idée est simple: parmi les députés français, on regarde qui suit qui. Quelqu'un de très suivi sera au centre du nuage, alors que quelqu'un qui se contente de suivre sera sur le bord (intensément connecté au reste du nuage). Pour les personnes peu familières, Twitter n'est pas Facebook: on n'a pas des "amis", il y a des gens que l'on suit parce qu'ils racontent des choses qui peuvent nous intéresser.

En utilisant gephi Ewen a ensuite pu visualiser le graphique des interconnexions entre les députés.

Sans grande surprise, les députés de gauche suivent essentiellement les députés de gauche, et inversement. Quelques gros comptes font la passerelle entre les deux groupes parlementaires.

En fait, si on regarde en détail (voire sur le fichier image complet), on peut observer un peu mieux ce qui se passe,

Bon, la grande difficulté est de lire ces interactions correctement. En particulier, on ne peut pas conclure (à la vue seule du graphique) que Cécile Duflot est de gauche ! Ce que cela nous dit est que ce qu'elle raconte intéresse les gens de gauche (ou en tous cas les députés du Parti Socialiste), et pas du tout les gens de droite (les députés de l'UMP) ! On notera aussi que le centre n'existe pas. Sur Twitter en tous cas. Et si on regarde tout en bas, à droite, on retrouve le Front National, et on a la confirmation que ce que raconte le Front National n'intéresse personne !

Continue reading...

Monday, June 11 2012

Do you still have time to sleep ?

Last week, @3wen (Ewen) helped me to write nice R functions to extract tweets in R and build datasets containing a lot of information. I've tried a couple of time on my own. Once on tweet contents, but it was not convincing and once on the activity on Twitter following an event (e.g. the death of someone famous). I have to admit that I am not a big fan of databases that can be generated using standard function to study tweets. For instance, we can only extract tweets, not re-tweets (which is also an important indicator of tweet-activity). @3wen suggested to use
require("RJSONIO") 
The first step is to extract some information from a tweet, and store it in a dataset (details can be found on https://dev.twitter.com/)
obtenir_ligne <- function(unTweet){
date_courante=unTweet$created_at
id_courant=unTweet$id_str
text=unTweet$text
nb_followers=unTweet$user$followers_count
nb_amis=unTweet$user$friends_count
utc_offset=unTweet$user$utc_offset
listeMentions=unTweet$entities$user_mentions
return(c(list(c(id_courant,date_courante,text,
nb_followers,nb_amis,utc_offset)),
list(do.call("rbind",lapply(listeMentions,
function(x,id_courant) c(id_courant,
x$screen_name),unTweet$id_str)))))
}
Now that we  have the code to extract information from one tweet, let us find several tweets, from one user, say my account,
nom="Freakonometrics"
The (small) problem here, is that we have a limitation: we can only get 100 tweets per call of the function
n=100
tweets_courants=scan(paste(
"http://api.twitter.com/1/statuses/user_timeline.json?
include_entities=true&include_rts=true&screen_name=
",nom,"&count=",n,sep=""),what = "character",
encoding="latin1")
tweets_courants=paste(tweets_courants[
1:length(tweets_courants)],collapse=" ")
tweets_courants=fromJSON(tweets_courants,
method = "C")
Then, we use our function to build a database with 100 lines,
extracTweets <- lapply(tweets_courants,
obtenir_ligne)
mentions=do.call("rbind",lapply(extracTweets,
function(x) x[[2]]))
colnames(mentions)=list("id","screen_name")
res=t(sapply(extracTweets,function(x) x[[1]]))
colnames(res) <- list("id","date","text",
"nb_followers","nb_amis","utc_offset")
The idea then is simply to use a loop, based on the latest id observed
dernier_id=tweets_courants[[length(
tweets_courants)]]$id_str
So, here we go,
compteurLimite=100
 
while(compteurLimite<4100){
tweets_courants=scan(paste(
"http://api.twitter.com/1/statuses/user_timeline.json?
include_entities=true&include_rts=true&screen_name=
",nom,"&count=",n,"&max_id=",dernier_id,sep=""),
what = "character", encoding="latin1")
tweets_courants=paste(tweets_courants[
1:length(tweets_courants)],collapse=" ")
tweets_courants=fromJSON(tweets_courants,
method = "C")
 
extracTweets <- lapply(tweets_courants[
2:length(tweets_courants)],obtenir_ligne)
mentions=rbind(mentions,do.call("rbind",
lapply(extracTweets,function(x) x[[2]])))
res=rbind(res,t(sapply(extracTweets,function(x) x[[1]])))
t(sapply(extracTweets,function(x) x[[1]]))
dernier_id=tweets_courants[[length(
tweets_courants)]]$id_str
compteurLimite=compteurLimite+100
}
 
resFreakonometrics=res=
data.frame(res,stringsAsFactors=FALSE)
All the information about my own tweets (and re-tweets) are stored in a nice dataset. Actually, we have even more, since we have extracted also names of people mentioned in tweets,
mentionsFreakonometrics=
data.frame(mentions)
We can look at people I mention in my tweets
gazouillis=sapply(split(mentionsFreakonometrics,
mentions$screen_name),nrow)
gazouillis=gazouillis[order(gazouillis,
decreasing=TRUE)]
 
plot(gazouillis)
plot(gazouillis,log="xy")
> gazouillis[1:20]
tomroud freakonometrics       adelaigue       dmonniaux
155              84              77              56
J_P_Boucher         embruns      SkyZeLimit        coulmont
42              39              35              31
Fabrice_BM            3wen          obouba          msotod
31              30              29              27
StatFr     nholzschuch        renaudjf        squintar
26              25              23              23
Vicnent        pareto35        romainqc        valatini
23              22              22              22 
If we plot those frequencies, we can clearly observe a standard Pareto distribution,

Now, let us spend some time with dates and time of tweets (it was the initial goal of this post)... One more time, there is a (small) technical problem that we have to deal with: language. We need a function to convert date in English (on Twitter) to dates in French (since I have a French version of R),
changer_date_anglais <- function(date_courante){
mois <- c("Jan","Fév", "Mar", "Avr", "Mai",
"Jui", "Jul", "Aoû", "Sep", "Oct", "Nov", "Déc")
months <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
jours <- c("Lun","Mar","Mer","Jeu",
"Ven","Sam","Dim")
days <- c("Mon","Tue","Wed","Thu",
"Fri","Sat","Sun")
leJour <- substr(date_courante,1,3)
leMois <- substr(date_courante,5,7)
return(paste(jours[match(leJour,days)]," ",
mois[match(leMois,months)],substr(
date_courante,8,nchar(date_courante)),sep=""))
}
So now, it is possible to plot the times where I am online, tweeting,
DATE=Vectorize(changer_date_anglais)(res$date)
DATE=sapply(resSkyZeLimit$date,
changer_date_anglais,simplify=TRUE)
 
DATE2=strptime(as.character(DATE),
"%a %b %d %H:%M:%S %z %Y")
lt= as.POSIXlt(DATE2, origin="1970-01-01")
heure=lt$hour+lt$min/60
plot(DATE2,heure)

On this graph, we can see that I am clearly not online almost 6 hours a day (or at least not on Twitter). It is possible to visualize more precisely the period of the day where I might be on Twitter,

hist(heure,breaks=0:24,col="light green",proba=TRUE)
X=c(heure-24,heure,heure+24)
d=density(X,n = 512, from=0, to=24,bw=1)
lines(d$x,d$y*3,lwd=3,col="red")

or, if we want to illustrate with some kind of heat plot,

Note that we did it for my Twitter account, but we can also run the code on (almost) anyone on Twitter. Consider e.g. @adelaigue. Since Alexandre is tweeting in France, we have to play with time-zones,
res=extractR("adelaigue")
DATE=Vectorize(changer_date_anglais)(res$date) DATE2=strptime(as.character(DATE), "%a %b %d %H:%M:%S %z %Y",tz = "GMT")+2*60*60

or I can also look at @skythelimit who's usually twitting from Singapore (I am in Montréal). I can seen clearly when we might have overlaps,

res=extractR("skythelimit")

Nice isn't it. But it is possible to do much better... for instance, for those who do not ask specifically not to be Geo-located, we can see where they do tweet during the day, and during the night... I am quite sure a dozen posts with those functions can be written...

Thursday, March 22 2012

Do we appreciate sunbathing in Spring ?

We are currently experiencing an extremely hot month in Montréal (and more generally in North America). Looking at people having a beer, and starting the first barbecue of the year, I was wondering: if we asked people if global warming was a good or a bad thing, what do you think they will answer ? Wearing a T-shirt in Montréal in March is nice, trust me ! So how can we study, from a quantitative point of view, depending on the time of the year, what people think of global warming ?

A few month ago, I went quickly through

score.sentiment = function(sentences, pos.words,
neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(sentences, 
function(sentence, pos.words, neg.words) { sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) sentence = tolower(sentence) word.list = strsplit(sentence, '\\s+') words = unlist(word.list) pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }

hu.liu.pos = scan("positive-words.txt", what="character",
comment.char=';')
hu.liu.neg = scan('negative-words.txt', what='character',
comment.char=';')

> score.sentiment("It's awesome I am so happy,
thank you all",
+ hu.liu.pos,hu.liu.neg)$score
[1] 3

> score.sentiment("I'm desperate, life is a nightmare,
I want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] -3

But one can easy see a big problem with this methodology. What if the sentence included negations ? E.g.

> score.sentiment("I'm no longer desperate, life is
not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] -3

Here the sentence is negative, extremely negative, if we look only at the score. But it should be the opposite. I simple idea is to change (slightly) the function, so that once a negation is found in the sentence, we take the opposite of the score. Hence, we just add at the end of the function

if("not"%in%words){score=-score}

Here we obtain

> score.sentiment.neg("I'm no longer desperate,
life is not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] 3

But does it really work ? Let us focus on Tweets,

library(twitteR)

Consider the following tweet-extractions, based on two words, a negative word, and the negation of a positive word,

> tweets=searchTwitter('not happy',n=1000)
> NH.text= lapply(tweets, function(t) t$getText() )
> NH.scores = score.sentiment(NH.text,
+ hu.liu.pos,hu.liu.neg)
 
> tweets=searchTwitter('unhappy',n=1000)
> UH.text= lapply(tweets, function(t) t$getText() )
> UH.scores = score.sentiment(UH.text,
+ hu.liu.pos,hu.liu.neg)

> plot(density(NH.scores$score,bw=.8),col="red")
> lines(density(UH.scores$score,bw=.8),col="blue")

> UH.scores = score.sentiment.neg(UH.text,
+ hu.liu.pos,hu.liu.neg)
> NH.scores = score.sentiment.neg(NH.text,
+ hu.liu.pos,hu.liu.neg)
> plot(density(NH.scores$score,bw=.8),col="red") > lines(density(UH.scores$score,bw=.8),col="blue")

> w.tweets=searchTwitter("snow",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
> W.text= lapply(w.tweets, function(t) t$getText() )
> W.scores = score.sentiment.neg(W.text,
+ hu.liu.pos,hu.liu.neg, .progress='text')
> M[k]=mean(W.scores$score)
We obtain here the following score function, over three years, on Twitter,

Well, we have to admit that the pattern is not that obvious. There might me small (local) bump in the winter, but it is not obvious...

Let us get back to the point used to introduce this post. If we study what people "feel" when they mention global warming, let us run the same code, again in North America
> w.tweets=searchTwitter("global warming",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
Actually, I was expecting a nice cycle, with positive scores in Spring, and perhaps negative scores during heat waves, in the middle of the Summer...

What we simply observe is that global warming was related to "negative words" on Twitter a few years ago, but we have reached a neutral position nowadays.

And to be honest, I do not really know how to interpret that: is there a problem with the technique I use (obviously, I use here a very simple scoring function, even after integrating a minor correction to take into consideration negations) or is there something going one that can be interpreted ?