## That damn R-squared !

By arthur charpentier on Friday, September 7 2012, 03:41 - ACT6420-H2012 - Permalink

Another post about the R-squared coefficient, and about why, after some years teaching econometrics, I still hate when students ask questions about it. Usually, it starts with "I have a _____ R-squared... isn't it too low ?" Please, feel free to fill in the blanks with your favorite (low) number. Say 0.2. To make it simple, there are different answers to that question:

- if you don't want to waste time understanding econometrics, I would say something like "Forget about the R-squared, it is useless" (perhaps also "please, think twice about taking that econometrics course")
- if you're ready to spend some time to get a better understanding on subtle concepts, I would say "I don't like the R-squared. I might be interesting in some rare cases (you can probably count them on the fingers of one finger), like comparing two models on the same dataset (even so, I would recommend the adjusted one). But usually, its values has no meaning. You can compare 0.2 and 0.3 (and prefer the 0.3 R-squared model, rather than the 0.2 R-squared one), but 0.2 means nothing". Well, not exactly, since it means
*something*, but it is not a measure tjat tells you if you deal with a*good*or a*bad*model. Well, again, not exactly, but it is rather difficult to say where*bad*ends, and where*good*starts. Actually, it is exactly like the correlation coefficient (well, there is nothing mysterious here since the R-squared can be related to some correlation coefficient, as mentioned in class) - if you want some more advanced advice, I would say "It's complicated..." (and perhaps also "Look in a textbook write by someone more clever than me, you can find hundreds of them in the library !")
- if you want me to act like people we've seen recently on TV (during electoral debate), "It's extremely interesting, but before answering your question, let me tell you a story..."

> set.seed(1) > n=20 > X=runif(n) > E=rnorm(n) > Y=2+5*X+E*.5 > base=data.frame(X,Y) > reg=lm(Y~X,data=base) > summary(reg) Call: lm(formula = Y ~ X, data = base) Residuals: Min 1Q Median 3Q Max -1.15961 -0.17470 0.08719 0.29409 0.52719 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.4706 0.2297 10.76 2.87e-09 *** X 4.2042 0.3697 11.37 1.19e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.461 on 18 degrees of freedom Multiple R-squared: 0.8778, Adjusted R-squared: 0.871 F-statistic: 129.3 on 1 and 18 DF, p-value: 1.192e-09

> Y=2+5*X+E*4 > base=data.frame(X,Y) > reg=lm(Y~X,data=base) > summary(reg) Call: lm(formula = Y ~ X, data = base) Residuals: Min 1Q Median 3Q Max -9.2769 -1.3976 0.6976 2.3527 4.2175 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.765 1.837 3.138 0.00569 ** X -1.367 2.957 -0.462 0.64953 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.688 on 18 degrees of freedom Multiple R-squared: 0.01173, Adjusted R-squared: -0.04318 F-statistic: 0.2136 on 1 and 18 DF, p-value: 0.6495

> S=seq(0,4,by=.2) > R2=rep(NA,length(S)) > for(s in 1:length(S)){ + Y=2+5*X+E*S[s] + base=data.frame(X,Y) + reg=lm(Y~X,data=base) + R2[s]=summary(reg)$r.squared}

Nevertheless, it looks like some econometricians really care about the R-squared, and cannot imagine looking at a model if the R-squared is lower than - say - 0.4. It is always possible to reach that level ! you just have to add more covariates ! If you have some... And if you don't, it is always possible to use polynomials of a continuous variate. For instance, on the previous example,

> S=seq(1,25,by=1) > R2=rep(NA,length(S)) > for(s in 1:length(S)){ + reg=lm(Y~poly(X,degree=s),data=base) + R2[s]=summary(reg)$r.squared}

## Comments

Hi!

Nice animations!

And how about using cross-validation techniques such as leave-one-out? On your very last example, the cross-validated R2 score would certainly not increase indefinitely with the power of your polynomial model...

You suggest that in some cases the R-squared is low because of a high noise of the interest variable. But isn't it possible to build a test based on the ratio R2/(standard deviation of the interest variable), in order to have a simple criterion to judge the quality of the fit ?

Vous suggérez que dans certains cas le R2 est faible du fait d'une variable dépendante fortement bruitée. Mais dans ce cas n'est-il pas possible de bâtir un test à partir du ratio R2/(écart type de la variable dépendante) pour juger de la qualité d'une régression ?