Tuesday, 20 May 2014

Don’t use a t-test, use a regression

What?

The t-test is such a mainstay of basic statistics that I can’t be serious when I say that I haven’t used Stata’s ttest command ever. But that’s right. I’ve used Stata since version 3 (we’re on 13 now – I’m older too) and I’ve never used the t-test command built into Stata. 

And the reason is that a t-test isn’t as versatile as a regression. But let me reassure you: a t-test is a regression.

Promise.

Watch

Doing a t-test

Using the auto dataset that is built into Stata, I’ll run a t-test to see if foreign cars have better fuel consumption than US cars:

. sysuse auto
(1978 Automobile Data)

. ttest mpg, by(foreign)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
Domestic |      52    19.82692     .657777    4.743297    18.50638    21.14747
 Foreign |      22    24.77273     1.40951    6.611187    21.84149    27.70396
---------+--------------------------------------------------------------------
combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
---------+--------------------------------------------------------------------
    diff |           -4.945804    1.362162               -7.661225   -2.230384
------------------------------------------------------------------------------
    diff = mean(Domestic) - mean(Foreign)                         t =  -3.6308
Ho: diff = 0                                     degrees of freedom =       72

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0003         Pr(|T| > |t|) = 0.0005          Pr(T > t) = 0.9997

The difference in fuel consumption is 4.9 miles per gallon, in favour of the foreign cars, and it’s statistically significant. Stata writes, very correctly, that the probability of a value of t greater than the absolute value we observed in the data is 0·0005. And note the value of t: –3·6308.

A t-test is the same thing as a regression

What happens when I use a regression?

. regress mpg foreign

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   13.18
       Model |  378.153515     1  378.153515           Prob > F      =  0.0005
    Residual |  2065.30594    72  28.6848048           R-squared     =  0.1548
-------------+------------------------------           Adj R-squared =  0.1430
       Total |  2443.45946    73  33.4720474           Root MSE      =  5.3558

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |   4.945804   1.362162     3.63   0.001     2.230384    7.661225
       _cons |   19.82692   .7427186    26.70   0.000     18.34634    21.30751
------------------------------------------------------------------------------

The coefficient for the foreign variable is 4·945. That’s exactly the same as the difference between the means. Why?

The coefficient for a regression variable is the effect of a one-unit increase in that variable. For every one-unit increase in the foreign variable, the expected fuel consumption drops by 4·945 miles per gallon. But the foreign variable is a binary variable: it’s coded zero (domestic) and one (foreign). So the effect of a one unit increase is to change from a domestic car to a foreign one. 

When a variable is coded 0/1, the regression coefficient is the difference between the means of the two categories. In other words, it’s a test of equality of means. 

And, indeed, the t-value you see in the regression is the t-value of the t-test. The t-test isn’t a separate test – it’s the simplest kind of regression: a regression with just one predictor that’s coded 0/1.

Taking the analysis further

But the real reason I use regress is the power I can bring to bear on the analysis. So far we know that foreign cars consume less fuel. But is that because they have smaller engines? If I do a t-test, I can’t answer the question, because a t-test only compares two means. But a regression can:

. regress mpg foreign displacement

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =   35.57
       Model |  1222.85283     2  611.426414           Prob > F      =  0.0000
    Residual |  1220.60663    71  17.1916427           R-squared     =  0.5005
-------------+------------------------------           Adj R-squared =  0.4864
       Total |  2443.45946    73  33.4720474           Root MSE      =  4.1463

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |  -.8006817   1.335711    -0.60   0.551    -3.464015    1.862651
displacement |  -.0469161   .0066931    -7.01   0.000    -.0602618   -.0335704
       _cons |   30.79176   1.666592    18.48   0.000     27.46867    34.11485
------------------------------------------------------------------------------

When we take into account the effect of engine size on fuel consumption, there is no longer a significant difference between the foreign and domestic cars. The coefficient for displacement is rather silly, because it’s the effect of a one cubic inch difference in engine size. I don’t know anyone who talks about engines in cubic inches, so please excuse me while I do some work:

Making a sensible predictor variable

. gen cc= displacement*16.387064

. lab var cc "Engine size in cc"


First off, I looked up the conversion factor. Now I have a variable called cc that records engine size in cubic centimetres. 

But that’s still going to give us a silly regression coefficient, because the effect of a one-cc change in engine size will be trivial. So I’m generating a variable that represents engine size in multiples of 100 cc.


. gen cc100=cc/100

. lab var cc100 "Engine size in multiples of 100 cc"


Now let’s see the regression:


. regress mpg foreign cc100

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =   35.57
       Model |  1222.85287     2  611.426434           Prob > F      =  0.0000
    Residual |  1220.60659    71  17.1916421           R-squared     =  0.5005
-------------+------------------------------           Adj R-squared =  0.4864
       Total |  2443.45946    73  33.4720474           Root MSE      =  4.1463

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |  -.8006821   1.335711    -0.60   0.551    -3.464015    1.862651
       cc100 |  -.2862997    .040844    -7.01   0.000    -.3677404    -.204859
       _cons |   30.79176   1.666592    18.48   0.000     27.46867    34.11485
------------------------------------------------------------------------------

Aha. A 100-cc increase in engine size is associated with a reduction of 0·283 mpg. Adjusted for this, the difference between foreign and domestic cars is no longer significant (P=0·551) nor is it important (a difference of 0·8 mpg).

But does that mean that smaller engines are more fuel efficient? Or is it that smaller engines are found in, well, smaller cars?

. regress mpg foreign cc100  weight

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  3,    70) =   45.88
       Model |  1619.71934     3  539.906448           Prob > F      =  0.0000
    Residual |  823.740115    70  11.7677159           R-squared     =  0.6629
-------------+------------------------------           Adj R-squared =  0.6484
       Total |  2443.45946    73  33.4720474           Root MSE      =  3.4304

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |  -1.600631   1.113648    -1.44   0.155    -3.821732    .6204699
       cc100 |   .0117693   .0614517     0.19   0.849    -.1107922    .1343308
      weight |  -.0067745   .0011665    -5.81   0.000    -.0091011   -.0044479
       _cons |   41.84795   2.350704    17.80   0.000     37.15962    46.53628
------------------------------------------------------------------------------

Now we can see that the weight of the car, not the engine size, drives the fuel consumption. Adjusted for weight, engine size no longer predicts fuel consumption, and there is no significant difference between the domestic and foreign cars.

Advantages of regression

Using a regression model allowed us to examine factors that might have caused the apparent difference in fuel consumption between foreign and domestic cars. At each step, we came up with an explanation that we could then test by adding a new variable to the model. 

And don’t forget robust standard errors

Actually, in real life, I would have used Stata’s robust standard errors in all of these models. They are the equivalent of the unequal variances option in the t-test, but far more powerful:

. regress mpg foreign cc100  weight, robust

Linear regression                                      Number of obs =      74
                                                       F(  3,    70) =   50.13
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.6629
                                                       Root MSE      =  3.4304

------------------------------------------------------------------------------
             |               Robust
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |  -1.600631   1.185019    -1.35   0.181    -3.964077    .7628141
       cc100 |   .0117693   .0449128     0.26   0.794    -.0778065    .1013451
      weight |  -.0067745   .0008357    -8.11   0.000    -.0084413   -.0051077
       _cons |   41.84795   1.811437    23.10   0.000     38.23515    45.46075
------------------------------------------------------------------------------

That’s what it looks like. 


So make a new year’s resolution (never too late!) to get beyond the t-test and use the power of regression to construct and test more thoughtful models.