What?
The t-test is such a mainstay of basic statistics that I can’t be serious when I say that I haven’t used Stata’s ttest command ever. But that’s right. I’ve used Stata since version 3 (we’re on 13 now – I’m older too) and I’ve never used the t-test command built into Stata.
And the reason is that a t-test isn’t as versatile as a regression. But let me reassure you: a t-test is a regression.
Promise.
Watch
Doing a t-test
Using the auto dataset that is built into Stata, I’ll run a t-test to see if foreign cars have better fuel consumption than US cars:
. sysuse auto
(1978 Automobile Data)
. ttest mpg, by(foreign)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Domestic | 52 19.82692 .657777 4.743297 18.50638 21.14747
Foreign | 22 24.77273 1.40951 6.611187 21.84149 27.70396
---------+--------------------------------------------------------------------
combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769
---------+--------------------------------------------------------------------
diff | -4.945804 1.362162 -7.661225 -2.230384
------------------------------------------------------------------------------
diff = mean(Domestic) - mean(Foreign) t = -3.6308
Ho: diff = 0 degrees of freedom = 72
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0003 Pr(|T| > |t|) = 0.0005 Pr(T > t) = 0.9997
The difference in fuel consumption is 4.9 miles per gallon, in favour of the foreign cars, and it’s statistically significant. Stata writes, very correctly, that the probability of a value of t greater than the absolute value we observed in the data is 0·0005. And note the value of t: –3·6308.
A t-test is the same thing as a regression
What happens when I use a regression?
. regress mpg foreign
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 13.18
Model | 378.153515 1 378.153515 Prob > F = 0.0005
Residual | 2065.30594 72 28.6848048 R-squared = 0.1548
-------------+------------------------------ Adj R-squared = 0.1430
Total | 2443.45946 73 33.4720474 Root MSE = 5.3558
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
foreign | 4.945804 1.362162 3.63 0.001 2.230384 7.661225
_cons | 19.82692 .7427186 26.70 0.000 18.34634 21.30751
------------------------------------------------------------------------------
The coefficient for the foreign variable is 4·945. That’s exactly the same as the difference between the means. Why?
The coefficient for a regression variable is the effect of a one-unit increase in that variable. For every one-unit increase in the foreign variable, the expected fuel consumption drops by 4·945 miles per gallon. But the foreign variable is a binary variable: it’s coded zero (domestic) and one (foreign). So the effect of a one unit increase is to change from a domestic car to a foreign one.
When a variable is coded 0/1, the regression coefficient is the difference between the means of the two categories. In other words, it’s a test of equality of means.
And, indeed, the t-value you see in the regression is the t-value of the t-test. The t-test isn’t a separate test – it’s the simplest kind of regression: a regression with just one predictor that’s coded 0/1.
Taking the analysis further
But the real reason I use regress is the power I can bring to bear on the analysis. So far we know that foreign cars consume less fuel. But is that because they have smaller engines? If I do a t-test, I can’t answer the question, because a t-test only compares two means. But a regression can:
. regress mpg foreign displacement
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 2, 71) = 35.57
Model | 1222.85283 2 611.426414 Prob > F = 0.0000
Residual | 1220.60663 71 17.1916427 R-squared = 0.5005
-------------+------------------------------ Adj R-squared = 0.4864
Total | 2443.45946 73 33.4720474 Root MSE = 4.1463
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
foreign | -.8006817 1.335711 -0.60 0.551 -3.464015 1.862651
displacement | -.0469161 .0066931 -7.01 0.000 -.0602618 -.0335704
_cons | 30.79176 1.666592 18.48 0.000 27.46867 34.11485
------------------------------------------------------------------------------
When we take into account the effect of engine size on fuel consumption, there is no longer a significant difference between the foreign and domestic cars. The coefficient for displacement is rather silly, because it’s the effect of a one cubic inch difference in engine size. I don’t know anyone who talks about engines in cubic inches, so please excuse me while I do some work:
Making a sensible predictor variable
. gen cc= displacement*16.387064
. lab var cc "Engine size in cc"
First off, I looked up the conversion factor. Now I have a variable called cc that records engine size in cubic centimetres.
But that’s still going to give us a silly regression coefficient, because the effect of a one-cc change in engine size will be trivial. So I’m generating a variable that represents engine size in multiples of 100 cc.
. gen cc100=cc/100
. lab var cc100 "Engine size in multiples of 100 cc"
Now let’s see the regression:
. regress mpg foreign cc100
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 2, 71) = 35.57
Model | 1222.85287 2 611.426434 Prob > F = 0.0000
Residual | 1220.60659 71 17.1916421 R-squared = 0.5005
-------------+------------------------------ Adj R-squared = 0.4864
Total | 2443.45946 73 33.4720474 Root MSE = 4.1463
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
foreign | -.8006821 1.335711 -0.60 0.551 -3.464015 1.862651
cc100 | -.2862997 .040844 -7.01 0.000 -.3677404 -.204859
_cons | 30.79176 1.666592 18.48 0.000 27.46867 34.11485
------------------------------------------------------------------------------
Aha. A 100-cc increase in engine size is associated with a reduction of 0·283 mpg. Adjusted for this, the difference between foreign and domestic cars is no longer significant (P=0·551) nor is it important (a difference of 0·8 mpg).
But does that mean that smaller engines are more fuel efficient? Or is it that smaller engines are found in, well, smaller cars?
. regress mpg foreign cc100 weight
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 3, 70) = 45.88
Model | 1619.71934 3 539.906448 Prob > F = 0.0000
Residual | 823.740115 70 11.7677159 R-squared = 0.6629
-------------+------------------------------ Adj R-squared = 0.6484
Total | 2443.45946 73 33.4720474 Root MSE = 3.4304
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
foreign | -1.600631 1.113648 -1.44 0.155 -3.821732 .6204699
cc100 | .0117693 .0614517 0.19 0.849 -.1107922 .1343308
weight | -.0067745 .0011665 -5.81 0.000 -.0091011 -.0044479
_cons | 41.84795 2.350704 17.80 0.000 37.15962 46.53628
------------------------------------------------------------------------------
Now we can see that the weight of the car, not the engine size, drives the fuel consumption. Adjusted for weight, engine size no longer predicts fuel consumption, and there is no significant difference between the domestic and foreign cars.
Advantages of regression
Using a regression model allowed us to examine factors that might have caused the apparent difference in fuel consumption between foreign and domestic cars. At each step, we came up with an explanation that we could then test by adding a new variable to the model.
And don’t forget robust standard errors
Actually, in real life, I would have used Stata’s robust standard errors in all of these models. They are the equivalent of the unequal variances option in the t-test, but far more powerful:
. regress mpg foreign cc100 weight, robust
Linear regression Number of obs = 74
F( 3, 70) = 50.13
Prob > F = 0.0000
R-squared = 0.6629
Root MSE = 3.4304
------------------------------------------------------------------------------
| Robust
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
foreign | -1.600631 1.185019 -1.35 0.181 -3.964077 .7628141
cc100 | .0117693 .0449128 0.26 0.794 -.0778065 .1013451
weight | -.0067745 .0008357 -8.11 0.000 -.0084413 -.0051077
_cons | 41.84795 1.811437 23.10 0.000 38.23515 45.46075
------------------------------------------------------------------------------
That’s what it looks like.
So make a new year’s resolution (never too late!) to get beyond the t-test and use the power of regression to construct and test more thoughtful models.
No comments:
Post a Comment