Tuesday 17 June 2014

Chi-Squared : A measure of surprise

Introduction

Rashid, my opposite number in Penang Medical College, has the bad habit of asking statistical questions when I am trying to concentrate on pulling and crawling my way up steeply-sloping rainforest. And it was in such a situation that he came up with a rather good question:

Just what is Chi-squared anyway?

And I said

Chi-squared is a measure of surprise.

Think of it this way. Here was have a table of data, and we are wondering if there is a pattern in there. The data show the relationship between loneliness and prevalence of depressed mood in people aged 65 and over in a community survey.


. tab  depressed loneliness, col

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

                      |     Feels lonely
 Depressed mood (Q24) | Not lonely    Lonely |     Total
----------------------+----------------------+----------
No depression last mo |     1,310        621 |     1,931 
                      |     95.48      81.28 |     90.40 
----------------------+----------------------+----------
Depressed last month  |        62        143 |       205 
                      |      4.52      18.72 |      9.60 
----------------------+----------------------+----------
                Total |     1,372        764 |     2,136 
                      |    100.00     100.00 |    100.00 


Is there anything surprising here? Well we seem to have more people who are depressed among the lonely (19% depression prevalence) than the non-lonely (5% prevalence). But how surprising is that difference? Well, that depends on what we were expecting. 

If loneliness had no impact on depression, then we would expect to see more or less the same prevalence of depression in the lonely and the non-lonely. Of course, the prevalence probably wouldn’t be exactly the same, but a few people one way or another wouldn’t cause much surprise. What we are looking for is a surprising number of people too many or too few.

So how do we judge when the number of people too many or too few is surprising? Well, that depends on how many people we were expecting, doesn’t it? If ten people more than I expected turn up, I am not surprised if I am running a conference, but I am pretty taken aback if I had thought I was going out on an intimate dinner for two. 

So the amount of surprise depends on the number of people I saw, but also on the number I was expecting. 

What are we expecting?

Let’s have a look at that table again. Notice that 9·6% of people were depressed. So if loneliness has nothing to do with depression, we expect 9·6% of the lonely and 9·6% of the non-lonely to be depressed. 

We have 1,372 non-lonely people; 9·6% of that is 

. di 1372*0.096
131.712

I let Stata’s display command do the work for me. Essentially, we are expecting 132 of the non-lonely people to be depressed, and so we are expecting 

. di 764*0.096
73.344

73 of the lonely people to be depressed. 

We can work out the remainder of the numbers, but Stata will display them for us. Now we know how Stata expected those frequencies, that is. 

. tab  depressed loneliness, exp

+--------------------+
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
+--------------------+

                      |     Feels lonely
 Depressed mood (Q24) | Not lonely    Lonely |     Total
----------------------+----------------------+----------
No depression last mo |     1,310        621 |     1,931 
                      |   1,240.3      690.7 |   1,931.0 
----------------------+----------------------+----------
Depressed last month  |        62        143 |       205 
                      |     131.7       73.3 |     205.0 
----------------------+----------------------+----------
                Total |     1,372        764 |     2,136 
                      |   1,372.0      764.0 |   2,136.0 

We have four cells in the table, and in each cell we can compare the number we saw with the number we expected. And remember, the expectation is based on the idea that loneliness has nothing to do with depression. From that idea follows the reasoning that if 9·6% of the whole sample is depressed, then we are expecting 9·6% of the lonely, and of the non-lonely to be depressed. And, of course, 90·4% of the whole sample is not depressed, so we are expecting 90·4% of the lonely, and 90·4% of the non-lonely to be depressed.

The Chi-squared test visits each cell of the table, calculating how far apart the expected and observed frequencies are, as a measure of how surprising that cell is, and totalling up all those surprises as a total surprise score.

Of course, we need to relate that total surprise score to the number of cells in the table. A big surprise score could be the result of adding together a lot of small surprises coming from the cells of a large table – in which case we aren’t really surprised – or it could come from a very small number of cells in a small table, in which case we are surprised. 

Degrees of freedom: the capacity for surprise

So a Chi-squared value has to be interpreted in the light of the number of cells in the table. Which is where the degrees of freedom come in. Degrees of freedom measures the capacity of a table to generate surprising results. Looking at our table, once we had worked out one expected frequency, we could have filled in the other three by simple subtraction. If we expect 131·7 people to be depressed but not lonely, then the rest of the depressed people must be depressed and lonely. And the rest of the non-lonely people must be non-depressed. See – once I know one number in that table, I can work out the rest by subtraction.

So that 2 x 2 table has only one potential for surprise. 

And by extension, once I know the all but one of the numbers in the row of a table, I can fill in the last one by subtraction. Same goes for the columns. So the capacity of a table for surprises is one less than the number of rows multiplied by one less than the number of columns.

And what is the Chi-squared value for the table anyway?

. tab  depressed loneliness, col chi

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

                      |     Feels lonely
 Depressed mood (Q24) | Not lonel     Lonely |     Total
----------------------+----------------------+----------
No depression last mo |     1,310        621 |     1,931 
                      |     95.48      81.28 |     90.40 
----------------------+----------------------+----------
Depressed last month  |        62        143 |       205 
                      |      4.52      18.72 |      9.60 
----------------------+----------------------+----------
                Total |     1,372        764 |     2,136 
                      |    100.00     100.00 |    100.00 

          Pearson chi2(1) = 114.0215   Pr = 0.000

A very large index of surprise: a Chi-squared of 114, with one degree of freedom. And a P-value that is so small that the first three digits are all zero. We would write P<0·001.

So that’s it . Chi-squared is a measure of surprise, and degrees of freedom measure the capacity of a table to come up with surprises.

Those aren’t mathematical definitions, but they are useful ways of thinking about statistical testing, and the difficult idea of degrees of freedom. 


Monday 16 June 2014

FIFA World Cup 2014: Probabilities of Winning!

The Norwegian Computing Center in Oslo have calculated the chances of winning for all teams participating in the championship, based on a probability model. The probabilities are also updated daily during the championship.


Monday 9 June 2014

Analysis of Variance: Almost always a bad idea


Ask a vague question, get a vague answer

Analysis of variance is a dangerous tool. It allows researchers to avoid asking precise questions. And if you don’t formulate your question precisely, then the answer won’t tell you anything. It’s like the gag in the Hitch Hiker’s Guide where they build a supercomputer to figure out “the answer to life, the universe and everything”. When the computer finally figures it out, the answer is 42. The problem, you see, was that they hadn’t actually figured out what the question was. 

Let’s have a look at a simple case. Here is data on cardiac output late in pregnancy in three groups of mothers: those with normal blood pressure, pre-eclampsia (PET) and gestational hypertension (GH)

. table outcome, c(mean  co5) format(%2.1f)

----------------------
Hypertens |
ion       |
outcome   |  mean(co5)
----------+-----------
   Normal |        6.4
    P E T |        5.6
  Gest HT |        9.0
----------------------


I used the format option to limit us to one place of decimals. 

What does an anova tell us

. anova  co5 outcome

                           Number of obs =     256     R-squared     =  0.4984
                           Root MSE      = .790256     Adj R-squared =  0.4944

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |   156.97329     2  78.4866448     125.68     0.0000
                         |
                 outcome |   156.97329     2  78.4866448     125.68     0.0000
                         |
                Residual |  157.999502   253  .624503962   
              -----------+----------------------------------------------------
                   Total |  314.972792   255  1.23518742  

Now, if you find that edifying you’re a better person than I. What the anova tells us is that there is a difference in cardiac output between the three groups. But that’s not the answer to any particular question. It certainly doesn’t answer any useful clinical question.

Oh, but now we can do post-hoc comparisons”, you say. To which I might reply “But don’t you have any ideas you want to test?” The trouble with post-hoc comparisons is that you can rapidly end up with a bunch of comparisons, some of which are of interest and some meaningless. 

First ask a question

To analyse data, you need to articulate the underlying hypothesis. And here, the hypothesis wasn’t “there’s some kinda difference in cardiac output between the three groups”. This is what I call the Empty Brain hypothesis, that assumes no previous research has been done, we know nothing about physiology, about hæmodynamics, about anything. And it’s not good enough. Science progresses by building on our understanding, not by wandering around hypothesising “some kinda difference”.

In fact, we expect that late in pregnancy, cardiac output in gestational hypertension will be higher than normal, causing high blood pressure because it exceeds the normal carrying capacity of the mother’s circulatory system. On the other hand, we expect that it will be below normal in pre-eclampsia because pre-eclampsia is characterised by a lot of clinical indicators of inadequate blood supply. The high blood pressure in pre-eclampsia is the result of very high peripheral resistance (the circulatory system basically closed tight) and the heart desperately trying to get the blood supply through.

Then use regression to answer it

We can use regression to ask this question:


. regress  co5 i.outcome

      Source |       SS       df       MS              Number of obs =     256
-------------+------------------------------           F(  2,   253) =  125.68
       Model |   156.97329     2  78.4866448           Prob > F      =  0.0000
    Residual |  157.999502   253  .624503962           R-squared     =  0.4984
-------------+------------------------------           Adj R-squared =  0.4944
       Total |  314.972792   255  1.23518742           Root MSE      =  .79026

------------------------------------------------------------------------------
         co5 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     outcome |
      P E T  |  -.8146901    .217851    -3.74   0.000    -1.243722   -.3856577
    Gest HT  |   2.612142   .1732165    15.08   0.000     2.271012    2.953272
             |
       _cons |   6.376119   .0534005   119.40   0.000     6.270953    6.481285
------------------------------------------------------------------------------

I have used Stata’s ‘i’ notation to tell Stata that outcome is to be treated as separate categories. In this case, Stata will use the normotensive group (coded as zero) as the reference group. 

Not just hypothesis tests – effect sizes too

The regression has tested our two hypotheses:
1. Compared with women with normal BP, those with PET will have lower cardiac outputs
2. Compared with women with normal BP, those with GH will have higher cardiac outputs
Furthermore, it has quantified the effects. Cardiac output is 0·8 litres a minute lower in pre-eclampsia, and 2·6 litres a minute higher in gestational hypertension. 

Robust variance estimation as well!

Another advantage of regression is that we can use Stata’s robust variance estimates. This topic is so important that it will be the subject of its own blog post. But note what happens when I invoke robust variance estimation:

. regress  co5 i.outcome, robust

Linear regression                                      Number of obs =     256
                                                       F(  2,   253) =  207.37
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.4984
                                                       Root MSE      =  .79026

------------------------------------------------------------------------------
             |               Robust
         co5 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     outcome |
      P E T  |  -.8146901   .1947096    -4.18   0.000    -1.198148   -.4312319
    Gest HT  |   2.612142   .1352127    19.32   0.000     2.345856    2.878428
             |
       _cons |   6.376119   .0549773   115.98   0.000     6.267847     6.48439
------------------------------------------------------------------------------


Our estimates for the effect sizes haven’t changed, but the confidence intervals have. Robust variance estimation allows us to take into account the clustering within the data due to factors beyond our control. Always a good idea!