Thursday, 29 October 2015

Which formula for the confidence interval for a proportion?

Stata presents a bewildering array of options for the confidence interval for a proportion. Which one should you use? 

By default, Stata uses the "exact" confidence interval. This name is a bit misleading (this interval is also called the Clopper Pearson confidence interval, which makes fewer implied claims!). The exact confidence interval is exact only in the sense that it is never too narrow. In other words, the probability of the true proportion lying within the "exact" confidence interval is at least 95%. However, this means that in most cases the interval is wider than it needs to be. 

For an apparently simple problem, finding a formula that will give 95% confidence intervals for a proportion has turned out to be surprisingly difficult to crack! The problem is that events are whole numbers, while proportions are continuous. Imagine you have a 25% real prevalence of smoking in your population, and you have a sample size of 107. Your sample cannot have a 25% prevalence of smoking, because, well, that would be 26·75 people. So some sample sizes are "lucky" because they can actually show lots of sample sizes proportion, and some proportions are "lucky" because they can turn up in lots of sample sizes. You begin to see the problem?

Solutions from research

There have been quite a few studies that have used computer simulation to examine the performance of different confidence interval formulas. The recommended alternatives are Wilson or Jeffeys confidence intervals for samples of less than 100 and the Agresti-Coull interval for samples of 100 or more. This gives the best  trade off between confidence intervals that are less than 95% and confidence intervals that are too wide. 

What about the textbook formula that SPSS uses?

One option that Stata does not offer you is the formula you find in textbooks, which simply uses the standard error of the proportion to create a confidence interval. This is known as the normal approximation interval, and it is used by SPSS. If you calculate the confidence interval for 2 events out of a sample of 23 using the normal approximation, the confidence interval is -4% to 21%. That's right: SPSS is suggesting that the true event rate could be minus four percent. Quite clearly this is wrong, as there is no such thing as minus four percent. However, the confidence interval also includes a figure which is obviously wrong. If we have observed two cases, then the true value cannot be zero percent either. Less obviously, the upper end of the confidence interval is also very wrong. Using Wilson's formula gives a confidence interval of 

. cii 23 2, wil

                                                         ------ Wilson ------
    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
             |         23    .0869565    .0587534          .02418    .2679598

2.4% to 26.8%. The "exact" method gives an interval that is slightly wider:

. cii 23 2, exact

                                                         -- Binomial Exact --
    Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
-------------+---------------------------------------------------------------
             |         23    .0869565    .0587534          .01071    .2803793

at 1.1% to 28.0%. 

So never calculate a binomial confidence interval by hand or using SPSS!

Skip to this bit for the answer


For such an apparently simple problem, the issue of the confidence interval for a proportion is mathematically pretty complex. Mercifully, a Stata user just has to remember three things: 
  1. the "exact" interval is conservative, but has at least a 95% chance of including the true value; 
  2. for N < 100, Wilson or Jeffreys is less conservative and closest to an average chance of 95% coverage, 
  3. and for N > 100, Agresti Coull is the best bet. 

Tuesday, 20 October 2015

Missing data: never, never use 99!

The myth of the 99

My heart sinks every time I come across data with 9, 99, 999, 9999 and other real numbers used to indicate missing data. There are still supervisors (though they are getting pretty old by now) that advise students to do this. It's a myth particularly prevalent among SPSS users. 

Origins

The use of actual numbers as missing values takes us way back to the seventies, when computers were run by punched card. They looked like this:


And yes, thats SPSS on those cards!
When you were building a dataset, you had to tell the computer what kind of variable each variable was, and how much storage space it needed. Numeric variables could only contain numbers, and so researchers had a problem: what happened when the information was missing. 
The SPSS solution was to use declare one of the numbers to be a missing value. This overcame the problem of storing missing values in a column of numeric data, but it also opened the floodgates to a lot of really risky calculations. Because if you forgot to tell SPSS about your missing value, then the information would be treated as real.

Myth: you must have a numeric missing value

Every modern statistics package since Bob Dylan was alive in a meaningful sense has been able to handle missing values. By this I mean that if you leave a blank, it is correctly interpreted. It will automatically be assigned a missing value by the package. 
Let me repeat: you do not have to have special numeric missing values.

Missing – just leave it blank

So if you have missing data, just leave it blank. Your stats package knows what to do. If you need to know why the data were missing, then create a separate variable that codes the reasons. If the reasons are worth analysing, then they are worth coding properly. 

Friday, 16 October 2015

Stata tips : get a display of Stata's graph colours

Stata graphs allow you to specify the colo(u)rs used for various graph items. Stata has about 50 named colours, so it's hard to remember exactly what each colour looks like. In addition, Stata uses the names gs0 to gs16 for sixteen shades of grey, starting with gs0 (black) and ending with gs16 (white). These are useful for producing more-or-less evenly-spaced gradations of grey. 
To see a plot of all the available Stata colour, you need to install the vgsg package. This package contains useful resources that accompany Mitchell's excellent book  A visual guide to Stata graphics. Amongst these resources there a simple command that makes a colour chart. 

You install the package like this:

. net from http://stata-press.com/data/vgsg

Click the link to install the package. Once you have installed it, you can issue the command 

. vgcolormap


to print a palette of the available colours. I printed one in colour and pasted it inside my copy of Mitchell's book. Here is the graph:


Notice that Stata has 16 shades of grey (biostats people can't cope with fifty). These are named gs0 to gs16. Of course, gs0 looks black, but if you look carefully, Dougal, you'll see it's actually a very, very, very, very dark grey*. And gs16 simply white. 

And there is a second user-written command, by Seth T. Lirette, that I rather like for its elegant output:

. ssc install hue
. hue



*For non-Irish people, this is a Father Ted joke. Don't worry about it.

10 Steps to successful research : 1 – Know the current state of knowledge


The research process begins by identifying a gap in our knowledge or our understanding. It doesn't matter whether we're talking about scientific research or real life. In real life, for example, you might be going to Lisbon and you need to find a good, cheap hotel near the city centre. Or you might need to figure out how to make carrot soup. But hold onto this idea: research fills a gap in our knowledge. If you don't have a knowledge gap, you're not doing research, you're just noodling around on the internet. 

Scientific research involves adding to knowledge. In order to do this, you must know the current state of knowledge, the current theoretical approaches and current best practice in terms of measurement. 

All biostats people have the experience of the person who comes in with a great research idea that looks like this:

The person: I have sixteen patients with rapid cycling mood disorder
Me : So what are you going to research?
The person: The patients with rapid cycling mood disorder
Me : No, I meant what question are you going to research. What do we not know about rapid cycling mood disorder?
The person : Oh…

Of course, those sixteen patients are a research opportunity. But they aren't a research project until we can find a question that will add to our knowledge, and that can be answered with sixteen patients. Often our job supporting student research is to help the student identify the research opportunities in their environment and then to see if any of these opportunities can be used to study a question that we need answered. 

The introduction to your research paper should do three things
1. It should outline the current state of knowledge.
2. It should identify a gap in that knowledge and
3. It should state the research question in clear, simple language.

Being able to write the first two sections is critical. There will be no step 3 – no research question – without the first two steps. 

But what about a great research question that just sort of pops into your head?  I hear you ask? 

Two things: first, this question may have a well-known answer. You need to know the literature to avoid duplicating work already done.
The second is to do with connectedness. Research is like a jigsaw. The best contributions are made by people who find the edge of the work in progress and join up with it. Sciences advances because each piece of research links into the existing body of knowledge like a jigsaw piece. 

So find out where the edge of our knowledge is. That's where you need to go to work.