Tuesday 20 October 2015

Missing data: never, never use 99!

The myth of the 99

My heart sinks every time I come across data with 9, 99, 999, 9999 and other real numbers used to indicate missing data. There are still supervisors (though they are getting pretty old by now) that advise students to do this. It's a myth particularly prevalent among SPSS users. 

Origins

The use of actual numbers as missing values takes us way back to the seventies, when computers were run by punched card. They looked like this:


And yes, thats SPSS on those cards!
When you were building a dataset, you had to tell the computer what kind of variable each variable was, and how much storage space it needed. Numeric variables could only contain numbers, and so researchers had a problem: what happened when the information was missing. 
The SPSS solution was to use declare one of the numbers to be a missing value. This overcame the problem of storing missing values in a column of numeric data, but it also opened the floodgates to a lot of really risky calculations. Because if you forgot to tell SPSS about your missing value, then the information would be treated as real.

Myth: you must have a numeric missing value

Every modern statistics package since Bob Dylan was alive in a meaningful sense has been able to handle missing values. By this I mean that if you leave a blank, it is correctly interpreted. It will automatically be assigned a missing value by the package. 
Let me repeat: you do not have to have special numeric missing values.

Missing – just leave it blank

So if you have missing data, just leave it blank. Your stats package knows what to do. If you need to know why the data were missing, then create a separate variable that codes the reasons. If the reasons are worth analysing, then they are worth coding properly. 

1 comment:

  1. When doing clinical studies, it is essential to distinguish between a missed question and one that doesn't need/have/or is given an answer. Usually, there are several "codes" for these types of data; "don't know" and "refused to answer" are the most common.

    ReplyDelete