BCSS: Statistical Software

Showing posts with label Statistical Software. Show all posts

Thursday, 5 November 2015

RStudio – an interface for R for people who hate the R interface

No-one would call the R interface pretty. In fact, it's the sort of interface that strikes terror into the heart of new users. The trouble is that many people find themselves trying to use R because you can do something in R that you cannot do in any other package. These users find that R is just so different to anything else they have used that they can spend days – weeks, even – just trying to figure out how to get their data into the blasted thing.

There have been a few attempts to improve the user experience, though the feeling in the R community seems to be that R does statistics, and that a nice user interface is a low priority.

I've been experimenting with RStudio recently. As an interface, I've found it much easier to work with than anything I tried previously. Here's what it looks like (click the image to see it full-size):

The bottom left shows you your output. On the top left, you can see that I'm browsing a small table, and on the top right you can see the contents of my R workspace. I like this, because R's ability to have multiple datasets available at once is a strength. Being able to browse them and inspect them is pretty useful. The bottom right shows a very useful pane that you can use to manage files, plots and packages. Clicking a package name opens the help file in the help tab.

Command tips appear as you type a command – no, it doesn't give you dialogues for commands, but the tips are very useful.

Will this make R as easy as Stata? No, clearly. But it makes it a lot easier. And for that, you may well be grateful.

RStudio is also under pretty active development, and has improved noticeably over the couple of months I've been using it. Worth a try, then.

Thursday, 22 January 2015

Making graphs of tables

Seeing tables as graphs

We often put tables into papers by reflex. Making them is a dull activity because, I suspect, there is the sense that no-one reads them. And there’s a very good reason for this: while tables are a very good resource, they are lousy communicators.

Tables : lousy communicators

Here is a table of hair and eye colour

. use "Hair and eye colour.dta"

(Hair and Eye Colour, Caithness, from Tocher (1908))

. tabulate eye_colour hair_colour [fweight = freq]

| Hair colour

Eye colour | Fair Red Medium Dark Black | Total

-----------+-------------------------------------------------------+----------

Blue | 326 38 241 110 3 | 718

Light | 688 116 584 188 4 | 1,580

Medium | 343 84 909 412 26 | 1,774

Dark | 98 48 403 681 85 | 1,315

-----------+-------------------------------------------------------+----------

Total | 1,455 286 2,137 1,391 118 | 5,387

You have to be pretty determined to make any sense of the table. Indeed, to do so requires somehow digesting the information from 20 numbers, most of which are three-digit numbers. This is pretty much guaranteed to be beyond the working memory capacity of the average human.

And no, percentages don’t help much:

. tabulate eye_colour hair_colour [fweight = freq], column nofreq

| Hair colour

Eye colour | Fair Red Medium Dark Black | Total

-----------+-------------------------------------------------------+----------

Blue | 22.41 13.29 11.28 7.91 2.54 | 13.33

Light | 47.29 40.56 27.33 13.52 3.39 | 29.33

Medium | 23.57 29.37 42.54 29.62 22.03 | 32.93

Dark | 6.74 16.78 18.86 48.96 72.03 | 24.41

-----------+-------------------------------------------------------+----------

Total | 100.00 100.00 100.00 100.00 100.00 | 100.00

Stacked bar charts

Here, instead, is what happens when we graph the data

catplot eye_colour hair_colour [fw=freq], name(catplot,replace) ///

asyvars stack percent(hair) legend(rows(1) stack)

The stacked bar chart shows the trend of dark-to-light running from top left to bottom right. This shows the breakdown of eye colour within each hair colour, but tells us nothing about the distribution of hair colour.

This is done with Nick Cox’s command catplot. Download it from the ssc archive

. ssc install catplot

Spineplots (mosaic plots)

Spine plots (also called mosaic plots) are a very effective way of visualising tables. Unlike stacked bar charts, you may not have heard of spine plots.

A spineplot will show both the distribution of hair colour, and the distribution of eye colour within hair colour:

spineplot eye_colour hair_colour [fw=freq], percent

The hair colours are shown as columns, and we can see that red hair and black hair are much rarer in this population (Scotland, early 20th century) than fair, medium and dark. And the relationship with eye colour is now very evident – the colour changes from bottom left (fair hair, light or blue eyes) to the top right (dark or black hair, dark eyes).

Do you need a graph rather than a table

The tables above contain the relationship but they don’t show it. And even if you are determined to find it, there are simply too many numbers in the table for any normal person to hold them all in working memory and make sense of the pattern.

The spine plot, on the other hand, shows the relationship with little work needed on the part of the reader. It doesn’t record the exact percentages. If you needed simply to record the exact percentages for reference, then a table is better, but if you wanted to communicate a pattern, then there’s no question: the graph wins hands-down.

This is done with Nick Cox’s command spineplot. Download it from the ssc archive

. ssc install spineplot

Monday, 28 April 2014

What stats packages should we be investing in?

Statistical software is in a phase of rapid change. The graph on the left shows the citation rates for the major statistical packages on Google Scholar. It's taken from an interesting post here.

It is clear that the traditional packages – SAS and SPSS – are in sharp decline (though the decline in SPSS seems to have slowed.

But what is also clear is that both Stata and R are increasing solidly. Both of these packages have a key advantage over the competition: users can write commands that add to the package's functionality. This means that you are likely to be able to access cutting edge procedures in these packages before they are implemented (if they are implemented at all) in other packages.

From the teaching point of view, both are attractive too. While SPSS and SAS have annual licenses, Stata student licenses are permanent, and R is, of course, free. With SAS or SPSS, a graduating student rapidly ends up unable to afford to renew the license for the software and so finds they have learned a package they cannot now use. Not a problem with R – it's open source – and Stata users will be able to keep whatever version they have indefinitely.

However, a major consideration is research output. While there is considerable interest in Stata in among PIs in College, a package that emerged with a solid and enthusiastic fan base was GraphPad Prism. BCSS gets a small but steady trickle of support requests for support with Prism, which is a package ideally suited to the workflow (and thought flow!) of lab-based researchers. And both we and the users agree that the manuals are remarkably well written and informative. So we plan on supporting Prism as long as this enthusiasm continues.

College will have to consider seriously how we invest money in software, taking into account the teaching potential, the research productivity and the accumulated expertise. We will be cavasing opinions, both by survey and by contacting PIs, but in the meantime we would be interested to hear from people with with views or interests.