Friday, January 20, 2012

The Adventures of DataThief

DataThief by Jed Pascoe: Reproduced with the artist's permission

This post was chosen as an Editor's Selection for ResearchBlogging.org

Having spent much of the past week struggling to make sense of my data, it’s good to come home, pour a glass of wine, put on some Sharon Jones, and, er… play with somebody else’s data!

Recently, I’ve discovered DataThief - an application that allows you to scan in a graph from a paper and extract the data points. Sometimes, this provides insights that really aren’t obvious from the original paper.

The other week, for example, I came across an intriguing neuroimaging study reported on the SFARI website. In the study, Judith Verhoeven and colleagues used diffusion tensor MRI to examine the superior longitudinal fasciculus, a bundle of nerve fibres that is assumed (although see this paper) to connect two brain regions involved in language production and comprehension - Broca’s area (left front-ish) with Wernicke’s area (left and back a bit).

Verhoeven et al. reported that integrity of the superior longitudinal fasciculus was compromised in kids with specific language impairment (SLI) – that is, kids who have language difficulties for no obvious reason. However, the same was not true of kids with autism, even though they had poorer language skills than those with SLI.

Taken at face value, this is a pretty major blow to the idea that autism and SLI have anything more than a superficial resemblance [pdf].

DataThief, however, suggests otherwise.

The figure below is a scatterplot with each coloured shape representing a single child. On the x-axis is performance on a language test. On the y-axis is fractional anisotropy (FA) – the imaging measure used to assess the integrity of the left superior longitudinal fasciculus.

Figure 3a from Verhoeven et al 2011, showing integrity of the left superior longitudinal fasciculus plotted against the child's language scores (z-scores). Children with SLI in red, autistic kids are the blue squares. Control children are the green and blue circles


The purpose of the graph was to show the significant correlation between these two measures in the SLI group. But if we can read off the y-coordinates of each shape, we can show the distribution of fractional anisotropy scores for all three groups.

Cue DataThief.

It’s really just a case of clicking on three reference points for which you know the coordinates and then clicking on each of the data points in turn. Then you simply export the coordinates of the data points as a text file. The only thing I had to remember was to do the three groups separately so I knew which point belonged in which group.

Here’s the fractional anisotropy data replotted to show the distribution for each group. What we can now see is that there is a small subgroup of control kids who have really high FAs. There is also one autistic kid and one kid (arguably two) with SLI who have low FAs. Everyone else is pretty much in the middle.

Verhoeven et al.'s data replotted to show the distribution of fractional anisotropy for each group


On average, kids with SLI have lower than ‘normal’ fractional anisotropy [1], but looking at the spread of scores, you’d be hard pressed to conclude that this was a characteristic of SLI. Likewise, the overlap between the distributions of the autism and SLI groups is almost complete. Hardly evidence for fundamentally different neural mechanisms in the two disorders. 

At the risk of sounding like a broken record, this once again highlights the importance of looking at individual variation within diagnostic groups such as autism and SLI, rather than (or as well as) looking at group averages.

But it also emphasizes a more general point (and this I have to stress is no criticism of the authors of this particular paper).

The data reported in a journal article are really just a snapshot of the actual data recorded, filtered through the authors’ preconceptions about what questions are interesting to ask and how to go about doing that. There’s an imperative to present the data in a neat, sanitized package, with all the rough edges and anomalies smoothed out; to tell a coherent story that will convince reviewers and editors that it’s worthy of publication in a reputable journal. Years of work and terabytes of data may be compressed into just two or three pages.

DataThief only takes us so far. It allows us to extract the information presented visually in the published article, but no further.

Most of the past week has been spent convincing myself that it doesn’t really matter how I analyse my data because the results come out the same regardless. This is reassuring for me, but it doesn’t mean that somebody else, looking at my data with fresh eyes and a different perspective, would not come to an entirely different set of conclusions.

In an ideal world, when a paper is published, researchers should also be able (and encouraged) to publish the data on which the paper is based, as well as the script showing exactly how those data were analysed.

There are, of course, many obstacles in the way and questions to be answered before this becomes standard practice. Who would host and maintain the data? Just how raw should the raw data be? What if the authors are writing multiple papers based on the same data set? Who gets credit for reanalyses of the data set? What happens if a reanalysis shows up an error in the original paper? If the research involves human participants, how do we reassure them that their anonymity will be maintained?

Undoubtedly, there are many more problems that I haven't thought of. But, as scientists, we need to work through these issues and find ways to set our data free.


Footnotes:

[1] The analyses involved an ANOVA with left and right hemisphere as a within-subjects factor. This showed a main effect of group, but no group by hemisphere interaction.


Update:

Originally, I linked to the wrong SFARI article in the third paragraph. That's now fixed. The one I mistakenly linked to reports a conference presentation that does indicate atypical connectivity between language regions in the brains of nonverbal autistic kids (although not the same pathway as examined by Verhoeven et al.)

Further reading:




Reference:

Verhoeven, J., Rommel, N., Prodi, E., Leemans, A., Zink, I., Vandewalle, E., Noens, I., Wagemans, J., Steyaert, J., Boets, B., Van de Winckel, A., De Cock, P., Lagae, L., & Sunaert, S. (2011). Is There a Common Neuroanatomical Substrate of Language Deficit between Autism Spectrum Disorder and Specific Language Impairment? Cerebral Cortex DOI: 10.1093/cercor/bhr292

2 comments:

  1. Jon;
    Excellent analysis on the apparent widespread problem of data theft. The paper you referenced used diffusion tensor imaging that measures the flow of water and tracks the pathway of white matter in the brain (The paper has a pay wall and I can only refer to the abstract). Another Safari article referred to newly recognized problems with fMRI which directly measures the blood flow providing information on brain activity.

    The SFARI fMRI article states:

    ‘In a study published 14 October, researchers reanalyzed data from several of their own functional connectivity studies after correcting for head motion and found that this maturation pattern usually disappears once head motion is taken into account’.

    “It really, really, really sucks. My favorite result of the last five years is an artifact,” says lead investigator professor of cognitive neuroscience at Washington University in St. Louis”.

    The problem of head movement in fMRI has produced false positives and spurious results if multiple tested controls aren’t used to correct the spurious results. Even when a subject has no head movement during an fMRI session it still can produce spurious results and false positives because of background noise as was demonstrated in a study posted a few years ago. The subject was shown a series of photographs depicting human individuals in social situations and was asked to determine what emotion the individual in the photo must have been experiencing.

    The subject of the study was an 18 inch long, 3.8 pound dead Salmon:


    http://prefrontal.org/files/posters/Bennett-Salmon-2009.jpg

    In the discussion the authors asked ‘Can we conclude from this data that the salmon is engaging in the perspective-taking task? Certainly not. What we can determine is that random noise in the EPI timeseries may yield spurious results if multiple comparisons are not controlled for’.

    As far as your comment that ‘Taken at face value, this is a pretty major blow to the idea that autism and SLI have anything more than a superficial resemblance’ is one I would agree with. One of the features that may distinguish language difficulties between autism and SLI kids is the presence of echolalia in verbal autistic children. Damaged neuronal circuitry is also associated with echolalia in brain tumor, stroke and Alzheimer’s patients. ( Blair et al 2007) ( Yang et al 1989) ( Endo et al 2001 ).

    http://www.ncbi.nlm.nih.gov/pubmed/11296406

    http://www.ncbi.nlm.nih.gov/pubmed/2473823

    http://www.ncbi.nlm.nih.gov/pubmed/20964503

    ReplyDelete
    Replies
    1. Thanks Raj. I'm a little concerned that I wasn't clear in the post. I don't think data theft is a problem - I'm the one doing the stealing!!! The rambling at the end was all about trying to make scientific publications more open so we don't have to steal data.

      I'll email you the paper shortly. I really like the paper, and I'm still a little confused how they ended up with their significant group differences. I didn't look at the right hemisphere so it may be that it's actually the right hemisphere driving things - although as I footnoted, there wasn't a group by hemisphere interaction.

      I love the SFARI blog but I do wince every time they talk about water "flowing" through the brain. Essentially, DTI relies on the fact that water molecules jiggle around (I don't know if you remember doing Brownian motion in science at school - it's the same thing). But the nerve fibres that make up the white matter in the brain constrain the motion of the water molecules, so they tend to end up jiggling in a direction parallel to the fibres - but they're not actually "flowing" anywhere. Fractional anisotropy is a measure of how much the water molecules in a voxel of the brain are constrained.

      The early studies (eg Barnea-Goraly 2004) just looked at fractional anisotropy voxel by voxel. Verhoeven et al. did something quite a lot more sophisticated (from what I understand), where they first traced the superior longitudinal fasciculus, then took a slice though it and worked out the fractional anisotropy for that cross-section.

      Like all of these methods, it's not perfect, which is why the best evidence is converging evidence from different techniques.

      Delete