New Study on Thimerosal and Neurodevelopmental Disorders: I. Scientific Fraud or Just Playing with Data?Published May 22nd, 2008 in Autism, Child Health, Infant Health, Medical & Epidemiological Studies
Two days ago I came across the paper, “Thimerosal exposure in infants and neurodevelopmental disorders: an assessment of computerized medical records in the Vaccine Safety Datalink,” by Heather Young, David Geier, and Mark Geier, which is listed as an Article in Press in the Journal of Neurological Sciences. This study has a lot of problems, and I predict that it will take me at least five posts to go through the article point by point to explain all the flaws. However, there’s one trick that the authors play that’s so glaring that I have to point it out immediately. In fact, I’ve spent two days in shock that the journal editor and reviewers let the authors get away with it.
Since extremely dubious “imputing” or imputation lies at the bottom of the author’s little trick, let’s back up a moment and define imputation. My mother always told me not to trust everything I read in Wikipedia, but Wikipedia does give pretty good definition of imputation: “the substitution of some value for a missing data point.” “To impute” is just the verb form of imputation. In economics and sociology, imputation is often used for the value of income. For example, a sociologist may be trying to collect data on social class. People are willing to provide information on their education, their job, and other social characterisitcs, but often they will refuse to provide info on their income. So when analyzing her data, the sociological researcher may use the person’s education, occupation, and other characteristics to impute the person’s income value. (Methods for imputation have become very complex and sophisticated. The most common are hot-deck imputation and multiple imputation.)
You may be wondering why I’ve been emphasizing the word “value.” Note that our definition of imputation is “the substitution of some value for a missing data point.” So let’s say a researcher has a file of data on children and 8% are missing data values on parent’s household income, 4% are missing data values on gestational age at birth, and 1% are missing data values on birth weight. She decides to use an imputation procedure to impute values for parental income, gestational age, and birthweight where they were missing. Perfectly fine, legitimate, and scientifically valid under most circumstances. However, let’s say the outcome she’s interested in is autism. She examines the data and sees that in certain cohorts in her study population the distribution of autism isn’t quite what she would like. So she “imputes” autism cases into the data set. Except that she’s not imputing a value on a variable for an existing study participant. She’s adding imaginary autism cases into the analysis. This isn’t imputation — it’s cooking the data. Sorry folks, but when you have a data set that comes from the real world with certain number of cases and non-cases, the data are what they are.
Let’s quote directly from the Young, Geier & Geier paper to make sure I have this right. “Because of concern that the cohorts from 1995-1996 had only 4-6 years of follow-up, frequency distributions of age at diagnosis were examined for all years. This revealed that for some of the disorders a sizable proportion of children were diagnosed after 4.5 years. Adjustments were made for counts of cases as needed for birth cohorts depending upon the disorder examined to correct for under ascertainment that occurred due to shorter follow-up times. These adjustments were made for all disorders including the control disorders as appropriate based on the age distribution….”
“For example, 37% of autism cases in the study were diagnosed after 5 years old with about 50% diagnosed after 4.5 years old. This is a conservative estimate since it includes the 2 years (1995-1996) that had shorter follow-up times. Examination of the distribution of age of diagnosis by birth year for autism revealed that only about 15% of cases were diagosed after 5 years of age in the 1995 birth cohort while the 1996 birth cohort had no cases diagnosed after 5 years of age and only 3.5% of cases diagnosed between 4.5 and 5 years of age. Based on the average age at diagnosis for all cohorts the 1995 count of autism cases was increased by 45 cases with the assumption that all of these would have been added in the 5 year+ age group (bringing this percentage close to the overall average of of 37% diagnosed after 5 years of age.) The same was done for 1996, but the number of cases was augmented by 80 because it was assumed that these would be diagnosed in the 4.5 to 5 and 5+ groups essentially bringing the percentage after age 4.5 close to the overall average of 50% diagnosed after 4.5 years of age. The new augmented frequency counts of cases in 1995 and 1996 birth cohorts were then use as new case counts in the analysis.”
This is just not done. It’s not valid. It’s not ethical. Adding imaginary cases into a data set borders on scientiific fraud. I’ve been trying to wrap my mind around some sort of rationale for the authors “imputing” extra cases and to me it’s just fudging the data. What they’ve done bears some relationship to a procedure called “direct age standardization,” but age standardardization might be useful in a situation where invesigators were comparing birth cohorts — not where the birth cohorts are the units of analysis (more on this “units of analysis” concept later). I don’t think this is downright scientific fraud for two reasons. First, they carried out this procedure of “imputing” imaginary cases for the control disorders, as well as autism and five other neurodevelopmental disorders. (I’ll explain this in more detai in upcoming posts.) Second, they come right and admit that they cooked the data by adding imaginary cases — it’s not as if they’re trying to hide anything.
Anyway, this isn’t even the biggest flaw in the paper. It just gives me the creeps that they would do such a thing. In fact, I doubt that this little trick had much effect on the rate ratios reported in Table 3. The authors do state that “sensitivity analyses revealed that point estimates were similar even when imputing [sic] 50% fewer cases than would be expected using the average age distributions as noted above.” However, as I said above, this study has a lot of problems. The “rate ratios” reported in Table 3 are surely invalid for two reasons (among others):
- The “ecological” study design is strange, weird, and downright bizarre. It’s true that the authors could not link the separate data files, but this “ecological” design was not necessary. Instead of using a total “population at risk” of 278,624 children, the authors should have used person-time (e.g., person-months) in the denominator to calculate true rates. This is the standard approach in epidemiological studies in which there is “right censored” data, i.e. in this case, children who might eventually be diagnosed with a neurodevelopmental condition, but who had not been followed up long enough.
- Despite appearances, from a statistical point of view this is not an analysis of 278,624 children. The “ecological” analysis actually comes down to a regression analysis of a sample size of SEVEN (7) units — the seven birth cohorts. Picture a scatter plot of 7 points were the X axis is Hg dose, the Y axis is the prevalence of a given disorder, and the 7 points are where the mean Hg dose for each birth cohort intersects the prevalence for that birth cohort. Aside from the fact that a regression analysis based on an N of 7 is unstable and not robust at all, it has been known in the social sciences since 1950 and in epidemiology since about 1973 that in general, regression estimates from ecological analyses tend to be hugely magnified compared to individual-level analyses. (By individual-level analysis I simply mean the type of study where individual exposure data and individual level outcome data is used in the analysis for every study participant.)
I will be discussing these issues in much more detail in my next few posts. If you have any quesions, don’t hesitate to comment.