The analysis of 168 randomised controlled trials to test data integrity
Email:[email protected]
Corrections added after online publication. 11 April 2012. 169 studies has been changed to 168 studies throughout; 142 studies has been changed to 141 studies on pps 521, 523, 527, 528, 531; 97/135 human studies has been changed to 96/134 on pps 521,528.
Summary
The purpose of this study was to use some statistical methods to assess if randomised controlled trials (RCTs) published by one particular author (Fujii) contained data of unusual consistency. I searched seven electronic databases, retrieving 168 RCTs published by this author between 1991 and July 2011. I extracted rates for categorical variables and means (SDs) for continuous variables, and compared these published distributions with distributions that would be expected by chance. The published distributions of 28/33 variables (85%) were inconsistent with the expected distributions, such that the likelihood of their occurring ranged from 1 in 25 to less than 1 in 1 000 000 000 000 000 000 000 000 000 000 000 (1 in 10^{33}), equivalent to p values of 0.04 to < 1 × 10^{−33}, respectively. In 141 human studies, 13/13 published continuous variable distributions were inconsistent with expected, their likelihoods being: weight < 1 in 10^{33}; age < 1 in 10^{33}; height < 1 in 10^{33}; last menstrual period 1 in 4.5 × 10^{15}; baseline blood pressure 1 in 4.2 × 10^{5}; gestational age 1 in 28; operation time < 1 in 10^{33}; anaesthetic time < 1 in 10^{33}; fentanyl dose 1 in 6.3 × 10^{8}; operative blood loss 1 in 5.6 × 10^{9}; propofol dose 1 in 7.7 × 10^{7}; paracetamol dose 1 in 4.4 × 10^{2}; uterus extrusion time 1 in 33. The published distributions of 7/11 categorical variables in these 141 studies were inconsistent with the expected, their likelihoods being: previous postoperative nausea and vomiting 1 in 2.5 × 10^{6}; motion sickness 1 in 1.0 × 10^{4}; male or female 1 in 140; antihypertensive drug 1 in 25; postoperative headache 1 in 7.1 × 10^{10}; postoperative dizziness 1 in 1.6 × 10^{6}; postoperative drowsiness 1 in 3.8 × 10^{4}. Distributions for individual RCTs were inconsistent with the expected in 96/134 human studies by Fujii et al. that reported more than two continuous variables, their likelihood ranging from 1 in 22 to 1 in 140 000 000 000 (1 in 1.4 × 10^{11}), compared with 12/139 RCTs by other authors. In 26 canine studies, the distributions of 8/9 continuous variables were inconsistent with the expected, their likelihoods being: right atrial pressure < 1 in 10^{33}; diaphragmatic stimulation (100 Hz) < 1 in 10^{33}; pulmonary artery occlusion pressure < 1 in 10^{33}; diaphragmatic stimulation (20 Hz) < 1 in 10^{33}; heart rate 1 in 6.3 × 10^{10}; mean pulmonary artery pressure 1 in 2.2 × 10^{14}; mean arterial pressure 1 in 6.3 × 10^{7}; cardiac output 1 in 110. Distributions were inconsistent with the expected in 21/24 individual canine studies that reported more than two continuous variables, their likelihood ranging from 1 in 345 to 1 in 51 000 000 000 000 (1 in 5.1 × 10^{13}).
Patterns are useful because they tell us something about the processes that create them. When patterns deviate from the expected, we know that something unusual has happened. Random allocation of individuals from a population into different groups distributes both categorical and continuous variables in predictable patterns, the centre, spread and shape of which are necessary consequences of the interaction between the sampled population and the sampling method.
For example, the variation in the means of continuous variables, such as age, depends upon: (i) the mean age of the sampled population; (ii) the population’s age distribution; and (iii) the size of the sample. The distribution of mean values for each continuous variable is normal (Gaussian), unless the population variable is both very asymmetric (‘skewed’) and the samples have been small (often quoted as < 30 individuals). The distribution of means in such cases will be slightly skewed and may cluster more or less tightly around the population mean.
Similarly, the variation in the proportions of binomial characteristics, such as sex, depends upon: (i) the proportions of each sex in the sampled population; and (ii) the size of the sample. The shape and asymmetry of binomial distributions change with these two variables.
Significant deviation from the expected occurrences of one binomial characteristic in the outcomes reported by one particular anaesthetic researcher was publicised by Kranke et al., commenting that: ‘Reported data on granisetron and postoperative nausea and vomiting by Fujii et al. are incredibly nice!’ [1]. Kranke et al. concluded by observing: ‘…we have to conclude that there must be an underlying influence causing such incredibly nice data reported by Fujii et al.’
Kranke et al. had looked at 47 randomised controlled trials (RCTs) of antiemetics to prevent postoperative nausea and vomiting (PONV), published between 1994 and 1999 by Dr Yoshitaka Fujii and colleagues (references 1–46; Appendix S1; available online, please see details at the end of the paper). Eighteen of these RCTs had reported postoperative rates of headache. Ten had reported the same rate of headache in every group; for instance, in one paper, Fujii et al. reported that they had randomly allocated 270 women to one of six groups (reference 1; Appendix S1). Eighteen of the 270 women had postoperative headaches: 3/45 in each of the six groups. Table 1 shows the reported and expected (i.e. by chance) rates of headache in such patients.
Women with a headache in a group of 45 | Groups reported with this incidence of headache in this study | Groups expected with this incidence if headaches were distributed randomly across groups |
---|---|---|
0 | 0 | 0.3 |
1 | 0 | 0.9 |
2 | 0 | 1.4 |
3 | 6 | 1.4 |
4 | 0 | 1.0 |
5 | 0 | 0.6 |
6 | 0 | 0.3 |
7 | 0 | 0.1 |
Kranke et al. proceeded to reject the null hypothesis that 10/18 RCTs would report homogenous rates of headache by chance, calculating a probability of 6.8 × 10^{9}, or 1 in 147 million. My slight concern is that this indirect calculation confused the probability of a particular incidence's occurring, with the probability that this incidence is consistent with the expected binomial distribution. Kranke et al. calculated the first probability, but it is the second that I am more interested in. This turns out to be ∼1 in 5600 for the distribution of headache reported in all 18 RCTs that Kranke et al. analysed: more than Kranke’s estimate, but still 280 times smaller than the p < 0.05 threshold conventionally regarded as statistically significant.
Kranke et al. had concluded that it was more likely that an ‘unnatural mechanism’ had obliterated the expected binomial distribution. Moore et al. mention these ‘suspect’ data in their editorial on scientific fraud [2].
My purpose in this study was to extend the statistical analysis of papers, begun by Kranke et al., to all RCTs published by Fujii. Identification of unnatural patterns of categorical and continuous variables would support the conclusion that these data depart from those that would be expected from random sampling to a sufficient degree that they should not contribute to the evidence base.
Methods
I searched the following databases (author ‘Fujii’) between 1991 and July 2011: the Cochrane Central Register of Controlled Trials (CENTRAL); MEDLINE; EMBASE; CINAHL; ISI WOS; LILAC; and INGENTA. I included RCTs authored by Dr Yoshitaka Fujii, identified as working at the University of Tsukuba Institute of Clinical Medicine, the Tokyo University Medical and Dental School, the Toride Kyodo General Hospital or the Toho University School of Medicine (with Fujii in any position in the list of authors). I analysed the integrity of the data in these RCTs and compared them with 366 RCTs by other authors (Appendices S2–S6; available online) [3].
From the retrieved studies, I extracted the following data: the number of participants or animals in each group; all continuous variables reported as mean (SD), for example age; and all categorical variables, such as the number of women. I included variables measured before or after exposure to the allocated intervention, as long as they had been unaffected by the exposure.
Appendix A details the generation of expected categorical and continuous distributions and their subsequent statistical analysis, whilst an example illustrates the method at the beginning of the Results section, below. Broadly, the analysis is based on two principles focused on the spread of values around the most common value (rather than simply comparing two averages). The first principle is for categorical distributions, which can best be understood by considering the results of tossing two coins: the most likely outcome (50%) is head and tail while two heads or two tails is less likely (25% each). The probability of departures from these frequencies (e.g. finding that two tails occur 75% of the time in a dataset) can be calculated. In this example, one knows that for a single coin, there is an equal chance a head and a tail will be tossed, so one knows the expected shape of the binomial distribution for two coins. For the categorical variables analysed in this paper, one does not know the expected rate for each outcome – for instance, how many men or women one might expect in a sample of patients having cholecystectomy. Fortunately, the answer is given for each study by each study itself – if in total 80/100 patients randomly allocated to five groups are women, then one would expect a binomial distribution that peaks at a female distribution of 16/20 women per group. In this important way, my analysis also looks at the spread of proportions in each group, not just the peak value.
The second principle of analysis, for continuous variables, is known as the ‘central limit theorem’. If we calculated the mean height of people in a sample, performed repeated sampling of other groups of people, and then plotted a series of these means, we would obtain a normal curve, even if the distribution of heights within the sampled population was not normal. The standard deviation of this curve can be estimated from the sample standard deviation and is called the standard error of the mean (SEM). The SEM is a measure of the extent to which the sampled means vary from the (true) population mean. The SEM is rather like a standard deviation of the sample means, illustrating their variation. Just as for binomial characteristics, the focus of the analysis in this paper is the spread of means, not the peak value. Authors occasionally mislabel standard deviations (SDs) as standard errors of the mean (SEMs), so I checked whether substitution of one by the other resolved apparently abnormal distributions.
In this paper, each RCT is its own standard: the expected distributions of categorical rates and continuous means for each RCT were generated from within the same RCT. For categorical variables, this mathematical coupling of expected-to-reported measurement actually reduces the power of my analysis to identify aberrant distributions, because the expected result is dependent upon the observed result. In other words, my method is ‘conservative’ and any finding of aberrant distributions using this technique suggests extremely aberrant data distribution. In contrast, my method of analysis of continuous variables can overestimate clustering of means in small samples due to imprecision of the reported means and SDs or a population distribution that might be skewed. I therefore applied a somewhat arbitrary (but again intentionally conservative) adjustment to reduce any clustering (Appendix A). Any finding of clustering after adjustment indicated that the data were strikingly clustered. Furthermore, I subjected to the same analysis 366 RCTs by authors other than Fujii et al.
Results
I identified and retrieved 168 RCTs published by Fujii and colleagues between 1991 and July 2011: 141 in humans (13 734 participants); 26 in dogs (688 mongrels); and one in guinea pigs (14 animals) (references 1–168; Appendix S1). This alone is a remarkable research output – over 600 patients per year – and offers much useful material for analysis (Fig. 1). The focus of the human RCTs was prevention of PONV in 91 [correction added after online publication. 11 April 2012; 92 has been changed to 91] (54%), pain on injection of propofol in 14 (8%), treatment of PONV in 13 (8%), neuromuscular blockade in 11 (7%), the cardiovascular response to airway manipulation in 9 (5%), and epidural analgesia, middle cerebral artery perfusion and postoperative hypoxaemia in one each (1%). Drug effects on diaphragmatic contractility were the focus of the guinea pig study and 23 canine RCTs (24%), whereas the remaining 3 (2%) canine studies focussed on haemodynamic effects of drugs.
In addition, I analysed 366 other RCTs: 126 RCTs by other authors that reported postoperative rates of headache following prophylactic antiemesis (Appendix S2; available online); 31 RCTs by other authors of PONV prophylaxis with granisetron (Appendix S3; available online); 100 RCTs by other authors of rescue rates for droperidol and metoclopramide (Appendix S4; available online); 65 additional RCTs by other authors that reported rates of side effects after PONV prophylaxis (Appendix S5; available online); 145 RCTs by other authors of PONV prophylaxis reporting age, height or weight of participants (Appendices S2–S6). Some RCTs contributed to more than one analysis.
An example
The following single study serves as an example to illustrate the analyses of categorical and continuous variables. In one paper, Fujii et al. reported the response of 100 adults with PONV (20 per group) to placebo or one of four intravenous doses of granisetron (10, 20, 40 or 80 μg.kg^{−1}) (reference 89; Appendix S1). Table 2 lists the seven reported continuous variables and Table 3 lists the standardised differences between the mean of each group and the estimated population mean for these variables (see Appendix A for how to obtain the data in Table 3 from the original data in Table 2). Figure 2 presents histograms of the 35 standardised differences from Table 3. Each bar is 0.25 standardised differences wide with a height determined by the number of differences within each bar. The superimposed red curve (left graph) represents the expected distribution of standardised differences, the ‘standard curve’, with a mean of zero, a standard difference of one and a peak probability density at the mean of 0.40. The superimposed black curve (middle and right graphs) is the probability density curve generated by the data: the statistical test is between the variances of red and black curves. The result was that there was a significantly greater clustering of the actual data around zero than would be expected by chance (p = 0.0017), as demonstrated by comparison with the standard curve. This was the case even with the variable ‘last menstrual period’ (LMP) removed (p = 0.016; please see the Discussion section, below, concerning the invariance of LMP). However, in this example, adjustment of the variance (see Methods, Appendix A) made the distribution consistent with the expected (p = 0.06). Subsequent figures are illustrated with the comparative red standard normal curve.
Granisetron dose; μg.kg^{−1} | ||||||
---|---|---|---|---|---|---|
Placebo (n = 20) | 10 (n = 20) | 20 (n = 20) | 40 (n = 20) | 80 (n = 20) | Population mean (μ) | |
Age; years | 46 (8) | 47 (7) | 45 (11) | 47 (10) | 50 (11) | 47 (9) |
Height; cm | 159 (10) | 158 (9) | 155 (11) | 157 (10) | 159 (9) | 158 (10) |
Weight; kg | 57 (7) | 57 (9) | 54 (7) | 54 (8) | 58 (8) | 56 (8) |
Surgical time; min | 86 (35) | 92 (31) | 83 (34) | 87 (36) | 92 (27) | 88 (33) |
Anaesthetic time; min | 106 (35) | 117 (33) | 106 (36) | 112 (37) | 118 (29) | 112 (34) |
Fentanyl dose; μg | 103 (79) | 98 (73) | 93 (73) | 98 (79) | 105 (86) | 99 (78) |
LMP; days | 16 (3) | 16 (3) | 16 (3) | 16 (3) | 16 (3) | 16 (3) |
- LMP, last menstrual period.
Granisetron dose; μg.kg^{−1} | ||||||
---|---|---|---|---|---|---|
SEM | Placebo (n = 20) | 10 (n = 20) | 20 (n = 20) | 40 (n = 20) | 80 (n = 20) | |
Age; years | 2.10 | −0.48 | 0.00 | −0.95 | 0.00 | 1.43 |
Height; cm | 2.19 | 0.64 | 0.18 | −1.19 | −0.27 | 0.64 |
Weight; kg | 1.74 | 0.57 | 0.57 | −1.15 | −1.15 | 1.15 |
Surgical time; min | 7.29 | −0.27 | 0.55 | −0.69 | −0.14 | 0.55 |
Anaesthetic time; min | 7.60 | −0.76 | 0.68 | −0.76 | 0.03 | 0.82 |
Fentanyl dose; μg | 17.44 | 0.21 | −0.08 | −0.37 | −0.08 | 0.32 |
LMP; days | 0.67 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
- LMP, last menstrual period.
Table 4 lists the number of participants in each group for the six reported binomial variables in the same study, and Fig. 3 presents histograms of reported and expected rates for these binomial variables. For instance, 40/100 participants were women, which is the best estimate of the proportion (0.40) of women in the population from which 100 participants were drawn. The expected binomial distribution of this variable is therefore determined by the rate of 0.40 and the sample size. Summation of the binomial distributions for the 6 variables in Table 4 results in the expected distribution depicted by white bars in Fig. 3. The reported distribution was no different from that expected (p = 0.13). For this published study taken in isolation, therefore, my conclusion was that it did not contain data of unusual consistency.
Granisetron dose; μg.kg^{−1} | ||||||
---|---|---|---|---|---|---|
Rate | Placebo (n = 20) | 10 (n = 20) | 20 (n = 20) | 40 (n = 20) | 80 (n = 20) | |
Women | 0.4 | 8 | 7 | 8 | 9 | 8 |
Motion sickness | 0.09 | 2 | 1 | 2 | 2 | 2 |
Previous PONV | 0.02 | 0 | 1 | 0 | 1 | 0 |
Operation two* | 0.12 | 2 | 3 | 2 | 2 | 3 |
Operation three* | 0.11 | 2 | 2 | 3 | 2 | 2 |
No analgesia | 0.37 | 7 | 8 | 7 | 7 | 8 |
- *Referring to different operations.
- PONV, postoperative nausea and vomiting.
Summary of human studies
The approach explained above was applied to peri-operative continuous and categorical variables reported in 142 human RCTs (studies specified in the Tables 5–9). 4-7 show the histograms of standardised mean difference for some of the reported continuous variables, illustrating that the reported clustering around zero was more extreme than expected by chance. Recall that for continuous variables, the SDs should be ∼1.0 and the p values in Tables 5 and 7 indicate the significance of departures from this expected value. 8, 9 show the histograms of two categorical variables. Tables 6, 8 and 9 list the p values for the distributions of these and nine other categorical variables.
Groups reporting means with SDs | Standardised SD of means | p value | Studies | |
---|---|---|---|---|
Weight | 438 | 0.551 | < 10^{−33} | 1–58, 60–65, 67–72, 74–96, 100–22, 150–68 |
Age | 414 | 0.567 | < 10^{−33} | 1–16, 18, 20, 23–4, 26–35, 37, 39–45, 47–58, 60–5, 67–72, 74–9, 79–96, 98, 100–2, 104–22, 150–68 |
Height | 447 | 0.620 | < 10^{−33} | 1–58, 60–65, 67–72, 74–9, 79–96, 98, 100–22, 150–68 |
LMP | 56 | 0.261 | 2.22 × 10^{−16} | 36, 41, 46, 49–50, 54–6, 67, 71–2, 80–2, 85, 87, 89, 164 |
Baseline BP | 27 | 0.363 | 2.4 × 10^{−6} | 43, 94–6, 104, 106, 114, 150, 152 |
Gestation | 26 | 0.707 | 0.036 | 24, 34, 43, 64, 74, 94–6, 152 |
- BP, blood pressure; LMP, last menstrual period.
Groups reporting rates | p value | Studies | |
---|---|---|---|
Previous PONV | 83 | 4.0 × 10^{−7} | 12, 14–7, 21, 26–8, 32, 37–8, 49, 53–4, 56–7, 62, 66–7, 69–70, 79, 89, 91, 94 |
Motion sickness | 88 | 1.0 × 10^{−4} | 12–7, 21, 26–8, 31–2, 37–8, 49, 53–4, 56–7, 59, 66–7, 69–70, 79, 89, 91, 94 |
Gender | 217 | 7.1 × 10^{−3} | 1, 7, 9–10, 16–8, 20–1, 30, 33, 39–42, 44–5, 48, 51–3, 56–8, 60–3, 65–69, 71–2, 76–7, 79, 81, 84, 86–7, 89–90, 98, 100–11, 113–20, 122, 168 |
Antihypertensive drugs | 73 | 0.04 | 104–6, 108, 114, 118 |
Previous caesarean section | 18 | 0.08 | 24, 34, 45, 94–6 |
- PONV, postoperative nausea and vomiting.
Groups reporting means with SDs | Standardised SD of means | p value | Studies | |
---|---|---|---|---|
Operation time | 320 | 0.447 | < 10^{−33} | 1–29, 31–2, 34–44, 46, 48–58, 60–4, 67–72, 74–7, 79–96, 100, 104–5, 107, 110, 112–3, 118, 151–2, 157, 160–1, 164, 166–7 |
Anaesthetic time | 293 | 0.466 | < 10^{−33} | 1–18, 20–3, 25–9, 31–2, 35–42, 44, 46, 48–58, 60–4, 67–71, 74–7, 79–93, 104–5, 107, 110, 112–3, 118, 151, 157, 160–1, 164, 166–7 |
Fentanyl | 62 | 0.398 | 1.6 × 10^{−9} | 56, 64, 70, 72, 74, 79–84, 89, 95–6, 152, 157, 160–1, 164, 166–7 |
Operative blood loss | 32 | 0.252 | 1.8 × 10^{−10} | 11, 15, 19, 28, 35, 68, 75, 91, 104, 118 |
Propofol | 63 | 0.511 | 1.3 × 10^{−8} | 63–4, 94, 119–120, 122, 153–6, 158–59, 162–3, 165, 168 |
Paracetamol | 19 | 0.392 | 2.3 × 10^{−3} | 1, 7, 9–11, 18, 30 |
Time uterus out | 18 | 0.875 | 0.03 | 24, 34, 43, 94–6, 152 |
Groups reporting rates | p value | Studies | |
---|---|---|---|
Uterus exteriorised | 18 | 0.23 | 24, 34, 43, 94–6 |
Tubal ligation | 18 | 0.79 | 24, 34, 43, 94–6 |
Groups reporting rates | p value | Studies | |
---|---|---|---|
Headache | 170 | 1.4 × 10^{−11} | 1, 5–15, 20, 23–8, 31, 35,39–45, 47, 50–1, 53–4, 57–8, 67, 69, 72, 87, 90 |
Dizziness | 117 | 6.2 × 10^{−7} | 1, 5–6, 8, 11–5, 23–8, 31, 35, 41, 43, 47, 50, 53–4, 57, 67, 69, 75 |
Drowsiness | 201 | 2.6 × 10^{−5} | 1, 5–11, 15, 20, 24, 28, 35, 39–42, 45, 47, 50, 57, 63, 69, 75 |
Constipation | 26 | 0.08 | 39–40, 42, 46, 51, 58, 75 |
Figure 10 shows the expected and reported rates of headache in RCT groups authored by Fujii et al. and by others. The former studies had a distribution of headache that was strikingly different from the expected binomial distribution (p = 1.4 × 10^{−11}), whereas the distribution of headache in studies by authors other than Fujii et al. was not different from the expected (p = 0.81). The distributions of weight, age and height in RCTs by other authors were consistent with the expected, in contrast to the distributions in RCTs authored by Fujii et al. (Table 10).
Variable | Authors | Groups reporting rates | Standardised SD of means | p value | Studies |
---|---|---|---|---|---|
Weight | Fujii et al. | 438 | 0.551 | < 10^{−33} | See Table 5 |
Weight | Other authors | 359 | 1.06 | 0.13 | See Appendix S6 |
Age | Fujii et al. | 414 | 0.567 | < 10^{−33} | See Table 5 |
Age | Other authors | 556 | 1.04 | 0.14 | See Appendix S6 |
Height | Fujii et al. | 447 | 0.620 | < 10^{−33} | See Table 5 |
Height | Other authors | 146 | 0.93 | 0.24 | See Appendix S6 |
Summary of animal studies
Table 11 lists nine continuous variables reported in 26 canine RCTs. The adjusted p values indicate that the distributions are significantly different from that expected for all but one variable, pulmonary capillary wedge pressure; the histograms of the expected and reported distributions for the latter and a representative significantly different variable, transdiaphragmatic pressure at 100 Hz, are shown in Fig. 11.
Groups reporting means with SDs | Standardised SD of means (expected 1) | p value | Studies | |
---|---|---|---|---|
Right atrial pressure | 87 | 0 | < 10^{−33} | 123, 125–7, 129–131, 139–140, 146–5 |
Transdiaphragmatic pressure at 100 Hz stimulation | 118 | 0.203 | < 10^{−33} | 123–136, 139–140, 143–5, 147–8 |
Pulmonary artery occlusion pressure | 56 | 0.351 | < 10^{−33} | 126–8, 130–1, 142, 144–5 |
Heart rate | 108 | 0.330 | 1.6 × 10^{−11} | 123–136, 139–140, 142–7 |
Transdiaphragmatic pressure at 20 Hz stimulation | 77 | 0.178 | < 10^{−33} | 123–136, 139–140, 143–5, 147–8 |
Mean pulmonary arterial pressure | 69 | 0.389 | 4.6 × 10^{−15} | 125–7, 129–131, 139–140, 142, 144–6 |
Mean arterial pressure | 107 | 0.482 | 1.6 × 10^{−8} | 123–140, 142–7 |
Cardiac output | 62 | 0.511 | 9.5 × 10^{−3} | 125–7, 129–131, 137, 142, 144–5 |
Pulmonary capillary wedge pressure | 18 | 0.775 | 0.42 | 125, 139–140 |
Individual studies
The preceding analyses combined results from RCTs. I also assessed in isolation each RCT, combining the standardised mean differences for different variables within a study, for instance age, height and weight. Of 141 human RCTs by Fujii et al., 134 [correction added after online publication. 11 April 2012; 135 has been changed to 134] (95%) reported mean and SD for at least two continuous variables. Figure 12a shows the p values that the reported distributions of group means were consistent with the expected distributions in human and animal studies by Fujii et al. The results for other authors are shown in Fig 12b. In the studies by Fujii et al., the distributions were abnormal in 96/134 human RCTs and 22/24 animal RCTs, while distributions were abnormal in 12/139 human RCTs by other authors. A summary of the distribution of adjusted p values for RCTs by Fujii et al. is shown in Table 12. There were insufficient binomial data in any study, human or animal, to generate expected distributions reliably and compare them with reported binomial distributions.
p < 0.00001 | p < 0.001 | p < 0.01 | p < 0.05 | p > 0.049 | |
---|---|---|---|---|---|
Human studies (n = 134) | 9 (7%) | 29 (22%) | 27 (20%) | 31 (23%) | 38 (28%) |
Animal studies (n = 24) | 15 (62%) | 3 (13%) | 2 (8%) | 1 (4%) | 3 (13%) |
- p values from individual studies are shown in Fig. 12a. [corrections added after online publication. 11 April 2012; n=135 has been changed to n=134 and 30 (22%) has been changed to 29 (22%)]
All the continuous variables reported above are combined in Fig. 13, with Fig. 13a displaying 2556 values from RCTs by Fujii et al. (graphs on the left) and 2015 values from other RCTs (graphs on the right). The striking feature is the greater clustering of the data from Fujii, which applies to some extent even to Fujii’s trials that do not themselves show distributions different from the expected (Fig. 13b) as well as to those that do show distributions different from expected (Fig. 13c).
The random sampling of eight values from studies by Fujii et al. generates a statistically abnormal distribution whilst an abnormal distribution is only generated after 500 values have been sampled from other RCTs. If these data are combined sequentially, from least to most different from the expected distributions, values from Fujii et al. generate an abnormal distribution after 50 values are analysed and other RCTs generate an abnormal distribution after 333 values are analysed (thick black and red lines, respectively, in Fig. 14). Artificially increasing the variance by 9% of the 1520 values from RCTs of authors other than Fujii resulted in a cumulative distribution that was as expected (making the line in Fig. 14 horizontal; not shown). This suggested that a degree of clustering was likely when combining different trials (see Discussion); however, adjustment little affected Fujii’s data.
Discussion
There is no ‘correct’ or singular statistical method to detect if data follow highly unusual distributions. Using the methods I have employed, my main conclusion is that the distribution of variables reported by Fujii et al. in the trials analysed varied less than expected by chance. Scientific notation might not convey how unlikely it is that natural processes could account for these distributions. In Table 7, for example, a p value of < 10^{−33} is a probability of fewer than one in a decillion (or 1 in 1 000 000 000 000 000 000 000 000 000 000 000), the chance of selecting one particular atom from all the human bodies on earth. It is also striking that when the results of several trials are combined, the reported distributions for Fujii increasingly depart from the expected, whereas those for other authors do so relatively little (and can be corrected by modest adjustment; Fig. 14).
In my analysis, I did not impose any external theoretical distribution upon the data that was not already a necessary consequence of the data embedded within these RCTs. The single assumption common to calculating the expected variation, for both continuous and categorical variables, was that the study groups represented subjects sampled randomly from the same population. A well-designed RCT should ensure random distribution of variables measured before intervention, including age, weight, height, sex, chronic medications, a history of PONV or motion sickness, LMP and gestation. However, after the groups have been exposed to allocated interventions – placebo, drugs and so on – one cannot assume that the distribution of the data from the groups will be the same. One might therefore conclude that comparison of reported vs theoretical distributions is invalid for some of the variables reported here, including side effects such as drowsiness, dizziness and headache, the last of which aroused Kranke et al.’s suspicion [1]. For instance, a group given granisetron might report more headaches than a group given saline. One does not know whether these different rates represent chance variation (sampling from a single population rate) or an effect of granisetron (sampling from two different population rates). One therefore does not know what distribution to expect. However, not only were the rates of these side effects reported by Fujii et al. consistent with sampling from a single population, they were so invariant that they were inconsistent with the variation one would expect to arise by chance.
There are two types of probabilities that one can calculate for the distributions of binomial characteristics: (i) the probability of observing a particular rate of occurrence of an event; and (ii) the probability that a particular rate is consistent with the expected rate. Kranke et al. calculated the first type of probability, whereas I have calculated the second type. The problem with calculating the first, the probability of a specific rate, is interpreting what it means – because any single rate is very unlikely. For example, when rolling two dice, the chance of throwing a 5 and a 2 is exactly the same as the chance of throwing any other combination of two numbers; but the chance of throwing a total of 7 is higher (18%) than for other sums (because it can be attained by more combinations than for other totals), hence the need to study ‘distributions’ (which, in this analogy, would be the respective incidences of the sum of two dice-throws from 2 to 12) rather than just ‘chance of occurrence’. As with rolling dice, the single most likely distribution of 18/270 headaches in one group of 45 women is 3/45, a probability of 0.23, or 1 in 4. Fujii et al. reported this rate in 6/6 groups (reference 1; Appendix S1); the probability of obtaining this precise distribution in this study is 0.23^{6}, or 1 in 6800. Kranke et al. combined such probabilities across 18 RCTs to estimate a 1 in 147 million chance that 10 of them would report the same headache rate in all groups. However, the starting point for this calculation was that there is something special about equal rates of headache in all groups. A homogenous distribution (such as 3/45 people having headache in all six groups) is not the most likely distribution but it is also not the least likely, which would be all 18 headaches in one group. Indeed, six groups each with 3/45 headaches is a distribution that has borderline probability, given the expected distribution for this study (p is around 0.05, or 1 in 20). A likelihood of 1 in 147 million probably underestimates the chance of 10/18 RCTs' reporting the same rates in all groups, whereas the value I calculated for all the distributions in these 18 RCTs, of 1 in 5650, might well overestimate the chance of this distribution. I calculated the expected rate of a binomial variable from the observed rate. This mathematically couples the expected distribution to the observed distribution, making the comparison less likely to identify differences. In such a conservative analysis, any significant result therefore more robustly implies substantial disparity between the expected and reported distributions. Fujii et al. continued to publish RCTs that included headache rates after Kranke et al.’s letter was published [1]. In addition, I calculated the probability of any reported distribution occurring, rather than limiting the analysis to homogenous distributions. The final probability that the difference between the reported and expected distributions arose by chance was 1 in 71 billion (Table 9).
I have previously mentioned that a graph of sample means will cluster around the population mean, shaped in a normal curve. Two characteristics of the samples determine the width or spread of this curve: the sample size and the variability of the measurements. Less variable measurements (smaller standard deviation) and more measurements will result in less variable means that, in turn, populate a narrower graph of means. The standard deviation in a sample and the sample size should make the calculated width of this graph, or SEM, inextricably linked with the spread of the reported means. Just as Fujii et al.’s group rates for binomial variables clustered around the population rate, so too did their group means cluster abnormally tightly around the population mean. However, it is important to recognise that if the mean values are reported insufficiently precisely, artefact may be introduced, itself causing the reported distribution of means to cluster more tightly around the population mean than expected. This is most clearly demonstrated when the SEM calculated for each group is smaller than the precision to which the mean is reported, for instance, if mean height is reported to the nearest centimetre and the SEM is reported as 0.3 cm. In Fujii et al.’s data, the only human continuous variable reported for which this problem increased clustering during analysis was last LMP (Fig. 5). This problem was of particular concern for 46 of 56 groups in which the mean LMP was reported to the nearest day (mean (SEM) 16 (0.52) days in all groups). If means between 15.1 and 16.9 in these 46 groups were rounded up and reported as 16, one would have expected 2/46 means to have been ≤ 15 days and 2/46 to have been ≥ 17 days; if mean LMPs > 16.5 were rounded up to 17 and those < 15.5 were rounded down to 15, one would have expected about 8/46 means to have been ≤ 15 days and 8/46 to have been ≥ 17 days. In the first scenario, reporting 46 out of 46 means as 16 days has a probability of 0.17 of occurring by chance and in the second scenario, a probability of 2.5 × 10^{−6}, 1 in 400 000.
Reported LMP might also cluster because of women’s preference to report particular values. However, this cannot explain the results in the animal studies. In RCTs of dog diaphragmatic function, the mean value for right atrial pressure (RAP) in all 87 groups was 5 mmHg. With an average SEM of 0.56 mmHg, one would have expected some RAPs to be 4 or 6. If any RAP mean between 4.1 and 5.9 mmHg was reported as 5 mmHg, one would have expected 9/87 means to have been either 4 or 6 mmHg. If mean RAPs > 5.5 mmHg were rounded up to 6 mmHg, and those < 4.5 mmHg were rounded down to 4 mmHg, one would have expected 32/87 means to have been 4 or 6 mmHg. The probabilities that 87 of 87 means would be 5 mmHg are 1 in 300 in the first scenario and 1 in 72 billion in the second scenario.
The combination of values from RCTs that individually had distributions close to the expected led to abnormal clustering after 50 Fujii values and after 333 values from other RCTs (Fig. 14). This finding might be because there was something wrong with the values or something wrong with the analysis, or both. Multiplication by 1.09 of the 1520 values from 117 other RCTs with the least aberrant distributions increased their variance and prevented this clustering, whilst multiplication by 1.30 was needed to normalise the 285 values from the 19 Fujii RCTs with the least aberrant distributions. Summation of values (both from Fujii’s and other authors’ trials) might be revealing aberrant distributions that remain undetected when RCTs are analysed individually, in much the same way that a meta-analysis can identify an effect that is undetected by single underpowered RCTs. By far the majority of RCTs by Fujii et al. remain aberrant despite attempts to correct them and the overall steepness of the lines in Fig. 14 provide a statistical index of suspicion that the data are aberrant.
It is usual to modify or correct analyses of means, medians or rates when tests are not independent. In this paper, there are two sources of correlated variables: those that are biologically associated, such as age, sex, height and weight; and those that are constrained by another analysed variable, such as surgical times being necessarily less than anaesthetic time. I have adjusted the analyses for continuous variables in this paper, but not as a consequence of this concern. The analyses of categorical data were conservative for the reasons I have stated, therefore I did not adjust these further.
In conclusion, I have shown that the distributions of continuous and categorical variables reported in Fujii’s papers, both human and animal, are extremely unlikely to have arisen by chance and if so, in many cases with likelihoods that are infinitesimally small. Whether the raw data from any of these studies can be analysed, and whether this might provide an innocent explanation of such results [4], is beyond the scope of this paper. Until such a time that these results can be explained, it is essential that all Fujii et al.’s data are excluded from meta-analyses or reviews of the relevant fields. The techniques explored in this paper offer a method of assessing data integrity in RCTs published by other authors, for instance within systematic reviews by the Cochrane Collaboration.
Competing interests
No external funding and no competing interests declared.
Acknowledgements
I am indebted to Professor Jaideep J Pandit and Dr Steve Yentis for their encouragement, support and help. This paper has benefitted from their advice. Much of the clarity this paper possesses has been generated by their hard work, whilst the obscurity that remains is mine alone.
Appendix
Appendix A Method of generation of expected categorical and continuous distributions and their subsequent statistical analysis
Division of ( − μ) by the SEM will standardise the normal curve so that the distribution of reported mean differences will remain centred on zero, but the SD of the curve should equal one. For example, Table 13 shows mean (SD) height (in cm) reported in three groups (Appendix 1, reference 2), with calculations of the SEM and standardised mean differences, ( − μ)/SEM.
Variable | Group 1 (n = 24) | Group 2 (n = 23) | Group 3 (n = 23) | Mean |
---|---|---|---|---|
Mean | 121.1 | 119.9 | 117.8 | 119.6 (μ) |
SD | 11.5 | 11.1 | 10.1 | 10.9 |
SEM | 2.35 | 2.31 | 2.11 | 2.26 |
−μ | 1.5 | 0.3 | −1.8 | 0 |
(−μ)/SEM | 0.66 | 0.13 | −0.79 | 0 |
Adjusted* | 0.72 | 0.14 | −0.86 | 0 |
- *For the analysis of means in individual RCTs, I increased the variance of the standardised mean differences by an arbitrary factor determined by the standard deviation of the calculated SEMs (SD_{SEM}), divided by the square root of the mean SEM; (-μ)/SEM × (1 + (SD_{SEM}/√SEM)). In the Table, the standard deviation of the three SEMs (2.35, 2.31, 2.11) is 0.129. The square root of the mean SEM (2.26) is 1.5. Therefore, the adjustment factor is 1 + (0.129/1.5), or 1.086.
Unlike Kranke et al., I calculated a separate population rate probability ‘p’ for each RCT rather than assuming that there was a common underlying rate across RCTs. I combined the expected distributions from different studies and compared the summed distribution with that reported.
I used Intercooled stata^{®} 12 (StataCorp LP, College Station, TX, USA) to test whether the reported-to-expected categorical distributions (Fisher’s exact test), and the variances of reported-to-expected standardised distributions, significantly departed from those which would arise from chance (sdtest).