Volume 72, Issue 8 p. 944-952
Original Article
Free Access

Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals

J. B. Carlisle

Corresponding Author

J. B. Carlisle

Consultant

Department of Anaesthesia, Peri-operative Medicine and Intensive Care, Torbay Hospital, UK

Correspondence to: J. B. Carlisle

Email: [email protected]

Search for more papers by this author
First published: 04 June 2017
Citations: 164

This article is accompanied by an editorial by Loadsman and McCulloch, Anaesthesia 2017; 72: 931–5.

You can respond to this article at http://www.anaesthesiacorrespondence.com

Summary

Randomised, controlled trials have been retracted after publication because of data fabrication and inadequate ethical approval. Fabricated data have included baseline variables, for instance, age, height or weight. Statistical tests can determine the probability of the distribution of means, given their standard deviation and the number of participants in each group. Randomised, controlled trials have been retracted after the data distributions have been calculated as improbable. Most retracted trials have been written by anaesthetists and published by specialist anaesthetic journals. I wanted to explore whether the distribution of baseline data in trials was consistent with the expected distribution. I wanted to determine whether trials retracted after publication had distributions different to trials that have not been retracted. I wanted to determine whether data distributions in trials published in specialist anaesthetic journals have been different to distributions in non-specialist medical journals. I analysed the distribution of 72,261 means of 29,789 variables in 5087 randomised, controlled trials published in eight journals between January 2000 and December 2015: Anaesthesia (399); Anesthesia and Analgesia (1288); Anesthesiology (541); British Journal of Anaesthesia (618); Canadian Journal of Anesthesia (384); European Journal of Anaesthesiology (404); Journal of the American Medical Association (518) and New England Journal of Medicine (935). I chose these journals as I had electronic access to the full text. Trial p values were distorted by an excess of baseline means that were similar and an excess that were dissimilar: 763/5015 (15.2%) trials that had not been retracted from publication had p values that were within 0.05 of 0 or 1 (expected 10%), that is, a 5.2% excess, p = 1.2 × 10−7. The p values of 31/72 (43%) trials that had been retracted after publication were within 0.05 of 0 or 1, a rate different to that for unretracted trials, p = 1.03 × 10−10. The difference between the distributions of these two subgroups was confirmed by comparison of their overall distributions, p = 5.3 × 10−15. Each journal exhibited the same abnormal distribution of baseline means. There was no difference in distributions of baseline means for 1453 trials in non-anaesthetic journals and 3634 trials in anaesthetic journals, p = 0.30. The rate of retractions from JAMA and NEJM, 6/1453 or 1 in 242, was one-quarter the rate from the six anaesthetic journals, 66/3634 or 1 in 55, relative risk (99%CI) 0.23 (0.08–0.68), p = 0.00022. A probability threshold of 1 in 10,000 identified 8/72 (11%) retracted trials (7 by Fujii et al.) and 82/5015 (1.6%) unretracted trials. Some p values were so extreme that the baseline data could not be correct: for instance, for 43/5015 unretracted trials the probability was less than 1 in 1015 (equivalent to one drop of water in 20,000 Olympic-sized swimming pools). A probability threshold of 1 in 100 for two or more trials by the same author identified three authors of retracted trials (Boldt, Fujii and Reuben) and 21 first or corresponding authors of 65 unretracted trials. Fraud, unintentional error, correlation, stratified allocation and poor methodology might have contributed to the excess of randomised, controlled trials with similar or dissimilar means, a pattern that was common to all the surveyed journals. It is likely that this work will lead to the identification, correction and retraction of hitherto unretracted randomised, controlled trials.

Introduction

Techniques have been developed to analyse baseline variables, particularly the mean (SD) of continuous variables, and these have helped to identify fabricated data in randomised, controlled trials by Fujii et al. 1. The general principles of these methods have been explained elsewhere 2, 3. The same approach has been used to investigate trials published by Yuhji Saitoh, a co-author of Dr Fujii, and form a component of the investigation of his work 4. The technique has recently identified systematic problems with data in 33 randomised trials by Yoshihiro Sato, who is not an anaesthetist 5.

Fujii features top in a list of biomedical authors with the most retractions, and, for quite separate reasons from those based on statistical data analysis, two other anaesthetists appear in this list: second (Boldt) and fifteenth (Reuben) 6. Several questions arise. Are trials published by anaesthetists more likely to be retracted than trials from other specialists? Are anaesthetists more likely to generate fabricated data in trials? Would the statistical methods used to discover issues with data published by Fujii and Saitoh 1, 3 also retrospectively find aberrations in baseline data of trials published by authors like Boldt and Reuben?

The purpose of this survey is to assess if: (1) the distribution of baseline means corresponded to the expected distribution and whether discrepancies were shared by leading non-anaesthetic vs. anaesthetic journals; (2) there was a different rate of retraction in leading non-anaesthetic vs. anaesthetic journals; and (3) data corruption was discoverable by the new statistical techniques in those papers/authors that had been retracted. I used the method to detect anomalies in the distributions of baseline variable mean (SD) from randomised, controlled trials published during 15 years in six specialist anaesthetic journals (Anaesthesia, Anesthesia and Analgesia, Anesthesiology, the British Journal of Anaesthesia, the Canadian Journal of Anesthesia and the European Journal of Anaesthesiology) and two general medical journals (Journal of the American Medical Association (JAMA) and New England Journal of Medicine (NEJM)).

Methods

I searched eight journals (to which I had electronic access) for randomised, controlled trials published between January 2000 and December 2015: Anaesthesia; Anesthesia and Analgesia; Anesthesiology; British Journal of Anaesthesia; Canadian Journal of Anesthesia; European Journal of Anaesthesiology (2002–2012); JAMA; and NEJM. I extracted baseline summary data for continuous variables, reported as mean (SD) or mean (SEM). I did not study trials for which participant allocation was not described as random, or trials that did not report baseline continuous variables, or those that reported a different summary measure, such as median (IQR or range). I defined ‘baseline’ as a variable measured before groups were exposed to the allocated intervention, variables such as age, height, ‘baseline’ blood pressure or serum sodium concentration. I excluded variables that had been stratified. I recorded whether the allocation sequence had been generated in blocks, permuted or otherwise, which could reduce the distribution of means for time-varying measurements.

The primary outcome was the distribution of p values, calculated for differences between means, for individual variables and when combined within trials. I used three methods to generate p values for individual variables: independent t-test; ANOVA; and Monte Carlo simulations 5, adjusted for the precision to which mean (SD) were reported. The p value generated by these three methods that was closest to 0.5 was combined with the p values for other variables from a trial. I used the sum of the z values (Stouffer's method) as the primary method to combine p values for different variables within a randomised, controlled trial. I also calculated the results of five other methods used to combine p values into a single probability for each trial: logit; mean; Wilkinson's method; sum of log (Fisher's method); and sum. I used the Anderson–Darling test to compare the distribution of p values with the expected uniform distribution, which interrogates the extremes of the distribution; the Kolmogorov–Smirnov test assesses the central distribution. I checked the mean (SD) of trials in which one or more p < 0.01 or > 0.99, as these would indicate excessively narrow or excessively wide distributions. I checked whether substitution of SEM for SD, and vice versa, resulted in a less extreme p value, should the authors have incorrectly labelled one for the other.

A secondary analysis included comparison of p values for randomised, controlled trials that had been retracted vs. trials that had not been retracted. All analyses were conducted in R (R Foundation for Statistical Computing, Vienna, Austria), packages (function): Anderson–Darling test, ‘goftest’ (ad.test) and ‘kSamples’ (ad.test, Steel.test); ANOVA, ‘rpsychi’ (ind.oneway.second) and ‘CarletonStats’ (anovaSummarized); t-test, ‘BSDA’ (tsum.test); the quantile–quantile plot, ‘qqtest’ (qqtest); the combination of p values, ‘metap’ (logitp, meanp, minimump, sumlog, sump, sumz). All p values were one-sided and inverted, such that dissimilar means generated p values near 1.

Results

I scanned 9673 clinical trials for the random allocation of baseline variables reported as mean (SD or SEM): 4586 were not randomised, controlled trials or did not present unstratified baseline mean (SD or SE) data. I therefore analysed 5087 trials, which included 72,261 means of 29,789 variables. The supplementary appendices list the trials and the results of analyses (Appendix S1) and the data that I analysed (Appendix S2).

The distribution of 72,261 baseline means was largely consistent with random sampling, with 5087 trial p values being contained within the 99% confidence interval of the cumulative uniform distribution, between p values of 0.15 and 0.95 (Fig. 1). However, there were more trials than expected with baseline means that were similar (near a p value of 0) or dissimilar (near a p value of 1): 794/5087 (15.6%) trial p values were within 0.05 of 0 or 1, that is, 5.6% more than expected or 1 in 18 trials (Fig. 1 and Tables 1 and 2). Consequently, the distribution of trial p values deviated from the expected distribution, p = 1.2 × 10−7. Each journal had the same proportion of trials with extreme p values (Fig. 2). Although the distribution of p values was not the same in all journals, p = 0.007, there were no significant differences when one journal was tested against any other. There was no difference in distributions of baseline variables of 1453 trials published in non-anaesthetic journals and 3634 trials published in anaesthetic journals, p = 0.30 (Fig. 3).

Details are in the caption following the image
A quantile–quantile plot of p values calculated for 5087 randomised, controlled trials (image) compared with the 99%CI for the expected uniform distribution (image). The point of interest is the deviation of p values in the trials from what is expected, at values less than 0.15 and more than 0.95.
Table 1. The numbers of randomised, controlled trials analysed from eight journals and the number (proportion) with p values for baseline means within 0.05 of 0 or 1
Journal Status Means Variables Trials Probability of the distribution of baseline variable mean values
p < 0.001 0.01 > p > 0.001 0.05 > p > 0.01 0.95 < p < 0.99 0.99 < p < 0.999 p > 0.999
0.1% 0.9% 4% 4% 0.9% 0.1%
Anaesthesia Not retracted 4169 1799 393 0 5 (1.3%) 24 (6.0%) 26 (6.5%) 8 (2.0%) 8 (2.0%)
Retracted 49 33 6 0 0 0 0 0 0
Total 4218 1832 399 0 5 (1.3%) 24 (6.0%) 26 (6.5%) 8 (2.0%) 8 (2.0%)
Anesthesia and Analgesia Not retracted 14,556 5720 1254 3 (0.2%) 26 (2.1%) 48 (3.9%) 68 (5.5%) 22 (1.8%) 25 (2.0%)
Retracted 612 238 34 5 (16%) 3 (10%) 2 (6%) 1 (3%) 2 (6%) 1 (3%)
Total 15,168 5958 1288 8 (0.6%) 29 (2.3%) 50 (3.9%) 69 (5.4%) 24 (1.9%) 26 (2.0%)
Anesthesiology Not retracted 9399 3617 537 6 (1.1%) 8 (1.5%) 23 (4.3%) 23 (4.3%) 9 (1.7%) 14 (2.6%)
Retracted 82 28 4 1 (20%) 0 0 0 0 0
Total 9481 3645 541 7 (1.3%) 8 (1.5%) 23 (4.3%) 23 (4.3%) 9 (1.7%) 14 (2.6%)
British Journal of Anaesthesia Not retracted 6492 2759 614 1 (0.2%) 6 (1.0%) 39 (6.3%) 28 (4.5%) 3 (0.5%) 11 (1.8%)
Retracted 77 30 4 1 (13%) 0 0 1 (13%) 0 0
Total 6569 2789 618 2 (0.3%) 6 (1.0%) 39 (6.3%) 29 (4.7%) 3 (0.5%) 11 (1.8%)
Canadian Journal of Anesthesia Not retracted 3907 1632 373 3 (0.8%) 7 (1.8%) 15 (3.9%) 16 (4.2%) 4 (1.0%) 9 (2.3%)
Retracted 290 88 11 3 (27%) 1 (9%) 2 (18%) 0 0 0
Total 4197 1720 384 6 (1.6%) 8 (2.1%) 17 (4.4%) 16 (4.2%) 4 (1.0%) 9 (2.3%)
European Journal of Anaesthesiology Not retracted 4226 1811 397 1 (0.2%) 2 (0.5%) 14 (3.5%) 30 (7.6%) 9 (2.2%) 12 (3.0%)
Retracted 94 36 7 0 0 2 (29%) 1 (14%) 0 0
Total 4320 1847 404 1 (0.2%) 2 (0.5%) 16 (4.0%) 31 (7.7%) 9 (2.2%) 12 (3.0%)
Journal of the American Medical Association Not retracted 10,717 4481 513 2 (0.4%) 8 (1.6%) 27 (5.2%) 23 (4.4%) 11 (2.1%) 10 (1.9%)
Retracted 163 76 5 1 (20%) 1 (20%) 0 0 0 0
Total 10,880 4557 518 3 (0.6%) 9 (1.7%) 27 (5.2%) 23 (4.4%) 11 (2.1%) 10 (1.9%)
New England Journal of Medicine Not retracted 17,404 7429 934 5 (0.5%) 12 (1.3%) 50 (5.3%) 34 (3.6%) 13 (1.4%) 10 (1.1%)
Retracted 24 12 1 0 0 0 0 1 (100%) 0
Total 17,428 7441 935 5 (0.5%) 12 (1.3%) 50 (5.3%) 34 (3.6%) 13 (1.4%) 10 (1.1%)
Total Not retracted 70,873 29,250 5015 22 (0.4%) 72 (1.5%) 244 (4.8%) 247 (4.9%) 77 (1.6%) 101 (2.0%)
Retracted 1388 539 72 11 (15%) 6 (8%) 7 (10%) 3 (4%) 3 (4%) 1 (1%)
Total 72,261 29,789 5087 33 (0.6%) 78 (1.5%) 251 (4.9%) 250 (4.9%) 80 (1.6%) 102 (2.0%)
Table 2. Conversion of single-sided p values within 0.005 of 0 or 1 in Table 1 to two-sided p values < 0.01 for 5015 unretracted trials. Randomised, controlled trials with the least likely distributions of baseline variables might benefit from further investigation. Values are number (proportion)
Total p < 0.00001 0.0001 > p > 0.00001 0.001 > p > 0.0001 0.01 < p < 0.001 Total p < 0.01*
Expected % < 0.001% 0.009% 0.09% 0.9% 1%
Anaesthesia 393 7 (1.8%) 1 (0.3%) 0 8 (2.0%) 16 (4.1%)
Anesthesia and Analgesia 1254 17 (1.4%) 3 (0.2%) 4 (0.3%) 28 (2.2%) 52 (4.1%)
Anesthesiology 537 9 (1.7%) 3 (0.6%) 7 (1.3%) 10 (1.9%) 29 (5.4%)
British Journal of Anaesthesia 614 5 (0.8%) 0 5 (0.8%) 7 (1.1%) 17 (2.8%)
Canadian Journal of Anesthesia 373 8 (2.1%) 1 (0.3%) 3 (0.8%) 9 (2.4%) 21 (5.6%)
European Journal of Anaesthesiology 397 5 (1.3%) 0 7 (1.8%) 7 (1.8%) 19 (4.8%)
Journal of the American Medical Association 513 10 (1.9%) 1 (0.2%) 1 (0.2%) 10 (1.9%) 22 (4.3%)
New England Journal of Medicine 934 6 (0.6%) 2 (0.2%) 3 (0.3%) 18 (1.9%) 29 (3.1%)
Total 5015 67 (1.3%) 11 (0.2%) 30 (0.6%) 97 (1.9%) 204 (4.1%)
  • *p = 0.17 for chi-squared comparison of totals between journals.
Details are in the caption following the image
A cumulative plot of ordered p values (0 to 1) for 5087 randomised, controlled trials, grouped by the journal in which they were published. Each journal had more trials with baseline means that were similar (p value near 0) or dissimilar (p value near 1) than expected, resulting in cumulative distributions that were inconsistent with the cumulative uniform distribution (image): Anaesthesia (image), p = 1.5 × 10−6; Anesthesia and Analgesia (image), p = 4.7 × 10−7; Anesthesiology (image), p = 1.1 × 10−6; British Journal of Anaesthesia (image), p = 9.7 × 10−7; Canadian Journal of Anesthesia (image), p = 1.6 × 10−6; European Journal of Anaesthesiology (image), p = 1.5 × 10−6; Journal of the American Medical Association (image), p = 1.2 × 10−6; New England Journal of Medicine (image), p = 6.4 × 10−7.
Details are in the caption following the image
A quantile–quantile plot for p values calculated for 1453 randomised, controlled trials published in two non-specialist medical journals vs. 3634 randomised, controlled trials published in six specialist anaesthetic journals (image). The distribution is consistent with the reference unitary tangent (image), p = 0.30.

More baseline variables from 72 retracted trials had very similar means or very dissimilar means than 5015 trials that are not retracted, p = 5.3 × 10−15 (Fig. 4 and Table 1). The exclusion of retracted trials (from 5087) did not resolve the discrepancies between the observed and expected distribution of baseline means in the remaining 5015 trials (the p value remained 1.2 × 10−7). The rate of retracted articles from the two general medical journals (6/1453) was one-quarter the rate from the six specialist anaesthetic journals (66/3634), relative risk (99%CI) 0.23 (0.08–0.68), p = 0.0002. The p values of the six trials retracted from JAMA or NEJM were 6.3 × 10−8, 0.0097, 0.057, 0.21, 0.37 and 0.9988.

Details are in the caption following the image
The expected cumulative uniform distribution (image) was not followed by the p values for 5015 unretracted trials (image), p = 1.2 × 10−7, or by the p values from 72 retracted trials (image), p = 8.6 × 10−6. The cumulative distributions of unretracted and retracted trials were different, p = 5.3 × 10−15.

To assess if it might be possible to use probability to determine which trials and authors to investigate, to correct erroneous data or to retract fabricated data, I applied different investigative probability thresholds. A threshold of 1 in 10,000 (0.5 in 5000) would have captured 8/72 (11%) retracted trials (7 by Yoshitaka Fujii) and 82/5015 (1.6%) trials that have not yet been corrected or retracted. I supplemented this approach by investigating authors of more than one trial for which the probability was, arbitrarily, less than 1 in 100. This identified trials (number) by Yoshitaka Fujii (13), Joachim Boldt (3) and Scott Reuben (2). I searched through all the trials by first author and corresponding author and identified 21 other authors of more than one trial (65 in total) with a probability less than 1 in 100. A more thorough but laborious method would be to combine the probabilities calculated for all the trials published by individuals. For instance, five trials published in JAMA by an individual (four as corresponding author) generated p values of 0.012, 0.030, 0.047, 0.20 and 0.48, that is, they would not have been identified with the first two methods described in this paragraph. The composite distribution of all p values from these five trials generate p = 0.0011 with the Anderson–Darling test and p = 0.00045 with the sum of z score statistic. Explanations other than chance include corrupted data that are only revealed on pooling data from multiple trials by the same author.

The examination of some trials might suggest reasons for p values near 0 and 1. None of the following trials have been corrected or retracted. The trial with the p value closest to 0 (p = 3.6 × 10−30) – that is, generated by similar means – was JAMA 2002; 288: 2421. The authors or editors probably labelled SD incorrectly as SE: the SD calculated from the ‘SE’ was too large to be plausible and analyses assuming the ‘SE’ were SD increased p to 0.90. The same explanation might apply to the second-smallest p value, 7.1 × 10−21, generated by NEJM 2008; 359: 119, and JAMA 2001; 285: 1856, p = 3.3 × 10−11. However, a single solution does not explain extreme p values in other papers. For instance, NEJM 2007; 356: 911 reported mean (SE) tissue plasminogen activator concentrations of 4.5 (0.6) and 3.2 (0.4) in groups of 59 and 61, respectively: conversion of the SE to SD (4.6 and 3.1) resulted in a p value of 0.92, which is not particularly near 1 (or 0). However, conversion of the ‘SE’ for the 19 other variables resulted in p values averaging 0.02 and a composite trial p value of 4.1 × 10−13. One could construct a p value for this trial that does not suggest data corruption if one posited that the SD of 19 variables were incorrectly labelled SE, whereas the SE for tissue plasminogen activator concentration were correct. Conversely, standard errors that are incorrectly labelled SD generate p values close to 1 (using the methodology in this paper), which might explain the p values of 1 generated for 4/11 variables in NEJM 2006; 355: 549. Some journals publish p values for baseline data, which can help determine sources of error. For instance, the authors of NEJM 2010; 362: 790 calculated p = 0.07 for mean (SD) intelligence scores of 99.1 (16.6), 92.0 (14.5) and 100 (14.8) in groups of 155, 149 and 147, respectively. The correct p value is 0.00000718. Although 7 is the correct numeral it is unclear why the authors’ p value was out by a magnitude of 10,000 or why the groups were so different for a baseline variable. Similarly, NEJM 2004; 370: 2265 reported p = 0.03 for mean (SD) central venous pressures of 9.0 (4.7) and 8.6 (4.6) in groups of 3,428 and 3,423, respectively. The calculated p value is between 0.00037 and 0.0015, depending upon the method used.

I analysed variables as independent, that is, not correlated. Correlated variables might explain the p value of 9.6 × 10−4 for JAMA 2004; 291: 309 that reported nine baseline variables (p values averaging 0.24), six of which one would expect to be correlated as they were derived from the same exercise tests. The correlation of three histologic scores in NEJM 2002; 346: 1706 probably accounts for the p value of 0.99998, as might the correlation of three osteoarthritic scores in NEJM 2010; 363: 1521 that generated a p value of 0.99997. Supplementary data are not always exposed to the same rigour as those in the main paper, by author or editor, which might have contributed to the p value indistinguishable from 1 generated by NEJM 2013; 368: 1279, a p value that cannot be explained by substitution of SD for SE or correlation.

Values of p very near 0 and 1 may be generated by incorrect means, incorrect SDs or incorrect participant numbers. Two trials in JAMA with p values near zero illustrate the probable unwitting replacement of the correct numeral with another. Trial JAMA 2008; 299: 39 reported 31 baseline variables, 30 of which generated p values in the normal range, whereas the 31st variable, mean (SD) subcutaneous fat depths of 2.6 (0.8) cm and 3.5 (0.8) cm in groups of 113 and 110, respectively, generated p = 5.5 × 10−15. Other data in the paper suggest that the correct means were 2.6 cm and 2.5 cm, for which p = 0.35. Amidst 14 p values in JAMA 2003; 289: 2215 the mean (SD) dietary fat intakes of 37.2 (0.09) kcal and 38.2 (0.19) kcal in groups of 230 and 220, respectively, generated p < 10−16 (but reported as 0.60). I expect that the correct means were identical to one decimal place, probably both 37.2 or 38.2, but there might be a second error as standard deviations in such large groups should be similar.

Trials are retracted for various reasons. The retraction of two trials was triggered after I noticed duplication of baseline data: EJA 2004; 21: 60, which was the entire republication of EJA 2003; 20: 668; and Anaesthesia 2010; 65: 595, which presented secondary data from a previously published cohort without reference. Eighteen trials in this survey were authored by Yoshitaka Fujii, all of which were retracted for data fabrication, as were eight trials by Scott Reuben. Figure 5 compares the 26 trials retracted for fabrication with the 44 trials that were retracted for inadequate ethical approval or unclear reasons, such as ‘misconduct’. The distributions of baseline data were different in the two subgroups, p = 3.3 × 10−5, but neither was at all consistent with the expected distribution, p = 1.9 × 10−5 and p = 2.3 × 10−5, respectively. Trials retracted due to unethical practice or ill-defined reasons might therefore also contain corrupt data, due to error or fabrication.

Details are in the caption following the image
The expected cumulative uniform distribution (image) was not followed by the 26 trials retracted for fabrication (image), p = 1.9 × 10−5, composed of trials by Fujii (image) and Reuben (X); and it was not followed by the 44 trials retracted for other reasons (image), p = 2.3 × 10−5, composed of trials by Boldt (image) and others (image). The cumulative distributions of these two categories of retracted articles were different, p = 3.3 × 10−5.

Discussion

I analysed the proximity of means for baseline variables in 5087 randomised, controlled trials. In 15.6% of trials, the probability of a more extreme distribution was 1 in 10. Retracted trials had a higher proportion of p values in the extreme 10% of the expected distribution than trials that have not been retracted (43% vs. 15%). There was evidence that trials retracted for reasons other than data integrity may have contained corrupt and possibly fabricated data. Trials with extreme distributions of means were more likely to contain incorrect or fabricated data than other trials, as has been independently verified.

The discrepancy between the observed distribution and the expected distribution of p values could be because the expected distribution was wrong. I was aware that stratified allocation could make group means more similar, which is why I did not analyse the means of stratified variables. However, the effect could have ‘carried over’ into non-stratified variables: for instance, the mean weights of groups might have been made more similar through stratification by sex or height. The excess of dissimilar means would not be explained by this mechanism, although correlations between variables could account for excess trials with p values at either extreme (near 0 or 1). Simulations could generate credible intervals for the contribution of stratification and correlation with extreme p values. Investigators can manipulate the distribution of participants into groups if the allocation sequence is inadequately masked or if the allocation sequence is predictable. The manipulation of participant allocation could result in baseline mean values that are similar or dissimilar, depending upon the motivation and efficacy of the method used to distort random allocation. The ‘observed’ distribution may have been distorted by my mistakes. It is likely that I incorrectly transcribed some of the 72,261 means, 72,261 standard deviations, 72,261 participant numbers and 72,261 precisions for mean and SD.

Some trials with extreme p values probably contained unintentional typographical errors, such as the description of standard error as standard deviation and vice versa. The more extreme the p value the more likely there is to be an error (mine or the authors’), either unintentional or fabrication. For instance, for 43/5015 unretracted and uncorrected trials, the probability that random allocation would result in the distributions of baseline means was less than 1 in 1015 (one water drop in 20,000 Olympic-sized swimming pools). In a sample of just over 5000 trials it seems reasonable to conclude that these trials – and others with more likely distributions – almost certainly contain some sort of error. The association of extreme distributions with trial retraction suggests that further investigation of uncorrected unretracted trials and their authors will result in most trials being corrected and some retracted. The evidence for this association in this survey comes mainly from specialist anaesthetic journals. It is unclear whether trials in the anaesthetic journals have been more deserving of retraction, or perhaps there is a deficit of retractions from JAMA and NEJM.

In summary, the distribution of means for baseline variables in randomised, controlled trials was inconsistent with random sampling, due to an excess of very similar means and an excess of very dissimilar means. Fraud, unintentional error, correlation, stratified allocation and poor methodology might have contributed to this distortion. The distortion in two non-specialist medical journals was indistinguishable from that found in six specialist anaesthetic journals. Future work might determine whether this finding is general to all randomised, controlled trials. Journal editors could use Table 2 and online Appendix S1 to determine which trials to correct and if necessary retract.

Acknowledgements

JC is an editor of Anaesthesia. No external funding or other competing interests declared.