Doi:10.1016/j.jclinepi.2006.01.012

Journal of Clinical Epidemiology 59 (2006) 964e969 Testing multiple statistical hypotheses resulted in spurious associations: Peter C. Austin,,,Muhammad M. Mamdani,, David N. ,Janet E. aInstitute for Clinical Evaluative Sciences, G1 06, 2075 Bayview Avenue, Toronto, Ontario, M4N 3M5 Canada bDepartment of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada cDepartment of Health Policy, Management and Evaluation, University of Toronto, Canada dFaculty of Pharmacy, University of Toronto, Canada eClinical Epidemiology and Health Care Research Program (Sunnybrook & Women’s College Site), Canada fDivision of General Internal Medicine, Sunnybrook & Women’s College Health Sciences Centre and the University of Toronto, Canada Objectives: To illustrate how multiple hypotheses testing can produce associations with no clinical plausibility.
Study Design and Setting: We conducted a study of all 10,674,945 residents of Ontario aged between 18 and 100 years in 2000. Res- idents were randomly assigned to equally sized derivation and validation cohorts and classified according to their astrological sign. Usingthe derivation cohort, we searched through 223 of the most common diagnoses for hospitalization until we identified two for which subjectsborn under one astrological sign had a significantly higher probability of hospitalization compared to subjects born under the remainingsigns combined (P ! 0.05).
Results: We tested these 24 associations in the independent validation cohort. Residents born under Leo had a higher probability of gastrointestinal hemorrhage (P 5 0.0447), while Sagittarians had a higher probability of humerus fracture (P 5 0.0123) compared to allother signs combined. After adjusting the significance level to account for multiple comparisons, none of the identified associationsremained significant in either the derivation or validation cohort.
Conclusions: Our analyses illustrate how the testing of multiple, non-prespecified hypotheses increases the likelihood of detecting implausible associations. Our findings have important implications for the analysis and interpretation of clinical studies. Ó 2006 ElsevierInc. All rights reserved.
Keywords: Subgroup analyses; Multiple comparisons; Hypothesis testing; Astrology; Data mining; Statistical methods construct, other investigators have examined the effect ofastrologic signs more rigorously. For example, Gurm and The second International Study of Infarct Survival (ISIS- Lauer conducted a study to examine the belief that those 2) demonstrated that the use of aspirin during the acute born under the sign of Leo are ‘‘big-hearted’’ and at in- phase of acute myocardial infarction reduced mortality in creased risk for heart disease. They examined 32,386 patients a group of more than 17,000 patients A subgroup who underwent exercise stress testing at the Cleveland Clinic analysis demonstrated that aspirin increased mortality between 1990 and 1999 and found a slight excess of deaths of patients born under the astrological sign of Gemini or among Leos (9.6% vs. 8.7%). This effect disappeared in Libra. This biologically implausible finding reinforced the a matched propensity score analysis (P 5 0.3). Furthermore, authors’ contention that frivolous subgroup analyses should they found no correlation between astrological signs and Although the subgroup analysis in the ISIS-2 trial was in- While an undue reliance on astrologic phenomena as tended as an amusing illustration of a fundamental statistical a guide to health and healthcare may put subjects at riskfor adverse outcomes we examined the relationshipbetween birth sign and health outcomes with a differentintent. The purpose of the current study was to demonstrate * Corresponding author. Tel.: þ1-416-480-6131; fax: þ1-416-480- the pitfalls of multiple hypothesis testing and of conducting analyses without prespecified hypotheses. We hypothesized 0895-4356/06/$ e see front matter Ó 2006 Elsevier Inc. All rights reserved.
doi: 10.1016/j.jclinepi.2006.01.012 P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969 that we could generate numerous statistically significant as- In the validation cohort, we explicitly tested the 24 hy- sociations, but that these would be neither reproducible nor potheses associating astrological sign and illness that were biologically plausible. For illustrative purposes, we studied the association between astrological signs and health.
We conducted a population-based retrospective cohort The number of Ontario residents who were aged be- study using administrative databases covering 10,674,945 tween 18 and 100 years and who were alive on their birth- residents of Ontario aged 18e100 years. The Registered day in 2000 was 10,674,945. The derivation cohort Person’s Database (RPDB) contains basic demographic included 5,337,472 residents and the validation cohort in- data on all residents of Ontario, Canada. We extracted in- cluded 5,337,473 residents. There were 895 diagnoses for formation on all residents of Ontario between the ages of which patients had emergent and urgent hospitalizations 18 and 100 in 2000 and who were alive on their birthday between January 1, 2000 and December 31, 2001.
in 2000. We then randomly assigned these individuals to In the derivation cohort, it was necessary to search se- equally sized derivation and validation cohorts. From the quentially through admissions for the 223 most common birth date, we determined the astrological sign under which causes for hospitalization to identify two diagnoses for which the probability of hospitalization was statistically The Canadian Institute of Health Information (CIHI) significantly greater for residents born under each astrolog- hospital discharge abstract database contains data on all ical sign compared to residents born under the remaining 11 hospital separations in the province of Ontario. We exam- astrological signs. These 223 diagnoses accounted for ined all admissions to Ontario hospitals among subjects 91.8% of all urgent and emergent hospitalizations in Ontar- aged 18 to 100 years during a 2-year period (January 1, io in 2000 and 2001. Of these 223 diagnoses, there were 72 2000 to December 31, 2001), who were classified as either (32.3%) for which residents born under one astrological urgent or emergent admissions (i.e., elective or planned sign had a significantly higher probability of hospitalization admissions were excluded). Each admission was classified compared to residents born under the other astrological according to the most responsible diagnosis, using the first signs combined (P ! 0.05). The number of diagnoses for three digits of the ICD-9 coding scheme. Diagnoses were which residents born under a given astrological sign had then ranked from most frequent to least frequent. Both a significantly higher probability of hospitalization com- the CIHI discharge abstract database and the RPDB data- pared to residents born under the 11 other astrological signs base contain encrypted versions of residents’ health card combined ranged from a low of 2 (Scorpio) to a high of 10 numbers, permitting the two databases to be deterministi- (Taurus), with a mean of 6 diagnoses for each astrological cally linked in an anonymous fashion.
sign. The P-values for the 72 significant associations Beginning with the most frequently occurring urgent or ranged from 0.0003 to 0.0488. The two most frequently emergent diagnosis for hospitalization, we determined occurring diagnoses for which each astrological sign had whether persons in the derivation cohort were hospitalized a higher probability of hospitalization compared to the with that diagnosis in the 365 days following their birthday other astrological signs combined are described in .
in 2000. We then determined the proportion of subjects The P-values for testing the significance of the association born under each astrological sign who were hospitalized between a particular astrological sign and the probability of with that same diagnosis in the year subsequent to their the diagnosis-specific admission ranged from 0.0006 to birthday in 2000. We then identified the astrological sign 0.0475 among these 24 potential associations. In , with the highest hospitalization rate for that diagnosis.
we also report the relative risk comparing the probability We then determined whether the probability of admission of hospital admission for residents born under the given as- for that diagnosis was statistically significantly different trological sign with the probability of hospital admission for residents born under this astrological sign than for res- for residents born under all other astrological signs com- idents born under all other astrological signs combined bined. The relative risks ranged from a low of 1.10 to a high (i.e., we compared the probability of admission between of 1.80. For example, the probability of hospitalization for residents born under one astrological sign and residents lymphoid leukemia was 80% greater for Scorpios than it born under all other signsda two-sample comparison of bi- was for residents born under the 11 other astrological signs nomial proportions). Statistical significance was assessed using Fisher’s exact test, and a two-tailed significance level We tested the associations identified in in the of 0.05 was used to denote statistical significance. This pro- validation cohort. Of the 24 associations identified in the cess was repeated for all diagnoses, beginning with the derivation cohort, only 2 remained statistically significant most frequent, until two diagnoses were identified for each in the validation cohort. In the validation cohort, residents astrological sign. This phase of the study served as the hy- born under the sign of Leo had a significantly higher prob- ability of hospitalization due to gastrointestinal hemorrhage P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969 Table 1Diagnoses for which residents with given astrological sign had a higher probability of hospitalization compared to residents born under the remainingastrological signs combined: results from derivation cohort Intestinal infections due to other organisms Intestinal obstruction without mention of hernia Encounter for other and unspecified procedure and aftercare Other ill-defined and unknown causes or morbidity and mortality Other acute and subacute forms of ischemic heart disease Abbreviation: NEC = not elsewhere classified.
compared to other residents of Ontario, with a relative risk number of potential associations. We began the study with of 1.15 (P 5 0.0483). Similarly, residents born under the no prespecified hypotheses. Rather, we searched sequen- sign of Sagittarius had a significantly higher probability tially through a list of diagnosis codes until at least two of hospitalization for fractures of the humerus compared diagnoses had been found for each astrological sign, for to residents born under the remaining 11 astrological signs, which residents born under that sign were signifi- with a relative risk of 1.38 (P 5 0.0125). The remaining 22 cantly more likely to be hospitalized compared to residents associations were no longer significant in the validation born under the remaining astrological signs combined. This exercise implicitly involved multiple comparisons for eachdiagnosis. For each astrological sign, we computed the pro-portion of persons born under that sign who were hospital- ized for that diagnosis in the year subsequent to theirbirthday in 2000. We then selected the astrological sign We identified at least two diagnoses for which Ontario for which persons born under that sign had the highest residents born under each astrological sign had a signifi- probability of hospitalization. This implicitly involved 66 cantly higher probability of hospitalization compared to pairwise comparisons, because there are ð12 residents born under the remaining astrological signs com- of selecting distinct pairs from a set of 12 objects.
bined. Two of these 24 associations remained statistically The finding that 22 of 24 statistically significant findings significant when tested in an independent validation cohort.
generated in the derivation cohort were not confirmed in the These observations yield several important lessons about validation cohort illustrates the dangers inherent in studies hypothesis testing, study design, and the interpretation of involving multiple, non-prespecified hypotheses.
4.1. The pitfalls of multiple significance tests 4.2. Adjusting P-values for multiple comparisons First, it was relatively simple to generate numerous sta- Second, our observation that two of the associations tistically significant associations when we examined a large identified in the derivation set were confirmed in the P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969 validation set does not necessarily provide evidence that primary outcome or endpoint and multiple secondary end- those born under the sign of Leo have a significantly higher points. However, as the number of secondary endpoints or probability of hospitalization for gastrointestinal hemor- subgroup analyses increases, the risk of erroneously identi- rhage, or that those born under the sign of Sagittarius have fying a significant association also increases. To quantify a higher probability of hospitalization for fractures of the the prevalence of subgroup analyses and the number of end- humerus. Under the null hypothesis, P-values are uniformly points in clinical trials, we examined all 131 randomized distributed between 0 and 1. The likelihood of a type I clinical trials published in the Journal of the American errordidentifying a statistically significant association where Medical Association, the New England Journal of Medi- none existsdis 5%, when using a 0.05 significance level.
cine, the Lancet, and the British Medical Journal between When testing 24 hypotheses in which the null hypothesis January 1 and June 30 of 2004. The mean and median is true, the likelihood that at least one will be found to be number of subgroups in which endpoints were compared significant simply by chance is 70.8%. Thus, by not making between treatment arms were 5.1 and 2, respectively appropriate adjustments for the testing of multiple hypoth- (IQR 5 0e6), while the mean and median number of sig- eses, we greatly increased our risk of falsely ‘‘uncovering’’ nificance tests of efficacy and safety endpoints were 26.5 an association between astrological sign and illness. Had and 19, respectively (IQR 5 9e32). The maximum number we instead endeavored to preserve an overall type I error of distinct subgroups in which endpoints were compared rate of 0.05, we would have had to use a significance level between treatment arms was 68, while the maximum num- of 0.00213 for each of the 24 individual hypothesis tests (this is marginally less conservative than a Bonferroni cor-rection, which would have used a significance level of 0.05/ 4.3. The importance of biologic plausibility 24 5 0.00208; both methods require that the multiple com-parisons be independent of one another). Using this signif- Third, none of the hypotheses generated using the deri- icance level, none of the 24 hypothesized associations vation cohort had any apparent biologic plausibility. De- would have been significant in the validation cohort. San- spite confirming 2 of the 24 prespecified hypotheses in koh et al. discuss the relative merits of different the validation cohort, there is no currently apparent mech- methods in adjusting for the testing of multiple endpoints anism by which Leos might be predisposed to gastrointes- in clinical trials. In particular, they note that the Bonferroni tinal hemorrhage or Sagittarians to humeral fractures. In adjustment (which is an approximation to our exact interpreting the subgroup analyses from the ISIS-2 trial, method) ignores most of the information from the data the authors argued that the results were not biologically and is too conservative when there are many outcomes plausible, and should be ignored. Caution is required in in- . Bender and Lange provide an overview of methods terpreting results that do not have apparent biological plau- to adjust for multiple testing in medical and epidemiologi- sibility. In particular, it is important that biologically plausible associations be specified during the design of Similarly, in the derivation cohort, there were implicitly the study, because it is tempting to construct biologically 14,718 comparisons (223 diagnoses Â 66 pairwise compar- plausible reasons for observed subgroup effects after hav- isons per diagnosis). To retain an overall 5% type I error ing observed them . Our study demonstrates that data- rate, one would need to use a significance level of driven statistical methods may result in conclusions that 0.000003485 for an individual hypothesis test. Using this are neither reproducible nor biologically plausible.
significance level, none of the 72 associations identifiedin the derivation cohort would have been identified as 4.4. Subgroup analyses in clinical trials statistically significant. We should note that there were 72diagnoses for which the astrological sign with the highest Subgroup analyses are common in randomized con- probability of hospitalization had a significantly higher trolled trials. Indeed, the subgroup analysis reported by probability of hospitalization compared to that for the re- the ISIS-2 investigators motivated the current study.
maining astrological signs combined. It is highly likely that Many investigators have cautioned against subgroup analy- there were other astrological signs (but not the one with the ses in randomized controlled trials. It has been argued that highest probability of hospitalization) that had a signifi- such analyses should be prespecified, and that there should cantly higher probability of hospitalization compared to be a pre-specified biologically plausible explanation for the residents born under the remaining 11 astrological signs proposed subgroup analysis . Furthermore, it has been combined. While these comparisons were implicitly con- suggested that one should not be guided by statistical sig- sidered in our design, they were not reported on in the cur- nificance, but rather by trends and consistency, because rent study. Our study illustrates that in a trial with multiple such analyses are frequently underpowered Similarly, hypothesis tests (either secondary outcomes or subgroup Sleight cautions against subgroup analyses in random- analyses), the significance level used should be adjusted ized clinical trials, suggesting that plausible explanations to preserve an overall type I error of a desired level. It is for specific findings can often be found for conclusions that common in randomized clinical trials to examine one were, in reality, spurious. If our categorization of residents P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969 had been based upon clinical criteria or demographic char- acteristics rather than astrological sign, it is likely that post Finally, there is an increasing interest in ‘‘data mining’’ hoc plausible explanations could have been constructed for as a means of hypothesis generation, particularly in com- many of the associations identified. Both Yusuf et al. mercial endeavors. Data mining has been variously de- and Oxman and Guyatt provide guidelines for inter- scribed as ‘‘the nontrivial extraction of implicit, previously preting the results of subgroup analyses. Freemantle unknown, and potentially useful information from data’’ suggested that a purist approach would be to examine and as a ‘‘semi-automatic extraction of patterns, subgroup analyses and secondary endpoints only if the pri- changes, associations, anomalies, and other statistically sig- mary endpoint is statistically significant. Recently, Roth- nificant structures from large data sets’’ . Data well discussed arguments for and against subgroup mining is often conducted in large datasets and often does analyses and provided guidelines for designing and inter- not involve prespecified hypotheses. In the current study, preting subgroup analyses. There are increasing calls for we began with no prespecified hypotheses, and used auto- the registration of trial protocols prior to the start of ran- mated methods to detect apparently significant associations.
domized clinical trials , an initiative that could reduce Despite the addition of a validation cohort, two unantici- the number of frivolous subgroup analyses. The current pated associations remained significant. Our study therefore study adds a cautionary note concerning the practice of serves as a cautionary note regarding the interpretation of conducting numerous significance tests, such as those often findings generated by data mining, and suggests that conclu- performed in the setting of a randomized trial.
sions obtained from data mining should be viewed witha healthy degree of skepticism.
In conclusion, we were able to identify multiple signifi- cant associations, all of them clinically implausible, between The current study used both derivation and validation astrological sign and the probability of hospitalization for datasets. Only 2 of the 24 significant associations that were specific diagnoses. Two of these associations remained sta- identified using the derivation cohort remained statistically tistically significant when tested in an independent valida- significant in the validation cohort. The use of derivation tion cohort. Our study emphasizes the hazards of testing and validation datasets has been frequently advocated in multiple, non-prespecified hypotheses.
the statistical literature The use of a validation datasetallows one to assess the reproducibility of findings obtainedin the derivation cohort, and serves to protect oneself fromidentifying spurious findings in a single dataset. We suggest that when surprising associations are obtained, either as aresult of subgroup analyses or analysis of secondary out- The Institute for Clinical Evaluative Sciences (ICES) is comes in clinical trials, researchers seek to reproduce these supported in part by a grant from the Ontario Ministry of Health and Long-Term Care. The opinions, results, and This concept is nicely illustrated by two major clinical conclusions are those of the authors and no endorsement trials. The Prospective Randomized Amlodipine Survival by the Ministry of Health and Long-Term Care or the Insti- Evaluation (PRAISE) study examined the effect of amlodi- tute for Clinical Evaluative Sciences is intended or should pine in patients with congestive heart failure and found be inferred. Drs. Austin, Mamdani, and Juurlink are sup- no benefit in the primary analysis. In a prespecified sub- ported by New Investigator awards from the Canadian Insti- group analysis, amlodipine reduced the risk of fatal and nonfatal events in patients with severe nonischemic heartfailure (P 5 0.04) . Furthermore, amlodipine seemedto prevent a secondary outcome (mortality) in the same patients (P ! 0.001). The PRAISE-2 trial, which was [1] ISIS-2 Collaborative Group. Randomized trial of intravenous strepto- explicitly designed to examine the effect of amlodipine in kinase, oral aspirin, both, or neither among 17187 cases of suspected nonischemic heart failure patients, found no effect on acute myocardial infarction: ISIS-2. Lancet 1988;2(8607):349e60.
mortality or cardiac events This trial was never re- [2] Gurm HS, Lauer MS. Predicting incidence of some critical events by ported in detail. Similarly, the Evaluation of Losartan in sun signsethe PISCES Study. ACC Curr J Rev 2003;Jan/Feb:22e4.
[3] Philips DP, Ruth TE, Wagner LM. Psychology and survival. Lancet the Elderly (ELITE) trial suggested a survival benefit in el- derly heart failure patients treated with the angiotensin II [4] Sankoh AJ, Huque MF, Dubey SD. Some comments on frequently antagonist losartan compared to the ACE inhibitor captopril used multiple endpoint adjustment methods in clinical trials. Stat . This finding was not replicated in the ELITE II trial . The results of the PRAISE/PRAISE-2 and ELITE/ [5] Bender R, Lange S. Adjusting for multiple testingewhen and how? J ELITE II trials illustrate that subgroup analyses, even when [6] Topol EJ, Califf RM, Van de Werf F, Simoons M, Hampton J, specified, can result in findings that are not subsequently Lee KL, et al. Perspectives on large-scale cardiovascular clinical tri- als for the new millennium. Circulation 1997;95:1072e82.
P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969 [7] Sleight P. Debate: subgroup analyses in clinical trials e fun to look [14] Packer M, O’Connor CM, Ghali JK, Pressler ML, Carson PE, at, but don’t believe them? Curr Control Trials Cardiovasc Med Belkin RN, et al. Effect of amlodipine on morbidity and mortality in severe chronic heart failure. N Engl J Med 1996;335:1107e14.
[8] Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpreta- [15] Thackray S, Witte K, Clark AL, Cleland JGF. Clinical trials update: tion of treatment effects in subgroups of patients in randomized clin- OPTIME-CHF, PRAISE-2, ALL-HAT. Eur J Heart Fail 2000;2:209e12.
ical trials. J Am Med Assoc 1991;266:93e8.
[16] Pitt B, Segal R, Martinez FA, Meurers G, Cowley AJ, Thomas I, et al.
[9] Oxman AD, Guyatt GH. A consumer’s guide to subgroup analysis.
Randomized trial of losartan versus captopril in patients with heart failure (Evaluation of Losartan in the Elderly Study, ELITE). Lancet [10] Freemantle N. Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt [17] Pitt B, Poole-Wilson PA, Segal R, Martinez FA, Dickstein K, Camm AJ, et al. Effect of losartan compared with captopril on mor- [11] Rothwell PM. Subgroup analysis in randomized controlled trials: im- tality in patients with symptomatic heart failure: randomized tri- portance, indications, and interpretation. Lancet 2005;365:176e86.
aldthe Losartan Heart Failure Survival Study ELITE II. Lancet [12] DeAngelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, et al. International Committee of Medical Journal Editors. Clinical [18] Everitt BS. The Cambridge dictionary of statistics, 2nd edition.
trial registration: a statement from the International Committee of Cambridge: Cambridge University Press; 1998.
Medical Journal editors. J Am Med Assoc 2004;292:1363e4.
[13] Picard RR, Berk KN. Data splitting. Am Stat 1990;44:140e7.

Source: http://personalpages.to.infn.it/~bagnasco/Austin2006.pdf

gpcahealth.org

GPCA Health Committee INFLAMMATORY BOWEL DISEASE (IBD) - 2011 OVERVIEW: This is a group of diseases of the small and large intestine, characterized by chronic and protracted diarrhea, malabsorption, weight loss, anemia, and malnutrition. They are all treatable, but seldom cured. In each specific disease, a different type of inflammatory cell is found in large numbers in the l

students.umw.edu

A Few Thoughts on Quitting Smoking Introduction More than 90% of smokers are addicted to nicotine, 10% of drinkers are addicted to alcohol and 85% of adult Americans are dependent on caffeine. Nicotine is one of the strongest addictions known to man. What is the difference between a dependency and an addiction? With both, your body adapts to the substance. If you stop the