The Philosophical Basis of Statistical Misuse
Over the past decade, researchers are lending an increasing amount of attention to the problems arising from the misuse of statistical methods to a variety of applications. An examination of the widespread mis-application of statistics across academic disciplines provides valuable information for the emerging class of data analysts. By examining these common hurdles, data analysts not only perfect their analytic craft through conscious and appropriate application of statistical techniques across a variety of problems, they contribute to an increased discernment among non-technical consumers of the oft-misinterpreted statistical insights. Especially given the abundance of statistical technologies, this increased discernment is a vital measure to reducing the various harms that emanate from misapplied methods and misinterpreted outcomes.
The Problem of Statistics?
Across applications of surgical research, psychotherapy research, and even litigation, the elementary application of statistical techniques falls prey to a ubiquitous set of misinterpretations. According to researchers Dan, Serlin & Omar in Misuse of Statistical Test in Three Decades of Psychotherapy Research (1994), much of the current use of statistical tests is flawed. Several researchers attribute this effect of misuse to a deficiency in statistical literacy . Statistical methods are semi-automated algorithms, which are highly error prone without extremely sophisticated human input. The primary problem inherent in the application of traditional statistics is that its techniques do not clearly separate the causal and the associational aspects of inference; this abstraction may be difficult for decision-makers to grasp [2, 3].
The problem, too, is partly philosophical. The past several-decades trend toward pragmatic, applied research has led to an increased demand for clinically meaningful findings . Simultaneously, Academia’s rigid treatment of methodologies as overarching philosophies rather than as technologies or tools aggravates controversies over the roles of formal methods . The decline in theoretical guidance as well as advances in statistical computation has led to an increased use, of exploratory procedures. Misuse of Statistics in Surgical Literature (2016) found a growing volume of advanced multivariate techniques throughout clinical-research practice. Despite these advances, a large proportion of research fails to adopt and appropriately apply statistical testing methodologies producing interpretable results .
The result of this widespread misuse of statistical models is an oversimplification of analyses, leading to a misinterpretation of the data . The global application of hypothesis testing, and the rigidly incorrect interpretation of the p statistic as an assertion of a particular position’s significance (or insignificance) reflect a pervasive ignorance of the subtleties of statistical inference across fields. Further, the increasing application of more advanced statistical techniques compounds the already generic problem of model interpretability, while also producing an increase sense of confidence . Best articulated by Sander Greenland, the extensive treatment of models as “black boxes” for synthesizing input into formal inferences encourages a dangerous overconfidence in formal results .
A Battle of Frameworks
The most substantial issue that should attract the attention of the burgeoning data analyst is that statistics is hardly a unified approach. Its attendants dissect along Frequentist vs. Bayesian factions; and among Frequentist proponents, dissonance endures among Fischer vs. Neyman-Pearson approaches to statistical testing. This diverse history is generally neglected among introductory statistics trainings, and yet its implications nevertheless influence the prevalent mis-applications of statistical methods today. Although taught as a unified framework, what we call “statistics” speaks less to unified methodology, evolved linearly over time, and more to a phylogenic tree of potential approaches, all of which may be applied at varying suitabilities as indicated by the data at hand. Knowledge of these competing frameworks liberates researchers from the long-standing pattern of misuse and misinterpretation of one particular methodology, to pursue truly valuable insights.
Bayesian vs. Frequentist
Published posthumously in An Essay Towards Solving a Problem in the Doctrine of Chances (1763), Reverend Thomas Bayes theorized a solution to the problem of computing a probability distribution by considering existing degrees of belief . The resulting Bayesian approach defines probability as the measure of a person’s degree of belief in an event, and applies this knowledge as a priori subjective probabilities against real-world observations to determine the probability of an event. The Frequentist approach to statistical inference arose from the philosophical discourse between Fischer, Neyman and Pearson in the early twentieth century . Instead of considering real-world events, and the existing knowledge of the likelihood of those events, frequentists define probability as the ratio of relative frequency of an event in an infinite series of trials. The important distinction between Bayesian and Frequency-based approaches is that in the Bayesian approach, the data is fixed while the parameters are adjusted to reflect a state of knowledge whereas in the frequentist approach, the parameters are fixed while the data is repeatedly sampled.
Of the statistical misuses studied across research fields, most examples seem to relate to the application of frequency-based statistical models to what are essentially Bayesian questions. The dominance of the Frequentist framework across scientific-research and litigation domains acts as a positive-feedback mechanism that reinforces the misuse and misinterpretation of statistical models – one size cannot fit all! In litigation and in scientific domains, there is often a given set of “real-world” data against which a particular event may be assessed for its likelihood. The problem of interpretation of the probability of a hypothesized event arises from the frequentist approach of the repeated experiment from the same set of (sample) data, when a large population set is non-available . For a Bayesian, there is just the data, which is used as evidence for a particular hypothesis .
Fischer vs. Neyman-Pearson
The Frequentist theory of hypothesis testing began with Student’s discovery of the t-distribution in 1908 . Following this discovery, R. A. Fischer, a geneticist and statistician, saw great potential in applying “significance testing” to the problem of statistical inference in science. His interpretation of the p-value in Statistical Methods for Research Workers (1925) advocated for a 5% standard level for the acceptance of the hypothesized parameters, a measure still canonized throughout research statistics. Where results are “non-significant”, the Fischerian approach simply advocates for more data and more testing. Simultaneously, J. Neyman and E. S. Pearson advanced the application of “hypothesis testing” as a decision-oriented process for weighing and eliminating one of two alternatives. Unlike Fischer, Neyman and Pearson considered the implications not only of accepting or rejecting the null hypothesis, but of falsely accepting or falsely rejecting the true hypothesis. Whereas Fischer’s approach to probability is one of inductive inference, the Neyman-Pearson approach is one of deductive argument, which seeks to minimize the probability of false conclusions.
The concept of the hypothesis test and the interpretation of the p-value constitute the bulk of surveyed statistical misinterpretation. In seeking to present a more intuitive approach to the calculation of probability, the Neyman-Pearson approach oversimplifies both the problem and the meaning of the p-value. Perhaps in reaching to achieve greater simplicity, research statistics has hybridized the contrasting philosophies of significance testing and hypothesis testing into “null-hypothesis significance-testing”, which today dominates as a faulty tradition . Further, ignoring the original context of Fischer’s 5% rejection threshold as a personal preference, research statistics approach “statistical significance” as a test for the truth of a proposition, rather than as a measure of the model’s fit to the data. Pervasively, significance testing is misinterpreted to imply causality – that one of the hypotheses is true . According to Dar et. al, “we can only estimate the probability of obtaining the data given the truth of the null hypothesis, not the probability of the truth of the null hypothesis given the data” .
Unlike Bayesian probability, which seeks to discern the probability of an event given a state of incomplete knowledge – a thoroughly non-intuitive mental exercise – Frequentist probability is seemingly more direct, and yet its interpretation is falsely intuitive. Much of the misinterpretation predominating current statistical inference is an attempt to impart real-world causality to the outcomes of an idealized, frequency-based test that assumes no prior knowledge, but rather, seeks to approach a Platonian ideal of truth. The data analyst cannot afford to ignore the philosophical differences between these mathematical approaches. And of the problems that justify the application of frequentist philosophies, a few, very specific problems warrant Neyman-Pearson hypothesis testing. Unfortunately, the widespread misapplication of the hybridized Fischer/Neyman-Pearson approaches as the only approach is incentivized by several factors.
Most significantly, research journals both respond and perpetuate the bandwagon pressure to accept the significance levels as a graduated scale for determining causal relevance. It is simply incorrect to interpret a p-value as an indicator that a hypothesis has “more” or “less” significance. Because research that fails to show significance tends to not be published , researchers cite “marginally significant” or “borderline significant” findings . Further, there is no objective basis for believing that a proposition is true because the evidence for it is “statistically significant” at Fischer’s arbitrary 5% threshold . The overemphasis of research journals on the meaning of this threshold places significant pressure on researchers to unwittingly misapply both hypothesis tests and significance tests in pursuit of publication [3, 4].
What often fails to be understood by researchers and publishers alike is that in rigorously controlling for Type II error – the error of choosing a false null – the Neyman-Pearson approach places the burden of proof on the alternative proposition. The hybrid-Frequentist approach perpetuates a general bias towards null conclusions that is fed by beliefs that this bias is the hallmark of the scientific method . Further, the size of the data-set matters when it comes to interpreting the significance of significance levels. Small datasets must demonstrate more extreme differences between the hypothesized alternatives, whereas the p-values of large datasets are more sensitive to subtle differences . Repeated testing of datasets also increases this sensitivity to “noise”. As observed by Dar et. al., there has been a proliferation of falsely rejected null hypotheses in research . In relying so heavily on Fischer’s 5% significance threshold for scientific induction while applying Neyman-Pearson decision-oriented approach, researchers are, in fact, increasing Type I error, and incorrectly rejecting the true null.
Finally, the data analyst must consider the human preference for simplicity, clarity and causality as an incentive driving both misuse and misinterpretation of statistical inference. Greenland observes many of these incentives in For and Against Methodologies: Some Perspectives on Recent Causal and Statistical Inference Debates (2017). On one hand, psycho-social obstacles to change place pressures for researchers to draw unambiguous conclusions in efforts to appear important and to advance their careers. On the other hand, the overconfidence in formal results may derive from a need to feel that one’s labor was justified. Add to this, for both researchers, publishers, jurors, judges, and business-area managers: there is a pervasive human aversion to uncertainty. How else can we explain this historically rich fascination with the determination of probability in mathematics?
References Ambaum, M. H. (2012, July). Frequentist vs. Bayesian statistics: A non-statisticians view (Rep.). Retrieved from http://www.met.reading.ac.uk/~sws97mha/Publications/Bayesvsfreq.pdf  Dar, R., Serlin, R. C., & Omer, H. (1994). Misuse of statistical tests in three decades of psychotherapy research. Journal of Consulting and Clinical Psychology, 62(1), 75-82. doi:10.1037//0022-006x.62.1.75  Greenland, S. (2017). For and against methodologies: Some perspectives on recent causal and statistical inference debates. European Journal of Epidemiology, 32(1), 3-20. doi:10.1007/s10654-017-0230-6  Kaye, D. H. (1986). Is proof of statistical significance relevant? Washington Law Review, 61(1333), 1333-1365.  Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242-1249. doi:10.1080/01621459.1993.10476404  Thiese, M. S., Ronna, B., & Robbins, R. B. (2016). Misuse of statistics in surgical literature. Journal of Thoracic Disease, 8(8). doi:10.21037/jtd.2016.06.46  Vallverdú, J. (2008). The false dilemma: Bayesian vs. Frequentist. Electronic Journal for Philosophy, ArXiv:0804.0486. Retrieved from https://arxiv.org/pdf/0804.0486.pdf