A Short Course in Statistical Terminology

Posted on Posted in Analytics, Uncategorized

How much a society values something corresponds directly with the number of words a culture allocates to describe it.  Additionally, the words accessible to a person shape his or her ways of perceiving.  Within the sphere of discourse comprising the trending topic of data analytics, the terms analytics, advanced analytics, data mining and machine learning are often expressed interchangeably.  Several texts illuminate the subtleties existing between each.  Examining these terms and strengthening their distinctions helps to standardize a common lexicon across domains.  This lexicon is not only important for satisfying a common need for precision and consistency in language; it is an essential enabler of the growing discourse, research, discovery and application of data analysis techniques for solving real-world problems.

In Essentials of Business Analysis (2016), the authors defined business analytics as a scientific process of transforming data into insight for making better decisions (Camm, Cochran, Fry, Ohlmann, Anderson, Sweeney, & Williams).  Analytics may be (1) descriptive, (2) predictive or (3) prescriptive.  Descriptive analytics use techniques to describe what has happened, using past data.  Predictive analytics use models constructed from past data to predict the future, or to ascertain the impact of one variable on another.  Prescriptive analytics utilize input data to yield a best course of action.  Predictive and prescriptive analytics are advanced analytic methods (Figure 1).  Finally, data mining refers to the use of analytic techniques for understanding patterns and relationships in large data sets.

analytics.png
Figure 1: Decision-support Analytics

Introduction to Data Mining defined data mining variously as a technology, and as a process (Tan, Steinbach, & Kumar; 2006).  As a technology, data mining blends traditional (statistical) analytic methods with sophisticated (machine-learning) algorithms to process large volumes of data.  Here, data mining is a tool that can be employed to support a wide-range of OLTP applications, such as customer profiling, targeted marketing, etc.  As a process, data mining is applied as an integral part of the Knowledge Discovery process to automate the discovery of useful information in large data repositories (Figure 2).   Here, data-mining techniques enhance information-retrieval systems by scouring large databases of indexed data-elements to find useful patterns. In this way, data mining sits as a subprocess between the pre-processing and post-processing of raw data.

DataMining.png
Figure 2: Data Mining Sub-Process

Learning from Data: A Short Course further clarifies the relationships between the fields of statistics, data mining and machine learning (Abu-Mostafa, Magdon-Ismail, & Lin; 2012).  Subsets of machine learning, both statistics and data mining are fields dedicated to the subject of “learning from data” (Figure 3).  As a mathematics field, statistics answer most questions with proofs, and yield somewhat idealized models which may be analyzed in great detail.  Machine-learning techniques make less restrictive assumptions about the meaning in data using more general models; their results are generally weaker, though more broadly applicable.  Data mining focuses on finding patterns, correlations and anomalies in large relational databases; this field tends to be less focused on prediction and more focused on (descriptive) analytics.

machineLearning2.png
Figure 3: Learning from Data

In Artificial Intelligence for Humans Volume 1: Fundamental Algorithms (2013), programmer Jeff Heaton interchanged Machine Learning algorithms and Artificial Intelligence (AI) algorithms almost synonymously.  His examples of each, however, illuminate how machine learning is a sub-set of AI; and how AI algorithms comprise only part of all that machine learning encompasses. Neural Networks, Support Vector Machines, Bayesian Networks and Hidden Markov Models are all examples of AI algorithms.  Machine-learning algorithms, however, may be grouped into four classes which include both AI algorithms and classical statistics: (1) data classification; (2) regression analysis; (3) clustering; and (4) time series.  These lists capture a subtle difference that warrants further specification.Generally, the interdisciplinary field of AI transects the fields of cognitive science, computer science, and linguistics.  Machine learning is a growing sub-set of AI that is primarily concerned with a computer system’s ability to learn.  AI algorithms variously attempt to codify new input from the external environment into data, to compare it against stored data, and to produce actionable, relevant insights and/or updates.  Within a decision-support system, this push/pull interaction is facilitated by supervised, unsupervised or reinforcement learning loops (Figure 4).  Statistical methods are generally push techniques, which convert existing data into information. Data-mining techniques work to pull pre-existing knowledge (like text) into data supporting analytics, which can then be used to identify patterns and describe relationships.  Machine learning automates aspects of both the push and the pull to create prescriptive knowledge informing decisions.

systemLearning
Figure 4: Systems-embedded Learning

Both data mining and machine learning present unique challenges to business users in the way that each interact with system architectures.  Whereas traditional top-down analytics are performed with specific data-sets collected from servers sitting on top of a data warehouse (s.a. OLAP), the data-mining engine may pull directly from the data warehouse and other external repositories.  Because they reside within large repositories, data-mining data presents unique challenges in scalability, dimensionality, complexity, and quality. The purpose of the laborious pre-processing step is to transform the raw input data into appropriate formats for subsequent analysis by fusing, cleansing and imputing values (Tan et. al.; 2006).  Machine-learning, likewise, presents a problem of complexity to business-intelligence workflows – only here, it is the algorithms themselves that are complex.  Infamously dubbed as “black-box” algorithms, neural networks are an often-cited example of this complexity in discussions of algorithmic bias and the trustworthiness of prescribed outputs.  In the case of data-mining, statistical methods may sometimes be employed to verify the recommended outputs.  However, in learning engines, the algorithm adapts itself to new input over time, potentially obscuring how it estimated and prioritized an output.

References

Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: A short course.   Seattle, WA: AMLbook.com.

Camm, J. D., Cochran, J. J., Fry, M. J., Ohlmann, J. W., Anderson, D. R., Sweeney, D. J., & Williams, T. A. (2016). Essentials of business analytics. Boston, MA: Cengage Learning.

Heaton, J. (2013). Artificial intelligence for humans: Fundamental algorithms (Vol. 1). St. Louis,  MO: Heaton Research.

Tan, P., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston, MA: Pearson.3