Statistical Comparison of Machine Learning Algorithms (Part 1)
This is the first of a two-part series dealing with the application of statistical tests for the formal comparison of several Machine Learning (ML) algorithms in order to determine whether one generally outperforms the rest or not. In this first chapter, we explain the fundamentals of statistical tests, while in the second part, we examine how they are applied to ML algorithm performance data with the aim of comparing them from a statistical point of view.
In industry, when a practitioner (often a Data Scientist) uses a machine learning algorithm to build a predictive model to solve a real-world problem, they are interested in the performance when the model is deployed into a production environment. In other words, the Data Scientist needs to provide the customer with an estimate of how good the model will perform in a production environment when predicting the outcome for new examples never seen before by the model. In order to obtain an estimate of the performance (most often the test error), it is enough to separate a subset of the examples (the test subset) that are not used to train the model, but are fed into the final trained model in order to make predictions on them. Those predictions are compared with the (known) actual output of each of the test examples. Some measure of performance (there are many out there) is calculated taking into account the differences between the predicted and actual outputs, and then the exact same model (with no changes) is shipped into production and deployed. Ideally the test error reassures the customer, who knows in advance the approximate model performance when facing new examples.
In academia, a researcher has quite a different goal. As it happens in many other research areas -not just in Computer Science-, when a new ML algorithm is to be published, the authors must prove in their paper that it represents an advancement with respect to other state-of-the-art algorithms. Otherwise, it is difficult for a renowned research journal to publish yet another algorithm to solve the same task – although a few journals exist which favour originality and fresh ideas over useful advances in the state of the art. In general, what the scientific community (starting with the journal reviewers) expects to see is how an algorithm compares to others – and the authors must be very careful in designing the comparison procedure. It has to be fair and rigorous (that’s what scientific research is all about!) – and nothing is fairer and more rigorous than inferential statistics. The first publication that formally defined how statistical tests should be used to compare ML algorithms was Demsar’s 2006 seminal paper . Many papers followed, with  and  among the most influential.
Inferential statistics and statistical tests
In a broad sense, inferential statistics can be defined as the part of statistics that studies how to draw conclusions from a whole population (that is large enough to make it infeasible to study every individual) by only studying a finite sample of individuals of the population. Remember that a sample is a set of individuals (or examples) taken (be it with a randomized procedure or not) from a population. The conclusions we might be interested in about the population are things like (i) the (unknown) value of some parameters that characterize the population, (ii) the relationships existing between some aspects (or variables) that can be measured on the population, (iii) determining whether two samples come from a single population or two different populations, etc.
Statistical tests constitute a very important and widely used tool within inferential statistics, as it has lots of applications (as it happens with statistics in general) in many disciplines, from engineering to biology, sociology, economy and almost anything that we study from finite samples – which is basically everything in the world around us. A statistical test is aimed at accepting or rejecting some hypothesis we make about a population, by just studying our finite sample. In simple terms, it is able to judge whether our hypothesis about the whole population is compatible with our sample (i.e. with what our data shows) or, on the contrary, whether our data constitutes strong evidence against our hypothesis and whether it should be rejected.
When applying a statistical test, two issues arise: it is crucial to choose (i) the correct hypothesis for checking the phenomenon of interest, and (ii) the right statistical test according to the circumstances under which our sample data was obtained. If we fail in any of those choices, then the conclusion we draw about our population (according to the test result) is fake. In other words, we must not trust the result of the test, no matter what it says.
One of the main aspects to consider when choosing a statistical test is whether to apply a parametric or a non-parametric test. Without going into much detail, parametric tests can be used when the data follows a certain probability distribution (more precisely, the data in each of the groups being compared) – most often, the normal distribution. This requires using a previous, specific test to check the normality assumption prior to the application of the parametric test itself. (Kolmogorov-Smirnov and Shapiro-Wilks are the names of some common normality tests which perform roughly the same). If the normality test rejects our hypothesis about the data (which states “our data is normally distributed”), we have to use a non-parametric test. This is a very common situation in real life, especially when the sample size (number of elements in our sample) is small. Parametric tests are known to be more powerful than non-parametric ones, which means that they can more easily draw a final conclusion about the population. However, they can only be applied when normality is met. Non-parametric tests can be applied in any situation (although parametric tests are preferred if we are lucky and the normality assumption is met).
Repeated-measures ANOVA and Friedman’s test
In the case of ML, the kind of tests we need are those that judge whether multiple samples (two, or more) come from the same population or from multiple populations – we will further elaborate on this in the next paragraph. Therefore, our choice of the most appropriate statistical test must take into account the number of samples we are comparing – two vs more than two samples. In both cases, there are parametric and non-parametric versions for each situation (two samples, more than two samples, paired samples -each datum in one sample is related to one datum in the other samples-, non-paired).
As a simple example, imagine we want to check whether three medicines for insomnia actually make a difference or not by assessing the time needed by the patients to nod off. So we take a random sample of patients, give them the first medicine and measure the time until they nod off (these times constitute our data), then take another random sample of patients and do the same with the second medicine, and so on with the third, with no relation between the patients selected in each sample. (The patients who take the medicines are called experimental units in Design of Experiments terminology… yes, the name sounds like laboratory rats :-]). If the data of each medicine is normally distributed -this should be formally checked with a normality test- we could use the so-called one-way ANOVA test (one-way refers to the fact that we are considering only one factor, namely the medicine employed). In abstract terms, we want to assess whether the three samples come from the same population (which means all medicines exhibit the same performance) or from different ones (which means at least one of them performs differently from the others).
Now reconsider the samples. If one medicine really outperforms the rest, then all patients should nod off faster, no matter the exact patients chosen. However, beyond the actual effect of the medicine, it is clear that medicines do not affect everyone equally, but there is some component of the time to nod off that depends on each particular person. In order to properly understand which part of the results is due to the medicine and which is caused by each person, the correct way to design this experiment is to select only one sample of people, and administer each of them with the three medicines on separate days (with enough separation in time so as to assume no interaction between medicines). This way, the effects caused by the person affect the three measurements, because all the levels (medicine type) of the variable of interest (the medicine) are applied to exactly the same experimental unit (the same person). This is called a repeated-measures design (no matter whether data is normal or not).
If our data does not meet the normality assumption, there exists an analogous non-parametric test called Friedman’s test (caution: several variants exist; more details in the next section), which also considers repeated measures. Either of them gives a yes/no answer to the question: do the three samples come from the same population? The null hypothesis that the test tries to reject is “all the samples come from the same population”, but it does not say which of the samples is significantly different from the rest. Stating that Friedman’s test is analogous to repeated-measures ANOVA is not exactly true because Friedman discards a lot of information (as it transforms data into ranks so it loses exact differences), but from an intuitive point of view they can be considered similar.
How does this relate to the statistical comparison of ML algorithms? We will explain it in the next chapter … more to come!
 Demšar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1-30 (2006).
 García, S., and Herrera, F. An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research 9:2677-2694 (2008).
 García, S., Fernández, A., Luengo, J., and Herrera, F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180:2044–2064 (2010).
 Daniel, Wayne W. “Friedman two-way analysis of variance by ranks”. Applied Nonparametric Statistics (2nd ed.). Boston: PWS-Kent. pp. 262–74. ISBN 0-534-91976-6 (1990).
Pablo Villacorta Iglesias obtained his MSc. in Computer Engineering in 2009, his MSc. in Statistics in 2012 and his PhD in Computer Science and AI (adversarial decision making) in 2015, all from University of Granada, Spain. He currently works as a Data scientist at Stratio Big Data. Pablo enjoys developing mathematical software in R and Scala, and has authored six open-source R packages so far, all on CRAN. He also works in Spark-based implementations of Big data science techniques for Stratio Intelligence, a core module of the Stratio Data Centric platform. Since October 2018, Pablo is a Databricks Certified Developer in Spark 2.x for Scala.