Primer on Binary Classification or Answering Yes/No Questions with Machine Learning

Binary classification

The goal of binary classification is to put objects into one of two categories based on a set of attributes. For example, a credit card company may wish to categorize a transaction as legitimate or fraudulent based on the amount and location of the transaction. This is an example of binary classification since there are two categories, legitimate and fraudulent, a transaction can be classified as. A couple other popular examples of binary classification include spam detection and medical fraud detection.

The goal of binary classification is to put objects into one of two categories based on a set of attributes.

Types of errors that can occur

In binary classification, there are two types of errors, type 1 and type 2. A type 1 error, or false positive finding, is incorrectly rejecting a true null hypothesis. In our credit card example, a type 1 error is when a transaction is legitimate but gets classified as fraudulent. A type 2 error, or false negative finding, is incorrectly accepting a false null hypothesis. In our credit card example, a type 2 error is when a transaction is fraudulent but gets classified as legitimate. In this case, a type 2 error is obviously worse, since a fraudulent transaction will go unnoticed. In general, however, it depends on the situation whether a type 1 or type 2 error is worse. In fact, it depends on the statement of the null and alternative hypotheses. In our credit card example, the null hypothesis is that a transaction is legitimate. Likewise, the alternative hypothesis is that a transaction is not legitimate and is, therefore, fraudulent. If we had flipped the null and alternative hypotheses, a type 1 error would have been worse.

Evaluating classifiers

In binary classification, the so-called confusion matrix is a two-by-two matrix that tabulates the number of true positive (upper left), false negative (upper right), false positive (bottom left), and true negative (bottom right) classifications. The confusion matrix describes the performance of a classifier. Diagonal terms correspond to numbers of correct classifications and off-diagonal terms correspond to numbers of incorrect classifications. In general, classifiers with relatively large numbers along the diagonal and relatively small numbers everywhere else are best.

The confusion matrix describes the performance of a classifier.

Accuracy answers the question: how often is a classifier correct? Accuracy is calculated as the number of correct classifications divided by the total number of classifications. This number may, however, be misleading when there is a large discrepancy between the number of true positive and true negative correct classifications. For instance, in our credit card example, we would hope that a relatively small number of transactions are fraudulent and a relatively large number of transactions are legitimate. In this case, a classifier that labels all transactions legitimate would have a high accuracy, but be worthless at detecting fraud. Better measures of classifier performance are Precision and Recall.

Better measures of classifier performance are Precision and Recall.

Precision, or positive predicted value, answers the question: when a classifier predicts yes, how often is it correct? Precision is calculated as the number of true positives divided by the sum of the number of true positives and the number of false positives. In terms of our credit card example, precision answers the question: when a classifier predicts fraud, how often is it correct?

Precision, or positive predicted value, answers the question: when a classifier predicts yes, how often is it correct?

Recall, or sensitivity, answers the question: when an answer is actually yes, how often does a classifier predict yes? Recall is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. In terms of our credit card example, recall answers the question: when it is actually fraud, how often does a classifier predict fraud?

Recall, or sensitivity, answers the question: when an answer is actually yes, how often does a classifier predict yes?

It is often the case that increasing precision lowers recall and increasing recall lowers precision. Deciding on a trade-off between precision and recall is best left to a judgement call based on the situation. If a single measure of classifier performance is required, the F-Score / F1-Score is a popular measure. The F-Score is the harmonic mean of the precision and recall. Unlike your standard mean, the harmonic mean puts more emphasis on the smaller value. Therefore, to score well, classifiers must balance the two values. The harmonic mean is calculated as one over the average of one over the precision and one over the recall.

Deciding on a trade-off between precision and recall is best left to a judgement call based on the situation.

Techniques for binary classification

Techniques for binary classification include statistical methods, like hypothesis testing, and non-statistical methods, like machine learning.  Popular machine learning algorithms for binary classification include: support vector machines, perceptron, and logistic regression.  We leave a detailed discussion of specific statistical and non-statistical methods for future posts.  For now, know that statistical methods can provide more information about our confidence in an answer, but come at a cost of stronger assumptions on our data, less flexibility, and difficult computational questions.  Conversely, non-statistical methods require few assumptions on our data, are more flexible, and avoid difficult computational questions.  Of course, they also provide less information about our confidence in an answer.  Deciding which method to use, therefore, depends on the situation.

Leave a Reply

Your email address will not be published. Required fields are marked *