Evaluating tests: a basic metrics guide.

November 4, 2020

We often hear about the reliability of certain methods, tests, or algorithms, and the idea is presented to us based on certain metrics, such as accuracy, precision, or recall. But, what differences are there between these metrics, and why do they expose us to several of them? Is it better to use one or the other depending on the context? Is there a metric that works for everything and is better than the rest? Read on, and you will know the answer to all these questions.

Confusion matrix

First of all, we must understand a basic concept when we have a test that predicts positives and negatives: the confusion matrix. A confusion matrix tells us the performance of our algorithm or test, where the rows are the actual data and the columns the predictions (or vice versa).

As can be seen, the confusion matrix tells us, for the real positives and negatives, how many have been predicted as positive and how many as negative. Each of the values ​​within the matrix has a name, which are the following:

TP = True Positive. Real: Covid | Prediction: Covid

FN = False Negative. Real: Covid | Prediction: No Covid

FP = False Positive. Real: No Covid | Prediction: Covid

TN = True Negative. Real: No Covid | Prediction: No Covid

Once this table is understood, we can review the most important metrics.

Accuracy

Accuracy tells you what percentage of the predictions are correct.

$$Accuracy = \frac{TP + TN} {All ~ the ~ predictions} = \frac{TP + TN} {TP + TN + FP + FN} $$

This metric may seem very good at first glance, but it has a weak point. When your data set is unbalanced (many more real cases of positive than negative or vice versa) it loses reliability. Suppose we have a data set of 100 patients, and only 1 has Covid-19. And suppose, also, that the test tells us that none of them have it:

We would be talking about an accuracy of 99%, but it could be possible that our test is not capable of detecting positives. We need metrics capable of detecting false positive and false negative errors. We need the precision, the recall, and, ultimately, the F1-score.

Precision

Precision is a metric that tells us the following: from the positive predictions I have made, what percentage are positive.

$$ Precision = \frac {TP} {Positive ~ predictions} = \frac {TP} {TP + FP} $$

This metric is useful when the cost of a false positive is high. For example, if your system detects if an email is spam, a false positive could send a non-spam email to the spam folder, which is an error that can lead to the loss of important information.

Recall

Recall, as opposed to precision, is useful when the cost of a false negative is high. It tells us: of all the positive cases, what percentage has hit the test.

$$ Recall = \frac {TP} {Real ~ positives} = \frac {TP} {TP + FN} $$

For example, in a test to detect Covid-19, a false negative (saying that the patient does not have covid when they really do have it) has a very high cost, such as the life of one or more people.

F1-score

After reading about accuracy and recall it is natural that the following question arises: What if I want a metric that strikes a balance between accuracy and recall and penalizes both false positives and false negatives?

Suppose we have several tests, all with different precision and recall metrics, and we want to find the best of them taking both metrics into account equally (that is, penalizing false positives and false negatives equally). For this case we have the F1-score, whose formula is:

$$ F1 = 2\times\frac{Precision \times Recall}{Precision + Recall}$$

As we can see from the formula, the F1-score seeks a balance between precision and recall. If one of the two metrics has a very low value, the F1-score drops drastically. For example, in the extreme case where the precision or recall is 0, the result would be 0 regardless of the other metric.

Conclusion

After reading about all the metrics, the answer to “What metric do I use?” is clear: the F1-score is the most complete when evaluating our test since it takes into account false negatives and false positives. Only in the case of having a context in which one of the two errors is clearly fatal, should recall or precision be used to evaluate our test. On the other hand, even evaluating with the F1-score, it is interesting to expose both the precision and the recall, to see in more detail what type of error is made more often.