Exploring the relations between machine learning metrics
Terminology of a specific domain is often difficult to start with. With a software engineering background, machine learning has many such terms that I find I need to remember to use the tools and read the articles.
Some basic terms are Precision, Recall, and F1-Score. These relate to getting a finer-grained idea of how well a classifier is doing, as opposed to just looking at overall accuracy. Writing an explanation forces me to think it through, and helps me remember the topic myself. That’s why I like to write these articles.
I am looking at a binary classifier in this article. The same concepts do apply more broadly, just require a bit more consideration on multi-class problems. But that is something to consider another time.
Before going into the details, an overview figure is always nice:
On the first look, it is a bit of a messy web. No need to worry about the details for now, but we can look back at this during the following sections when explaining the details from the bottom up. The metrics form a hierarchy starting with the the true/false negatives/positives (at the bottom), and building up all the way to the F1-score to bind them all together. Lets build up from there.
True/False Positives and Negatives
A binary classifier can be viewed as classifying instances as positive or negative:
- Positive: The instance is classified as a member of the class the classifier is trying to identify. For example, a classifier looking for cat photos would classify photos with cats as positive (when correct).
- Negative: The instance is classified as not being a member of the class we are trying to identify. For example, a classifier looking for cat photos should classify photos with dogs (and no cats) as negative.
The basis of precision, recall, and F1-Score comes from the concepts of True Positive, True Negative, False Positive, and False Negative. The following table illustrates these (consider value 1 to be a positive prediction):
True Positive (TP)
The following table shows 3 examples of a True Positive (TP). The first row is a generic example, where 1 represents the Positive prediction. The following two rows are examples with labels. Internally, the algorithms would use the 1/0 representation, but I used labels here for a more intuitive understanding.
False Positive (FP)
These False Positives (FP) examples illustrate making wrong predictions, predicting Positive samples for a actual Negative samples. Such failed prediction is called False Positive.
True Negative (TN)
For the True Negative (TN) example, the cat classifier correctly identifies a photo as not having a cat in it, and the medical image as the patient having no cancer. So the prediction is Negative and correct (True).
False Negative (FN)
In the False Negative (FN) case, the classifier has predicted a Negative result, while the actual result was positive. Like no cat when there is a cat. So the prediction was Negative and wrong (False). Thus it is a False Negative.
A confusion matrix is sometimes used to illustrate classifier performance based on the above four values (TP, FP, TN, FN). These are plotted against each other to show a confusion matrix:
Using the cancer prediction example, a confusion matrix for 100 patients might look something like this:
This example has:
- TP: 45 positive cases correctly predicted
- TN: 25 negative cases correctly predicted
- FP: 18 negative cases are misclassified (wrong positive predictions)
- FN: 12 positive cases are misclassified (wrong negative predictions)
Thinking about this for a while, there are different severities to the different errors here. Classifying someone who has cancer as not having it (false negative, denying treatment), is likely more severe than classifying someone who does not have it as having it (false positive, consider treatment, do further tests).
As the severity of different kinds of mistakes varies across use cases, the metrics such as Accuracy, Precision, Recall, and F1-score can be used to balance the classifier estimates as preferred.
The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions:
These three show the same formula for calculating accuracy, but in different wording. From more formalized to more intuitive (my opinion). In the above cancer example, the accuracy would be:
This is perhaps the most intuitive of the model evaluation metrics, and thus commonly used. But often it is useful to also look a bit deeper.
Precision is a measure of how many of the positive predictions made are correct (true positives). The formula for it is:
All three above are again just different wordings of the same, with the last one using the cancer case as a concrete example. In this cancer example, using the values from the above example confusion matrix, the precision would be:
Recall / Sensitivity
Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. It is sometimes also referred to as Sensitivity. The formula for it is:
Once again, this is just the same formula worded three different ways. For the cancer example, using the confusion matrix data, the recall would be:
Specificity is a measure of how many negative predictions made are correct (true negatives). The formula for it is:
In the above medical example, the specificity would be:
F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The formula used for F1-score in this case is:
The idea is to provide a single metric that weights the two ratios (precision and recall) in a balanced way, requiring both to have a higher value for the F1-score value to rise. For example, a Precision of 0.01 and Recall of 1.0 would give :
- an arithmetic mean of (0.01+1.0)/2=0.505,
- F1-score score (formula above) of 2*(0.01*1.0)/(0.01+1.0)=~0.02.
This is because the F1-score is much more sensitive to one of the two inputs having a low value (0.01 here). Which makes it great if you want to balance the two.
Some advantages of F1-score:
- Very small precision or recall will result in lower overall score. Thus it helps balance the two metrics.
- If you choose your positive class as the one with fewer samples, F1-score can help balance the metric across positive/negative samples.
- As illustrated by the first figure in this article, it combines many of the other metrics into a single one, capturing many aspects at once.
In the cancer example further above, the F1-score would be
- 2 * (0.714*0.789)/(0.714+0.789)=0.75 = 75%
I find it easiest to understand concepts by looking at some examples. First a function in Python to calculate F1-score:
To compare different combinations of precision and recall, I generate example values for precision and recall in range of 0 to 1 with steps of 0.01 (100 values of 0.01, 0.02, 0.03, … , 1.0):
This produces a list for both precision and recall to experiment with:
F1-score when precision=recall
To see what is the F1-score if precision equals recall, we can calculate F1-scores for each point 0.01 to 1.0, with precision = recall at each point:
F1-score equals precision and recall if the two input metrics (P&R) are equal. The Difference column in the table shows the difference between the smaller value (Precision/Recall) and F1-score. Here they are equal, so no difference, in following examples they start to vary.
F1-score when Recall = 1.0, Precision = 0.01 to 1.0
So, the F1-score should handle reasonably well cases where one of the inputs (P/R) is low, even if the other is very high.
Lets try setting Recall to the maximum of 1.0 and varying Precision from 0.01 to 1.0:
As expected, the F1-score stays low when one of the two inputs (Precision / Recall) is low. The difference column shows how the F1-score in this case rises a bit faster than the smaller input (Precision here), gaining more towards the middle of the chart, weighted up a bit by the bigger value (Recall here). However, it never goes very far from the smaller input, balancing the overall score based on both inputs. These differences can also be visualized on the figure (difference is biggest at the vertical red line):
F1-score when Precision = 1.0 and Recall = 0.01 to 1.0
If we swap the roles of Precision and Recall in the above example, we get the same result (due to F1-score formula):
This is to say, regardless of which one is higher or lower, the overall F1-score is impacted in the exact same way (which seems quite obvious in the formula but easy to forget).
F1-score when Precision=0.8 and Recall = 0.01 to 1.0
Besides fixing one input at maximum, lets try a bit lower. Here precision is fixed at 0.8, while Recall varies from 0.01 to 1.0 as before:
The top score with inputs (0.8, 1.0) is 0.89. The rising curve shape is similar as Recall value rises. At maximum of Precision = 1.0, it achieves a value of about 0.1 (or 0.09) higher than the smaller value (0.89 vs 0.8).
F1-score when Precision=0.1 and Recall=0.01 to 1.0
And if we fix one value near minimum at 0.1?
Because one of the two inputs is always low (0.1), the F1-score never rises very high. However, interestingly it again rises at maximum to about 0.08 value larger than the smaller input (Precision = 0.1, F1-score=0.18). This is quite similar to the fixed value of Precision = 0.8 above, where the maximum value reached was 0.09 higher than the smaller input.
Focusing F1-score on precision or recall
Besides the plain F1-score, there is a more generic version, called Fbeta-score. F1-score is a special instance of Fbeta-score, where beta=1. It allows one to weight the precision or recall more, by adding a weighting factor. I will not go deeper into that in this post, however, it is something to keep in mind.
F1-score vs Accuracy
Accuracy is commonly described as a more intuitive metric, with F1-score better addressing a more imbalanced dataset. So how does the F1-score (F1) vs Accuracy (ACC) compare across different types of data distributions (ratios of positive/negative)?
Imbalance: Few Positive Cases
In this example, there is an imbalance of 10 positive cases, and 90 negative cases, with different TN, TP, FN, and FP values for a classifier to calculate F1 and ACC:
The maximum accuracy with the class imbalance is with a result of TN=90 and TP=10, as shown on row 2.
In each case where TP =0, the Precision and Recall both become 0, and F1-score cannot be calculated (division by 0). Such cases can be scored as F1-score = 0, or generally marking the classifier as useless. Because the classifier cannot predict any correct positive result. This is rows 0, 4, and 8 in the above table. These also illustrate some cases of high Accuracy for a broken classifier (e.g., row 0 with 90% Accuracy while always predicting only negative).
The remaining rows illustrate how the F1-score is reacting much better to the classifier making more balanced predictions. For example, F1-score=0.18 vs Accuracy = 0.91 on row 5, to F1-score=0.46 vs Accuracy = 0.93 on row 7. This is only a change of 2 positive predictions, but as it is out of 10 possible, the change is actually quite large, and the F1-score emphasizes this (and Accuracy sees no difference to any other values).
Balance 50/50 Positive and Negative cases:
How about when the datasets are more balanced? Here are similar values for a balanced dataset with 50 negative and 50 positive items:
F1-score is still a slightly better metric here, when there are only very few (or none) of the positive predictions. But the difference is not as huge as with imbalanced classes. In general, it is still always useful to look a bit deeper into the results, although in balanced datasets, a high accuracy is usually a good indicator of a decent classifier performance.
Imbalance: Few Negative Cases
Finally, what happens if the minority class is measured as the negative and not positive? F1-score no longer balances it but rather the opposite. Here is an example with 10 negative cases and 90 positive cases:
For example, row 5 has only 1 correct prediction out of 10 negative cases. But the F1-score is still at around 95%, so very good and even higher than accuracy. In the case where the same ratio applied to the positive cases being the minority, the F1-score for this was 0.18 vs now it is 0.95. Which was a much better indicator of quality rather than in this case.
This result with minority negative cases is because of how the formula to calculate F1-score is defined over precision and recall (emphasizing positive cases). If you look back at the figure illustrating the metrics hierarchy at the beginning of this article, you will see how True Positives feed into both Precision and Recall, and from there to F1-score. The same figure also shows how True Negatives do not contribute to F1-score at all. This seems to be viisble here if you reverse the ratios and have fewer true negatives.
So, as usual, I believe it is good to keep in mind how to represent your data, and do your own data exploration, not blindly trusting any single metric.
So what are these metrics good for?
The traditional Accuracy is a good measure if you have quite balanced datasets and are interested in all types of outputs equally. I like to start with it in any case, as it is intuitive, and dig deeper from there as needed.
Precision is great to focus on if you want to minimize false positives. For example, you build a spam email classifier. You want to see as little spam as possible. But you do not want to miss any important, non-spam emails. In such cases, you may wish to aim for maximizing precision.
Recall is very important in domains such as medical (e.g., identifying cancer), where you really want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive.
Neither precision nor recall is necessarily useful alone, since we rather generally are interested in the overall picture. Accuracy is always good to check as one option. F1-score is another.
F1-score combines precision and recall, and works also for cases where the datasets are imbalanced as it requires both precision and recall to have a reasonable value, as demonstrated by the experiments I showed in this post. Even if you have a small number of positive cases vs negative cases, the formula will weight the metric value down if the precision or recall of the positive class is low.
Besides these, there are various other metrics and ways to explore your results. A popular and very useful approach is also use of ROC- and precision-recall curves. These allow fine-tuning the evaluation thresholds according to what type of error we want to minimize. But that is a different topic to explore.
Thats all for today.. 🙂