## Exploring the relations between machine learning metrics

Terminology of a specific domain is often difficult to start with. With a software engineering background, machine learning has many such terms that I find I need to remember to use the tools and read the articles.

Some basic terms are Precision, Recall, and F1-Score. These relate to getting a finer-grained idea of how well a classifier is doing, as opposed to just looking at overall accuracy. Writing an explanation forces me to think it through, and helps me remember the topic myself. That’s why I like to write these articles.

I am looking at a binary classifier in this article. The same concepts do apply more broadly, just require a bit more consideration on multi-class problems. But that is something to consider another time.

Before going into the details, an overview figure is always nice:

On the first look, it is a bit of a messy web. No need to worry about the details for now, but we can look back at this during the following sections when explaining the details from the bottom up. The metrics form a hierarchy starting with the the *true/false negatives/positives *(at the bottom), and building up all the way to the *F1-score *to bind them all together. Lets build up from there.

## True/False Positives and Negatives

A binary classifier can be viewed as classifying instances as *positive* or *negative:*

**Positive**: The instance is classified as a member of the class the classifier is trying to identify. For example, a classifier looking for cat photos would classify photos with cats as positive (when correct).**Negative**: The instance is classified as not being a member of the class we are trying to identify. For example, a classifier looking for cat photos should classify photos with dogs (and no cats) as negative.

The basis of precision, recall, and F1-Score comes from the concepts of *True Positive*, *True Negative*, *False Positive*, and *False Negative*. The following table illustrates these (consider value 1 to be a positive prediction):

## True Positive (TP)

The following table shows 3 examples of a True Positive (TP). The first row is a generic example, where 1 represents the Positive prediction. The following two rows are examples with labels. Internally, the algorithms would use the 1/0 representation, but I used labels here for a more intuitive understanding.

## False Positive (FP)

These False Positives (FP) examples illustrate making wrong predictions, predicting Positive samples for a actual Negative samples. Such failed prediction is called False Positive.

## True Negative (TN)

For the True Negative (TN) example, the cat classifier correctly identifies a photo as not having a cat in it, and the medical image as the patient having no cancer. So the prediction is Negative and correct (True).

## False Negative (FN)

In the False Negative (FN) case, the classifier has predicted a Negative result, while the actual result was positive. Like no cat when there is a cat. So the prediction was Negative and wrong (False). Thus it is a False Negative.

## Confusion Matrix

A confusion matrix is sometimes used to illustrate classifier performance based on the above four values (TP, FP, TN, FN). These are plotted against each other to show a confusion matrix:

Using the cancer prediction example, a confusion matrix for 100 patients might look something like this:

This example has:

- TP: 45 positive cases correctly predicted
- TN: 25 negative cases correctly predicted
- FP: 18 negative cases are misclassified (wrong positive predictions)
- FN: 12 positive cases are misclassified (wrong negative predictions)

Thinking about this for a while, there are different severities to the different errors here. Classifying someone who has cancer as not having it (false negative, denying treatment), is likely more severe than classifying someone who does not have it as having it (false positive, consider treatment, do further tests).

As the severity of different kinds of mistakes varies across use cases, the metrics such as *Accuracy*, *Precision*, *Recall*, and *F1-score* can be used to balance the classifier estimates as preferred.

## Accuracy

The base metric used for model evaluation is often *Accuracy*, describing the number of correct predictions over all predictions:

These three show the same formula for calculating accuracy, but in different wording. From more formalized to more intuitive (my opinion). In the above cancer example, the accuracy would be:

- (TP+TN)/DatasetSize=(45+25)/100=0.7=70%.

This is perhaps the most intuitive of the model evaluation metrics, and thus commonly used. But often it is useful to also look a bit deeper.

## Precision

*Precision* is a measure of how many of the positive predictions made are correct (true positives). The formula for it is:

All three above are again just different wordings of the same, with the last one using the cancer case as a concrete example. In this cancer example, using the values from the above example confusion matrix, the precision would be:

- 45/(45+18)=45/63=0.714=71.4%.

## Recall / Sensitivity

*Recall* is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. It is sometimes also referred to as *Sensitivity**. *The formula for it is:

Once again, this is just the same formula worded three different ways. For the cancer example, using the confusion matrix data, the recall would be:

- 45/(45+12)=45/57=0.789=78.9%.

## Specificity

Specificity is a measure of how many negative predictions made are correct (true negatives). The formula for it is:

In the above medical example, the specificity would be:

- 25/(25+18)=0.581=58,1%.

## F1-Score

F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The formula used for F1-score in this case is:

The idea is to provide a single metric that weights the two ratios (precision and recall) in a balanced way, requiring both to have a higher value for the F1-score value to rise. For example, a Precision of 0.01 and Recall of 1.0 would give :

- an arithmetic mean of (0.01+1.0)/2=0.505,
- F1-score score (formula above) of 2*(0.01*1.0)/(0.01+1.0)=~0.02.

This is because the F1-score is much more sensitive to one of the two inputs having a low value (0.01 here). Which makes it great if you want to balance the two.

Some advantages of F1-score:

- Very small precision or recall will result in lower overall score. Thus it helps balance the two metrics.
- If you choose your positive class as the one with fewer samples, F1-score can help balance the metric across positive/negative samples.
- As illustrated by the first figure in this article, it combines many of the other metrics into a single one, capturing many aspects at once.

In the cancer example further above, the F1-score would be

- 2 * (0.714*0.789)/(0.714+0.789)=0.75 = 75%

**Exploring F1-score**

I find it easiest to understand concepts by looking at some examples. First a function in Python to calculate F1-score:

To compare different combinations of precision and recall, I generate example values for precision and recall in range of 0 to 1 with steps of 0.01 (100 values of 0.01, 0.02, 0.03, … , 1.0):

This produces a list for both precision and recall to experiment with:

### F1-score when precision=recall

To see what is the F1-score if precision equals recall, we can calculate F1-scores for each point 0.01 to 1.0, with precision = recall at each point:

F1-score equals precision and recall if the two input metrics (P&R) are equal. The *Difference* column in the table shows the difference between the smaller value (Precision/Recall) and F1-score. Here they are equal, so no difference, in following examples they start to vary.

### F1-score when Recall = 1.0, Precision = 0.01 to 1.0

So, the F1-score should handle reasonably well cases where one of the inputs (P/R) is low, even if the other is very high.

Lets try setting Recall to the maximum of 1.0 and varying Precision from 0.01 to 1.0:

As expected, the F1-score stays low when one of the two inputs (*Precision / R*ecall) is low. The *difference* column shows how the F1-score in this case rises a bit faster than the smaller input (*Precision* here), gaining more towards the middle of the chart, weighted up a bit by the bigger value (*Recall* here). However, it never goes very far from the smaller input, balancing the overall score based on both inputs. These differences can also be visualized on the figure (*difference* is biggest at the vertical red line):

### F1-score when Precision = 1.0 and Recall = 0.01 to 1.0

If we swap the roles of *Precision* and *Recall* in the above example, we get the same result (due to *F1-score* formula):

This is to say, regardless of which one is higher or lower, the overall *F1-score* is impacted in the exact same way (which seems quite obvious in the formula but easy to forget).

### F1-score when Precision=0.8 and Recall = 0.01 to 1.0

Besides fixing one input at maximum, lets try a bit lower. Here precision is fixed at 0.8, while Recall varies from 0.01 to 1.0 as before:

The top score with inputs (0.8, 1.0) is 0.89. The rising curve shape is similar as *Recall* value rises. At maximum of *Precision* = 1.0, it achieves a value of about 0.1 (or 0.09) higher than the smaller value (0.89 vs 0.8).

### F1-score when Precision=0.1 and Recall=0.01 to 1.0

And if we fix one value near minimum at 0.1?

Because one of the two inputs is always low (0.1), the *F1-score* never rises very high. However, interestingly it again rises at maximum to about 0.08 value larger than the smaller input (*Precision* = 0.1, *F1-score*=0.18). This is quite similar to the fixed value of *Precision* = 0.8 above, where the maximum value reached was 0.09 higher than the smaller input.

### Focusing F1-score on precision or recall

Besides the plain *F1-score*, there is a more generic version, called *Fbeta-score*. *F1-score* is a special instance of *Fbeta-score*, where *beta*=1. It allows one to weight the precision or recall more, by adding a weighting factor. I will not go deeper into that in this post, however, it is something to keep in mind.

## F1-score vs Accuracy

Accuracy is commonly described as a more intuitive metric, with *F1-score* better addressing a more imbalanced dataset. So how does the *F1-score* (**F1**) vs *Accuracy* (**ACC**) compare across different types of data distributions (ratios of positive/negative)?

**Imbalance: Few **Positive Cases

In this example, there is an imbalance of 10 positive cases, and 90 negative cases, with different TN, TP, FN, and FP values for a classifier to calculate F1 and ACC:

The maximum accuracy with the class imbalance is with a result of TN=90 and TP=10, as shown on row 2.

In each case where TP =0, the *Precision* and *Recall* both become 0, and *F1-score* cannot be calculated (division by 0). Such cases can be scored as F1-score = 0, or generally marking the classifier as useless. Because the classifier cannot predict any correct positive result. This is rows 0, 4, and 8 in the above table. These also illustrate some cases of high *Accuracy* for a broken classifier (e.g., row 0 with 90% *Accuracy* while always predicting only negative).

The remaining rows illustrate how the *F1-score* is reacting much better to the classifier making more balanced predictions. For example, *F1-score*=0.18 vs *Accuracy* = 0.91 on row 5, to *F1-score*=0.46 vs *Accuracy* = 0.93 on row 7. This is only a change of 2 positive predictions, but as it is out of 10 possible, the change is actually quite large, and the *F1-score* emphasizes this (and *Accuracy* sees no difference to any other values).

### Balance 50/50 Positive and Negative cases:

How about when the datasets are more balanced? Here are similar values for a balanced dataset with 50 negative and 50 positive items:

*F1-score* is still a slightly better metric here, when there are only very few (or none) of the positive predictions. But the difference is not as huge as with imbalanced classes. In general, it is still always useful to look a bit deeper into the results, although in balanced datasets, a high accuracy is usually a good indicator of a decent classifier performance.

### Imbalance: Few Negative Cases

Finally, what happens if the minority class is measured as the negative and not positive? *F1-score* no longer balances it but rather the opposite. Here is an example with 10 negative cases and 90 positive cases:

For example, row 5 has only 1 correct prediction out of 10 negative cases. But the *F1-score* is still at around 95%, so very good and even higher than accuracy. In the case where the same ratio applied to the positive cases being the minority, the *F1-score* for this was 0.18 vs now it is 0.95. Which was a much better indicator of quality rather than in this case.

This result with minority negative cases is because of how the formula to calculate *F1-score* is defined over *precision* and *recall* (emphasizing positive cases). If you look back at the figure illustrating the metrics hierarchy at the beginning of this article, you will see how *True Positives* feed into both *Precision* and *Recall*, and from there to *F1-score*. The same figure also shows how *True Negatives* do not contribute to *F1-score* at all. This seems to be viisble here if you reverse the ratios and have fewer *true negatives*.

So, as usual, I believe it is good to keep in mind how to represent your data, and do your own data exploration, not blindly trusting any single metric.

## Conclusions

So what are these metrics good for?

The traditional **Accuracy** is a good measure if you have quite balanced datasets and are interested in all types of outputs equally. I like to start with it in any case, as it is intuitive, and dig deeper from there as needed.

**Precision** is great to focus on if you want to minimize false positives. For example, you build a spam email classifier. You want to see as little spam as possible. But you do not want to miss any important, non-spam emails. In such cases, you may wish to aim for maximizing precision.

**Recall** is very important in domains such as medical (e.g., identifying cancer), where you really want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive.

Neither *precision* nor *recall* is necessarily useful alone, since we rather generally are interested in the overall picture. *Accuracy* is always good to check as one option. *F1-score *is another.

**F1-score** combines precision and recall, and works also for cases where the datasets are imbalanced as it requires both precision and recall to have a reasonable value, as demonstrated by the experiments I showed in this post. Even if you have a small number of positive cases vs negative cases, the formula will weight the metric value down if the precision or recall of the positive class is low.

Besides these, there are various other metrics and ways to explore your results. A popular and very useful approach is also use of **ROC- and precision-recall curves**. These allow fine-tuning the evaluation thresholds according to what type of error we want to minimize. But that is a different topic to explore.

Thats all for today.. 🙂