# A Look at Precision, Recall, and F1-Score

## Exploring the relations between machine learning metrics

Terminology of a specific domain is often difficult to start with. With a software engineering background, machine learning has many such terms that I find I need to remember to use the tools and read the articles.

Some basic terms are Precision, Recall, and F1-Score. These relate to getting a finer-grained idea of how well a classifier is doing, as opposed to just looking at overall accuracy. Writing an explanation forces me to think it through, and helps me remember the topic myself. That’s why I like to write these articles.

I am looking at a binary classifier in this article. The same concepts do apply more broadly, just require a bit more consideration on multi-class problems. But that is something to consider another time.

Before going into the details, an overview figure is always nice:

On the first look, it is a bit of a messy web. No need to worry about the details for now, but we can look back at this during the following sections when explaining the details from the bottom up. The metrics form a hierarchy starting with the the true/false negatives/positives (at the bottom), and building up all the way to the F1-score to bind them all together. Lets build up from there.

## True/False Positives and Negatives

A binary classifier can be viewed as classifying instances as positive or negative:

• Positive: The instance is classified as a member of the class the classifier is trying to identify. For example, a classifier looking for cat photos would classify photos with cats as positive (when correct).
• Negative: The instance is classified as not being a member of the class we are trying to identify. For example, a classifier looking for cat photos should classify photos with dogs (and no cats) as negative.

The basis of precision, recall, and F1-Score comes from the concepts of True Positive, True Negative, False Positive, and False Negative. The following table illustrates these (consider value 1 to be a positive prediction):

## True Positive (TP)

The following table shows 3 examples of a True Positive (TP). The first row is a generic example, where 1 represents the Positive prediction. The following two rows are examples with labels. Internally, the algorithms would use the 1/0 representation, but I used labels here for a more intuitive understanding.

## False Positive (FP)

These False Positives (FP) examples illustrate making wrong predictions, predicting Positive samples for a actual Negative samples. Such failed prediction is called False Positive.

## True Negative (TN)

For the True Negative (TN) example, the cat classifier correctly identifies a photo as not having a cat in it, and the medical image as the patient having no cancer. So the prediction is Negative and correct (True).

## False Negative (FN)

In the False Negative (FN) case, the classifier has predicted a Negative result, while the actual result was positive. Like no cat when there is a cat. So the prediction was Negative and wrong (False). Thus it is a False Negative.

## Confusion Matrix

A confusion matrix is sometimes used to illustrate classifier performance based on the above four values (TP, FP, TN, FN). These are plotted against each other to show a confusion matrix:

Using the cancer prediction example, a confusion matrix for 100 patients might look something like this:

This example has:

• TP: 45 positive cases correctly predicted
• TN: 25 negative cases correctly predicted
• FP: 18 negative cases are misclassified (wrong positive predictions)
• FN: 12 positive cases are misclassified (wrong negative predictions)

Thinking about this for a while, there are different severities to the different errors here. Classifying someone who has cancer as not having it (false negative, denying treatment), is likely more severe than classifying someone who does not have it as having it (false positive, consider treatment, do further tests).

As the severity of different kinds of mistakes varies across use cases, the metrics such as Accuracy, Precision, Recall, and F1-score can be used to balance the classifier estimates as preferred.

## Accuracy

The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions:

These three show the same formula for calculating accuracy, but in different wording. From more formalized to more intuitive (my opinion). In the above cancer example, the accuracy would be:

• (TP+TN)/DatasetSize=(45+25)/100=0.7=70%.

This is perhaps the most intuitive of the model evaluation metrics, and thus commonly used. But often it is useful to also look a bit deeper.

## Precision

Precision is a measure of how many of the positive predictions made are correct (true positives). The formula for it is:

All three above are again just different wordings of the same, with the last one using the cancer case as a concrete example. In this cancer example, using the values from the above example confusion matrix, the precision would be:

• 45/(45+18)=45/63=0.714=71.4%.

## Recall / Sensitivity

Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. It is sometimes also referred to as Sensitivity. The formula for it is:

Once again, this is just the same formula worded three different ways. For the cancer example, using the confusion matrix data, the recall would be:

• 45/(45+12)=45/57=0.789=78.9%.

## Specificity

Specificity is a measure of how many negative predictions made are correct (true negatives). The formula for it is:

In the above medical example, the specificity would be:

• 25/(25+18)=0.581=58,1%.

## F1-Score

F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The formula used for F1-score in this case is:

The idea is to provide a single metric that weights the two ratios (precision and recall) in a balanced way, requiring both to have a higher value for the F1-score value to rise. For example, a Precision of 0.01 and Recall of 1.0 would give :

• an arithmetic mean of (0.01+1.0)/2=0.505,
• F1-score score (formula above) of 2*(0.01*1.0)/(0.01+1.0)=~0.02.

This is because the F1-score is much more sensitive to one of the two inputs having a low value (0.01 here). Which makes it great if you want to balance the two.

• Very small precision or recall will result in lower overall score. Thus it helps balance the two metrics.
• If you choose your positive class as the one with fewer samples, F1-score can help balance the metric across positive/negative samples.
• As illustrated by the first figure in this article, it combines many of the other metrics into a single one, capturing many aspects at once.

In the cancer example further above, the F1-score would be

• 2 * (0.714*0.789)/(0.714+0.789)=0.75 = 75%

### Exploring F1-score

I find it easiest to understand concepts by looking at some examples. First a function in Python to calculate F1-score:

To compare different combinations of precision and recall, I generate example values for precision and recall in range of 0 to 1 with steps of 0.01 (100 values of 0.01, 0.02, 0.03, … , 1.0):

This produces a list for both precision and recall to experiment with:

### F1-score when precision=recall

To see what is the F1-score if precision equals recall, we can calculate F1-scores for each point 0.01 to 1.0, with precision = recall at each point:

F1-score equals precision and recall if the two input metrics (P&R) are equal. The Difference column in the table shows the difference between the smaller value (Precision/Recall) and F1-score. Here they are equal, so no difference, in following examples they start to vary.

### F1-score when Recall = 1.0, Precision = 0.01 to 1.0

So, the F1-score should handle reasonably well cases where one of the inputs (P/R) is low, even if the other is very high.

Lets try setting Recall to the maximum of 1.0 and varying Precision from 0.01 to 1.0:

As expected, the F1-score stays low when one of the two inputs (Precision / Recall) is low. The difference column shows how the F1-score in this case rises a bit faster than the smaller input (Precision here), gaining more towards the middle of the chart, weighted up a bit by the bigger value (Recall here). However, it never goes very far from the smaller input, balancing the overall score based on both inputs. These differences can also be visualized on the figure (difference is biggest at the vertical red line):

### F1-score when Precision = 1.0 and Recall = 0.01 to 1.0

If we swap the roles of Precision and Recall in the above example, we get the same result (due to F1-score formula):

This is to say, regardless of which one is higher or lower, the overall F1-score is impacted in the exact same way (which seems quite obvious in the formula but easy to forget).

### F1-score when Precision=0.8 and Recall = 0.01 to 1.0

Besides fixing one input at maximum, lets try a bit lower. Here precision is fixed at 0.8, while Recall varies from 0.01 to 1.0 as before:

The top score with inputs (0.8, 1.0) is 0.89. The rising curve shape is similar as Recall value rises. At maximum of Precision = 1.0, it achieves a value of about 0.1 (or 0.09) higher than the smaller value (0.89 vs 0.8).

### F1-score when Precision=0.1 and Recall=0.01 to 1.0

And if we fix one value near minimum at 0.1?

Because one of the two inputs is always low (0.1), the F1-score never rises very high. However, interestingly it again rises at maximum to about 0.08 value larger than the smaller input (Precision = 0.1, F1-score=0.18). This is quite similar to the fixed value of Precision = 0.8 above, where the maximum value reached was 0.09 higher than the smaller input.

### Focusing F1-score on precision or recall

Besides the plain F1-score, there is a more generic version, called Fbeta-score. F1-score is a special instance of Fbeta-score, where beta=1. It allows one to weight the precision or recall more, by adding a weighting factor. I will not go deeper into that in this post, however, it is something to keep in mind.

## F1-score vs Accuracy

Accuracy is commonly described as a more intuitive metric, with F1-score better addressing a more imbalanced dataset. So how does the F1-score (F1) vs Accuracy (ACC) compare across different types of data distributions (ratios of positive/negative)?

### Imbalance: Few Positive Cases

In this example, there is an imbalance of 10 positive cases, and 90 negative cases, with different TN, TP, FN, and FP values for a classifier to calculate F1 and ACC:

The maximum accuracy with the class imbalance is with a result of TN=90 and TP=10, as shown on row 2.

In each case where TP =0, the Precision and Recall both become 0, and F1-score cannot be calculated (division by 0). Such cases can be scored as F1-score = 0, or generally marking the classifier as useless. Because the classifier cannot predict any correct positive result. This is rows 0, 4, and 8 in the above table. These also illustrate some cases of high Accuracy for a broken classifier (e.g., row 0 with 90% Accuracy while always predicting only negative).

The remaining rows illustrate how the F1-score is reacting much better to the classifier making more balanced predictions. For example, F1-score=0.18 vs Accuracy = 0.91 on row 5, to F1-score=0.46 vs Accuracy = 0.93 on row 7. This is only a change of 2 positive predictions, but as it is out of 10 possible, the change is actually quite large, and the F1-score emphasizes this (and Accuracy sees no difference to any other values).

### Balance 50/50 Positive and Negative cases:

How about when the datasets are more balanced? Here are similar values for a balanced dataset with 50 negative and 50 positive items:

F1-score is still a slightly better metric here, when there are only very few (or none) of the positive predictions. But the difference is not as huge as with imbalanced classes. In general, it is still always useful to look a bit deeper into the results, although in balanced datasets, a high accuracy is usually a good indicator of a decent classifier performance.

### Imbalance: Few Negative Cases

Finally, what happens if the minority class is measured as the negative and not positive? F1-score no longer balances it but rather the opposite. Here is an example with 10 negative cases and 90 positive cases:

For example, row 5 has only 1 correct prediction out of 10 negative cases. But the F1-score is still at around 95%, so very good and even higher than accuracy. In the case where the same ratio applied to the positive cases being the minority, the F1-score for this was 0.18 vs now it is 0.95. Which was a much better indicator of quality rather than in this case.

This result with minority negative cases is because of how the formula to calculate F1-score is defined over precision and recall (emphasizing positive cases). If you look back at the figure illustrating the metrics hierarchy at the beginning of this article, you will see how True Positives feed into both Precision and Recall, and from there to F1-score. The same figure also shows how True Negatives do not contribute to F1-score at all. This seems to be viisble here if you reverse the ratios and have fewer true negatives.

So, as usual, I believe it is good to keep in mind how to represent your data, and do your own data exploration, not blindly trusting any single metric.

## Conclusions

So what are these metrics good for?

The traditional Accuracy is a good measure if you have quite balanced datasets and are interested in all types of outputs equally. I like to start with it in any case, as it is intuitive, and dig deeper from there as needed.

Precision is great to focus on if you want to minimize false positives. For example, you build a spam email classifier. You want to see as little spam as possible. But you do not want to miss any important, non-spam emails. In such cases, you may wish to aim for maximizing precision.

Recall is very important in domains such as medical (e.g., identifying cancer), where you really want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive.

Neither precision nor recall is necessarily useful alone, since we rather generally are interested in the overall picture. Accuracy is always good to check as one option. F1-score is another.

F1-score combines precision and recall, and works also for cases where the datasets are imbalanced as it requires both precision and recall to have a reasonable value, as demonstrated by the experiments I showed in this post. Even if you have a small number of positive cases vs negative cases, the formula will weight the metric value down if the precision or recall of the positive class is low.

Besides these, there are various other metrics and ways to explore your results. A popular and very useful approach is also use of ROC- and precision-recall curves. These allow fine-tuning the evaluation thresholds according to what type of error we want to minimize. But that is a different topic to explore.

Thats all for today.. 🙂

# Building an NLP classifier: Example with Firefox Issue Reports

## DistilBERT vs LSTM, with data exploration

Machine learning (ML) techniques for Natural Language Processing (NLP) offer impressive results these days. Libraries such as Keras, PyTorch, and HuggingFace NLP make the application of the latest research and models in the area a (relatively) easy task. In this article, I implement and compare two different NLP based classifier model architectures using the Firefox browser issue tracker data.

I originally posted this article on Medium, where you can still find it.

I previously built a similar issue report classifier. Back then, I found the deep-learning based LSTM (Long-Short-Term-Memory) model architecture performed very well for the task. I later added the Attention mechanism on top of the LSTM, improving the results. This LSTM with Attention is the first model type I chose for this article.

The second model type is Transformers. Transformers have become very popular in NLP over the past few years, and their popularity has spawned many variants. In this article, I wanted to get a feel for how they compare to the LSTM approach I applied before.

I use the HuggingFace (HF) DistilBERT as the Transformer model. For the LSTM, I use Keras with its LSTM and open-source Attention layers. I train them both to predict what component an incoming Firefox bug report should be assigned to. The idea would be to assist in bug report triaging tasks.

### Getting the Data

As said, I am using the issue tracker for the Firefox browser as the source of my data. Since I am in no way affiliated with Mozilla of the Firefox Browser, I just used their Bugzilla REST API to download it. In a scenario where you would actually work within your company or with a partner organization, you likely could also just internally ask to get the data as a direct dump from the database. But sometimes you just have to find a way. In this case it was quite simple, with a public API to support downloads. You can find the downloader code on my Github. It is just a simple script to download the data in chunks.

### Exploring and Cleaning the Data

The overall data I use contains the following attributes for each issue:

• ID: A unique issue number, apparently across all Firefox/Mozilla product issues (not just the Browser)
• Summary: A summary description of the issue. This is one of the text fields I will use as features for the classifier.
• Description: A longer description of the issue. This is the other text field I will use as features for the classifier.
• Component: The software component in the Firefox Browser architecture that the issue was assigned to. This will be my prediction target, and thus the label to train the classifier on.
• Duplicate issues number, if any
• Creator: email, name, user id, etc.
• Severity: trivial to critical/blocker, or S1 to S4. Or some random values. Seems like a mess when I looked at it :).
• Last time modified (change time)
• Keywords: apparently, you can pick any keywords you like. I found 1680 unique values.
• Status: One of resolved, verified, new, unconfirmed, reopened, or assigned
• Resolution: One of Duplicate, fixed, worksforme, incomplete, invalid, wontfix, expired, inactive, or moved
• Open/Closed status: 18211 open, 172552 closed at the time of my download.

#### Selecting Features for the Predictor

My goal was to build a classifier assigning bug reports to components, based its natural language description. The first field I considered for this was naturally the description field. However, another similarly interesting field for this purpose is the summary field.

By looking at the values, I can quite simply see there are some issue reports with no (or empty) description:

I considered an option of using the summary field as another extra feature set. Almost all issue reports have a summary:

I would need a way to build a classifier so each issue report could be classified, meaning they all should have the required features. Clearly description or summary alone is not sufficient here. One option I considered was to use the summary if the description is missing, or if the description classifier is not very confident in its classification.

A related paper I found used a combination of summary + description as the feature set. Combining these two gives a set where all items at least have some features to predict on:

The strange looking apply function in the above simple adds the summary and description texts together to produce text_feature. With this, all issue reports have a non-empty text_feature as a result. This is what I used as the features for the classifier (tokenized words from text_feature).

#### Selecting the Components to Predict

To predict a component to assign an issue report to, I need the list of components. Getting this list is simple enough:

There are 52 components that have something assigned to them. The simple thing to do would be to just train a classifier to predict any of these 52 based on the text features. But as software evolves, some components may become obsolete, and trying to assign something to them at the current time might be completely pointless if the component is no longer relevant.

Lets see the details:

This gives the following list:

General                                 64153
Untriaged 18207
Bookmarks & History 13972
Tabbed Browser 9786
Preferences 6882
Toolbars and Customization 6701
Theme 6442
Sync 5930
Session Restore 4205
Search 4171
New Tab Page 4132
File Handling 3471
Extension Compatibility 3371
Security 3277
Shell Integration 2610
Installer 2504
PDF Viewer 2193
Messaging System 1561
Private Browsing 1494
Disability Access 998
Protections UI 984
Migration 964
Site Identity 809
Page Info Window 731
Site Permissions 604
Enterprise Policies 551
Pocket 502
Firefox Accounts 451
Tours 436
WebPayments UI 433
Normandy Client 409
Screenshots 318
Translation 212
Remote Settings Client 174
Top Sites 124
Activity Streams: General 113
Normandy Server 91
Nimbus Desktop Client 83
Pioneer 73
Launcher Process 67
Firefox Monitor 61
Distributions 56
Activity Streams: Timeline 25
Foxfooding 22
Activity Streams: Server Operations 5

The number of issues per component is quite highly spread, and the distribution highly skewed (General almost has more reports assigned than others together). Specifically, some components have very few reports, and likely any classifier would do quite poorly with them.

First, lets see if I can find any components that might be obsolete and could be removed. This might happen over time as the software under test evolves, and some features (and their components) are dropped. One way to look at this is to find components that have had no activity for a long time. The following should show the latest date when a component had an issue created for it:

component
Activity Streams: Timeline 2016-09-14T14:05:46Z
Activity Streams: Server Operations 2017-03-17T18:22:07Z
Activity Streams: General 2017-07-18T18:09:40Z
WebPayments UI 2020-03-24T13:09:16Z
Normandy Server 2020-09-21T17:28:10Z
Extension Compatibility 2021-02-05T16:18:34Z
Distributions 2021-02-23T21:12:01Z
Disability Access 2021-02-24T17:35:33Z
Remote Settings Client 2021-02-25T17:25:41Z
Migration 2021-03-28T16:36:08Z
Normandy Client 2021-04-01T21:14:52Z
Firefox Monitor 2021-04-05T16:47:26Z
Translation 2021-04-06T17:39:27Z
Tours 2021-04-08T23:26:27Z
Firefox Accounts 2021-04-10T14:17:25Z
Enterprise Policies 2021-04-13T02:38:53Z
Shell Integration 2021-04-13T10:01:39Z
Pocket 2021-04-15T02:55:12Z
Launcher Process 2021-04-15T03:10:09Z
PDF Viewer 2021-04-15T08:13:57Z
Site Identity 2021-04-15T09:20:25Z
Nimbus Desktop Client 2021-04-15T11:16:11Z
Installer 2021-04-15T15:06:46Z
Page Info Window 2021-04-15T19:24:28Z
Site Permissions 2021-04-15T21:33:40Z
Bookmarks & History 2021-04-16T09:43:36Z
Protections UI 2021-04-16T13:25:27Z
File Handling 2021-04-16T13:40:56Z
Foxfooding 2021-04-16T14:08:35Z
Security 2021-04-16T15:31:09Z
Screenshots 2021-04-16T15:35:25Z
Top Sites 2021-04-16T15:56:26Z
General 2021-04-16T16:36:27Z
Private Browsing 2021-04-16T17:17:21Z
Tabbed Browser 2021-04-16T17:37:16Z
Sync 2021-04-16T18:53:49Z
Pioneer 2021-04-16T20:58:44Z
New Tab Page 2021-04-17T02:50:46Z
Messaging System 2021-04-17T14:22:36Z
Preferences 2021-04-17T14:41:46Z
Theme 2021-04-17T17:45:09Z
Session Restore 2021-04-17T19:22:53Z
Toolbars and Customization 2021-04-18T08:16:27Z
Search 2021-04-18T09:04:40Z
Untriaged 2021-04-18T10:36:19Z

The list above is sorted by time, and all three components related to “Activity Streams” have last issues assigned to them 4–5 years ago. With this, I added them to a list of components to remove from the dataset. Seems pointless to assign any new issues to them with this timeline.

The Activity Streams: Timeline component was also one of the components in the earlier list with the fewest issues assigned to it. The other two components with very few issues created were Foxfooding and System Add-ons: Off-train Deployment. Since the issues for a component are listed in chronological order, looking at the last few in both of these should give some insight on their recent activity.

The above figure/table shows how the actual last reported issue is from 2019, and the last few after that are actually some kind of clones of old issues, made for some other purpose than reporting actual issues. So I dropped System Add-ons: Off-train Deployment from the dataset as well.

Foxfooding is described in the Firefox issue tracker as collecting issues for later triaging. Looking into it it only shows recent issues. I guess older ones might have been triaged. Without further knowledge, I left it in the dataset. With better access to domain experts, I might have removed it as it sounds like the actual issues in it could belong to many other components (and moved after triaging). But I expect it is not a big deal as it only has a few issues in it.

Few other components also had a bit longer period since last issue report. To get a better idea about how active these components have been over time, I plotted their count of issues over months. For example, Webpayments UI:

WebPayments UI seems to have started quietly, gotten some attention, and quieted down again. The last of this attention was a bit over a year ago from today, on March 2020. I don’t know if it is still relevant, so I just left it in.

Finally, the list of components I removed for training as a result of this short analysis were the following:

• Activity Streams: Timeline
• Activity Streams: Server Operations
• Activity Streams: General

The rest of the components seemed more likely to be relevant still. This left me with 48 target components from the original 52. I trained the initial set of models with these 48 target components. After a little bit of looking at the results, I removed one more component. Did you figure out which one it is already?

It is Untriaged. Because untriaged is just a list of issues that are not yet assigned to other components. Thus from machine learning perspective these issues are unlabeled. As far as I can see, keeping these in the training set can only confuse the trained classifier. So for further training iterations I removed also issues assigned to Untriaged, leaving 47 target components (labels).

In data analysis, it is easy to get sidetracked because of the next shiny. A bit like the Little Red Riding Hood in the forest I guess. In that line, some interesting facts can also be found by looking at the oldest reports with the Untriaged component/tag:

The above list shows the oldest open and untriaged issue is from over 20 years ago (when writing this). It discusses the correct way to abbreviate “seconds”. In my experience, this is exactly how issue trackers in projects tend to evolve over time. No-one wants to say this issue does not matter and close it, yet no-one wants to make the effort or decide what to do with it. Or take the heat for the slightly irrelevant decision. Or maybe it is just forgotten.

A bunch of others in that list are also waiting for a few years, and if I remove the is_open requirement from the query, there are many very old issues in untriaged status. Issue trackers in general evolve this way, I guess. At least it is what I have seen, and it sort of makes sense. Like my storage closet, junk accumulates and easier to just leave it than do something about it.

Finally, just one more query to show the oldest issue created per component:

The above list actually seems to give a kind of a history on the development of the project. As I said, it is easy enough to get lost in data exploration like this, but I guess in a practical setting I should be focusing more on the task at hand. So lets get back to building the issue report classifier..

### Training the Models

In this section I will briefly describe the models I used, and the general process I applied. For some assurance on training stability, I repeated the training 10 times for both models with randomly shuffled data. The overall dataset I used has 172556 rows of data (issue reports) after removing the five components discussed above.

#### A Look at the Models

First, the models.

Keras LSTM

A notebook setting up the Keras LSTM model, and running it can be found on my Github. The Keras summary function shows the structure:

The input layer takes in a maximum of 512 word tokens. These feed into a Glove-based word-embedding layer, which converts the input token sequence into a 300-dimensional embedding vector. This is followed by a bi-directional LSTM layer that has 128 nodes. A self-attention layer follows, and feeds the attention output into another bi-directional LSTM layer that has 64 nodes. The output from here goes into a weighted attention layer, that passes it to the final Dense output layer. Many fancy words if not familiar with it, but worry not. It is simple to use in the end, and practical results will be presented.

HuggingFace DistilBERT

The HuggingFace DistilBERT is a bit more of a black box. A notebook for the training and testing is also on my Github. The Keras summary for it gives:

I guess it is some kind of a custom Tensorflow implementation in a single layer from Keras viewpoint. Matches my previous attempts at trying to read about all the transformer architectures, where everyone quickly diverges into all the in-depth details from the general high level name, and I am just left wondering why is no-one able to provide an understandable and intuitive intermediate view. Anyway. The visual representation of it in terms of boxes and arrows in Keras is even better:

It’s all just a single box. I guess this is what you would call a black-box model (just color the box black to finish it..) :).

#### Sampling the Data

I sampled the dataset in each case into 3 parts. The model was trained on 70% of the data (training set), or 124126 issue reports. 20%, or 31032 issue reports (validation set), were used for evaluating model performance during training. 10%, or 17240, issue reports to evaluate the final model after training finished (test set). The sampling in each case was stratified, producing an equal proportion of different target components in each dataset of train, validation, and test.

I repeated the sampling to these 3 different sets 10 times, with different randomization in selecting items to each set. Another option would have been to do 10-fold splits and validations, which would be more systematic. But the sampling I used worked for my purposes. Luckily I am not writing a research paper for Reviewer 2 today, so lets pretend that is fine.

#### Training Accuracy over Epochs

The following figures illustrate the training loss and accuracy, for training the models on one set of the sampled data for both models. First the HuggingFace training for DistilBERT:

I set the HuggingFace trainer to evaluate the model at every 500 steps, providing the high granularity graph above. With the amount of data I used, and a batch size of 16, the number of steps in HF training over 3 epochs was 22881. At each 500 of these the evaluation was performed, and shows as a point on the above graph. As shown in the figure, training was quite consistent but leveled at around epoch 2.

The Keras LSTM training metrics per epoch are shown below:

In this figure, epoch 0.0 is epoch 1, and 1.0 epoch 2, and so on. It is simply due to how I used a 0-indexed array for it. The name test is also actually referring my validation set in this case, I always get the terms confused. Sorry about that. More importantly, I trained this model for 4 epochs, and each point in this figure shows the evaluation result after an epoch. The validation always peaked on epoch 2 for this model.

### Results

Results are perhaps the most interesting part. The following table shows 10 different runs where the LSTM and Transformer classifiers were trained on different dataset variants as described before:

Each row in the table is a separate training on different random shuffles of the data. The table has the following columns:

• hf_best_epoch: the epoch where the lowest loss (for validation set) was recorded for HF. In Keras this was always epoch 2, so I did not include a column for it.
• hf_val_loss: the validation set loss at hf_best_epoch as given by HF.
• hf_val_acc: the validation set accuracy at same point as hf_val_loss.
• k_val_loss: the validation set loss at end of best epoch as given by Keras.
• k_val_acc: the validation set accuracy at same point as k_val_loss.
• hf1_test_acc: HF accuracy of using the best model to predict the target component, and only taking the top prediction.
• k1_test_acc: same as hf1_test_acc but for the Keras model.
• hf5_test_acc: same as hf1_test_acc, but considering if any of the top 5 predictions match the correct label. Think of it as providing the triaging user with 5 top component suggestions to assign the issue to.
• k5_test_acc: same as h5_test_acc but for the Keras model.

#### Comparing Accuracy in Transformers vs LSTM

The results from the above table are quite clear, and the Transformer version outperforms the LSTM version in each case. The difference for top-1 prediction accuracy is about 0.72 vs 0.75, or 72% vs 75% in favor of the Transformer architecture. For the top-5 the difference is about 96% vs 97% accuracy. The difference in loss is about 0.91 vs 0.84, again favoring the Transformer.

In a research study promoting research results these would be a big deal. However, in a practical situation the significance of this depends on the target domain. Here, my aim was to build a classifier to help the triaging process by suggesting components to assign new issue reports to. In this case, a few misses, or a difference of 96% vs 97% in top 5 may not be that big a deal.

Additionally, besides this classification performance, other considerations may also be relevant. For example, the LSTM in general trains faster and requires fewer resources (such as GPU memory). This and similar issues might also be important tradeoffs in practice.

#### A Deeper Look at Misclassifications

Beyond blindly looking at accuracy values, or even loss values, it is often quite useful to look a bit deeper at what did the classifier get right, and what did it get wrong. That is, what is getting misclassified. Let’s see.

In the following, I will present multiple tables across the models and their misclassifications. These tables hold the following columns:

• total: total number of issue reports for this component in the entire dataset
• test_total: number of issue reports for this component in the test set
• fails_act: number of issues for this component that were misclassified as something else. For example, there were 1000 issue reports that were actually for component General but classified as something else.
• fails_pred: number of issues predicted for this component, but were actually for another component. For example, there were 1801 issues predicted as General but their correct label was some other component.
• total_pct: the total column value divided by the total number of issues (172556). The percentage this components represents from all the issues.
• test_pct: same as total_pct but for the test set.
• act_pct: how many percent of test_total is fails_act.
• pred_pct: how many percent of test_total is fails_pred.

Keras LSTM Top-1 Prediction Misclassifications

First, failure statistics for the Keras LSTM classifier, and its top-1 predictions:

Predicting only the most common label is often used as the very baseline reference, and in this case we could then expect an accuracy of 37% if we always predicted each issue report to be assignable to the General component. Because the table above shows it holds 37% of all issues.

Something that I find slightly unexpected here is that even though General is so much more common in the set of issues, it does not have an overly large proportion of misclassified issues attributed to it (act_pct + pred_pct). Often such a dominant label in the training set also dominates the predictions, but it is nice to see it is not too bad in this case.

Instead, there are others in that list that more stand out. For example, Shell Integration looks quite bad with 83% (217 of 261) of its actual test set issues being misclassified for some other component (act_pct). One might consider this to be due to it having such a smaller number of issues in the training set, but many components with even fewer issues are much better. For example, the one visible in the table above, Installer, has a 32% (act_pct) fail rate only.

To analyze these causes deeper, I would look in a bit more detail at the strongest misclassifications (in terms of component probabilities) for Shell Integration, and try to determine the cause of mixup. Perhaps some feature engineering would be in order, to preprocess certain words or tokens differently. But this article is long enough as it is, so not going there.

Something a bit more generic, that I looked further in the statistics, is pairs of misclassifications. The following list shows the most common misclassified pairs in the test set. For example, the top one shows 204 issues for Tabbed Browser being predicted as General. And similarly, 134 General issues predicted as Tabbed Browser. Clearly, the two seem to be mixed often. Our friend, Shell Integration also seems to be commonly mixed with General. And so on.

Overall, the biggest General component also dominates the above list, as one might have expected. Maybe because it is general, large, and thus randomly holds a bit of everything..

Keras LSTM Top-5 Predictions Misclassifications

Besides top-1 predictions, I also collected top-5 predictions. Top-5 means taking the 5 predicted components for an issue, and considering the prediction correct if any of these 5 is the expected (correct) label. The following shows similar stats for the top-5 as before for top-1:

This is otherwise similar to top-1, but the fails_pred column has large values, because each issue report had 5 predictions counted for it. So even if one of the 5 was correct, the other 4 would be calculated in fails_pred here.

For the rest of the values, the numbers are clearly better for top-5 than for the top-1 results. For example, General has fails_act value of only 3, while in top-1 it was 1000. Likely because of its dominant size it gets into many top-5’s. This drop from 1000 to 3 is a big improvement, but the overall top-5 accuracy is only 97% and not 99.9% as the components with fewer instances are still getting larger number of misclassifications. For example, Shell Integration still has about 17% act_pct, even with top-1 relaxed to top-5. Much better, but also not nearly 0% fails.

HuggingFace DistilBERT Top-1 Prediction Misclassifications

Let’s move to DistilBERT results for the top-1 prediction. Again for details, see my Github notebook, and two scripts to go with it. The overall values are clearly better for this model vs the LSTM as was illustrated by the general statistics earlier. However, I see a slightly different trend in this vs the LSTM. This model seems to balance better.

The number of failed predictions for the dominating General component are higher than for the LSTM model. For example, fails_act here is 1063, while in the LSTM model top-1 it was 1000. However, as the overall accuracy was quite a bit better for this model (72% LSTM vs 75% for this), this gets balanced by the other components having better scores. This is what I mean with more balanced. The errors are less focused on few components.

For example, the fails_act for the nasty Shell Integration is down to 185 for DistilBERT here, while it was 217 in the LSTM results. Still not great, but much better than the LSTM. Most of the other components are similarly lower, with a few exceptions. So this model seems to be overall more accurate, but also more balanced.

To further compare, here are the most commonly misclassified pairs for this model:

This also shows the similar trend, where the total number of misclassifications is lower, but also no single pair dominates the list as strongly as in the LSTM model results. And again, the pair of Tabbed Browser and General are at the top with getting mixed with each other. Along with General being a part of almost every pair in that list. Looking into these associations in more detail would definitely be on my list if taking this classifier further.

HuggingFace DistilBERT Top-5 Prediction Misclassifications

And the results for the DistilBERT top-5:

Similar to top-1, this one has a higher fails_act for General but mostly lower for the others, leading to what seems to be a more balanced result, along with the higher overall accuracy (96% LSTM vs 97% this model/DistilBERT).

Short Summary on Model Differences

The results I presented could likely be optimized quite a bit. I did not do any hyperparameter optimization, or model architecture tuning, so my guess is that both the LSTM and Transformer model could have been optimized further. Might also be useful to try the other HF transformer variants. However, my experience is that the optimization gains are not necessarily huge. My goal for this article was to build a practical classifier from scratch, and compare the trendy Transformers architecture to my previous experiments with Attention LSTM’s. For this, I find the currect results are fine.

Overall, I find the results show that the Transformer in general gives better classification performance, but also is more balanced. The LSTM still produces good results, but if resources were available I would go with the Transformer.

One point I find interesting in the model differences is that the LSTM seems to do slightly better for the dominant General component, while the Transformer seems to do better on the other ones. This analysis was based on one pair of the 10 variants I trained, so with more time and resources, looking at different training variants would likely give more confidence still.

One way to apply such differences in model performance would be to build an ensemble learner. In such a combined model, both the LSTM and Transformer models would contribute to the prediction. These models would seem to be good candidates, since they have mutual diversity. Meaning one produces better results in some parts of the data, while the other in a different part. In practical production systems, ensembles can be overly complex for relatively small gains. However, in something like Kaggle competitions, where fractions of percentage in the results matter, this would be good insights to look into.

### Predictor in Action

The article so far is very much text and data, and would contribute for a very boring powerpoint slideset to sell the idea. More concrete, live, and practical demonstrations are often more interesting.

Back when I presented my Qt company classifier based on their issue tracker data, I had similar data with some older classifiers (RF, SVM, TF-IDF, …) and the LSTM. Back then, I found the LSTM classifier produced surprisingly good results (similar to here). I made a presentation of this, showed some good results, and people naturally had the question: How does it actually work on issue reports it has not seen before?

One way to address this question is to try to explain that the training data did not include the validation or the test set, and thus it already measures exactly how well it would do on data it has not seen before. But not everyone is so familiar with machine learning terminology or concepts. To better address this question, I then opened the issue tracker for that day, took one of the newest, still untriaged issues (no component assigned), and copied its text. And then ran the classifier live on that text. Asked if people thought it was a good result. We can simulate that also here.

I took an issue report filed on May 13th (today, in 2021), as my training dataset was downloaded on the 21st of April. As opposed to a live demo, you just have to trust me on that :).

For this experiment, I picked issue number 1710955. The component label it has been assigned by the developers is Messaging System. The following shows the top-5 predictions for the issue using the HF DistilBERT model. First using only the summary field as the features, followed by using both summary + description as features.

The top line above shows the Messaging System as the top prediction, which is correct in this case. Second place goes to New Tab Page. As shown by the middle line, with just the summary field, the Messaging System is correctly predicted as the top component at 98.6% probability, followed by New Tab Page at 84.7%. The third and final line above shows how adding the description field to the features, the predicted top-5 remains the same but the classifier becomes more confident with Messaging System given 99.1% probability, and New Tab Page dropping to 62.5% on second place.

The summary field text for this issue report is simply “Change MR1 upgrade onboarding to Pin then Default then Theme screens”. Check the issue report for more details and the description. I think its quite impressive to predict the component from such short summary text, although I guess some words in it might be quite specific.

While this was a very good result for a random issue I picked, it is maybe not as believable as picking one in a live demonstration. In a more of a live demonstration, I could also ask people to write a real or imaginary bug report right there, run the classifier on it, and give them a classification for it, asking their opinion on its correctness. But, as usual, last time I tried that it was a bit difficult to get anyone to volunteer and I had to pick one myself. But at least you can do it live.

### Conclusions

Well that’s it. I downloaded the bugreport data, explored it, trained the classifiers, compared the results, and dug a bit deeper. And found a winner in Transformers, along with building myself and increased understanding and insights into the benefits of each model.

In the end, writing this article was an interesting exercise in trying out the Transformer architecture for a real-world dataset, and gaining insight in comparison the LSTM architecture I used before. Would someone use something like this in real world? I don’t know. I think such applications depend on the scale and real needs. In very large companies with very large products and development departments, I can see it could provide useful assistance. Or integrated as part of an issue tracker platform for added functionality. Think about the benefit of being to advertise your product uses the latest deep-learning models and AI to analyze your issues.. 🙂

Besides comparing to my previous work on building a similar bug classifier for the Qt company issue tracker, I also found another paper from around the same time when I wrote my previous study, on a similar issue tracker classification, called DeepTriage. I considered using it for comparison, but their target was to predict the developer to assign the issue to, so I left that out. It was still useful for providing some insight on where to find accessible issue trackers, and how they used the data for features.

However, when I went searching the internet for the word DeepTriage after that paper, I found another paper with the same name. DeepTriage for cloud incident triaging. Yes, in my previous articles on metamorphic-testing of machine-learning based systems I noted how many DeepXXX articles there are. Apparently it is now so popular to name your research DeepXXX that you even re-use names. In any case, this was a paper from Microsoft, talking about the cost and need for fast triaging in datacenters, imbalanced datasets and all that. And how they use this type of technique in Azure since 2017 with thousands of teams. So, as I said, this likely becomes useful once your scale and real-world requirements catch up.

Well, my goal was to play with Transformers and maybe provide some example of building a classifier from real-world NLP data. I think I did ok. Cheers.