Tag Archives: software testing

Algorithm Test Engineering: Exploratory Job Analysis

What does it mean to test an algorithm? Let’s play

Plot twist. The article has an RNA algorithm “example”.. Photo by Braňo on Unsplash

Recently I had a discussion about what it means to test an algorithm, or what it means to do algorithm test engineering. I couldn’t quite come up with a convincing definition for myself. At a basic level, maybe you figure out some basic rules of the algorithm, give it some input and output. Check the results. But I believe there is more to it, including algorithm fit for purpose, fit to data, evaluating alternatives, adapting the algorithm to the problem and system at hand, designing and running experiments, learning, etc.

To get a better idea about all this, I will explore the topic in this article. I try analyzing and testing two algorithms, and see where it takes me. I start in this article with more classical algorithms, where the inputs, outputs, and their relations are clearly defined. In a follow-up article, I hope to look at a machine learning-based one where the definition of the correct results is not so clear.

This story was originally published on Medium, where it can still be found.

Choice of Algorithm

Since my goal was to analyze the testing and analysis of algorithms, I needed to pick an algorithm to work with. Luckily, Wikipedia provides an extensive list of algorithms. I picked two classics, binary search and Levenshtein edit distance. I will warm up first with a basic test and analysis of binary search. This is followed with a bit broader look at Levenshtein.

In general, I find testing an algorithm without a broader context to miss some key points from the overall process. So I set up an application for the Levenshtein algorithm to give my testing and analysis exploration some context. This context is an application for comparing edit distances over RNA strings, more specifically motivated by the COVID-19 virus structure.

First, binary search. It is an algorithm designed for efficient search over a list of items, in logarithmic execution time. For my experiments, I use the IMDB names dataset, and a binary search Python implementation (adapted from a set of Stack Overflow answers).

Understanding the Algorithm

When testing an algorithm, I first need to understand what I am working with. To gain some intuition for this, I start with a look at linear search vs binary search. Linear search is a very basic search that simply loops the list in order to find a match, and provides a base reference here. To illustrate the difference, and what the binary search algorithm is about, a look at a set of 10 names from the IMDB dataset:

Example set of 10 names from the IMDB dataset. Image by author.

I picked the last 10 in the dataset, because it looked like a good random mix. Binary search requires the input to be sorted, so lets do that. Sorted and re-indexed these 10 become like this:

The IMDB names dataset slice, sorted by string order (first name). Image by author.

To build a better understanding of linear and binary search, I created illustrations on applying them to the above list to find “Lu Bevins”:

Linear search (left) vs Binary search (right) on the IMDB name slice. Image by author.

Linear search always loops through the entire list from the beginning, until it finds a match. The list can be in any order. Binary search constantly splits the list in two and checks the middle value for match, repeating this in smaller chunks, until it finds a match, or runs out of elements to check.

Binary search is a well known and simple algorithm, but above illustrates how I would go about trying to understand an algorithm. I find concrete examples and diagrams help me build an understanding. Access to a (domain) expert to help build and check these really helps. Especially with less known algorithms.

The above example list of 10 is very small, and does not really illustrate the benefit of binary search. With larger size, the comparison between linear and binary search should look like this (expecting logarithmic complexity):

Illustrating the theoretical logarithmic execution time vs linear. Image by author.

Such assumptions are good to check, so I will do that next.

Testing Computational Complexity

A common property of interest when testing and analyzing algorithms is computational complexity. This refers to the resources (processing time, memory, etc) the algorithm uses at different scales. Here I focus on processing time, although similar analysis could be performed for other resources.

To explore the scalability of binary search, I sampled the IMDB names dataset for 10 to 1000 names, in increments of 10, and executed both linear and binary search on random items in these sampled lists.

To gain more statistically confident results, I ran the experiments 100 times for each of these lists. I used my codeprofile library for execution time profiling. The following figures illustrate the measured average execution times over the 100 runs at sample sizes vs the theoretical, expected time:

Measured performance (left) vs theoretical (right). Image by author.

The above shapes show the curves follow the expected curve on logarithmic execution time growth (binary search) and linear growth (linear search). So the measurements matches the theory (scales are different in the figures due to parameters, I was just interested in the shapes).

1000 items is still a very short list for real performance measurements. Running the same experiment for 10 to 1 million samples (in 10x increments) shows the effect and benefit of binary search more clearly:

Binary search vs linear search. Left side = 10–1000 list size, right side = 10–1M list size. Image by author.

In the above figure, with list size 1 million, the difference is so big the binary search line looks flat compared to linear search. Based on the above simple experiments, I would say the assumptions hold for computational complexity.

The linear and binary search are well known and studied algorithms, and their complexity would be well known. As such, these demonstrate how we can evaluate this in practice. I have found evaluating practical computational complexity of actual implementations can produce surprises, especially different use cases for custom algorithms from 3rd parties.

Testing Input-Output Relations

Knowing the computational complexity is good, but we also need assurance that the implementation works as intended. For binary search, basic tests for finding selected names, boundaries, category-partitioning, etc are good examples for this. I call these base assurance tests.

To scale this up, I used the test generator from the above 10 to 1 million computational complexity analysis, and added the following check to it:

  • for every item, linear and binary search find the same index.

I call this an expected/assumed invariant over the data/algorithm. I was expecting the two algorithms to give the same results for the same input, and that the assertion would simply be there for extra assurance. However, the assertion was failing because the indices given by the linear and binary search were not always the same even for the same input.

After some thought on what could cause this difference, and how the algorithms work, I figured there might be duplicate names in the dataset. Here is a quick check to see if this is the case (concatenated to two rows):

Listing items that have duplicate names in the dataset (slice). Image by author.

Above figure shows “Robert Ellis” and “Harrison Ford” as having duplicates. To check a bit deeper, a look at the name “Harrison Ford”:

Check whether Harrison Ford exists in multiple instances as primary name. Image by author.

There are four people listed with the same name “Harrison Ford” in this dataset. Having established there are duplicate names, some summary metrics would be of interest to see how many there are:

Counting the number of duplicated primary names in the dataset. Image by author.

The above figures show a total of 853911 (row count) distinct names have duplicates. The one with the most duplicates is “David Smith”, repeating 340 times. What does the above mean for the mismatching search results? Due to how the two algorithms work, searching for “David Smith” would likely result in both linear and binary search returning a “David Smith”, but different ones. Linear search always returns the first one in list, binary search can give any result from the list of duplicates.

To me, this illustrates, how testing and analysis of the algorithm helps understand both the algorithm, and the data it is being applied to better. And how it is good to assert and test your assumptions about the data and algorithm. Having a clear goal, and performing this process systematically should help.

Besides invariants over the algorithm output, one can also consider them over input. For example, binary search expects its input to be sorted. The choice is whether to limit the scope of testing to expect the environment (the overall system) to enforce this, or to expect the algorithm implementation to handle it. I would refer to this as defining the algorithm (test) scope.

As a second example, and to gain some broader insight into the topic, I look at the Levenshtein edit distance algorithm here. Edit distance is also sometimes referred to as approximate string matching. It refers to the number of edits (character changes) required on a string to convert it to a specific (target) string. Known applications include providing spelling suggestions, approximate string search queries, and DNA/RNA sequence comparison.

Compared to the single search operation of binary search, the classic Levenshtein algorithm supports three operations: insert, delete, and substitute (a character in a string). The edit distance it gives is the minimum number of these operations required. There is also a variant called Damerau-Levenshtein adding a transposition operation, but to keep this example simpler, I will use the classic Levenshtein here.

Some examples:

  • test -> tent: one substitution of s->n = Levenshtein score 1
  • Levenstein -> Levenshtein: one insertion of h = score 1
  • Levenshtein -> Levenstein: one deletion of h = score 1
  • bad -> base: one substitution of d -> s + one insertion of e = score 2
  • bad -> based: two insertions of s + e = score 2

Defining the Example Application

As I noted before, I will look at Levenshtein from the viewpoint of an example application, to give it some context that my binary search example lacked. As Levenshtein has been applied in RNA analysis, I use RNA search as the application. Specifically, the COVID-19 virus, which is RNA based. There is plenty of information available on it, and it gives me a chance to play with longer RNA strings on a timely topic. As with binary search, I start with research to understand the application domain as well as the algorithm applied.

The COVID-19 RNA sequence is described as having a length of 29881 “characters”, consisting of a four letter alphabet representing its chemical base. One part of the COVID-19 structure that is specifically considered interesting is the spike protein. This interest comes from the role of the spike protein in infecting the human cells.

The spike protein is described as a 3831 “character” long sub-sequence of the overall virus RNA. Wikipedia describes the Omicron variant as having 60 mutations from the original Wuhan variant, 32 which are in the spike protein.

So in this example I look to apply Levenshtein on sequences of length 3831 (possibly up to 29881). Following the Omicron mutations, I will use up to 60 edits as the maximum in my testing. As with binary search, I am defining the algorithm (test) parameters based on domain knowledge and expected use.

DISCLAIMER: I have not done in-depth research on RNA search, I am sure it is far more advanced, but this works as a basic example for this article.

Experiment Setup

I used the Python weighted-levenshtein library to run my experiments. My base assurance tests include handling empty string, string of different lengths, invalid characters, and a selected string mutations. And so on.

To scale up the testing, I again built a simple input generator to generate input strings, and apply the algorithms supported operations (insert, delete, substitute) on them. This generated random strings of given length, applied the selected count of operation(s) on the string, and calculated the distance score for the modified vs the original (non-modified) string. At the same time measuring execution time, and checking a set of output invariants:

  • The score given by the algorithm should always match the number of operations applied.
  • Flipping the compared strings, e.g, car->cars to cars->car, should always give the same score. Since the operations are effectively reversible.

My test executor checked these invariants for every generated (test) execution of the algorithm.

To support the base assurance tests, I first used the input generator to build large number of strings of relatively short length (5–10), with varying number (0–9) operations applied. This gave me increased confidence, higher position coverage, and easier to debug results (with limited length). For score calculations, string boundaries, grouped operations, etc. It also showed me how my assumptions about the above two invariants was wrong for the first one.

What I found is, running the algorithm with enough iterations, the random chance will find cases where the target string built from multiple substitution operations can be achieved with fewer combinations of insert, delete, and substitute operations together. Here is an example case:

Applying 3 substitutions on a source to produce a target, expecting Levenshtein score of 3. Image by author.

The above shows the source string, and the target string after making 3 substitutions at randomly selected indices, with randomly selected characters. Remember, my target was COVID-19 RNA search, with 4 characters (ABCD). The following substitutions have been made; index 3: A->B, index 4: C->A, index 5: D->C. Expecting Levenshtein score to match the number of edit operations (3 substitutions here), this should give a score of 3.

However, instead of giving a score of 3, the above gives a score of 2. After some analysis, I realized the same target can be reached with one insert and one delete, giving the minimal score of 2:

Achieving the same target as above with only 2 operations (insert + delete). Image by author.

With this knowledge, I had to disable the failing invariant from the checks. Instead, I opted to collect statistics on the score achieved with different number of operations applied. I could then check that the overall score distribution was not too far off from the number of operations applied, if not exactly the same. The second invariant (flipping strings) was fine for all generated tests.

I find this is an example of what one might call exploratory algorithm testing. Take any assumptions on how the algorithm works, encode them as invariants (or whatever works), generate further tests to see if they hold. Learning about the algorithm and its domain along the way.

After successfully running these smaller tests, and feeling confident I had sufficiently covered my assumptions, I increased the generator string sizes higher for testing computational complexity.

Computational Complexity: Setup

For this evaluation, I used the input generator to produce strings of length from 100 to 5000 in intervals of 250 (100, 350, 600, 850, … until 4850). The target string length I set earlier was 3981 character length, which gets covered here, with some wide margin. For statistical confidence, I repeated each test size (100 to 4850) 100 times.

For each generated string, I applied the following operations 1, 3, 9, and 60 times (60 being the omicron limit defined above):

  • Substitute characters in the string at random (non-overlapping) locations
  • Insert random characters at a random location in the test string
  • Delete random characters in the test string
  • Combinations of above, all at the same time

My goal was to see if a different algorithm operations, their number, or the string length would affect the computation time. Or more generally, to test the algorithm operations and parameters, to see how they affect its execution time.

Computational Complexity: Measurements

The results for the above described experiments were very similar in terms of performance. The following figure illustrates the substitution tests, and their overall execution time:

Levenshtein execution time vs string size. Image by author.

Substitute_x refers to substituting X character in the source string, and calculating the Levenshtein score. The x-axis in above is the string size from 100 to 4850. The y-axis is the time it took to run the 100 experiments.

The four lines on the above graph are practically overlapping, as the execution time was so close in each case. Since this seemed a bit suspicious, I ran multiple experiments separately with varying parameters to see if this holds true, which it did. A bit strange, but ok. Having a domain expert and a team to bounce thoughts on this would have been real nice.

I omitted showing all the rest of the operations, their size variants, and combinations here. They were all very close, indicating that the execution time had little to no dependence on the type or number of operations.

The above curve in general resembles an exponentially growing curve. To check this, I experimented with a few parameters to try to visualize a matching form of an exponential curve. Here is the final measured execution time plot vs a theoretical plot for exponential time:

Execution time (left) vs exponential plot (right). Image by author.

From this, I would say the algorithm has exponential complexity. Wikipedia actually has a few fancy words on the Levenshtein computational complexity being high. From this, my verdict is that these results match the theoretical expectation, this algorithm seems to scale less optimally for longer inputs.

Testing an Adaption to the Algorithm:

Sometimes we test and analyze an algorithm, and realize it is not a great fit for the need. But we may wish to try an alternative, or an adaptation to this algorithm to address the issues. Let’s look at one example here.

Exponential execution time growth is generally considered bad for scalability. For shorter string (e.g., spell checking a word or a shell command) this is probably not an issue, as the exponential effect is so small on shorter input. However, in my RNA story I wanted to search and analyze sequences of length 3831+. For this, investigating possible speedups would seem useful.

Let’s say we decide to try to make this faster by trying to run the algorithm on smaller pieces of the input. This should make the execution time grow more linearly, instead of exponential. In the end, we just need to rank the search results relatively, not necessarily an exact score. And once the top results are known, one could calculate exact scores for that subset. Since this is a hypothetical example, I call this a hypothetical development idea :).

Here is an execution time graph for splitting the generated input strings to slices of length 100 characters, and summing up their scores:

Execution time after splitting to slices of 100. Image by author.

From the execution time perspective, the above graph looks much better. Exponential growth is gone, turned linear. But how about the results, how large is the difference in score between calculating it for the full string vs summing up the scores for slices of 100? The following tables illustrate this:

Substitution operator, slices of 100 vs full string. Image by author.

In the above table, s_x refers to substituting X character in the string. Size is the total string length. The postfix of /full refers to running the Levenshtein algorithm on the full string. The /100 version uses slices of 100 characters. Each configuration was repeated 100 times to get statistical coverage. Which is why, for example, s_1/full has a score of 100 (100 runs, each scoring 1).

For this substitution example, sliced version scores are very close to the full version. Perhaps because substitution is a local change, and does not so much affect multiple slices. There are a few rows in the table where the summed score has a difference of 1–2 edits (6000 vs 5998 and 5999). This is because of the issue I noted above, where insert and delete can work together to find a smaller minimum. If the substitution operation was enough, this would seem a plausible adaptation. But the goal was to support all operations.

In the below tables, the insert operation uses prefix of i, and delete a prefix of d. So i_x refers to inserting X characters, and d_x deleting X characters:

Insert operator, slices of 100 vs full string. Image by author.

The above insert table shows how the i_x/full score always matches the number of operations (*100) as expected. With i_x/100 slices, the score starts to get larger as the string length increases, and the number of operations applied increases (from 1 to 3, 9, and 60). After some analysis, I concluded this is because inserting a character at the beginning of the string shifts all the rest of the slices forward to the right, and thus causes each slice to require multiple edits, and increase the summed score.

The following figures illustrate this issue:

Above string of length 10, Inserting B, Levenshtein score calculated as full string. Image by author.
Same as above string, Levenshtein score calculated in slices of length 3. Image by author.

The above is an example of slice size 3 instead of 100 but the same concept. Depending on the location of the insert, the edit count propagates towards the end and increases the sum by a large amount. For each slice of 3 to match, it has to remove the first and add the last char. Same for delete operation:

Delete operator, slices of 100 vs full string. Image by author.

The above delete table shows very similar behaviour to the insert table. And for the same reasons, a delete shifts all slices left when insert shifted them right.

So, while the slicing approach would be successful in cutting down the algorithm processing time, it would definitely not work to properly rank the search results for insert and delete. From the application perspective, I would consider this adaptation a failed experiment, except for learning.

However, for the purposes of this article, I find this is a good experiment. It shows how one might run tests on an algorithm, analyze its results, fit for purpose, and other properties, build hypothesis, implement experiments, evaluate it, and iterate. Optimally, this is how I would see algorithm testing and analysis contributing to the overall development more broadly, helping to design and evaluate experiments for improved algorithms and adaptations.

Concluding the Levenshtein Search Experiment

In a more realistic scenario, I hope I would be working with a team of experts. And have resources to perform research on what are all the approaches, state of the art, and everything else on the subject. In fact, when the algorithm and its application domain are complex, and the opportunity is there, I would consider this an essential part of the testing and analysis process. Working with domain experts, R&D team, and researching the domain knowledge. Here I will limit the scope as my resources are limited.


My goal with this article was to explore the idea of what algorithm testing (and analysis) could be. A summarizing list makes it simpler to remember:

  • Traditional testing techniques can define base assurance tests with expert defined inputs and expected outputs
  • Building a good, in-depth understanding of the algorithm helps understand how to test it, and how to adapt it to a domain
  • This includes building a similarly good understanding of the data it is applied to, and how this relates to the algorithm
  • Ideally, above works iteratively in interaction with research, development, and testing
  • Besides verification, algorithm testing can contribute to understanding its limits, potential optimizations, and compare alternatives
  • Identifying assumptions about the data and algorithm input describes what the algorithm is expected to work with
  • Exploratory data analysis can use these assumptions as input, check if they hold, and refine them
  • Identifying assumptions about the data and the algorithm output gives a basis to write invariants to check in every test
  • An automated test generator helps scale testing with these invariants, and check whether the assumptions hold
  • Scope of testing relates to defining the scope of the algorithm vs the overall system using it, responsibilities for input and output handling
  • Theoretical computational complexity is good, but practical evaluation if it holds for the implementation and available data is good to ensure
  • Algorithm testing can form a tool for exploring the algorithm, by formulating hypothesis about it and using testing to evaluate them
  • Tools and techniques, such as metamorphic testing can help evaluate robustness of the algorithm to different types of valid and invalid inputs
  • Testing and analysis is optimally an iterative process, where the end result is the final generated tests and checks, and all the learning on the way


I started this article with the idea to explore what it means to test an algorithm, or what it could mean to be do “algorithm test engineering”. I like to think I made some progress, although I am sure the definition can be subjective, much like “goodness” of a (machine learning) algorithm result.

Both my examples in this article, binary search and Levenshtein edit distance, are quite simple and basic algorithms in the end. As noted in this article, the basic testing of such algorithms is not too complicated. However, considering algorithm testing and analysis as part of a broader research and development process, I believe the interactions, collaborations, research and development contributions can make it more diverse and interesting.

In this article, I looked at two “classic” algorithms, where the input, output, and their relations are quite straightforward to define. In a followup article I will look at machine learning based algorithm(s), and an example of an algorithm where the input-output relations, and the correctness of the output is harder to define, subjective, or “undefinable”, in a traditional sense. Until then.

That’s all for now. Cheers.

Metamorphic Testing of Machine-Learning Based Systems

Techniques for Testing Autonomous Cars and other ML-Based Systems

Testing machine learning (ML)-based systems requires different approaches compared to traditional software. With traditional software, the specification and its relation to the implementation is typically quite explicit: “When the user types a valid username and matching password, they are successfully logged in”. Very simple to understand, deterministic, and easy to write a test case for.

ML-based systems are quite different. Instead of clearly defined inputs and logical flows based on explicit programming statements, a ML-based system is based on potentially huge input spaces with probabilistic outcomes from largely black-box components (models). In this article, I take a look at metamorphic testing, which is a technique that has become increasingly popular to address some of the ML-based systems testing challenge. I will go through some of the latest research, and present examples from different application domains.

These two images can be used to represent some metamorphic tests for a ML-system. Read on to see how.. 🙂 Image by author.

Metamorphic Testing

Metamorphic Testing (MMT) was originally proposed quite a while back, at least up to (Chen1998). Having worked a long time with software testing research, I always viewed MMT as a curiosity with few real use cases. With ML-based systems, however, it seems to have found its niche nicely.

The general idea of MMT is to describe the system functionality in terms of generic relations between inputs, the generic transformations of those inputs and their outputs, rather than as mappings of specific inputs to specific outputs.

One typical example used for metamorphic testing in the past has been from testing search engines (e.g., Zhou2016). As search engines are these days practically natural language processing (NLP)/ML-based systems, they also fit the topic of this article well. To illustrate the concept, I ran two queries on Google (in October 2020):

A Google query for “car” and its refined version “autonomous car”. Image by author.

The first query is just one word “car”. The second query adds another word to the first query, “autonomous”. So the query now becomes “autonomous car”. This addition of a restrictive search keyword is an example of an input transformation (or “morphing”, in the spirit or metamorphic testing):

Example input transformation: changing the original input (query) to be more restricted. Image by author.

And to perform a check on the test results (a test oracle), we define a matching relation that should hold when the input transformation is applied:

Example output (metamorphic) relation: More restricted query returns fewer results. Image by author.

In this case, adding the restrictive search term (“autonomous”) to the previous query (“car”) changes the result set, restricting it to a smaller set. From 8.3 billion results to 159 million. The metamorphic test would not specify these exact values, but rather the relation “restricting the query leads to fewer search results”. And one could generate several (seed) inputs (search queries), associated restrictive keywords for transformations, and run the query and check the metamorphic relation holds (restricting the query produces fewer results). For more details on MMT with search engines, see (Zhou2016).

The above is an example of what metamorphic testing refers to. You transform (morph) your inputs in some way, while at the same time defining a relation that should hold from the previous input (and its output) to the new morphed input (and its output). The key concepts / terms are:

  • morph/transform: modify a seed input in a way that your defined metamorphic relations should hold
  • metamorphic relation: the defined transformation of the input should have a known/measurable effect on the output. Checking that this relation holds after the transformation is the test oracle of metamorphic testing. (Test oracle is a general term for a mechanism to give a verdict on test result)
  • seed inputs: the inputs that are used as initial inputs for the tests, to be transformed. If you know the output of the seed input, you may use it to define a stricter relation (output should be correct). But even without the seed output, you can still define a relation check, but it might be a bit more relaxed (output should be similar but you don’t know if it is correct).

More generally metamorphic testing refers to defining such transformations, and observing their impact (metamorphic relations) on the result. The effectiveness and applicability then depends on how well and extensively these can be defined. I will present more concrete examples in the following sections.

Problem Space

Why would you want to use metamorphic testing? I will illustrate this with an example for autonomous cars. Autonomous cars are recently going through a lot of development, getting a lot of funding, have safety-critical requirements, and are highly dependent on machine-learning. Which is maybe also why they have received so much attention in MMT research. Makes great examples.

For example, the Tesla Autopilot collects data (or did when I was writing this..) from several front-, rear-, and side-cameras, a radar, and 12 ultrasonic sensors. At each moment in time, it must be able to process all this data, along with previous measurements, and come up with reasoning fulfilling highest safety-standards. Such real-world input-spaces are incredibly large. Consider the two pictures I took just a few days apart, near my previous office:

Same place, few days apart. Image by author.

Just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers / no markers, connecting roads visible, parking lots visible, other cars, and so on. Yet in all such conditions one would be expected to be able to navigate, safely. To illustrate the problem a bit more, here are some example variants in that domain that quickly come to mind:

A few examples of items a ML-algorithm in an autonomous vehicle needs to consider. Image by author.

Besides these, one can easily expand this to different locations, road shapes, object types, bridges, trains, … Other sensors have other considerations, every location is different, and so on.

In different domains of ML-based system applications, one would need to be able to identify similar problem scenarios, and their relevant combinations, to be able to test them. Manually building test sets to cover all this is (for me) an unrealistic effort.

Metamorphic Testing with Autonomous Cars

Metamorphic testing can help in better covering domains such as the above autonomous cars problem space. As the interest is high, many approaches for this have also been presented, and I will describe a few of those here.

Covering Image Variations

The DeepTest work in (Tian2018) uses transformations on real images captured from driving cars to produce new images. In this case, the metamorphic attributes are:

  • Seed inputs: Real images from car cameras.
  • Metamorphic transformations: moving, tilting, blurring, scaling, zooming, adding fog, adding rain, etc. on he original images
  • Metamorphic relation: the autonomous driving decisions should show minimal divergence on the same input images after the transformations.

The following illustrates this with some simple examples using the road image from outside my previous office. In the following, I “transformed” the image by simply rotating the camera a bit at the location. I then added the arrows to illustrate how a system should “predict” a path that should be taken. The arrow here is manually added, and intended to be only illustrative:

Viewed from a different angle, the real-world path should still be the same (or close). Image by author.

And the same, but with the snowy ground (two transformations in the following compared to the above; added snow + rotation):

With and without snow, the path should be the same (or close). Image by author.

Of course, no-one would expect to manually create any large number of such images (or transformations). Instead, automated transformation tools can be used. For example, there are several libraries for image augmentation, originally created to help increase training dataset sizes in machine learning. The following illustrates a few such augmentations run on the original non-snow image from above:

Automatically generated augmentations / transformations of the earlier road image. Image by author.

All these augmented / transformed images were generated from the same original source image shown before, using the Python imgaug image augmentation library. Some could maybe be improved with more advanced augmentation methods, but most are already quite useful.

Once those transformations are generated, the metamorphic relations on the generated images can be checked. For example, the system should propose a very similar driving path, with minimal differences across all transformations on acceleration, steering, etc. Or more complex checks if such can be defined, such as defining a known reference path (if such exists).

Again, this process of transformation and checking metamorphic relations is what MMT is about. It helps achieve higher coverage and thus confidence by automating some of the testing process for complex systems, where scaling to the large input spaces is otherwise difficult.

GAN-based Transformations with MMT

A more advanced approach to generate input transformations is to apply different ML-based techniques to build the transformations themselves. In image augmentation, one such method is Generative Adversarial Networks (GANs). An application of GANs to autonomous cars is presented in (Zhang2018). In their work, GANs are trained to transform images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image.

The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic transformations. (Zhang2018) uses the NVidia UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos for training.

Images illustrating the GAN results are available on the UNIT website, as well as in higher resolution in their Google Photos album. I recommend having a look, it is quite interesting. The smaller images on the UNIT website look very convincing, but looking more closely in the bigger images in the photo albums reveals some limitations. However, the results are quite impressive, and this was a few years ago. I expect the techniques to improve further over time. In general, using machine learning to produce transformations appears to be a very promising area in MMT.

LIDAR Transformation

Besides cameras, there are many possible sensors a system can also use. In autonomous cars, one such system is LIDAR, measuring distances to objects using laser-based sensors. A study of applying metamorphic testing on LIDAR data in the Baidu Apollo autonomous car system is presented in (Zhou2019).

The system first identifies a region of interest (ROI), the “drivable” area. It then identifies and tracks objects in this area. The system consists of multiple components:

  • Object segmentation and bounds identification: Find and identify obstacles in ROI
  • Object tracking: Tracking the obstacles (movement)
  • Sequential type fusion: To smooth the obstacle types over time (make more consistent classifications over time by using also time related data)

The (Zhou2019) study focuses on metamorphic testing of the object identification component, specifically on robustness of classification vs misclassification in minor variations of the LIDAR point cloud. The LIDAR point cloud in this case is simply collection of measurement points the LIDAR system reports seeing. These clouds can be very detailed, and the number of measured points very large (Zhou2019).

The following figures illustrates this scenario (see (Zhou2019) for the realistic LIDAR images from actual cars, I just use my own drawings here to illustrate the general idea. I marked the ROI in a darker color, and added some dots in circular fashion to illustrate the LIDAR scan. The green box illustrates a bigger obstacle (e.g., a car), and the smaller red box illustrates a smaller obstacle (e.g., a pedestrian):

My doodle of a LIDAR image :). Image by author, using a car image by OpenClipArt from Pixabay.

The metamorphic relations and transformations in this case are:

  • Metamorphic relation: same obstacles (objects) should be identified both before and after adding small amounts of noise to the LIDAR point cloud.
  • Transformation: add noise (points to the LIDAR point cloud)
  • Seed inputs: actual LIDAR measurements from cars

The following figure illustrates this type of metamorphic transformation, with the added points marked in red. I simply added them in a random location, outside the ROI in this case, as this was the example also in (Zhou2019):

My doodle of a transformed LIDAR image with added points in red. Image by author, using a car image by OpenClipArt from Pixabay.

The above is a very simple transformation and metamorphic relation to check, but I find often the simple ones work the best.

In summary, the MMT approach here takes existing LIDAR data, and adds some noise to it, in form of added LIDAR data points. In relation to the real world, such noise is described in (Zhou2019) as potentially insects, dust, or sensor noise. The amount of added noise is also described as a very small percentage of the overall points, to make it more realistic.

The metamorphic experiments in (Zhou2019) show how adding a small number of points outside the ROI area in the point cloud was enough to cause the classifier (metamorphic relation check) to fail.

As a result, (Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and how the Baidu team incorporated some of the test data into their training dataset. This can be a useful approach, since metamorphic testing can be seen as a way to generate new data that could be used for training. However, I think one should not simply discard the tests in either case, even if re-using some of the data for further ML-model training. More on this later.

Metamorphic Testing of Machine Translation

Not everyone works on autonomous cars, so examples from other domains are important for broader insight. Outside autonomous vehicles, testing of automated language translations with ML-based NLP techniques has received some attention in recent years (for example, He2020, Sun2020). I found the (He2020) paper to be especially clear and sensible, so I use their approach as an example of the metamorphic properties for translation testing here:

  • Seed inputs: Sentences to be translated
  • transformation: replace words with specific part of speech (POS) tag in the input sentence, with another word that has the same POS tag. For example, a verb with another verb. Finally, use another NLP model (Google’s BERT in this case) to “predict” a suitable replacement candidate word.
  • metamorphic relation: the structure of the transformed output should match the original translation output sentence structure for the original input. Large deviations indicate potential errors. The test oracle metric is the difference in output sentence structures for the automated translation on the original input vs the transformed input.

Here is an illustrative example using Google Translate, and a sentence I picked (at the time) from this article. Translating that sentence from English to Finnish:

Translate a sentence from English to Finnish. Image by author.
Mask a word (noun) for a metamorphic transformation. Image by author.
Use BERT to predict a word to replace the masked word (one that “fits” in the mask). Check the metamorphic relation holds. Here it is fine. You just have to trust me on that, I am Finnish :). Image by author.

The above shows the metamorphic transformation and how the check for the defined metamorphic relation should hold. In this case the sentence structure holds fine (in my opinion as a native Finnish speaker) and the result is good. I performed these experiments manually to illustrate the concept, but the test process is the same whether automated or not. Overall, trying a few different sentences, Google Translate actually worked very well. Great for them.

To be honest, I did not really use BERT in the above example, since it was just one example I needed to illustrate the concept. I just picked a word that makes sense (to me). However, HuggingFace has really nice and easy to use implementations available of BERT and many other similar models if needed. I have used them myself for many other tasks. Much like the image augmentation libraries in the car example, the NLP libraries have come a long way, and many basic applications are quite simple and easy these days.

For more details on MMT for machine translation, I recommend checking the papers, especially the (He2020) is quite readable. An extra interesting point here is again the use of another ML-based approach to help in building the transformations, similar to the GAN-based approaches for autonomous cars.

Metamorphic Testing of Medical Images

As an example of a third application domain, applying metamorphic testing to ML-based systems in the medical domain is presented in (Ding2017). This uses MMT to test variants of existing high-resolution biological cell images.

In (Ding2017), a number of metamorphic relations are defined related to various aspects of the biological cells (mitochondria etc.) in the images, and the manipulations done to the image. I lack the medical domain expertise to analyze the transformations or metamorphic relations in more detail, and the paper does not very clearly describe these for me. But I believe my lack of understanding is actually a useful point here.

Metamorphic testing related elements in this case (as far as I understood):

Seed inputs: existing medical images (actually, the paper is very unclear on this along with many other aspects, but it serves as a domain example)

Transformations: Adding, removing, transforming, etc. of mitochondria in the images.

Metamorphic relations: The relations between the elements (mitochondria) in the algorithm outputs for the transformed images should match the defined relation (e.g., linking some elements after adding new ones).

This example highlights, for me, how in many cases, the nuances, the metamorphic relations and transformations require an in-depth domain understanding. This requires extensive collaboration between different parties, which is quite common (in my experience) in applying ML-based approaches. Cars, driving, and language translation are everyday tasks we are all familiar with. Many expert domains, such as in this example, less so. This is why I think this is a useful example in highlighting my lack of domain expertise.

Interestingly, (Ding2017) also mentions using traditional testing techniques such as combinatorial testing, randomization, and category-partitioning, to enhance the initial input seed set. This is also the case in the following example on drones.

Metamorphic Testing of Drones

As a final example domain, an approach of combining model-based testing, simulation, and metamorphic testing for testing an ML-based flight guidance systems of autonomous drones is presented in (Lindvall2017).

The drone is defined as having a set of sensors, including barometer, GPS, cameras, LIDAR, and ultrasonic. Many sensors, quite similar the autonomous cars example. The metamorphic relations defined for the drone control:

  • behaviour should be similar across similar runs
  • rotation of world coordinates should have no effect
  • coordinate translation: same scenario in different coordinates should have no effect
  • obstacle location: same obstacle in different locations should have same route
  • obstacle formation: similar to location but multiple obstacles together
  • obstacle proximity: always within defined bounds
  • drone velocity: velocity should stay inside defined bounds
  • drone altitude: altitude should stay inside defined bounds

Following are properties of such systems metamorphic testing environment:

  • Seed inputs: generated using the model-based approaches based on an environment model for the simulation
  • Transformations: See above; rotation and coordinate changes of drone vs environment and obstacles or obstacle groups, etc
  • Checks: See also above

A test environment generator is used to define (simulated) test environments for the drone, effectively generating the seeds of the metamorphic tests. The metamorphic transformations can be seen as modifications of this environment, and finally checks test that the above defined metamorphic relations hold. Various scenarios are defined to hold these together, including lift-off, returning home, landing, etc.

Perhaps the most interesting part here is the use of model-based testing approaches to build the seed inputs themselves, including the test environment. This seems like a very useful approach for gaining further coverage in domains where this is possible.

Another relevant observation in this is the use of scenarios to group elements together to form a larger test scenario, spanning also time. This is important, since a drone or a car, or many other systems, cannot consider a single input in isolation, but rather must consider a sequence of events, and use it as a context. This time aspect also needs to be taken into account in metamorphic testing.

Adversarial Inputs and Relations Across Time

Adversarial Inputs

A specific type of transformation that is often separately discussed in machine learning is that of adversarial inputs, which is extensively described in (Goodfellow2018). In general, an adversarial input aims to trick the machine learning algorithm to make a wrong classification. An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues.

Generating such adversarial inputs can be seen as one example of a metamorphic transformation, with an associated relation that the output should not change, or change should be minimal, due to adversarial inputs.

Typically such adversarial testing requires specifically tailored data to trigger such misclassification. In a real-world driving scenario, where the car sensors are not tampered with, it might be harder to produce such purely adversarial effects. However, there are some studies and approaches, such as (Zhou2020) considering this for real-world cases. More on this in a bit.

Beyond autonomous cars, digitally altered or tailored adversarial inputs might be a bigger issue. For example, in domains such as cyber-security log analysis, or natural language processing, where providing customized input data could be easier. I have not seen practical examples of this from the real world, but I expect once the techniques mature and become more easily available, more practical sightings would surface.

Much of the work on adversarial elements, such as (Goodfellow2018), have examples of adversarially modified single inputs (images). Real systems are often not so simple. For example, as a car drives, the images (as well as other sensor data), and the decisions that need to be made based on that data, change continuously. This is what the (Zhou2020) paper discusses for autonomous cars.

Relations Across Time

In many cases, besides singular inputs, sequences of input over time are more relevant for the ML-based system. Driving past a sign (or a digital billboard..), the system has to cope with all the sensor data at all the different positions in relation to the environment over time. In this case, the camera viewing angle. For other sensors (LIDAR etc), the input and thus output data would change in a similar manner over time.

Following is an example of what might be two frames a short time apart. In a real video stream there would be numerous changes and images (and other inputs) per second:

Change of (same) input concept over time. Image by author.

Not only does the angle change, but time as a context should be more generally considered in this type of testing (and implementation). Are we moving past the sign? Towards it? Passing it? Did we stop already? What else is in the scene? And so on.

This topic is studied in (Zhou2020), which considers it from the viewpoint of adversarial input generation. In a real-world setting, you are less likely to have your image data directly manipulated, but may be susceptible to adversarial inputs on modified traffic signs, digital billboards, or similar. This is what they (Zhou2020) focus on.

The following example illustrates how any such modification would also need to change along with the images, over time (compared to calculating a single, specific altered input vs real-world physical data moving across time):

If testing with generated adversarial elements, they should change over time in relation to the rest of the image. Here the angle and size of the blue box. Image by author.

This temporal aspect is important in more ways than just for adversarial inputs. For example, all the image augmentations (weather effects, etc) I discussed earlier would benefit from being applied in a realistic driving scenario (sequences of images) vs just a single image. This is what the cars have to deal with in the real world after all.

The test oracle in (Zhou2020) also considers the effect of the adversarial input from two different viewpoints: strength and probability. That is, how large deviations can you cause in the steering of the car with the adversarial changes, and how likely it is that you can cause these deviations with the adversarial input.

Beyond cars and video streams, time series sequences are common in other domains as well. The drone scenarios discussed are one example. Other examples include processing linked paragraphs of text, longer periods of signal in a stock market, or basic sensor signals such as temperature and wind speed.

Minimizing the Test Set

While automating metamorphic testing can be quite straightforward (once you figure your domain relations and build working transformations…), the potential input space from which to choose, and the number of transformations and their combinations can quickly grow huge. For this reason, test selection in MMT is important, just as with other types of testing.

One approach to address this is presented in (Tian2018), which applies a greedy search strategy. Starting with a seed set of images and transformations, the transformations and their combinations are applied on the input (images), and the achieved neuron activation coverage is measured. If they increase coverage, the “good” combinations are added back to the seed set for following rounds, along with other inputs and transformations, as long as they provide some threshold of increased coverage. This iterates until defined ending thresholds (or number of experiments). Quite similar to more traditional testing approaches.

Coverage in (Tian2018) is measured in terms of activations of different neurons in the ML model. They build coverage criteria for different neural network architectures, such as convolutional neural nets, recurrent neural nets, and dense neural nets. Various other coverage criteria also have been proposed, that could be used, such as one in (Gerasimou2020) on evaluating the importance of different neurons in classification.

When more and easily applicable tools become available for this type of ML-model coverage measurement, it would seem a very useful approach. However, I do not see people generally writing their own neural net coverage measurement tools.

Relation to Traditional Software Testing

Besides test suite optimization, it is important to consider MMT more broadly in relation to overall software and system testing. MMT excels in testing and verifying many aspects of ML-based systems, which are more probabilistic and black-box in nature. At least to gain higher confidence / assurance in them.

However, even in ML-based systems, the ML-part is not generally an isolated component working alone. Rather it consumes inputs, produces outputs, and uses ML models for processing complex datasets. The combinatorial, equivalence partitioning, and model-based methods I mentioned earlier are some examples of how the MMT based approaches can be applied together with the overall, more traditional, system testing.

As I mentioned with the Baidu Apollo case and its LIDAR test data generation, one of the feedbacks was to use the metamorphic test data for further ML training. This in general seems like a useful idea, and it is always nice to get more training data. In my experience with building ML-based systems, and training related ML-models, everyone always wants more training data.

However, I believe one should not simply dump all MMT test data into the training dataset. A trained model will learn from the given data, and can be tested for general accuracy on a split test set. This is the typical approach to test a specific ML-model in isolation. However, in practice, the classifications will not be 100% accurate, and some items will end up misclassified, or with low confidence scores. These further feed into the overall system, which may have unexpected reactions in combination with other inputs or processes. Running specific (MMT based or not) tests with specific inputs helps highlight exactly which data is causing issues, how this behaviour changes over time, and so on. If you just throw your MMT tests into the training set and forget it, you lose the benefit of this visibility.

Besides MMT, and complimentary to it, other interesting approaches of tailoring traditional testing techniques for ML-based system testing exist. One specific approach is A/B testing (evaluating benefits of different options). In ML-based systems, this can also be a feedback loop from the human user, or operational system, back to testing and training. The Tesla Shadow Mode is one interesting example, where the autonomous ML-based system makes continuous driving decisions, but these decisions are never actually executed. Rather they are compared with the actual human driver choices in those situations, and this is used to refine the models. Similar approaches, where the system can learn from human corrections are quite common, such as tuning search-engine results and machine translations, based on human interactions with the system. You are changing / morphing the system here as well, but in a different way. This would also make an interesting seed input source for MMT, along with oracle data (e.g., driving path taken by human user) for the metamorphic relation.


Testing machine learning based systems is a different challenge from more traditional systems. The algorithms and models do not come with explicit specifications of inputs and outputs that can be simply tested and verified. The potential space for both is often quite huge and noisy. Metamorphic testing is one useful technique to gain confidence in their operation with a reasonable effort. Compared to traditional testing techniques, it is not a replacement but rather a complimentary approach.

I presented several examples of applying MMT to different domains in this article. While applications in different domains require different considerations, I believe some generally useful guidelines can be derived to help perform MMT over ML-based systems:

  • metamorphic transformations: these do not have to be hugely complex, but rather simple ones can bring good benefits, such as the addition of a few random points to the LIDAR cloud. Consider how the same input could change in its intended usage environment, and how such change can be implemented with least (or reasonable) effort as a transformation.
  • metamorphic relations: to build these relations, we need to ask how can we change the ML input, and what effect should it have on the output? Sometimes this requires deep domain expertise to identify most relevant changes, as in the medical domain example.
  • test oracles: These check that the performed transformation results in a acceptable (vs valid) output. Requires considerations such as how to represent the change (e.g., steering angle change, sentence structural change), possibly defining the probability of some error, the severity of the error, and a distance metric between the potential outputs after transformation (e.g., steering angle calculation). That is, the values are likely not fixed but in a continuous range.
  • time relation: in many systems, the inputs and outputs are not singular but the overall system performance over time is important. This may also require asking the question of how time might be impacting the system, and how it should be considered in sequences of metamorphic relations. The idea of overall test scenarios as providers of a broader context, time related and otherwise, is useful to consider here.
  • test data: can you use the user interactions with the system as an automated source of test inputs for transformations and metamorphic relations? Think Tesla Shadow mode, Google search results, and the inputs from the environment and the user, and use reactions to these inputs.

As discussed with some of the examples, an interesting trend I see is the move towards using ML-based algorithms to produce or enhance the (MMT-based) tests for ML-based systems. In the NLP domain this is shown by the use of BERT as a tool to build metamorphic transformations for testing natural language translations. In the autonomous cars domain by the use of GAN-based networks to create transformations between image properties, such as different weather elements and time of day.

Overall the ML field still seems to be advancing quite fast, with useful approaches already available also for MMT, and hopefully much more mature tooling in the next few years. Without good tool support for testing (data generation, model coverage measurement, etc), finding people with all this expertise (testing, machine learning, domain specifics, …), and implementing it all over again for every system, seems likely to be quite a challenge and sometimes a needlessly high effort without good support in tools and methods.

That’s all for today, this got way too long, so if someone managed to read all this far, I am impressed :). If you have experiences in testing ML-based systems and willing to share, I am interested to hear and learn in the comments 🙂

Building an NLP classifier: Example with Firefox Issue Reports

DistilBERT vs LSTM, with data exploration

This is what the article is about. Robot photo by Phillip Glickman on Unsplash. Developer persona, Beetle, and Fox images by Open Clipart, tortoise by creozavr, ladybug by Prawny, bee by gustavorezende, other beetle by Clker. On Pixabay for the last N after the Unsplash. Composition by author.

Machine learning (ML) techniques for Natural Language Processing (NLP) offer impressive results these days. Libraries such as Keras, PyTorch, and HuggingFace NLP make the application of the latest research and models in the area a (relatively) easy task. In this article, I implement and compare two different NLP based classifier model architectures using the Firefox browser issue tracker data.

I originally posted this article on Medium, where you can still find it.

I previously built a similar issue report classifier. Back then, I found the deep-learning based LSTM (Long-Short-Term-Memory) model architecture performed very well for the task. I later added the Attention mechanism on top of the LSTM, improving the results. This LSTM with Attention is the first model type I chose for this article.

The second model type is Transformers. Transformers have become very popular in NLP over the past few years, and their popularity has spawned many variants. In this article, I wanted to get a feel for how they compare to the LSTM approach I applied before.

I use the HuggingFace (HF) DistilBERT as the Transformer model. For the LSTM, I use Keras with its LSTM and open-source Attention layers. I train them both to predict what component an incoming Firefox bug report should be assigned to. The idea would be to assist in bug report triaging tasks.

Getting the Data

As said, I am using the issue tracker for the Firefox browser as the source of my data. Since I am in no way affiliated with Mozilla of the Firefox Browser, I just used their Bugzilla REST API to download it. In a scenario where you would actually work within your company or with a partner organization, you likely could also just internally ask to get the data as a direct dump from the database. But sometimes you just have to find a way. In this case it was quite simple, with a public API to support downloads. You can find the downloader code on my Github. It is just a simple script to download the data in chunks.

Exploring and Cleaning the Data

The overall data I use contains the following attributes for each issue:

  • ID: A unique issue number, apparently across all Firefox/Mozilla product issues (not just the Browser)
  • Summary: A summary description of the issue. This is one of the text fields I will use as features for the classifier.
  • Description: A longer description of the issue. This is the other text field I will use as features for the classifier.
  • Component: The software component in the Firefox Browser architecture that the issue was assigned to. This will be my prediction target, and thus the label to train the classifier on.
  • Duplicate issues number, if any
  • Creator: email, name, user id, etc.
  • Severity: trivial to critical/blocker, or S1 to S4. Or some random values. Seems like a mess when I looked at it :).
  • Last time modified (change time)
  • Keywords: apparently, you can pick any keywords you like. I found 1680 unique values.
  • Status: One of resolved, verified, new, unconfirmed, reopened, or assigned
  • Resolution: One of Duplicate, fixed, worksforme, incomplete, invalid, wontfix, expired, inactive, or moved
  • Open/Closed status: 18211 open, 172552 closed at the time of my download.

Selecting Features for the Predictor

My goal was to build a classifier assigning bug reports to components, based its natural language description. The first field I considered for this was naturally the description field. However, another similarly interesting field for this purpose is the summary field.

By looking at the values, I can quite simply see there are some issue reports with no (or empty) description:

There are 2973 issue reports with missing description. Image by author.

I considered an option of using the summary field as another extra feature set. Almost all issue reports have a summary:

There are 4 issue reports with missing summary. Image by author.

I would need a way to build a classifier so each issue report could be classified, meaning they all should have the required features. Clearly description or summary alone is not sufficient here. One option I considered was to use the summary if the description is missing, or if the description classifier is not very confident in its classification.

A related paper I found used a combination of summary + description as the feature set. Combining these two gives a set where all items at least have some features to predict on:

Joining summary + description as a single feature set to have no issue with missing features. Image by author.

The strange looking apply function in the above simple adds the summary and description texts together to produce text_feature. With this, all issue reports have a non-empty text_feature as a result. This is what I used as the features for the classifier (tokenized words from text_feature).

Selecting the Components to Predict

To predict a component to assign an issue report to, I need the list of components. Getting this list is simple enough:

Getting the number of target components (52) for the issue reports. Image by author.

There are 52 components that have something assigned to them. The simple thing to do would be to just train a classifier to predict any of these 52 based on the text features. But as software evolves, some components may become obsolete, and trying to assign something to them at the current time might be completely pointless if the component is no longer relevant.

Lets see the details:

Getting the count of issue reports for all components. Image by author.

This gives the following list:

General                                 64153
Untriaged 18207
Bookmarks & History 13972
Tabbed Browser 9786
Address Bar 8120
Preferences 6882
Toolbars and Customization 6701
Theme 6442
Sync 5930
Menus 4440
Session Restore 4205
Search 4171
New Tab Page 4132
File Handling 3471
Extension Compatibility 3371
Security 3277
Shell Integration 2610
Installer 2504
PDF Viewer 2193
Keyboard Navigation 1808
Messaging System 1561
Private Browsing 1494
Downloads Panel 1264
Disability Access 998
Protections UI 984
Migration 964
Site Identity 809
Page Info Window 731
about:logins 639
Site Permissions 604
Enterprise Policies 551
Pocket 502
Firefox Accounts 451
Tours 436
WebPayments UI 433
Normandy Client 409
Screenshots 318
Translation 212
Remote Settings Client 174
Top Sites 124
Headless 119
Activity Streams: General 113
Normandy Server 91
Nimbus Desktop Client 83
Pioneer 73
Launcher Process 67
Firefox Monitor 61
Distributions 56
Activity Streams: Timeline 25
Foxfooding 22
System Add-ons: Off-train Deployment 15
Activity Streams: Server Operations 5

The number of issues per component is quite highly spread, and the distribution highly skewed (General almost has more reports assigned than others together). Specifically, some components have very few reports, and likely any classifier would do quite poorly with them.

First, lets see if I can find any components that might be obsolete and could be removed. This might happen over time as the software under test evolves, and some features (and their components) are dropped. One way to look at this is to find components that have had no activity for a long time. The following should show the latest date when a component had an issue created for it:

Getting a sorted list of latest issue creation time per component. Image by author.
Activity Streams: Timeline 2016-09-14T14:05:46Z
Activity Streams: Server Operations 2017-03-17T18:22:07Z
Activity Streams: General 2017-07-18T18:09:40Z
WebPayments UI 2020-03-24T13:09:16Z
Normandy Server 2020-09-21T17:28:10Z
Extension Compatibility 2021-02-05T16:18:34Z
Distributions 2021-02-23T21:12:01Z
Disability Access 2021-02-24T17:35:33Z
Remote Settings Client 2021-02-25T17:25:41Z
System Add-ons: Off-train Deployment 2021-03-23T13:58:13Z
Headless 2021-03-27T21:15:22Z
Migration 2021-03-28T16:36:08Z
Normandy Client 2021-04-01T21:14:52Z
Firefox Monitor 2021-04-05T16:47:26Z
Translation 2021-04-06T17:39:27Z
Tours 2021-04-08T23:26:27Z
Firefox Accounts 2021-04-10T14:17:25Z
Enterprise Policies 2021-04-13T02:38:53Z
Shell Integration 2021-04-13T10:01:39Z
Pocket 2021-04-15T02:55:12Z
Launcher Process 2021-04-15T03:10:09Z
PDF Viewer 2021-04-15T08:13:57Z
Site Identity 2021-04-15T09:20:25Z
Nimbus Desktop Client 2021-04-15T11:16:11Z
Keyboard Navigation 2021-04-15T14:40:13Z
Installer 2021-04-15T15:06:46Z
Page Info Window 2021-04-15T19:24:28Z
about:logins 2021-04-15T19:59:31Z
Site Permissions 2021-04-15T21:33:40Z
Bookmarks & History 2021-04-16T09:43:36Z
Downloads Panel 2021-04-16T11:39:07Z
Protections UI 2021-04-16T13:25:27Z
File Handling 2021-04-16T13:40:56Z
Foxfooding 2021-04-16T14:08:35Z
Security 2021-04-16T15:31:09Z
Screenshots 2021-04-16T15:35:25Z
Top Sites 2021-04-16T15:56:26Z
General 2021-04-16T16:36:27Z
Private Browsing 2021-04-16T17:17:21Z
Tabbed Browser 2021-04-16T17:37:16Z
Sync 2021-04-16T18:53:49Z
Menus 2021-04-16T19:42:42Z
Pioneer 2021-04-16T20:58:44Z
New Tab Page 2021-04-17T02:50:46Z
Messaging System 2021-04-17T14:22:36Z
Preferences 2021-04-17T14:41:46Z
Theme 2021-04-17T17:45:09Z
Session Restore 2021-04-17T19:22:53Z
Address Bar 2021-04-18T03:10:06Z
Toolbars and Customization 2021-04-18T08:16:27Z
Search 2021-04-18T09:04:40Z
Untriaged 2021-04-18T10:36:19Z

The list above is sorted by time, and all three components related to “Activity Streams” have last issues assigned to them 4–5 years ago. With this, I added them to a list of components to remove from the dataset. Seems pointless to assign any new issues to them with this timeline.

The Activity Streams: Timeline component was also one of the components in the earlier list with the fewest issues assigned to it. The other two components with very few issues created were Foxfooding and System Add-ons: Off-train Deployment. Since the issues for a component are listed in chronological order, looking at the last few in both of these should give some insight on their recent activity.

First System Add-ons: Off-train Deployment:

Last 5 issue reports for “System Add-ons: Off-train Deployment”. Image by author.

The above figure/table shows how the actual last reported issue is from 2019, and the last few after that are actually some kind of clones of old issues, made for some other purpose than reporting actual issues. So I dropped System Add-ons: Off-train Deployment from the dataset as well.

Foxfooding is described in the Firefox issue tracker as collecting issues for later triaging. Looking into it it only shows recent issues. I guess older ones might have been triaged. Without further knowledge, I left it in the dataset. With better access to domain experts, I might have removed it as it sounds like the actual issues in it could belong to many other components (and moved after triaging). But I expect it is not a big deal as it only has a few issues in it.

Few other components also had a bit longer period since last issue report. To get a better idea about how active these components have been over time, I plotted their count of issues over months. For example, Webpayments UI:

Plotting number of issues report over a month for “Webpayments UI”. Image by author.

WebPayments UI seems to have started quietly, gotten some attention, and quieted down again. The last of this attention was a bit over a year ago from today, on March 2020. I don’t know if it is still relevant, so I just left it in.

Finally, the list of components I removed for training as a result of this short analysis were the following:

  • System Add-ons: Off-train Deployment
  • Activity Streams: Timeline
  • Activity Streams: Server Operations
  • Activity Streams: General

The rest of the components seemed more likely to be relevant still. This left me with 48 target components from the original 52. I trained the initial set of models with these 48 target components. After a little bit of looking at the results, I removed one more component. Did you figure out which one it is already?

It is Untriaged. Because untriaged is just a list of issues that are not yet assigned to other components. Thus from machine learning perspective these issues are unlabeled. As far as I can see, keeping these in the training set can only confuse the trained classifier. So for further training iterations I removed also issues assigned to Untriaged, leaving 47 target components (labels).

In data analysis, it is easy to get sidetracked because of the next shiny. A bit like the Little Red Riding Hood in the forest I guess. In that line, some interesting facts can also be found by looking at the oldest reports with the Untriaged component/tag:

Oldest open issues in Untriaged state. Image by author.

The above list shows the oldest open and untriaged issue is from over 20 years ago (when writing this). It discusses the correct way to abbreviate “seconds”. In my experience, this is exactly how issue trackers in projects tend to evolve over time. No-one wants to say this issue does not matter and close it, yet no-one wants to make the effort or decide what to do with it. Or take the heat for the slightly irrelevant decision. Or maybe it is just forgotten.

A bunch of others in that list are also waiting for a few years, and if I remove the is_open requirement from the query, there are many very old issues in untriaged status. Issue trackers in general evolve this way, I guess. At least it is what I have seen, and it sort of makes sense. Like my storage closet, junk accumulates and easier to just leave it than do something about it.

Finally, just one more query to show the oldest issue created per component:

Date of oldest issue created per component. Cut in 2014 to save space. Image by author.

The above list actually seems to give a kind of a history on the development of the project. As I said, it is easy enough to get lost in data exploration like this, but I guess in a practical setting I should be focusing more on the task at hand. So lets get back to building the issue report classifier..

Training the Models

In this section I will briefly describe the models I used, and the general process I applied. For some assurance on training stability, I repeated the training 10 times for both models with randomly shuffled data. The overall dataset I used has 172556 rows of data (issue reports) after removing the five components discussed above.

A Look at the Models

First, the models.

Keras LSTM

A notebook setting up the Keras LSTM model, and running it can be found on my Github. The Keras summary function shows the structure:

Keras model with bi-directional LSTM layers and Attention layers. Image by author.

The input layer takes in a maximum of 512 word tokens. These feed into a Glove-based word-embedding layer, which converts the input token sequence into a 300-dimensional embedding vector. This is followed by a bi-directional LSTM layer that has 128 nodes. A self-attention layer follows, and feeds the attention output into another bi-directional LSTM layer that has 64 nodes. The output from here goes into a weighted attention layer, that passes it to the final Dense output layer. Many fancy words if not familiar with it, but worry not. It is simple to use in the end, and practical results will be presented.

Keras LSTM model, but with boxes and arrows. Image generated by Keras as instructed by author.

I recommend checking the layer docs for more information if interested.

HuggingFace DistilBERT

The HuggingFace DistilBERT is a bit more of a black box. A notebook for the training and testing is also on my Github. The Keras summary for it gives:

Model summary given by Keras for the HF DistilBERT model. Image by author.

I guess it is some kind of a custom Tensorflow implementation in a single layer from Keras viewpoint. Matches my previous attempts at trying to read about all the transformer architectures, where everyone quickly diverges into all the in-depth details from the general high level name, and I am just left wondering why is no-one able to provide an understandable and intuitive intermediate view. Anyway. The visual representation of it in terms of boxes and arrows in Keras is even better:

Model visualization by Keras for the HF DistilBERT. Image generated by Keras as instructed by author.

It’s all just a single box. I guess this is what you would call a black-box model (just color the box black to finish it..) :).

Sampling the Data

I sampled the dataset in each case into 3 parts. The model was trained on 70% of the data (training set), or 124126 issue reports. 20%, or 31032 issue reports (validation set), were used for evaluating model performance during training. 10%, or 17240, issue reports to evaluate the final model after training finished (test set). The sampling in each case was stratified, producing an equal proportion of different target components in each dataset of train, validation, and test.

I repeated the sampling to these 3 different sets 10 times, with different randomization in selecting items to each set. Another option would have been to do 10-fold splits and validations, which would be more systematic. But the sampling I used worked for my purposes. Luckily I am not writing a research paper for Reviewer 2 today, so lets pretend that is fine.

Training Accuracy over Epochs

The following figures illustrate the training loss and accuracy, for training the models on one set of the sampled data for both models. First the HuggingFace training for DistilBERT:

DistilBERT transformer evaluation loss and accuracy over training epochs. Image by author.

I set the HuggingFace trainer to evaluate the model at every 500 steps, providing the high granularity graph above. With the amount of data I used, and a batch size of 16, the number of steps in HF training over 3 epochs was 22881. At each 500 of these the evaluation was performed, and shows as a point on the above graph. As shown in the figure, training was quite consistent but leveled at around epoch 2.

The Keras LSTM training metrics per epoch are shown below:

LSTM training and validation set loss and accuracy over epochs. Epoch 2 = 1.0 due to 0-index. Image by author.

In this figure, epoch 0.0 is epoch 1, and 1.0 epoch 2, and so on. It is simply due to how I used a 0-indexed array for it. The name test is also actually referring my validation set in this case, I always get the terms confused. Sorry about that. More importantly, I trained this model for 4 epochs, and each point in this figure shows the evaluation result after an epoch. The validation always peaked on epoch 2 for this model.


Results are perhaps the most interesting part. The following table shows 10 different runs where the LSTM and Transformer classifiers were trained on different dataset variants as described before:

10 different runs of train + predict with top 1 and top 5 accuracy. Image by author.

Each row in the table is a separate training on different random shuffles of the data. The table has the following columns:

  • hf_best_epoch: the epoch where the lowest loss (for validation set) was recorded for HF. In Keras this was always epoch 2, so I did not include a column for it.
  • hf_val_loss: the validation set loss at hf_best_epoch as given by HF.
  • hf_val_acc: the validation set accuracy at same point as hf_val_loss.
  • k_val_loss: the validation set loss at end of best epoch as given by Keras.
  • k_val_acc: the validation set accuracy at same point as k_val_loss.
  • hf1_test_acc: HF accuracy of using the best model to predict the target component, and only taking the top prediction.
  • k1_test_acc: same as hf1_test_acc but for the Keras model.
  • hf5_test_acc: same as hf1_test_acc, but considering if any of the top 5 predictions match the correct label. Think of it as providing the triaging user with 5 top component suggestions to assign the issue to.
  • k5_test_acc: same as h5_test_acc but for the Keras model.

Comparing Accuracy in Transformers vs LSTM

The results from the above table are quite clear, and the Transformer version outperforms the LSTM version in each case. The difference for top-1 prediction accuracy is about 0.72 vs 0.75, or 72% vs 75% in favor of the Transformer architecture. For the top-5 the difference is about 96% vs 97% accuracy. The difference in loss is about 0.91 vs 0.84, again favoring the Transformer.

In a research study promoting research results these would be a big deal. However, in a practical situation the significance of this depends on the target domain. Here, my aim was to build a classifier to help the triaging process by suggesting components to assign new issue reports to. In this case, a few misses, or a difference of 96% vs 97% in top 5 may not be that big a deal.

Additionally, besides this classification performance, other considerations may also be relevant. For example, the LSTM in general trains faster and requires fewer resources (such as GPU memory). This and similar issues might also be important tradeoffs in practice.

A Deeper Look at Misclassifications

Beyond blindly looking at accuracy values, or even loss values, it is often quite useful to look a bit deeper at what did the classifier get right, and what did it get wrong. That is, what is getting misclassified. Let’s see.

In the following, I will present multiple tables across the models and their misclassifications. These tables hold the following columns:

  • total: total number of issue reports for this component in the entire dataset
  • test_total: number of issue reports for this component in the test set
  • fails_act: number of issues for this component that were misclassified as something else. For example, there were 1000 issue reports that were actually for component General but classified as something else.
  • fails_pred: number of issues predicted for this component, but were actually for another component. For example, there were 1801 issues predicted as General but their correct label was some other component.
  • total_pct: the total column value divided by the total number of issues (172556). The percentage this components represents from all the issues.
  • test_pct: same as total_pct but for the test set.
  • act_pct: how many percent of test_total is fails_act.
  • pred_pct: how many percent of test_total is fails_pred.

Keras LSTM Top-1 Prediction Misclassifications

First, failure statistics for the Keras LSTM classifier, and its top-1 predictions:

Keras LSTM classifier failure stats for top-1 prediction. Image by author.

Predicting only the most common label is often used as the very baseline reference, and in this case we could then expect an accuracy of 37% if we always predicted each issue report to be assignable to the General component. Because the table above shows it holds 37% of all issues.

Something that I find slightly unexpected here is that even though General is so much more common in the set of issues, it does not have an overly large proportion of misclassified issues attributed to it (act_pct + pred_pct). Often such a dominant label in the training set also dominates the predictions, but it is nice to see it is not too bad in this case.

Instead, there are others in that list that more stand out. For example, Shell Integration looks quite bad with 83% (217 of 261) of its actual test set issues being misclassified for some other component (act_pct). One might consider this to be due to it having such a smaller number of issues in the training set, but many components with even fewer issues are much better. For example, the one visible in the table above, Installer, has a 32% (act_pct) fail rate only.

To analyze these causes deeper, I would look in a bit more detail at the strongest misclassifications (in terms of component probabilities) for Shell Integration, and try to determine the cause of mixup. Perhaps some feature engineering would be in order, to preprocess certain words or tokens differently. But this article is long enough as it is, so not going there.

Something a bit more generic, that I looked further in the statistics, is pairs of misclassifications. The following list shows the most common misclassified pairs in the test set. For example, the top one shows 204 issues for Tabbed Browser being predicted as General. And similarly, 134 General issues predicted as Tabbed Browser. Clearly, the two seem to be mixed often. Our friend, Shell Integration also seems to be commonly mixed with General. And so on.

Pairs of most common misclassifications for LSTM top-1. Image by author.

Overall, the biggest General component also dominates the above list, as one might have expected. Maybe because it is general, large, and thus randomly holds a bit of everything..

Keras LSTM Top-5 Predictions Misclassifications

Besides top-1 predictions, I also collected top-5 predictions. Top-5 means taking the 5 predicted components for an issue, and considering the prediction correct if any of these 5 is the expected (correct) label. The following shows similar stats for the top-5 as before for top-1:

Keras LSTM classifier failure stats for top-5 prediction. Image by author.

This is otherwise similar to top-1, but the fails_pred column has large values, because each issue report had 5 predictions counted for it. So even if one of the 5 was correct, the other 4 would be calculated in fails_pred here.

For the rest of the values, the numbers are clearly better for top-5 than for the top-1 results. For example, General has fails_act value of only 3, while in top-1 it was 1000. Likely because of its dominant size it gets into many top-5’s. This drop from 1000 to 3 is a big improvement, but the overall top-5 accuracy is only 97% and not 99.9% as the components with fewer instances are still getting larger number of misclassifications. For example, Shell Integration still has about 17% act_pct, even with top-1 relaxed to top-5. Much better, but also not nearly 0% fails.

HuggingFace DistilBERT Top-1 Prediction Misclassifications

Let’s move to DistilBERT results for the top-1 prediction. Again for details, see my Github notebook, and two scripts to go with it. The overall values are clearly better for this model vs the LSTM as was illustrated by the general statistics earlier. However, I see a slightly different trend in this vs the LSTM. This model seems to balance better.

HF DistilBERT classifier failure stats for top-1 prediction. Image by author.

The number of failed predictions for the dominating General component are higher than for the LSTM model. For example, fails_act here is 1063, while in the LSTM model top-1 it was 1000. However, as the overall accuracy was quite a bit better for this model (72% LSTM vs 75% for this), this gets balanced by the other components having better scores. This is what I mean with more balanced. The errors are less focused on few components.

For example, the fails_act for the nasty Shell Integration is down to 185 for DistilBERT here, while it was 217 in the LSTM results. Still not great, but much better than the LSTM. Most of the other components are similarly lower, with a few exceptions. So this model seems to be overall more accurate, but also more balanced.

To further compare, here are the most commonly misclassified pairs for this model:

Pairs of most common misclassifications for DistilBERT top-1. Image by author.

This also shows the similar trend, where the total number of misclassifications is lower, but also no single pair dominates the list as strongly as in the LSTM model results. And again, the pair of Tabbed Browser and General are at the top with getting mixed with each other. Along with General being a part of almost every pair in that list. Looking into these associations in more detail would definitely be on my list if taking this classifier further.

HuggingFace DistilBERT Top-5 Prediction Misclassifications

And the results for the DistilBERT top-5:

HF DistilBERT classifier failure stats for top-5 prediction. Image by author.

Similar to top-1, this one has a higher fails_act for General but mostly lower for the others, leading to what seems to be a more balanced result, along with the higher overall accuracy (96% LSTM vs 97% this model/DistilBERT).

Short Summary on Model Differences

The results I presented could likely be optimized quite a bit. I did not do any hyperparameter optimization, or model architecture tuning, so my guess is that both the LSTM and Transformer model could have been optimized further. Might also be useful to try the other HF transformer variants. However, my experience is that the optimization gains are not necessarily huge. My goal for this article was to build a practical classifier from scratch, and compare the trendy Transformers architecture to my previous experiments with Attention LSTM’s. For this, I find the currect results are fine.

Overall, I find the results show that the Transformer in general gives better classification performance, but also is more balanced. The LSTM still produces good results, but if resources were available I would go with the Transformer.

One point I find interesting in the model differences is that the LSTM seems to do slightly better for the dominant General component, while the Transformer seems to do better on the other ones. This analysis was based on one pair of the 10 variants I trained, so with more time and resources, looking at different training variants would likely give more confidence still.

One way to apply such differences in model performance would be to build an ensemble learner. In such a combined model, both the LSTM and Transformer models would contribute to the prediction. These models would seem to be good candidates, since they have mutual diversity. Meaning one produces better results in some parts of the data, while the other in a different part. In practical production systems, ensembles can be overly complex for relatively small gains. However, in something like Kaggle competitions, where fractions of percentage in the results matter, this would be good insights to look into.

Predictor in Action

The article so far is very much text and data, and would contribute for a very boring powerpoint slideset to sell the idea. More concrete, live, and practical demonstrations are often more interesting.

Back when I presented my Qt company classifier based on their issue tracker data, I had similar data with some older classifiers (RF, SVM, TF-IDF, …) and the LSTM. Back then, I found the LSTM classifier produced surprisingly good results (similar to here). I made a presentation of this, showed some good results, and people naturally had the question: How does it actually work on issue reports it has not seen before?

One way to address this question is to try to explain that the training data did not include the validation or the test set, and thus it already measures exactly how well it would do on data it has not seen before. But not everyone is so familiar with machine learning terminology or concepts. To better address this question, I then opened the issue tracker for that day, took one of the newest, still untriaged issues (no component assigned), and copied its text. And then ran the classifier live on that text. Asked if people thought it was a good result. We can simulate that also here.

I took an issue report filed on May 13th (today, in 2021), as my training dataset was downloaded on the 21st of April. As opposed to a live demo, you just have to trust me on that :).

For this experiment, I picked issue number 1710955. The component label it has been assigned by the developers is Messaging System. The following shows the top-5 predictions for the issue using the HF DistilBERT model. First using only the summary field as the features, followed by using both summary + description as features.

HF DistilBERT predicted components for issue 1710955. Image by author.

The top line above shows the Messaging System as the top prediction, which is correct in this case. Second place goes to New Tab Page. As shown by the middle line, with just the summary field, the Messaging System is correctly predicted as the top component at 98.6% probability, followed by New Tab Page at 84.7%. The third and final line above shows how adding the description field to the features, the predicted top-5 remains the same but the classifier becomes more confident with Messaging System given 99.1% probability, and New Tab Page dropping to 62.5% on second place.

The summary field text for this issue report is simply “Change MR1 upgrade onboarding to Pin then Default then Theme screens”. Check the issue report for more details and the description. I think its quite impressive to predict the component from such short summary text, although I guess some words in it might be quite specific.

While this was a very good result for a random issue I picked, it is maybe not as believable as picking one in a live demonstration. In a more of a live demonstration, I could also ask people to write a real or imaginary bug report right there, run the classifier on it, and give them a classification for it, asking their opinion on its correctness. But, as usual, last time I tried that it was a bit difficult to get anyone to volunteer and I had to pick one myself. But at least you can do it live.


Well that’s it. I downloaded the bugreport data, explored it, trained the classifiers, compared the results, and dug a bit deeper. And found a winner in Transformers, along with building myself and increased understanding and insights into the benefits of each model.

In the end, writing this article was an interesting exercise in trying out the Transformer architecture for a real-world dataset, and gaining insight in comparison the LSTM architecture I used before. Would someone use something like this in real world? I don’t know. I think such applications depend on the scale and real needs. In very large companies with very large products and development departments, I can see it could provide useful assistance. Or integrated as part of an issue tracker platform for added functionality. Think about the benefit of being to advertise your product uses the latest deep-learning models and AI to analyze your issues.. 🙂

Besides comparing to my previous work on building a similar bug classifier for the Qt company issue tracker, I also found another paper from around the same time when I wrote my previous study, on a similar issue tracker classification, called DeepTriage. I considered using it for comparison, but their target was to predict the developer to assign the issue to, so I left that out. It was still useful for providing some insight on where to find accessible issue trackers, and how they used the data for features.

However, when I went searching the internet for the word DeepTriage after that paper, I found another paper with the same name. DeepTriage for cloud incident triaging. Yes, in my previous articles on metamorphic-testing of machine-learning based systems I noted how many DeepXXX articles there are. Apparently it is now so popular to name your research DeepXXX that you even re-use names. In any case, this was a paper from Microsoft, talking about the cost and need for fast triaging in datacenters, imbalanced datasets and all that. And how they use this type of technique in Azure since 2017 with thousands of teams. So, as I said, this likely becomes useful once your scale and real-world requirements catch up.

Well, my goal was to play with Transformers and maybe provide some example of building a classifier from real-world NLP data. I think I did ok. Cheers.