Category Archives: testing

Algorithm Test Engineering Part 2: Machine Learning

If we don’t exactly know what we are testing, how can we test?

Let’s probe the machine to learn what it has learned. Photo by Zoltan Tasi on Unsplash

In my previous algorithm test engineering article, I discussed the testing and analysis of more classical algorithms, such as binary search. While the overall testing of algorithms can be complicated, most classical algorithms can be described as having quite clearly defined inputs, outputs, and their relations.

In this follow-up article I discuss testing and analysis in relation to a different type of an algorithm, where the input-output relations are more complex, harder to define, and the results sometimes subjective. Typically these are based on machine learning (ML). For example, the same set of Internet search-results can be good for one person at one time, but less so for (another) person, or at a different time. Same might apply to other properties, such as an e-nose trying to distinguish wines vs trying to distinguish whiskeys based on the same sensor inputs, or personal stress level based on biometrics. The context and sometimes the personal “feeling” can make the difference.

In this article, I explore the idea of what it means to test these types of systems. I start with testing as part of the ML model training process, extending into post-training testing. To make it all a bit more concrete, I look at a few industry examples, and reflect on my experiences on a search-engine project I built a few years back using customized Natural Language Processing (NLP) algorithms.

The Train, Validate, and Test Process

The basic test process in relation to machine learning is the testing and validation during the ML model training process. The following figure illustrates this:

ML train and test loop. Image by author.

This uses three datasets:

  • Training set: The ML model is trained on this, the algorithm tuning the model to better generalize over all the items in this set.
  • Validation set: the trained model performance is evaluated on this separate dataset. Training is repeated with the training set as long as validation score improves, or other end criteria is met.
  • Test set: a final, separate dataset used to perform a final test of the results, independent of the whole training loop and its validation.

The above process aims to build a model that generalizes as well as possible in relation to given criteria (e.g., accuracy or prediction error) over the training data. Typically this optimizes the model over the given dataset, but does not differentiate on which specific data items it still performs poorly.

Following this, post-training testing is a term referring to testing related to the fully trained model after the above process has fully finished. Let’s look at this type of testing in a bit more detail.

Post-Training Testing

I ran into this term during writing this article, from an article by Jeremy Jordan, and found it very fitting. He described it as investigating the logic behind the “final” algorithm (trained model). I like the term investigating, as I think of this as exploring the model behaviour with different inputs and input transformations. This is very much like the metamorphic testing (MMT) approach I wrote about earlier, with more focus on the exploration part. As a reminder, MMT makes modifications to inputs, and observes the effect on output. The modifications are called metamorphic transformations.

The Jordan article splits post-training testing to three types:

  • Invariance tests: Testing the ML algorithm with similar inputs that one would expect to give a similar output. The example given is two different (red) apples. Think metamorphic transformations.
  • Directional tests: Testing small input modifications in a given direction to see if the algorithm output reacts in an expected way. For example, does the house price go up with increase in house size? Another form of metamorphic transformation.
  • Minimum functionality tests: Testing the ML algorithm with specific instances of inputs, where we know the result we expect. For example, does the same apple always give the same result (e.g., classification percentages).

I discussed many similar aspects related to testing traditional algorithms in my earlier article, and I think the above test types can be applied to testing in general. However, in testing ML algorithms, I find the exploration has a bigger role due to the model (algorithm) being more of a black box, with complex internals, such as large number of weights and connections.

I believe how well the above test types can be applied depends on the data, model, and domain at hand. For example, when classifying images of apples, I find it quite intuitive to define variants of an apple, or how to transform such images. Other times it gets more complicated. Let’s look at some examples.

Examples of Evaluation Complexity

For example, in my metamorphic testing article I discussed the use case of self-driving cars. Now imagine that the ML algorithm feeds steering angles to a driving system. If this angle drifts by fractions of a percentage over time (with different model versions) for the same scenarios, when is it a relevant change in output? Or if it uses multiple camera inputs from the real-world, and we consider all the complexities of the real-world living environment as inputs, which change in input is relevant? The data space can be huge. In time-series data (such as driving), the data relations and change over time also needs to be considered.

The Jordan article gives another interesting example of defining invariants and using them for investigation/testing: real-estate price estimation. Specifically expecting a higher bathroom count to not lower house price, or decreasing house size to not increase the prize. These are in a way inverse invariants, describing the opposite of what one might expect. Maybe I expect the price to stay the same or rise, but not to drop. An interesting angle to look at the data and expectations.

In Jordan’s example, they noticed their dataset for smaller apartments was mostly from a big city where the apartments were more expensive. This had biased the model with regards to the smaller apartment sizes in general. I find this an interesting example of findings that would be easy to miss without in-depth exploration (or investigation) of the post-training results.

Operational vs Data Domain

Thinking of the metamorphic relations at a high level, one might consider what kind of changes in the operational domain (e.g., weather, driving environment, angles, for cars) one might observe. These are high-level changes describing human understandable concepts. Additionally, there are changes in the input data domain itself (e.g., sensor corruptions, interference, signal disturbance, environmental impacts).

These two domains are nicely described in an article by Daniel Angilov. Although not using the metamorphic testing terms, it nicely summarizes the difference of how one might build input transformations for testing ML at a higher level (operational level), and how they might map to the lower level representation (data level). Naturally, the data domain is much larger in scope, as it includes all possible input values. For the operational domain many of these may not be relevant (e.g., images of random data). Illustration:

Operational domain vs Data Domain, adapted from here.

The Operational Domain deals with more intuitive, high-level and human understandable, concepts, and transformations. For example, in an autonomous car scenario, foggy scenes, camera angles, or rainy weather vs sunny weather. These are mapped to the data domain, where the actual data transformations happen. We can apply operational domain transformations to produce high-level modifications for identified test scenarios. And lower-level data transformations (e.g., adding noise) for lower level changes.

Here is an example from my metamorphic testing article that I believe illustrates a (domain) transformation in the operational domain:

Operational domain transformation example. Image by author.

In this case, the camera angle of the imaginary car has been turned, as if the car itself was tilted. Well, the other car in front has also moved further away :). I took these pictures while walking across the street for my metamorphic testing paper, which explains it. A more data domain oriented transformation could be, for example, to add static noise to the image.

It is unlikely that we could cover all possible values in the data domain, or figure out a generic test oracle for all of them. Similarly for the operational domain, expecting to build exhaustive test sets for all possible input combinations is not feasible. Instead, we need to try to figure out the most relevant ones for both domains. And pick a suitable test set, explore its changes, and evolution over time. Unfortunately I do not have a universal recipe for this (at this time).

Testing in the Overall ML Process

Considering the overall ML train, test, and operations process, perhaps the most succinct and full of ML testing detail paper I have seen is the Google paper from the 2016 Reliable Machine Learning in the Wild workshop. Since the entire paper is just a list test types and their very concise descriptions, I list the ones that I found most interesting. These all relate to continuously testing:

  • Feature assumptions. Such as value ranges and most common values, as these may drift over time. Check assumptions stay valid.
  • Feature relations to target variable and each other. To understand your data better, and check the relations hold.
  • Computational cost vs benefit per each feature. Is it worth including all features for their computational cost? Does it evolve?
  • Leak of unwanted features into the model due to copy-paste type errors. Constant monitoring should alert if unwanted features are used.
  • Change of model evaluation score over time if consistently re-trained. For example, the effect of daily training with new data.
  • Model bias with regards to specific types of data, does some new data or data type bring new biases that should be considered.
  • Reproducibility of training, how big is the drift across multiple trainings on the same data.
  • Constantly monitoring how invariants seen in training data should hold for operational data over time.

I see the overarching theme is to constantly monitor and observe change and impact.

This is just a short list of ones I found most related to this article. There are many more related to the overall process from ML code review to operational monitoring. I recommend reading the Google paper for the very insightful list.

A Few Real-World Examples

Lots of words are nice, but real-world examples help make it concrete. Especially real ones from the industry. Recently I listened to some ACM ByteCast episodes, a few of which provided interesting industry insights.


One of these was discussing research on recommendation algorithms at Spotify. More specifically, Spotify’s recommendation systems and everything around that. The term used (in this talk) was evaluation of ML results.

This differentiated the evaluation in two forms: offline and online. Offline is the traditional ML evaluation approach of using a train/test split and metrics such as accuracy, precision, and recall. Online refers to evaluating the algorithm performance from user actions, such as clicks on a webpage, or in Spotify case I guess choices on music playlists. They discuss the concept of a proxy of user engagement, and whether some specific interaction in an app is a good measure of achieved goal, or if a returning user is an indication of satisfaction and a success. An interesting viewpoint on defining an ML test oracle, and constantly investigating the ML algorithm behaviour.

Another aspect they discussed with Spotify is to enable users to discover new content. Instead of simply relying on algorithm recommendations based on the current user profile, it can be useful to enable people to discover new content. In the Spotify case, this was discussed as selecting about 10% of the options presented to the user as these “new” types of choices. This is a good example of considering the overall user experience and goals when testing / investigating the ML algorithms as part of a system. It ensures the user is not locked into an algorithmic sandbox, supporting them in finding new ideas (music, search results, ..) to explore.


Another ByteCast episode I found interesting regarding this topics was the one on DuoLingo. DuoLingo is an app and a service designed to help people learn languages. They use Machine Learning for many parts of their system, including creating tasks for students tailored by their learning history, customizing content based on how well the user does on certain words and language structures, and what has worked overall for similar users, and many other ways I am sure I miss here. But to summarize, a heavily ML applying and successful application.

DuoLingo generates tailored quizzes for users, based on their learned profiles, together with expert teams. In this way it gamifies the evaluation of the algorithms and has users provide feedback by answering the tailored quizzes, and using approaches such as A/B testing. I find this a very interesting idea and approach for “online” algorithm testing and evolution with algorithm feedback from users. By enticing the user as part of the service use process to work on tasks that help the algorithm learn to be better for them.

An Example of My Own: ValtuustoPilvi

A few years back I build a search-engine called Valtuustopilvi (Finnish for CouncilCloud), as part of an open-data based service competition hosted by the city of Oulu in Finland. It was ranked first in this (small) competition, and so I ended up also hosting it as a service for the city for the following two years. As a result, I got familiar with various NLP algorithms and built my first service using them, including my first experiences in testing ML and NLP.

Service Overview

ValtuustoPilvi was designed for people to interactively search the city council meeting documents. I used various existing Natural Language Processing (NLP) libraries for the search algorithms, built some custom algorithms of my own, and adapted some well known and widely used ones.

Since it was just me working on this project, I did the development, testing, analysis, and everything else. Not sure if that is good or bad for this case, but that’s how it was. I packaged the code up a few years back, so no concrete code measurements and execution this time, but examples and reflection on how I tested, analyzed, and designed it. Hope it makes for something more concrete and interesting.

To concretize the service a bit, here is a screenshot of the search UI:

Valtuustopilvi interactive search UI. Image by author.

The user could interact with the word-cloud by selecting words in it to refine their search, or by typing query terms in the box shown at the bottom of above figure. The system built (in real-time) a new word-cloud matching the search results for the modified search query, and the user could iteratively continue to refine their search using the continuously updated UI / word-cloud.

I designed several algorithms to build the word-cloud in different hierarchies. The overall document set was not changing very rapidly, so for that I used a pre-computed data model, updated every night. On first arrival at the main search-page for all the documents, or for a specific pre-defined sub-category of council meetings (e.g., building permits), this type of data model was applied. It was based on a topic-model. Based this topic model, a set of words was chosen, each one weighted by how high the topic model ranked them in the topics its discovered. This is visible as the word size in the word-cloud above. More detail on this shortly. First a few words on pre-processing.

Computational Complexity Example: Preprocessing with Voikko

In my previous algorithm testing article, I discussed evaluation of algorithm computational complexity. In the case of ValtuustoPilvi, a related example is from using a third-party library, and how its computational complexity became important as my use was different from its previous use cases.

A very common task in ML, and in NLP especially, is to preprocess the data being fed to the algorithms. In case of the word-cloud building algorithms, one of the basic needs is to unify different forms of words into a single representation (word). In NLP this is called lemmatization (i.e., base form conversion, e.g., cars->car).

For this , I used the Voikko Finnish NLP library. Voikko had earlier been used primarily for shorter pieces of text, such as sentences and words in spell checking. However, I used Voikko to process whole documents, longest of which were hundreds of pages long. Using such larger inputs, the processing time was increasing exponentially. After reporting the issue, it was fixed in version 4.0.1 (visible in Voikko release notes).

In relation to the topic of this and my previous article, this illustrates how different requirements for an algorithm can be relevant in different scenarios, and how this may evolve over time. The Voikko web-page now actually lists the library commonly used as a tool in machine learning pipelines for Finnish texts, illustrating its more general use case drift in this direction.

Continous Data Analysis: Testing Preprocessing

Lemmatization tools such as Voikko are great for processing well written and structured text. However, real-world text often contains misspellings, domain specific words, abbreviations, and other similar anomalies. For these, tools such as Voikko will fail to process them as they are not in the standard dictionary.

Reading all the documents and manually fixing all text is not a feasible task for any reasonable sized input set. Instead, I collected all words / tokens that Voikko did not recognize. By ranking these unrecognized words and tokens in the order of their frequency, I could easily find words that occurred often but were not recognized, create custom rules to lemmatize them as needed, and thus continuously monitor, test, and improve the preprocessing. Even as the set of documents, and the topics they covered would evolve over time. This process ran each time the document set and topic models were updated (nightly). I checked the results once a week or so, updating the lemmatization rules when new and frequent misspellings or new words appeared.

Algorithm Customization: LDA for Word Clouds

Another topic I previously discussed is the importance of understanding the algorithms used. In this case, the topic model algorithm I used produces internal representations that are typically not used as part of the output. However, learning the algorithm in detail, I was able to use those internal properties as a basis for service features. Of course, understanding what you are testing does not hurt in testing either.

The topic modelling algorithm I used for building the word-cloud was Latent Dirichlet Allocation (LDA). This is an NLP algorithm that maps documents and texts to a set of “topics”. The discovered topics are often used as input to further steps in an application. First, a short introduction to LDA.

LDA maps words together in clusters that are called topics. To concretize, here is an example of four topics (word clusters) I built using LDA on one of the Kaggle COVID-19 research paper datasets (code and notebook here):

LDA based topic model with four topics, with word-weights shown. Image by author.

With LDA, interpreting these topics/clusters is left to the user. In this case, I would say the topics above seem to relate to patient treatments (topic 1), virus analysis (topic 2), structure of the virus (topic 3), and possibly infection mechanisms (topic 4).

In the case of the Valtuustopilvi, the exact meaning of the topics was irrelevant, only that a good set of representative words, and their weights was captured, and they represented a set of high-level and diverse concepts in the document set. The idea in the end was to help the user explore different and interesting high-level concepts in the document set. This end user goal is always good to keep in mind also when testing a system, ML based or not.

The topics themselves were never shown to the user, only the backend code used them as a basis to build the high-level word-clouds. Summing the weights in different ways provided a basis to choose the words for the cloud, and weight their size.

As I was testing and using the service myself, I also realized that giving fully deterministic results to the user, in form of highest weighted words by the algorithm was boring and did not serve the user well. Because doing exactly that would always give the exact same word-cloud for the document set. It would not help them explore and find new information over time, and felt boring after a few times seeing the exact same word cloud I already had interacted with. So similar to the Spotify example of trying to avoid the algorithmic sandbox, I added some randomization to reduce the weights of some words, or add some randomness to which of the top words in each topic were included in the word-cloud calculations, and how much weight was given to each.

Testing, Analysis, and Development of ValtuustoPilvi

The LDA example above was only used for the top level searches, where people were searching over all the documents, or over a specific pre-defined category of meetings. It requires pre-computation and is most useful when summarizing larger sets. For deeper search queries with dynamic document sets, I used an algorithm that was faster and adaptive to changing search results. For further details on this algorithm, and the others I used, refer to the paper draft I wrote on the topic (…) back in the day.

An example of base tests in this was to check that the correct algorithm is used. These base tests are often quite simple and intuitive to write, with some basic understanding on how the system works. But how to test the word cloud is a good representation of the actual topics for the user? This is much more subjective and harder to test.

Looking back at this, I see different levels of testing I did on this:

Test sizes and oracle levels in the Valtuustopilvi project. Image by author.

First, I looked at a set of documents for a category, and whether the algorithm provided results made sense to me. This is the Me at the bottom of the above pyramid. The test oracle was myself, so very limited view but also very much to the point, and easy to check (I know exactly what I am thinking, at least usually :)).

Next came test groups, such as the competition reviewers and the feedback they gave. This feedback was more generic, without detailed knowledge of the algorithms, but still specifically giving feedback on how they felt about different parts of the service. Which is practically the algorithm output transformed into a user presented form.

The third level is the general user population. It has a large sample size in potentially covering all users of the service. However, the test oracle is very generic, as it can only rely on some indirect clues that can be collected from the user interaction with the service. This is similar to the Spotify (e.g., evaluation and search) and DuoLingo examples on analyzing user interactions, and performing experiments for tuning algorithms. The same also applies for search engines in general, such as Google search results optimization based on aggregated and anonymized interactions (linked in Feb/2022).

In the Valtuustopilvi case, the service did not have internet scale number of users, so there was not that much to analyze at the top level of all users. I also included a feedback mechanism in the UI, but beyond few basic comments not much came of it. However, once a service reaches a large enough scale, I believe the general user population is a very useful opportunity to pursue for evaluation data and ideas. As illustrated by the Spotify, DuoLingo, and Google examples. Of course, keeping in mind all the privacy and similar aspects.


Looking back at what became of this article, the big difference in testing ML based algorithms compared to traditional algorithms, as I see it, is the black box nature of the ML models. They embed complex internals, such as a large number of weights, nodes, and connections. And each of these are different across applications, model configurations, training sessions, and the data used. Thus there is no single way to test a specific ML algorithm, but the process involves much investigation and exploration. And this is a constant process, as the data, models, and out understanding based on it often evolve over time.

Some of the specific points I collected from this:

  • Data evolution, and constantly monitoring it,
  • Trained model evolution, and constantly monitoring it,
  • Identifying important operational and data invariants to cover,
  • Post-training testing types and metamorphic testing,
  • Different viewpoints, such as inverse assumptions/invariants,
  • Investigation and exploration as a central concept,
  • User (system) feedback and its proxies in evaluation at different scales,
  • Keeping the entire ML operations pipeline and related quality in mind,
  • Breadth in end (user) goals, such as avoiding algorithmic sandboxes,

Looking at the above, another common theme is the focus on continous test, monitoring, and analysis rather than point-in-time as often in classical testing.

Overall, I find there are many aspects to consider when testing ML algorithms, and the systems built around them. Much of the same concepts apply as for traditional testing, but hopefully this article helps provide some insights into the specifics of ML testing.

That’s all for now. Cheers 🙂

Algorithm Test Engineering: Exploratory Job Analysis

What does it mean to test an algorithm? Let’s play

Plot twist. The article has an RNA algorithm “example”.. Photo by Braňo on Unsplash

Recently I had a discussion about what it means to test an algorithm, or what it means to do algorithm test engineering. I couldn’t quite come up with a convincing definition for myself. At a basic level, maybe you figure out some basic rules of the algorithm, give it some input and output. Check the results. But I believe there is more to it, including algorithm fit for purpose, fit to data, evaluating alternatives, adapting the algorithm to the problem and system at hand, designing and running experiments, learning, etc.

To get a better idea about all this, I will explore the topic in this article. I try analyzing and testing two algorithms, and see where it takes me. I start in this article with more classical algorithms, where the inputs, outputs, and their relations are clearly defined. In a follow-up article, I hope to look at a machine learning-based one where the definition of the correct results is not so clear.

This story was originally published on Medium, where it can still be found.

Choice of Algorithm

Since my goal was to analyze the testing and analysis of algorithms, I needed to pick an algorithm to work with. Luckily, Wikipedia provides an extensive list of algorithms. I picked two classics, binary search and Levenshtein edit distance. I will warm up first with a basic test and analysis of binary search. This is followed with a bit broader look at Levenshtein.

In general, I find testing an algorithm without a broader context to miss some key points from the overall process. So I set up an application for the Levenshtein algorithm to give my testing and analysis exploration some context. This context is an application for comparing edit distances over RNA strings, more specifically motivated by the COVID-19 virus structure.

First, binary search. It is an algorithm designed for efficient search over a list of items, in logarithmic execution time. For my experiments, I use the IMDB names dataset, and a binary search Python implementation (adapted from a set of Stack Overflow answers).

Understanding the Algorithm

When testing an algorithm, I first need to understand what I am working with. To gain some intuition for this, I start with a look at linear search vs binary search. Linear search is a very basic search that simply loops the list in order to find a match, and provides a base reference here. To illustrate the difference, and what the binary search algorithm is about, a look at a set of 10 names from the IMDB dataset:

Example set of 10 names from the IMDB dataset. Image by author.

I picked the last 10 in the dataset, because it looked like a good random mix. Binary search requires the input to be sorted, so lets do that. Sorted and re-indexed these 10 become like this:

The IMDB names dataset slice, sorted by string order (first name). Image by author.

To build a better understanding of linear and binary search, I created illustrations on applying them to the above list to find “Lu Bevins”:

Linear search (left) vs Binary search (right) on the IMDB name slice. Image by author.

Linear search always loops through the entire list from the beginning, until it finds a match. The list can be in any order. Binary search constantly splits the list in two and checks the middle value for match, repeating this in smaller chunks, until it finds a match, or runs out of elements to check.

Binary search is a well known and simple algorithm, but above illustrates how I would go about trying to understand an algorithm. I find concrete examples and diagrams help me build an understanding. Access to a (domain) expert to help build and check these really helps. Especially with less known algorithms.

The above example list of 10 is very small, and does not really illustrate the benefit of binary search. With larger size, the comparison between linear and binary search should look like this (expecting logarithmic complexity):

Illustrating the theoretical logarithmic execution time vs linear. Image by author.

Such assumptions are good to check, so I will do that next.

Testing Computational Complexity

A common property of interest when testing and analyzing algorithms is computational complexity. This refers to the resources (processing time, memory, etc) the algorithm uses at different scales. Here I focus on processing time, although similar analysis could be performed for other resources.

To explore the scalability of binary search, I sampled the IMDB names dataset for 10 to 1000 names, in increments of 10, and executed both linear and binary search on random items in these sampled lists.

To gain more statistically confident results, I ran the experiments 100 times for each of these lists. I used my codeprofile library for execution time profiling. The following figures illustrate the measured average execution times over the 100 runs at sample sizes vs the theoretical, expected time:

Measured performance (left) vs theoretical (right). Image by author.

The above shapes show the curves follow the expected curve on logarithmic execution time growth (binary search) and linear growth (linear search). So the measurements matches the theory (scales are different in the figures due to parameters, I was just interested in the shapes).

1000 items is still a very short list for real performance measurements. Running the same experiment for 10 to 1 million samples (in 10x increments) shows the effect and benefit of binary search more clearly:

Binary search vs linear search. Left side = 10–1000 list size, right side = 10–1M list size. Image by author.

In the above figure, with list size 1 million, the difference is so big the binary search line looks flat compared to linear search. Based on the above simple experiments, I would say the assumptions hold for computational complexity.

The linear and binary search are well known and studied algorithms, and their complexity would be well known. As such, these demonstrate how we can evaluate this in practice. I have found evaluating practical computational complexity of actual implementations can produce surprises, especially different use cases for custom algorithms from 3rd parties.

Testing Input-Output Relations

Knowing the computational complexity is good, but we also need assurance that the implementation works as intended. For binary search, basic tests for finding selected names, boundaries, category-partitioning, etc are good examples for this. I call these base assurance tests.

To scale this up, I used the test generator from the above 10 to 1 million computational complexity analysis, and added the following check to it:

  • for every item, linear and binary search find the same index.

I call this an expected/assumed invariant over the data/algorithm. I was expecting the two algorithms to give the same results for the same input, and that the assertion would simply be there for extra assurance. However, the assertion was failing because the indices given by the linear and binary search were not always the same even for the same input.

After some thought on what could cause this difference, and how the algorithms work, I figured there might be duplicate names in the dataset. Here is a quick check to see if this is the case (concatenated to two rows):

Listing items that have duplicate names in the dataset (slice). Image by author.

Above figure shows “Robert Ellis” and “Harrison Ford” as having duplicates. To check a bit deeper, a look at the name “Harrison Ford”:

Check whether Harrison Ford exists in multiple instances as primary name. Image by author.

There are four people listed with the same name “Harrison Ford” in this dataset. Having established there are duplicate names, some summary metrics would be of interest to see how many there are:

Counting the number of duplicated primary names in the dataset. Image by author.

The above figures show a total of 853911 (row count) distinct names have duplicates. The one with the most duplicates is “David Smith”, repeating 340 times. What does the above mean for the mismatching search results? Due to how the two algorithms work, searching for “David Smith” would likely result in both linear and binary search returning a “David Smith”, but different ones. Linear search always returns the first one in list, binary search can give any result from the list of duplicates.

To me, this illustrates, how testing and analysis of the algorithm helps understand both the algorithm, and the data it is being applied to better. And how it is good to assert and test your assumptions about the data and algorithm. Having a clear goal, and performing this process systematically should help.

Besides invariants over the algorithm output, one can also consider them over input. For example, binary search expects its input to be sorted. The choice is whether to limit the scope of testing to expect the environment (the overall system) to enforce this, or to expect the algorithm implementation to handle it. I would refer to this as defining the algorithm (test) scope.

As a second example, and to gain some broader insight into the topic, I look at the Levenshtein edit distance algorithm here. Edit distance is also sometimes referred to as approximate string matching. It refers to the number of edits (character changes) required on a string to convert it to a specific (target) string. Known applications include providing spelling suggestions, approximate string search queries, and DNA/RNA sequence comparison.

Compared to the single search operation of binary search, the classic Levenshtein algorithm supports three operations: insert, delete, and substitute (a character in a string). The edit distance it gives is the minimum number of these operations required. There is also a variant called Damerau-Levenshtein adding a transposition operation, but to keep this example simpler, I will use the classic Levenshtein here.

Some examples:

  • test -> tent: one substitution of s->n = Levenshtein score 1
  • Levenstein -> Levenshtein: one insertion of h = score 1
  • Levenshtein -> Levenstein: one deletion of h = score 1
  • bad -> base: one substitution of d -> s + one insertion of e = score 2
  • bad -> based: two insertions of s + e = score 2

Defining the Example Application

As I noted before, I will look at Levenshtein from the viewpoint of an example application, to give it some context that my binary search example lacked. As Levenshtein has been applied in RNA analysis, I use RNA search as the application. Specifically, the COVID-19 virus, which is RNA based. There is plenty of information available on it, and it gives me a chance to play with longer RNA strings on a timely topic. As with binary search, I start with research to understand the application domain as well as the algorithm applied.

The COVID-19 RNA sequence is described as having a length of 29881 “characters”, consisting of a four letter alphabet representing its chemical base. One part of the COVID-19 structure that is specifically considered interesting is the spike protein. This interest comes from the role of the spike protein in infecting the human cells.

The spike protein is described as a 3831 “character” long sub-sequence of the overall virus RNA. Wikipedia describes the Omicron variant as having 60 mutations from the original Wuhan variant, 32 which are in the spike protein.

So in this example I look to apply Levenshtein on sequences of length 3831 (possibly up to 29881). Following the Omicron mutations, I will use up to 60 edits as the maximum in my testing. As with binary search, I am defining the algorithm (test) parameters based on domain knowledge and expected use.

DISCLAIMER: I have not done in-depth research on RNA search, I am sure it is far more advanced, but this works as a basic example for this article.

Experiment Setup

I used the Python weighted-levenshtein library to run my experiments. My base assurance tests include handling empty string, string of different lengths, invalid characters, and a selected string mutations. And so on.

To scale up the testing, I again built a simple input generator to generate input strings, and apply the algorithms supported operations (insert, delete, substitute) on them. This generated random strings of given length, applied the selected count of operation(s) on the string, and calculated the distance score for the modified vs the original (non-modified) string. At the same time measuring execution time, and checking a set of output invariants:

  • The score given by the algorithm should always match the number of operations applied.
  • Flipping the compared strings, e.g, car->cars to cars->car, should always give the same score. Since the operations are effectively reversible.

My test executor checked these invariants for every generated (test) execution of the algorithm.

To support the base assurance tests, I first used the input generator to build large number of strings of relatively short length (5–10), with varying number (0–9) operations applied. This gave me increased confidence, higher position coverage, and easier to debug results (with limited length). For score calculations, string boundaries, grouped operations, etc. It also showed me how my assumptions about the above two invariants was wrong for the first one.

What I found is, running the algorithm with enough iterations, the random chance will find cases where the target string built from multiple substitution operations can be achieved with fewer combinations of insert, delete, and substitute operations together. Here is an example case:

Applying 3 substitutions on a source to produce a target, expecting Levenshtein score of 3. Image by author.

The above shows the source string, and the target string after making 3 substitutions at randomly selected indices, with randomly selected characters. Remember, my target was COVID-19 RNA search, with 4 characters (ABCD). The following substitutions have been made; index 3: A->B, index 4: C->A, index 5: D->C. Expecting Levenshtein score to match the number of edit operations (3 substitutions here), this should give a score of 3.

However, instead of giving a score of 3, the above gives a score of 2. After some analysis, I realized the same target can be reached with one insert and one delete, giving the minimal score of 2:

Achieving the same target as above with only 2 operations (insert + delete). Image by author.

With this knowledge, I had to disable the failing invariant from the checks. Instead, I opted to collect statistics on the score achieved with different number of operations applied. I could then check that the overall score distribution was not too far off from the number of operations applied, if not exactly the same. The second invariant (flipping strings) was fine for all generated tests.

I find this is an example of what one might call exploratory algorithm testing. Take any assumptions on how the algorithm works, encode them as invariants (or whatever works), generate further tests to see if they hold. Learning about the algorithm and its domain along the way.

After successfully running these smaller tests, and feeling confident I had sufficiently covered my assumptions, I increased the generator string sizes higher for testing computational complexity.

Computational Complexity: Setup

For this evaluation, I used the input generator to produce strings of length from 100 to 5000 in intervals of 250 (100, 350, 600, 850, … until 4850). The target string length I set earlier was 3981 character length, which gets covered here, with some wide margin. For statistical confidence, I repeated each test size (100 to 4850) 100 times.

For each generated string, I applied the following operations 1, 3, 9, and 60 times (60 being the omicron limit defined above):

  • Substitute characters in the string at random (non-overlapping) locations
  • Insert random characters at a random location in the test string
  • Delete random characters in the test string
  • Combinations of above, all at the same time

My goal was to see if a different algorithm operations, their number, or the string length would affect the computation time. Or more generally, to test the algorithm operations and parameters, to see how they affect its execution time.

Computational Complexity: Measurements

The results for the above described experiments were very similar in terms of performance. The following figure illustrates the substitution tests, and their overall execution time:

Levenshtein execution time vs string size. Image by author.

Substitute_x refers to substituting X character in the source string, and calculating the Levenshtein score. The x-axis in above is the string size from 100 to 4850. The y-axis is the time it took to run the 100 experiments.

The four lines on the above graph are practically overlapping, as the execution time was so close in each case. Since this seemed a bit suspicious, I ran multiple experiments separately with varying parameters to see if this holds true, which it did. A bit strange, but ok. Having a domain expert and a team to bounce thoughts on this would have been real nice.

I omitted showing all the rest of the operations, their size variants, and combinations here. They were all very close, indicating that the execution time had little to no dependence on the type or number of operations.

The above curve in general resembles an exponentially growing curve. To check this, I experimented with a few parameters to try to visualize a matching form of an exponential curve. Here is the final measured execution time plot vs a theoretical plot for exponential time:

Execution time (left) vs exponential plot (right). Image by author.

From this, I would say the algorithm has exponential complexity. Wikipedia actually has a few fancy words on the Levenshtein computational complexity being high. From this, my verdict is that these results match the theoretical expectation, this algorithm seems to scale less optimally for longer inputs.

Testing an Adaption to the Algorithm:

Sometimes we test and analyze an algorithm, and realize it is not a great fit for the need. But we may wish to try an alternative, or an adaptation to this algorithm to address the issues. Let’s look at one example here.

Exponential execution time growth is generally considered bad for scalability. For shorter string (e.g., spell checking a word or a shell command) this is probably not an issue, as the exponential effect is so small on shorter input. However, in my RNA story I wanted to search and analyze sequences of length 3831+. For this, investigating possible speedups would seem useful.

Let’s say we decide to try to make this faster by trying to run the algorithm on smaller pieces of the input. This should make the execution time grow more linearly, instead of exponential. In the end, we just need to rank the search results relatively, not necessarily an exact score. And once the top results are known, one could calculate exact scores for that subset. Since this is a hypothetical example, I call this a hypothetical development idea :).

Here is an execution time graph for splitting the generated input strings to slices of length 100 characters, and summing up their scores:

Execution time after splitting to slices of 100. Image by author.

From the execution time perspective, the above graph looks much better. Exponential growth is gone, turned linear. But how about the results, how large is the difference in score between calculating it for the full string vs summing up the scores for slices of 100? The following tables illustrate this:

Substitution operator, slices of 100 vs full string. Image by author.

In the above table, s_x refers to substituting X character in the string. Size is the total string length. The postfix of /full refers to running the Levenshtein algorithm on the full string. The /100 version uses slices of 100 characters. Each configuration was repeated 100 times to get statistical coverage. Which is why, for example, s_1/full has a score of 100 (100 runs, each scoring 1).

For this substitution example, sliced version scores are very close to the full version. Perhaps because substitution is a local change, and does not so much affect multiple slices. There are a few rows in the table where the summed score has a difference of 1–2 edits (6000 vs 5998 and 5999). This is because of the issue I noted above, where insert and delete can work together to find a smaller minimum. If the substitution operation was enough, this would seem a plausible adaptation. But the goal was to support all operations.

In the below tables, the insert operation uses prefix of i, and delete a prefix of d. So i_x refers to inserting X characters, and d_x deleting X characters:

Insert operator, slices of 100 vs full string. Image by author.

The above insert table shows how the i_x/full score always matches the number of operations (*100) as expected. With i_x/100 slices, the score starts to get larger as the string length increases, and the number of operations applied increases (from 1 to 3, 9, and 60). After some analysis, I concluded this is because inserting a character at the beginning of the string shifts all the rest of the slices forward to the right, and thus causes each slice to require multiple edits, and increase the summed score.

The following figures illustrate this issue:

Above string of length 10, Inserting B, Levenshtein score calculated as full string. Image by author.
Same as above string, Levenshtein score calculated in slices of length 3. Image by author.

The above is an example of slice size 3 instead of 100 but the same concept. Depending on the location of the insert, the edit count propagates towards the end and increases the sum by a large amount. For each slice of 3 to match, it has to remove the first and add the last char. Same for delete operation:

Delete operator, slices of 100 vs full string. Image by author.

The above delete table shows very similar behaviour to the insert table. And for the same reasons, a delete shifts all slices left when insert shifted them right.

So, while the slicing approach would be successful in cutting down the algorithm processing time, it would definitely not work to properly rank the search results for insert and delete. From the application perspective, I would consider this adaptation a failed experiment, except for learning.

However, for the purposes of this article, I find this is a good experiment. It shows how one might run tests on an algorithm, analyze its results, fit for purpose, and other properties, build hypothesis, implement experiments, evaluate it, and iterate. Optimally, this is how I would see algorithm testing and analysis contributing to the overall development more broadly, helping to design and evaluate experiments for improved algorithms and adaptations.

Concluding the Levenshtein Search Experiment

In a more realistic scenario, I hope I would be working with a team of experts. And have resources to perform research on what are all the approaches, state of the art, and everything else on the subject. In fact, when the algorithm and its application domain are complex, and the opportunity is there, I would consider this an essential part of the testing and analysis process. Working with domain experts, R&D team, and researching the domain knowledge. Here I will limit the scope as my resources are limited.


My goal with this article was to explore the idea of what algorithm testing (and analysis) could be. A summarizing list makes it simpler to remember:

  • Traditional testing techniques can define base assurance tests with expert defined inputs and expected outputs
  • Building a good, in-depth understanding of the algorithm helps understand how to test it, and how to adapt it to a domain
  • This includes building a similarly good understanding of the data it is applied to, and how this relates to the algorithm
  • Ideally, above works iteratively in interaction with research, development, and testing
  • Besides verification, algorithm testing can contribute to understanding its limits, potential optimizations, and compare alternatives
  • Identifying assumptions about the data and algorithm input describes what the algorithm is expected to work with
  • Exploratory data analysis can use these assumptions as input, check if they hold, and refine them
  • Identifying assumptions about the data and the algorithm output gives a basis to write invariants to check in every test
  • An automated test generator helps scale testing with these invariants, and check whether the assumptions hold
  • Scope of testing relates to defining the scope of the algorithm vs the overall system using it, responsibilities for input and output handling
  • Theoretical computational complexity is good, but practical evaluation if it holds for the implementation and available data is good to ensure
  • Algorithm testing can form a tool for exploring the algorithm, by formulating hypothesis about it and using testing to evaluate them
  • Tools and techniques, such as metamorphic testing can help evaluate robustness of the algorithm to different types of valid and invalid inputs
  • Testing and analysis is optimally an iterative process, where the end result is the final generated tests and checks, and all the learning on the way


I started this article with the idea to explore what it means to test an algorithm, or what it could mean to be do “algorithm test engineering”. I like to think I made some progress, although I am sure the definition can be subjective, much like “goodness” of a (machine learning) algorithm result.

Both my examples in this article, binary search and Levenshtein edit distance, are quite simple and basic algorithms in the end. As noted in this article, the basic testing of such algorithms is not too complicated. However, considering algorithm testing and analysis as part of a broader research and development process, I believe the interactions, collaborations, research and development contributions can make it more diverse and interesting.

In this article, I looked at two “classic” algorithms, where the input, output, and their relations are quite straightforward to define. In a followup article I will look at machine learning based algorithm(s), and an example of an algorithm where the input-output relations, and the correctness of the output is harder to define, subjective, or “undefinable”, in a traditional sense. Until then.

That’s all for now. Cheers.

Metamorphic Testing of Machine-Learning Based Systems

Techniques for Testing Autonomous Cars and other ML-Based Systems

Testing machine learning (ML)-based systems requires different approaches compared to traditional software. With traditional software, the specification and its relation to the implementation is typically quite explicit: “When the user types a valid username and matching password, they are successfully logged in”. Very simple to understand, deterministic, and easy to write a test case for.

ML-based systems are quite different. Instead of clearly defined inputs and logical flows based on explicit programming statements, a ML-based system is based on potentially huge input spaces with probabilistic outcomes from largely black-box components (models). In this article, I take a look at metamorphic testing, which is a technique that has become increasingly popular to address some of the ML-based systems testing challenge. I will go through some of the latest research, and present examples from different application domains.

These two images can be used to represent some metamorphic tests for a ML-system. Read on to see how.. 🙂 Image by author.

Metamorphic Testing

Metamorphic Testing (MMT) was originally proposed quite a while back, at least up to (Chen1998). Having worked a long time with software testing research, I always viewed MMT as a curiosity with few real use cases. With ML-based systems, however, it seems to have found its niche nicely.

The general idea of MMT is to describe the system functionality in terms of generic relations between inputs, the generic transformations of those inputs and their outputs, rather than as mappings of specific inputs to specific outputs.

One typical example used for metamorphic testing in the past has been from testing search engines (e.g., Zhou2016). As search engines are these days practically natural language processing (NLP)/ML-based systems, they also fit the topic of this article well. To illustrate the concept, I ran two queries on Google (in October 2020):

A Google query for “car” and its refined version “autonomous car”. Image by author.

The first query is just one word “car”. The second query adds another word to the first query, “autonomous”. So the query now becomes “autonomous car”. This addition of a restrictive search keyword is an example of an input transformation (or “morphing”, in the spirit or metamorphic testing):

Example input transformation: changing the original input (query) to be more restricted. Image by author.

And to perform a check on the test results (a test oracle), we define a matching relation that should hold when the input transformation is applied:

Example output (metamorphic) relation: More restricted query returns fewer results. Image by author.

In this case, adding the restrictive search term (“autonomous”) to the previous query (“car”) changes the result set, restricting it to a smaller set. From 8.3 billion results to 159 million. The metamorphic test would not specify these exact values, but rather the relation “restricting the query leads to fewer search results”. And one could generate several (seed) inputs (search queries), associated restrictive keywords for transformations, and run the query and check the metamorphic relation holds (restricting the query produces fewer results). For more details on MMT with search engines, see (Zhou2016).

The above is an example of what metamorphic testing refers to. You transform (morph) your inputs in some way, while at the same time defining a relation that should hold from the previous input (and its output) to the new morphed input (and its output). The key concepts / terms are:

  • morph/transform: modify a seed input in a way that your defined metamorphic relations should hold
  • metamorphic relation: the defined transformation of the input should have a known/measurable effect on the output. Checking that this relation holds after the transformation is the test oracle of metamorphic testing. (Test oracle is a general term for a mechanism to give a verdict on test result)
  • seed inputs: the inputs that are used as initial inputs for the tests, to be transformed. If you know the output of the seed input, you may use it to define a stricter relation (output should be correct). But even without the seed output, you can still define a relation check, but it might be a bit more relaxed (output should be similar but you don’t know if it is correct).

More generally metamorphic testing refers to defining such transformations, and observing their impact (metamorphic relations) on the result. The effectiveness and applicability then depends on how well and extensively these can be defined. I will present more concrete examples in the following sections.

Problem Space

Why would you want to use metamorphic testing? I will illustrate this with an example for autonomous cars. Autonomous cars are recently going through a lot of development, getting a lot of funding, have safety-critical requirements, and are highly dependent on machine-learning. Which is maybe also why they have received so much attention in MMT research. Makes great examples.

For example, the Tesla Autopilot collects data (or did when I was writing this..) from several front-, rear-, and side-cameras, a radar, and 12 ultrasonic sensors. At each moment in time, it must be able to process all this data, along with previous measurements, and come up with reasoning fulfilling highest safety-standards. Such real-world input-spaces are incredibly large. Consider the two pictures I took just a few days apart, near my previous office:

Same place, few days apart. Image by author.

Just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers / no markers, connecting roads visible, parking lots visible, other cars, and so on. Yet in all such conditions one would be expected to be able to navigate, safely. To illustrate the problem a bit more, here are some example variants in that domain that quickly come to mind:

A few examples of items a ML-algorithm in an autonomous vehicle needs to consider. Image by author.

Besides these, one can easily expand this to different locations, road shapes, object types, bridges, trains, … Other sensors have other considerations, every location is different, and so on.

In different domains of ML-based system applications, one would need to be able to identify similar problem scenarios, and their relevant combinations, to be able to test them. Manually building test sets to cover all this is (for me) an unrealistic effort.

Metamorphic Testing with Autonomous Cars

Metamorphic testing can help in better covering domains such as the above autonomous cars problem space. As the interest is high, many approaches for this have also been presented, and I will describe a few of those here.

Covering Image Variations

The DeepTest work in (Tian2018) uses transformations on real images captured from driving cars to produce new images. In this case, the metamorphic attributes are:

  • Seed inputs: Real images from car cameras.
  • Metamorphic transformations: moving, tilting, blurring, scaling, zooming, adding fog, adding rain, etc. on he original images
  • Metamorphic relation: the autonomous driving decisions should show minimal divergence on the same input images after the transformations.

The following illustrates this with some simple examples using the road image from outside my previous office. In the following, I “transformed” the image by simply rotating the camera a bit at the location. I then added the arrows to illustrate how a system should “predict” a path that should be taken. The arrow here is manually added, and intended to be only illustrative:

Viewed from a different angle, the real-world path should still be the same (or close). Image by author.

And the same, but with the snowy ground (two transformations in the following compared to the above; added snow + rotation):

With and without snow, the path should be the same (or close). Image by author.

Of course, no-one would expect to manually create any large number of such images (or transformations). Instead, automated transformation tools can be used. For example, there are several libraries for image augmentation, originally created to help increase training dataset sizes in machine learning. The following illustrates a few such augmentations run on the original non-snow image from above:

Automatically generated augmentations / transformations of the earlier road image. Image by author.

All these augmented / transformed images were generated from the same original source image shown before, using the Python imgaug image augmentation library. Some could maybe be improved with more advanced augmentation methods, but most are already quite useful.

Once those transformations are generated, the metamorphic relations on the generated images can be checked. For example, the system should propose a very similar driving path, with minimal differences across all transformations on acceleration, steering, etc. Or more complex checks if such can be defined, such as defining a known reference path (if such exists).

Again, this process of transformation and checking metamorphic relations is what MMT is about. It helps achieve higher coverage and thus confidence by automating some of the testing process for complex systems, where scaling to the large input spaces is otherwise difficult.

GAN-based Transformations with MMT

A more advanced approach to generate input transformations is to apply different ML-based techniques to build the transformations themselves. In image augmentation, one such method is Generative Adversarial Networks (GANs). An application of GANs to autonomous cars is presented in (Zhang2018). In their work, GANs are trained to transform images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image.

The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic transformations. (Zhang2018) uses the NVidia UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos for training.

Images illustrating the GAN results are available on the UNIT website, as well as in higher resolution in their Google Photos album. I recommend having a look, it is quite interesting. The smaller images on the UNIT website look very convincing, but looking more closely in the bigger images in the photo albums reveals some limitations. However, the results are quite impressive, and this was a few years ago. I expect the techniques to improve further over time. In general, using machine learning to produce transformations appears to be a very promising area in MMT.

LIDAR Transformation

Besides cameras, there are many possible sensors a system can also use. In autonomous cars, one such system is LIDAR, measuring distances to objects using laser-based sensors. A study of applying metamorphic testing on LIDAR data in the Baidu Apollo autonomous car system is presented in (Zhou2019).

The system first identifies a region of interest (ROI), the “drivable” area. It then identifies and tracks objects in this area. The system consists of multiple components:

  • Object segmentation and bounds identification: Find and identify obstacles in ROI
  • Object tracking: Tracking the obstacles (movement)
  • Sequential type fusion: To smooth the obstacle types over time (make more consistent classifications over time by using also time related data)

The (Zhou2019) study focuses on metamorphic testing of the object identification component, specifically on robustness of classification vs misclassification in minor variations of the LIDAR point cloud. The LIDAR point cloud in this case is simply collection of measurement points the LIDAR system reports seeing. These clouds can be very detailed, and the number of measured points very large (Zhou2019).

The following figures illustrates this scenario (see (Zhou2019) for the realistic LIDAR images from actual cars, I just use my own drawings here to illustrate the general idea. I marked the ROI in a darker color, and added some dots in circular fashion to illustrate the LIDAR scan. The green box illustrates a bigger obstacle (e.g., a car), and the smaller red box illustrates a smaller obstacle (e.g., a pedestrian):

My doodle of a LIDAR image :). Image by author, using a car image by OpenClipArt from Pixabay.

The metamorphic relations and transformations in this case are:

  • Metamorphic relation: same obstacles (objects) should be identified both before and after adding small amounts of noise to the LIDAR point cloud.
  • Transformation: add noise (points to the LIDAR point cloud)
  • Seed inputs: actual LIDAR measurements from cars

The following figure illustrates this type of metamorphic transformation, with the added points marked in red. I simply added them in a random location, outside the ROI in this case, as this was the example also in (Zhou2019):

My doodle of a transformed LIDAR image with added points in red. Image by author, using a car image by OpenClipArt from Pixabay.

The above is a very simple transformation and metamorphic relation to check, but I find often the simple ones work the best.

In summary, the MMT approach here takes existing LIDAR data, and adds some noise to it, in form of added LIDAR data points. In relation to the real world, such noise is described in (Zhou2019) as potentially insects, dust, or sensor noise. The amount of added noise is also described as a very small percentage of the overall points, to make it more realistic.

The metamorphic experiments in (Zhou2019) show how adding a small number of points outside the ROI area in the point cloud was enough to cause the classifier (metamorphic relation check) to fail.

As a result, (Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and how the Baidu team incorporated some of the test data into their training dataset. This can be a useful approach, since metamorphic testing can be seen as a way to generate new data that could be used for training. However, I think one should not simply discard the tests in either case, even if re-using some of the data for further ML-model training. More on this later.

Metamorphic Testing of Machine Translation

Not everyone works on autonomous cars, so examples from other domains are important for broader insight. Outside autonomous vehicles, testing of automated language translations with ML-based NLP techniques has received some attention in recent years (for example, He2020, Sun2020). I found the (He2020) paper to be especially clear and sensible, so I use their approach as an example of the metamorphic properties for translation testing here:

  • Seed inputs: Sentences to be translated
  • transformation: replace words with specific part of speech (POS) tag in the input sentence, with another word that has the same POS tag. For example, a verb with another verb. Finally, use another NLP model (Google’s BERT in this case) to “predict” a suitable replacement candidate word.
  • metamorphic relation: the structure of the transformed output should match the original translation output sentence structure for the original input. Large deviations indicate potential errors. The test oracle metric is the difference in output sentence structures for the automated translation on the original input vs the transformed input.

Here is an illustrative example using Google Translate, and a sentence I picked (at the time) from this article. Translating that sentence from English to Finnish:

Translate a sentence from English to Finnish. Image by author.
Mask a word (noun) for a metamorphic transformation. Image by author.
Use BERT to predict a word to replace the masked word (one that “fits” in the mask). Check the metamorphic relation holds. Here it is fine. You just have to trust me on that, I am Finnish :). Image by author.

The above shows the metamorphic transformation and how the check for the defined metamorphic relation should hold. In this case the sentence structure holds fine (in my opinion as a native Finnish speaker) and the result is good. I performed these experiments manually to illustrate the concept, but the test process is the same whether automated or not. Overall, trying a few different sentences, Google Translate actually worked very well. Great for them.

To be honest, I did not really use BERT in the above example, since it was just one example I needed to illustrate the concept. I just picked a word that makes sense (to me). However, HuggingFace has really nice and easy to use implementations available of BERT and many other similar models if needed. I have used them myself for many other tasks. Much like the image augmentation libraries in the car example, the NLP libraries have come a long way, and many basic applications are quite simple and easy these days.

For more details on MMT for machine translation, I recommend checking the papers, especially the (He2020) is quite readable. An extra interesting point here is again the use of another ML-based approach to help in building the transformations, similar to the GAN-based approaches for autonomous cars.

Metamorphic Testing of Medical Images

As an example of a third application domain, applying metamorphic testing to ML-based systems in the medical domain is presented in (Ding2017). This uses MMT to test variants of existing high-resolution biological cell images.

In (Ding2017), a number of metamorphic relations are defined related to various aspects of the biological cells (mitochondria etc.) in the images, and the manipulations done to the image. I lack the medical domain expertise to analyze the transformations or metamorphic relations in more detail, and the paper does not very clearly describe these for me. But I believe my lack of understanding is actually a useful point here.

Metamorphic testing related elements in this case (as far as I understood):

Seed inputs: existing medical images (actually, the paper is very unclear on this along with many other aspects, but it serves as a domain example)

Transformations: Adding, removing, transforming, etc. of mitochondria in the images.

Metamorphic relations: The relations between the elements (mitochondria) in the algorithm outputs for the transformed images should match the defined relation (e.g., linking some elements after adding new ones).

This example highlights, for me, how in many cases, the nuances, the metamorphic relations and transformations require an in-depth domain understanding. This requires extensive collaboration between different parties, which is quite common (in my experience) in applying ML-based approaches. Cars, driving, and language translation are everyday tasks we are all familiar with. Many expert domains, such as in this example, less so. This is why I think this is a useful example in highlighting my lack of domain expertise.

Interestingly, (Ding2017) also mentions using traditional testing techniques such as combinatorial testing, randomization, and category-partitioning, to enhance the initial input seed set. This is also the case in the following example on drones.

Metamorphic Testing of Drones

As a final example domain, an approach of combining model-based testing, simulation, and metamorphic testing for testing an ML-based flight guidance systems of autonomous drones is presented in (Lindvall2017).

The drone is defined as having a set of sensors, including barometer, GPS, cameras, LIDAR, and ultrasonic. Many sensors, quite similar the autonomous cars example. The metamorphic relations defined for the drone control:

  • behaviour should be similar across similar runs
  • rotation of world coordinates should have no effect
  • coordinate translation: same scenario in different coordinates should have no effect
  • obstacle location: same obstacle in different locations should have same route
  • obstacle formation: similar to location but multiple obstacles together
  • obstacle proximity: always within defined bounds
  • drone velocity: velocity should stay inside defined bounds
  • drone altitude: altitude should stay inside defined bounds

Following are properties of such systems metamorphic testing environment:

  • Seed inputs: generated using the model-based approaches based on an environment model for the simulation
  • Transformations: See above; rotation and coordinate changes of drone vs environment and obstacles or obstacle groups, etc
  • Checks: See also above

A test environment generator is used to define (simulated) test environments for the drone, effectively generating the seeds of the metamorphic tests. The metamorphic transformations can be seen as modifications of this environment, and finally checks test that the above defined metamorphic relations hold. Various scenarios are defined to hold these together, including lift-off, returning home, landing, etc.

Perhaps the most interesting part here is the use of model-based testing approaches to build the seed inputs themselves, including the test environment. This seems like a very useful approach for gaining further coverage in domains where this is possible.

Another relevant observation in this is the use of scenarios to group elements together to form a larger test scenario, spanning also time. This is important, since a drone or a car, or many other systems, cannot consider a single input in isolation, but rather must consider a sequence of events, and use it as a context. This time aspect also needs to be taken into account in metamorphic testing.

Adversarial Inputs and Relations Across Time

Adversarial Inputs

A specific type of transformation that is often separately discussed in machine learning is that of adversarial inputs, which is extensively described in (Goodfellow2018). In general, an adversarial input aims to trick the machine learning algorithm to make a wrong classification. An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues.

Generating such adversarial inputs can be seen as one example of a metamorphic transformation, with an associated relation that the output should not change, or change should be minimal, due to adversarial inputs.

Typically such adversarial testing requires specifically tailored data to trigger such misclassification. In a real-world driving scenario, where the car sensors are not tampered with, it might be harder to produce such purely adversarial effects. However, there are some studies and approaches, such as (Zhou2020) considering this for real-world cases. More on this in a bit.

Beyond autonomous cars, digitally altered or tailored adversarial inputs might be a bigger issue. For example, in domains such as cyber-security log analysis, or natural language processing, where providing customized input data could be easier. I have not seen practical examples of this from the real world, but I expect once the techniques mature and become more easily available, more practical sightings would surface.

Much of the work on adversarial elements, such as (Goodfellow2018), have examples of adversarially modified single inputs (images). Real systems are often not so simple. For example, as a car drives, the images (as well as other sensor data), and the decisions that need to be made based on that data, change continuously. This is what the (Zhou2020) paper discusses for autonomous cars.

Relations Across Time

In many cases, besides singular inputs, sequences of input over time are more relevant for the ML-based system. Driving past a sign (or a digital billboard..), the system has to cope with all the sensor data at all the different positions in relation to the environment over time. In this case, the camera viewing angle. For other sensors (LIDAR etc), the input and thus output data would change in a similar manner over time.

Following is an example of what might be two frames a short time apart. In a real video stream there would be numerous changes and images (and other inputs) per second:

Change of (same) input concept over time. Image by author.

Not only does the angle change, but time as a context should be more generally considered in this type of testing (and implementation). Are we moving past the sign? Towards it? Passing it? Did we stop already? What else is in the scene? And so on.

This topic is studied in (Zhou2020), which considers it from the viewpoint of adversarial input generation. In a real-world setting, you are less likely to have your image data directly manipulated, but may be susceptible to adversarial inputs on modified traffic signs, digital billboards, or similar. This is what they (Zhou2020) focus on.

The following example illustrates how any such modification would also need to change along with the images, over time (compared to calculating a single, specific altered input vs real-world physical data moving across time):

If testing with generated adversarial elements, they should change over time in relation to the rest of the image. Here the angle and size of the blue box. Image by author.

This temporal aspect is important in more ways than just for adversarial inputs. For example, all the image augmentations (weather effects, etc) I discussed earlier would benefit from being applied in a realistic driving scenario (sequences of images) vs just a single image. This is what the cars have to deal with in the real world after all.

The test oracle in (Zhou2020) also considers the effect of the adversarial input from two different viewpoints: strength and probability. That is, how large deviations can you cause in the steering of the car with the adversarial changes, and how likely it is that you can cause these deviations with the adversarial input.

Beyond cars and video streams, time series sequences are common in other domains as well. The drone scenarios discussed are one example. Other examples include processing linked paragraphs of text, longer periods of signal in a stock market, or basic sensor signals such as temperature and wind speed.

Minimizing the Test Set

While automating metamorphic testing can be quite straightforward (once you figure your domain relations and build working transformations…), the potential input space from which to choose, and the number of transformations and their combinations can quickly grow huge. For this reason, test selection in MMT is important, just as with other types of testing.

One approach to address this is presented in (Tian2018), which applies a greedy search strategy. Starting with a seed set of images and transformations, the transformations and their combinations are applied on the input (images), and the achieved neuron activation coverage is measured. If they increase coverage, the “good” combinations are added back to the seed set for following rounds, along with other inputs and transformations, as long as they provide some threshold of increased coverage. This iterates until defined ending thresholds (or number of experiments). Quite similar to more traditional testing approaches.

Coverage in (Tian2018) is measured in terms of activations of different neurons in the ML model. They build coverage criteria for different neural network architectures, such as convolutional neural nets, recurrent neural nets, and dense neural nets. Various other coverage criteria also have been proposed, that could be used, such as one in (Gerasimou2020) on evaluating the importance of different neurons in classification.

When more and easily applicable tools become available for this type of ML-model coverage measurement, it would seem a very useful approach. However, I do not see people generally writing their own neural net coverage measurement tools.

Relation to Traditional Software Testing

Besides test suite optimization, it is important to consider MMT more broadly in relation to overall software and system testing. MMT excels in testing and verifying many aspects of ML-based systems, which are more probabilistic and black-box in nature. At least to gain higher confidence / assurance in them.

However, even in ML-based systems, the ML-part is not generally an isolated component working alone. Rather it consumes inputs, produces outputs, and uses ML models for processing complex datasets. The combinatorial, equivalence partitioning, and model-based methods I mentioned earlier are some examples of how the MMT based approaches can be applied together with the overall, more traditional, system testing.

As I mentioned with the Baidu Apollo case and its LIDAR test data generation, one of the feedbacks was to use the metamorphic test data for further ML training. This in general seems like a useful idea, and it is always nice to get more training data. In my experience with building ML-based systems, and training related ML-models, everyone always wants more training data.

However, I believe one should not simply dump all MMT test data into the training dataset. A trained model will learn from the given data, and can be tested for general accuracy on a split test set. This is the typical approach to test a specific ML-model in isolation. However, in practice, the classifications will not be 100% accurate, and some items will end up misclassified, or with low confidence scores. These further feed into the overall system, which may have unexpected reactions in combination with other inputs or processes. Running specific (MMT based or not) tests with specific inputs helps highlight exactly which data is causing issues, how this behaviour changes over time, and so on. If you just throw your MMT tests into the training set and forget it, you lose the benefit of this visibility.

Besides MMT, and complimentary to it, other interesting approaches of tailoring traditional testing techniques for ML-based system testing exist. One specific approach is A/B testing (evaluating benefits of different options). In ML-based systems, this can also be a feedback loop from the human user, or operational system, back to testing and training. The Tesla Shadow Mode is one interesting example, where the autonomous ML-based system makes continuous driving decisions, but these decisions are never actually executed. Rather they are compared with the actual human driver choices in those situations, and this is used to refine the models. Similar approaches, where the system can learn from human corrections are quite common, such as tuning search-engine results and machine translations, based on human interactions with the system. You are changing / morphing the system here as well, but in a different way. This would also make an interesting seed input source for MMT, along with oracle data (e.g., driving path taken by human user) for the metamorphic relation.


Testing machine learning based systems is a different challenge from more traditional systems. The algorithms and models do not come with explicit specifications of inputs and outputs that can be simply tested and verified. The potential space for both is often quite huge and noisy. Metamorphic testing is one useful technique to gain confidence in their operation with a reasonable effort. Compared to traditional testing techniques, it is not a replacement but rather a complimentary approach.

I presented several examples of applying MMT to different domains in this article. While applications in different domains require different considerations, I believe some generally useful guidelines can be derived to help perform MMT over ML-based systems:

  • metamorphic transformations: these do not have to be hugely complex, but rather simple ones can bring good benefits, such as the addition of a few random points to the LIDAR cloud. Consider how the same input could change in its intended usage environment, and how such change can be implemented with least (or reasonable) effort as a transformation.
  • metamorphic relations: to build these relations, we need to ask how can we change the ML input, and what effect should it have on the output? Sometimes this requires deep domain expertise to identify most relevant changes, as in the medical domain example.
  • test oracles: These check that the performed transformation results in a acceptable (vs valid) output. Requires considerations such as how to represent the change (e.g., steering angle change, sentence structural change), possibly defining the probability of some error, the severity of the error, and a distance metric between the potential outputs after transformation (e.g., steering angle calculation). That is, the values are likely not fixed but in a continuous range.
  • time relation: in many systems, the inputs and outputs are not singular but the overall system performance over time is important. This may also require asking the question of how time might be impacting the system, and how it should be considered in sequences of metamorphic relations. The idea of overall test scenarios as providers of a broader context, time related and otherwise, is useful to consider here.
  • test data: can you use the user interactions with the system as an automated source of test inputs for transformations and metamorphic relations? Think Tesla Shadow mode, Google search results, and the inputs from the environment and the user, and use reactions to these inputs.

As discussed with some of the examples, an interesting trend I see is the move towards using ML-based algorithms to produce or enhance the (MMT-based) tests for ML-based systems. In the NLP domain this is shown by the use of BERT as a tool to build metamorphic transformations for testing natural language translations. In the autonomous cars domain by the use of GAN-based networks to create transformations between image properties, such as different weather elements and time of day.

Overall the ML field still seems to be advancing quite fast, with useful approaches already available also for MMT, and hopefully much more mature tooling in the next few years. Without good tool support for testing (data generation, model coverage measurement, etc), finding people with all this expertise (testing, machine learning, domain specifics, …), and implementing it all over again for every system, seems likely to be quite a challenge and sometimes a needlessly high effort without good support in tools and methods.

That’s all for today, this got way too long, so if someone managed to read all this far, I am impressed :). If you have experiences in testing ML-based systems and willing to share, I am interested to hear and learn in the comments 🙂

Testing Machine Learning Intensive Systems (or Self-Driving Cars) – A Look at the Uber Accident

Previously I looked at what it means to test machine learning systems, and how one might use machine learning in software testing. Most of the materials I found on testing machine learning systems was academic in nature, and as such a bit lacking in practical views. Various documents on the Uber incident (fatally hitting a pedestrian/cyclist) have been published, and I had a look at those documents to find a bit more insights into what it might mean to be testing overall systems that rely heavily on machine learning components. I call them machine-learning intensive systems. Because.

Accident Overview

There are several articles published on the accident, and the information released for it. I leave the further details and other views for those articles, while just trying to find some insights related to the testing (and development) related aspects here. However, a brief overview is always in order to set the context.

This accident was reported to have happened on 18th of March, 2018. It involved the Uber test car (a modified Volvo X90) hitting a person walking with a bicycle. The person died as a result of the impact. This was on a specific test route that Uber used for their self-driving car experiments. There was a vehicle operator (VO) person in the Uber car, whose job was to oversee the autonomous car performance, mark any events of interest (e.g., road debris, accidents, interesting objects), label objects in the recorded data, and take over control of the vehicle in case of emergency or other need (system unable to handle some situation).

The records indicate there used to be two VO’s per car, one focusing more on the driving, and one more on recording events. They also indicate that before the accident, the number of VO’s had been reduced to just one. The roles were combined and an update of the car operating console was designed to address the single VO being able to perform both roles. The lone VO was then expected to keep a constant eye on the road, monitor everything, label the data, and perform any other tasks as needed. Use of mobile phone was prohibited during driving by the VO, but the documents for the accident indicate the VO had been eyeing the spot where their mobile device was located, inside a slot on the dashboard. The documents also indicate the VO had several video streaming applications installed, and records from the Hulu streaming service showed video streaming occurring on the VO account at the time of the accident.

The accident itself was a result of many combined factors, where the human VO seems to have put their attention elsewhere at just the time, and the automation system has failed to react properly to the pedestrian / cyclist. Some points on the automation system in relation to potential failures:

  • The system kept records of each moving object / actor and their movement history, using their previous movements and position as an aid to predict their future movements. The system was further designed to discard all previous movement (position) history information when it changed the classification of an object. So no history was available to predict movement of an object / actor, if its classification changed.
  • The classification of the pedestrian that was hit kept changing multiple times before the crash. As a result, the system constantly discarded all information related to it, severely inhibiting the system from predicting the pedestrians movement.
  • The system had an expectation to not classify anything outside a road crossing as a pedestrian. As such, before the crash, the system continously changed the pedestrian classification between vehicle, bicycle, or other. This was the cause of losing the movement history. The system was not designed for the possibility of someone walking on the road outside a crossing area.
  • The system had safeguards in place to stop it from reacting too aggressively. A delay of 1 second was in place to delay braking when a likely issue was identified. This delayed automatic braking even at a point where a likely crash was identified (as in the accident). The reasoning was to avoid too aggressive reactions to false positives. I guess they expected the VO to react, and to log issues for improvement.
  • Even when the danger was identified and automated braking started, it was limited to reasonable force to avoid too much impact on the VO. If this maximum braking was calculated as insufficient to avoid impact, the system would brake even less and emit an audio signal for the VO to take over. So if maximum is not enough, slow down less(?). And the maximum emergency braking force was not set very high (as far as I understand..).

Before the crash, the system thus took an overly long time to identify the danger due to bad assumptions (no pedestrian outside crossing). It lost pedestrian movement history due to dropping data on classification change. It waited for 1 second from finally identifying the danger to do anything, and then initiated a slowdown rather than emergency braking. And the VO seemed to be distracted from observing the situation. After the accident, Uber has moved to address all these issues.

There are several other documents on various aspects of the VO, the automation system, and the environment available on the National Transportation Safety Board website for those interested. Including nice illustrations of all aspects.

This was a look at the accident and its possible causes to give some context. Next a look at the system architecture to also give some context of potential testing approaches.

Uber System Architecture

Looking at testing in any domain, understanding the system architecture is important. A look.

Software Modules

The Uber document on the topic lists the following main software modules:

  • Perception: Collects data from different sensors around the car
  • Localization: Combines detailed map data with sensor data for accurate positioning
  • Prediction: Takes Perception output as input, predicts actions for actors and objects in the environment.
  • Routing and Navigation: Uses map data, vehicle status, operational activity to determine long term routes for a given goal.
  • Motion Planning: Generates shorter term motion plans to control the vehicle in the now. Based on Perception and Prediction inputs.
  • Vehicle Control: Executes the motion plan using vehicle communication interfaces.


The same Uber document also describes the self-driving car hardware.

The current components at the time of writing the document:

  • Light Detection and Ranging (LIDAR): Measuring distance to actors and objects, 100m+ range.
  • Cameras: multiple cameras for different distances, covering 360 degrees around the vehicle. Both near- and far-range. To identify people and objects.
  • Radar: Object detection, ranging, relative velocity of objects. Forward-, backward-, and side-facing.
  • Global Positioning System (GPS): Coarse position to support vehicle localization (positioning it), vehicle command (to use location / position for control), map data collection, satellite measurements.
  • Self-Driving Computer: A liquid-cooled local computer in the car to run all the SW modules (Perception, Prediction, Motion Planning, …)
  • Telematics: Communication with backend systems, cellular operator redundancy, etc.

Planned components (not installed back then, but in future plans..):

  • Ultrasonic Sensors: Uses echolocation to range objects. Front, back, and sides.
  • Vehicle Interface Module: Seems to be an independent backup module to safely control and stop the vehicle in case of autonomous system faults.


Now that we established a list of the SW and HW components, a look at their functionality.


The system is described as using very detailed maps, including:

  • Geometry of the road and curbs
  • Drivable surface boundaries and driveways
  • Lane boundaries, including paint lines of various types
  • Bike and bus lanes, parking regions, stop lines, crosswalks
  • Traffic control signals, light sets, and lane and conflict associations (whatever that is? :))
  • Railroad crossings and trolley or railcar tracks
  • Speed limits, constraint zones, restrictions, speed bumps
  • Traffic control signage

Combined with precise location information, the system uses these detailed maps to beforehand "predict" what type of environment lies ahead, even before the Perception module has observed it. This is used to prepare for the expected road changes, anticipate speed changes, and optimize for expected motion plans. For example, when anticipating a tight turn in the road.

Perception and Prediction

The main tasks of the Perception module are described as detecting the environment, actors and objects. It uses sensor data to continously estimate the speed, position, orientation, and other variables of the objects and actors, as a basis to make better predictions and plans about their future movement, velocity, and position.

An example given is the turn signals of other cars, which is used to predict the their actions. At the same time, all the other data is also recorded and used to predict other, alternative courses for the same car, in case it does not turn even though using a turning signal.

While the Perception module observes the environment (collects sensor data), the Prediction component uses this, and other available, data as a basis for predicting the movement of the other actors, and changes in the environment.

The observed environment can have different types of objects and actors in it. Some are classified as fixed structures, and are expected not to move: buildings, ground, vegetation. Others are classified as more dynamic actors, and expected to move: vehicles, pedestrians, cyclists, animals.

The Prediction module makes predictions on where each of these objects is likely to move in the next 10 seconds. The predictions include multiple properties for each object and actor, such as movement, velocity, future position, and intended goal. The intended goal (or intention) is mentioned in the document, but I did not find a clear description of how this would be used. In any case, it seems plausible that the system would assign "intents" to objects, such as pedestrian crossing a street, a car turning, overtaking another car, going straight, and so on. At least these would seem useful abstractions and input to the next processing module (Motion Planning).

The Prediction module makes predictions multiple times a second to keep an updated representation available. The predictions are provided as input to the Route and Motion Planning module, including the "certainty" of those predictions. This (un)certainty is another factor that the Motion Planning module can use as input to apply more caution to any control actions.

Route and Motion Planning

Motion Planning (as far as I understand) refers to short-term movements, translating to concrete control instructions for the car. Route planning on the other hand refers to long term planning on where to go, and gives goals for the Motion Planning to guide the car to the planned route.

Motion Planning combines information from generated route (Route Planning), perceived objects and actors (Perception), and predicted movements (Prediction). Mapping data is used for the "rules of the road", as well as any active constraints. I guess also combined with sensor data for more up-to-date views in the local environment (the public docs are naturally not super-detailed on everything). Using these, it creates a motion plan for vehicle. Data from Perception and Prediction modules is also used as input, to define the anticipated movements of other objects and actors.

A spatial buffer is defined to be kept between the vehicle and other objects in the environment. My understanding is that this refers to keeping some amount of open space between the car and environmental elements. The size of this buffer varies with variables such as autonomous vehicle speed (and properties and labels of other objects and actors I assume). To preserve the required buffer, the system may take action such as changing lanes, brake, or stop and wait for situation to clear.

The system is also described as being able to identify and track occlusions in the environment. These would be environmental elements, such as buildings or other cars, blocking a view to certain other parts of the environment. These are constantly reasoned about, and the system becomes more concervative in decision when occlusions are observed. It aims to be able to avoid actors coming out of occlusions with reasonable speed.

Vehicle Control

The Vehicle Control module executes trajectories provided by the Motion Planning module. It controls the vehicle through communication interfaces. Example controls include steering, braking, turn signals, throttle, and switching gears.

It also tracks any limits set for the system (or environment?), and communicates back to the operation center as needed.

Data Collection and Test Scenarios

Since my point with this "article" was to look into what it might mean to test a machine learning intensive system, I find it important to also look at what type of data is used to train the machine learning systems, and how is all the used data collected. And how these are used as part of test case (In the Uber documents they seems to call them test scenarios). Of course, such complex systems use this type of data for many different purposes besides just the machine learning part, so they are generally interesting as well.

The Uber document describes data uses including system performance analysis, quality assurance, machine teaching and testing, simulated environment creation and validation, software development, human operator training and assessment, and map building and validation.

Data Collection

Summarizing the various parts related to data collection and synthesis from the Uber descriptions, at the heart of all this are the real-world training data collected by the VO’s driving around, the car and automated sensors collecting detailed data, and the VO’s tagging the data. This tagging also helps further identify new scenarios, objects, and actors. The sensor data is based on the sensors I listed above in the HW section.

Additionally, the system is listed as recording:

  • telemetry (maybe refers to metrics about network? or just generally to transferring data?)
  • control signals (commands for vehicle control?)
  • Control Area Network (CAN) messages
  • system health, such as
    • hard drive speeds
    • internal network performance
    • computer temperatures

The larger datasets are recorded in onboard (car) storage. Smaller amounts of data are transmitted in near real-time using over-the-air (OTA) interfaces over cellular networks to the Uber control center. These use multiple cellular network for cybersecurity and resiliency purposes. The OTA data includes insights on how the vehicles are performing, where they are, and their current state.

Scenario Development

In the documents (Uber and another from the RAND corporation), the operational environment of the autonomous vehicle is referred to as the operational design domain (ODD). Defining the ODD is quite central to the development (as well as testing) of the system and training the ML algorithms, as well as the controlling logic based on those. It defines the world in which the car operates, and all the actors and objects, and their relations.

The Uber document describes using something called scenarios as test cases. Well, it mostly does not mention the word "test case", but for practical purposes this seems to be similar. Of course, this is quite a bit more complex than a traditional software test case with simple inputs and outputs, requiring description of complex real-world environments as inputs, and boundaries and profiles of accepted behaviour as outputs, rather than specific data values. These complex real-world inputs and outputs are also varying over time, different from the typical static input values as often is with traditional software tests. Thus, also a time-series aspect is relevant to the inputs and outputs.

Uber describes a unified schema being used to describe the scenarios and data. Besides the collected data and learned models, other data inputs are also used, such as operational policies. Various success criteria are defined for each scenario, such as speed, distance, and description of safe behaviour.

When new actors, environmental elements, or other similar items are encountered, they are recorded and tagged for further training of the autonomous system. The resulting definitions and characterization of the ODD is then used as input to improve the test scenarios and create new ones. This includes improving the test simulations, and test tracks for coverage.

Events such as large deviations between consequtive planned trajectories are recorded and automatically tagged for investigation. Simulations are used to evaluate they are fixed, and the new scenarios are added to ML training datasets, or as "hard test cases". This seems a bit similar to the Tesla "shadow mode" I discussed earlier, just a bit more limited.

Test Coverage

Besides a general overview of the scenario development, the Uber documents do not really discuss how they handle test coverage, or what types of tests they run. There are some minor references but nothing very concrete. It is more focused on describing the overall system, and some related processes. I tried to collect here some points that I figured seemed relevant.

A key difference to more traditional software systems seems to be how these types of systems do not have a clearly defined input or output space. The interaction interfaces (API/GUI) of traditional software systems naturally defines some contract for what type of input is allowed, and expected. With these it is possible to apply the traditional techniques such as category partitioning, boundary analysis, etc. When your input space is the real world and everything that can happen in it, and output space is all the possible actions in relation to all the possible environmental configurations, it gets a bit more complex. In a similar comment, Uber describes their system as requiring more testing with different variations.

Potential Test Scenarios from Uber Docs

These are just points I collected that I though would illustrate something related to test scenarios and test coverage.

Uber describes evaluating their system performance in different common and rare scenarios, using measurements such as traffic rule violations, and vehicle dynamic attributes. This means having very few crash and unsafe scenarios available, but a large number of safe scenarios. That is, when the scenarios are based on real-world use and data, commonly there are much more "safe" scenarios available than "un-safe", due to rarity of crashes, accidents and other problem cases vs normal operations.

With only this type of highly biased data-set available, I expect there is a need to synthesize more extensive test sets, or other methods to test and develop such systems more extensively. The definition of safety also does not seem to be a binary decision but rather there can be different scales of "safe", depending on the safety attribute. For example, a safety margin of how far from other vehicles should the autonomous vehicle keep distance, is a continous variable, not a binary value. Some variables might of course have binary representations, such as avoiding hitting a pedestrian, or ramming a red light. But even the pedestrian metric may have similar distance measures, impact measures, etc. So I guess its a bit more complicated than just safe or not safe.

Dataset augmentation and imbalanced datasets are common issues in developing and training ML models. However, those techniques are (to my understanding) based on a single clear goal such as classification of an object, not on complex output such as overall driving control and its relation to the real world. Thus, I would expect to use overall scenario augmentation type of approaches, more holistic than a simple classifier (which in its own might be part of the system).

Some properties I found in the Uber documents (as I discussed above), referring to potential examples of test requirement:

  • Movement of objects in relation to vehicle.

  • Inability of the system to classify a pedestrian correctly if not near a crosswalk.

  • Inability of the system to predict pedestrian path correctly when not classified as pedestrian.

  • Overly strict assumptions made, such as cyclist not moving across lanes.

  • Losing location history of tracked objects and actors if their classification changed

  • Uber defines test coverage requirements based on collected map data and tags.

  • Map data predictin that upcoming environment would be of specific type (e.g., left curve), but it has changed and observations differ

  • Another car signals turning left but other predictors do not predict that, and the other car may not actually turn left.

  • Certainty of classifications.

  • Occlusions in the environment.


Looking at the above examples, trying to abstract some more generic concepts that would serve as a potentially useful basis:

  • Listing of known objects / actors

  • Listing of labels for different types of objects / actors

  • Assumptions made about specific types of objects / actors

  • Properties of objects / actors

  • Interaction constraints of objects and actors

  • Probabilities of classifications for different objects / actors and labels

  • Functionality when faced with unknown objects / actors

The above list may be lacking in general details that would cover the different types of systems, or even the Uber example, but I find it provides an insight into how this is heavily about probabilities, uncertainty, and preparing for, and handling, that uncertainty.

For different types of systems, the actual objects, actors, labels and properties would likely change. To illustrate these a bit more concretely with the autonomous car example:

  • Objects / Actors, and Properties their Labels

    • Our car
      • Speed, Position, Orientation,
      • Accelerating, Slowing down,
      • Intended goal (turn left, drive forward, change lane, stop, …)
      • Predicted location in 1s, 2s, 5s, …
      • Distance to all other actors / objects
      • Right of way
    • Other car, moving
      • Same as "Our car"
    • Other car, parked
      • Probability of leaving parking mode
    • Pedestrian, moving or stopped (parked)
      • Same as "other car"
      • Crossing street
      • On pedestrian path
    • Cyclist
      • Same as "other car"
      • Crossing street
      • On bicycle path
    • Other object
      • Moving or static
      • Same as others above
    • Traffic light
      • Current light (green, yellow, red)
      • On / off / blinking
    • Traffic sign
      • Type / Meaning
        • Set speed, stop, yield, no parking, …
        • Long term / local effect
    • Building
      • Size, shape, location
    • Occlusion
      • Predicted time of object / actor coming from occlusion
    • Unknown object, moving or parked
      • Much like the other car etc but maybe with unknown goals
  • Interaction constraints

    • Safety margin (distance to our car and other actors) before triggering some action
    • Actions triggered in different constraint states / boundaries

Something that seems important is also the ability to reason about previously unknown objects and actors to an extent possible. For example, a moving object that does not seem to fit any known category, but has known movement history, speed, and other variables. Perhaps there would be a more abstract category of a moving object, or some hierarchy of such categories. As well as the any of these objects or actors changing their classifications and goals, and how their long-term history should be taken into account overall to make future predictions.

In a different "machine learning intensive" system (not autonomous cars), one might use different set of properties, actors, object, etc. But it seems some similar consideration could be useful.

Possible Test Strategies

Once the domain (the "ODD") is properly defined, as above, it seems many traditional testing techniques could be applied. In the Uber documents, they describe performing architecture analysis to identify all potential failure points. They divided faults into three levels: faults in the self driving system on its own, faults in relation to the environment (e.g., at intersections), and faults related to the operational design domain, such as unknown obstacles the system does not recognize (or misclassifies?). This could be another way to categorize a more specific system, or inspiration for other similar systems.

Another part of this type of system could be related to the human aspect. This is somewhat discussed also in the Uber docs, in relation to operational situations for the system: a distracted operator, and a fatigued operator. They have some functionality in place (especially after the accident) to monitor operator alertness via in-car dashcam and attached analysis. However, I will not go into these here.

Testing ML Components

For testing the ML components, I discussed various techniques in a previous blog post. This includes approaches such as metamorphic testing, adversarial testing, and testing with reference inputs. In autonomous cars, this might be visual classifiers (e.g., convolutional networks), or path prediction models (recurrent neural nets etc.), or something else.

Testing ML Intensive Systems

As for the set of properties I listed above, it seems once these have been defined, using traditional testing techniques should be quite useful:

  • combinatorial testing: combine different objects / actors, with different properties, labels, etc. observe the system behaviour in relation to the set constraints (e.g., safety limits).
  • boundary analysis: apply to the combinations and constraints from the previous bullet. for example, probabilities at different values. might require some work to define interesting sets of probability boundaries, or ways to explore the (combined) probability spaces. but not that different in the end from more traditional testing.
  • model-based testing: use the above type of variables to express the system state, use a test generator to build extensive test sets that can be used to cover combinations, but also transitions between states and their combinations over time.
  • fault-injection testing: the system likely uses data from multiple different data sources, including numerous different types of sensors. different types of faults in these may have different types of impact on the ML classifier outputs, overall system state, etc. fault-injection testing in all these elements can help surface such cases. think Boeing Max from recent history, where a single sensor failure caused multiple crashes with hundreds of lives lost.

The real trick may be in combining these into actual, complete, test scenarios for unit tests, integration tests, simulators, test tracks, and real-world tests.

Regarding the last bullet above (fault-injection testing), the Uber documents discuss this from the angle of fault-injection training – injecting faults into the system and seeing how the vehicle operator reacts to them. Training them how they should react. This sounds similar to fault-injection testing, and I would expect that they would have also applied the same scenarios more broadly. However, I could not find mention of this.

Regarding general failures, and when they happen in real use, the same fault models can also be used to prepare and mitigate actual operational faults. The Uber docs also discuss this viewpoints as the system having a set of identified fault conditions and mitigations when these happen. These are identified by redundant systems and overall monitoring across the system. Example faults:

  • Primary compute power failure
  • Loss of primary compute or motion planning timeout
  • Sensor data delay
  • Door open during driving

General Safety Proceduress

Volvo Safety Features

Besides the Uber self-driving technology, the documents show Volvo cars having safety features in themselves, an Advanced Driver Assistance Systems (ADAS), including an automated emergency braking system named "City Safety". It contains a forward collision warning system, alerting the driver about imminent collision and automatically applying brakes when it observes a potentially dangerous situation. This also includes pedestrian, cyclist, and large animal detection components. However, these were turned off during autonomous driving mode, and only active in manual mode. Simulation tests conducted by the Volvo Group showed how the ADAS features would have been able to avoid the collision (17 times out of 20) or significantly reduce collision speed and impact (remaining 3 times). In post-crash changes, the ADAS system has been activated at all times (along with many other fixes to all the issues discussed here).

Information Sharing and Other Domains

The documents on reviews and investigations after the accident include comparisons to safety cultures in many other (safety-critical) domains: Nuclear Power, Transportation (Rail), Aviation, Oil and Gas, and Maritime. While some are quite specific to the domains, and related to higher level process and cultural aspects, there seem to be many quite interesting points one could build on also for the autonomous driving domain. Or other similar ones. Safety has many higher level shared aspects across domains. Regarding my look for testing related aspects, in many cases replacing "safety" with "QA" would also seem to provide useful insights.

One practical example is how (at least) avionics and transportation (rail) domains have processes in place to collect, analyze, and share information on unsafe conditions and events observed. This would seem like a useful way to identify also relevant test scenarios for testing products in the autonomous driving domain. Given how much effort is required for extensive collection of such data, how expensive and dangerous it can be, the benefits seem quite obvious to everyone.

Related to this, Uber discusses shared metrics for evaluating progress of their development. These include disengagements and self-driving miles travelled. While they have used these to signal progress both internally and externally, they also note that such metrics can easily lead to "gaming the system" at the expense of safety or working system. For example, in becoming overly conservative in avoiding disentanglements, or in using inconsistent definitions of the metrics across developers / systems.

Uber discusses need for work in creating more broadly usable safety performance metrics with academic and industry partners. They list how these metrics should be:

  • Specific to different development stages (development, testing, deployment)
  • Specific to different operational design domains, scenarios and capabilities
  • Have comparable metrics for human drivers
  • Applied in validation environments and scenarios for autonomous cars with other autonomous cars from different companies

The Uber safety approach document refers also a more general work towards automotive safety framework by the RAND corporation. This includes topics such as building a shared taxonomy to form a basis for discussion and sharing across vendors. It also discusses safety metrics, their use across vendors, and the possible issues in use and possible gaming of such metrics. And many other related aspects of cross-vendor safety program. Interesting. Seems like lots of work to do there as well.


This was an overly long look of the documents from the Uber accident. I was thinking of just looking at the testing aspect briefly, but I guess it is hard to discuss them properly without setting the whole background and overall context. Overall, the summary is not that complicated. I just get carried away with writing too much details.

However, I found writing this down helped me reason better about what is the difference between more traditional software intensive systems, and these types of new machine-learning intensive systems. I would summarize it as the need to consider everything in terms of probabilities, the unknown elements in the input and output, constraints over everything, complexity of identifying all the objects and actors, and their possible intents, and all the relations between all possibilities. With probabilities (or un-certainty). But once the domain analysis is well done, and understanding the inputs and outputs, I find the traditional testing techniques such as combinatorial testing, model-based testing, category partitioning, boundary analysis, fault-injection testing would give a good basis. But it might take a bit broader insight to be able to apply them efficiently.

As for the Uber approach, it is interesting. I previously discussed the Tesla approach of collecting data from fleets of deployed consumer vehicles. And features such as the Tesla shadow mode, continuously running in the background as the human driver drives, always evaluating whether each autonomous decision the system would have made would have been similar to what action the human took, or how it differs from that taken by the actual human driver. Not specifically trained VO’s as in the Uber case, but usual consumer drivers (so Tesla customers at work helping to improve the product).

The Tesla approach seems much more scalable in general. It might also generalize better as opposed to Uber aiming for very specific routes and building super detailed maps of just those areas. Creating and maintaining such super-detailed maps seems like a challenging task. Perhaps if the companies have very good automated tools to take care of it, it can be easier to manage and scale. I don’t know if Tesla does some similar mapping with the help of their consumer fleet, but would be interesting to see similar documents and compare.

As for other types of machine learning (intensive) systems, there are many variations, such as those using IoT sensors and data to provide a service. Those are maybe not as open-worlded in all possible input spaces. However, it would seem to me that many of the considerations and approaches I discussed here could be applied. Probabilities, (un-)certainties, domain characterizations, relations, etc. Remains interesting to see, perhaps I will find a chance to try someday.. 🙂

Machine Learning in Software Testing

Early 2019 Edition


Software testing has not really changed all that much in the past decades. Machine learning on the other hand is a very rapidly evolving technology being adopted all over the place. So what can it bring to software testing?

Back in 2018 (so about a year ago from now) I did a review of machine learning (ML) (and deep learning (DL)) applications in Network Analysis and Software Testing. After spending some more time learning and trying ML/DL in practice, this is an update on the ML for testing part, reflecting on my own learnings and some developments over this past year. Another interesting part would be testing ML system. I will get to that in another post.

In my last years review, I focused on several topics. A recent academic study (Durelli2019) in this area also lists a number of topics. This includes topics such as "learning test oracles", which basically translates to learning a model of a system behaviour based on some observations or other data about the software behaviour. Last years I included this under the name of specification mining. In practice, I have found such learned behavioral models to be of limited use, and have not seen general uptake anywhere in practice. In this review I focus on fewer topics I find more convincing for practical use.

I illustrate these techniques with this nice pic I made:


In this pic, the "Magic ML Oracle" is just a ML model, or a system using such a model. It learns from a set of inputs during the training phase. In the figure above this could be some bug reports linked to components (file handling, user interface, network, …). In the prediction phase it runs as a classifier, predicting something such as which component a reported issue should be assigned to, how fault-prone an analysis is (e.g., how to focus testing), or how tests and specs are linked (in case of missing links).

The topics I cover mainly relate to using machine learning to analyze various test related artefacts as in the figure above. One example of this is the bug report classifier I built previously. Since most of these ML techniques are quite general, just applied to software testing, ideas from broader ML applications could be useful here as well.

Specifically, software testing is not necessarily that different from other software engineering activities. For example, Microsoft performed an extensive study (Kim2017) on their data scientists and their work in software engineering teams. This work includes bug and performance analysis and prioritization, as well as customer feedback analysis, and various other quality (assurance) related topics.

As an example of concrete ML application to broader SW engineering, (Gu2018) maps natural language queries to source code to enable code search. To train a DL model for this, one recurrent neural network (RNN) based model is built for the code description (from source comments), and another one for the source code. The output of these two is a numerical feature vector. Cosine similarity is a measure used to compare how far apart two such vectors are, and here it is used as the training loss function. This is a nice trick to train a model to map source code constructs to natural language "constructs", enabling mapping short queries to new code in similar ways. It is also nicely described in the morning paper. I see no reason why such queries and mappings would not work for test related searches and code/documents as well. In fact, many of the techniques I list in following sections use quite similar approaches.

Like I said, I am focusing on a smaller set of more practical topics than last year in this still-overly-long post. The overall idea of how to apply these types of techniques in testing in general should not be too different though. This time, I categorize topics to test prioritization bug report localization, defect prediction, and traceability analysis. One section to go over each, one to rule over them all.

Test Prioritization

As (software) organizations and projects grow over time, their codebase tends to grow with them. This would lead to also having more tests to cover that codebase (hopefully…). Running all possible test cases all the time in such a scenario is not always possible or cost-efficient, as it starts to take more and more time and resources. The approach to selecting a subset of the tests to execute has different names: test prioritization, test suite optimization, test minimization, …

The goal with this being to cover as much of the fault-prone areas with fewer tests, such as in this my completely made up image to illustrate the topic:

Prioritized coverage

Consider the coverage % in the above to reflect, for example, covering changes since last time tests were run. Aiming to cover changes did not break anything as fast and efficient as possible.

An industrial test prioritization system used at Google (in 2017) is described in (Memon2017). This does not directly discuss using machine learning (although mentions it as future plan for the data). However, I find it interesting for general data-analysis of testing related data as a basis for test prioritization. It also works to provide a basis for a set of features for ML algorithms, as understanding and tuning your data is really the basis for all ML applications.

The goal in this (Memon2017) case is two-fold: better utilizing test resources (focus on potentially failing tests) and provide feedback to the developers about their commits. The aim is not 100% accurate predictions but rather focusing automated test execution and providing the developer with feedback such as "this commit is 95% more likely to cause breakage due to the code being touched by 5 developers in the past 10 days, and being written in Java". The developers can use this feedback to seek further assurance in additional reviews, more testing, static analysis, and so on.

Some of the interesting features/findings described in (Memon2017):

  • Only about 1% of the tests ever failed during their lifetime.
  • Thus about 99% speedup would be possible if right tests could be identified.
  • Use of dependency distance as a feature: What other component depends on the changed component, and through how many other components
  • Test targets further away from the change are much less likely to fail. So dependency distance seems like a useful prediction (feature) metric. They used a threshold of 10 for their codebase, which might vary by project but the idea likely holds.
  • Files/targets modified more often are more likely to cause breakages.
  • File type affects likelihood of breakage.
  • User/tool id affects likelihood of breakage.
  • Number of developers having worked on single file in a short time affects likelihood of breakage. More developers means higher likelihood to break.
  • The number of test targets affected by a change varies greatly, maybe requiring different treatment.

A similar set of features is presented in (Bhagwan2018):

  • Developer experience: Developer time in the organization and project
  • Code ownership: More developers changing files/components cause more bugs
  • Code hotspots: Specific parts of code that cause issues when changed
  • Commit complexity: Number of changes, changed files, review comments in a single commit. More equals more bugs.

A test prioritization approach taken at is described in (Busjaeger2016). They use five types of features:

  • code coverage
  • text path similarity
  • text content similarity
  • test failure history
  • test age

In (Bhagwan2018), the similarity scores are based on TF-IDF scores and their cosine similarity calculation. TF-IDF simply weights frequency of words in a document against the frequency of the same word in all other documents, to identify most specific terms for document types. The features are fed into a support-vector model to rank tests to execute first. In their (Bhagwan2018) experience, about 3% of overall tests is required to reach about 75% coverage. From the predicted tests, about every 5th is found to be causing a failure.

I find the above provide good examples of data analysis, and basis for defining ML features.

In the (Durelli2019) review, several studies are listed under test prioritization, but these mostly do not strike me as very realistic examples of ML applications. However, one interesting approach is (Spieker2017), which investigates using reinforcement learning for test prioritization. It uses only three features: execution time (duration), last execution (whatever that means..), and failure history. These seem a rather simple set of features to build a complex model, and it seems likely to me that a simple model would also work here. The results in (Spieker2017) are presented as good but not investigated in depth so hard to say from just that. However, I did find the approach to present some interesting ideas in relation to this:

  • Continuous integration systems constantly execute the test suites so you will have a lot of constantly updating data about test suites, execution, results available
  • Continuously updating the model over time based on a last N test runs from past
  • Using a higher exploration rate over full suite to bootstrap the model, lowering over time when it has learned but not setting to zero
  • Using "test case" as model state, and assigning it a priority as an action
  • Listing real "open-data" industry-based datasets to evaluate prioritization ML models on

I would be interested to see how well a simple model, such as Naive Bayes, weighting the previous pass/fails and some pattern over their probability would work. But from the paper it is hard to tell. In any case, the points above would be interesting to explore further.

A Thought (maybe Two)

I assume ML has been applied to test prioritization, just not so much documented. For example, I expect Google would have taken their studies further and used ML for this as discussed in their report (Memon2017). Test prioritization seems like a suitably complex problem, with lots of easily accessible data, and with a clear payoff in sight, to apply ML. The more tests you have, the more you need to execute, the more data you get, the more potential benefit.

In this as in many advanced research topics, I guess the "killer app" might come by integrating all this into a test system / product as a black-box. This would enable everyone to make use of it without requiring to learn all the "ML in test" details outside their core business. Same I guess applies to the other topics I cover in the following sections.

Bug Report Localization

Bug report localization (in this case anyway) involves taking a bug report and finding the component or other part of the software that the report is most likely to concern. Various approaches aim to automate this process by using machine learning algorithms. My previous example is one example of building one.

I made a pretty picture to illustrate this:

Localization Oracle

Typically a bug report is expressed in natural language (at least partially, with code snippets embedded). These are fed to the machine learning classifier (magic oracle in the pic above), which assigns it to 1-N potential components. Component granularity and other details may wary but this is the general idea.

For this, code structural elements used as features include (Tufano2018):

  • sequences of abstract syntax tree (AST) nodes
  • sequences of call-flow-graphs (CFG) nodes
  • bytecode representations. This seems interesting in mapping the code to fewer shared elements (opcodes)

Other features include (Lam2017):

  • camel-case splitting source code (n-grams would seem a natural fit too)
  • time since a file was previously changed when fixing a bug
  • how many bugs overall have been fixed in a file
  • similarity between a bug report and previous bug reports (and what were they assigned to)

Besides using such specific code structures as inputs, also specific pre-processing steps are taken. These include (Tufano2018, Li2018):

  • replacing constant value with their types,
  • splitting camel-case,
  • removing low-level detailed abstract syntax tree (AST) nodes,
  • filtering out methods less than 10 lines long.
  • regular expressions to remove code format characters, and to identify code snippets embedded into the bug report.

An industrial case study on bug localization from Ericsson is presented in (Johnson2016). Topic models built with Latent Dirichlet Allocation (LDA) are learned from the set of bug reports. These are used to assign topic weights to bug reports based on the bug report text. The assigned weights are compared to the learned topic distribution for components, and the higher the match of distributions in the report vs learned component model, the higher the probability to assign the bug report to that component.

Vector Space Model (VSM) was used as a baseline comparison in many cases I found. This is based on TF-IDF scores (vectors) calculated for a document. Similarity between a bug report and source code files in VSM is calculated as a cosine similarity between their TF-IDF vectors. Revised Vector Space Model (rVSM) (Zhou2012) is a refinement of VSM that weights larger documents more, reasoning that bugs are more often found in larger source files. (Zhou2012) also adds weighting from similarity with previous bug reports.

Building on rVSM, (Lam2017) uses an auto-encoder neural network on TF-IDF weighted document terms to map different terms with similar meaning together for more accurate bug localization. Similarly, the "DeepSum" work (Li2018) uses an auto-encoder to summarize bug reports, and to compare their TF-IDF distance with cosine similarity. To me this use of auto-encoders seems like trying to re-invent word-embeddings for no obvious gain, but probably I miss something. After auto-encoding, (Lam2017) combines a set of features using a deep neural network (multi-layer perceptron (MLP) it seems) for final probability evaluation. In any case, word-embedding style mapping of words together in a smaller dimension is found in these works as others.

A Thought or Two

I am a bit surprised not to see much work in applying RNN type networks such as LSTM and GRU into these topics, since they are a great fit for processing textual documents. In my experience they are also quite powerful compared to traditional machine learning methods.

I think this type of bug report localization has practical relevance mainly for big companies with large product teams and customer bases, and complex processes (support levels, etc). This is in domains like telcos, from which the only clear industry report I listed here is from (Ericsson). Something I have found limiting these types of applications in practice is also the need for cross-domain vision to combine these topics and expertise. People seem often quite narrowly focused on specific areas. Black-box integration with common tools might help, again.

Defect Prediction

Software defect prediction refers to predicting which parts of the software are most likely to contain faults. Sometimes this is also referred to as fault proneness analysis. Aim is to provide additional information to help focus testing efforts. This is actually very similar to the bug report localization I discussed above, but with the goal of predicting where currently unknown bugs might appear (vs localizing existing issue reports).

An overall review of this area is presented in (Malhotra2015), showing an extensive use of traditional ML approaches (random forest, decision trees, support vector machines, etc) and traditional source code metrics (lines of code, cyclomatic complexity, etc.) as features. These studies show reasonably good accuracies up from 75% to 93%.

However, another broad review on these approaches and their effectiveness is presented in (Zhou2018). It shows how simply using larger module size to predict higher fault proneness would give equal or better accuracy in many cases. This is my experience from many contexts, keeping it simple is often very effective. But on the other hand, finding that simplicity can be the real challenge, and you can learn a lot by trying different approaches.

More recently, deep learning based approaches have also been applied to this domain. Deep Belief Nets (DBN) are applied in (Wang2018) to generate features from source code AST, and combined with more traditional source code metrics. The presentation on DBNs in (Zhou2018) is a bit unclear to me, but it seems very similar to a MLP. The output of this layer is then termed (as far as I understand) as "semantic feature vector". I looked a bit into the difference of DBN vs MLP, and found some practical discussion at a Keras issue. Make what you will of it. Do let me know if you figure it out better than I did (what is the difference in using a MLP style fully connected dense layer here vs DBN).

An earlier version of the (Wang2018) work is refined and further explored using convolutional neural networks (CNNs) in (Li2017). In this case, a word2vec word-embedding layer is used as the first layer, and trained on the source and AST vocabulary. This is fed into a 1-dimensional CNN, which is one of the popular deep learning network types for text processing. After passing through this part of the network, the output feature vector is merged with a set of the more traditional source metrics (lines of code, etc). These are together merged for the final network layers to do the prediction, which are fed into the final single-node output layer for the probability prediction.

Illustration of this network:

Metrics based model

To address class imbalance (more "clean" than "buggy" files), (Li2017) uses duplication of the minority class instances. They also compare to traditional metrics as well as the DBN from (Wang2018) and DBN+ whichs combines the traditional features with the DBN "semantic" features. As usual for research papers, they report getting better results with the CNN+ version. Everyone seems to do that, yet perfection seems never to be achieved, or even nearly. Strange.

A Thought

The evolution in defect prediction seems to be from traditional classifiers with traditional "hand-crafted" (source metrics) features to deep-learning generated and AST-based features in combination with traditional metrics. Again, I did not see any RNN based deep-learning classifier applications, although I would expect they should be quite well suited for this type of analysis. Perhaps next time.

Traceability Analysis

Despite everyone being all Agile now, heavier processes such as requirements traceability can still be needed. Especially for complex enough systems, and ones with heavy regulatory- or standards-based compliance requirements. Medical, telco, automotive, … In the real world, such traces may not always be documented, and sometimes it is of interest to find them.

A line of work exploring the use of deep learning for automating the generation of traceability links between software artefacts is in (Guo2017, Rath2018). These are from the same major software engineering conference (ICSE) over two following years, with some author overlap. So there is some traceability in this work too, heh-heh (such joke, much non-academic). The first one (Guo2017) aims to link requirements to design and test artefacts in the train control domain. The second one aims to link code submissions to issues/bug reports on Github.

Requirements documents

Using recurrent neural networks (RNN) to link requirements documents to other documents is investigated in (Guo2017). I covered this work to some extend already last year, but lets see if I can add something with what I learned since.

Use cases fot this as mentioned in (Guo2017):

  • Finding new, missing (undocumented) links between artefacts.
  • Train on a set of existing data for existing projects, apply to find links within a new project. This seems like a form of transfer learning, and is not explored in the paper. It focuses on the first bullet.

I find the approach used in (Guo2017) interesting, linking together two recurrent neural network (RNN) layers from parallel input branches for natural language processing (NLP):

Requirements linking NN

There are two identical input branches (top of figure above). One for the requirements documents, and one for the target document for which the link is assessed. Let’s pretend the target is a test document to stay relevant. A pair of documents is fed to different input branches of the network, and the network outputs a probability of these two documents being linked.

In ML you typically try different model configurations and hyperparameters to find what works best. In (Guo2017) they tried different types of layers and parameters. The figure above shows what they found best for their task. See Guo2017) for the experiment details for other parameter values. Here, a bi-directional gated recurrent unit (bi-GRU) layer is used to process each document into a feature vector.

When the requirements document and the target document have been transformed by this to a vector representation, they are fed into a pointwise multiplication layer (mul) and to a vector substraction layer (sub). In Keras this would seem to be a Merge layer with type "mul" or "sub". These merge layers are intended to describe the vector difference direction (mul) and distance (sub) across dimensions. A dense layer with sigmoid activation is used to integrate these two merges, and the final output is given by a 2-neuron softmax layer (linked/not linked probability).

For word-embeddings they try both a domain specific (train-control systems in this case) embedding with 50-dimensions, and a 300-dimensional one combining the domain-specific data with a Wikipedia text dump. They found the domain specific one works better, speculating it to be due to domain-specific use of words.

Since this prediction can produce many different possibilities in a ranked order, simple accuracy of the top choice is not "accurate" itself as an evaluation metric. For evaluating the results, (Guo2017) uses mean average prediction (MAP) as a metric. The MAP achieved in (Guo2017) is reported up to 83.4%. The numbers seem relatively good, although I would need to play with the results to really understand everything in relation to this metric.

An interesting point from (Guo2017) is a way to address class imbalances. The set of requirements and other documents that have valid links they have is a small fraction of the overall set. So the imbalance between the true and false labels is big. They address this by selecting an equal set of true and false labels for an epoch, and switching the set of false label items at the start of each epoch. So all the training data is processes, while a balance is held in each epoch. Nice.

Github Issues

Traceability for linking code commits to bug tracker issues and improvement tickets ("bugs" and "improvement" in project Jira) is presented in (Rath2018). The studied projects are 6 open-source projects written in Java. Unlike the previous study on requirements linking, this study does not use deep-learning based approaches but rather manual feature engineering and more traditional ML classifiers (decision trees, naive bayes, random forest).

This is about mapping issue reports to commits that fix those issues:

Github Issue Linking

Besides more traditional features, (Rath2018) also makes use of time related aspects as extra filtering features. A training set is built by finding commit messages that reference affected issue IDs. The features used include:

  • Timestamp of commit. Has to be later than creation timestamp for potential issue it could be linked to. Has to be inside given timeframe since issue was marked resolved.
  • Closest commit before analyzed commit, and its linked issues.
  • Closest commit after analyzed commit, and its linked issues.
  • Committer id
  • Reporter id
  • Textual similarity of commit source/message and issue text. TF-IDF weighted word- and ngram-counts.

The study in (Rath2018) looks at two different types of analysis for the accuracy of the ML classifier trained. In the first case they try to "reconstruct known links", and in the second case "construct unknown links". They further consider two different scenarios: recommending links for commits, and fully automated link generation. For assistance, their goal is to have the correct link tag in the top 3 suggestions. The automated tagging scenario requires the first predicted tag to be correct.

Not surprisingly, the top 3 approach has better results as it gives the classifier more freedom and leeway. Their results are reported with up to 95%+ recall but with a precision of around 30%. This seems to be in line with what I saw when I tried to build my issue categorization classifier. The first result may not always be correct but many good ones are in the top (and with too many possibilities, even the "correct" one might be a bit ambiguous).

The second use case of constructing previously unknown links sounds to me like it should be very similar in implementation to the first one, but it appears not. The main difference comes from there being large numbers of commits that do not actually address a specific Jira issue or ticket. The canonical example given is a refactoring commit. The obvious (in hindsight) result seems to state you are more likely to find a link if one is known to exist (case 1) vs finding one if it might not exist at all (case 2) :).

A Thought or Two

The point of the requirements linking approach finding the domain-specific word-embeddings better is interesting. In my previous LSTM bug predictor, I found domain specific training helps in similar way, although in that case also combining with the pre-trained word-embeddings worked nicely as well. Of course, I used a large pre-trained Glove embedding for that and did not train on Wikipedia myself. And used Glove vs Word2Vec but I would not expect a big difference.

However, the domain-specific embeddings performance sounds similar to ELMo, Bert, and other recent developments in context-sensitive embeddings. By training only on the domain-specific corpus, you likely get more context-sensitive embeddings for the domain. Maybe the train-control domain in (Guo2017) has more specific vocabulary, or some other attributes that make the smaller domain-specific embedding alone better? Or maybe the type of embedding and its training data makes the difference? No idea. Here’s hoping Elmo style contextual embeddings are made easy to add to Keras models soon, so I can more broadly experiment with those as well. In my obvious summary, I guess it is always better to try different options for different data and models..

Parting Notes

I tried to cover some different aspects of ML applications in software testing. The ones I covered seem to have quite a lot in common. In some sense, they are all mapping documents together. The set of features are also quite common, "traditional" source code metrics along with NLP features. Many specific metrics have also been developed as I listed above, such as modification and modifier (commit author) counts. Deep learning approaches are used to some extent, but it still seems to be making its way in this domain.

Besides what I covered, there are of course other approaches to apply ML to SW testing. I covered some last year, and (Durelli2019) covers much more from an academic perspective. But I found the ones I covered here to be a rather representative set of the ones I consider closest to practical today. If you have further ideas, happy to hear.

In general, I have not seen much of ML applied in meaningful ways to software testing. One approach I have used in past is to use ML as a tool for learning about a test network and its services (Kanstren2017). I am not sure if that really qualifies for a ML application to software testing, since it investigated properties of the test network itself and its services, not the process of testing. Perhaps the generalization of that is in "using machine learning with testing technologies". This would be different from applying ML to testing itself, as well as different from testing ML applications. Have to think about that.

Next I guess I will see if/when I have some time to look at the testing ML applications part. With all the hype on self-driving cars and everything else, that should be interesting.

See, I made this nice but too small text picture of the tree facets of ML and SW Testing I listed above:

Test vs ML facets


R. Baghwan et al., "Orca: Differential Bug Localization in Large-Scale Services", 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

B. Busjaeger, T. Xie, "Learning for Test Prioritization: An Industrial Case Study", 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2016.

N. DiGiuseppe, J.A. Jones, "Concept-Based Failure Clustering", ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012.

V. H. S. Durelli et al., "Machine Learning Applied to Software Testing: A Systematic Mapping Study", IEEE Transactions on Reliability, 2019.

X. Gu, H. Zhang, S. Kim, "Deep code search", 40th International Conference on Software Engineering (ICSE), 2018.

J. Guo, J. Cheng, J. Cleland-Huang, "Semantically Enhanced Software Traceability Using Deep Learning Techniques", 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017.

L. Johnson, et al., "Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

T. Kanstren, "Experiences in Testing and Analysing Data Intensive Systems", IEEE International Conference on Software Quality, Reliability and Security (QRS, industry track), 2017

M. Kim, et al., "Data Scientists in Software Teams: State of the Art and Challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, 2018.

A. N. Lam, A. T. Nguyen, H. A. Nguyen, T. N. Nguyen, "Bug Localization with Combination of Deep Learning and Information Retrieval", IEEE International Conference on Program Comprehension, 2017.

J. Li, P. He, J. Zhu, M. R. Lye, "Software Defect Prediction via Convolutional Neural Network", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

X. Li et al., "Unsupervised deep bug report summarization", 26th Conference on Program Comprehension (ICPC), 2018.

R. Malhotra, "A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing, 27,2015.

A. Memon et al., "Taming Google-scale continuous testing", 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017.

M. Rath, J. Rendall, J.L.C Guo, J. Cleland-Huang, P. Mäder, "Traceability in the wild: Automatically Augmenting Incomplete Trace Links", 40th IEEE/ACM International Conference on Software Engineering (ICSE), 2018.

M. Tufano et al., "Deep learning similarities from different representations of source code", 15th International Conference on Mining Software Repositories (MSR), 2018.

S. Wang, T. Liu, J. Nam, L. Tan, "Deep Semantic Feature Learning for Software Defect Prediction", IEEE Transactions on Software Engineering, 2018.

J. Zhou, H. Zhang, D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports", International Conference on Software Engineering (ICSE), 2012.

Y. Zhou et al, "How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction", ACM Transactions on Software Engineering and Methodology, no. 1, vol. 27, 2018.

Predicting issue categories on Github

Practical examples of applying machine learning seem to be a bit difficult to find. So I tried to create one for a presentation I was doing on testing and data analytics. I made a review of works in the area, and just chose one for illustrate. This one tries to predict a target category to assign for an issue report. I used ARM mBed OS as a test target since it has issues available on Github and there were some people who work with it attending the presentation.

This demo “service” I created works by first training a predictive model based on a set of previous issue reports. I downloaded the reports from the issue repository. The amount of data available there was so small, I just downloaded the issues manually using the Github API that let me download the data for 100 issues at once. Automating the download should be pretty easy if needed. The amount of data is small, and there are a large number of categories to predict, so not the best for results, but serves as an example to illustrate the concept.

And no, there is no deep learning involved here, so not quite that trendy. I don’t think it is all that necessary for this purpose or this size of data. But could work better of course, if you do try, post the code so we can play as well.

The Github issues API allows me to download the issues in batches. For example, to download page 12 of closed issues, with 100 issues per page, the URL to request is The API seems to cut it down to 100 even if using bigger values than 100. Or I just didn’t quite use it right, whichever. The API docs describe the parameters quite clearly, I downloaded open and closed issues separately, even if I did not use the separation in any meaningful way in the end.

The code here is all in Python. The final classifier/prediction services code is available on my Github repository.

First build a set of stopwords to do some cleaning on the issue descriptions:

	stop_words = set(stopwords.words('english'))
	stop_words = stop_words.union(set(punctuation))
	stop_words.update(["''", "--", "**"])

The above code uses the common NLTK stopwords, a set of punctuation symbols, and a few commonly occurring symbol combinations I found in the data. Since later on I clean it up with another regular expression, probably just the NLTK stopwords would suffice here as well..

To preprocess the issue descriptions before applying machine learning algorightms:

def preprocess_report(body_o):
	#clean issue body text. leave only alphabetical and numerical characters and some specials such as +.,:/\
	body = re.sub('[^A-Za-z0-9 /\\\_+.,:\n]+', '', body_o)
	# replace URL separators with space so the parts of the url become separate words
	body = re.sub('[/\\\]', ' ', body)
	# finally lemmatize all words for the analysis
	lemmatizer = WordNetLemmatizer()
	# text tokens are basis for the features
	text_tokens = [lemmatizer.lemmatize(word) for word in word_tokenize(body.lower()) if word not in stop_words]
	return text_tokens

Above code is intended to remove all but standard alphanumeric characters from the text, remove stop words, and tokenize the remaining text into separate words. It also splits URL’s into parts as separate words. The lemmatization changes known words into their baseforms (e.g., “car” and “cars” become “car”). This just makes it easier for the machine learning algorithm to match words together. Another option is stemming, but lemmatization produces more human-friendly words so I use that.

I stored the downloaded issues as JSON files (as Github API gives) in the data directory. To read all these filenames for processing:

#names of files containing closed and open issues (at time of download)
closed_files = glob.glob("data/*-closed-*")
open_files = glob.glob("data/*-closed-*")

To process those files, I need to pick only the ones with an assigned “component” value. This is what is the training target label. The algorithm is trained to predict this “component” value from the issue description, so without this label, the piece of data is not useful for training.

def process_files(files):
	process the given set of files by collecting issue body text and labels.
	also cleans and lemmatizes the body text

	:param files: names of files to process
	:return: nothing
	global total

	for filename in files:
		with open(filename, encoding="utf-8") as json_data:
			all_issues = json.load(json_data)
			for issue in all_issues:
				labels = issue["labels"]
				for label in labels:
					if label["name"].startswith("component:"):
						name = label["name"]
						body_o = issue["body"]
						text_tokens = preprocess_report(body_o)
						#component_labels are prediction targets
						total += 1

print("total: ", total)

There is a limited number of such labeled data items, as many of the downloaded issues do not have this label assigned. The print at the end of the above code shows the total number of items with the “component” label given, and the number in this dataset is 1078.

Besides removing stop-words and otherwise cleaning up the documents for NLP, combining words sometimes makes sense. Pairs, triplets, and so on are sometimes meaningful. Typical example is words “new” and “york” in a document, versus “new york”. This would be an example of a bi-gram, combining two words into “new_york”. To do this, I use the gensim package:

import gensim

# Build the bigram and trigram models
bigram = gensim.models.Phrases(all_text_tokens, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[all_text_tokens], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#just to see it works

#transform identified word pairs and triplets into bigrams and trigrams
text_tokens = [trigram_mod[bigram_mod[text]] for text in all_text_tokens]

#build whole documents from text tokens. some algorithms work on documents not tokens.
texts = [" ".join(tokens) for tokens in text_tokens]

The above code uses thresholds and minimum co-occurrence counts to avoid combining every possible word with every other possible word. So only word-pairs and triplets that commonly are found to occur together are used (replaced) in the document.

Use the Python data processing library Pandas to turn it into suitable format for the machine learning algorithms:

from pandas import DataFrame

df = DataFrame()

df["texts"] = texts
df["text_tokens"] = text_tokens
df["component"] = component_labels


First to have a look at the data:

#how many issues are there in our data for all the target labels, assigned component counts
value_counts = df["component"].value_counts()
#print how many times each component/target label exists in the training data
#remove all targets for which we have less than 10 training samples.
#K-fold validation with 5 splits requires min 5 to have 1 in each split. This makes it 2, which is still tiny but at least it sorta runs
indices = df["component"].isin(value_counts[value_counts > 9].index)
#this is the weird syntax i never remember, them python tricks. i think it slices the dataframe to remove the items not in "indices" list
df = df.loc[indices, :]

The above code actually already does a bit more. It also filters the dataset to remove the rows with component values that only have less than 10 items. So this is the unfiltered list:

component: tools              162
component: hal                128
component: export             124
component: networking         118
component: drivers            110
component: rtos                88
component: filesystem          80
component: tls                 78
component: docs                60
component: usb                 54
component: ble                 38
component: events              14
component: cmsis               10
component: stdlib               4
component: rpc                  4
component: uvisor               2
component: greentea-client      2
component: compiler             2

And after filtering, the last four rows will have been removed. So in the end, the dataset will not have any rows with labelsl “rpc”, “uvisor”, “greentea-client”, or “compiler”. This is because I will later use stratified 5-fold cross-validation and this requires a minimum of 5 items of each. Filtering with minimum of 10 instances for a label, it is at least possible to have 2 of the least common “component” in each fold.

In a more realistic case, much more data would be needed to cover all categories, and I would also look at possibly combining some of the different categories. And rebuilding the model every now and then, depending on how much effort it is, how much new data comes in, etc.

To use the “component” values as target labels for machine learning, they need to be numerical (integers). This does the transformation:

from sklearn.preprocessing import LabelEncoder

# encode class values as integers
encoder = LabelEncoder()
encoded_label = encoder.fit_transform(df.component)

Just to see how the mapping of integer id’s to labels after label encoding looks:

unique, counts = np.unique(encoded_label, return_counts=True)
print(unique) #the set of unique encoded labels
print(counts) #the number of instances for each label

The result (first line = id, second line = number of items):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 38  10  60 110  14 124  80 128 118  88  78 162  54]

Mapping the labels to integers:

#which textual label/component name matches which integer label
le_name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

#which integer matches which textual label/component name
le_id_mapping = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))

So the first is to print “label: id” pairs, and the second to print “id: label” pairs. The first one looks like this:

'component: ble': 0, 
'component: cmsis': 1, 
'component: docs': 2, 
'component: drivers': 3, 
'component: events': 4, 
'component: export': 5, 
'component: filesystem': 6, 
'component: hal': 7, 
'component: networking': 8, 
'component: rtos': 9, 
'component: tls': 10, 
'component: tools': 11, 
'component: usb': 12

Now, to turn the text into suitable input for a machine learning algorithm, I transform the documents into their TF-IDF presentation. Well, if you go all deep learning with LSTM and the like, this may not be necessary. But don’t take my word for it, I am still trying to figure some of that out.

TF-IDF stands for term frequency (TF) – inverse document frequency (IDF). For example, if the word “bob” appears often in a document, it has a high term frequency for that document. Generally, one might consider such a word to describe that document well (or the concepts in the document). However, if the same word also appears commonly in all the documents (in the “corpus”), it is not really specific to that document, and not very representative of that document vs all other documents in the corpus. So IDF is used to modify the TF so that words that appear often in a document but less often in others in the corpus get a higher weight. And if the word appears often across many documents, it gets a lower weight. This is TF-IDF.

Traditional machine learning approaches also require a more fixed size set of input features. Since documents are of varying length, this can be a bit of an issue. Well, I believe some deep learning models also require this (e.g., CNN), while others less so (e.g., sequential models such as LSTM). Digressing. TF-IDF also (as far as I understand) results in a fixed length feature vector for all documents. Or read this on Stack Overflow and make your own mind up.

Anyway, to my understanding, the feature space (set of all features) after TF-IDF processing becomes the set of all unique words across all documents. Each of these is given a TF-IDF score for each document. For the words that do not exist in a document, the score is 0. And most documents don’t have all words in them, so this results in a very “sparse matrix”, where the zeroes are not really stored. That’s how you can actually process some reasonable sized set of documents in memory.

So, in any case, to convert the documents to TF-IDF presentation:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)

#transfor all documents into TFIDF vectors.
#TF-IDF vectors are all same length, same word at same index, value is its TFIDF for that word in that document
features_transformed = vectorizer.fit_transform(features)

Above code fits the vectorizer to the corpus and then transforms all the documents to their TF-IDF representations. To my understanding (from SO), the fit part counts the word occurrences in the corpus, and the transform part uses these overall counts to transform each document into TF-IDF.

It is possible also to print out all the words the TF-IDF found in the corpus:

#the TFIDF feature names is a long list of all unique words found
feature_names = np.array(vectorizer.get_feature_names())

Now to train a classifier to predict the component based on a given document:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

#the classifier to use, the parameters are selected based on a set i tried before
clf = RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=5)

results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

#fit the classifier on the TFIDF transformed word features, train it to predict the component, encoded_label)
probabilities = clf.predict_proba(features_transformed[0])

In the above I am using RandomForest classifier, with a set of parameters previously tuned. I am also using 5-fold cross validation, meaning the data is split into 5 different parts. The parts are “stratified”, meaning each fold has about the same percentage of each target label as the original set. This is why I removed the labels with less that 10 instances in the beginning, to have at least 2 for each class. Which is till super-tiny but thats what this example is about.

The last part of the code above also runs a prediction on one of the transformed documents just to try it out.

Now, to run predictions on previously unseen documents:

import requests

def predict_component(issue):
	use this to get a set of predictions for a given issue.

	:param issue: issue id from github.
	:return: list of tuples (component name, probability)
	#first load text for the given issue from github
	url = "" + str(issue)
	r = requests.get(url)
	url_json = json.loads(r.content)
	#process the loaded issue data to format matching what the classifier is trained on
	issue_tokens = preprocess_report(url_json["body"])
	issue_tokens = trigram_mod[bigram_mod[issue_tokens]]
	issue_text = " ".join(issue_tokens)
	features_transformed = vectorizer.transform([issue_text])
	#and predict the probability of each component type
	probabilities = clf.predict_proba(features_transformed)
	result = []
	for idx in range(probabilities.shape[1]):
		name = le_id_mapping[idx]
		prob = (probabilities[0, idx]*100)
		prob_str = "%.2f%%" % prob
		print(name, ":", prob_str)
		result.append((name, prob_str))
	return result

This code takes as parameter an issue number for the ARM mBed Github repo. Downloads the issue data, preprocesses it similar to the training data (clean, tokenize, lemmatize, TF-IDF). This is then used as a set of features to predict the component, based on the model trained earlier.

The “predict_component” method/function can then be called from elsewhere. In my case, I made a simple web page to call it. As noted in the beginning of this post, you can find that webserver code, as well as all the code above on my Github repository.

That’s pretty much it. Not very complicated to put some Python lines one after another, but knowing which lines and in which order is perhaps what takes the time to learn. Having someone else around to do it for you if you are a domain expert (e.g., testing, software engineering or similar in this case) is handy, but it can also be useful to have some idea of what happens, or how the algorithms in general work.

Something I left out in all the above was the code to try out different classifiers and their parameters. So I will just put it below for reference.

First a few helper methods:

def top_tfidf_feats(row, features, top_n=25):
	''' Get top n tfidf values in row and return them with their corresponding feature names.'''
	topn_ids = np.argsort(row)[::-1][:top_n]
	top_feats = [(features[i], row[i]) for i in topn_ids]
	df = pd.DataFrame(top_feats)
	df.columns = ['feature', 'tfidf']
	return df

#this prints it for the first document in the set
arr = features_test_transformed[0].toarray()
top_tfidf_feats(arr[0], feature_names)

def show_most_informative_features(vectorizer, clf, n=20):
	feature_names = vectorizer.get_feature_names()
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In above code, “top_tfidf_feats” prints the top words with highest TF-IDF score for a document. So in a sense, it prints the words that TF-IDF has determined to be most uniquely representing that document.

The “show_most_informative_features” prints the top features that a given classifier has determined to be most descriptive/informative for distinguishing target labels. This only works for certain classifiers, which have such simple co-efficients (feature weights). Such as multinomial naive-bayes (MultinomialNB below).

Here is the code to actually try out the classifiers then:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(), labels_train)

from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed)
y_true = labels_test
acc_score = accuracy_score(y_true, y_pred)
print("MNB accuracy:"+str(acc_score))

show_most_informative_features(vectorizer, clf)

#try it out on a single document
probabilities = clf.predict_proba(features_test_transformed[0])

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#set of parameters to try
estimators = [10, 20, 30, 40, 50]
min_splits = [5, 10, 20, 30, 40, 50]
min_leafs = [1, 2, 5, 10, 20, 30]

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

best_acc = 0.0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)

			results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

			print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

			if results.mean() > best_acc:
				best_acc = results.mean()
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

print("best classifier:")

best_acc = 0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split), labels_train)

			pred = clf.predict(features_test_transformed)

			accuracy = accuracy_score(labels_test, pred)


			if accuracy > best_acc:
				best_acc = accuracy
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

In the code above, I use loops to run through the parameters. There is also something called GridSearch in the Python libraries, as well as RandomSearch (for cases where trying all combos is expensive). But I prefer the ability to control the loops, print out whatever I like and all that.

The above code also shows two ways I tried to train/evaluate the RandomForest parameters. First is with k-fold, latter with single test-train split. I picked MultinomialNB and RandomForest because some internet searching gave me the impression they might work reasonably well for unbalanced class sets such as this one. Of course the final idea is always to try and see what works.. This worked quite fine for me. Or so it seems, machine learning seems to be always about iterating stuffs and learning and updating as you go. More data could change this all, or maybe finding some mistake, or having more domain or analytics knowledge, finding mismatching results, or anything really.

What the unbalanced refers to is the number of instances of different components in this dataset, some “components” have many bug repots, while others much less. For many learning algorithms this seems to be an issue. Some searches indicated RandomForest should be fairly robust for this type so this is also one reason I used it.

Running the above code to experiment with the parameters also produced some slightly concerning results. The accuracy for the classifier ranged from 30% to 95% with smallish parameters changes. I would guess that also speaks for the small dataset causing potential issues. Also re-running the same code would give different classifications for new (unseen) instances. Which is what you might expect when I am not setting the randomization seed. But then I would also expect the accuracy to vary somewhat, which it didn’t. So just don’t take this as more than an example of how you might apply ML for some SW testing related tasks. Take it to highlight the need to always learn more, try more things, and get a decent amount of data, evolve models constantly, etc. And post some comments on all the things you think are wrong in this post/code so we can verify the approach of learning and updating all the time :).

In any case, I hope the example is useful for giving an idea of one way how machine learning could be applied in software testing related aspects. Now write me some nice LSTM or whatever is the latest trend in deep learning models, figure out any issues in my code, or whatever, and post some comments. Cheers.