Algorithm Test Engineering Part 2: Machine Learning

If we don’t exactly know what we are testing, how can we test?

Let’s probe the machine to learn what it has learned. Photo by Zoltan Tasi on Unsplash

In my previous algorithm test engineering article, I discussed the testing and analysis of more classical algorithms, such as binary search. While the overall testing of algorithms can be complicated, most classical algorithms can be described as having quite clearly defined inputs, outputs, and their relations.

In this follow-up article I discuss testing and analysis in relation to a different type of an algorithm, where the input-output relations are more complex, harder to define, and the results sometimes subjective. Typically these are based on machine learning (ML). For example, the same set of Internet search-results can be good for one person at one time, but less so for (another) person, or at a different time. Same might apply to other properties, such as an e-nose trying to distinguish wines vs trying to distinguish whiskeys based on the same sensor inputs, or personal stress level based on biometrics. The context and sometimes the personal “feeling” can make the difference.

In this article, I explore the idea of what it means to test these types of systems. I start with testing as part of the ML model training process, extending into post-training testing. To make it all a bit more concrete, I look at a few industry examples, and reflect on my experiences on a search-engine project I built a few years back using customized Natural Language Processing (NLP) algorithms.

The Train, Validate, and Test Process

The basic test process in relation to machine learning is the testing and validation during the ML model training process. The following figure illustrates this:

ML train and test loop. Image by author.

This uses three datasets:

  • Training set: The ML model is trained on this, the algorithm tuning the model to better generalize over all the items in this set.
  • Validation set: the trained model performance is evaluated on this separate dataset. Training is repeated with the training set as long as validation score improves, or other end criteria is met.
  • Test set: a final, separate dataset used to perform a final test of the results, independent of the whole training loop and its validation.

The above process aims to build a model that generalizes as well as possible in relation to given criteria (e.g., accuracy or prediction error) over the training data. Typically this optimizes the model over the given dataset, but does not differentiate on which specific data items it still performs poorly.

Following this, post-training testing is a term referring to testing related to the fully trained model after the above process has fully finished. Let’s look at this type of testing in a bit more detail.

Post-Training Testing

I ran into this term during writing this article, from an article by Jeremy Jordan, and found it very fitting. He described it as investigating the logic behind the “final” algorithm (trained model). I like the term investigating, as I think of this as exploring the model behaviour with different inputs and input transformations. This is very much like the metamorphic testing (MMT) approach I wrote about earlier, with more focus on the exploration part. As a reminder, MMT makes modifications to inputs, and observes the effect on output. The modifications are called metamorphic transformations.

The Jordan article splits post-training testing to three types:

  • Invariance tests: Testing the ML algorithm with similar inputs that one would expect to give a similar output. The example given is two different (red) apples. Think metamorphic transformations.
  • Directional tests: Testing small input modifications in a given direction to see if the algorithm output reacts in an expected way. For example, does the house price go up with increase in house size? Another form of metamorphic transformation.
  • Minimum functionality tests: Testing the ML algorithm with specific instances of inputs, where we know the result we expect. For example, does the same apple always give the same result (e.g., classification percentages).

I discussed many similar aspects related to testing traditional algorithms in my earlier article, and I think the above test types can be applied to testing in general. However, in testing ML algorithms, I find the exploration has a bigger role due to the model (algorithm) being more of a black box, with complex internals, such as large number of weights and connections.

I believe how well the above test types can be applied depends on the data, model, and domain at hand. For example, when classifying images of apples, I find it quite intuitive to define variants of an apple, or how to transform such images. Other times it gets more complicated. Let’s look at some examples.

Examples of Evaluation Complexity

For example, in my metamorphic testing article I discussed the use case of self-driving cars. Now imagine that the ML algorithm feeds steering angles to a driving system. If this angle drifts by fractions of a percentage over time (with different model versions) for the same scenarios, when is it a relevant change in output? Or if it uses multiple camera inputs from the real-world, and we consider all the complexities of the real-world living environment as inputs, which change in input is relevant? The data space can be huge. In time-series data (such as driving), the data relations and change over time also needs to be considered.

The Jordan article gives another interesting example of defining invariants and using them for investigation/testing: real-estate price estimation. Specifically expecting a higher bathroom count to not lower house price, or decreasing house size to not increase the prize. These are in a way inverse invariants, describing the opposite of what one might expect. Maybe I expect the price to stay the same or rise, but not to drop. An interesting angle to look at the data and expectations.

In Jordan’s example, they noticed their dataset for smaller apartments was mostly from a big city where the apartments were more expensive. This had biased the model with regards to the smaller apartment sizes in general. I find this an interesting example of findings that would be easy to miss without in-depth exploration (or investigation) of the post-training results.

Operational vs Data Domain

Thinking of the metamorphic relations at a high level, one might consider what kind of changes in the operational domain (e.g., weather, driving environment, angles, for cars) one might observe. These are high-level changes describing human understandable concepts. Additionally, there are changes in the input data domain itself (e.g., sensor corruptions, interference, signal disturbance, environmental impacts).

These two domains are nicely described in an article by Daniel Angilov. Although not using the metamorphic testing terms, it nicely summarizes the difference of how one might build input transformations for testing ML at a higher level (operational level), and how they might map to the lower level representation (data level). Naturally, the data domain is much larger in scope, as it includes all possible input values. For the operational domain many of these may not be relevant (e.g., images of random data). Illustration:

Operational domain vs Data Domain, adapted from here.

The Operational Domain deals with more intuitive, high-level and human understandable, concepts, and transformations. For example, in an autonomous car scenario, foggy scenes, camera angles, or rainy weather vs sunny weather. These are mapped to the data domain, where the actual data transformations happen. We can apply operational domain transformations to produce high-level modifications for identified test scenarios. And lower-level data transformations (e.g., adding noise) for lower level changes.

Here is an example from my metamorphic testing article that I believe illustrates a (domain) transformation in the operational domain:

Operational domain transformation example. Image by author.

In this case, the camera angle of the imaginary car has been turned, as if the car itself was tilted. Well, the other car in front has also moved further away :). I took these pictures while walking across the street for my metamorphic testing paper, which explains it. A more data domain oriented transformation could be, for example, to add static noise to the image.

It is unlikely that we could cover all possible values in the data domain, or figure out a generic test oracle for all of them. Similarly for the operational domain, expecting to build exhaustive test sets for all possible input combinations is not feasible. Instead, we need to try to figure out the most relevant ones for both domains. And pick a suitable test set, explore its changes, and evolution over time. Unfortunately I do not have a universal recipe for this (at this time).

Testing in the Overall ML Process

Considering the overall ML train, test, and operations process, perhaps the most succinct and full of ML testing detail paper I have seen is the Google paper from the 2016 Reliable Machine Learning in the Wild workshop. Since the entire paper is just a list test types and their very concise descriptions, I list the ones that I found most interesting. These all relate to continuously testing:

  • Feature assumptions. Such as value ranges and most common values, as these may drift over time. Check assumptions stay valid.
  • Feature relations to target variable and each other. To understand your data better, and check the relations hold.
  • Computational cost vs benefit per each feature. Is it worth including all features for their computational cost? Does it evolve?
  • Leak of unwanted features into the model due to copy-paste type errors. Constant monitoring should alert if unwanted features are used.
  • Change of model evaluation score over time if consistently re-trained. For example, the effect of daily training with new data.
  • Model bias with regards to specific types of data, does some new data or data type bring new biases that should be considered.
  • Reproducibility of training, how big is the drift across multiple trainings on the same data.
  • Constantly monitoring how invariants seen in training data should hold for operational data over time.

I see the overarching theme is to constantly monitor and observe change and impact.

This is just a short list of ones I found most related to this article. There are many more related to the overall process from ML code review to operational monitoring. I recommend reading the Google paper for the very insightful list.

A Few Real-World Examples

Lots of words are nice, but real-world examples help make it concrete. Especially real ones from the industry. Recently I listened to some ACM ByteCast episodes, a few of which provided interesting industry insights.


One of these was discussing research on recommendation algorithms at Spotify. More specifically, Spotify’s recommendation systems and everything around that. The term used (in this talk) was evaluation of ML results.

This differentiated the evaluation in two forms: offline and online. Offline is the traditional ML evaluation approach of using a train/test split and metrics such as accuracy, precision, and recall. Online refers to evaluating the algorithm performance from user actions, such as clicks on a webpage, or in Spotify case I guess choices on music playlists. They discuss the concept of a proxy of user engagement, and whether some specific interaction in an app is a good measure of achieved goal, or if a returning user is an indication of satisfaction and a success. An interesting viewpoint on defining an ML test oracle, and constantly investigating the ML algorithm behaviour.

Another aspect they discussed with Spotify is to enable users to discover new content. Instead of simply relying on algorithm recommendations based on the current user profile, it can be useful to enable people to discover new content. In the Spotify case, this was discussed as selecting about 10% of the options presented to the user as these “new” types of choices. This is a good example of considering the overall user experience and goals when testing / investigating the ML algorithms as part of a system. It ensures the user is not locked into an algorithmic sandbox, supporting them in finding new ideas (music, search results, ..) to explore.


Another ByteCast episode I found interesting regarding this topics was the one on DuoLingo. DuoLingo is an app and a service designed to help people learn languages. They use Machine Learning for many parts of their system, including creating tasks for students tailored by their learning history, customizing content based on how well the user does on certain words and language structures, and what has worked overall for similar users, and many other ways I am sure I miss here. But to summarize, a heavily ML applying and successful application.

DuoLingo generates tailored quizzes for users, based on their learned profiles, together with expert teams. In this way it gamifies the evaluation of the algorithms and has users provide feedback by answering the tailored quizzes, and using approaches such as A/B testing. I find this a very interesting idea and approach for “online” algorithm testing and evolution with algorithm feedback from users. By enticing the user as part of the service use process to work on tasks that help the algorithm learn to be better for them.

An Example of My Own: ValtuustoPilvi

A few years back I build a search-engine called Valtuustopilvi (Finnish for CouncilCloud), as part of an open-data based service competition hosted by the city of Oulu in Finland. It was ranked first in this (small) competition, and so I ended up also hosting it as a service for the city for the following two years. As a result, I got familiar with various NLP algorithms and built my first service using them, including my first experiences in testing ML and NLP.

Service Overview

ValtuustoPilvi was designed for people to interactively search the city council meeting documents. I used various existing Natural Language Processing (NLP) libraries for the search algorithms, built some custom algorithms of my own, and adapted some well known and widely used ones.

Since it was just me working on this project, I did the development, testing, analysis, and everything else. Not sure if that is good or bad for this case, but that’s how it was. I packaged the code up a few years back, so no concrete code measurements and execution this time, but examples and reflection on how I tested, analyzed, and designed it. Hope it makes for something more concrete and interesting.

To concretize the service a bit, here is a screenshot of the search UI:

Valtuustopilvi interactive search UI. Image by author.

The user could interact with the word-cloud by selecting words in it to refine their search, or by typing query terms in the box shown at the bottom of above figure. The system built (in real-time) a new word-cloud matching the search results for the modified search query, and the user could iteratively continue to refine their search using the continuously updated UI / word-cloud.

I designed several algorithms to build the word-cloud in different hierarchies. The overall document set was not changing very rapidly, so for that I used a pre-computed data model, updated every night. On first arrival at the main search-page for all the documents, or for a specific pre-defined sub-category of council meetings (e.g., building permits), this type of data model was applied. It was based on a topic-model. Based this topic model, a set of words was chosen, each one weighted by how high the topic model ranked them in the topics its discovered. This is visible as the word size in the word-cloud above. More detail on this shortly. First a few words on pre-processing.

Computational Complexity Example: Preprocessing with Voikko

In my previous algorithm testing article, I discussed evaluation of algorithm computational complexity. In the case of ValtuustoPilvi, a related example is from using a third-party library, and how its computational complexity became important as my use was different from its previous use cases.

A very common task in ML, and in NLP especially, is to preprocess the data being fed to the algorithms. In case of the word-cloud building algorithms, one of the basic needs is to unify different forms of words into a single representation (word). In NLP this is called lemmatization (i.e., base form conversion, e.g., cars->car).

For this , I used the Voikko Finnish NLP library. Voikko had earlier been used primarily for shorter pieces of text, such as sentences and words in spell checking. However, I used Voikko to process whole documents, longest of which were hundreds of pages long. Using such larger inputs, the processing time was increasing exponentially. After reporting the issue, it was fixed in version 4.0.1 (visible in Voikko release notes).

In relation to the topic of this and my previous article, this illustrates how different requirements for an algorithm can be relevant in different scenarios, and how this may evolve over time. The Voikko web-page now actually lists the library commonly used as a tool in machine learning pipelines for Finnish texts, illustrating its more general use case drift in this direction.

Continous Data Analysis: Testing Preprocessing

Lemmatization tools such as Voikko are great for processing well written and structured text. However, real-world text often contains misspellings, domain specific words, abbreviations, and other similar anomalies. For these, tools such as Voikko will fail to process them as they are not in the standard dictionary.

Reading all the documents and manually fixing all text is not a feasible task for any reasonable sized input set. Instead, I collected all words / tokens that Voikko did not recognize. By ranking these unrecognized words and tokens in the order of their frequency, I could easily find words that occurred often but were not recognized, create custom rules to lemmatize them as needed, and thus continuously monitor, test, and improve the preprocessing. Even as the set of documents, and the topics they covered would evolve over time. This process ran each time the document set and topic models were updated (nightly). I checked the results once a week or so, updating the lemmatization rules when new and frequent misspellings or new words appeared.

Algorithm Customization: LDA for Word Clouds

Another topic I previously discussed is the importance of understanding the algorithms used. In this case, the topic model algorithm I used produces internal representations that are typically not used as part of the output. However, learning the algorithm in detail, I was able to use those internal properties as a basis for service features. Of course, understanding what you are testing does not hurt in testing either.

The topic modelling algorithm I used for building the word-cloud was Latent Dirichlet Allocation (LDA). This is an NLP algorithm that maps documents and texts to a set of “topics”. The discovered topics are often used as input to further steps in an application. First, a short introduction to LDA.

LDA maps words together in clusters that are called topics. To concretize, here is an example of four topics (word clusters) I built using LDA on one of the Kaggle COVID-19 research paper datasets (code and notebook here):

LDA based topic model with four topics, with word-weights shown. Image by author.

With LDA, interpreting these topics/clusters is left to the user. In this case, I would say the topics above seem to relate to patient treatments (topic 1), virus analysis (topic 2), structure of the virus (topic 3), and possibly infection mechanisms (topic 4).

In the case of the Valtuustopilvi, the exact meaning of the topics was irrelevant, only that a good set of representative words, and their weights was captured, and they represented a set of high-level and diverse concepts in the document set. The idea in the end was to help the user explore different and interesting high-level concepts in the document set. This end user goal is always good to keep in mind also when testing a system, ML based or not.

The topics themselves were never shown to the user, only the backend code used them as a basis to build the high-level word-clouds. Summing the weights in different ways provided a basis to choose the words for the cloud, and weight their size.

As I was testing and using the service myself, I also realized that giving fully deterministic results to the user, in form of highest weighted words by the algorithm was boring and did not serve the user well. Because doing exactly that would always give the exact same word-cloud for the document set. It would not help them explore and find new information over time, and felt boring after a few times seeing the exact same word cloud I already had interacted with. So similar to the Spotify example of trying to avoid the algorithmic sandbox, I added some randomization to reduce the weights of some words, or add some randomness to which of the top words in each topic were included in the word-cloud calculations, and how much weight was given to each.

Testing, Analysis, and Development of ValtuustoPilvi

The LDA example above was only used for the top level searches, where people were searching over all the documents, or over a specific pre-defined category of meetings. It requires pre-computation and is most useful when summarizing larger sets. For deeper search queries with dynamic document sets, I used an algorithm that was faster and adaptive to changing search results. For further details on this algorithm, and the others I used, refer to the paper draft I wrote on the topic (…) back in the day.

An example of base tests in this was to check that the correct algorithm is used. These base tests are often quite simple and intuitive to write, with some basic understanding on how the system works. But how to test the word cloud is a good representation of the actual topics for the user? This is much more subjective and harder to test.

Looking back at this, I see different levels of testing I did on this:

Test sizes and oracle levels in the Valtuustopilvi project. Image by author.

First, I looked at a set of documents for a category, and whether the algorithm provided results made sense to me. This is the Me at the bottom of the above pyramid. The test oracle was myself, so very limited view but also very much to the point, and easy to check (I know exactly what I am thinking, at least usually :)).

Next came test groups, such as the competition reviewers and the feedback they gave. This feedback was more generic, without detailed knowledge of the algorithms, but still specifically giving feedback on how they felt about different parts of the service. Which is practically the algorithm output transformed into a user presented form.

The third level is the general user population. It has a large sample size in potentially covering all users of the service. However, the test oracle is very generic, as it can only rely on some indirect clues that can be collected from the user interaction with the service. This is similar to the Spotify (e.g., evaluation and search) and DuoLingo examples on analyzing user interactions, and performing experiments for tuning algorithms. The same also applies for search engines in general, such as Google search results optimization based on aggregated and anonymized interactions (linked in Feb/2022).

In the Valtuustopilvi case, the service did not have internet scale number of users, so there was not that much to analyze at the top level of all users. I also included a feedback mechanism in the UI, but beyond few basic comments not much came of it. However, once a service reaches a large enough scale, I believe the general user population is a very useful opportunity to pursue for evaluation data and ideas. As illustrated by the Spotify, DuoLingo, and Google examples. Of course, keeping in mind all the privacy and similar aspects.


Looking back at what became of this article, the big difference in testing ML based algorithms compared to traditional algorithms, as I see it, is the black box nature of the ML models. They embed complex internals, such as a large number of weights, nodes, and connections. And each of these are different across applications, model configurations, training sessions, and the data used. Thus there is no single way to test a specific ML algorithm, but the process involves much investigation and exploration. And this is a constant process, as the data, models, and out understanding based on it often evolve over time.

Some of the specific points I collected from this:

  • Data evolution, and constantly monitoring it,
  • Trained model evolution, and constantly monitoring it,
  • Identifying important operational and data invariants to cover,
  • Post-training testing types and metamorphic testing,
  • Different viewpoints, such as inverse assumptions/invariants,
  • Investigation and exploration as a central concept,
  • User (system) feedback and its proxies in evaluation at different scales,
  • Keeping the entire ML operations pipeline and related quality in mind,
  • Breadth in end (user) goals, such as avoiding algorithmic sandboxes,

Looking at the above, another common theme is the focus on continous test, monitoring, and analysis rather than point-in-time as often in classical testing.

Overall, I find there are many aspects to consider when testing ML algorithms, and the systems built around them. Much of the same concepts apply as for traditional testing, but hopefully this article helps provide some insights into the specifics of ML testing.

That’s all for now. Cheers 🙂


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s