All posts by Teemu

About Teemu

I am interested in technology research, data engineering, and applied machine learning. Advanced test automation architectures and systems are also in my park. If you have a great offer to make, let go.. :)

Algorithm Test Engineering Part 2: Machine Learning

If we don’t exactly know what we are testing, how can we test?

Let’s probe the machine to learn what it has learned. Photo by Zoltan Tasi on Unsplash

In my previous algorithm test engineering article, I discussed the testing and analysis of more classical algorithms, such as binary search. While the overall testing of algorithms can be complicated, most classical algorithms can be described as having quite clearly defined inputs, outputs, and their relations.

In this follow-up article I discuss testing and analysis in relation to a different type of an algorithm, where the input-output relations are more complex, harder to define, and the results sometimes subjective. Typically these are based on machine learning (ML). For example, the same set of Internet search-results can be good for one person at one time, but less so for (another) person, or at a different time. Same might apply to other properties, such as an e-nose trying to distinguish wines vs trying to distinguish whiskeys based on the same sensor inputs, or personal stress level based on biometrics. The context and sometimes the personal “feeling” can make the difference.

In this article, I explore the idea of what it means to test these types of systems. I start with testing as part of the ML model training process, extending into post-training testing. To make it all a bit more concrete, I look at a few industry examples, and reflect on my experiences on a search-engine project I built a few years back using customized Natural Language Processing (NLP) algorithms.

The Train, Validate, and Test Process

The basic test process in relation to machine learning is the testing and validation during the ML model training process. The following figure illustrates this:

ML train and test loop. Image by author.

This uses three datasets:

  • Training set: The ML model is trained on this, the algorithm tuning the model to better generalize over all the items in this set.
  • Validation set: the trained model performance is evaluated on this separate dataset. Training is repeated with the training set as long as validation score improves, or other end criteria is met.
  • Test set: a final, separate dataset used to perform a final test of the results, independent of the whole training loop and its validation.

The above process aims to build a model that generalizes as well as possible in relation to given criteria (e.g., accuracy or prediction error) over the training data. Typically this optimizes the model over the given dataset, but does not differentiate on which specific data items it still performs poorly.

Following this, post-training testing is a term referring to testing related to the fully trained model after the above process has fully finished. Let’s look at this type of testing in a bit more detail.

Post-Training Testing

I ran into this term during writing this article, from an article by Jeremy Jordan, and found it very fitting. He described it as investigating the logic behind the “final” algorithm (trained model). I like the term investigating, as I think of this as exploring the model behaviour with different inputs and input transformations. This is very much like the metamorphic testing (MMT) approach I wrote about earlier, with more focus on the exploration part. As a reminder, MMT makes modifications to inputs, and observes the effect on output. The modifications are called metamorphic transformations.

The Jordan article splits post-training testing to three types:

  • Invariance tests: Testing the ML algorithm with similar inputs that one would expect to give a similar output. The example given is two different (red) apples. Think metamorphic transformations.
  • Directional tests: Testing small input modifications in a given direction to see if the algorithm output reacts in an expected way. For example, does the house price go up with increase in house size? Another form of metamorphic transformation.
  • Minimum functionality tests: Testing the ML algorithm with specific instances of inputs, where we know the result we expect. For example, does the same apple always give the same result (e.g., classification percentages).

I discussed many similar aspects related to testing traditional algorithms in my earlier article, and I think the above test types can be applied to testing in general. However, in testing ML algorithms, I find the exploration has a bigger role due to the model (algorithm) being more of a black box, with complex internals, such as large number of weights and connections.

I believe how well the above test types can be applied depends on the data, model, and domain at hand. For example, when classifying images of apples, I find it quite intuitive to define variants of an apple, or how to transform such images. Other times it gets more complicated. Let’s look at some examples.

Examples of Evaluation Complexity

For example, in my metamorphic testing article I discussed the use case of self-driving cars. Now imagine that the ML algorithm feeds steering angles to a driving system. If this angle drifts by fractions of a percentage over time (with different model versions) for the same scenarios, when is it a relevant change in output? Or if it uses multiple camera inputs from the real-world, and we consider all the complexities of the real-world living environment as inputs, which change in input is relevant? The data space can be huge. In time-series data (such as driving), the data relations and change over time also needs to be considered.

The Jordan article gives another interesting example of defining invariants and using them for investigation/testing: real-estate price estimation. Specifically expecting a higher bathroom count to not lower house price, or decreasing house size to not increase the prize. These are in a way inverse invariants, describing the opposite of what one might expect. Maybe I expect the price to stay the same or rise, but not to drop. An interesting angle to look at the data and expectations.

In Jordan’s example, they noticed their dataset for smaller apartments was mostly from a big city where the apartments were more expensive. This had biased the model with regards to the smaller apartment sizes in general. I find this an interesting example of findings that would be easy to miss without in-depth exploration (or investigation) of the post-training results.

Operational vs Data Domain

Thinking of the metamorphic relations at a high level, one might consider what kind of changes in the operational domain (e.g., weather, driving environment, angles, for cars) one might observe. These are high-level changes describing human understandable concepts. Additionally, there are changes in the input data domain itself (e.g., sensor corruptions, interference, signal disturbance, environmental impacts).

These two domains are nicely described in an article by Daniel Angilov. Although not using the metamorphic testing terms, it nicely summarizes the difference of how one might build input transformations for testing ML at a higher level (operational level), and how they might map to the lower level representation (data level). Naturally, the data domain is much larger in scope, as it includes all possible input values. For the operational domain many of these may not be relevant (e.g., images of random data). Illustration:

Operational domain vs Data Domain, adapted from here.

The Operational Domain deals with more intuitive, high-level and human understandable, concepts, and transformations. For example, in an autonomous car scenario, foggy scenes, camera angles, or rainy weather vs sunny weather. These are mapped to the data domain, where the actual data transformations happen. We can apply operational domain transformations to produce high-level modifications for identified test scenarios. And lower-level data transformations (e.g., adding noise) for lower level changes.

Here is an example from my metamorphic testing article that I believe illustrates a (domain) transformation in the operational domain:

Operational domain transformation example. Image by author.

In this case, the camera angle of the imaginary car has been turned, as if the car itself was tilted. Well, the other car in front has also moved further away :). I took these pictures while walking across the street for my metamorphic testing paper, which explains it. A more data domain oriented transformation could be, for example, to add static noise to the image.

It is unlikely that we could cover all possible values in the data domain, or figure out a generic test oracle for all of them. Similarly for the operational domain, expecting to build exhaustive test sets for all possible input combinations is not feasible. Instead, we need to try to figure out the most relevant ones for both domains. And pick a suitable test set, explore its changes, and evolution over time. Unfortunately I do not have a universal recipe for this (at this time).

Testing in the Overall ML Process

Considering the overall ML train, test, and operations process, perhaps the most succinct and full of ML testing detail paper I have seen is the Google paper from the 2016 Reliable Machine Learning in the Wild workshop. Since the entire paper is just a list test types and their very concise descriptions, I list the ones that I found most interesting. These all relate to continuously testing:

  • Feature assumptions. Such as value ranges and most common values, as these may drift over time. Check assumptions stay valid.
  • Feature relations to target variable and each other. To understand your data better, and check the relations hold.
  • Computational cost vs benefit per each feature. Is it worth including all features for their computational cost? Does it evolve?
  • Leak of unwanted features into the model due to copy-paste type errors. Constant monitoring should alert if unwanted features are used.
  • Change of model evaluation score over time if consistently re-trained. For example, the effect of daily training with new data.
  • Model bias with regards to specific types of data, does some new data or data type bring new biases that should be considered.
  • Reproducibility of training, how big is the drift across multiple trainings on the same data.
  • Constantly monitoring how invariants seen in training data should hold for operational data over time.

I see the overarching theme is to constantly monitor and observe change and impact.

This is just a short list of ones I found most related to this article. There are many more related to the overall process from ML code review to operational monitoring. I recommend reading the Google paper for the very insightful list.

A Few Real-World Examples

Lots of words are nice, but real-world examples help make it concrete. Especially real ones from the industry. Recently I listened to some ACM ByteCast episodes, a few of which provided interesting industry insights.


One of these was discussing research on recommendation algorithms at Spotify. More specifically, Spotify’s recommendation systems and everything around that. The term used (in this talk) was evaluation of ML results.

This differentiated the evaluation in two forms: offline and online. Offline is the traditional ML evaluation approach of using a train/test split and metrics such as accuracy, precision, and recall. Online refers to evaluating the algorithm performance from user actions, such as clicks on a webpage, or in Spotify case I guess choices on music playlists. They discuss the concept of a proxy of user engagement, and whether some specific interaction in an app is a good measure of achieved goal, or if a returning user is an indication of satisfaction and a success. An interesting viewpoint on defining an ML test oracle, and constantly investigating the ML algorithm behaviour.

Another aspect they discussed with Spotify is to enable users to discover new content. Instead of simply relying on algorithm recommendations based on the current user profile, it can be useful to enable people to discover new content. In the Spotify case, this was discussed as selecting about 10% of the options presented to the user as these “new” types of choices. This is a good example of considering the overall user experience and goals when testing / investigating the ML algorithms as part of a system. It ensures the user is not locked into an algorithmic sandbox, supporting them in finding new ideas (music, search results, ..) to explore.


Another ByteCast episode I found interesting regarding this topics was the one on DuoLingo. DuoLingo is an app and a service designed to help people learn languages. They use Machine Learning for many parts of their system, including creating tasks for students tailored by their learning history, customizing content based on how well the user does on certain words and language structures, and what has worked overall for similar users, and many other ways I am sure I miss here. But to summarize, a heavily ML applying and successful application.

DuoLingo generates tailored quizzes for users, based on their learned profiles, together with expert teams. In this way it gamifies the evaluation of the algorithms and has users provide feedback by answering the tailored quizzes, and using approaches such as A/B testing. I find this a very interesting idea and approach for “online” algorithm testing and evolution with algorithm feedback from users. By enticing the user as part of the service use process to work on tasks that help the algorithm learn to be better for them.

An Example of My Own: ValtuustoPilvi

A few years back I build a search-engine called Valtuustopilvi (Finnish for CouncilCloud), as part of an open-data based service competition hosted by the city of Oulu in Finland. It was ranked first in this (small) competition, and so I ended up also hosting it as a service for the city for the following two years. As a result, I got familiar with various NLP algorithms and built my first service using them, including my first experiences in testing ML and NLP.

Service Overview

ValtuustoPilvi was designed for people to interactively search the city council meeting documents. I used various existing Natural Language Processing (NLP) libraries for the search algorithms, built some custom algorithms of my own, and adapted some well known and widely used ones.

Since it was just me working on this project, I did the development, testing, analysis, and everything else. Not sure if that is good or bad for this case, but that’s how it was. I packaged the code up a few years back, so no concrete code measurements and execution this time, but examples and reflection on how I tested, analyzed, and designed it. Hope it makes for something more concrete and interesting.

To concretize the service a bit, here is a screenshot of the search UI:

Valtuustopilvi interactive search UI. Image by author.

The user could interact with the word-cloud by selecting words in it to refine their search, or by typing query terms in the box shown at the bottom of above figure. The system built (in real-time) a new word-cloud matching the search results for the modified search query, and the user could iteratively continue to refine their search using the continuously updated UI / word-cloud.

I designed several algorithms to build the word-cloud in different hierarchies. The overall document set was not changing very rapidly, so for that I used a pre-computed data model, updated every night. On first arrival at the main search-page for all the documents, or for a specific pre-defined sub-category of council meetings (e.g., building permits), this type of data model was applied. It was based on a topic-model. Based this topic model, a set of words was chosen, each one weighted by how high the topic model ranked them in the topics its discovered. This is visible as the word size in the word-cloud above. More detail on this shortly. First a few words on pre-processing.

Computational Complexity Example: Preprocessing with Voikko

In my previous algorithm testing article, I discussed evaluation of algorithm computational complexity. In the case of ValtuustoPilvi, a related example is from using a third-party library, and how its computational complexity became important as my use was different from its previous use cases.

A very common task in ML, and in NLP especially, is to preprocess the data being fed to the algorithms. In case of the word-cloud building algorithms, one of the basic needs is to unify different forms of words into a single representation (word). In NLP this is called lemmatization (i.e., base form conversion, e.g., cars->car).

For this , I used the Voikko Finnish NLP library. Voikko had earlier been used primarily for shorter pieces of text, such as sentences and words in spell checking. However, I used Voikko to process whole documents, longest of which were hundreds of pages long. Using such larger inputs, the processing time was increasing exponentially. After reporting the issue, it was fixed in version 4.0.1 (visible in Voikko release notes).

In relation to the topic of this and my previous article, this illustrates how different requirements for an algorithm can be relevant in different scenarios, and how this may evolve over time. The Voikko web-page now actually lists the library commonly used as a tool in machine learning pipelines for Finnish texts, illustrating its more general use case drift in this direction.

Continous Data Analysis: Testing Preprocessing

Lemmatization tools such as Voikko are great for processing well written and structured text. However, real-world text often contains misspellings, domain specific words, abbreviations, and other similar anomalies. For these, tools such as Voikko will fail to process them as they are not in the standard dictionary.

Reading all the documents and manually fixing all text is not a feasible task for any reasonable sized input set. Instead, I collected all words / tokens that Voikko did not recognize. By ranking these unrecognized words and tokens in the order of their frequency, I could easily find words that occurred often but were not recognized, create custom rules to lemmatize them as needed, and thus continuously monitor, test, and improve the preprocessing. Even as the set of documents, and the topics they covered would evolve over time. This process ran each time the document set and topic models were updated (nightly). I checked the results once a week or so, updating the lemmatization rules when new and frequent misspellings or new words appeared.

Algorithm Customization: LDA for Word Clouds

Another topic I previously discussed is the importance of understanding the algorithms used. In this case, the topic model algorithm I used produces internal representations that are typically not used as part of the output. However, learning the algorithm in detail, I was able to use those internal properties as a basis for service features. Of course, understanding what you are testing does not hurt in testing either.

The topic modelling algorithm I used for building the word-cloud was Latent Dirichlet Allocation (LDA). This is an NLP algorithm that maps documents and texts to a set of “topics”. The discovered topics are often used as input to further steps in an application. First, a short introduction to LDA.

LDA maps words together in clusters that are called topics. To concretize, here is an example of four topics (word clusters) I built using LDA on one of the Kaggle COVID-19 research paper datasets (code and notebook here):

LDA based topic model with four topics, with word-weights shown. Image by author.

With LDA, interpreting these topics/clusters is left to the user. In this case, I would say the topics above seem to relate to patient treatments (topic 1), virus analysis (topic 2), structure of the virus (topic 3), and possibly infection mechanisms (topic 4).

In the case of the Valtuustopilvi, the exact meaning of the topics was irrelevant, only that a good set of representative words, and their weights was captured, and they represented a set of high-level and diverse concepts in the document set. The idea in the end was to help the user explore different and interesting high-level concepts in the document set. This end user goal is always good to keep in mind also when testing a system, ML based or not.

The topics themselves were never shown to the user, only the backend code used them as a basis to build the high-level word-clouds. Summing the weights in different ways provided a basis to choose the words for the cloud, and weight their size.

As I was testing and using the service myself, I also realized that giving fully deterministic results to the user, in form of highest weighted words by the algorithm was boring and did not serve the user well. Because doing exactly that would always give the exact same word-cloud for the document set. It would not help them explore and find new information over time, and felt boring after a few times seeing the exact same word cloud I already had interacted with. So similar to the Spotify example of trying to avoid the algorithmic sandbox, I added some randomization to reduce the weights of some words, or add some randomness to which of the top words in each topic were included in the word-cloud calculations, and how much weight was given to each.

Testing, Analysis, and Development of ValtuustoPilvi

The LDA example above was only used for the top level searches, where people were searching over all the documents, or over a specific pre-defined category of meetings. It requires pre-computation and is most useful when summarizing larger sets. For deeper search queries with dynamic document sets, I used an algorithm that was faster and adaptive to changing search results. For further details on this algorithm, and the others I used, refer to the paper draft I wrote on the topic (…) back in the day.

An example of base tests in this was to check that the correct algorithm is used. These base tests are often quite simple and intuitive to write, with some basic understanding on how the system works. But how to test the word cloud is a good representation of the actual topics for the user? This is much more subjective and harder to test.

Looking back at this, I see different levels of testing I did on this:

Test sizes and oracle levels in the Valtuustopilvi project. Image by author.

First, I looked at a set of documents for a category, and whether the algorithm provided results made sense to me. This is the Me at the bottom of the above pyramid. The test oracle was myself, so very limited view but also very much to the point, and easy to check (I know exactly what I am thinking, at least usually :)).

Next came test groups, such as the competition reviewers and the feedback they gave. This feedback was more generic, without detailed knowledge of the algorithms, but still specifically giving feedback on how they felt about different parts of the service. Which is practically the algorithm output transformed into a user presented form.

The third level is the general user population. It has a large sample size in potentially covering all users of the service. However, the test oracle is very generic, as it can only rely on some indirect clues that can be collected from the user interaction with the service. This is similar to the Spotify (e.g., evaluation and search) and DuoLingo examples on analyzing user interactions, and performing experiments for tuning algorithms. The same also applies for search engines in general, such as Google search results optimization based on aggregated and anonymized interactions (linked in Feb/2022).

In the Valtuustopilvi case, the service did not have internet scale number of users, so there was not that much to analyze at the top level of all users. I also included a feedback mechanism in the UI, but beyond few basic comments not much came of it. However, once a service reaches a large enough scale, I believe the general user population is a very useful opportunity to pursue for evaluation data and ideas. As illustrated by the Spotify, DuoLingo, and Google examples. Of course, keeping in mind all the privacy and similar aspects.


Looking back at what became of this article, the big difference in testing ML based algorithms compared to traditional algorithms, as I see it, is the black box nature of the ML models. They embed complex internals, such as a large number of weights, nodes, and connections. And each of these are different across applications, model configurations, training sessions, and the data used. Thus there is no single way to test a specific ML algorithm, but the process involves much investigation and exploration. And this is a constant process, as the data, models, and out understanding based on it often evolve over time.

Some of the specific points I collected from this:

  • Data evolution, and constantly monitoring it,
  • Trained model evolution, and constantly monitoring it,
  • Identifying important operational and data invariants to cover,
  • Post-training testing types and metamorphic testing,
  • Different viewpoints, such as inverse assumptions/invariants,
  • Investigation and exploration as a central concept,
  • User (system) feedback and its proxies in evaluation at different scales,
  • Keeping the entire ML operations pipeline and related quality in mind,
  • Breadth in end (user) goals, such as avoiding algorithmic sandboxes,

Looking at the above, another common theme is the focus on continous test, monitoring, and analysis rather than point-in-time as often in classical testing.

Overall, I find there are many aspects to consider when testing ML algorithms, and the systems built around them. Much of the same concepts apply as for traditional testing, but hopefully this article helps provide some insights into the specifics of ML testing.

That’s all for now. Cheers 🙂

Algorithm Test Engineering: Exploratory Job Analysis

What does it mean to test an algorithm? Let’s play

Plot twist. The article has an RNA algorithm “example”.. Photo by Braňo on Unsplash

Recently I had a discussion about what it means to test an algorithm, or what it means to do algorithm test engineering. I couldn’t quite come up with a convincing definition for myself. At a basic level, maybe you figure out some basic rules of the algorithm, give it some input and output. Check the results. But I believe there is more to it, including algorithm fit for purpose, fit to data, evaluating alternatives, adapting the algorithm to the problem and system at hand, designing and running experiments, learning, etc.

To get a better idea about all this, I will explore the topic in this article. I try analyzing and testing two algorithms, and see where it takes me. I start in this article with more classical algorithms, where the inputs, outputs, and their relations are clearly defined. In a follow-up article, I hope to look at a machine learning-based one where the definition of the correct results is not so clear.

This story was originally published on Medium, where it can still be found.

Choice of Algorithm

Since my goal was to analyze the testing and analysis of algorithms, I needed to pick an algorithm to work with. Luckily, Wikipedia provides an extensive list of algorithms. I picked two classics, binary search and Levenshtein edit distance. I will warm up first with a basic test and analysis of binary search. This is followed with a bit broader look at Levenshtein.

In general, I find testing an algorithm without a broader context to miss some key points from the overall process. So I set up an application for the Levenshtein algorithm to give my testing and analysis exploration some context. This context is an application for comparing edit distances over RNA strings, more specifically motivated by the COVID-19 virus structure.

First, binary search. It is an algorithm designed for efficient search over a list of items, in logarithmic execution time. For my experiments, I use the IMDB names dataset, and a binary search Python implementation (adapted from a set of Stack Overflow answers).

Understanding the Algorithm

When testing an algorithm, I first need to understand what I am working with. To gain some intuition for this, I start with a look at linear search vs binary search. Linear search is a very basic search that simply loops the list in order to find a match, and provides a base reference here. To illustrate the difference, and what the binary search algorithm is about, a look at a set of 10 names from the IMDB dataset:

Example set of 10 names from the IMDB dataset. Image by author.

I picked the last 10 in the dataset, because it looked like a good random mix. Binary search requires the input to be sorted, so lets do that. Sorted and re-indexed these 10 become like this:

The IMDB names dataset slice, sorted by string order (first name). Image by author.

To build a better understanding of linear and binary search, I created illustrations on applying them to the above list to find “Lu Bevins”:

Linear search (left) vs Binary search (right) on the IMDB name slice. Image by author.

Linear search always loops through the entire list from the beginning, until it finds a match. The list can be in any order. Binary search constantly splits the list in two and checks the middle value for match, repeating this in smaller chunks, until it finds a match, or runs out of elements to check.

Binary search is a well known and simple algorithm, but above illustrates how I would go about trying to understand an algorithm. I find concrete examples and diagrams help me build an understanding. Access to a (domain) expert to help build and check these really helps. Especially with less known algorithms.

The above example list of 10 is very small, and does not really illustrate the benefit of binary search. With larger size, the comparison between linear and binary search should look like this (expecting logarithmic complexity):

Illustrating the theoretical logarithmic execution time vs linear. Image by author.

Such assumptions are good to check, so I will do that next.

Testing Computational Complexity

A common property of interest when testing and analyzing algorithms is computational complexity. This refers to the resources (processing time, memory, etc) the algorithm uses at different scales. Here I focus on processing time, although similar analysis could be performed for other resources.

To explore the scalability of binary search, I sampled the IMDB names dataset for 10 to 1000 names, in increments of 10, and executed both linear and binary search on random items in these sampled lists.

To gain more statistically confident results, I ran the experiments 100 times for each of these lists. I used my codeprofile library for execution time profiling. The following figures illustrate the measured average execution times over the 100 runs at sample sizes vs the theoretical, expected time:

Measured performance (left) vs theoretical (right). Image by author.

The above shapes show the curves follow the expected curve on logarithmic execution time growth (binary search) and linear growth (linear search). So the measurements matches the theory (scales are different in the figures due to parameters, I was just interested in the shapes).

1000 items is still a very short list for real performance measurements. Running the same experiment for 10 to 1 million samples (in 10x increments) shows the effect and benefit of binary search more clearly:

Binary search vs linear search. Left side = 10–1000 list size, right side = 10–1M list size. Image by author.

In the above figure, with list size 1 million, the difference is so big the binary search line looks flat compared to linear search. Based on the above simple experiments, I would say the assumptions hold for computational complexity.

The linear and binary search are well known and studied algorithms, and their complexity would be well known. As such, these demonstrate how we can evaluate this in practice. I have found evaluating practical computational complexity of actual implementations can produce surprises, especially different use cases for custom algorithms from 3rd parties.

Testing Input-Output Relations

Knowing the computational complexity is good, but we also need assurance that the implementation works as intended. For binary search, basic tests for finding selected names, boundaries, category-partitioning, etc are good examples for this. I call these base assurance tests.

To scale this up, I used the test generator from the above 10 to 1 million computational complexity analysis, and added the following check to it:

  • for every item, linear and binary search find the same index.

I call this an expected/assumed invariant over the data/algorithm. I was expecting the two algorithms to give the same results for the same input, and that the assertion would simply be there for extra assurance. However, the assertion was failing because the indices given by the linear and binary search were not always the same even for the same input.

After some thought on what could cause this difference, and how the algorithms work, I figured there might be duplicate names in the dataset. Here is a quick check to see if this is the case (concatenated to two rows):

Listing items that have duplicate names in the dataset (slice). Image by author.

Above figure shows “Robert Ellis” and “Harrison Ford” as having duplicates. To check a bit deeper, a look at the name “Harrison Ford”:

Check whether Harrison Ford exists in multiple instances as primary name. Image by author.

There are four people listed with the same name “Harrison Ford” in this dataset. Having established there are duplicate names, some summary metrics would be of interest to see how many there are:

Counting the number of duplicated primary names in the dataset. Image by author.

The above figures show a total of 853911 (row count) distinct names have duplicates. The one with the most duplicates is “David Smith”, repeating 340 times. What does the above mean for the mismatching search results? Due to how the two algorithms work, searching for “David Smith” would likely result in both linear and binary search returning a “David Smith”, but different ones. Linear search always returns the first one in list, binary search can give any result from the list of duplicates.

To me, this illustrates, how testing and analysis of the algorithm helps understand both the algorithm, and the data it is being applied to better. And how it is good to assert and test your assumptions about the data and algorithm. Having a clear goal, and performing this process systematically should help.

Besides invariants over the algorithm output, one can also consider them over input. For example, binary search expects its input to be sorted. The choice is whether to limit the scope of testing to expect the environment (the overall system) to enforce this, or to expect the algorithm implementation to handle it. I would refer to this as defining the algorithm (test) scope.

As a second example, and to gain some broader insight into the topic, I look at the Levenshtein edit distance algorithm here. Edit distance is also sometimes referred to as approximate string matching. It refers to the number of edits (character changes) required on a string to convert it to a specific (target) string. Known applications include providing spelling suggestions, approximate string search queries, and DNA/RNA sequence comparison.

Compared to the single search operation of binary search, the classic Levenshtein algorithm supports three operations: insert, delete, and substitute (a character in a string). The edit distance it gives is the minimum number of these operations required. There is also a variant called Damerau-Levenshtein adding a transposition operation, but to keep this example simpler, I will use the classic Levenshtein here.

Some examples:

  • test -> tent: one substitution of s->n = Levenshtein score 1
  • Levenstein -> Levenshtein: one insertion of h = score 1
  • Levenshtein -> Levenstein: one deletion of h = score 1
  • bad -> base: one substitution of d -> s + one insertion of e = score 2
  • bad -> based: two insertions of s + e = score 2

Defining the Example Application

As I noted before, I will look at Levenshtein from the viewpoint of an example application, to give it some context that my binary search example lacked. As Levenshtein has been applied in RNA analysis, I use RNA search as the application. Specifically, the COVID-19 virus, which is RNA based. There is plenty of information available on it, and it gives me a chance to play with longer RNA strings on a timely topic. As with binary search, I start with research to understand the application domain as well as the algorithm applied.

The COVID-19 RNA sequence is described as having a length of 29881 “characters”, consisting of a four letter alphabet representing its chemical base. One part of the COVID-19 structure that is specifically considered interesting is the spike protein. This interest comes from the role of the spike protein in infecting the human cells.

The spike protein is described as a 3831 “character” long sub-sequence of the overall virus RNA. Wikipedia describes the Omicron variant as having 60 mutations from the original Wuhan variant, 32 which are in the spike protein.

So in this example I look to apply Levenshtein on sequences of length 3831 (possibly up to 29881). Following the Omicron mutations, I will use up to 60 edits as the maximum in my testing. As with binary search, I am defining the algorithm (test) parameters based on domain knowledge and expected use.

DISCLAIMER: I have not done in-depth research on RNA search, I am sure it is far more advanced, but this works as a basic example for this article.

Experiment Setup

I used the Python weighted-levenshtein library to run my experiments. My base assurance tests include handling empty string, string of different lengths, invalid characters, and a selected string mutations. And so on.

To scale up the testing, I again built a simple input generator to generate input strings, and apply the algorithms supported operations (insert, delete, substitute) on them. This generated random strings of given length, applied the selected count of operation(s) on the string, and calculated the distance score for the modified vs the original (non-modified) string. At the same time measuring execution time, and checking a set of output invariants:

  • The score given by the algorithm should always match the number of operations applied.
  • Flipping the compared strings, e.g, car->cars to cars->car, should always give the same score. Since the operations are effectively reversible.

My test executor checked these invariants for every generated (test) execution of the algorithm.

To support the base assurance tests, I first used the input generator to build large number of strings of relatively short length (5–10), with varying number (0–9) operations applied. This gave me increased confidence, higher position coverage, and easier to debug results (with limited length). For score calculations, string boundaries, grouped operations, etc. It also showed me how my assumptions about the above two invariants was wrong for the first one.

What I found is, running the algorithm with enough iterations, the random chance will find cases where the target string built from multiple substitution operations can be achieved with fewer combinations of insert, delete, and substitute operations together. Here is an example case:

Applying 3 substitutions on a source to produce a target, expecting Levenshtein score of 3. Image by author.

The above shows the source string, and the target string after making 3 substitutions at randomly selected indices, with randomly selected characters. Remember, my target was COVID-19 RNA search, with 4 characters (ABCD). The following substitutions have been made; index 3: A->B, index 4: C->A, index 5: D->C. Expecting Levenshtein score to match the number of edit operations (3 substitutions here), this should give a score of 3.

However, instead of giving a score of 3, the above gives a score of 2. After some analysis, I realized the same target can be reached with one insert and one delete, giving the minimal score of 2:

Achieving the same target as above with only 2 operations (insert + delete). Image by author.

With this knowledge, I had to disable the failing invariant from the checks. Instead, I opted to collect statistics on the score achieved with different number of operations applied. I could then check that the overall score distribution was not too far off from the number of operations applied, if not exactly the same. The second invariant (flipping strings) was fine for all generated tests.

I find this is an example of what one might call exploratory algorithm testing. Take any assumptions on how the algorithm works, encode them as invariants (or whatever works), generate further tests to see if they hold. Learning about the algorithm and its domain along the way.

After successfully running these smaller tests, and feeling confident I had sufficiently covered my assumptions, I increased the generator string sizes higher for testing computational complexity.

Computational Complexity: Setup

For this evaluation, I used the input generator to produce strings of length from 100 to 5000 in intervals of 250 (100, 350, 600, 850, … until 4850). The target string length I set earlier was 3981 character length, which gets covered here, with some wide margin. For statistical confidence, I repeated each test size (100 to 4850) 100 times.

For each generated string, I applied the following operations 1, 3, 9, and 60 times (60 being the omicron limit defined above):

  • Substitute characters in the string at random (non-overlapping) locations
  • Insert random characters at a random location in the test string
  • Delete random characters in the test string
  • Combinations of above, all at the same time

My goal was to see if a different algorithm operations, their number, or the string length would affect the computation time. Or more generally, to test the algorithm operations and parameters, to see how they affect its execution time.

Computational Complexity: Measurements

The results for the above described experiments were very similar in terms of performance. The following figure illustrates the substitution tests, and their overall execution time:

Levenshtein execution time vs string size. Image by author.

Substitute_x refers to substituting X character in the source string, and calculating the Levenshtein score. The x-axis in above is the string size from 100 to 4850. The y-axis is the time it took to run the 100 experiments.

The four lines on the above graph are practically overlapping, as the execution time was so close in each case. Since this seemed a bit suspicious, I ran multiple experiments separately with varying parameters to see if this holds true, which it did. A bit strange, but ok. Having a domain expert and a team to bounce thoughts on this would have been real nice.

I omitted showing all the rest of the operations, their size variants, and combinations here. They were all very close, indicating that the execution time had little to no dependence on the type or number of operations.

The above curve in general resembles an exponentially growing curve. To check this, I experimented with a few parameters to try to visualize a matching form of an exponential curve. Here is the final measured execution time plot vs a theoretical plot for exponential time:

Execution time (left) vs exponential plot (right). Image by author.

From this, I would say the algorithm has exponential complexity. Wikipedia actually has a few fancy words on the Levenshtein computational complexity being high. From this, my verdict is that these results match the theoretical expectation, this algorithm seems to scale less optimally for longer inputs.

Testing an Adaption to the Algorithm:

Sometimes we test and analyze an algorithm, and realize it is not a great fit for the need. But we may wish to try an alternative, or an adaptation to this algorithm to address the issues. Let’s look at one example here.

Exponential execution time growth is generally considered bad for scalability. For shorter string (e.g., spell checking a word or a shell command) this is probably not an issue, as the exponential effect is so small on shorter input. However, in my RNA story I wanted to search and analyze sequences of length 3831+. For this, investigating possible speedups would seem useful.

Let’s say we decide to try to make this faster by trying to run the algorithm on smaller pieces of the input. This should make the execution time grow more linearly, instead of exponential. In the end, we just need to rank the search results relatively, not necessarily an exact score. And once the top results are known, one could calculate exact scores for that subset. Since this is a hypothetical example, I call this a hypothetical development idea :).

Here is an execution time graph for splitting the generated input strings to slices of length 100 characters, and summing up their scores:

Execution time after splitting to slices of 100. Image by author.

From the execution time perspective, the above graph looks much better. Exponential growth is gone, turned linear. But how about the results, how large is the difference in score between calculating it for the full string vs summing up the scores for slices of 100? The following tables illustrate this:

Substitution operator, slices of 100 vs full string. Image by author.

In the above table, s_x refers to substituting X character in the string. Size is the total string length. The postfix of /full refers to running the Levenshtein algorithm on the full string. The /100 version uses slices of 100 characters. Each configuration was repeated 100 times to get statistical coverage. Which is why, for example, s_1/full has a score of 100 (100 runs, each scoring 1).

For this substitution example, sliced version scores are very close to the full version. Perhaps because substitution is a local change, and does not so much affect multiple slices. There are a few rows in the table where the summed score has a difference of 1–2 edits (6000 vs 5998 and 5999). This is because of the issue I noted above, where insert and delete can work together to find a smaller minimum. If the substitution operation was enough, this would seem a plausible adaptation. But the goal was to support all operations.

In the below tables, the insert operation uses prefix of i, and delete a prefix of d. So i_x refers to inserting X characters, and d_x deleting X characters:

Insert operator, slices of 100 vs full string. Image by author.

The above insert table shows how the i_x/full score always matches the number of operations (*100) as expected. With i_x/100 slices, the score starts to get larger as the string length increases, and the number of operations applied increases (from 1 to 3, 9, and 60). After some analysis, I concluded this is because inserting a character at the beginning of the string shifts all the rest of the slices forward to the right, and thus causes each slice to require multiple edits, and increase the summed score.

The following figures illustrate this issue:

Above string of length 10, Inserting B, Levenshtein score calculated as full string. Image by author.
Same as above string, Levenshtein score calculated in slices of length 3. Image by author.

The above is an example of slice size 3 instead of 100 but the same concept. Depending on the location of the insert, the edit count propagates towards the end and increases the sum by a large amount. For each slice of 3 to match, it has to remove the first and add the last char. Same for delete operation:

Delete operator, slices of 100 vs full string. Image by author.

The above delete table shows very similar behaviour to the insert table. And for the same reasons, a delete shifts all slices left when insert shifted them right.

So, while the slicing approach would be successful in cutting down the algorithm processing time, it would definitely not work to properly rank the search results for insert and delete. From the application perspective, I would consider this adaptation a failed experiment, except for learning.

However, for the purposes of this article, I find this is a good experiment. It shows how one might run tests on an algorithm, analyze its results, fit for purpose, and other properties, build hypothesis, implement experiments, evaluate it, and iterate. Optimally, this is how I would see algorithm testing and analysis contributing to the overall development more broadly, helping to design and evaluate experiments for improved algorithms and adaptations.

Concluding the Levenshtein Search Experiment

In a more realistic scenario, I hope I would be working with a team of experts. And have resources to perform research on what are all the approaches, state of the art, and everything else on the subject. In fact, when the algorithm and its application domain are complex, and the opportunity is there, I would consider this an essential part of the testing and analysis process. Working with domain experts, R&D team, and researching the domain knowledge. Here I will limit the scope as my resources are limited.


My goal with this article was to explore the idea of what algorithm testing (and analysis) could be. A summarizing list makes it simpler to remember:

  • Traditional testing techniques can define base assurance tests with expert defined inputs and expected outputs
  • Building a good, in-depth understanding of the algorithm helps understand how to test it, and how to adapt it to a domain
  • This includes building a similarly good understanding of the data it is applied to, and how this relates to the algorithm
  • Ideally, above works iteratively in interaction with research, development, and testing
  • Besides verification, algorithm testing can contribute to understanding its limits, potential optimizations, and compare alternatives
  • Identifying assumptions about the data and algorithm input describes what the algorithm is expected to work with
  • Exploratory data analysis can use these assumptions as input, check if they hold, and refine them
  • Identifying assumptions about the data and the algorithm output gives a basis to write invariants to check in every test
  • An automated test generator helps scale testing with these invariants, and check whether the assumptions hold
  • Scope of testing relates to defining the scope of the algorithm vs the overall system using it, responsibilities for input and output handling
  • Theoretical computational complexity is good, but practical evaluation if it holds for the implementation and available data is good to ensure
  • Algorithm testing can form a tool for exploring the algorithm, by formulating hypothesis about it and using testing to evaluate them
  • Tools and techniques, such as metamorphic testing can help evaluate robustness of the algorithm to different types of valid and invalid inputs
  • Testing and analysis is optimally an iterative process, where the end result is the final generated tests and checks, and all the learning on the way


I started this article with the idea to explore what it means to test an algorithm, or what it could mean to be do “algorithm test engineering”. I like to think I made some progress, although I am sure the definition can be subjective, much like “goodness” of a (machine learning) algorithm result.

Both my examples in this article, binary search and Levenshtein edit distance, are quite simple and basic algorithms in the end. As noted in this article, the basic testing of such algorithms is not too complicated. However, considering algorithm testing and analysis as part of a broader research and development process, I believe the interactions, collaborations, research and development contributions can make it more diverse and interesting.

In this article, I looked at two “classic” algorithms, where the input, output, and their relations are quite straightforward to define. In a followup article I will look at machine learning based algorithm(s), and an example of an algorithm where the input-output relations, and the correctness of the output is harder to define, subjective, or “undefinable”, in a traditional sense. Until then.

That’s all for now. Cheers.

Mapping Ring Signatures and Stealth Addresses in Monero

With an example from a recent XHV hack (of June 2021)


I originally published this article on Medium, where it is still available.

Since Monero has been forked and used as a basis for multiple other blockchains, the same concepts of Ring Signatures and Stealth Addresses also apply more broadly to those forked blockchains as well. One of these forked projects is the Haven Protocol (XHV) private stablecoin project. The example I will use is from the June 2021 XHV hack, highlighting an interplay of the Stealth Addresses and Ring Signatures, and how they were relevant in how the XHV developers tried to address the hack.

Since the exploits were not fully rolled back on the XHV blockchain, the data should still be there to check in practice for anyone interested. As long as the blockchain exists, anyway. In any case, lets start with a brief look at the relevant Monero privacy features to set the stage.

Monero Privacy Features

Monero uses multiple privacy enhancing mechanisms. These include Pedersen Commitments, Ring Signatures, Ring Confidential Transactions (RingCT), and Stealth Addresses. These can be divided into techniques for hiding the amount of Monero transferred in a transaction, hiding the receiver of a transaction, and hiding the sender. That’s a lot of hiding there, so lets see about each one.

Transaction Amounts

One core aspect of transaction anonymity is the transaction amount, i.e., how much funds is the transaction spending / sending. This is relevant to hide the amount itself, but also for the sake of correlating the inputs of one transaction with outputs of another transaction. For example, if you could look at a specific sum that is sent in one transaction, and find the same sum spent later in another transaction, you could deduce that the recipient of the first transaction is likely the spender of the second one, or that it matches the price of some other asset that you think was acquired.

Monero originally did not hide the amounts, and the transaction input and output sums were visible on the blockchain. You can still see this if you look at the earlier Monero transactions on the blockchain, for example this one from year 2014 in block 100000.

These days Monero uses a cryptographic scheme called Pedersen Commitment to hide the amounts. I described the Pedersen Commitment in detail in my article on Zero Knowledge Proofs. Related to this, later in this article I also discuss Ring Signatures, which also used to need to take the amounts into account. However, since the introduction of the RingCT protocol, the Ring Signatures now also hide the amounts along with the Pedersen Commitment.

So, in general, with regards to transaction amounts, the current Monero-based blockchains hide them well, and this is very transparent to the user. That is, the user does not need to really concern about it, as it happens always and automatically when using the protocol.

Transaction Receiver

The approach to hiding the recipient of a transaction in Monero is called Stealth Addresses. There appears to be some ambiquity with regards to what a Stealth Address means in the Monero community, but here I refer to its use as the transaction receiver addresses visible on the blockchain.

Stealth addresses are different from the wallet addresses. The Monero wallet address, as I refer to it here, is the one you get if you start the official Monero Command Line (CLI) wallet, and type the “address” command. Or the same in GUI or whatever wallet you use. This wallet address actually never appears on a Monero-based blockchain, only the Stealth Addresses do.

Given a target (receiver) wallet address, the sender wallet generates a one-time Stealth Address. This generated Stealth Address is used in the blockchain as the transaction receiver address. Elliptic Curve mathematics are used to create a Stealth Address from public keys encoded in the given receiver wallet address, along with using a large randomized factor. This makes each Stealth Address unique. Even if the same sender sends Monero to the same receiver, the Stealth Address is always different and unique for each transaction. This is the mechanism to hide the transaction recipient.

For the details of building Stealth Addresses, I recommend this excellent Stack Exchange post, and similarly excellent article by Luigi1111. When looking at the blockchain, without matching private keys, a Stealth Address cannot be matched to any wallet address. Only the receiver has access to the private keys matching the public keys used to generate the Stealth Address, and thus only their wallet can identify transactions intended for them.

Monkey See, Monkey Do (a Doodle)

To make this a bit more visual, I made the following most beautiful doodle about the mail system as a metaphora for how a Stealth Address works.

My most impressive doodle on using mail as a metaphora for a Stealth Address. Pics from Pixabay: monkey, envelope 1, envelope 2, envelope 3, heads 1, heads 2, heads 3, heads 4, trashcan, piggymailman.

Consider the process in the figure in 7 steps listed in it:

  1. George (the Monkey on left) wants to send 10 Monero to Bob (balding guy on right). So he starts to build a Monero transaction, and putting the 10 Monero in it.
  2. George has Bob’s wallet address. Maybe he found it from Bob’s Github project, asking for donations. George wants to send a few bucks. Using the public keys encoded in Bob’s public wallet address, George’s wallet generates a unique Stealth Address to send funds to Bob.
  3. George puts this newly generated Stealth Address on the 10 Monero transaction as the receiver address.
  4. George gives the transaction to the postman for delivery. The postman here represents the Monero peer-to-peer blockchain network. Consisting of nodes running the Monero Daemon software.
  5. The postman, or in this case the Monero network, delivers a copy of the transaction to every single user of the blockchain. Or their wallets.
  6. Everyone tries to read and open the transaction by attempting to fit their private keys to the Stealth Address. Or their wallet automatically does this.
  7. Everyone else discards the transaction, since their keys did not match, except Bob, whose private keys match the public keys George used from his wallet address. So Bob gets the 10 Monero and adds them to his wallet (piggybank) for later spending.

For the purposes of this article, the key takeaway is that for each transaction a unique receiver address is generated, and this is the one you see on the blockchain as the transaction receiver address. And this is (what I call) the Stealth Address. This is an important element for the next step of hiding the transaction sender using Ring Signatures, and how we can later use this to look for links between the Ring Signatures and Stealth Addresses. The XHV hack I discuss later will illustrate this link more concretely.

Transaction Sender

As I noted already, the approach to hiding the sender of a transaction in Monero is called Ring Signatures. A Ring Signature is so named due to the ring-like structure in how the algorithm is applied. The general idea of this algorithm is to create an anonymity set of possible signers, and hide the true sender/signer in that anonymity set. A signer in this case refers to a transaction output (Stealth Address) that could be spent in the transaction the Ring Signature is attached to. The following illustrates a Ring Signature (yes, its just a doodle based on my understanding):

My Ring Signature doodle. Green is the real tx, hiding in yellow decoys.

In the above figure, I colored T9 (as Transaction Output 9 in the ring) in green to illustrate it being the real input in this case. This is just for illustration, as the real input / signer is randomly placed on the ring. Because if it was always at the same position, it would be obvious which one it is. The Ring Signature algorithm processes these in a ring-like sequence, hence the name.

Monero currently enforces a ring size of 11, meaning the ring contains the real transaction input and 10 fake decoy transaction outputs (sometimes called mixins). Ring size used to be open to user change, but the forced count of 11 is currently used to avoid distinct ring sizes giving away additional metadata about specific users and thus weakening the overall anonymity set.

The signatures in a ring are generated using a set of Stealth Addresses from real transactions on the blockchain. Anyone can prove that one of the signatures is real, but not which one. This gives every ring participant plausible deniability. The real signer is of course the one who sends the real funds, or spends the output they received at that Stealth Address. But you cannot tell which of the 11 it is. Further, a key image is included in each transaction spend to make sure each output is only spent once.

For details on Ring Signatures implementation, I recommend the excellent Stack Exchange answer for Monero and RingCT specifics, or the Wikipedia article with example code for the original Ring Signature scheme.

For the purposes of this article, it is enough to understand that Stealth Addresses define the transaction recipient, and when a transaction output is spent, the Stealth Address of that spent output appears in the transactions Ring Signature. But the Stealth Address may also appear as fake decoys in other transactions.

Visualize it: Stealth Address and a Ring Signature from XHV Explorer

To illustrate, here is a screenshot of the Haven protocol (XHV) explorer showing one of the hacked coinbase outputs (normally amounts are hidden but coinbase / miner transactions show them):

Screenshot of a coinbase output on XHV chain. This link should take you to it, if the explorer exists as used to.

Note that the above screenshot shows how each transaction output has a visible Stealth Address for the receiver. And it is always unique even if going to the same receiver wallet. All the outputs in the above screenshot are actually paid to the same (hacker) wallet, but have different Stealth Address each. In the above screenshot I highlighted one row with the 101k xUSD output. Let’s map it to a Ring Signature.

Here is a screenshot of the explorer, showing how the above highlighted Stealth Address is used in a Ring Signature:

Screenshot of explorer with a ring signature using the previous screenshots Stealth Address. Link.

In this screenshot I marked it as the real output. Generally we could not tell if it was a real input or a decoy input. In this case, I believe it is the real spend, as I will explain soon, due to some telling ways the hackers spent it. However, in general this should illustrate how the Stealth Address appears in Ring Signatures as the real or decoy one. From the public data on the blockchain we cannot really tell which one it is.

Finally after all that lengthy explaining, lets look at the promised XHV hack, and how the Stealth Addresses and Ring Signatures play out in that.

Example by Looking at the XHV Hack

First, some brief background to understand the XHV exploit descriptions, and their relation to Stealth Addresses and Ring Signatures a bit better.

What is XHV

XHV, or Haven Protocol, is called an “algorithmic stablecoin”. The basic XHV token is, surprisingly, called XHV. It can be traded on exchanges, and sent between wallet, similar to XMR for Monero. To create the algorithmic stablecoin functionality, XHV implements also other token types on the blockchain, such as xUSD to track US dollar value, xBTC to track Bitcoin value, and xJPY to track the Japanese Yen value. Special transactions can be created on the XHV blockchain to change one of these token types to another, and these tokens can be sent between wallets just like the base XHV itself.

The tracked prices for USD, BTC, etc., are determined by Chainlink oracles, such as the one for XHV-USD. Without going into too much detail, these concepts are only relevant for this article at a higher level with the multiple asset types, as they come up discussing the following examples with the XHV exploits.

The Exploits

At the time of the example used here, there were multiple different exploits applied in the course of a few days on the XHV blockhain. I will use the first exploit on that list as an example. This exploit abused some missing checks on how the miner (coinbase) transaction is created. The output of the other exploits could be analyzed similarly if interested, as they are all listed on the XHV teams Medium post.

The first exploited miner transaction on that list is in the XHV block 882877. Specifically, the attackers used an exploit to take advantage of missing checks for validating mining rewards / fees in the miner reward (a.k.a. coinbase) transaction for the additional asset tokens. In the block 882877 mining transaction (also the screenshots above in theRing Signatures section), you can see 101k xUSD and 6.73 xBTC being created out of thin air. The miner transaction is exploited again in block 883040 for the same amounts of 6.73 xBTC and 101k xUSD.

So how does this relate to Stealth Addresses and Ring Signatures? After the XHV developers noticed / were notified of the abnormally large miner rewards, they attempted to stop the hackers from using those hacked 101k xUSD and 6.73 xBTC outputs. This is where the Stealth Addresses and Ring Signatures of those transaction outputs come to play.

Attempting to Block Spending of the Exploited Funds

As I described earlier, Monero does not include the actual receiver address on the blockchain. Instead, it creates the one-time Stealth Addresses for each transaction, based on the receivers cryptographic public keys encoded in their wallet address. And while you cannot map a single Stealth Address to any specific recipient/wallet, you can map it to Ring Signatures of specific transactions, where the output identified by the Stealth Address is possibly used as input / spent (or as a decoy). The screenshots earlier illustrated this.

In this case, the XHV developers would have taken the Stealth Addresses used as target addresses for payment of the exploited transactions, and scanned the blockchain to find any Ring Signatures using these Stealth Addresses as inputs, and thus any potential spends of those funds. With no matching Ring Signatures / transaction spends found, one could assume that the funds were not yet spent / moved from the initial target Stealth Address. Reading their report, this would have been the case they assumed.

To block the attacker from using their exploited funds, they provided the biggest XHV mining pools with a modified software that discarded any new transactions using those Stealth Addresses as part of their Ring Signatures. These biggest mining pools would give them control of well over 51% of the network hash power. This 51%+ hash power gives anyone power to discard any blocks mined by someone else, in this case those that would have contained those Stealth Addresses from the exploited outputs as inputs.

However, reading their report and the Medium post, the developers seem to have made some mistake in their scanning, and the idea that the funds were not spent was not the real case. We can verify this simply by using the Monero Daemon RPC API that is shared by the XHV Daemon. We can query a network node with this API for blocks and transactions, and look for any with a specific Stealth Address as an input in the Ring Signatures. Let’s see how it looks.

Mapping Stealth Addresses to Ring Signatures

To build this example, I queried the XHV network for transactions that have the Stealth Addresses of the exploited miner transactions in their Ring Signatures. Once I found any, I repeated this scan one more time, to find transactions using those potential spends as inputs in their ring signatures. The following figure shows the results of this process, and how the exploited miner transactions spread into the blockchain in the blocks coming after it:

Hacker TX fan-out into the blockchain and its ring signatures.

The above figure illustrates how the two exploited transaction outputs fan out from the original exploited miner transaction from block 882887. The two exploited miner transactions in the above are the ones on left side of the figure, labeled 882877 xUSD 100k and 882877 6.73 xBTC. If you look at the miner transaction on the block explorer, the xUSD transaction is paid to Stealth Address 3b8e80a279000d3d32826010b665e2c78aaf25d1744c49cac8a5da154d293875, and the xBTC to Stealth Address 7dec1f53dd30416458926bd35e9c1fff98d0ac86fc185a8fedfad546ff7d4546.

Scanning for these Stealth Addresses as part of a Ring Signature, the xUSD output is used as input in Ring Signatures of three transactions, in blocks 882939, 883064, 883410. The xBTC output is similarly used as input in two Ring Signatures, in blocks 882939, and 883041.

I highlighted 882939 in the above figure in green, since it seems quite clear to me that this is the actual transaction to spend the exploited funds from block 882877. I cannot prove it 100% because the Ring Signature obfuscates the sender in 10 other decoys for both the xUSD and xBTC transaction. However, this is a good illustration of where these privacy mechanisms don’t necessarily alone hide you, if your actions betray you through other heuristics.

I say this because block 882877 contains transactions using both the xUSD and xBTC exploited outputs as ring members. Both outputs also appear in any Ring Signature for the first time in this same block. With the way Monero chooses its decoys, having two transactions in the same block using decoys from the exact same previous transaction is extremely unlikely. Even more, Monero has an unlock time of 60 blocks for miner (coinbase) transactions such as the exploit in block 882877, and the difference between 882939 and 882877 is 62 blocks. So 882939 is one of the first blocks where this hacked output could ever be spent. What a coincidence to have all this happen together. This illustrates how the technology can work perfectly fine, but your actions can still betray you. Probably makes no difference to the hacker, but a good example.

Why Only 2–3 Rings Use the Exploited Outputs

The above figure shows how there are only 3 and 2 blocks where the xUSD and xBTC outputs appear as ring inputs but for the other transactions in the figure there are many more. This is the effect of the XHV developers efforts to stop their use, which was just a bit late as they already were used as real inputs/decoys in those 3 and 2 blocks. But if not, the stopping likely would have worked fine as the cutoff shows.

Fanning Out Into the Blockchain

In the figure above, I expanded the two transaction outputs from the xUSD payment in the block 882939 as examples of what a more realistic fan-out from an output would look like into multiple Ring Signatures in the following blocks. The first (marked output1 in the figure) is paid to stealth address 71d224f0043fdf8c5b6e8bc292f1c1e07bf820d707c2d9b04b6335d49bf37e4b. The second fanned-out output (output2) is paid to stealth address afa169a74b1581cafedf215219ca7b7708802ec13945540d3fb9ba30e188d842.

As the figure shows, the output 1 of block 882939 can be found as (input) member of Ring Signatures in transactions in blocks 883000, 883105, 883155, 883247, 883302, 883987, 883993, 884002, and 884003. Similarly, output 2 can be found in 882970, 883903, 883925, 883986, 883991, 883996, 884440 and 885705. If you want to check for yourself, just open these links and search for the stealth address I copy-pasted above.

Relevance to User Privacy

Beyond all this technological aspect, reading about hacks and all can be interesting. Or maybe everyone already fell asleep. But is any of this relevant to a normal user?

If you only care about having some basic privacy, or if you don’t care about privacy and anonymity at all, you are likely fine just using the Monero based protocols as is. If you do care more deeply about it, there are a few points we can gather from this that are good to keep in mind:

  • When you receive any transaction, consider re-sending it a short time after to yourself, or in general churning it a few times. This will obfuscate what stealth address the funds are in, beyond just inclusion in other Ring Signatures. But beware of obvious patterns such as the one above for the miner exploit.
  • Consider other metadata that could be used to identify and link your transaction. In the above XHV hack example, the metadata was the use of multiple outputs from the same miner transaction at the same time, right after the original transaction unlocked.

However, to make this more complicated, churning and other techniques are not so simple to do perfectly either. Some related discussion in this Stack Exchange post. To make it perfect, your churning should look like real decoy patterns look. But it can get a little complicated to do such perfect optimizations, so mostly some basic consideration for above points is likely good enough. It is especially tricky as I am now aware of any specific support for having granular control over which transaction outputs are spent, or even visibility to the specific inputs and outputs in a Monero wallet.

In general, timing related analysis seems to have been one of the most common ways to try and analyze Monero transactions. To look more deeply into the decoy selection algorithm, I wrote a small study on that here but it got a bit lenghty. It seems to happen to me a lot. So to avoid even more overly long writing, and too much digression, I will save my details on decoy selection and timing analysis for another day.

Other Metadata

Besides time-based analysis, various other issues related to transaction analysis, Ring Signatures, decoys, and related topics have been identified and addressed over Monero lifetime. A very nice and extensive discussion on this topic is available in the Breaking Monero YouTube series by the Monero team. I won’t go into any of those details here, but if interested in the topic, it is definitely a great series to watch.

Future Views on Ring Signatures

While the current Monero ring size is 11, recent Monero Labs Research topics include Triptych and related ideas that aim to enable much larger ring sizes in practice. This should make any analysis of the transaction sender, and the relations between them even more difficult. It should also provide broader decoy patterns, making it harder to fail to match them for your own transactions.

Obligatory Promotion of Software Testing

Having worked over many years in software testing (mainly the engineering part of it), I will take the chance for a few words about these cryptocurrency hacks and what I think about the topic, even if it is not really about Ring Signatures and Stealth Addresses. Sorry about that.

Regarding the XHV hack, the reports they released seem to point to quite lacking software development practices (search for Lessons Learned). There appears to have been little to no tests, lack of code reviews and audits, and getting any (helpful) community involvement in development seems to have had little to no support. Pretty dire reading for an open-source financial project running openly decentralized over the internet, and at times valued in hundreds of millions of dollars or more.

Reading those lessons learned, at least the XHV team seem to recognize this now and work to improve. I guess it remains to be seen how well one can change fundamental processes and peoples working practices in one go.

Of course, the XHV team are also not alone in this as, for example, about one month later (July 2021), another cryptocurrency project called Thorchain had similarly multiple hacks amounting to millions of dollars in damages. While not as bad, some of the reports I read pointed at similar limits in development processes, testing, etc. In general such hacks seem to occur almost monthly in cryptocurrency projects, especially for the newer projects with a larger new codebase (and thus untested attack surface). Maybe it goes with the territory, but maybe it wouldn’t hurt to put some thought into testing and security.

Of course, everyone makes mistakes, and I guess it is most important to learn from those mistakes. But that should not excuse ignoring even basic levels of quality assurance.


I hope this article provides some added understanding on what are Ring Signatures and Stealth Addresses, and how they work. I find it quite impressive how Monero applies much of the latest cryptographic research into practice, combines different cryptographic techniques to create a system of private and decentralized transactions, and has not had any big hacks like the one I partially described here.

The Monero technology can be quite complex, but in the end I find a good explanation helps make even complex topics understandable. Much like Bitcoin is similarly clever, but not too hard to understand once you find the right resources to explain it. Luigi1111’s Stealth Address article is great, too bad he never seems to have gotten around to his Ring Signatures article hinted there.

For anyone interested on how using the Monero API to analyze the blockchain might work, at the time I wrote my earlier article on Merkle Trees, I built a small tool to crawl the Monero blockchain using the Monero Daemon / node API. This let me locally analyze the data, which I also applied for this article. Maybe one of these days I will also finish the additions I made for this article and update the Github, even though it should already work as an example of the general process.

Personally, I feel like I got maybe halfway there in trying to get a good understanding of how to implement the Stealth Addresses and Ring Signatures in practice. Maybe for understanding their relations on the blockchain and tobasic Monero tracing, that could be enough already.

That’s all for today. Cheers.

Merkle Trees: Concepts and Use Cases

The data structure within Bitcoin, Amazon Dynamo DB, ZFS, …

This article explores what are Merkle trees, and how they are used in practice in different systems including Bitcoin, Amazon’s Dynamo DB, and the ZFS filesystem. The basic concept is quite simple, but some of the clever applications are not so obvious.

First, lets start with the concept of Merkle trees. As I said, it is not too complicated in its basic form.

What is a Merkle Tree

A Merkle tree is fundamentally just a hierarchical set of hash values, building from a set of actual data (Merkle leaf) to intermediate hashes (Merkle braches) and up to the Merkle root that summarizes all the data in one hash value.

Example: A Basic Merkle Tree

The following figure illustrates a very small Merkle tree:

A very small Merkle tree. Image by author.

In this figure, the bottom nodes (Data1Data4) are the actual data processed by the application. Each of these is summarized by their respective hash value (Hash1Hash4), as a Merkle leaf. From these, the Merkle tree builds a hierarchy, combining hashes together until only one is left. The nodes combining other hash nodes are called Merkle branches (here Hash12 and Hash34). When there is only one left (here Hash1234), this is called the Merkle root. There can be multiple levels of branches and hashing as following examples will demonstrate.

Handling An Unbalanced Merkle Tree

The above example illustrates the very basic case of a Merkle tree. It is a convenient example, as at every level there are just the right number of nodes to form exact pairs. What happens if you have an uneven (odd) number of leaf (data) nodes? For example, what happens to the above example if you have 5 Data nodes? You can hash Data1+Data2 together to form a Merkle branch, and same for Data3+Data4. But Data 5 is left without a pair to hash into a new branch.

Different approaches can be taken to address this. For example, in this situation Bitcoin simply copies the un-pairable (odd) hash, and uses the duplicate as a pair (the odd hash is paired with itself). The following figure illustrates this:

Handling odd (uneven) numbers of hashes Bitcoin style. Image by author.

In this example, Hash 5 has no immediate pair, and is duplicated to pair it with itself. Same happens for Hash55. The above is just one option, there are different ways to handle this pairing issue. Of the ones I have seen, I like the Monero (or Cryptonote?) one most. I will present it shortly. First, something slightly different.

A Very Simple Merkle Tree in Python

Theories and explanations are great. But for the technical person, some code helps formalize it. The following code from my Github repo illustrates a naive but simple example of implementing a Merkle tree in Python (or just skip it for more conceptual explanation after): very simple and naive Merkle tree implementation in Python. Code by author.

The last four lines in the above code run the example to create a Merkle tree with data nodes Data1Data5. The calculation of the tree with this algorithm looks like this (cropping the hashes to the first 5 characters):

Merkle tree calculation example with actual values. Image and animation by author.

I used the keccak hash function in this example, as this is what the Monero cryptocurrency/blockchain uses, and I was looking into Monero code lately. The 5th (odd) data element here (using my Python code above) is handled slightly different from the Bitcoin example, as this simply re-uses the hash if one is left over (un-pairable, cf54b above). Practically this re-use should have the same effect as the duplication in the Bitcoin algorithm.

Optimizing Merkle Calculation: Monero Style

As I was looking into the Monero blockchain to understand it better, I found it has a different but clever approach to hashing the Merkle tree. The code for it is available in the tree-hash.c file in the Monero Github repository. And a Python version I made to implement the same in my Github.

The Monero approach could be described as converting the hash tree to a perfect binary tree. It hashes enough leaf nodes in the first iteration, so that the following iterations will always have some variant of 2ˣ (a power of 2) nodes. The following figure illustrates this:

Monero approach to Merkle tree building, with 5 nodes (transactions in Monero). Image by author.

This works as follows: First, calculate x so that is greater than number of transactions (data nodes). In this example, it is 2³=8, because 8>5 (there are 5 transactions/Data nodes). The value of x before that, 2²=4, would not fit (4 > 5 is not true). From this, the number of transactions is subtracted. In this case 8–5=3. This 3 is the index of transaction to start iteration 1 from. With 0-based indexing, the start index is Data4 in this example.

Above explanation is maybe a bit abstract. Below is a table with concrete examples of transaction counts and how these are converted into a “perfect binary tree” shape (size is always 2ˣ, eliminating any “odd” counts or leftover un-pairable hashes from following iterations) for iteration 2 and following iterations:

Example calculations of Merkle size with the Monero (Cryptonote) algorithm. Image/Table by author.

The columns:

  • Transactions: Number of transactions in the block being calculated
  • Closest 2^x: First 2ˣ that is bigger than transactions
  • Iteration 1 start index: start hashing pairs from this index in transaction list
  • Iteration 1 values: Number of transactions in transaction list starting from iteration 1 start index to the end of transaction list
  • Iteration 1 pairs: iteration 1 values divided by 2 (because hashed in pairs)
  • Iteration 2 formula: How the above values lead to number of items to hash in the following iteration
  • Iteration 2 size: The number of transactions to hash in iteration 2. As you can see, it is always 2ˣ, and leads into a “perfect binary tree”.

Too many numbers and formulas I guess. Did you fall asleep yet? Anyway, if you pick the row from above table with 23 transactions, you can follow this nice 8-bit animation I made on how the Merkle root calculation progresses for that row:

Calculating Merkle root in Monero for block 1407480. Image/animation by author.

In this case (animation above), the start index is calculated as 2⁶=32–23=9. This is transaction number 10 (zero index adds 1). The following iteration then has 16 nodes, which is 2⁵, next one 8 or 2⁴, and so on. The animation above actually represents how the hash array is manipulated in memory by the Monero code.

Example Use Cases

Concepts are nice, while concrete, real-world, use cases make the point. Here I will discuss the use of Merkle trees in blockchains/cryptocurrencies, and Amazon’s AWS DynamoDB distributed database. I will also briefly touch on the ZFS filesystem, and the Git version control system, as these are sometimes also mentioned as examples of Merkle tree use.

Cryptocurrencies and blockchains: Bitcoin, Monero, et al.

As already briefly discussed above, Bitcoin and similar cryptocurrencies make use of Merkle trees to summarize and validate the transactions in a block, and embed the Merkle root into their block header as a summary of it all. The following figure illustrates this:

Blockchain and Merkle trees. Image by author.

Each block has an ID value, which is the hash of its header fields. A part of this is the Merkle root. Another part is the previous block ID (Parent in above figure). By linking with the previous block ID, and including that as part of the next block ID, the blocks form a blockchain. By embedding the Merkle root in it, they make an immutable record of transactions in the block.

For example, the Bitcoin block header contains:

  • Difficulty target value (called bits in Bitcoin)
  • Merkle root of all transactions in the block
  • Nonce; a value changed in mining to find accepted blocks
  • Previous block hash (block ID); linking this block to the previous block in chain (this previous hash is named parent in figure above)
  • Timestamp; Time of mining (creating the hash for) the block
  • Block version; identifies support features and formats (also used to identify hash function in some blockchains such as Monero)

These header fields are all hashed together to form the block ID. Making the block ID a hash of all header data makes it practically impossible to modify a block header. Including the Merkle root in the block ID further makes it practically impossible to modify the transactions in a block.

But let’s look in a bit more detail into how these types of blockchains actually use the Merkle tree in different ways. It’s all about data integrity, but in different and sometimes ingenious ways.

Blockchain Use Case 1: Transaction Data (Block) Immutability

As I noted above, the base use case of the Merkle tree in Bitcoin style blockchains is to build the Merkle root into the block header, and using it to verify that no transaction has been changed.

As an example, let’s change Data4 to Data6 in the earlier example:

Changing one node value invalidates (changes the root of) the Merkle tree. Image by author.

Comparing this to the earlier example, with Data4, the Merkle root was aa8d3. With Data6, it is now f8932. This way, any change in any transaction data changes the Merkle root, and it no longer matches the one stored in the blockchain for that block (in block header). And through the block header, this issue would propagate to the whole blockchain following this block.

However, if you think about it for a while, you may ask a question:

Do you really need Merkle trees to verify the block data?

No. Instead, one could just concatenate all the transaction data together and build a single root hash. Consider the above example with Data1Data5, but just hashing them all together:

Concatenate+hash all data at once. Image by author.

Now, changing just one data value will have the same effect of invalidating the summary hash:

Hash change detection in concat+hash all at once. Image by author.

This single hash could also be used in place of the Merkle root in the block header and block hash. Same effect so far.

So what is the real benefit of using a Merkle tree here? To summarize, I would say it gives you more granularity in the hash verification, and enables other clever tricks to process the blockchain more efficiently. While also providing the transaction integrity assurance / verification.

The original Bitcoin whitepaper provides two additional examples for use of the Merkle tree: blockchain pruning and simplified payment verification (SPV). Beyond these, the one I really like is the Stratum mining pool protocol. Let’s look at these next.

Blockchain pruning

Over time, a blockchain will grow larger and larger as it accumulates more and more blocks. For example, today (Feb/2021) the size of the Bitcoin blockchain is about 380GB. Blockchain pruning is an approach to reduce this used space with the help of Merkle trees. By removing used transactions from local storage where no longer needed.

We want to have all data around for full scale verification and history, but not all nodes in the peer to peer network need all the data. The blockchain pruning approach proposed in the original Bitcoin whitepaper suggests addressing this by using the Merkle tree to prune (remove) spent (used) transactions from the blocks.

The following figure illustrates this:

Three out of five transactions used (spent) in the block (and Merkle tree). Image by author.

In this example, we have used (spent) transactions TX1, TX2, and TX4. We still have unused transactions TX3 and TX5 left in this block. Assume we are not interested in a full node with full archival, just keeping a list of spendable transactions. Satoshi’s proposal was to prune the block from the used transaction data, and simply leave the Merkle tree branches needed to verify the unspent transaction data:

The Merkle tree pruned of the used transactions. Image by author.

This is perhaps more clear if we also spend TX3:

One more transaction used, leaving just one unused in the Merkle tree (and the block). Image by author.

The block could now be pruned to only contain the TX5 data, and the hash of the other Merkle tree branch (96b8d) in the block:

Pruned Merkle tree with only one transaction left. Image by author.

We would just have saved the space require to store 4 our of 5 transactions, and 6 out of 9 Merkle tree nodes. About 80% space savings on that example. The longer the blockchain gets, and the more it is used, the more spent transactions it would have. And thus more space could be saved. It’s a very clever idea, like most ideas in the original Bitcoin whitepaper.

But much as with block verification, we can always ask the question:

Do you really need Merkle trees to prune the blockchain?

The Bitcoin StackExchange is full of insightful discussion on the actual ways pruning is applied in the actual implementations. While the Merkle tree pruning is a clever approach, it is not applied in this way in (at least) the Bitcoin core software.

Instead, the unspent transactions are stored in a separate database for fast lookup. This database is initialized on first startup by scanning the blockchain and updating the database as new and spent transactions are found. The database is continously updated as new blocks are broadcasted for the Bitcoin network. A pruned node could then just rely on this database.

Running a pruned node in general is sufficient for the basic operations, but does not fully support all features of a blockchain, so some nodes will still need to hold the full data. But that goes beyond the scope of this article.

I think the basic takeaway here is that Merkle trees are cool but sometimes the basic and even simpler approach is fine, and even better. Of course, the hard part is identifying when this is true and when the cooler (Merkle) approach is really best. And don’t take this in any way as a suggestion to not use Merkle trees. Just to think about the overall picture. Also with Bitcoin, and related blockchains, I believe the Merkle tree enables many other things and thus makes a lot of sense. As following sections will show.

Simplified Payment Verification (SPV)

The simplified payment verification (SPV) algorithm was also suggested in the original Bitcoin whitepaper. In SPV, a light-weight blockchain client only stores the block headers, but also wishes to verify payments it receives as valid transactions in the blockchain. Lacking full transaction details, the SPV client uses Merkle trees to effectively verify the transaction details in collaboration with full nodes.

I will re-use the example from blockchain pruning above. The SPV client wants to verify TX5 in the given block. The following figure illustrates this:

SPV client node verifying a single transaction is part of a given Merkle tree. Image by author.

Here, the SPV client node asks a full node for the Merkle branches required to build the Merkle root with the TX5 data. By rebuilding the Merkle root (aa8d3) from the transaction of interest (TX5), and the Merkle branches (96b8d) provided by the full node, the SPV client can have confidence in having received a valid transaction. By checking this re-built Merkle root against the one stored in the block header, it can verify that the whole tree (and thus TX5) is in fact valid, and is a part of the blockchain.

I find SPV to be an interesting example of how Merkle trees can be used, together with (block) data filtering (Bitcoin uses Bloom filters but that is again another topic), to synchronize and verify existence and correctness of selected data in a distributed system.

Pool Mining: The Stratum Protocol

The traditional way to create cryptocurrencies has been mining via proof of work (PoW) hashing. Mining pools are a way for smaller miners to join together and get some rewards for mining, based on how much compute (hash) power they contribute to the pool. This leads to the requirement for a central entity, the mining pool, to be able to coordinate the miners. It needs a way to effectively distribute and track the overall mining process across all clients. With some clever tricks, the Stratum protocol uses Merkle trees to enable this.

Especially Stratum version 1 makes use of Merkle trees to distribute the work efficiently. The pool server provides the miner nodes with the block header elements needed to build the block. This includes the partial Merkle tree, with the branches calculated for all other transactions except the Coinbase transaction. The coinbase transaction is a special transaction that pays out the mining reward. This is illustrated by the following figure:

Merkle templating with the Stratum mining pool protocol (v1). Image by author.

This figure contains three elements: the Merkle tree template, the Coinbase template, and the search for the nonce(s). The pool provides the Merkle tree template to the miner, containing the pre-calculated Merkle branches for the transactions in the block (here 96b8d, thus highly reducing bandwidth and calculation needs for miners). The miner is then expected to fill the Merkle tree template with a suitable coinbase transaction, built from the server provided coinbase template. The winner is one who fills the template with values providing a hash that matches the blockchain network difficulty level. Meaning just a hash with suitable small value.

The coinbase template is provided by the pool, and is filled by the miner using their choice of nonce and extranonce fields. When combined with the Merkle tree template, each of these provides a different block hash, and the only way to find a winning block is to try values until one produces a hash matching the network difficult target. If the miner finds a nonce that matches the network hash difficulty for this Merkle template, it submits it to the mining pool. The coinbase template contains the mining pool address, ensuring that the pool receives the block reward, and can distribute it to all the miners.

In a broader context, the Merkle tree here is used to distribute a partial solution (the pre-calculated Merkle tree branches and coinbase template), while allowing different distributed nodes to work independently to try to find the solution to the missing part (nonces for the coinbase transaction to build an acceptable PoW hash). By embedding the pool address as part of the template, it ensures that all distributed nodes contribute to the common goal and can share the rewards.

AWS Dynamo DB

Dynamo DB is a distributed database provided as part of the Amazon Web Services (AWS) platform. It was originally designed to handle much of Amazon’s retail global infrastructure needs. As such, a lot of consideration has been put into making it scalable and optimizing the efficiency in all parts of its architecture. The architecture of Dynamo DB is described in the AWS Dynamo DB paper from 2007, including its use of Merkle trees to efficiently synchronize diverged nodes (section 4.7 in the paper).

Dynamo DB is what is called a key-value store. The actual data (values) are stored in data nodes, and identified and indexed by their keys. Dynamo DB hosts this data in what it calls virtual nodes. Each virtual node hosts a key-range. To handle high-availability and scalability requirements, Dynamo DB distributes the data across these key-ranges, and hosts each key-range on multiple virtual nodes (across multiple physical nodes, or partitions).

The following figure tries to illustrate this, with 3 (virtual) Dynamo DB nodes, each holding 2 key ranges, each key range duplicated across two nodes. I am sure this is not exact to all the details, and the system had evolved since the paper was published, but I believe it depics the general concept well enough:

DynamoDB virtual nodes and distributed key ranges. Image by author.

In this type of a distributed system, there will eventually always come a time when some of these virtual nodes are out of sync with other virtual nodes holding the same key-range. Dynamo DB uses Merkle trees to perform effective comparison and synchronization of the key-ranges in these nodes. In the above figure I made a small 3-node Merkle tree inside each key-range to illustrate this The KR3 key-range tree is shown partially in red to illustrate how the nodes would have diverged and need resolving to find correct values.

A Merkle tree is built for each key-range, where the leaf nodes of the tree are the key-range data values. The Merkle root summarizes the data in each node. By comparing the Merkle roots of each virtual node hosting the same key-range, divergence in nodes is immediately visible. This full comparison requires only the communication and comparison of a single value, the Merkle root. If there is a difference, one can keep comparing the branches of the tree to efficiently find the exact divergent spots, and the data values to synchronize.

The following figure illustrates this process:

DynamoDB using Merkle trees to find data nodes out of sync. Image by author.

As you can see, I just copied the blockchain example into the above figure. Because the fundamel structure, and the concepts it relies on, are exactly the same! The difference is only in the (clever) application of the same fundamentals in a different context.

Yes, some details would likely differ. For example , DynamoDB likely uses a different hash function than Keccak. But, as illustrated here, the fundamental approach and concepts are the same, and as such applicable to more than just the common blockchain example of Merkle trees.

ZFS File System

ZFS is a filesystem supporting data spread over multiple volumes, a form of distributed filesystem. Multiple ZFS volumes can be grouped into storage pools, which can also host redundant copies of the data. ZFS is advertised as using Merkle trees to ensure data integrity.

ZFS uses Merkle trees to checksum data, to identify issues where some part of the data written to, or read from, disk is corrupted (or misread etc.). As in most filesystems, the data is stored as blocks of specific sizes. These blocks form a natural leaf node for building a Merkle tree.

The main rationale I found online for ZFS to use Merkle trees always discusses how ZFS stores the Merkle tree as a checksum outside the block itself, providing higher resiliency for identifying errors. This leaves it quite unclear to me why you need a Merkle tree for this, and not just store a general checksum outside the block. Sometimes I feel people should ask more questions.

On the other hand, I could see the benefit of verifying data across multiple storage pools and their replicas by using hierarchical Merkle trees. This case would make it very similar to how DynamoDB operates and uses Merkle trees to sync nodes (vs ZFS volumes). However, I could not find any description for using Merkle trees for this. If someone knows better, I would be happy to hear.

In any case, this seems like a very similar case to Dynamo DB, as being used to verify integrity of a distributed data storage system more efficiently. With the added benefit of separating the consistency properties (checksum hashes) from the data itself, for increased resiliency.

Git Version Control System

Another example that I found referenced as an example of Merkle trees is the Git version control system. Git does use a lot of hashing, and its basic functionality is heavily based on hashing file contents and various metadata. Git forms a directed acyclic graph (DAG) of commits identified by these hashes. Which seems to be considered by some to be a Merkle tree. I disagree, so I will use this as an example of some nuances in the concept.

A DAG is a graph where connections go one way (directed), and there are no cycles (acyclic). Unfortunately, I do not understand how the Git DAG would be a Merkle tree. The following figure represents a Git DAG with a few of the most common Git graph actions – branches and merges:

A made up DAG representing Git commits. Arrows point to parent commits. Image by author.

Here you have nodes that are identified by their hashes, and a tree-type structure where each node points to a parent, and has one or more children. However, there are multiple issues I see in calling this a Merkle tree:

  • Nodes in a Merkle tree do not have multiple parents. The DAG does. The figure above has two such nodes: 8e290 and 184da.
  • Merkle tree only hashes actual data in its leaf nodes, the Merkle branches after the leaf nodes are hashes of hashes. The Git DAG does not have branches based on hashes of hashes, just nodes with hashes of raw data.
  • The Merkle tree starts with the leaf nodes, and proceeds up the tree, constantly halving the number of branches in each iteration. The Git DAG contracts and expands all the time with branching and merging. Because it is a graph and not a tree.

Each node in the Git DAG hashes its parent hash ID as part of its own hash identifier, and it is ordered in time (directed). As such it actually very closely resembles a blockchain. Unlike cryptocurrencies, it allows forking (branching), and integrates those as a central part of its structure. However, I see no problem to relax this as a requirement for a general blockchain definition. And I do not see a blockchain requiring a Merkle tree, just as I do not see Git DAG as one. It is just useful for cryptocurrencies to have one, but you could have a different form of one, with other mechanisms, such as Git.

In summary, not everything needs to be a Merkle tree, even if it is cool. Although you should consider Merkle trees where it makes sense, because they are cool. However, I do not see the Merkle tree in the Git DAG (or elsewhere in Git). If you know better, happy to hear and learn :).


I find Merkle trees are a simple but clever data structure. And I like my algorithms and systems best when they are simple but clever. The examples I covered in this article were

  • blockchain integrity verification,
  • blockchain pruning
  • simplified payment verification,
  • mining pool protocol for work distribution,
  • DynamoDB data synchronization
  • ZFS checksumming (and possibly node synchronization)

What is common in all of these? All of them use the Merkle tree as a tool to take a potentially large set of data elements, and hierarchically verify their integrity in different steps. This allows for efficient data communication, and efficiently processing the verification of the overall data, going into more details only as needed. This, I belive sums up what I see as the best use case for Merkle trees: Enabling efficient verification and synchronization of data over distributed nodes.

Of course, then there is the stratum mining protocol, that takes it a bit further and uses Merkle trees as a means to distribute work, while controlling specific properties of the output, in a decentralized and untrusted setup. Cool.

Well, in most day-to-day software engineering tasks there is not often much chance to use, and build on, these features. But it certainly is interesting to learn them, and thus better understand what some of the latest technologies built on them are based on. And to make those informated decisions when the time is right. Right?

A Look at Precision, Recall, and F1-Score

Exploring the relations between machine learning metrics

Terminology of a specific domain is often difficult to start with. With a software engineering background, machine learning has many such terms that I find I need to remember to use the tools and read the articles.

Some basic terms are Precision, Recall, and F1-Score. These relate to getting a finer-grained idea of how well a classifier is doing, as opposed to just looking at overall accuracy. Writing an explanation forces me to think it through, and helps me remember the topic myself. That’s why I like to write these articles.

I am looking at a binary classifier in this article. The same concepts do apply more broadly, just require a bit more consideration on multi-class problems. But that is something to consider another time.

Before going into the details, an overview figure is always nice:

Hierarchy of Metrics from raw measurements / labeled data to F1-Score. Image by Author.

On the first look, it is a bit of a messy web. No need to worry about the details for now, but we can look back at this during the following sections when explaining the details from the bottom up. The metrics form a hierarchy starting with the the true/false negatives/positives (at the bottom), and building up all the way to the F1-score to bind them all together. Lets build up from there.

True/False Positives and Negatives

A binary classifier can be viewed as classifying instances as positive or negative:

  • Positive: The instance is classified as a member of the class the classifier is trying to identify. For example, a classifier looking for cat photos would classify photos with cats as positive (when correct).
  • Negative: The instance is classified as not being a member of the class we are trying to identify. For example, a classifier looking for cat photos should classify photos with dogs (and no cats) as negative.

The basis of precision, recall, and F1-Score comes from the concepts of True Positive, True Negative, False Positive, and False Negative. The following table illustrates these (consider value 1 to be a positive prediction):

Examples of True/False Positive and Negative

True Positive (TP)

The following table shows 3 examples of a True Positive (TP). The first row is a generic example, where 1 represents the Positive prediction. The following two rows are examples with labels. Internally, the algorithms would use the 1/0 representation, but I used labels here for a more intuitive understanding.

Examples of True Positive (TP) relations.

False Positive (FP)

These False Positives (FP) examples illustrate making wrong predictions, predicting Positive samples for a actual Negative samples. Such failed prediction is called False Positive.

Examples of False Positive (FP) relations.

True Negative (TN)

For the True Negative (TN) example, the cat classifier correctly identifies a photo as not having a cat in it, and the medical image as the patient having no cancer. So the prediction is Negative and correct (True).

Examples of True Negative (TN) relations.

False Negative (FN)

In the False Negative (FN) case, the classifier has predicted a Negative result, while the actual result was positive. Like no cat when there is a cat. So the prediction was Negative and wrong (False). Thus it is a False Negative.

Examples of False Negative (FN) relations.

Confusion Matrix

A confusion matrix is sometimes used to illustrate classifier performance based on the above four values (TP, FP, TN, FN). These are plotted against each other to show a confusion matrix:

Confusion Matrix. Image by Author.

Using the cancer prediction example, a confusion matrix for 100 patients might look something like this:

Confusion matrix for the cancer example. Image by Author.

This example has:

  • TP: 45 positive cases correctly predicted
  • TN: 25 negative cases correctly predicted
  • FP: 18 negative cases are misclassified (wrong positive predictions)
  • FN: 12 positive cases are misclassified (wrong negative predictions)

Thinking about this for a while, there are different severities to the different errors here. Classifying someone who has cancer as not having it (false negative, denying treatment), is likely more severe than classifying someone who does not have it as having it (false positive, consider treatment, do further tests).

As the severity of different kinds of mistakes varies across use cases, the metrics such as Accuracy, Precision, Recall, and F1-score can be used to balance the classifier estimates as preferred.


The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions:

Accuracy formulas. Image by author.

These three show the same formula for calculating accuracy, but in different wording. From more formalized to more intuitive (my opinion). In the above cancer example, the accuracy would be:

  • (TP+TN)/DatasetSize=(45+25)/100=0.7=70%.

This is perhaps the most intuitive of the model evaluation metrics, and thus commonly used. But often it is useful to also look a bit deeper.


Precision is a measure of how many of the positive predictions made are correct (true positives). The formula for it is:

Precision formulas. Image by Author.

All three above are again just different wordings of the same, with the last one using the cancer case as a concrete example. In this cancer example, using the values from the above example confusion matrix, the precision would be:

  • 45/(45+18)=45/63=0.714=71.4%.

Recall / Sensitivity

Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. It is sometimes also referred to as Sensitivity. The formula for it is:

Recall formulas. Image by Author.

Once again, this is just the same formula worded three different ways. For the cancer example, using the confusion matrix data, the recall would be:

  • 45/(45+12)=45/57=0.789=78.9%.


Specificity is a measure of how many negative predictions made are correct (true negatives). The formula for it is:

Specificity formulas. Image by Author.

In the above medical example, the specificity would be:

  • 25/(25+18)=0.581=58,1%.


F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The formula used for F1-score in this case is:

F1-Score formula. Image by Author.

The idea is to provide a single metric that weights the two ratios (precision and recall) in a balanced way, requiring both to have a higher value for the F1-score value to rise. For example, a Precision of 0.01 and Recall of 1.0 would give :

  • an arithmetic mean of (0.01+1.0)/2=0.505,
  • F1-score score (formula above) of 2*(0.01*1.0)/(0.01+1.0)=~0.02.

This is because the F1-score is much more sensitive to one of the two inputs having a low value (0.01 here). Which makes it great if you want to balance the two.

Some advantages of F1-score:

  • Very small precision or recall will result in lower overall score. Thus it helps balance the two metrics.
  • If you choose your positive class as the one with fewer samples, F1-score can help balance the metric across positive/negative samples.
  • As illustrated by the first figure in this article, it combines many of the other metrics into a single one, capturing many aspects at once.

In the cancer example further above, the F1-score would be

  • 2 * (0.714*0.789)/(0.714+0.789)=0.75 = 75%

Exploring F1-score

I find it easiest to understand concepts by looking at some examples. First a function in Python to calculate F1-score:

Python implementation of the F1-score formula. Image by Author.

To compare different combinations of precision and recall, I generate example values for precision and recall in range of 0 to 1 with steps of 0.01 (100 values of 0.01, 0.02, 0.03, … , 1.0):

Generating example values for precision and recall. Image by Author.

This produces a list for both precision and recall to experiment with:

Generated precision and recall values. Image by Author.

F1-score when precision=recall

To see what is the F1-score if precision equals recall, we can calculate F1-scores for each point 0.01 to 1.0, with precision = recall at each point:

Calculating F1-Score for the example values, where precision = recall at each 100 points. Image by Author.
F1-score when precision = recall. F1-score equals precision and recall at each point when p=r. Image by Author.

F1-score equals precision and recall if the two input metrics (P&R) are equal. The Difference column in the table shows the difference between the smaller value (Precision/Recall) and F1-score. Here they are equal, so no difference, in following examples they start to vary.

F1-score when Recall = 1.0, Precision = 0.01 to 1.0

So, the F1-score should handle reasonably well cases where one of the inputs (P/R) is low, even if the other is very high.

Lets try setting Recall to the maximum of 1.0 and varying Precision from 0.01 to 1.0:

Calculating F1-Score when recall is always 1.0 and precision varies from 0.01 to 1.0. Image by Author.
F1-score when recall = 1.0 and precision varies from 0.1 to 1.0. Image by Author.

As expected, the F1-score stays low when one of the two inputs (Precision / Recall) is low. The difference column shows how the F1-score in this case rises a bit faster than the smaller input (Precision here), gaining more towards the middle of the chart, weighted up a bit by the bigger value (Recall here). However, it never goes very far from the smaller input, balancing the overall score based on both inputs. These differences can also be visualized on the figure (difference is biggest at the vertical red line):

F1-Score with precision = 1.0, recall = 0–1.0 with highlighted posts. Image by Author.

F1-score when Precision = 1.0 and Recall = 0.01 to 1.0

If we swap the roles of Precision and Recall in the above example, we get the same result (due to F1-score formula):

Calculating F1-Score when precision is always 1.0 and recall varies from 0.0 to 1.0. Image by Author.
F1-score when precision = 1.0 and recall varies from 0.01 to 1.0. Image by Author.

This is to say, regardless of which one is higher or lower, the overall F1-score is impacted in the exact same way (which seems quite obvious in the formula but easy to forget).

F1-score when Precision=0.8 and Recall = 0.01 to 1.0

Besides fixing one input at maximum, lets try a bit lower. Here precision is fixed at 0.8, while Recall varies from 0.01 to 1.0 as before:

Calculating F1-Score when precision is always 0.8 and recall varies from 0.0 to 1.0. Image by Author.
F1-score when precision = 0.8 and recall varies from 0.01 to 1.0. Image by Author.

The top score with inputs (0.8, 1.0) is 0.89. The rising curve shape is similar as Recall value rises. At maximum of Precision = 1.0, it achieves a value of about 0.1 (or 0.09) higher than the smaller value (0.89 vs 0.8).

F1-score when Precision=0.1 and Recall=0.01 to 1.0

And if we fix one value near minimum at 0.1?

Calculating F1-Score when precision is always 0.1 and recall varies from 0.0 to 1.0. Image by Author.
F1-score when precision = 0.1 and recall varies from 0.01 to 1.0. Image by Author.

Because one of the two inputs is always low (0.1), the F1-score never rises very high. However, interestingly it again rises at maximum to about 0.08 value larger than the smaller input (Precision = 0.1, F1-score=0.18). This is quite similar to the fixed value of Precision = 0.8 above, where the maximum value reached was 0.09 higher than the smaller input.

Focusing F1-score on precision or recall

Besides the plain F1-score, there is a more generic version, called Fbeta-score. F1-score is a special instance of Fbeta-score, where beta=1. It allows one to weight the precision or recall more, by adding a weighting factor. I will not go deeper into that in this post, however, it is something to keep in mind.

F1-score vs Accuracy

Accuracy is commonly described as a more intuitive metric, with F1-score better addressing a more imbalanced dataset. So how does the F1-score (F1) vs Accuracy (ACC) compare across different types of data distributions (ratios of positive/negative)?

Imbalance: Few Positive Cases

In this example, there is an imbalance of 10 positive cases, and 90 negative cases, with different TN, TP, FN, and FP values for a classifier to calculate F1 and ACC:

F1-score vs accuracy with varying prediction rates and imbalanced data. Image by Author.

The maximum accuracy with the class imbalance is with a result of TN=90 and TP=10, as shown on row 2.

In each case where TP =0, the Precision and Recall both become 0, and F1-score cannot be calculated (division by 0). Such cases can be scored as F1-score = 0, or generally marking the classifier as useless. Because the classifier cannot predict any correct positive result. This is rows 0, 4, and 8 in the above table. These also illustrate some cases of high Accuracy for a broken classifier (e.g., row 0 with 90% Accuracy while always predicting only negative).

The remaining rows illustrate how the F1-score is reacting much better to the classifier making more balanced predictions. For example, F1-score=0.18 vs Accuracy = 0.91 on row 5, to F1-score=0.46 vs Accuracy = 0.93 on row 7. This is only a change of 2 positive predictions, but as it is out of 10 possible, the change is actually quite large, and the F1-score emphasizes this (and Accuracy sees no difference to any other values).

Balance 50/50 Positive and Negative cases:

How about when the datasets are more balanced? Here are similar values for a balanced dataset with 50 negative and 50 positive items:

F1-score vs accuracy with varying prediction rates and balanced data. Image by Author.

F1-score is still a slightly better metric here, when there are only very few (or none) of the positive predictions. But the difference is not as huge as with imbalanced classes. In general, it is still always useful to look a bit deeper into the results, although in balanced datasets, a high accuracy is usually a good indicator of a decent classifier performance.

Imbalance: Few Negative Cases

Finally, what happens if the minority class is measured as the negative and not positive? F1-score no longer balances it but rather the opposite. Here is an example with 10 negative cases and 90 positive cases:

F1-score vs Accuracy when the positive class is the majority class. Image by Author.

For example, row 5 has only 1 correct prediction out of 10 negative cases. But the F1-score is still at around 95%, so very good and even higher than accuracy. In the case where the same ratio applied to the positive cases being the minority, the F1-score for this was 0.18 vs now it is 0.95. Which was a much better indicator of quality rather than in this case.

This result with minority negative cases is because of how the formula to calculate F1-score is defined over precision and recall (emphasizing positive cases). If you look back at the figure illustrating the metrics hierarchy at the beginning of this article, you will see how True Positives feed into both Precision and Recall, and from there to F1-score. The same figure also shows how True Negatives do not contribute to F1-score at all. This seems to be viisble here if you reverse the ratios and have fewer true negatives.

So, as usual, I believe it is good to keep in mind how to represent your data, and do your own data exploration, not blindly trusting any single metric.


So what are these metrics good for?

The traditional Accuracy is a good measure if you have quite balanced datasets and are interested in all types of outputs equally. I like to start with it in any case, as it is intuitive, and dig deeper from there as needed.

Precision is great to focus on if you want to minimize false positives. For example, you build a spam email classifier. You want to see as little spam as possible. But you do not want to miss any important, non-spam emails. In such cases, you may wish to aim for maximizing precision.

Recall is very important in domains such as medical (e.g., identifying cancer), where you really want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive.

Neither precision nor recall is necessarily useful alone, since we rather generally are interested in the overall picture. Accuracy is always good to check as one option. F1-score is another.

F1-score combines precision and recall, and works also for cases where the datasets are imbalanced as it requires both precision and recall to have a reasonable value, as demonstrated by the experiments I showed in this post. Even if you have a small number of positive cases vs negative cases, the formula will weight the metric value down if the precision or recall of the positive class is low.

Besides these, there are various other metrics and ways to explore your results. A popular and very useful approach is also use of ROC- and precision-recall curves. These allow fine-tuning the evaluation thresholds according to what type of error we want to minimize. But that is a different topic to explore.

Thats all for today.. 🙂

Metamorphic Testing of Machine-Learning Based Systems

Techniques for Testing Autonomous Cars and other ML-Based Systems

Testing machine learning (ML)-based systems requires different approaches compared to traditional software. With traditional software, the specification and its relation to the implementation is typically quite explicit: “When the user types a valid username and matching password, they are successfully logged in”. Very simple to understand, deterministic, and easy to write a test case for.

ML-based systems are quite different. Instead of clearly defined inputs and logical flows based on explicit programming statements, a ML-based system is based on potentially huge input spaces with probabilistic outcomes from largely black-box components (models). In this article, I take a look at metamorphic testing, which is a technique that has become increasingly popular to address some of the ML-based systems testing challenge. I will go through some of the latest research, and present examples from different application domains.

These two images can be used to represent some metamorphic tests for a ML-system. Read on to see how.. 🙂 Image by author.

Metamorphic Testing

Metamorphic Testing (MMT) was originally proposed quite a while back, at least up to (Chen1998). Having worked a long time with software testing research, I always viewed MMT as a curiosity with few real use cases. With ML-based systems, however, it seems to have found its niche nicely.

The general idea of MMT is to describe the system functionality in terms of generic relations between inputs, the generic transformations of those inputs and their outputs, rather than as mappings of specific inputs to specific outputs.

One typical example used for metamorphic testing in the past has been from testing search engines (e.g., Zhou2016). As search engines are these days practically natural language processing (NLP)/ML-based systems, they also fit the topic of this article well. To illustrate the concept, I ran two queries on Google (in October 2020):

A Google query for “car” and its refined version “autonomous car”. Image by author.

The first query is just one word “car”. The second query adds another word to the first query, “autonomous”. So the query now becomes “autonomous car”. This addition of a restrictive search keyword is an example of an input transformation (or “morphing”, in the spirit or metamorphic testing):

Example input transformation: changing the original input (query) to be more restricted. Image by author.

And to perform a check on the test results (a test oracle), we define a matching relation that should hold when the input transformation is applied:

Example output (metamorphic) relation: More restricted query returns fewer results. Image by author.

In this case, adding the restrictive search term (“autonomous”) to the previous query (“car”) changes the result set, restricting it to a smaller set. From 8.3 billion results to 159 million. The metamorphic test would not specify these exact values, but rather the relation “restricting the query leads to fewer search results”. And one could generate several (seed) inputs (search queries), associated restrictive keywords for transformations, and run the query and check the metamorphic relation holds (restricting the query produces fewer results). For more details on MMT with search engines, see (Zhou2016).

The above is an example of what metamorphic testing refers to. You transform (morph) your inputs in some way, while at the same time defining a relation that should hold from the previous input (and its output) to the new morphed input (and its output). The key concepts / terms are:

  • morph/transform: modify a seed input in a way that your defined metamorphic relations should hold
  • metamorphic relation: the defined transformation of the input should have a known/measurable effect on the output. Checking that this relation holds after the transformation is the test oracle of metamorphic testing. (Test oracle is a general term for a mechanism to give a verdict on test result)
  • seed inputs: the inputs that are used as initial inputs for the tests, to be transformed. If you know the output of the seed input, you may use it to define a stricter relation (output should be correct). But even without the seed output, you can still define a relation check, but it might be a bit more relaxed (output should be similar but you don’t know if it is correct).

More generally metamorphic testing refers to defining such transformations, and observing their impact (metamorphic relations) on the result. The effectiveness and applicability then depends on how well and extensively these can be defined. I will present more concrete examples in the following sections.

Problem Space

Why would you want to use metamorphic testing? I will illustrate this with an example for autonomous cars. Autonomous cars are recently going through a lot of development, getting a lot of funding, have safety-critical requirements, and are highly dependent on machine-learning. Which is maybe also why they have received so much attention in MMT research. Makes great examples.

For example, the Tesla Autopilot collects data (or did when I was writing this..) from several front-, rear-, and side-cameras, a radar, and 12 ultrasonic sensors. At each moment in time, it must be able to process all this data, along with previous measurements, and come up with reasoning fulfilling highest safety-standards. Such real-world input-spaces are incredibly large. Consider the two pictures I took just a few days apart, near my previous office:

Same place, few days apart. Image by author.

Just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers / no markers, connecting roads visible, parking lots visible, other cars, and so on. Yet in all such conditions one would be expected to be able to navigate, safely. To illustrate the problem a bit more, here are some example variants in that domain that quickly come to mind:

A few examples of items a ML-algorithm in an autonomous vehicle needs to consider. Image by author.

Besides these, one can easily expand this to different locations, road shapes, object types, bridges, trains, … Other sensors have other considerations, every location is different, and so on.

In different domains of ML-based system applications, one would need to be able to identify similar problem scenarios, and their relevant combinations, to be able to test them. Manually building test sets to cover all this is (for me) an unrealistic effort.

Metamorphic Testing with Autonomous Cars

Metamorphic testing can help in better covering domains such as the above autonomous cars problem space. As the interest is high, many approaches for this have also been presented, and I will describe a few of those here.

Covering Image Variations

The DeepTest work in (Tian2018) uses transformations on real images captured from driving cars to produce new images. In this case, the metamorphic attributes are:

  • Seed inputs: Real images from car cameras.
  • Metamorphic transformations: moving, tilting, blurring, scaling, zooming, adding fog, adding rain, etc. on he original images
  • Metamorphic relation: the autonomous driving decisions should show minimal divergence on the same input images after the transformations.

The following illustrates this with some simple examples using the road image from outside my previous office. In the following, I “transformed” the image by simply rotating the camera a bit at the location. I then added the arrows to illustrate how a system should “predict” a path that should be taken. The arrow here is manually added, and intended to be only illustrative:

Viewed from a different angle, the real-world path should still be the same (or close). Image by author.

And the same, but with the snowy ground (two transformations in the following compared to the above; added snow + rotation):

With and without snow, the path should be the same (or close). Image by author.

Of course, no-one would expect to manually create any large number of such images (or transformations). Instead, automated transformation tools can be used. For example, there are several libraries for image augmentation, originally created to help increase training dataset sizes in machine learning. The following illustrates a few such augmentations run on the original non-snow image from above:

Automatically generated augmentations / transformations of the earlier road image. Image by author.

All these augmented / transformed images were generated from the same original source image shown before, using the Python imgaug image augmentation library. Some could maybe be improved with more advanced augmentation methods, but most are already quite useful.

Once those transformations are generated, the metamorphic relations on the generated images can be checked. For example, the system should propose a very similar driving path, with minimal differences across all transformations on acceleration, steering, etc. Or more complex checks if such can be defined, such as defining a known reference path (if such exists).

Again, this process of transformation and checking metamorphic relations is what MMT is about. It helps achieve higher coverage and thus confidence by automating some of the testing process for complex systems, where scaling to the large input spaces is otherwise difficult.

GAN-based Transformations with MMT

A more advanced approach to generate input transformations is to apply different ML-based techniques to build the transformations themselves. In image augmentation, one such method is Generative Adversarial Networks (GANs). An application of GANs to autonomous cars is presented in (Zhang2018). In their work, GANs are trained to transform images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image.

The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic transformations. (Zhang2018) uses the NVidia UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos for training.

Images illustrating the GAN results are available on the UNIT website, as well as in higher resolution in their Google Photos album. I recommend having a look, it is quite interesting. The smaller images on the UNIT website look very convincing, but looking more closely in the bigger images in the photo albums reveals some limitations. However, the results are quite impressive, and this was a few years ago. I expect the techniques to improve further over time. In general, using machine learning to produce transformations appears to be a very promising area in MMT.

LIDAR Transformation

Besides cameras, there are many possible sensors a system can also use. In autonomous cars, one such system is LIDAR, measuring distances to objects using laser-based sensors. A study of applying metamorphic testing on LIDAR data in the Baidu Apollo autonomous car system is presented in (Zhou2019).

The system first identifies a region of interest (ROI), the “drivable” area. It then identifies and tracks objects in this area. The system consists of multiple components:

  • Object segmentation and bounds identification: Find and identify obstacles in ROI
  • Object tracking: Tracking the obstacles (movement)
  • Sequential type fusion: To smooth the obstacle types over time (make more consistent classifications over time by using also time related data)

The (Zhou2019) study focuses on metamorphic testing of the object identification component, specifically on robustness of classification vs misclassification in minor variations of the LIDAR point cloud. The LIDAR point cloud in this case is simply collection of measurement points the LIDAR system reports seeing. These clouds can be very detailed, and the number of measured points very large (Zhou2019).

The following figures illustrates this scenario (see (Zhou2019) for the realistic LIDAR images from actual cars, I just use my own drawings here to illustrate the general idea. I marked the ROI in a darker color, and added some dots in circular fashion to illustrate the LIDAR scan. The green box illustrates a bigger obstacle (e.g., a car), and the smaller red box illustrates a smaller obstacle (e.g., a pedestrian):

My doodle of a LIDAR image :). Image by author, using a car image by OpenClipArt from Pixabay.

The metamorphic relations and transformations in this case are:

  • Metamorphic relation: same obstacles (objects) should be identified both before and after adding small amounts of noise to the LIDAR point cloud.
  • Transformation: add noise (points to the LIDAR point cloud)
  • Seed inputs: actual LIDAR measurements from cars

The following figure illustrates this type of metamorphic transformation, with the added points marked in red. I simply added them in a random location, outside the ROI in this case, as this was the example also in (Zhou2019):

My doodle of a transformed LIDAR image with added points in red. Image by author, using a car image by OpenClipArt from Pixabay.

The above is a very simple transformation and metamorphic relation to check, but I find often the simple ones work the best.

In summary, the MMT approach here takes existing LIDAR data, and adds some noise to it, in form of added LIDAR data points. In relation to the real world, such noise is described in (Zhou2019) as potentially insects, dust, or sensor noise. The amount of added noise is also described as a very small percentage of the overall points, to make it more realistic.

The metamorphic experiments in (Zhou2019) show how adding a small number of points outside the ROI area in the point cloud was enough to cause the classifier (metamorphic relation check) to fail.

As a result, (Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and how the Baidu team incorporated some of the test data into their training dataset. This can be a useful approach, since metamorphic testing can be seen as a way to generate new data that could be used for training. However, I think one should not simply discard the tests in either case, even if re-using some of the data for further ML-model training. More on this later.

Metamorphic Testing of Machine Translation

Not everyone works on autonomous cars, so examples from other domains are important for broader insight. Outside autonomous vehicles, testing of automated language translations with ML-based NLP techniques has received some attention in recent years (for example, He2020, Sun2020). I found the (He2020) paper to be especially clear and sensible, so I use their approach as an example of the metamorphic properties for translation testing here:

  • Seed inputs: Sentences to be translated
  • transformation: replace words with specific part of speech (POS) tag in the input sentence, with another word that has the same POS tag. For example, a verb with another verb. Finally, use another NLP model (Google’s BERT in this case) to “predict” a suitable replacement candidate word.
  • metamorphic relation: the structure of the transformed output should match the original translation output sentence structure for the original input. Large deviations indicate potential errors. The test oracle metric is the difference in output sentence structures for the automated translation on the original input vs the transformed input.

Here is an illustrative example using Google Translate, and a sentence I picked (at the time) from this article. Translating that sentence from English to Finnish:

Translate a sentence from English to Finnish. Image by author.
Mask a word (noun) for a metamorphic transformation. Image by author.
Use BERT to predict a word to replace the masked word (one that “fits” in the mask). Check the metamorphic relation holds. Here it is fine. You just have to trust me on that, I am Finnish :). Image by author.

The above shows the metamorphic transformation and how the check for the defined metamorphic relation should hold. In this case the sentence structure holds fine (in my opinion as a native Finnish speaker) and the result is good. I performed these experiments manually to illustrate the concept, but the test process is the same whether automated or not. Overall, trying a few different sentences, Google Translate actually worked very well. Great for them.

To be honest, I did not really use BERT in the above example, since it was just one example I needed to illustrate the concept. I just picked a word that makes sense (to me). However, HuggingFace has really nice and easy to use implementations available of BERT and many other similar models if needed. I have used them myself for many other tasks. Much like the image augmentation libraries in the car example, the NLP libraries have come a long way, and many basic applications are quite simple and easy these days.

For more details on MMT for machine translation, I recommend checking the papers, especially the (He2020) is quite readable. An extra interesting point here is again the use of another ML-based approach to help in building the transformations, similar to the GAN-based approaches for autonomous cars.

Metamorphic Testing of Medical Images

As an example of a third application domain, applying metamorphic testing to ML-based systems in the medical domain is presented in (Ding2017). This uses MMT to test variants of existing high-resolution biological cell images.

In (Ding2017), a number of metamorphic relations are defined related to various aspects of the biological cells (mitochondria etc.) in the images, and the manipulations done to the image. I lack the medical domain expertise to analyze the transformations or metamorphic relations in more detail, and the paper does not very clearly describe these for me. But I believe my lack of understanding is actually a useful point here.

Metamorphic testing related elements in this case (as far as I understood):

Seed inputs: existing medical images (actually, the paper is very unclear on this along with many other aspects, but it serves as a domain example)

Transformations: Adding, removing, transforming, etc. of mitochondria in the images.

Metamorphic relations: The relations between the elements (mitochondria) in the algorithm outputs for the transformed images should match the defined relation (e.g., linking some elements after adding new ones).

This example highlights, for me, how in many cases, the nuances, the metamorphic relations and transformations require an in-depth domain understanding. This requires extensive collaboration between different parties, which is quite common (in my experience) in applying ML-based approaches. Cars, driving, and language translation are everyday tasks we are all familiar with. Many expert domains, such as in this example, less so. This is why I think this is a useful example in highlighting my lack of domain expertise.

Interestingly, (Ding2017) also mentions using traditional testing techniques such as combinatorial testing, randomization, and category-partitioning, to enhance the initial input seed set. This is also the case in the following example on drones.

Metamorphic Testing of Drones

As a final example domain, an approach of combining model-based testing, simulation, and metamorphic testing for testing an ML-based flight guidance systems of autonomous drones is presented in (Lindvall2017).

The drone is defined as having a set of sensors, including barometer, GPS, cameras, LIDAR, and ultrasonic. Many sensors, quite similar the autonomous cars example. The metamorphic relations defined for the drone control:

  • behaviour should be similar across similar runs
  • rotation of world coordinates should have no effect
  • coordinate translation: same scenario in different coordinates should have no effect
  • obstacle location: same obstacle in different locations should have same route
  • obstacle formation: similar to location but multiple obstacles together
  • obstacle proximity: always within defined bounds
  • drone velocity: velocity should stay inside defined bounds
  • drone altitude: altitude should stay inside defined bounds

Following are properties of such systems metamorphic testing environment:

  • Seed inputs: generated using the model-based approaches based on an environment model for the simulation
  • Transformations: See above; rotation and coordinate changes of drone vs environment and obstacles or obstacle groups, etc
  • Checks: See also above

A test environment generator is used to define (simulated) test environments for the drone, effectively generating the seeds of the metamorphic tests. The metamorphic transformations can be seen as modifications of this environment, and finally checks test that the above defined metamorphic relations hold. Various scenarios are defined to hold these together, including lift-off, returning home, landing, etc.

Perhaps the most interesting part here is the use of model-based testing approaches to build the seed inputs themselves, including the test environment. This seems like a very useful approach for gaining further coverage in domains where this is possible.

Another relevant observation in this is the use of scenarios to group elements together to form a larger test scenario, spanning also time. This is important, since a drone or a car, or many other systems, cannot consider a single input in isolation, but rather must consider a sequence of events, and use it as a context. This time aspect also needs to be taken into account in metamorphic testing.

Adversarial Inputs and Relations Across Time

Adversarial Inputs

A specific type of transformation that is often separately discussed in machine learning is that of adversarial inputs, which is extensively described in (Goodfellow2018). In general, an adversarial input aims to trick the machine learning algorithm to make a wrong classification. An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues.

Generating such adversarial inputs can be seen as one example of a metamorphic transformation, with an associated relation that the output should not change, or change should be minimal, due to adversarial inputs.

Typically such adversarial testing requires specifically tailored data to trigger such misclassification. In a real-world driving scenario, where the car sensors are not tampered with, it might be harder to produce such purely adversarial effects. However, there are some studies and approaches, such as (Zhou2020) considering this for real-world cases. More on this in a bit.

Beyond autonomous cars, digitally altered or tailored adversarial inputs might be a bigger issue. For example, in domains such as cyber-security log analysis, or natural language processing, where providing customized input data could be easier. I have not seen practical examples of this from the real world, but I expect once the techniques mature and become more easily available, more practical sightings would surface.

Much of the work on adversarial elements, such as (Goodfellow2018), have examples of adversarially modified single inputs (images). Real systems are often not so simple. For example, as a car drives, the images (as well as other sensor data), and the decisions that need to be made based on that data, change continuously. This is what the (Zhou2020) paper discusses for autonomous cars.

Relations Across Time

In many cases, besides singular inputs, sequences of input over time are more relevant for the ML-based system. Driving past a sign (or a digital billboard..), the system has to cope with all the sensor data at all the different positions in relation to the environment over time. In this case, the camera viewing angle. For other sensors (LIDAR etc), the input and thus output data would change in a similar manner over time.

Following is an example of what might be two frames a short time apart. In a real video stream there would be numerous changes and images (and other inputs) per second:

Change of (same) input concept over time. Image by author.

Not only does the angle change, but time as a context should be more generally considered in this type of testing (and implementation). Are we moving past the sign? Towards it? Passing it? Did we stop already? What else is in the scene? And so on.

This topic is studied in (Zhou2020), which considers it from the viewpoint of adversarial input generation. In a real-world setting, you are less likely to have your image data directly manipulated, but may be susceptible to adversarial inputs on modified traffic signs, digital billboards, or similar. This is what they (Zhou2020) focus on.

The following example illustrates how any such modification would also need to change along with the images, over time (compared to calculating a single, specific altered input vs real-world physical data moving across time):

If testing with generated adversarial elements, they should change over time in relation to the rest of the image. Here the angle and size of the blue box. Image by author.

This temporal aspect is important in more ways than just for adversarial inputs. For example, all the image augmentations (weather effects, etc) I discussed earlier would benefit from being applied in a realistic driving scenario (sequences of images) vs just a single image. This is what the cars have to deal with in the real world after all.

The test oracle in (Zhou2020) also considers the effect of the adversarial input from two different viewpoints: strength and probability. That is, how large deviations can you cause in the steering of the car with the adversarial changes, and how likely it is that you can cause these deviations with the adversarial input.

Beyond cars and video streams, time series sequences are common in other domains as well. The drone scenarios discussed are one example. Other examples include processing linked paragraphs of text, longer periods of signal in a stock market, or basic sensor signals such as temperature and wind speed.

Minimizing the Test Set

While automating metamorphic testing can be quite straightforward (once you figure your domain relations and build working transformations…), the potential input space from which to choose, and the number of transformations and their combinations can quickly grow huge. For this reason, test selection in MMT is important, just as with other types of testing.

One approach to address this is presented in (Tian2018), which applies a greedy search strategy. Starting with a seed set of images and transformations, the transformations and their combinations are applied on the input (images), and the achieved neuron activation coverage is measured. If they increase coverage, the “good” combinations are added back to the seed set for following rounds, along with other inputs and transformations, as long as they provide some threshold of increased coverage. This iterates until defined ending thresholds (or number of experiments). Quite similar to more traditional testing approaches.

Coverage in (Tian2018) is measured in terms of activations of different neurons in the ML model. They build coverage criteria for different neural network architectures, such as convolutional neural nets, recurrent neural nets, and dense neural nets. Various other coverage criteria also have been proposed, that could be used, such as one in (Gerasimou2020) on evaluating the importance of different neurons in classification.

When more and easily applicable tools become available for this type of ML-model coverage measurement, it would seem a very useful approach. However, I do not see people generally writing their own neural net coverage measurement tools.

Relation to Traditional Software Testing

Besides test suite optimization, it is important to consider MMT more broadly in relation to overall software and system testing. MMT excels in testing and verifying many aspects of ML-based systems, which are more probabilistic and black-box in nature. At least to gain higher confidence / assurance in them.

However, even in ML-based systems, the ML-part is not generally an isolated component working alone. Rather it consumes inputs, produces outputs, and uses ML models for processing complex datasets. The combinatorial, equivalence partitioning, and model-based methods I mentioned earlier are some examples of how the MMT based approaches can be applied together with the overall, more traditional, system testing.

As I mentioned with the Baidu Apollo case and its LIDAR test data generation, one of the feedbacks was to use the metamorphic test data for further ML training. This in general seems like a useful idea, and it is always nice to get more training data. In my experience with building ML-based systems, and training related ML-models, everyone always wants more training data.

However, I believe one should not simply dump all MMT test data into the training dataset. A trained model will learn from the given data, and can be tested for general accuracy on a split test set. This is the typical approach to test a specific ML-model in isolation. However, in practice, the classifications will not be 100% accurate, and some items will end up misclassified, or with low confidence scores. These further feed into the overall system, which may have unexpected reactions in combination with other inputs or processes. Running specific (MMT based or not) tests with specific inputs helps highlight exactly which data is causing issues, how this behaviour changes over time, and so on. If you just throw your MMT tests into the training set and forget it, you lose the benefit of this visibility.

Besides MMT, and complimentary to it, other interesting approaches of tailoring traditional testing techniques for ML-based system testing exist. One specific approach is A/B testing (evaluating benefits of different options). In ML-based systems, this can also be a feedback loop from the human user, or operational system, back to testing and training. The Tesla Shadow Mode is one interesting example, where the autonomous ML-based system makes continuous driving decisions, but these decisions are never actually executed. Rather they are compared with the actual human driver choices in those situations, and this is used to refine the models. Similar approaches, where the system can learn from human corrections are quite common, such as tuning search-engine results and machine translations, based on human interactions with the system. You are changing / morphing the system here as well, but in a different way. This would also make an interesting seed input source for MMT, along with oracle data (e.g., driving path taken by human user) for the metamorphic relation.


Testing machine learning based systems is a different challenge from more traditional systems. The algorithms and models do not come with explicit specifications of inputs and outputs that can be simply tested and verified. The potential space for both is often quite huge and noisy. Metamorphic testing is one useful technique to gain confidence in their operation with a reasonable effort. Compared to traditional testing techniques, it is not a replacement but rather a complimentary approach.

I presented several examples of applying MMT to different domains in this article. While applications in different domains require different considerations, I believe some generally useful guidelines can be derived to help perform MMT over ML-based systems:

  • metamorphic transformations: these do not have to be hugely complex, but rather simple ones can bring good benefits, such as the addition of a few random points to the LIDAR cloud. Consider how the same input could change in its intended usage environment, and how such change can be implemented with least (or reasonable) effort as a transformation.
  • metamorphic relations: to build these relations, we need to ask how can we change the ML input, and what effect should it have on the output? Sometimes this requires deep domain expertise to identify most relevant changes, as in the medical domain example.
  • test oracles: These check that the performed transformation results in a acceptable (vs valid) output. Requires considerations such as how to represent the change (e.g., steering angle change, sentence structural change), possibly defining the probability of some error, the severity of the error, and a distance metric between the potential outputs after transformation (e.g., steering angle calculation). That is, the values are likely not fixed but in a continuous range.
  • time relation: in many systems, the inputs and outputs are not singular but the overall system performance over time is important. This may also require asking the question of how time might be impacting the system, and how it should be considered in sequences of metamorphic relations. The idea of overall test scenarios as providers of a broader context, time related and otherwise, is useful to consider here.
  • test data: can you use the user interactions with the system as an automated source of test inputs for transformations and metamorphic relations? Think Tesla Shadow mode, Google search results, and the inputs from the environment and the user, and use reactions to these inputs.

As discussed with some of the examples, an interesting trend I see is the move towards using ML-based algorithms to produce or enhance the (MMT-based) tests for ML-based systems. In the NLP domain this is shown by the use of BERT as a tool to build metamorphic transformations for testing natural language translations. In the autonomous cars domain by the use of GAN-based networks to create transformations between image properties, such as different weather elements and time of day.

Overall the ML field still seems to be advancing quite fast, with useful approaches already available also for MMT, and hopefully much more mature tooling in the next few years. Without good tool support for testing (data generation, model coverage measurement, etc), finding people with all this expertise (testing, machine learning, domain specifics, …), and implementing it all over again for every system, seems likely to be quite a challenge and sometimes a needlessly high effort without good support in tools and methods.

That’s all for today, this got way too long, so if someone managed to read all this far, I am impressed :). If you have experiences in testing ML-based systems and willing to share, I am interested to hear and learn in the comments 🙂

Zero Knowledge Proofs: Example with Pedersen Commitments in Monero

An Explanation Between Cartoons and Greek Symbols

If you spend a bit more time reading about cryptocurrencies, blockchains, or many other related technologies, you likely run into the term zero knowledge proofs (ZKP). To me, the term sounds like a paradox. How can you prove anything with zero knowledge? How do you event know what to prove, if you have zero knowledge?

So I tried to build myself some understanding on this. In this article, I try to share this understanding of what zero knowledge means in ZKP, and what is the proof really about. And how do the two relate to each other.

I originally posted this articke on Medium, where it can still be found.

I start with the simple example of the ZKP and Ali Baba cave, which seems to be often used as a simplified ZKP example. I then proceed with an example of what ZKP might look like as a cryptographic approach, and how Monero uses Pedersen Commitments to prove input and output amounts match without revealing the actual amounts. I look at it as a zero knowledge proof of the matching sums of input and output amounts, without revealing the amounts.

My goal in writing this article was to build a more intuitive understanding that fits between the Ali Baba cave, and the cryptographic theories. So no formulas with Greek symbols, but not just a Alice and Bob in a cave either.

The Basic Example: Ali Baba Cave

The usual example given for Zero Knowledge Proofs is the Ali Baba cave. This is a circular cave containing a door that unlocks only with a specific passphrase. Alice claims to know the passphrase, and wants to prove this to Bob, without revealing anything about the passphrase. The following image shows the layout of the imaginary cave:

Layout of the Ali Baba cave in this story. Bob image by David Rock Design on Pixabay, Alice image by Graphic Mama Team on Pixabay. Rest of the image by author.

The overall story is that Alice enters the cave, and walks up to Point 3 or Point 4. From there, she can return to the entrance via two paths, Path A and Path B. Once she is at one of the points, Bob comes to the entrance (Point 2) and calls for her to arrive there via one of the paths, which he chooses on random. Depending on which side of the door Alice is, she always has a 50% chance of being able to return to Bob without passing through the door. Otherwise, she needs to know the passphrase. Bob does not know if she used the door or not, he only sees her arrive through the correct path he called out.

The points of interest, as I marked then in the figure, are:

  • Point 1: Alice and Bob meet outside the cave. Alice enters the cave and walks to Point 3 or Point 4.
  • Point 2: A fork of two paths leading deeper into the cave. Once Alice reaches Point 3 or 4, Bob comes here, picks Path A or Path B,and calls for Alice to arrive via the chosen path.
  • Point 3: If Alice waits here, and Bob calls for her to come out through Path A, she can walk there without knowing the passphrase to the door. If Bob calls for Path B, she has to open the door to arrive through Path B.
  • Point 4: Same as Point 3, but here she has to know the passphrase if Bob calls for her to arrive via Path A.

The following animation illustrates the different scenarios that can play out:

A silly animation on Bob and Alice in Ali Baba cave. Bob image by David Rock Design on Pixabay, Alice image by Graphic Mama Team on Pixabay. Rest of the images and animation by author.

I initially found the Ali Baba cave to be a quite confusing example of Zero Knowledge Proof. Because we obviously know a lot, including:

  • There is a door in the cave
  • The door is locked, and there is a magic passphrase to open it
  • Alice is in the cave, (at the door)
  • Bob is outside
  • The cave is circular, and there are two paths in/out of it, with no shortcuts between them
  • Alice claims to know the passphrase
  • After the “proof” is presented, we somehow trust Alice knows the passphrase

So we know a lot about this problem, and the term zero knowledge seemed confusing to me. I eventually figured the following about this problem:

  • The protected secret is the passphrase
  • The Zero Knowledge here refers to not knowing anything that helps reveal the passphrase (or more generally, the protected secret)
  • The proof is in the action of Alice arriving through the correct path every time. Repeated so many times that, statistically, starting on the called side (thus no passphrase required) every time is incredibly unlikely
  • In the end Bob gains trust that Alice knows the secret, but gains zero knowledge of the secret itself (the passphrase). He may gain other knowledge, such as Alice arriving through correct path.

If Alice does not know the passphrase for the door, she has a 50% chance to pick the right path on the first run, 25% to get it right twice in a row, and so on, until the joint probability of always getting it right gets statistically incredibly small.

There are various other properties that are often mentioned as required for a good ZKP. Stack Exchange provides nice and concise definitions for two main ones, soundness and completeness:

  • Soundness: the proof system is truthful, so whatever it proves, it is true. If the proof shows Alice knows the passphrase, she should really know it.
  • Completeness: the proof system is comprehensive, so it can prove all true statements. If Alice knows the passphrase, the system should always show it as true. You cannot know, and fail to prove it.

I recommend the Stack Exchange post for more details, if interested.

Cryptographic Example: Matching Sums in Monero

The Ali Baba cave is an interesting ZKP example, but for me it was a bit hard to figure out how this relates to ZKP in blockchains, cryptocurrencies, and similar constructs. For this, I had a look at how Monero uses the Pedersen commitment to hide the amounts in transactions, while at the same time verifying that the output sums match input sums.

Before going into Monero and its use of the Pedersen Commitment, I will present some base concepts: Elliptic Curves (EC), and their use for basic commitments. Followed finally by Pedersen Commitments on Monero.

Elliptic Curves

Elliptic curves (EC) are a data structure that is commonly used in public-key cryptography. Their mathematical properties make them suitable for creating secure keys and for other cryptographic operations. They are also used in building the Pedersen Commitment in Monero, so we need some basic understanding about them for this article.

EC use in cryptography is referred to as Elliptic Curve Cryptography (ECC). In relation to cryptocurrencies, two commonly used (and standardized) curves are curve 25519 and Secp256k1. Secp256k1 is used by Bitcoin, and 25519 by Monero and many other cryptocurrencies (and other applications). The curves are commonly visualized like this:

Elliptic curves Secp256k1 and 25519 visualized. Image by author.

Generally, the actual elliptic curve is described as a more abstract structure, but this visualization serves as a way to get a bit more concrete idea of what is being discussed. The elliptic curve’s curve points are used as a basis for many operations in ECC. The red dots in the following figures illustrate such points:

Line and curve points over Secp256k1 and 25519. Image by author.

The red lines in these figures also illustrates an important concept related to ECC operations such as adding curve points together. This is the concept of finding lines that connect two dots on the curve, intersecting the curve at a third point. So, a line and three points it passes through.

The code I used to plot these base curves (and some of the following examples) is available on my Github.

Basic Commitment

So what is a commitment? We could look at the dictionary for a general definition, but here it refers to defining (binding) a value in a way that it cannot be changed, while hiding it. Since the value cannot be changed, we can say we are committed to it. At a later time, we can reveal the value by providing a secret to open its container. This leads to the two defining factors commonly used for (cryptographic) commitments:

  • hiding: until the committed value is revealed, it cannot be discovered
  • binding: once a commitment is made, its value cannot be changed

There can be different levels of hiding and binding, typically the tradeoff is between being perfectly hiding/binding and computationally hiding/binding. But that is more details than needed for this article at this point.

A typical example of a simple commitment is putting a piece of paper (e.g., a voting ballot), with a number, in a locked box, and providing the key later:

A basic commitment scheme example: Big chef, Little chef, and a ballot commitment scheme. Chefs, chest, and key from Pixabay. Thanks for the art! Combined story image by author.

In the above example, the big chef puts a voting ballot (with number 5 written on it) into the locked chest. The chest is then given to the little chef, who has no key. The big chef has now committed to a vote. It is inside a box, with no visibility, thus it is hidden. It can no longer be changed as the box is locked and given to someone (little chef) who has no key. Thus, the commitment has become binding. But little chef does not yet know what the vote is. At a later time, the key is given to little chef, and the vote (commitment) is revealed.

Basic Commitment with ECC

Like the above example with the chefs, Elliptic Curves can be used to build cryptographic commitment schemes. I will try to illustrate this here, starting from a simple example using a single curve point, and leading to the Pedersen Commitment. I borrow some basic ideas from the example in the Grim documentation. Thanks for the ideas!

As for notation, I will use a to represent the committed value. Mainly because if you read about Monero, this seems to be the general notation used. I guess referring to the committed amount of Monero.

Using a Single Curve Point

A curve point named H is defined and made public. We call this the public base point. This base point is actually so public that it is generally defined in the curve’s standard specification, such as for curve 25519 and for secp256k1. With a as the committed value (amount), the commitment c based on H becomes c = a * H. This c defines another point on the same Elliptic Curve, based on the base point H. The following figure illustrates this process:

Basic commitment with Elliptic Curves, and a simplified visualization. Image by author.

Here, the blue point is the base point H on the elliptic curve (not the real one for 25519, I just picked something for illustration). The commitment value would be the red point, which would equal to a*H, in this case with value a of 5, c=5*H.

Disclaimer: The below figure illustrates the simplified (fake) notation of Elliptic Curve math I use in this article, with values increasingly moving on the curve in the direction of the dotted red line. For example, 5*H in this figure is 5 line segments from H into the direction of the arrow. I use such increases for more intuitive illustration, real Elliptic Curve math is more complex and behaves different. I will describe the actual math briefly at the end of this article for anyone interested, with some link to find out more. My simplification here is just for illustrative purposes.

Illustration of the simplified visualization I use, points moving in direction of red line. Image by author.

Curves vs Chefs

Lets consider the above simple curve commitment as the chefs example. The big chef calculates c=5*H, and publishes c. The little chef does not see the value a (5) being committed, he only sees curve point c. To reveal the commitment, the big chef then reveals a=5, and the little chef can verify that a*H=c. Due to properties of Elliptic Curves and related mathematics, finding any other value x where x*H=c ,other than a, is considered very difficult, and computationally unfeasible (outside Quantum computing). Thus, this commitment is considered computationally binding.

Once a is revealed, the verification of c is trivial. Because H is a public value, the little chef knows it. Because c was published earlier , he can check a*H=c when a is revealed. If it matches, the commitment and the given a is verified.

If we compare this to the chefs example with the chest earlier, the following concepts should match:

  • chest = published (commitment) curve point c
  • ballot = committed value a
  • key = curve point equation a*H. Or just a, since H is known.

The Problem Here

The problem with the above example is that this process is good at binding the commitment value a, but very poor at hiding it. Lets assume we have 10 choices of values we could put on the chefs ballot, from 1 to 10. It would be trivial to brute force all these point values. Just calculate a*H for all a=1–10, and compare the result to the commitment c.

To address this, something called the blinding factor is used.

The Blinding Factor

The blinding factor is commonly referred to as r. This r is just a (secure) random number. Generally in range 1-2²⁵⁶, so very large. A slightly naive approach is to add this to a as in (r+a)*H. The following figure illustrates this with r=11 and a=5, resulting in published c=(11+5)*H=16*H:

Commitment to 5, with a blinding factor 11. Image by author.

Here, r is shown as the green dot. With this, there is no way to tell what a is, without knowing what r is. However, this has an added problem where any combination of a+r could be revealed as the commitment. Thus this approach would lose the binding property, as the commitment could be changed after publishing it. For example, for the above example we could publish any of (a=4, r=12), (a=5, r=11), (a=6, r=10), or many others as the commitment where the sum of r and a is 16. The commitment needs to be binding.

Pedersen Commitment

And solving this problem finally moves on to the Pedersen Commitment. For this, another base point G on the same curve can be used. This leads to the final commitment form of r*G+a*H=c. This is finally what is called the Pedersen Commitment (with EC). Using my made-up EC notation, we could visualize this like so:

Pedersen commitment on Elliptic Curve 25519, to value 5, and blinding factor 11*G. Image by author.

Now we are back to computationally binding, and at the same time hiding commitment. As I mentioned earlier, due to properties of elliptic curves, and the associated mathematics, it is considered extremely difficult to forge a commitment with different values of r and a, so that it would still match the published c=r*G+a*H. And since both r and a now use a different base point, they cannot be changed as when just summing (a+r)*H above. That’s what they tell me, and it’s enough for my goal in this article.

Homomorphic sums: Hiding the Values in Monero

Monero applies the Pedersen Commitment to hide the actual amounts in transaction inputs and outputs, while at the same time verifying that the sum of amounts in inputs and outputs match. Sounds like sorcery, so lets see how this works.

The homomorphic property of the Pedersen Commitment means that we can actually perform mathematical operations on the encrypted values, and it works just as if the values were not encrypted. In this case, this is based on the homomorphic nature of the Elliptic Curve math. We can simply add curve points together, and compare the results, even without revealing the actual transaction Monero amounts. This is what the magic in the matching Monero sums without knowing the values is based on. But I need concrete examples, so lets make one.

First, a brief description of how cryptocurrency transactions generally works. At a high level, a typical transaction looks something like this:

A cryptocurrency transaction at a high level. Image by author.

Each transaction takes a set of inputs that are spent (TxINs). Spending here means the cryptocurrency in those inputs are transferred to some output addresses on the blockchain, as transaction outputs (TxOUTs). An associated transaction fee is taken by the network, and treated as one of the TxOUTs. In Bitcoin, the transaction inputs and outputs, and their values are visible for anyone to see. In that case, the transaction looks pretty much just like the figure above, for anyone observing the blockchain.

Monero is a cryptocurrency especially focusing on privacy, and does many things to hide various parts of these transactions from the public view. One of those is to hide the amounts being transferred. Monero uses the Pedersen Commitment to hide the transaction amounts (values). In this case, the process now looks something like this:

A (Monero) transaction with amounts hidden. Image by author.

In this case, the input and output values are hidden from the public view. Only the fee is made visible. The only way to see the hidden values is if you have access to the required keys to decrypt the transaction. Thus the values are only visible to the sender and receiver. No other blockchain network nodes, or any other outsider, can see the concrete values that are in the transaction.

However, to maintain a valid ledger of transactions, the system must be able to validate that the total sum of the transaction outputs match its inputs. That is, no bigger amount of Monero is spent than is put in. Otherwise people could just spend some small amount of coins (Monero) in the transaction inputs, and create transaction outputs for much larger sums. This would effectively create Monero out of thin air, and break the entire system. Kind of like the central banks, and the associated monetary system in the “real world”, except we are still waiting for that pyramid to fall ;).

Matching input and output sums

As described earlier, EC Pedersen Commitments used by Monero use two public curve points, G and H. Each input and output Tx (TxIN and TxOUT) has its own Pedersen Commitment, so each one has its own a and r defined, like this:

Example transaction with all commitment values except fee blinding factor. Image by author.

In the above, a always matches the amount of Monero used in the TxIN or TxOUT. As I discussed earlier, the r is basically a large random number. Here I just picked some (small) numbers to illustrate the calculations.

The r in the above figure now becomes the G multiplier, and a the H multiplier. Generally, the numbers (especially r) would be bigger, but these work for an example.

TxIn Commitment

For the TxIN inputs in the above, the commitments would now be (using r and a values from image above):

  • 10$: c = 14*G + 10*H
  • 30$: c = 85*G + 30*H
  • 10$: c = 45*G + 10*H

Using the homomorphic addition property, the total commitment of all the TxIN’s is then:

  • Cin: 14*G + 10*H + 85*G + 30*H + 45*G + 10*H = (14*G + 85*G + 43*G) + (10*H + 30*H + 10*H)

TxOut Commitment

For the TxOUTs, the commitments would be:

  • 8$: c = 33*G + 8*H
  • 40$: c = 28*G + 40*H
  • (fee) 2$: x*G + 2*H

The blinding factor for the fee is still undefined (x) for now. To calculate x, let’s sum the other parts first.

Using homomorphic addition, the total commitment of all the TxOUT’s is:

  • 33*G + 8*H + 28*G + 40*H + x*G + 2*H = (33*G+28*G+x*G) + (8*H + 40*H + 2*H)

Visualizing the Amount on the Elliptic Curve

It is perhaps simpler to look at the a and r calculations here separately (for points H and G respectively). First the a and H:

  • TXin: 10*H+30*H+10*H=50*H
  • TXout: 8*H + 40*H + 2*H=50*H
Matching the sum of input vs output amounts, via their commitments. Image by author.

Both of these end up at the final curve point of 50*H. The final curve point is the part that is published in the transaction (as part of the commitment, combined with the blinding factors*G). The homomorphic nature of it allows comparing the final points to see that the sums match in input and output. This gives the binding property, as finding other matching a and r values would be computationally unfeasible, while allowing to match the amounts.

Next, we need to add all the blinding factors to gain the hiding property.

Visualizing the Blinding Factor on the Elliptic Curve

As for amount H above, for the blinding factor G we can do the same:

  • TxIN: 14*G+85*G+45*G=144*G
  • TxOUT: 33*G+28*G+x*G=61*G+x*G

As noted above, the x in these equations is the r for fee. Using my simplified EC visualization approach, it looks like this without the fee:

Initial setup of the random blinding factors for the Pedersen Commitment. Image by author.

Now we need to make also the TxIN and TxOUT curve points match here. TxIN final curve point is 144*G, so we need to make TxOUT final curve point match that.

Matching the TxIN curve to the TxOUT Curve Using Fee Blinding Factor

As you may have guessed, we can match the TxIN and TxOUT curve points for G like this:

  • r for fee (x above) = (sum of TxIN multipliers for G) – (sum of TxOUT multipliers for G, without fee) = (14+85+45)–(33+28)=83

The commitment for the fee then becomes:

  • 2$: c = 83*G + 2*H

Filling in the last missing spot (x) in the transaction, we get:

Using the fee blinding factor to balance the curve point sums. Image by author.

And when we plug this into the total output G calculation and figure above, we get the following:

  • TxIN: 14*G+85*G+45*G=144*G
  • TxOUT: 33*G+28*G+x*G=61*G+(144–61)*G=61*G+83*G=144*G

And we can now see that both input and output end up at the same curve point for G just like for H:

Using the fee to match the sums (and EC end points) of Pedersen Commitment blinding factors. Image by author.

The total summed commitments are now:

  • TxIN: 144*G + 50*H
  • TxOUT: 144*G + 50*H

We can now subtract the summed input and output commitments, and check if they match the commitment to zero.

Commitment to Zero

To prove that the total transaction inputs match the outputs, we can compare the sum of their commitments to a commitment to zero. Besides my bank account balance, commitment to zero here simply means:

  • z = 0 * H + 0 * G

And if we calculate the difference of the input and output commitments from the above, we get:

  • TxIN commitment (c_in): 144*G + 50*H
  • TxOut commitment (c_out): 144*G + 50*H
  • c_in-c_out: (144*G+50*H) – (144*G+50*H) = 0*G + 0*H
  • This is the same as commitment to zero above (z = 0*G + 0*H)

And so we have successfully proven that our total sum of inputs matches the total sum of outputs. Without revealing anything about the actual transaction amounts. This is true, since only the single, final curve points are included in the transaction data. There is no r or a in any form included in the public commitment. Only thing included is the final curve point that is the sum of 144*G+50*H.

This should reveal no information about the amounts, and, I believe, provides a practical example of zero knowledge proof with more cryptographical constructs. It proves the input and output amounts match, but tells you nothing (you gain zero knowledge) about the actual amounts.

The trialing code I used to make the above calculations and the verify that the input and outputs commitments and their differences comes to match the commitment to zero is available on my Github.

I am still no in-depth expert on cryptography, but for anyone interested to read more on homomorphic encryption and its relation to ZKP, a good start seems to be this post on Stack Exchange (as usual..). However, a brief look into actual EC math is in order after all the simplified examples I did above.

EC Addition and Multiplication, Achtually

All the above examples of elliptic curve addition, or multiplication, I showed the points simply move along the curve by x line segments, where x was the base point multiplier (x*G or x*H). For example, for 5*G I simply moved the point 5 line segments forward on the curve starting from G. This is not how real EC math works.

First, a simple addition of two points. Adding two points on an Elliptic Curve gives you a third point. The following figure illustrates this in four steps:

Adding two points to get a third (or fourth..) point on an Elliptic Curve. Image by author.

In the above figure the four steps are the following:

  • Step 1: Define the points you want to add (P1+P2).
  • Step 2: Find a line connecting these two points.
  • Step 3: The line should always intersect the curve at a third point. Find that third point (P3).
  • Step 4: Reflect P3 across the x-axis to find P4. P4 is now the result of adding P1+P2. So P1+P2=P4 in the above figure.

Why is the Elliptic Curve math done like that? I have no idea, none of the materials I read explained that part. But for me it is enough to have a good enough idea of what is happening and discussed for this topic.

Elliptic Curve Cryptography takes a bit specialized approach to this. Instead of adding two different points together, a public base point is defined, and added to itself N times. Using G to describe a base point, the following figure again illustrates this process in another 4 steps:

Adding a base point to itself on an Elliptic Curve. Image by author.

Again, G here is just a random point I picked for illustration, not any official point. In the above figure, the four steps are now:

  • Step 1: Find the defined public base point G. Since we are adding G to itself, it is used as both P1 and P2 from the previous example on adding P1+P2. Instead of P1+P2, we now do G+G.
  • Step 2: Find the line that passes through P1 and P2, or G and G in this case. For a single point, you could draw any line that passes through it (i.e., infinite possible lines). Someone smart has then decided that the tangent line is to be used. It is the line that just touches the curve at point G.
  • Step 3: Find the point where the tangent line crosses the curve, this matches P3 from the previous example.
  • Step 4: Reflect P3 across the x-axis to get P4, similar to the previous example. This results in G+G, or 2*G. Continue adding G to this to get 3*G, or add 2*G to itself to get 4*G. Repeat until you get the point you desired. For example, in the Monero example above I used values such as 14*G, 85*G, and so on. Just iterate this process to get to the x in x*G.

Additionally, there are other related points, such as use of prime numbers and modulos, for cases such as where the results would not fit in the N bits used. This gets called EC over finite fields, and other such fancy terms. However, the level I presented in this article is enough for me to have a basic understanding to read the blockchain articles, understand the topics, and possibly experiment a bit with the topic.

While all this sounds a bit complicated, luckily other people have already implemented all this in different libraries. For prototyping the topics in this article, I used an Elliptic Curve library for Golang, where the operations and calculations are implemented and verified for me. But I do find it very useful to understand what and why am I doing in those operations.

As I noted before, you can find some example code where I use a library for these calculations on my Github.

Computational vs Information Theoretical

Earlier I discussed how a commitment should be both hiding and binding. Until the committing party reveals the committed value, it should remain hidden. And it should not be possible to change the committed value after committing to it. In other words, the commitment should be binding. I believe I already mentioned the terms information theoretically hiding, and computationally binding.

The Pedersen Commitment is considered to be information theoretically hiding, meaning that no matter how much computing power you have, it is not possible to 100% determine the original committed value, if it is not revealed. As far as I understand, this is because there are always multiple possible values that could fulfill the commitment.

Because of this same reason, the Pedersen Commitment is considered to be computationally binding. Because there are multiple possible values that could fulfill the EC equations, with sufficient computing power you could find a different (and thus a “false”) commitment that would satisfy the verifier. In practice, for the Pedersen Commitment, finding such a point is known as the EC discrete logarithm problem, which is considered too hard to solve without Quantum computing.

This appears to be a common tradeoff between the hiding and binding properties. Perfect (information theoretical) hiding results in computational binding, and perfect binding in computationally hiding.

Back to the Roots: What is Zero Knowledge Proof?

After the lengthy detours, back to the original question. What is Zero Knowledge Proof? Looking back at the Ali Baba example, and the use of Pedersen commitment in Monero, I will try to summarize my take.

  • We know a lot about the problem, and are happy share a lot of that knowledge.
  • We have a specific piece of information we want to keep secret.
  • We wish to share zero knowledge of that secret, but prove we know it, or that something related to it is true.
  • The proof comes from applying a specific process according to the domain in question. This results in (statistically) sufficiently strong proof that we hold the proof as verified.
  • There are likely some properties that need to be assumed true, such as having true sources of randomness, or lack of Quantum computing.

In the Ali Baba cave, the process is that of Alice always arriving via the correct path to Bob, repeated so many times that the combined probability of Alice always picking the right path by chance is incredibly small. Assumptions including no cheating (no path to bypass door in cave), and no broken randomness (you cannot predict which path Bob calls).

In the Monero proof of matching input and output amounts via Pedersen Commitments, the process is in the Homomorphic calculations holding true. While this case is only computationally binding, the computing power and the associated discrete logarithm problem is considered so hard that the chance of “faking” the proof is also considered incredibly small. Assumptions here include no solution to the discrete logarithm problem (i.e., no Quantum computing), and again true sources of randomness (cannot guess r).


There are plenty of examples on Ali Baba cave, and resources such as Zero to Monero that are filled with Greek symbols and formulas. I wrote this piece in order to find myself the middle ground on what ZKP means, and how the conceptual examples like Ali Baba translate to the cryptographic world, and what they mean at a higher level. I hope this will be useful to someone else. At least I feel I got a better understanding to my questions.

This article also allowed me to get familiar with the Pedersen Commitment, which I had often seen mentioned, but wondered about the real meaning. In the end, I find ZKP, as most topics, is not too complicated once you get past the terminology. Just a lot of new concepts for me, which takes time.

In the case of Monero and the way it verifies and anonymizes its transactions, there are also various other features I have not touched here. These include range proofs (a.k.a. bulletproofs in current implementation) to ensure the Pedersen Commitments are always using positive numbers, ring signatures to hide the sender addresses, stealth addresses for hiding recipients, key images to identify transactions, and (Diffie-Helman) key-exchanges to communicate the r and a from sender to receiver in the Pedersen Commitment. But these are out of the scope of this article.

Just to repeat the general disclaimer, I am not a cryptographer. So do your own research if you really need to understand the details. Hope this was useful for someone anyway. Cheers.

Just to remind once more, the codes I used to trial this article are available on my Monero Scraper Github project. At the time of writing, under src/golang/pedersen and src/python/experiments/ellipticcurves.

Building an NLP classifier: Example with Firefox Issue Reports

DistilBERT vs LSTM, with data exploration

This is what the article is about. Robot photo by Phillip Glickman on Unsplash. Developer persona, Beetle, and Fox images by Open Clipart, tortoise by creozavr, ladybug by Prawny, bee by gustavorezende, other beetle by Clker. On Pixabay for the last N after the Unsplash. Composition by author.

Machine learning (ML) techniques for Natural Language Processing (NLP) offer impressive results these days. Libraries such as Keras, PyTorch, and HuggingFace NLP make the application of the latest research and models in the area a (relatively) easy task. In this article, I implement and compare two different NLP based classifier model architectures using the Firefox browser issue tracker data.

I originally posted this article on Medium, where you can still find it.

I previously built a similar issue report classifier. Back then, I found the deep-learning based LSTM (Long-Short-Term-Memory) model architecture performed very well for the task. I later added the Attention mechanism on top of the LSTM, improving the results. This LSTM with Attention is the first model type I chose for this article.

The second model type is Transformers. Transformers have become very popular in NLP over the past few years, and their popularity has spawned many variants. In this article, I wanted to get a feel for how they compare to the LSTM approach I applied before.

I use the HuggingFace (HF) DistilBERT as the Transformer model. For the LSTM, I use Keras with its LSTM and open-source Attention layers. I train them both to predict what component an incoming Firefox bug report should be assigned to. The idea would be to assist in bug report triaging tasks.

Getting the Data

As said, I am using the issue tracker for the Firefox browser as the source of my data. Since I am in no way affiliated with Mozilla of the Firefox Browser, I just used their Bugzilla REST API to download it. In a scenario where you would actually work within your company or with a partner organization, you likely could also just internally ask to get the data as a direct dump from the database. But sometimes you just have to find a way. In this case it was quite simple, with a public API to support downloads. You can find the downloader code on my Github. It is just a simple script to download the data in chunks.

Exploring and Cleaning the Data

The overall data I use contains the following attributes for each issue:

  • ID: A unique issue number, apparently across all Firefox/Mozilla product issues (not just the Browser)
  • Summary: A summary description of the issue. This is one of the text fields I will use as features for the classifier.
  • Description: A longer description of the issue. This is the other text field I will use as features for the classifier.
  • Component: The software component in the Firefox Browser architecture that the issue was assigned to. This will be my prediction target, and thus the label to train the classifier on.
  • Duplicate issues number, if any
  • Creator: email, name, user id, etc.
  • Severity: trivial to critical/blocker, or S1 to S4. Or some random values. Seems like a mess when I looked at it :).
  • Last time modified (change time)
  • Keywords: apparently, you can pick any keywords you like. I found 1680 unique values.
  • Status: One of resolved, verified, new, unconfirmed, reopened, or assigned
  • Resolution: One of Duplicate, fixed, worksforme, incomplete, invalid, wontfix, expired, inactive, or moved
  • Open/Closed status: 18211 open, 172552 closed at the time of my download.

Selecting Features for the Predictor

My goal was to build a classifier assigning bug reports to components, based its natural language description. The first field I considered for this was naturally the description field. However, another similarly interesting field for this purpose is the summary field.

By looking at the values, I can quite simply see there are some issue reports with no (or empty) description:

There are 2973 issue reports with missing description. Image by author.

I considered an option of using the summary field as another extra feature set. Almost all issue reports have a summary:

There are 4 issue reports with missing summary. Image by author.

I would need a way to build a classifier so each issue report could be classified, meaning they all should have the required features. Clearly description or summary alone is not sufficient here. One option I considered was to use the summary if the description is missing, or if the description classifier is not very confident in its classification.

A related paper I found used a combination of summary + description as the feature set. Combining these two gives a set where all items at least have some features to predict on:

Joining summary + description as a single feature set to have no issue with missing features. Image by author.

The strange looking apply function in the above simple adds the summary and description texts together to produce text_feature. With this, all issue reports have a non-empty text_feature as a result. This is what I used as the features for the classifier (tokenized words from text_feature).

Selecting the Components to Predict

To predict a component to assign an issue report to, I need the list of components. Getting this list is simple enough:

Getting the number of target components (52) for the issue reports. Image by author.

There are 52 components that have something assigned to them. The simple thing to do would be to just train a classifier to predict any of these 52 based on the text features. But as software evolves, some components may become obsolete, and trying to assign something to them at the current time might be completely pointless if the component is no longer relevant.

Lets see the details:

Getting the count of issue reports for all components. Image by author.

This gives the following list:

General                                 64153
Untriaged 18207
Bookmarks & History 13972
Tabbed Browser 9786
Address Bar 8120
Preferences 6882
Toolbars and Customization 6701
Theme 6442
Sync 5930
Menus 4440
Session Restore 4205
Search 4171
New Tab Page 4132
File Handling 3471
Extension Compatibility 3371
Security 3277
Shell Integration 2610
Installer 2504
PDF Viewer 2193
Keyboard Navigation 1808
Messaging System 1561
Private Browsing 1494
Downloads Panel 1264
Disability Access 998
Protections UI 984
Migration 964
Site Identity 809
Page Info Window 731
about:logins 639
Site Permissions 604
Enterprise Policies 551
Pocket 502
Firefox Accounts 451
Tours 436
WebPayments UI 433
Normandy Client 409
Screenshots 318
Translation 212
Remote Settings Client 174
Top Sites 124
Headless 119
Activity Streams: General 113
Normandy Server 91
Nimbus Desktop Client 83
Pioneer 73
Launcher Process 67
Firefox Monitor 61
Distributions 56
Activity Streams: Timeline 25
Foxfooding 22
System Add-ons: Off-train Deployment 15
Activity Streams: Server Operations 5

The number of issues per component is quite highly spread, and the distribution highly skewed (General almost has more reports assigned than others together). Specifically, some components have very few reports, and likely any classifier would do quite poorly with them.

First, lets see if I can find any components that might be obsolete and could be removed. This might happen over time as the software under test evolves, and some features (and their components) are dropped. One way to look at this is to find components that have had no activity for a long time. The following should show the latest date when a component had an issue created for it:

Getting a sorted list of latest issue creation time per component. Image by author.
Activity Streams: Timeline 2016-09-14T14:05:46Z
Activity Streams: Server Operations 2017-03-17T18:22:07Z
Activity Streams: General 2017-07-18T18:09:40Z
WebPayments UI 2020-03-24T13:09:16Z
Normandy Server 2020-09-21T17:28:10Z
Extension Compatibility 2021-02-05T16:18:34Z
Distributions 2021-02-23T21:12:01Z
Disability Access 2021-02-24T17:35:33Z
Remote Settings Client 2021-02-25T17:25:41Z
System Add-ons: Off-train Deployment 2021-03-23T13:58:13Z
Headless 2021-03-27T21:15:22Z
Migration 2021-03-28T16:36:08Z
Normandy Client 2021-04-01T21:14:52Z
Firefox Monitor 2021-04-05T16:47:26Z
Translation 2021-04-06T17:39:27Z
Tours 2021-04-08T23:26:27Z
Firefox Accounts 2021-04-10T14:17:25Z
Enterprise Policies 2021-04-13T02:38:53Z
Shell Integration 2021-04-13T10:01:39Z
Pocket 2021-04-15T02:55:12Z
Launcher Process 2021-04-15T03:10:09Z
PDF Viewer 2021-04-15T08:13:57Z
Site Identity 2021-04-15T09:20:25Z
Nimbus Desktop Client 2021-04-15T11:16:11Z
Keyboard Navigation 2021-04-15T14:40:13Z
Installer 2021-04-15T15:06:46Z
Page Info Window 2021-04-15T19:24:28Z
about:logins 2021-04-15T19:59:31Z
Site Permissions 2021-04-15T21:33:40Z
Bookmarks & History 2021-04-16T09:43:36Z
Downloads Panel 2021-04-16T11:39:07Z
Protections UI 2021-04-16T13:25:27Z
File Handling 2021-04-16T13:40:56Z
Foxfooding 2021-04-16T14:08:35Z
Security 2021-04-16T15:31:09Z
Screenshots 2021-04-16T15:35:25Z
Top Sites 2021-04-16T15:56:26Z
General 2021-04-16T16:36:27Z
Private Browsing 2021-04-16T17:17:21Z
Tabbed Browser 2021-04-16T17:37:16Z
Sync 2021-04-16T18:53:49Z
Menus 2021-04-16T19:42:42Z
Pioneer 2021-04-16T20:58:44Z
New Tab Page 2021-04-17T02:50:46Z
Messaging System 2021-04-17T14:22:36Z
Preferences 2021-04-17T14:41:46Z
Theme 2021-04-17T17:45:09Z
Session Restore 2021-04-17T19:22:53Z
Address Bar 2021-04-18T03:10:06Z
Toolbars and Customization 2021-04-18T08:16:27Z
Search 2021-04-18T09:04:40Z
Untriaged 2021-04-18T10:36:19Z

The list above is sorted by time, and all three components related to “Activity Streams” have last issues assigned to them 4–5 years ago. With this, I added them to a list of components to remove from the dataset. Seems pointless to assign any new issues to them with this timeline.

The Activity Streams: Timeline component was also one of the components in the earlier list with the fewest issues assigned to it. The other two components with very few issues created were Foxfooding and System Add-ons: Off-train Deployment. Since the issues for a component are listed in chronological order, looking at the last few in both of these should give some insight on their recent activity.

First System Add-ons: Off-train Deployment:

Last 5 issue reports for “System Add-ons: Off-train Deployment”. Image by author.

The above figure/table shows how the actual last reported issue is from 2019, and the last few after that are actually some kind of clones of old issues, made for some other purpose than reporting actual issues. So I dropped System Add-ons: Off-train Deployment from the dataset as well.

Foxfooding is described in the Firefox issue tracker as collecting issues for later triaging. Looking into it it only shows recent issues. I guess older ones might have been triaged. Without further knowledge, I left it in the dataset. With better access to domain experts, I might have removed it as it sounds like the actual issues in it could belong to many other components (and moved after triaging). But I expect it is not a big deal as it only has a few issues in it.

Few other components also had a bit longer period since last issue report. To get a better idea about how active these components have been over time, I plotted their count of issues over months. For example, Webpayments UI:

Plotting number of issues report over a month for “Webpayments UI”. Image by author.

WebPayments UI seems to have started quietly, gotten some attention, and quieted down again. The last of this attention was a bit over a year ago from today, on March 2020. I don’t know if it is still relevant, so I just left it in.

Finally, the list of components I removed for training as a result of this short analysis were the following:

  • System Add-ons: Off-train Deployment
  • Activity Streams: Timeline
  • Activity Streams: Server Operations
  • Activity Streams: General

The rest of the components seemed more likely to be relevant still. This left me with 48 target components from the original 52. I trained the initial set of models with these 48 target components. After a little bit of looking at the results, I removed one more component. Did you figure out which one it is already?

It is Untriaged. Because untriaged is just a list of issues that are not yet assigned to other components. Thus from machine learning perspective these issues are unlabeled. As far as I can see, keeping these in the training set can only confuse the trained classifier. So for further training iterations I removed also issues assigned to Untriaged, leaving 47 target components (labels).

In data analysis, it is easy to get sidetracked because of the next shiny. A bit like the Little Red Riding Hood in the forest I guess. In that line, some interesting facts can also be found by looking at the oldest reports with the Untriaged component/tag:

Oldest open issues in Untriaged state. Image by author.

The above list shows the oldest open and untriaged issue is from over 20 years ago (when writing this). It discusses the correct way to abbreviate “seconds”. In my experience, this is exactly how issue trackers in projects tend to evolve over time. No-one wants to say this issue does not matter and close it, yet no-one wants to make the effort or decide what to do with it. Or take the heat for the slightly irrelevant decision. Or maybe it is just forgotten.

A bunch of others in that list are also waiting for a few years, and if I remove the is_open requirement from the query, there are many very old issues in untriaged status. Issue trackers in general evolve this way, I guess. At least it is what I have seen, and it sort of makes sense. Like my storage closet, junk accumulates and easier to just leave it than do something about it.

Finally, just one more query to show the oldest issue created per component:

Date of oldest issue created per component. Cut in 2014 to save space. Image by author.

The above list actually seems to give a kind of a history on the development of the project. As I said, it is easy enough to get lost in data exploration like this, but I guess in a practical setting I should be focusing more on the task at hand. So lets get back to building the issue report classifier..

Training the Models

In this section I will briefly describe the models I used, and the general process I applied. For some assurance on training stability, I repeated the training 10 times for both models with randomly shuffled data. The overall dataset I used has 172556 rows of data (issue reports) after removing the five components discussed above.

A Look at the Models

First, the models.

Keras LSTM

A notebook setting up the Keras LSTM model, and running it can be found on my Github. The Keras summary function shows the structure:

Keras model with bi-directional LSTM layers and Attention layers. Image by author.

The input layer takes in a maximum of 512 word tokens. These feed into a Glove-based word-embedding layer, which converts the input token sequence into a 300-dimensional embedding vector. This is followed by a bi-directional LSTM layer that has 128 nodes. A self-attention layer follows, and feeds the attention output into another bi-directional LSTM layer that has 64 nodes. The output from here goes into a weighted attention layer, that passes it to the final Dense output layer. Many fancy words if not familiar with it, but worry not. It is simple to use in the end, and practical results will be presented.

Keras LSTM model, but with boxes and arrows. Image generated by Keras as instructed by author.

I recommend checking the layer docs for more information if interested.

HuggingFace DistilBERT

The HuggingFace DistilBERT is a bit more of a black box. A notebook for the training and testing is also on my Github. The Keras summary for it gives:

Model summary given by Keras for the HF DistilBERT model. Image by author.

I guess it is some kind of a custom Tensorflow implementation in a single layer from Keras viewpoint. Matches my previous attempts at trying to read about all the transformer architectures, where everyone quickly diverges into all the in-depth details from the general high level name, and I am just left wondering why is no-one able to provide an understandable and intuitive intermediate view. Anyway. The visual representation of it in terms of boxes and arrows in Keras is even better:

Model visualization by Keras for the HF DistilBERT. Image generated by Keras as instructed by author.

It’s all just a single box. I guess this is what you would call a black-box model (just color the box black to finish it..) :).

Sampling the Data

I sampled the dataset in each case into 3 parts. The model was trained on 70% of the data (training set), or 124126 issue reports. 20%, or 31032 issue reports (validation set), were used for evaluating model performance during training. 10%, or 17240, issue reports to evaluate the final model after training finished (test set). The sampling in each case was stratified, producing an equal proportion of different target components in each dataset of train, validation, and test.

I repeated the sampling to these 3 different sets 10 times, with different randomization in selecting items to each set. Another option would have been to do 10-fold splits and validations, which would be more systematic. But the sampling I used worked for my purposes. Luckily I am not writing a research paper for Reviewer 2 today, so lets pretend that is fine.

Training Accuracy over Epochs

The following figures illustrate the training loss and accuracy, for training the models on one set of the sampled data for both models. First the HuggingFace training for DistilBERT:

DistilBERT transformer evaluation loss and accuracy over training epochs. Image by author.

I set the HuggingFace trainer to evaluate the model at every 500 steps, providing the high granularity graph above. With the amount of data I used, and a batch size of 16, the number of steps in HF training over 3 epochs was 22881. At each 500 of these the evaluation was performed, and shows as a point on the above graph. As shown in the figure, training was quite consistent but leveled at around epoch 2.

The Keras LSTM training metrics per epoch are shown below:

LSTM training and validation set loss and accuracy over epochs. Epoch 2 = 1.0 due to 0-index. Image by author.

In this figure, epoch 0.0 is epoch 1, and 1.0 epoch 2, and so on. It is simply due to how I used a 0-indexed array for it. The name test is also actually referring my validation set in this case, I always get the terms confused. Sorry about that. More importantly, I trained this model for 4 epochs, and each point in this figure shows the evaluation result after an epoch. The validation always peaked on epoch 2 for this model.


Results are perhaps the most interesting part. The following table shows 10 different runs where the LSTM and Transformer classifiers were trained on different dataset variants as described before:

10 different runs of train + predict with top 1 and top 5 accuracy. Image by author.

Each row in the table is a separate training on different random shuffles of the data. The table has the following columns:

  • hf_best_epoch: the epoch where the lowest loss (for validation set) was recorded for HF. In Keras this was always epoch 2, so I did not include a column for it.
  • hf_val_loss: the validation set loss at hf_best_epoch as given by HF.
  • hf_val_acc: the validation set accuracy at same point as hf_val_loss.
  • k_val_loss: the validation set loss at end of best epoch as given by Keras.
  • k_val_acc: the validation set accuracy at same point as k_val_loss.
  • hf1_test_acc: HF accuracy of using the best model to predict the target component, and only taking the top prediction.
  • k1_test_acc: same as hf1_test_acc but for the Keras model.
  • hf5_test_acc: same as hf1_test_acc, but considering if any of the top 5 predictions match the correct label. Think of it as providing the triaging user with 5 top component suggestions to assign the issue to.
  • k5_test_acc: same as h5_test_acc but for the Keras model.

Comparing Accuracy in Transformers vs LSTM

The results from the above table are quite clear, and the Transformer version outperforms the LSTM version in each case. The difference for top-1 prediction accuracy is about 0.72 vs 0.75, or 72% vs 75% in favor of the Transformer architecture. For the top-5 the difference is about 96% vs 97% accuracy. The difference in loss is about 0.91 vs 0.84, again favoring the Transformer.

In a research study promoting research results these would be a big deal. However, in a practical situation the significance of this depends on the target domain. Here, my aim was to build a classifier to help the triaging process by suggesting components to assign new issue reports to. In this case, a few misses, or a difference of 96% vs 97% in top 5 may not be that big a deal.

Additionally, besides this classification performance, other considerations may also be relevant. For example, the LSTM in general trains faster and requires fewer resources (such as GPU memory). This and similar issues might also be important tradeoffs in practice.

A Deeper Look at Misclassifications

Beyond blindly looking at accuracy values, or even loss values, it is often quite useful to look a bit deeper at what did the classifier get right, and what did it get wrong. That is, what is getting misclassified. Let’s see.

In the following, I will present multiple tables across the models and their misclassifications. These tables hold the following columns:

  • total: total number of issue reports for this component in the entire dataset
  • test_total: number of issue reports for this component in the test set
  • fails_act: number of issues for this component that were misclassified as something else. For example, there were 1000 issue reports that were actually for component General but classified as something else.
  • fails_pred: number of issues predicted for this component, but were actually for another component. For example, there were 1801 issues predicted as General but their correct label was some other component.
  • total_pct: the total column value divided by the total number of issues (172556). The percentage this components represents from all the issues.
  • test_pct: same as total_pct but for the test set.
  • act_pct: how many percent of test_total is fails_act.
  • pred_pct: how many percent of test_total is fails_pred.

Keras LSTM Top-1 Prediction Misclassifications

First, failure statistics for the Keras LSTM classifier, and its top-1 predictions:

Keras LSTM classifier failure stats for top-1 prediction. Image by author.

Predicting only the most common label is often used as the very baseline reference, and in this case we could then expect an accuracy of 37% if we always predicted each issue report to be assignable to the General component. Because the table above shows it holds 37% of all issues.

Something that I find slightly unexpected here is that even though General is so much more common in the set of issues, it does not have an overly large proportion of misclassified issues attributed to it (act_pct + pred_pct). Often such a dominant label in the training set also dominates the predictions, but it is nice to see it is not too bad in this case.

Instead, there are others in that list that more stand out. For example, Shell Integration looks quite bad with 83% (217 of 261) of its actual test set issues being misclassified for some other component (act_pct). One might consider this to be due to it having such a smaller number of issues in the training set, but many components with even fewer issues are much better. For example, the one visible in the table above, Installer, has a 32% (act_pct) fail rate only.

To analyze these causes deeper, I would look in a bit more detail at the strongest misclassifications (in terms of component probabilities) for Shell Integration, and try to determine the cause of mixup. Perhaps some feature engineering would be in order, to preprocess certain words or tokens differently. But this article is long enough as it is, so not going there.

Something a bit more generic, that I looked further in the statistics, is pairs of misclassifications. The following list shows the most common misclassified pairs in the test set. For example, the top one shows 204 issues for Tabbed Browser being predicted as General. And similarly, 134 General issues predicted as Tabbed Browser. Clearly, the two seem to be mixed often. Our friend, Shell Integration also seems to be commonly mixed with General. And so on.

Pairs of most common misclassifications for LSTM top-1. Image by author.

Overall, the biggest General component also dominates the above list, as one might have expected. Maybe because it is general, large, and thus randomly holds a bit of everything..

Keras LSTM Top-5 Predictions Misclassifications

Besides top-1 predictions, I also collected top-5 predictions. Top-5 means taking the 5 predicted components for an issue, and considering the prediction correct if any of these 5 is the expected (correct) label. The following shows similar stats for the top-5 as before for top-1:

Keras LSTM classifier failure stats for top-5 prediction. Image by author.

This is otherwise similar to top-1, but the fails_pred column has large values, because each issue report had 5 predictions counted for it. So even if one of the 5 was correct, the other 4 would be calculated in fails_pred here.

For the rest of the values, the numbers are clearly better for top-5 than for the top-1 results. For example, General has fails_act value of only 3, while in top-1 it was 1000. Likely because of its dominant size it gets into many top-5’s. This drop from 1000 to 3 is a big improvement, but the overall top-5 accuracy is only 97% and not 99.9% as the components with fewer instances are still getting larger number of misclassifications. For example, Shell Integration still has about 17% act_pct, even with top-1 relaxed to top-5. Much better, but also not nearly 0% fails.

HuggingFace DistilBERT Top-1 Prediction Misclassifications

Let’s move to DistilBERT results for the top-1 prediction. Again for details, see my Github notebook, and two scripts to go with it. The overall values are clearly better for this model vs the LSTM as was illustrated by the general statistics earlier. However, I see a slightly different trend in this vs the LSTM. This model seems to balance better.

HF DistilBERT classifier failure stats for top-1 prediction. Image by author.

The number of failed predictions for the dominating General component are higher than for the LSTM model. For example, fails_act here is 1063, while in the LSTM model top-1 it was 1000. However, as the overall accuracy was quite a bit better for this model (72% LSTM vs 75% for this), this gets balanced by the other components having better scores. This is what I mean with more balanced. The errors are less focused on few components.

For example, the fails_act for the nasty Shell Integration is down to 185 for DistilBERT here, while it was 217 in the LSTM results. Still not great, but much better than the LSTM. Most of the other components are similarly lower, with a few exceptions. So this model seems to be overall more accurate, but also more balanced.

To further compare, here are the most commonly misclassified pairs for this model:

Pairs of most common misclassifications for DistilBERT top-1. Image by author.

This also shows the similar trend, where the total number of misclassifications is lower, but also no single pair dominates the list as strongly as in the LSTM model results. And again, the pair of Tabbed Browser and General are at the top with getting mixed with each other. Along with General being a part of almost every pair in that list. Looking into these associations in more detail would definitely be on my list if taking this classifier further.

HuggingFace DistilBERT Top-5 Prediction Misclassifications

And the results for the DistilBERT top-5:

HF DistilBERT classifier failure stats for top-5 prediction. Image by author.

Similar to top-1, this one has a higher fails_act for General but mostly lower for the others, leading to what seems to be a more balanced result, along with the higher overall accuracy (96% LSTM vs 97% this model/DistilBERT).

Short Summary on Model Differences

The results I presented could likely be optimized quite a bit. I did not do any hyperparameter optimization, or model architecture tuning, so my guess is that both the LSTM and Transformer model could have been optimized further. Might also be useful to try the other HF transformer variants. However, my experience is that the optimization gains are not necessarily huge. My goal for this article was to build a practical classifier from scratch, and compare the trendy Transformers architecture to my previous experiments with Attention LSTM’s. For this, I find the currect results are fine.

Overall, I find the results show that the Transformer in general gives better classification performance, but also is more balanced. The LSTM still produces good results, but if resources were available I would go with the Transformer.

One point I find interesting in the model differences is that the LSTM seems to do slightly better for the dominant General component, while the Transformer seems to do better on the other ones. This analysis was based on one pair of the 10 variants I trained, so with more time and resources, looking at different training variants would likely give more confidence still.

One way to apply such differences in model performance would be to build an ensemble learner. In such a combined model, both the LSTM and Transformer models would contribute to the prediction. These models would seem to be good candidates, since they have mutual diversity. Meaning one produces better results in some parts of the data, while the other in a different part. In practical production systems, ensembles can be overly complex for relatively small gains. However, in something like Kaggle competitions, where fractions of percentage in the results matter, this would be good insights to look into.

Predictor in Action

The article so far is very much text and data, and would contribute for a very boring powerpoint slideset to sell the idea. More concrete, live, and practical demonstrations are often more interesting.

Back when I presented my Qt company classifier based on their issue tracker data, I had similar data with some older classifiers (RF, SVM, TF-IDF, …) and the LSTM. Back then, I found the LSTM classifier produced surprisingly good results (similar to here). I made a presentation of this, showed some good results, and people naturally had the question: How does it actually work on issue reports it has not seen before?

One way to address this question is to try to explain that the training data did not include the validation or the test set, and thus it already measures exactly how well it would do on data it has not seen before. But not everyone is so familiar with machine learning terminology or concepts. To better address this question, I then opened the issue tracker for that day, took one of the newest, still untriaged issues (no component assigned), and copied its text. And then ran the classifier live on that text. Asked if people thought it was a good result. We can simulate that also here.

I took an issue report filed on May 13th (today, in 2021), as my training dataset was downloaded on the 21st of April. As opposed to a live demo, you just have to trust me on that :).

For this experiment, I picked issue number 1710955. The component label it has been assigned by the developers is Messaging System. The following shows the top-5 predictions for the issue using the HF DistilBERT model. First using only the summary field as the features, followed by using both summary + description as features.

HF DistilBERT predicted components for issue 1710955. Image by author.

The top line above shows the Messaging System as the top prediction, which is correct in this case. Second place goes to New Tab Page. As shown by the middle line, with just the summary field, the Messaging System is correctly predicted as the top component at 98.6% probability, followed by New Tab Page at 84.7%. The third and final line above shows how adding the description field to the features, the predicted top-5 remains the same but the classifier becomes more confident with Messaging System given 99.1% probability, and New Tab Page dropping to 62.5% on second place.

The summary field text for this issue report is simply “Change MR1 upgrade onboarding to Pin then Default then Theme screens”. Check the issue report for more details and the description. I think its quite impressive to predict the component from such short summary text, although I guess some words in it might be quite specific.

While this was a very good result for a random issue I picked, it is maybe not as believable as picking one in a live demonstration. In a more of a live demonstration, I could also ask people to write a real or imaginary bug report right there, run the classifier on it, and give them a classification for it, asking their opinion on its correctness. But, as usual, last time I tried that it was a bit difficult to get anyone to volunteer and I had to pick one myself. But at least you can do it live.


Well that’s it. I downloaded the bugreport data, explored it, trained the classifiers, compared the results, and dug a bit deeper. And found a winner in Transformers, along with building myself and increased understanding and insights into the benefits of each model.

In the end, writing this article was an interesting exercise in trying out the Transformer architecture for a real-world dataset, and gaining insight in comparison the LSTM architecture I used before. Would someone use something like this in real world? I don’t know. I think such applications depend on the scale and real needs. In very large companies with very large products and development departments, I can see it could provide useful assistance. Or integrated as part of an issue tracker platform for added functionality. Think about the benefit of being to advertise your product uses the latest deep-learning models and AI to analyze your issues.. 🙂

Besides comparing to my previous work on building a similar bug classifier for the Qt company issue tracker, I also found another paper from around the same time when I wrote my previous study, on a similar issue tracker classification, called DeepTriage. I considered using it for comparison, but their target was to predict the developer to assign the issue to, so I left that out. It was still useful for providing some insight on where to find accessible issue trackers, and how they used the data for features.

However, when I went searching the internet for the word DeepTriage after that paper, I found another paper with the same name. DeepTriage for cloud incident triaging. Yes, in my previous articles on metamorphic-testing of machine-learning based systems I noted how many DeepXXX articles there are. Apparently it is now so popular to name your research DeepXXX that you even re-use names. In any case, this was a paper from Microsoft, talking about the cost and need for fast triaging in datacenters, imbalanced datasets and all that. And how they use this type of technique in Azure since 2017 with thousands of teams. So, as I said, this likely becomes useful once your scale and real-world requirements catch up.

Well, my goal was to play with Transformers and maybe provide some example of building a classifier from real-world NLP data. I think I did ok. Cheers.

Understanding the Poisson Distribution

I find probability distributions would often be useful tools to know and understand, but the explanations are not always very intuitive. The Poisson distribution is one of the probability distributions that I have run into quite often. Most recently I ran into it when preparing for some AWS Machine Learning certification questions. Since this is not the first time I run into it, I figured it would be nice to understand it better. In this article, I explore questions on when, where, and how to apply it. And I try to keep it more intuitive by using some concrete example cases.

Example Uses of the Poisson Distribution

Before looking into details of something, it is nice to have an idea of what that something is useful for. The Poisson Distribution is typically described as a discrete probability distribution. It tells you the probability of discrete number of events in a timeframe, such as 1 event, or 2 events, or 3 events, … Not 1.1 events, or 2.7 events, or any other fractional events. Whole events only. This is why it is called discrete.

Some example uses for the Poisson distribution include estimating the number of:

  • calls to a call center
  • cars passing a on a highway
  • fatal airplane crashes in a year
  • particles emitted in radioactive decay
  • faults in physical hardware
  • people visiting a restaurant during lunch time

One typical use of these estimates would be as input to capacity planning (e.g., call center staffing or hardware provisioning). Besides events over time, the Poisson distribution can also be used to estimate instances in an area, volume, or over a distance. I will present some examples of these different application types in the following sections. For further ideas and examples of its application, there is a nice Quora question/answers.

Let’s start by looking at what the Poisson distribution is, and how to calculate it.

Properties of the Poisson Distribution

A process that produces values conforming to a Poisson Distribution is called a Poisson Process. Such a process, and thus the resulting Poisson Distribution, is expected to have the following properties:

  • Discrete values. An event either occurs or not, there is no partial occurrence here.
  • Independent events. Occurrence of one event does not affect the occurrence of other events.
  • Multiple events do not occur at the same time. However, the sub-intervals in which they occur may be very small.
  • Average (a.k.a. the mean) number of events is constant through time.

The following figure/line illustrates a Poisson Process:

Poisson process example, with average of 5 units between events.

The dots in the above figure illustrate events occurring. The x-axis illustrates time passing. The units of time (on the x-axis here) could be any units of time, such as seconds, minutes, hours, or days. They could even be other types of units, such as distance. The Poisson distribution / process does not really care about the unit of measurement, only that the properties of the distribution are met. I generated the above figure using a random process that would generate values between 1-10, with an average value of 5. In practice, I used the Python random module to generate the numbers.

As I said, the time unit could be anything, but for clarity of example I will use seconds. To calculate the Poisson distribution, we really only need to know the average number of events in the timeframe of interest. If we calculate the number of dots in the above figure, we get 18 events (dots). Now assume we want to know the probability of getting exactly 18 events in a timeframe of 100 seconds (as in the figure).

In this process, the average number of events would be 100/5 = 20. We have a timeframe of 100 seconds, and a process that produces events on average every 5 seconds. Thus on average we have 100/5 = 20 events in 100 seconds.

To answer the question on the probability of having exactly 18 events in a timeframe of 100 seconds, we can plug these numbers into the Poisson distribution formula. This gives a chance of about 8.4% for having 18 events occur in this timeframe (100s). In the Poisson Distribution formula this would translate to the parameters k = 18, λ (lambda) = 20.

Let’s look at what these means, or what is the formula:

The Poisson Distribution Formula

To calculate the Poisson distribution, we can use the formula defined by the French mathematician Siméon Denis Poisson, about 200 years ago:

Poisson distribution formula.

It looks a bit scary with those fancy symbols. To use this, you really just need to know the values to plug in:

  • e: Euler’s number. A constant value of about 2.71828. Most programming languages provide this as a constant, such as math.e in Python.
  • λ (lambda): The average number of events in an interval. Sometimes μ (Mu) is used, but that is just syntax, the formula stays the same.
  • k: The number of events to calculate the probability for.

For the more programming oriented, here is the same, in Python:

The Poisson distribution formula in Python.

This code, and the code to generate all the images in this article is easiest to access on my Kaggle notebook. Its a bit of a mess, but if you want to look at the code, it’s there.

If we plug the values for the imaginary Poisson Process from the previous section (18 dots on a line, with an average of 20) into this formula, we get the following:

(e^-20)*(20^18)/18! = 0.08439

Which translates to about 8.4% chance of 18 events (k=18) in the timeframe, when the average is 20 events (λ=20). That’s how I got the number in the previous section.

Example: Poisson Distribution for Reliability Analysis

I find concrete examples make concepts much easier to understand. One area that I find is a fitting example here, and where I have experience in seeing the Poisson distribution applied, is system reliability analysis. So I will illustrate the concept here with an example of how to use it to provide assurance of “five nines”, or 99.999%, reliability in relation to potential component failures.

There are many possible components in different systems that would fit this example quite nicely. I use storage nodes (disks in good old times) as an example. Say we have a system that needs to have 100 storage nodes running at all times to provide the required level of capacity. We need to provide this capacity with a reliability of 99.999%. We have some metric (maybe from past observations, from manufacturer testing, or something else) to say the storage nodes we use have an average of 1 failure per 100 units in a year.

Knowing that 1 in 100 storage nodes fails on average during a year does not really give much information on how to achieve 99.999% assurance of having a minimum of 100 nodes up. However, with this information, we can use the Poisson distribution to help find an answer.

To build this distribution, we just plug in these values into the Poisson formula:

  • λ = 1
  • k = 0-7

This results in the following distribution:

Poisson distribution for λ (avg)=1, k (events) = 0-7.

Here λ (avg in the table) is 1, since we have the average number of events at 1. For k (events in the table above), I have simply started at 0, and increased by 1 until I reached the target probability of 99.999%. The Poisson formula with these values is also in the table for each row. Adding up all the probabilities gives us the value of 7 as the number of potential failures until we reach the probability of 99.999% (cumulative in the table).

So how did we reach 7? We have a probability of 36.7879% for 0 failures, and the same 36.7879% for 1 failure. The cumulative probability for 0 or 1 failures is 36.7879 + 36.7879 = 73.5759%. The probability of exactly 2 failures is 18.394%. The cumulative probability of 0, 1, or 2 failures is thus 73.5759 + 18.394 = 0.919699%. And so on, as visible in the table (cumulative column). Since these are exclusive numbers (we cannot have both 0 and 1 failures at the same time), we need to sum them up to get the cumulative chance of avoiding all scenarios of 0-7 failures. This is the cumulative 99.999% value in the table row 7.

So the answer to the original question in this example is to start the year with 107 nodes running to have 99.999% assurance of keeping a minimum of 100 nodes running at all times through the year. Or you could also do 108 depending on how you interpret the numbers (does the last failure count as reaching 99.999%, or do you need one more).

Of course, this does not consider external issues such as meteor hits, datacenter fires, tripping janitors, etc. but rather is to address the “natural” failure rate of the nodes themselves. It also assumes that you wouldn’t be exchanging broken nodes as they fail. But this is intended as an analysis example, and one can always finetune the process based on the results and exact requirements.

For those interested, some further examples of such applications in reliability analysis are available on the internet. Just search for something like poisson distribution reliability engineering.

Poisson Distributions for average of 0-5 events

Now that we have the basic definitions, and a concrete example, let’s look at what the distribution generally looks like. With the same parameters, the Poisson distribution is always the same, regardless of the unit of time or space used. Using an average number of events 1/minute or 1/hour, the distribution itself is the same. Just the time-frame changes. Similarly, looking at an average of instances per area of 10/m2 or 10/cm2 will have the same distribution, just over a different area.

Here I will plot how the distribution evolves as the λ (average number of events in timeframe or area) is changed. I use a varying number of events k in each case up until the probability given by the Poisson distribution is very small. I use discrete (whole) number for the λ, such as 0, 1, 2, 3, 4, and 5, although the average can have continuous values (0.1 would be fine for λ). Let’s start with λ of 0:

Poisson distribution for 0-5 events (k) with an average of 0 events (λ) in time interval.

The above figure shows the Poisson distribution for an average of 0 events (λ). If you think about it for a moment, the average number of events can only be 0 if the the number of events is always 0. This is what the above figure and table show. With an average of 0 events, the number of events is always 0.

Poisson distribution for 0-5 events (k) with an average of 1 events (λ) in time interval.

A bit more interesting, the above figure with an average of 1 events (λ) in the time interval shows a bigger spread of probable number of events. This distribution is already familiar from the reliability estimation example I showed, as it had λ of 1. Here the number of events (k) can be both 0 or 1 equally often (about 36.8%), after which the probability quickly goes down for larger number of events (k). The average of 1 is balanced by the large probability of 0’s and smaller probability of the larger numbers for k.

Poisson distribution for 0-7 events (k) with an average of 2 events (λ) in time interval.

Here, with figure with an average of 2 events (λ) in the interval, we see some trends with the increase of the average number of events (λ) as compared to the distributions with λ of 0 and 1. As λ increases, the center of the distribution shifts right, the distribution spreads wider apart (x-axis becomes broader), and the percentage of single values becomes smaller (y-axis is shorter). So there are more values for k, each one individually having smaller probabilities but summing to 1 in the end.

Poisson distribution for 0-8 events (k) with an average of 3 events (λ) in time interval.

Compared to the smaller averages (0 ,1, 2), the above figure with average of 3 events (λ) shows further the trend where the average value itself (here k=3) has the highest probability, but with the value right below it (here k=2) closely (or equally) matching in probability. And the center shifting right, with the spread getting broader.

Poisson distribution for 0-10 events (k) with an average of 4 events (λ) in time interval.

Both the number of 4 (figure above) and 5 (figure below) for the average number of events (λ) show the same trends as with 0, 1, 2, and 3 for λ continue further.

Poisson distribution for 0-11 events (k) with an average of 5 events (λ) in time interval.

To save some space, and to illustrate the Poisson distribution evolution as the average number of events (λ) rises, I animated it over different λ values from 0 to 50:

Poisson distribution from 0 to 50 (discrete) averages (λ) and scaled number of events (k).

As I noted before, the average (λ) does not need to be discrete, even if these examples I use here are. You cannot have 0.1 failures, but you can have on average 0.1 failures.

With the above animated Poisson distribution going from 0 to 50 values for λ, the same trends as before are again more pronounced – As the average number of events (λ) rises,

  • The overall number of events predicted goes up (x-axis center moves right).
  • The distribution becomes wider (x-axis becomes broader) with more values for k (number of events).
  • The probability of any single number of events (k) get smaller (y-axis is shorter).
  • The probability always peaks at the average (highest probability where λ=k).
  • The summer probability always stays at 1 (=100%) over the x- and y-axis.

That sums up my basic exploration of the Poisson distribution.

Example: Snowflakes, Call Centers, and a Poisson Distribution

As I noted, I find concrete examples tend to make things easier to understand. I already presented the example from reliability engineering, which I think makes a great example of the Poisson distribution as it seems like such an obvious fit. And useful too.

However, sometimes the real world is not so clear on all the assumptions on applying the Poisson distribution. With this in mind, I will try something a bit more ambiguous. Because it is winter now (where I live it is), I use falling snowflakes as an example to illustrate this problem. When I say snowflakes falling, I mean when the flakes hit the ground, on a specific area.

Assume we know the average rate of snowflakes falling in an hour (on the selected area) is 100, and we want to know the probability of 90 snowflakes falling. Using the Poisson distribution formula defined earlier in this article, we get the following:

  • e = 2.71828: as defined above (or math.e)
  • λ = 100: The average number of snowflakes falling in a given timeframe (1 hour).
  • k = 90: The number of snowflakes we want to find the probability for.

Putting these into the Poisson formula, we get:

(e^-100)*(100^90)/90! = 0.025.

Which translates to about 2.5% probability of 90 snowflakes falling in an hour (the given time-frame), on the given area.

To calculate the broader Poisson probability distribution for different number of snowflakes, we can, again, simply run the Poisson formula using different values for k. Remember, k here stands for the number of events to find the probability for. In this case the events are the snowflakes falling. Looping k with values from 75 to 125 gives the following distribution:

Poisson distribution for k=75-125, λ=100.

Sorry for the small text in the image. The bottom shows the ticks for x-axis as the numbers from 75 to 125. These are the values k that I looped for this example. The bars in the figure denote the probability of each number of snowflakes (running the Poisson formula with the value k). The numbers on top of the bars are the probabilities for each number of snowflakes falling, given the average of 100 snowflakes in the interval (λ). As usual, the Poisson distribution always peaks at the average point (λ, here 100). The probability at the highest point (k=100) here is only about 4% due to wide spread of the probabilities in this distribution.

So as long as we meet the earlier defined requirements for applying the Poisson distribution:

  • Discrete values
  • Independent events
  • Multiple events not occurring simultaneously
  • The average number of events is constant over time

As long as these hold, we can say that there is a 4% chance of 100 snowflakes, when the average number of observed snowflakes (λ) in the timeframe is 100. And about 2.5% chance of observing 90 snowflakes (k=90) in a similar case. And so on, according to the distribution visualized above.

Let’s look at the Poisson distribution requirements here:

  1. Snowflakes are discrete events. There cannot be half a snowflake in this story. Either the snowflake falls or it doesn’t.
  2. The snowflakes can be considered independent, as one would not expect one to affect another.
  3. Multiple snowflakes are often falling at the same time, but do they fall to the ground at the exact same moment?
  4. The average number of snowflakes over the hours is very unlikely to be constant for a very long time.

I considered a few times whether to use snowflakes as an example, since this list actually goes from the first item obviously being true, to the next one slightly debatable, the third more so, and the final one quite obviously not going to hold always. But everyone likes to talk about the weather, so why not.

So if we take the first argument above as true, and the following ones at different degrees of debatable, lets look at the points 2-4 from the above list.

2: Are falling snowflakes independent? I am not an in-depth expert on the physics of snowflakes or their falling patterns. However, 100 snowflakes on an area in an hour is not much, and I believe the independence of the snowflakes is a safe assumption here. But in different cases, it is always good to consider this more deeply as well (e.g., people tending to cluster in groups).

3: How likely is it that two snowflakes hit the ground at the exact same moment? Depends. If we reduce the time frame of what you consider “the same time” enough, there are very few real-world events that would occur at the exact same moment. And if this was a problem, there would be practically very few, if any, cases where the Poisson distribution would apply. There is almost always the theoretical possibility of simultaneous events, even if incredibly small. But it is very unlikely here. So I would classify this as not a problem to make this assumption. I believe this kind of relaxed assumption is quite common to make (e.g., we could say in the reliability example, there is an incredibly small chance of simultaneous failure).

4: The average number of falling snowflakes being consistent is of course not true over a longer period of time. It is not always snowing, and sometimes there is a bigger storm, other times just a minor dribble. To address this, we can consider the problem differently, as the number of snowflakes falling on a specific area in a smaller interval, such as 1 minute. The number of snowflakes should be more constant at least for the duration of several minutes (or as long as the snowstorm maintains a reasonably steady state). Maybe it would be more meaningful to discuss the distributions of different levels of storms in different stages?

The fourth point, the change over time, highlights an issue that is relevant to many example applications of the Poisson distribution. The rate of something is not always constant, and thus the time factor may need special consideration. I even found a term for it when I was looking for information on this on the internet: time-varying poisson distribution. And as highlighted by the third point above, some other considerations may also need to be slightly relaxed, while the result can still be useful, even if not perfect.

Snowflakes vs Support Calls

The above snow example was a largely made up problem, but weather is always a nice topic to discuss. Perhaps a more realistic, but very similar, case would be to predict the number of calls you could get in a call center. Just replace snowflakes with calls into the call center.

As with the snow example, with calls to a call center, over different time periods (depending also on time-zones), the average might vary, producing different distributions. Your distributions over days might also vary. For example, weekends vs holidays vs business days. But if you had, for example, an average of 100 calls during the business hours on business days, your Poisson distribution would look the exact same as the average 100 snowflakes in the snowflake example.

You could then use this distribution for things like choosing how many people you should hire into the call center to maintain a minimum service level to match your service level agreement. For example, by finding a spot where the cumulative probability of receiving that many calls in an hour is less than 90%, and using it as evidence of being able to provide minimum of 90% service level at all times.

In such a call center example, you would have other considerations compared to the weather scenario. Including the timezones you serve, business hours, business days, special events, the probability of multiple simultaneous calls, and so on. However, if you think about it for a moment, at the conceptual level these are exactly the same considerations as for the weather example.

So far, my examples have only described events over timeframes. As I noted in the beginning, sometimes the Poisson distribution can also be used to model number of instances in volumes, or areas, of space.

Example: Poisson Distribution Over an Area – Trees in a Forest, WW2 Bombs in London

Besides events over time, the Poisson distribution can also be used to estimate other types of data. One other example is typically distributions of instances (events) over an area. Imagine the following box to describe a forest, where every dot is a tree:

Imaginary forest, dots representing trees.

A Poisson distribution can similarly represent occurrences of such instances in space as it can represent occurrence of events in time. Imagine now splitting this forest area into a grid of smaller boxes (see my Kaggle kernel for the image generation. This image is from run 31):

The forest, divided into smaller squares.

In the above figure, each tree in this is now inside a single smaller box, although the rendering makes it look like the dots on the border of two boxes might belong to two boxes. Now, we can calculate how many trees are in each smaller square:

Calculating the number of trees inside the smaller boxes.

This calculation for the image above, gives the following distribution:

Number of trees in smaller box in the imaginary forest.

From the calculation, the number of smaller boxes with 0 trees (or dots) is 316, or 79%, of the total 400. Further 17% of the smaller boxes in the grid contain 1 tree (dot), and 4% contain 2 trees (dots).

Now, we can calculate the Poisson probability distribution for the same data. We have 400 smaller boxes, and 100 trees. This makes the average number of trees in a box 100/400=0.25. This is the average number of instances in the area (similar to timeframe in earlier examples), or λ, for the Poisson formula. The number of events k in this case is the number of instances (of trees). Putting these values into the Poisson formula (k=tree_count, λ=0.25, we get:

Number of trees estimated by the Poisson distribution.

Comparing this Poisson distribution to the actual distribution calculated from the tree grid image above, it is nearly identical. Due to random noise, there is always some deviation. The more we would increase the number of “trees” in the actual example (randomly generate more), and the number of mini-boxes (smaller area), the closer the actual observations should come to the theoretical numbers calculated by the Poisson formula. This is how randomness generally seems to work, the bigger sets tend to converge to their theoretical target better. But even with the values here it is quite close.

Poisson Distribution and the London Bombings in WW2

Besides the forest and the trees, one could use this type of analysis on whatever data is available, when it makes sense. As I read about Poisson distribution, its application to the London bombings in the second world-war (WW2) often comes up. There is a good summary about this on Stack Exchange. And if you read it, you will note that this is exactly the type of analysis that my forest example above showed.

In this story, the Germans were bombing the British in London, and the British suspected they were targeting some specific locations (bases, ports, etc). To figure if the Germans were really targeting specific targets with insider knowledge, the British divided London into squares and calculated the number of bomb hits in each square. They then calculated the probability of each number of hits in each square, and compared the results to the Poisson distribution. As the result closely matched the Poisson distribution, the British then concluded that the bombings were, in fact, random, and not based on inside-information. Again, this is what would happen if we just replace “trees” with “bombs” and “forest” with “London” in my forest example above.

Of course, a cunning adversary could account for this by targeting a few high-interest targets, and hiding them in a large number of randomly dropped bombs. But the idea is there, on how the Poisson distribution could be used for this type of analysis as well. Something to note, of course, is that the Stack Exchange post also discusses this as actually being a more suitable problem for a Binomial distribution. Which is closely related to the Poisson distribution (often quite indistinguishable). Let’s see about this in more detail.

Poisson vs Binomial Distribution

When I read about what is the Poisson distribution and where does it derive from, I often run into the Binomial distribution. The point being, that the binomial distribution comes from the more intuitive notion of single events, and their probabilities. Then some smart people like mr. Siméon Denis Poisson used this to derive the generalized formula of the Poisson distribution. And now, 200 years later, I can just plug in the numbers. Nice. But I digress. The Binomial distribution formula:

The Binomial distribution formula.

The above figure shows the general Binomial distribution formula. It has only a few variables:

  • n: number of trials
  • k: number of positive outcomes (events we calculating probability for)
  • p: probability of the positive outcome

We can further split the Binomial formula into different parts:

Different parts of the Binomial formula.

I found a good description of the binomial formula, and these parts, on the Math is Fun website. To briefly summarize, the three parts in the above figure correspond to:

  • part1: number of possible trial output combinations that can produce the wanted result
  • part2: probability for positive outcomes with expected k positive outcomes
  • part3: probability for negative outcomes with expected k positive outcomes

To summarize, it uses the number of possible trial combinations, the probability of a positive outcome, and the probability of a negative outcome to calculate the probability of k positive outcomes. As before, calculating this formula for the different numbers of k, we can get the Binomial distribution.

What does any of this have to do with the Poisson distribution? As an example, let’s look at the reliability example from before as a Binomial Distribution instead. We have the following numbers for the reliability example:

  • average number of failures (λ, in one year): 1 in 100
  • number of events investigated (k): 0-7

We can convert these into the values needed for the Binomial Distribution:

  • n: number of trials = 100. Because we have an average over 100 trials, and want to know the result for 100 storage nodes.
  • p: probability of failure = 1/100 = 0.01. Because we have a 1 in 100 chance of failure.
  • k: 0-7 as in the Poisson example.

Comparing the results, we get the following table:

Poisson vs Binomial

In this table, poisson refers to the values calculated with the Poisson formula. binomial100 refers to the values calculated with Binomial formula using parameters n=100 and p=0.01. Similarly, binomial1000 uses n=1000 and p=0.001. As you can see, these are all very close.

It is generally said, that when p nears 0 and n nears infinity for the Binomial Distribution, you will approximate the Poisson Distribution. Or the other way around. In any case, the idea being that as your interval of sub-trials gets smaller in Binomial, you get closer to the theoretical value calculated by the Poisson Distribution. This is visible in the table as binomial100 is already close to poisson, but binomial1000 is still noticeably closer with a larger n and smaller p.

You can further experiment with this yourself, by increasing n, and decreasing p in this example, and using a Binomial Distribution calculator. For example, where p = 0.01, 0.001, 0.0001, while n = 100, 1000, 10000, and so on.

The difference between the Binomial and Poisson distribution is a bit elusive. As usual, there is a very nice Stack Exchange question on the topic. I would summarize it as using Poisson when we know the average rate over a timeframe, and the timeframe (or area/volume) can be split into smaller and smaller pieces. Thus the domain of the events is continuous and bordering on infinite trials (as you make the interval smaller and smaller). The Binomial being a good fit if you have a specific, discrete, number of known events and a probability. Like a thousand coin tosses. I recommend the Stack Exchange post for more insights.

Other Examples of Poisson Distribution

As I was looking into the Poisson distribution I found many examples trying to illustrate its use. Let’s look at a few more briefly.

Number of chocolate chips in a cookie dough. I though this was an interesting one. Instead of events over time, or instances in an area, we are talking about instances in a volume of mass. Kind of like 3-dimensional area. I guess if we looked at time (and distance) as 1-dimensional, and area as 2-dimensional, this would be the continuation to 3-dimensional. The applicability I think would mainly depend on how well the chips are distributed in the dough, and whether they could be considered independent (not clustering) and on average constant. But I find the application of the Poisson distribution to volumes of mass is an interesting perspective.

Estimating traffic counts. How many cars pass a spot on a highway could be suitable for a Poisson distribution. I think there can be some dependencies between cars, since people tend to drive in clusters for various reasons. And this behaviour likely changes over specific time periods (of days, weeks, etc similar to the call center example).

Estimating bus arrivals. This one does not seem very meaningful, since the buses are on a schedule. Maybe the average rate would be quite constant over some time periods, but not very independent. For estimating how much they will deviate from their schedule might work a bit better, although weather, traffic, and other aspects might have a big impact.

People arriving in a location. This is a bit tricky. I think people often tend to travel in groups, and if they arrive to some events, such as school classes, workdays, meetings, football games, concerts, or anything else like that, they tend to cluster a lot. So I do not think this would work very well for a Poisson distribution. But maybe in some cases.

Number of fatal airline crashes in a year quite a classic example. If we consider back to the past few years (today) and, for example, the issues Boeing had with their MCAS system, this would certainly not fit. But then this (MCAS issue) would show up as an anomaly, similar to what the British were looking for in the London Bombing example. And this would be correct, and a useful find it in itself, if otherwise not observed. Other than these types of clustered events due to a specific cause, I believe airplane crashes makes a reasonable example of applying the Poisson distribution.

A bit like people, animal behaviour could also be of interest. One example I saw online (sorry, could not find the link anymore) was about cows being distributed across a field. And how they tend to group (or cluster), meaning the distribution would not be truly independent, and likely not very good for a Poisson distribution. This made me think about the minimum distance between events / instances as a parameter. For example, repeating my forest example, but setting a minimum distance from one tree to the next. The longer this minimum distance would be, the further the actual distribution should differ from the predicted Poisson distribution. This type of “personal space” seems likely in many real-world scenarios.

Most of these extra examples I listed here actually seem to have some properties similar to my snowflakes / call center example. You might not have the perfect case for the Poisson distribution, but if you figure out your rules and limitations, and find a suitable angle, you might still benefit from it.


I presented a bit deeper look at three cases of applying the Poisson distribution in this article. Hopefully they are helpful in understanding about its behaviour and potential use. I find the reliability example quite concise for the most “basic” (or standard-like) application of the Poisson distribution. The call center / snowflake example gives it a bit more complex real-wold context, and finally the trees in a forest / London bombing example illustrates the expansion of the application into the two-dimensional space from the time-domain. The additional cases I briefly discussed further highlighted a few interesting points, such as the expansion into 3-dimensional volumes, and finding different angles where the Poisson might be useful, even if not directly matching all its criteria.

Mostly my day to day job, or generally daily tasks don’t really involve many uses for the Poisson (or Binomial) distribution. However, I believe the distribution is very useful to understand and keep in mind when the situation arises. And when it does, I find some good pointers are useful to remember. Some points I find useful:

  • The Poisson distribution is always the same if we have the same parameters λ and k, regardless of the timeframe or area size. Just have to be able to scale the idea.
  • The probability and rate of events generally seems to be scalable across events, so if I have a rate of 1 in 100, I can also use a rate of 0.1 in 10, or 10 in 1000.
  • The Binomial distribution is a good alternative to keep in mind. I find it a good mental note to consider the Binomial if I have a given probability, and a specific (limited) number of events.
  • The Poisson on the other hand is a clearer fit if I have the average rate, and potentially continuous number of events of a timeframe

In the end, I think the most important thing is to remember what these distributions are good for. For Poisson, this would be estimating the probability of a discrete number of events, given their average rate in a timeframe/area/volume, and when the timeframe is continuous (e.g., you can divide an hour into smaller and smaller time units for more trials) but the events are discrete (no partial events). Binomial on the other hand if you have the probability for an event, and the number of trials is discrete (specific number of trials). Keeping the basic applications in mind, you can then look up the details as needed, and figure out the best way to apply them.

That’s all for today. If you think I missed something, or anything to improve, let me know 🙂

Understanding Golang Project Structure

Go is an interesting programming language, with a nice focus on keeping it minimal. In that sense, the classic Go workspace setup seems like a bit of an anomaly, requiring seemingly complex project structure. It sort of makes sense, once you understand it. But first you need to understand it. That is what this post aims at.

I start with defining the elements of a classic Go workspace setup. In recent Golang versions, the Go Modules are a new way to set up a workspace, which I find can be used to simplify it a bit. However, even with the modules, the classic structure applies. Although it is not strictly necessary to use the classic structure, it is useful to understand.

Go Packages

The base of Go project structure is the concept of a Go package. Coming from years of Java indoctrination, with Python sprinkled on top, I found the Go package nuances quite confusing.

Package Path vs Name

A Go package reference consists of two parts. A package path and a package name. The package path is used to import the code, the package name to refer to it. The package name is typically the last part of the import but that is not a strict requirement. Example:

package main

import ""

func main(){

The above code declares code that has a package name main. Lets say this code is in a file in directory Go only allows files from a single package in a single directory, and with this definition, all Go files in the directory would need to define themselves as package main. The package name main is a special in Go, as it is used as the binary program entry point.

The above code also imports a package in the path This is the package path being imported. In the classic Go project structure, the package path practically refers to a directory structure. The typical package name for this import would be hellopkg, matching the last part of the package path used for the import.

For example, consider that the path contains a file named somefile.go. Full path It contains the following code:

package hellopkg

import "fmt"

func Hello()  {
    fmt.Println("hello from hellopkg.Hello")

The code from this package is referenced in this case (in the main.go file above) as hellopkg.Hello(). Or more generally as packagename.functionname(). In this example it is quite understandable, since the package name matches the final elements of the package path (hellopkg). However, it is possible to make this much more confusing. Consider if everything else was as before, but the code in somefile.go were the following:

package anotherpkg

import "fmt"

func Hello()  {
    fmt.Println("hello from anotherpkg.Hello")

To run code from this package, now named anotherpkg, we would instead need to have the main.go contain the following:

package main

import ""

func main(){

This is how you write yourself some job security. There is no way to know where the anotherpkg comes from in the above code. Which is why the strong recommendation for keeping that last package path element the same as the package name. Of course, the standard main package needed to run a Go program is an immediate anomaly, but lets not go there.

Finally, you can make the package name explicit when importing, regardless of what the package name is defined inside the file:

package main

import hlo ""

func main(){

In the above code, hlo is the alias given to the package imported from the path After this aliased import, it is possible to refer to the code imported from this path as hlo.Hello() regardless of whether the package name given inside the files in the path is hellopkg, anotherpkg, or anything else. This is similar to how you might write import pandas as pd in Python.


GOROOT refers to the directory path where the Go compiler and related tools are installed. It is a bit like JAVA_HOME for Java Virtual Machines.

I generally prefer to set this up myself, even though each system likely comes with some default. For example, last time I was setting up Go on a Linux machine, the Go toolkit was installable with the Snap package manager. However, Snap warns about all the system changes the Go installer would do. Something about Snap classic mode, and some other scary sounding stuff the install might do. To avoid this, I just downloaded the official Go package, extracted it, and linked it to the path.

Extraction, and symbolic linking (on Ubuntu 20.04):

cd ~
mkdir exampledir
cd exampledir
mkdir golang_1_15
cd golang_1_15
#assume the following go install package was downloaded from the official site already:
tar -xzvf ~/Downloads/go1.15.3.linux-amd64.tar.gz
cd ..
ln -s golang_1_15/go go

The above shell script would extract the Go binaries into the directory ~/exampledir/golang_1_15/go (the last go part comes from the archive). It also creates a symbolic link from ~/exampledir/go to ~/exampledir/golang_1_15/go. This is just so I can point the GOROOT to ~/exampledir/go, and just change the symbolic link if I want to try a different Go version and/or upgrade the Go binaries.

The final step is to point GOROOT to the symbolic link, by adding the following to ~/.profile file, or whatever startup script your shell uses:

export GOROOT=/home/myusername/exampledir/go
export PATH=$PATH:$GOROOT/bin

After this, the go binaries are available in the shell, the libraries are found by the Go toolchain, and my Goland IDE found the Go toolchain on the path as well. Brilliant.


While GOROOT specifies the toolkit path, GOPATH specifies the Go workspace directory. GOPATH defaults to ~/go, and it is not strictly required to define it. When using the new Go modules, it is also easier to skip this definition, as the package path can be defined in the module definition. More on that later.

The part I find confusing for GOPATH is that the workspace directory it defines is not for a specific project, but intended for all Go projects. The actual project files are then to be put in a specific location in the GOPATH. It is possible to set up multiple workspaces, but this is generally not adviced. Rather it is suggested to use one, and put the different projects under specific subdirectories. Which seems all the same, since even if you use multiple workspaces, you still have to put your projects in the same subdirectories. I will illustrate why I find it confusing with an example.

The workspace defined by GOPATH includes all the source code files, compiled binaries, etc. for all the projects in the workspace. The directory structure is generally described as:


These directions are:

  • bin: compiled and executable binaries go here
  • pkg: precompiled binary components, used to build the actual binaries for bin. Also as some sort of intermediate cache by the go get tool, and likely other similar purposes. Generally I just ignore it since it is mainly for Go tools.
  • src: source code goes here

The above directory layout does not seem too confusing as such. However, if I look at what it means if I checkout code from multiple projects on github (or anywhere else) into the same workspace, the result is something like this:


And in the above is my confusion. Instead of checking out my projects into the root of the workspace, I first need to create their package path directories under the workspace. Then clone the git repo in there (or copy project files if no git). In the above example, this workspace would have two projects under it, with the following package paths:


To set this up, I would need to go to github, check the projects, figure out their package paths, create these paths as a folder structure under the workspace, and then, finally clone the project under that directory. After this, Go will find it. There is algo the go get tool to download and install specific packages from github (and likely elsewhere) into the correct place on the GOPATH. However, this does not clone the git repo for me, nor explain to me how my own code and repository should be placed there along with other projects. For that, I need to write posts like this so I get around to figuring it out 🙂

This workspace structure is especially confusing for me, since it seems to force all Go projects on Github to hold their main.go at the project root level along with everything else you put in your repo, including your documentation and whatever else. I find many Go projects also end up hosting pretty much all their code at the root level of the repo. This easily makes the repository a complete mess when I try to look at the code and it is just one big mess in the top directory.

Again, this is what I find to be really confusing about it all. With Go modules it is at least a bit more clear. But still, there is much legacy Go out there, and one has to be able to use and understand those as needed. And even for using Go modules I find I am much better off if I understand this general Go workspace setup and structure.

Go Modules

Go modules are really simple to set up. You just create a go.mod file in the root of your project directory. This file is also very simple in itself. Here is a basic example:


go 1.15

Or if that is too much to type, the go mod init command can do it for you:

go mod init

That creates the go.mod file for you, all the 3 lines in it..

The above go.mod example defines the module path on the first line. The code itself can be in any directory, it no longer matters if it is on the GOPATH when using the Go modules. The line go 1.15 defines the language version used for the project.

The official documentation still recommends using the whole GOPATH workspace definition as before. However, even with GOPATH undefined everything works if go.mod is there:

$ go build -v -o hellobin
$ ./hellobin
hellopkg from hellopkg.Hello

In the above, I am specifying the output file with the -o option. If GOPATH is not set, it will default to ~/go. Thus if I have the above project with the go.mod definition, and run the standard Go binary build command go install on it, it will generate the output binary file in ~/go/bin/helloexample. The -o option just defines a different output path in this case.

Seems a bit overly complicated, when I am just used to slapping my projects into some random workdir as I please. But I guess having some standardized layout has its benefits. Just took me a moment to figure it out. Hope this helps someone. It certainly helped me look all of this up properly by writing this post.


This post describes the general layout of the classic Go project. While I would use the new module structure for projects whenever possible, I sometimes download projects from Github to play with, and I am sure many corporations have various legacy policies that require this classic architecture. There are a lot more information and nuances, but I encourage everyone to look it up themselves when they come to that bridge.

The package path vs package name system is something that really confused me badly coming from other programming languages. It is not too bad once you understand it. But generally most things get easy once you master them, eh. Achieving the mastery is the hard part. It is just the difficulty in achieving the understanding. I cannot say if the Go project structure is confusing, or if I am just loaded with too much legacy from other environments (Java, Python, ..).

There is much good to Go in my opinion, and the modules system helps fix many of the general issues already. I have written a few small projects in Go, and look forward to trying some more. Sometimes the aim for simplicity can require some bloat in Go, not sure if the project structure qualifies in that category. In any case, good luck working with Go, even if you don’t need it.. I do 🙂