Category Archives: algorithms

Algorithm Test Engineering Part 2: Machine Learning

If we don’t exactly know what we are testing, how can we test?

Let’s probe the machine to learn what it has learned. Photo by Zoltan Tasi on Unsplash

In my previous algorithm test engineering article, I discussed the testing and analysis of more classical algorithms, such as binary search. While the overall testing of algorithms can be complicated, most classical algorithms can be described as having quite clearly defined inputs, outputs, and their relations.

In this follow-up article I discuss testing and analysis in relation to a different type of an algorithm, where the input-output relations are more complex, harder to define, and the results sometimes subjective. Typically these are based on machine learning (ML). For example, the same set of Internet search-results can be good for one person at one time, but less so for (another) person, or at a different time. Same might apply to other properties, such as an e-nose trying to distinguish wines vs trying to distinguish whiskeys based on the same sensor inputs, or personal stress level based on biometrics. The context and sometimes the personal “feeling” can make the difference.

In this article, I explore the idea of what it means to test these types of systems. I start with testing as part of the ML model training process, extending into post-training testing. To make it all a bit more concrete, I look at a few industry examples, and reflect on my experiences on a search-engine project I built a few years back using customized Natural Language Processing (NLP) algorithms.

The Train, Validate, and Test Process

The basic test process in relation to machine learning is the testing and validation during the ML model training process. The following figure illustrates this:

ML train and test loop. Image by author.

This uses three datasets:

  • Training set: The ML model is trained on this, the algorithm tuning the model to better generalize over all the items in this set.
  • Validation set: the trained model performance is evaluated on this separate dataset. Training is repeated with the training set as long as validation score improves, or other end criteria is met.
  • Test set: a final, separate dataset used to perform a final test of the results, independent of the whole training loop and its validation.

The above process aims to build a model that generalizes as well as possible in relation to given criteria (e.g., accuracy or prediction error) over the training data. Typically this optimizes the model over the given dataset, but does not differentiate on which specific data items it still performs poorly.

Following this, post-training testing is a term referring to testing related to the fully trained model after the above process has fully finished. Let’s look at this type of testing in a bit more detail.

Post-Training Testing

I ran into this term during writing this article, from an article by Jeremy Jordan, and found it very fitting. He described it as investigating the logic behind the “final” algorithm (trained model). I like the term investigating, as I think of this as exploring the model behaviour with different inputs and input transformations. This is very much like the metamorphic testing (MMT) approach I wrote about earlier, with more focus on the exploration part. As a reminder, MMT makes modifications to inputs, and observes the effect on output. The modifications are called metamorphic transformations.

The Jordan article splits post-training testing to three types:

  • Invariance tests: Testing the ML algorithm with similar inputs that one would expect to give a similar output. The example given is two different (red) apples. Think metamorphic transformations.
  • Directional tests: Testing small input modifications in a given direction to see if the algorithm output reacts in an expected way. For example, does the house price go up with increase in house size? Another form of metamorphic transformation.
  • Minimum functionality tests: Testing the ML algorithm with specific instances of inputs, where we know the result we expect. For example, does the same apple always give the same result (e.g., classification percentages).

I discussed many similar aspects related to testing traditional algorithms in my earlier article, and I think the above test types can be applied to testing in general. However, in testing ML algorithms, I find the exploration has a bigger role due to the model (algorithm) being more of a black box, with complex internals, such as large number of weights and connections.

I believe how well the above test types can be applied depends on the data, model, and domain at hand. For example, when classifying images of apples, I find it quite intuitive to define variants of an apple, or how to transform such images. Other times it gets more complicated. Let’s look at some examples.

Examples of Evaluation Complexity

For example, in my metamorphic testing article I discussed the use case of self-driving cars. Now imagine that the ML algorithm feeds steering angles to a driving system. If this angle drifts by fractions of a percentage over time (with different model versions) for the same scenarios, when is it a relevant change in output? Or if it uses multiple camera inputs from the real-world, and we consider all the complexities of the real-world living environment as inputs, which change in input is relevant? The data space can be huge. In time-series data (such as driving), the data relations and change over time also needs to be considered.

The Jordan article gives another interesting example of defining invariants and using them for investigation/testing: real-estate price estimation. Specifically expecting a higher bathroom count to not lower house price, or decreasing house size to not increase the prize. These are in a way inverse invariants, describing the opposite of what one might expect. Maybe I expect the price to stay the same or rise, but not to drop. An interesting angle to look at the data and expectations.

In Jordan’s example, they noticed their dataset for smaller apartments was mostly from a big city where the apartments were more expensive. This had biased the model with regards to the smaller apartment sizes in general. I find this an interesting example of findings that would be easy to miss without in-depth exploration (or investigation) of the post-training results.

Operational vs Data Domain

Thinking of the metamorphic relations at a high level, one might consider what kind of changes in the operational domain (e.g., weather, driving environment, angles, for cars) one might observe. These are high-level changes describing human understandable concepts. Additionally, there are changes in the input data domain itself (e.g., sensor corruptions, interference, signal disturbance, environmental impacts).

These two domains are nicely described in an article by Daniel Angilov. Although not using the metamorphic testing terms, it nicely summarizes the difference of how one might build input transformations for testing ML at a higher level (operational level), and how they might map to the lower level representation (data level). Naturally, the data domain is much larger in scope, as it includes all possible input values. For the operational domain many of these may not be relevant (e.g., images of random data). Illustration:

Operational domain vs Data Domain, adapted from here.

The Operational Domain deals with more intuitive, high-level and human understandable, concepts, and transformations. For example, in an autonomous car scenario, foggy scenes, camera angles, or rainy weather vs sunny weather. These are mapped to the data domain, where the actual data transformations happen. We can apply operational domain transformations to produce high-level modifications for identified test scenarios. And lower-level data transformations (e.g., adding noise) for lower level changes.

Here is an example from my metamorphic testing article that I believe illustrates a (domain) transformation in the operational domain:

Operational domain transformation example. Image by author.

In this case, the camera angle of the imaginary car has been turned, as if the car itself was tilted. Well, the other car in front has also moved further away :). I took these pictures while walking across the street for my metamorphic testing paper, which explains it. A more data domain oriented transformation could be, for example, to add static noise to the image.

It is unlikely that we could cover all possible values in the data domain, or figure out a generic test oracle for all of them. Similarly for the operational domain, expecting to build exhaustive test sets for all possible input combinations is not feasible. Instead, we need to try to figure out the most relevant ones for both domains. And pick a suitable test set, explore its changes, and evolution over time. Unfortunately I do not have a universal recipe for this (at this time).

Testing in the Overall ML Process

Considering the overall ML train, test, and operations process, perhaps the most succinct and full of ML testing detail paper I have seen is the Google paper from the 2016 Reliable Machine Learning in the Wild workshop. Since the entire paper is just a list test types and their very concise descriptions, I list the ones that I found most interesting. These all relate to continuously testing:

  • Feature assumptions. Such as value ranges and most common values, as these may drift over time. Check assumptions stay valid.
  • Feature relations to target variable and each other. To understand your data better, and check the relations hold.
  • Computational cost vs benefit per each feature. Is it worth including all features for their computational cost? Does it evolve?
  • Leak of unwanted features into the model due to copy-paste type errors. Constant monitoring should alert if unwanted features are used.
  • Change of model evaluation score over time if consistently re-trained. For example, the effect of daily training with new data.
  • Model bias with regards to specific types of data, does some new data or data type bring new biases that should be considered.
  • Reproducibility of training, how big is the drift across multiple trainings on the same data.
  • Constantly monitoring how invariants seen in training data should hold for operational data over time.

I see the overarching theme is to constantly monitor and observe change and impact.

This is just a short list of ones I found most related to this article. There are many more related to the overall process from ML code review to operational monitoring. I recommend reading the Google paper for the very insightful list.

A Few Real-World Examples

Lots of words are nice, but real-world examples help make it concrete. Especially real ones from the industry. Recently I listened to some ACM ByteCast episodes, a few of which provided interesting industry insights.


One of these was discussing research on recommendation algorithms at Spotify. More specifically, Spotify’s recommendation systems and everything around that. The term used (in this talk) was evaluation of ML results.

This differentiated the evaluation in two forms: offline and online. Offline is the traditional ML evaluation approach of using a train/test split and metrics such as accuracy, precision, and recall. Online refers to evaluating the algorithm performance from user actions, such as clicks on a webpage, or in Spotify case I guess choices on music playlists. They discuss the concept of a proxy of user engagement, and whether some specific interaction in an app is a good measure of achieved goal, or if a returning user is an indication of satisfaction and a success. An interesting viewpoint on defining an ML test oracle, and constantly investigating the ML algorithm behaviour.

Another aspect they discussed with Spotify is to enable users to discover new content. Instead of simply relying on algorithm recommendations based on the current user profile, it can be useful to enable people to discover new content. In the Spotify case, this was discussed as selecting about 10% of the options presented to the user as these “new” types of choices. This is a good example of considering the overall user experience and goals when testing / investigating the ML algorithms as part of a system. It ensures the user is not locked into an algorithmic sandbox, supporting them in finding new ideas (music, search results, ..) to explore.


Another ByteCast episode I found interesting regarding this topics was the one on DuoLingo. DuoLingo is an app and a service designed to help people learn languages. They use Machine Learning for many parts of their system, including creating tasks for students tailored by their learning history, customizing content based on how well the user does on certain words and language structures, and what has worked overall for similar users, and many other ways I am sure I miss here. But to summarize, a heavily ML applying and successful application.

DuoLingo generates tailored quizzes for users, based on their learned profiles, together with expert teams. In this way it gamifies the evaluation of the algorithms and has users provide feedback by answering the tailored quizzes, and using approaches such as A/B testing. I find this a very interesting idea and approach for “online” algorithm testing and evolution with algorithm feedback from users. By enticing the user as part of the service use process to work on tasks that help the algorithm learn to be better for them.

An Example of My Own: ValtuustoPilvi

A few years back I build a search-engine called Valtuustopilvi (Finnish for CouncilCloud), as part of an open-data based service competition hosted by the city of Oulu in Finland. It was ranked first in this (small) competition, and so I ended up also hosting it as a service for the city for the following two years. As a result, I got familiar with various NLP algorithms and built my first service using them, including my first experiences in testing ML and NLP.

Service Overview

ValtuustoPilvi was designed for people to interactively search the city council meeting documents. I used various existing Natural Language Processing (NLP) libraries for the search algorithms, built some custom algorithms of my own, and adapted some well known and widely used ones.

Since it was just me working on this project, I did the development, testing, analysis, and everything else. Not sure if that is good or bad for this case, but that’s how it was. I packaged the code up a few years back, so no concrete code measurements and execution this time, but examples and reflection on how I tested, analyzed, and designed it. Hope it makes for something more concrete and interesting.

To concretize the service a bit, here is a screenshot of the search UI:

Valtuustopilvi interactive search UI. Image by author.

The user could interact with the word-cloud by selecting words in it to refine their search, or by typing query terms in the box shown at the bottom of above figure. The system built (in real-time) a new word-cloud matching the search results for the modified search query, and the user could iteratively continue to refine their search using the continuously updated UI / word-cloud.

I designed several algorithms to build the word-cloud in different hierarchies. The overall document set was not changing very rapidly, so for that I used a pre-computed data model, updated every night. On first arrival at the main search-page for all the documents, or for a specific pre-defined sub-category of council meetings (e.g., building permits), this type of data model was applied. It was based on a topic-model. Based this topic model, a set of words was chosen, each one weighted by how high the topic model ranked them in the topics its discovered. This is visible as the word size in the word-cloud above. More detail on this shortly. First a few words on pre-processing.

Computational Complexity Example: Preprocessing with Voikko

In my previous algorithm testing article, I discussed evaluation of algorithm computational complexity. In the case of ValtuustoPilvi, a related example is from using a third-party library, and how its computational complexity became important as my use was different from its previous use cases.

A very common task in ML, and in NLP especially, is to preprocess the data being fed to the algorithms. In case of the word-cloud building algorithms, one of the basic needs is to unify different forms of words into a single representation (word). In NLP this is called lemmatization (i.e., base form conversion, e.g., cars->car).

For this , I used the Voikko Finnish NLP library. Voikko had earlier been used primarily for shorter pieces of text, such as sentences and words in spell checking. However, I used Voikko to process whole documents, longest of which were hundreds of pages long. Using such larger inputs, the processing time was increasing exponentially. After reporting the issue, it was fixed in version 4.0.1 (visible in Voikko release notes).

In relation to the topic of this and my previous article, this illustrates how different requirements for an algorithm can be relevant in different scenarios, and how this may evolve over time. The Voikko web-page now actually lists the library commonly used as a tool in machine learning pipelines for Finnish texts, illustrating its more general use case drift in this direction.

Continous Data Analysis: Testing Preprocessing

Lemmatization tools such as Voikko are great for processing well written and structured text. However, real-world text often contains misspellings, domain specific words, abbreviations, and other similar anomalies. For these, tools such as Voikko will fail to process them as they are not in the standard dictionary.

Reading all the documents and manually fixing all text is not a feasible task for any reasonable sized input set. Instead, I collected all words / tokens that Voikko did not recognize. By ranking these unrecognized words and tokens in the order of their frequency, I could easily find words that occurred often but were not recognized, create custom rules to lemmatize them as needed, and thus continuously monitor, test, and improve the preprocessing. Even as the set of documents, and the topics they covered would evolve over time. This process ran each time the document set and topic models were updated (nightly). I checked the results once a week or so, updating the lemmatization rules when new and frequent misspellings or new words appeared.

Algorithm Customization: LDA for Word Clouds

Another topic I previously discussed is the importance of understanding the algorithms used. In this case, the topic model algorithm I used produces internal representations that are typically not used as part of the output. However, learning the algorithm in detail, I was able to use those internal properties as a basis for service features. Of course, understanding what you are testing does not hurt in testing either.

The topic modelling algorithm I used for building the word-cloud was Latent Dirichlet Allocation (LDA). This is an NLP algorithm that maps documents and texts to a set of “topics”. The discovered topics are often used as input to further steps in an application. First, a short introduction to LDA.

LDA maps words together in clusters that are called topics. To concretize, here is an example of four topics (word clusters) I built using LDA on one of the Kaggle COVID-19 research paper datasets (code and notebook here):

LDA based topic model with four topics, with word-weights shown. Image by author.

With LDA, interpreting these topics/clusters is left to the user. In this case, I would say the topics above seem to relate to patient treatments (topic 1), virus analysis (topic 2), structure of the virus (topic 3), and possibly infection mechanisms (topic 4).

In the case of the Valtuustopilvi, the exact meaning of the topics was irrelevant, only that a good set of representative words, and their weights was captured, and they represented a set of high-level and diverse concepts in the document set. The idea in the end was to help the user explore different and interesting high-level concepts in the document set. This end user goal is always good to keep in mind also when testing a system, ML based or not.

The topics themselves were never shown to the user, only the backend code used them as a basis to build the high-level word-clouds. Summing the weights in different ways provided a basis to choose the words for the cloud, and weight their size.

As I was testing and using the service myself, I also realized that giving fully deterministic results to the user, in form of highest weighted words by the algorithm was boring and did not serve the user well. Because doing exactly that would always give the exact same word-cloud for the document set. It would not help them explore and find new information over time, and felt boring after a few times seeing the exact same word cloud I already had interacted with. So similar to the Spotify example of trying to avoid the algorithmic sandbox, I added some randomization to reduce the weights of some words, or add some randomness to which of the top words in each topic were included in the word-cloud calculations, and how much weight was given to each.

Testing, Analysis, and Development of ValtuustoPilvi

The LDA example above was only used for the top level searches, where people were searching over all the documents, or over a specific pre-defined category of meetings. It requires pre-computation and is most useful when summarizing larger sets. For deeper search queries with dynamic document sets, I used an algorithm that was faster and adaptive to changing search results. For further details on this algorithm, and the others I used, refer to the paper draft I wrote on the topic (…) back in the day.

An example of base tests in this was to check that the correct algorithm is used. These base tests are often quite simple and intuitive to write, with some basic understanding on how the system works. But how to test the word cloud is a good representation of the actual topics for the user? This is much more subjective and harder to test.

Looking back at this, I see different levels of testing I did on this:

Test sizes and oracle levels in the Valtuustopilvi project. Image by author.

First, I looked at a set of documents for a category, and whether the algorithm provided results made sense to me. This is the Me at the bottom of the above pyramid. The test oracle was myself, so very limited view but also very much to the point, and easy to check (I know exactly what I am thinking, at least usually :)).

Next came test groups, such as the competition reviewers and the feedback they gave. This feedback was more generic, without detailed knowledge of the algorithms, but still specifically giving feedback on how they felt about different parts of the service. Which is practically the algorithm output transformed into a user presented form.

The third level is the general user population. It has a large sample size in potentially covering all users of the service. However, the test oracle is very generic, as it can only rely on some indirect clues that can be collected from the user interaction with the service. This is similar to the Spotify (e.g., evaluation and search) and DuoLingo examples on analyzing user interactions, and performing experiments for tuning algorithms. The same also applies for search engines in general, such as Google search results optimization based on aggregated and anonymized interactions (linked in Feb/2022).

In the Valtuustopilvi case, the service did not have internet scale number of users, so there was not that much to analyze at the top level of all users. I also included a feedback mechanism in the UI, but beyond few basic comments not much came of it. However, once a service reaches a large enough scale, I believe the general user population is a very useful opportunity to pursue for evaluation data and ideas. As illustrated by the Spotify, DuoLingo, and Google examples. Of course, keeping in mind all the privacy and similar aspects.


Looking back at what became of this article, the big difference in testing ML based algorithms compared to traditional algorithms, as I see it, is the black box nature of the ML models. They embed complex internals, such as a large number of weights, nodes, and connections. And each of these are different across applications, model configurations, training sessions, and the data used. Thus there is no single way to test a specific ML algorithm, but the process involves much investigation and exploration. And this is a constant process, as the data, models, and out understanding based on it often evolve over time.

Some of the specific points I collected from this:

  • Data evolution, and constantly monitoring it,
  • Trained model evolution, and constantly monitoring it,
  • Identifying important operational and data invariants to cover,
  • Post-training testing types and metamorphic testing,
  • Different viewpoints, such as inverse assumptions/invariants,
  • Investigation and exploration as a central concept,
  • User (system) feedback and its proxies in evaluation at different scales,
  • Keeping the entire ML operations pipeline and related quality in mind,
  • Breadth in end (user) goals, such as avoiding algorithmic sandboxes,

Looking at the above, another common theme is the focus on continous test, monitoring, and analysis rather than point-in-time as often in classical testing.

Overall, I find there are many aspects to consider when testing ML algorithms, and the systems built around them. Much of the same concepts apply as for traditional testing, but hopefully this article helps provide some insights into the specifics of ML testing.

That’s all for now. Cheers 🙂

Algorithm Test Engineering: Exploratory Job Analysis

What does it mean to test an algorithm? Let’s play

Plot twist. The article has an RNA algorithm “example”.. Photo by Braňo on Unsplash

Recently I had a discussion about what it means to test an algorithm, or what it means to do algorithm test engineering. I couldn’t quite come up with a convincing definition for myself. At a basic level, maybe you figure out some basic rules of the algorithm, give it some input and output. Check the results. But I believe there is more to it, including algorithm fit for purpose, fit to data, evaluating alternatives, adapting the algorithm to the problem and system at hand, designing and running experiments, learning, etc.

To get a better idea about all this, I will explore the topic in this article. I try analyzing and testing two algorithms, and see where it takes me. I start in this article with more classical algorithms, where the inputs, outputs, and their relations are clearly defined. In a follow-up article, I hope to look at a machine learning-based one where the definition of the correct results is not so clear.

This story was originally published on Medium, where it can still be found.

Choice of Algorithm

Since my goal was to analyze the testing and analysis of algorithms, I needed to pick an algorithm to work with. Luckily, Wikipedia provides an extensive list of algorithms. I picked two classics, binary search and Levenshtein edit distance. I will warm up first with a basic test and analysis of binary search. This is followed with a bit broader look at Levenshtein.

In general, I find testing an algorithm without a broader context to miss some key points from the overall process. So I set up an application for the Levenshtein algorithm to give my testing and analysis exploration some context. This context is an application for comparing edit distances over RNA strings, more specifically motivated by the COVID-19 virus structure.

First, binary search. It is an algorithm designed for efficient search over a list of items, in logarithmic execution time. For my experiments, I use the IMDB names dataset, and a binary search Python implementation (adapted from a set of Stack Overflow answers).

Understanding the Algorithm

When testing an algorithm, I first need to understand what I am working with. To gain some intuition for this, I start with a look at linear search vs binary search. Linear search is a very basic search that simply loops the list in order to find a match, and provides a base reference here. To illustrate the difference, and what the binary search algorithm is about, a look at a set of 10 names from the IMDB dataset:

Example set of 10 names from the IMDB dataset. Image by author.

I picked the last 10 in the dataset, because it looked like a good random mix. Binary search requires the input to be sorted, so lets do that. Sorted and re-indexed these 10 become like this:

The IMDB names dataset slice, sorted by string order (first name). Image by author.

To build a better understanding of linear and binary search, I created illustrations on applying them to the above list to find “Lu Bevins”:

Linear search (left) vs Binary search (right) on the IMDB name slice. Image by author.

Linear search always loops through the entire list from the beginning, until it finds a match. The list can be in any order. Binary search constantly splits the list in two and checks the middle value for match, repeating this in smaller chunks, until it finds a match, or runs out of elements to check.

Binary search is a well known and simple algorithm, but above illustrates how I would go about trying to understand an algorithm. I find concrete examples and diagrams help me build an understanding. Access to a (domain) expert to help build and check these really helps. Especially with less known algorithms.

The above example list of 10 is very small, and does not really illustrate the benefit of binary search. With larger size, the comparison between linear and binary search should look like this (expecting logarithmic complexity):

Illustrating the theoretical logarithmic execution time vs linear. Image by author.

Such assumptions are good to check, so I will do that next.

Testing Computational Complexity

A common property of interest when testing and analyzing algorithms is computational complexity. This refers to the resources (processing time, memory, etc) the algorithm uses at different scales. Here I focus on processing time, although similar analysis could be performed for other resources.

To explore the scalability of binary search, I sampled the IMDB names dataset for 10 to 1000 names, in increments of 10, and executed both linear and binary search on random items in these sampled lists.

To gain more statistically confident results, I ran the experiments 100 times for each of these lists. I used my codeprofile library for execution time profiling. The following figures illustrate the measured average execution times over the 100 runs at sample sizes vs the theoretical, expected time:

Measured performance (left) vs theoretical (right). Image by author.

The above shapes show the curves follow the expected curve on logarithmic execution time growth (binary search) and linear growth (linear search). So the measurements matches the theory (scales are different in the figures due to parameters, I was just interested in the shapes).

1000 items is still a very short list for real performance measurements. Running the same experiment for 10 to 1 million samples (in 10x increments) shows the effect and benefit of binary search more clearly:

Binary search vs linear search. Left side = 10–1000 list size, right side = 10–1M list size. Image by author.

In the above figure, with list size 1 million, the difference is so big the binary search line looks flat compared to linear search. Based on the above simple experiments, I would say the assumptions hold for computational complexity.

The linear and binary search are well known and studied algorithms, and their complexity would be well known. As such, these demonstrate how we can evaluate this in practice. I have found evaluating practical computational complexity of actual implementations can produce surprises, especially different use cases for custom algorithms from 3rd parties.

Testing Input-Output Relations

Knowing the computational complexity is good, but we also need assurance that the implementation works as intended. For binary search, basic tests for finding selected names, boundaries, category-partitioning, etc are good examples for this. I call these base assurance tests.

To scale this up, I used the test generator from the above 10 to 1 million computational complexity analysis, and added the following check to it:

  • for every item, linear and binary search find the same index.

I call this an expected/assumed invariant over the data/algorithm. I was expecting the two algorithms to give the same results for the same input, and that the assertion would simply be there for extra assurance. However, the assertion was failing because the indices given by the linear and binary search were not always the same even for the same input.

After some thought on what could cause this difference, and how the algorithms work, I figured there might be duplicate names in the dataset. Here is a quick check to see if this is the case (concatenated to two rows):

Listing items that have duplicate names in the dataset (slice). Image by author.

Above figure shows “Robert Ellis” and “Harrison Ford” as having duplicates. To check a bit deeper, a look at the name “Harrison Ford”:

Check whether Harrison Ford exists in multiple instances as primary name. Image by author.

There are four people listed with the same name “Harrison Ford” in this dataset. Having established there are duplicate names, some summary metrics would be of interest to see how many there are:

Counting the number of duplicated primary names in the dataset. Image by author.

The above figures show a total of 853911 (row count) distinct names have duplicates. The one with the most duplicates is “David Smith”, repeating 340 times. What does the above mean for the mismatching search results? Due to how the two algorithms work, searching for “David Smith” would likely result in both linear and binary search returning a “David Smith”, but different ones. Linear search always returns the first one in list, binary search can give any result from the list of duplicates.

To me, this illustrates, how testing and analysis of the algorithm helps understand both the algorithm, and the data it is being applied to better. And how it is good to assert and test your assumptions about the data and algorithm. Having a clear goal, and performing this process systematically should help.

Besides invariants over the algorithm output, one can also consider them over input. For example, binary search expects its input to be sorted. The choice is whether to limit the scope of testing to expect the environment (the overall system) to enforce this, or to expect the algorithm implementation to handle it. I would refer to this as defining the algorithm (test) scope.

As a second example, and to gain some broader insight into the topic, I look at the Levenshtein edit distance algorithm here. Edit distance is also sometimes referred to as approximate string matching. It refers to the number of edits (character changes) required on a string to convert it to a specific (target) string. Known applications include providing spelling suggestions, approximate string search queries, and DNA/RNA sequence comparison.

Compared to the single search operation of binary search, the classic Levenshtein algorithm supports three operations: insert, delete, and substitute (a character in a string). The edit distance it gives is the minimum number of these operations required. There is also a variant called Damerau-Levenshtein adding a transposition operation, but to keep this example simpler, I will use the classic Levenshtein here.

Some examples:

  • test -> tent: one substitution of s->n = Levenshtein score 1
  • Levenstein -> Levenshtein: one insertion of h = score 1
  • Levenshtein -> Levenstein: one deletion of h = score 1
  • bad -> base: one substitution of d -> s + one insertion of e = score 2
  • bad -> based: two insertions of s + e = score 2

Defining the Example Application

As I noted before, I will look at Levenshtein from the viewpoint of an example application, to give it some context that my binary search example lacked. As Levenshtein has been applied in RNA analysis, I use RNA search as the application. Specifically, the COVID-19 virus, which is RNA based. There is plenty of information available on it, and it gives me a chance to play with longer RNA strings on a timely topic. As with binary search, I start with research to understand the application domain as well as the algorithm applied.

The COVID-19 RNA sequence is described as having a length of 29881 “characters”, consisting of a four letter alphabet representing its chemical base. One part of the COVID-19 structure that is specifically considered interesting is the spike protein. This interest comes from the role of the spike protein in infecting the human cells.

The spike protein is described as a 3831 “character” long sub-sequence of the overall virus RNA. Wikipedia describes the Omicron variant as having 60 mutations from the original Wuhan variant, 32 which are in the spike protein.

So in this example I look to apply Levenshtein on sequences of length 3831 (possibly up to 29881). Following the Omicron mutations, I will use up to 60 edits as the maximum in my testing. As with binary search, I am defining the algorithm (test) parameters based on domain knowledge and expected use.

DISCLAIMER: I have not done in-depth research on RNA search, I am sure it is far more advanced, but this works as a basic example for this article.

Experiment Setup

I used the Python weighted-levenshtein library to run my experiments. My base assurance tests include handling empty string, string of different lengths, invalid characters, and a selected string mutations. And so on.

To scale up the testing, I again built a simple input generator to generate input strings, and apply the algorithms supported operations (insert, delete, substitute) on them. This generated random strings of given length, applied the selected count of operation(s) on the string, and calculated the distance score for the modified vs the original (non-modified) string. At the same time measuring execution time, and checking a set of output invariants:

  • The score given by the algorithm should always match the number of operations applied.
  • Flipping the compared strings, e.g, car->cars to cars->car, should always give the same score. Since the operations are effectively reversible.

My test executor checked these invariants for every generated (test) execution of the algorithm.

To support the base assurance tests, I first used the input generator to build large number of strings of relatively short length (5–10), with varying number (0–9) operations applied. This gave me increased confidence, higher position coverage, and easier to debug results (with limited length). For score calculations, string boundaries, grouped operations, etc. It also showed me how my assumptions about the above two invariants was wrong for the first one.

What I found is, running the algorithm with enough iterations, the random chance will find cases where the target string built from multiple substitution operations can be achieved with fewer combinations of insert, delete, and substitute operations together. Here is an example case:

Applying 3 substitutions on a source to produce a target, expecting Levenshtein score of 3. Image by author.

The above shows the source string, and the target string after making 3 substitutions at randomly selected indices, with randomly selected characters. Remember, my target was COVID-19 RNA search, with 4 characters (ABCD). The following substitutions have been made; index 3: A->B, index 4: C->A, index 5: D->C. Expecting Levenshtein score to match the number of edit operations (3 substitutions here), this should give a score of 3.

However, instead of giving a score of 3, the above gives a score of 2. After some analysis, I realized the same target can be reached with one insert and one delete, giving the minimal score of 2:

Achieving the same target as above with only 2 operations (insert + delete). Image by author.

With this knowledge, I had to disable the failing invariant from the checks. Instead, I opted to collect statistics on the score achieved with different number of operations applied. I could then check that the overall score distribution was not too far off from the number of operations applied, if not exactly the same. The second invariant (flipping strings) was fine for all generated tests.

I find this is an example of what one might call exploratory algorithm testing. Take any assumptions on how the algorithm works, encode them as invariants (or whatever works), generate further tests to see if they hold. Learning about the algorithm and its domain along the way.

After successfully running these smaller tests, and feeling confident I had sufficiently covered my assumptions, I increased the generator string sizes higher for testing computational complexity.

Computational Complexity: Setup

For this evaluation, I used the input generator to produce strings of length from 100 to 5000 in intervals of 250 (100, 350, 600, 850, … until 4850). The target string length I set earlier was 3981 character length, which gets covered here, with some wide margin. For statistical confidence, I repeated each test size (100 to 4850) 100 times.

For each generated string, I applied the following operations 1, 3, 9, and 60 times (60 being the omicron limit defined above):

  • Substitute characters in the string at random (non-overlapping) locations
  • Insert random characters at a random location in the test string
  • Delete random characters in the test string
  • Combinations of above, all at the same time

My goal was to see if a different algorithm operations, their number, or the string length would affect the computation time. Or more generally, to test the algorithm operations and parameters, to see how they affect its execution time.

Computational Complexity: Measurements

The results for the above described experiments were very similar in terms of performance. The following figure illustrates the substitution tests, and their overall execution time:

Levenshtein execution time vs string size. Image by author.

Substitute_x refers to substituting X character in the source string, and calculating the Levenshtein score. The x-axis in above is the string size from 100 to 4850. The y-axis is the time it took to run the 100 experiments.

The four lines on the above graph are practically overlapping, as the execution time was so close in each case. Since this seemed a bit suspicious, I ran multiple experiments separately with varying parameters to see if this holds true, which it did. A bit strange, but ok. Having a domain expert and a team to bounce thoughts on this would have been real nice.

I omitted showing all the rest of the operations, their size variants, and combinations here. They were all very close, indicating that the execution time had little to no dependence on the type or number of operations.

The above curve in general resembles an exponentially growing curve. To check this, I experimented with a few parameters to try to visualize a matching form of an exponential curve. Here is the final measured execution time plot vs a theoretical plot for exponential time:

Execution time (left) vs exponential plot (right). Image by author.

From this, I would say the algorithm has exponential complexity. Wikipedia actually has a few fancy words on the Levenshtein computational complexity being high. From this, my verdict is that these results match the theoretical expectation, this algorithm seems to scale less optimally for longer inputs.

Testing an Adaption to the Algorithm:

Sometimes we test and analyze an algorithm, and realize it is not a great fit for the need. But we may wish to try an alternative, or an adaptation to this algorithm to address the issues. Let’s look at one example here.

Exponential execution time growth is generally considered bad for scalability. For shorter string (e.g., spell checking a word or a shell command) this is probably not an issue, as the exponential effect is so small on shorter input. However, in my RNA story I wanted to search and analyze sequences of length 3831+. For this, investigating possible speedups would seem useful.

Let’s say we decide to try to make this faster by trying to run the algorithm on smaller pieces of the input. This should make the execution time grow more linearly, instead of exponential. In the end, we just need to rank the search results relatively, not necessarily an exact score. And once the top results are known, one could calculate exact scores for that subset. Since this is a hypothetical example, I call this a hypothetical development idea :).

Here is an execution time graph for splitting the generated input strings to slices of length 100 characters, and summing up their scores:

Execution time after splitting to slices of 100. Image by author.

From the execution time perspective, the above graph looks much better. Exponential growth is gone, turned linear. But how about the results, how large is the difference in score between calculating it for the full string vs summing up the scores for slices of 100? The following tables illustrate this:

Substitution operator, slices of 100 vs full string. Image by author.

In the above table, s_x refers to substituting X character in the string. Size is the total string length. The postfix of /full refers to running the Levenshtein algorithm on the full string. The /100 version uses slices of 100 characters. Each configuration was repeated 100 times to get statistical coverage. Which is why, for example, s_1/full has a score of 100 (100 runs, each scoring 1).

For this substitution example, sliced version scores are very close to the full version. Perhaps because substitution is a local change, and does not so much affect multiple slices. There are a few rows in the table where the summed score has a difference of 1–2 edits (6000 vs 5998 and 5999). This is because of the issue I noted above, where insert and delete can work together to find a smaller minimum. If the substitution operation was enough, this would seem a plausible adaptation. But the goal was to support all operations.

In the below tables, the insert operation uses prefix of i, and delete a prefix of d. So i_x refers to inserting X characters, and d_x deleting X characters:

Insert operator, slices of 100 vs full string. Image by author.

The above insert table shows how the i_x/full score always matches the number of operations (*100) as expected. With i_x/100 slices, the score starts to get larger as the string length increases, and the number of operations applied increases (from 1 to 3, 9, and 60). After some analysis, I concluded this is because inserting a character at the beginning of the string shifts all the rest of the slices forward to the right, and thus causes each slice to require multiple edits, and increase the summed score.

The following figures illustrate this issue:

Above string of length 10, Inserting B, Levenshtein score calculated as full string. Image by author.
Same as above string, Levenshtein score calculated in slices of length 3. Image by author.

The above is an example of slice size 3 instead of 100 but the same concept. Depending on the location of the insert, the edit count propagates towards the end and increases the sum by a large amount. For each slice of 3 to match, it has to remove the first and add the last char. Same for delete operation:

Delete operator, slices of 100 vs full string. Image by author.

The above delete table shows very similar behaviour to the insert table. And for the same reasons, a delete shifts all slices left when insert shifted them right.

So, while the slicing approach would be successful in cutting down the algorithm processing time, it would definitely not work to properly rank the search results for insert and delete. From the application perspective, I would consider this adaptation a failed experiment, except for learning.

However, for the purposes of this article, I find this is a good experiment. It shows how one might run tests on an algorithm, analyze its results, fit for purpose, and other properties, build hypothesis, implement experiments, evaluate it, and iterate. Optimally, this is how I would see algorithm testing and analysis contributing to the overall development more broadly, helping to design and evaluate experiments for improved algorithms and adaptations.

Concluding the Levenshtein Search Experiment

In a more realistic scenario, I hope I would be working with a team of experts. And have resources to perform research on what are all the approaches, state of the art, and everything else on the subject. In fact, when the algorithm and its application domain are complex, and the opportunity is there, I would consider this an essential part of the testing and analysis process. Working with domain experts, R&D team, and researching the domain knowledge. Here I will limit the scope as my resources are limited.


My goal with this article was to explore the idea of what algorithm testing (and analysis) could be. A summarizing list makes it simpler to remember:

  • Traditional testing techniques can define base assurance tests with expert defined inputs and expected outputs
  • Building a good, in-depth understanding of the algorithm helps understand how to test it, and how to adapt it to a domain
  • This includes building a similarly good understanding of the data it is applied to, and how this relates to the algorithm
  • Ideally, above works iteratively in interaction with research, development, and testing
  • Besides verification, algorithm testing can contribute to understanding its limits, potential optimizations, and compare alternatives
  • Identifying assumptions about the data and algorithm input describes what the algorithm is expected to work with
  • Exploratory data analysis can use these assumptions as input, check if they hold, and refine them
  • Identifying assumptions about the data and the algorithm output gives a basis to write invariants to check in every test
  • An automated test generator helps scale testing with these invariants, and check whether the assumptions hold
  • Scope of testing relates to defining the scope of the algorithm vs the overall system using it, responsibilities for input and output handling
  • Theoretical computational complexity is good, but practical evaluation if it holds for the implementation and available data is good to ensure
  • Algorithm testing can form a tool for exploring the algorithm, by formulating hypothesis about it and using testing to evaluate them
  • Tools and techniques, such as metamorphic testing can help evaluate robustness of the algorithm to different types of valid and invalid inputs
  • Testing and analysis is optimally an iterative process, where the end result is the final generated tests and checks, and all the learning on the way


I started this article with the idea to explore what it means to test an algorithm, or what it could mean to be do “algorithm test engineering”. I like to think I made some progress, although I am sure the definition can be subjective, much like “goodness” of a (machine learning) algorithm result.

Both my examples in this article, binary search and Levenshtein edit distance, are quite simple and basic algorithms in the end. As noted in this article, the basic testing of such algorithms is not too complicated. However, considering algorithm testing and analysis as part of a broader research and development process, I believe the interactions, collaborations, research and development contributions can make it more diverse and interesting.

In this article, I looked at two “classic” algorithms, where the input, output, and their relations are quite straightforward to define. In a followup article I will look at machine learning based algorithm(s), and an example of an algorithm where the input-output relations, and the correctness of the output is harder to define, subjective, or “undefinable”, in a traditional sense. Until then.

That’s all for now. Cheers.

Merkle Trees: Concepts and Use Cases

The data structure within Bitcoin, Amazon Dynamo DB, ZFS, …

This article explores what are Merkle trees, and how they are used in practice in different systems including Bitcoin, Amazon’s Dynamo DB, and the ZFS filesystem. The basic concept is quite simple, but some of the clever applications are not so obvious.

First, lets start with the concept of Merkle trees. As I said, it is not too complicated in its basic form.

What is a Merkle Tree

A Merkle tree is fundamentally just a hierarchical set of hash values, building from a set of actual data (Merkle leaf) to intermediate hashes (Merkle braches) and up to the Merkle root that summarizes all the data in one hash value.

Example: A Basic Merkle Tree

The following figure illustrates a very small Merkle tree:

A very small Merkle tree. Image by author.

In this figure, the bottom nodes (Data1Data4) are the actual data processed by the application. Each of these is summarized by their respective hash value (Hash1Hash4), as a Merkle leaf. From these, the Merkle tree builds a hierarchy, combining hashes together until only one is left. The nodes combining other hash nodes are called Merkle branches (here Hash12 and Hash34). When there is only one left (here Hash1234), this is called the Merkle root. There can be multiple levels of branches and hashing as following examples will demonstrate.

Handling An Unbalanced Merkle Tree

The above example illustrates the very basic case of a Merkle tree. It is a convenient example, as at every level there are just the right number of nodes to form exact pairs. What happens if you have an uneven (odd) number of leaf (data) nodes? For example, what happens to the above example if you have 5 Data nodes? You can hash Data1+Data2 together to form a Merkle branch, and same for Data3+Data4. But Data 5 is left without a pair to hash into a new branch.

Different approaches can be taken to address this. For example, in this situation Bitcoin simply copies the un-pairable (odd) hash, and uses the duplicate as a pair (the odd hash is paired with itself). The following figure illustrates this:

Handling odd (uneven) numbers of hashes Bitcoin style. Image by author.

In this example, Hash 5 has no immediate pair, and is duplicated to pair it with itself. Same happens for Hash55. The above is just one option, there are different ways to handle this pairing issue. Of the ones I have seen, I like the Monero (or Cryptonote?) one most. I will present it shortly. First, something slightly different.

A Very Simple Merkle Tree in Python

Theories and explanations are great. But for the technical person, some code helps formalize it. The following code from my Github repo illustrates a naive but simple example of implementing a Merkle tree in Python (or just skip it for more conceptual explanation after): very simple and naive Merkle tree implementation in Python. Code by author.

The last four lines in the above code run the example to create a Merkle tree with data nodes Data1Data5. The calculation of the tree with this algorithm looks like this (cropping the hashes to the first 5 characters):

Merkle tree calculation example with actual values. Image and animation by author.

I used the keccak hash function in this example, as this is what the Monero cryptocurrency/blockchain uses, and I was looking into Monero code lately. The 5th (odd) data element here (using my Python code above) is handled slightly different from the Bitcoin example, as this simply re-uses the hash if one is left over (un-pairable, cf54b above). Practically this re-use should have the same effect as the duplication in the Bitcoin algorithm.

Optimizing Merkle Calculation: Monero Style

As I was looking into the Monero blockchain to understand it better, I found it has a different but clever approach to hashing the Merkle tree. The code for it is available in the tree-hash.c file in the Monero Github repository. And a Python version I made to implement the same in my Github.

The Monero approach could be described as converting the hash tree to a perfect binary tree. It hashes enough leaf nodes in the first iteration, so that the following iterations will always have some variant of 2ˣ (a power of 2) nodes. The following figure illustrates this:

Monero approach to Merkle tree building, with 5 nodes (transactions in Monero). Image by author.

This works as follows: First, calculate x so that is greater than number of transactions (data nodes). In this example, it is 2³=8, because 8>5 (there are 5 transactions/Data nodes). The value of x before that, 2²=4, would not fit (4 > 5 is not true). From this, the number of transactions is subtracted. In this case 8–5=3. This 3 is the index of transaction to start iteration 1 from. With 0-based indexing, the start index is Data4 in this example.

Above explanation is maybe a bit abstract. Below is a table with concrete examples of transaction counts and how these are converted into a “perfect binary tree” shape (size is always 2ˣ, eliminating any “odd” counts or leftover un-pairable hashes from following iterations) for iteration 2 and following iterations:

Example calculations of Merkle size with the Monero (Cryptonote) algorithm. Image/Table by author.

The columns:

  • Transactions: Number of transactions in the block being calculated
  • Closest 2^x: First 2ˣ that is bigger than transactions
  • Iteration 1 start index: start hashing pairs from this index in transaction list
  • Iteration 1 values: Number of transactions in transaction list starting from iteration 1 start index to the end of transaction list
  • Iteration 1 pairs: iteration 1 values divided by 2 (because hashed in pairs)
  • Iteration 2 formula: How the above values lead to number of items to hash in the following iteration
  • Iteration 2 size: The number of transactions to hash in iteration 2. As you can see, it is always 2ˣ, and leads into a “perfect binary tree”.

Too many numbers and formulas I guess. Did you fall asleep yet? Anyway, if you pick the row from above table with 23 transactions, you can follow this nice 8-bit animation I made on how the Merkle root calculation progresses for that row:

Calculating Merkle root in Monero for block 1407480. Image/animation by author.

In this case (animation above), the start index is calculated as 2⁶=32–23=9. This is transaction number 10 (zero index adds 1). The following iteration then has 16 nodes, which is 2⁵, next one 8 or 2⁴, and so on. The animation above actually represents how the hash array is manipulated in memory by the Monero code.

Example Use Cases

Concepts are nice, while concrete, real-world, use cases make the point. Here I will discuss the use of Merkle trees in blockchains/cryptocurrencies, and Amazon’s AWS DynamoDB distributed database. I will also briefly touch on the ZFS filesystem, and the Git version control system, as these are sometimes also mentioned as examples of Merkle tree use.

Cryptocurrencies and blockchains: Bitcoin, Monero, et al.

As already briefly discussed above, Bitcoin and similar cryptocurrencies make use of Merkle trees to summarize and validate the transactions in a block, and embed the Merkle root into their block header as a summary of it all. The following figure illustrates this:

Blockchain and Merkle trees. Image by author.

Each block has an ID value, which is the hash of its header fields. A part of this is the Merkle root. Another part is the previous block ID (Parent in above figure). By linking with the previous block ID, and including that as part of the next block ID, the blocks form a blockchain. By embedding the Merkle root in it, they make an immutable record of transactions in the block.

For example, the Bitcoin block header contains:

  • Difficulty target value (called bits in Bitcoin)
  • Merkle root of all transactions in the block
  • Nonce; a value changed in mining to find accepted blocks
  • Previous block hash (block ID); linking this block to the previous block in chain (this previous hash is named parent in figure above)
  • Timestamp; Time of mining (creating the hash for) the block
  • Block version; identifies support features and formats (also used to identify hash function in some blockchains such as Monero)

These header fields are all hashed together to form the block ID. Making the block ID a hash of all header data makes it practically impossible to modify a block header. Including the Merkle root in the block ID further makes it practically impossible to modify the transactions in a block.

But let’s look in a bit more detail into how these types of blockchains actually use the Merkle tree in different ways. It’s all about data integrity, but in different and sometimes ingenious ways.

Blockchain Use Case 1: Transaction Data (Block) Immutability

As I noted above, the base use case of the Merkle tree in Bitcoin style blockchains is to build the Merkle root into the block header, and using it to verify that no transaction has been changed.

As an example, let’s change Data4 to Data6 in the earlier example:

Changing one node value invalidates (changes the root of) the Merkle tree. Image by author.

Comparing this to the earlier example, with Data4, the Merkle root was aa8d3. With Data6, it is now f8932. This way, any change in any transaction data changes the Merkle root, and it no longer matches the one stored in the blockchain for that block (in block header). And through the block header, this issue would propagate to the whole blockchain following this block.

However, if you think about it for a while, you may ask a question:

Do you really need Merkle trees to verify the block data?

No. Instead, one could just concatenate all the transaction data together and build a single root hash. Consider the above example with Data1Data5, but just hashing them all together:

Concatenate+hash all data at once. Image by author.

Now, changing just one data value will have the same effect of invalidating the summary hash:

Hash change detection in concat+hash all at once. Image by author.

This single hash could also be used in place of the Merkle root in the block header and block hash. Same effect so far.

So what is the real benefit of using a Merkle tree here? To summarize, I would say it gives you more granularity in the hash verification, and enables other clever tricks to process the blockchain more efficiently. While also providing the transaction integrity assurance / verification.

The original Bitcoin whitepaper provides two additional examples for use of the Merkle tree: blockchain pruning and simplified payment verification (SPV). Beyond these, the one I really like is the Stratum mining pool protocol. Let’s look at these next.

Blockchain pruning

Over time, a blockchain will grow larger and larger as it accumulates more and more blocks. For example, today (Feb/2021) the size of the Bitcoin blockchain is about 380GB. Blockchain pruning is an approach to reduce this used space with the help of Merkle trees. By removing used transactions from local storage where no longer needed.

We want to have all data around for full scale verification and history, but not all nodes in the peer to peer network need all the data. The blockchain pruning approach proposed in the original Bitcoin whitepaper suggests addressing this by using the Merkle tree to prune (remove) spent (used) transactions from the blocks.

The following figure illustrates this:

Three out of five transactions used (spent) in the block (and Merkle tree). Image by author.

In this example, we have used (spent) transactions TX1, TX2, and TX4. We still have unused transactions TX3 and TX5 left in this block. Assume we are not interested in a full node with full archival, just keeping a list of spendable transactions. Satoshi’s proposal was to prune the block from the used transaction data, and simply leave the Merkle tree branches needed to verify the unspent transaction data:

The Merkle tree pruned of the used transactions. Image by author.

This is perhaps more clear if we also spend TX3:

One more transaction used, leaving just one unused in the Merkle tree (and the block). Image by author.

The block could now be pruned to only contain the TX5 data, and the hash of the other Merkle tree branch (96b8d) in the block:

Pruned Merkle tree with only one transaction left. Image by author.

We would just have saved the space require to store 4 our of 5 transactions, and 6 out of 9 Merkle tree nodes. About 80% space savings on that example. The longer the blockchain gets, and the more it is used, the more spent transactions it would have. And thus more space could be saved. It’s a very clever idea, like most ideas in the original Bitcoin whitepaper.

But much as with block verification, we can always ask the question:

Do you really need Merkle trees to prune the blockchain?

The Bitcoin StackExchange is full of insightful discussion on the actual ways pruning is applied in the actual implementations. While the Merkle tree pruning is a clever approach, it is not applied in this way in (at least) the Bitcoin core software.

Instead, the unspent transactions are stored in a separate database for fast lookup. This database is initialized on first startup by scanning the blockchain and updating the database as new and spent transactions are found. The database is continously updated as new blocks are broadcasted for the Bitcoin network. A pruned node could then just rely on this database.

Running a pruned node in general is sufficient for the basic operations, but does not fully support all features of a blockchain, so some nodes will still need to hold the full data. But that goes beyond the scope of this article.

I think the basic takeaway here is that Merkle trees are cool but sometimes the basic and even simpler approach is fine, and even better. Of course, the hard part is identifying when this is true and when the cooler (Merkle) approach is really best. And don’t take this in any way as a suggestion to not use Merkle trees. Just to think about the overall picture. Also with Bitcoin, and related blockchains, I believe the Merkle tree enables many other things and thus makes a lot of sense. As following sections will show.

Simplified Payment Verification (SPV)

The simplified payment verification (SPV) algorithm was also suggested in the original Bitcoin whitepaper. In SPV, a light-weight blockchain client only stores the block headers, but also wishes to verify payments it receives as valid transactions in the blockchain. Lacking full transaction details, the SPV client uses Merkle trees to effectively verify the transaction details in collaboration with full nodes.

I will re-use the example from blockchain pruning above. The SPV client wants to verify TX5 in the given block. The following figure illustrates this:

SPV client node verifying a single transaction is part of a given Merkle tree. Image by author.

Here, the SPV client node asks a full node for the Merkle branches required to build the Merkle root with the TX5 data. By rebuilding the Merkle root (aa8d3) from the transaction of interest (TX5), and the Merkle branches (96b8d) provided by the full node, the SPV client can have confidence in having received a valid transaction. By checking this re-built Merkle root against the one stored in the block header, it can verify that the whole tree (and thus TX5) is in fact valid, and is a part of the blockchain.

I find SPV to be an interesting example of how Merkle trees can be used, together with (block) data filtering (Bitcoin uses Bloom filters but that is again another topic), to synchronize and verify existence and correctness of selected data in a distributed system.

Pool Mining: The Stratum Protocol

The traditional way to create cryptocurrencies has been mining via proof of work (PoW) hashing. Mining pools are a way for smaller miners to join together and get some rewards for mining, based on how much compute (hash) power they contribute to the pool. This leads to the requirement for a central entity, the mining pool, to be able to coordinate the miners. It needs a way to effectively distribute and track the overall mining process across all clients. With some clever tricks, the Stratum protocol uses Merkle trees to enable this.

Especially Stratum version 1 makes use of Merkle trees to distribute the work efficiently. The pool server provides the miner nodes with the block header elements needed to build the block. This includes the partial Merkle tree, with the branches calculated for all other transactions except the Coinbase transaction. The coinbase transaction is a special transaction that pays out the mining reward. This is illustrated by the following figure:

Merkle templating with the Stratum mining pool protocol (v1). Image by author.

This figure contains three elements: the Merkle tree template, the Coinbase template, and the search for the nonce(s). The pool provides the Merkle tree template to the miner, containing the pre-calculated Merkle branches for the transactions in the block (here 96b8d, thus highly reducing bandwidth and calculation needs for miners). The miner is then expected to fill the Merkle tree template with a suitable coinbase transaction, built from the server provided coinbase template. The winner is one who fills the template with values providing a hash that matches the blockchain network difficulty level. Meaning just a hash with suitable small value.

The coinbase template is provided by the pool, and is filled by the miner using their choice of nonce and extranonce fields. When combined with the Merkle tree template, each of these provides a different block hash, and the only way to find a winning block is to try values until one produces a hash matching the network difficult target. If the miner finds a nonce that matches the network hash difficulty for this Merkle template, it submits it to the mining pool. The coinbase template contains the mining pool address, ensuring that the pool receives the block reward, and can distribute it to all the miners.

In a broader context, the Merkle tree here is used to distribute a partial solution (the pre-calculated Merkle tree branches and coinbase template), while allowing different distributed nodes to work independently to try to find the solution to the missing part (nonces for the coinbase transaction to build an acceptable PoW hash). By embedding the pool address as part of the template, it ensures that all distributed nodes contribute to the common goal and can share the rewards.

AWS Dynamo DB

Dynamo DB is a distributed database provided as part of the Amazon Web Services (AWS) platform. It was originally designed to handle much of Amazon’s retail global infrastructure needs. As such, a lot of consideration has been put into making it scalable and optimizing the efficiency in all parts of its architecture. The architecture of Dynamo DB is described in the AWS Dynamo DB paper from 2007, including its use of Merkle trees to efficiently synchronize diverged nodes (section 4.7 in the paper).

Dynamo DB is what is called a key-value store. The actual data (values) are stored in data nodes, and identified and indexed by their keys. Dynamo DB hosts this data in what it calls virtual nodes. Each virtual node hosts a key-range. To handle high-availability and scalability requirements, Dynamo DB distributes the data across these key-ranges, and hosts each key-range on multiple virtual nodes (across multiple physical nodes, or partitions).

The following figure tries to illustrate this, with 3 (virtual) Dynamo DB nodes, each holding 2 key ranges, each key range duplicated across two nodes. I am sure this is not exact to all the details, and the system had evolved since the paper was published, but I believe it depics the general concept well enough:

DynamoDB virtual nodes and distributed key ranges. Image by author.

In this type of a distributed system, there will eventually always come a time when some of these virtual nodes are out of sync with other virtual nodes holding the same key-range. Dynamo DB uses Merkle trees to perform effective comparison and synchronization of the key-ranges in these nodes. In the above figure I made a small 3-node Merkle tree inside each key-range to illustrate this The KR3 key-range tree is shown partially in red to illustrate how the nodes would have diverged and need resolving to find correct values.

A Merkle tree is built for each key-range, where the leaf nodes of the tree are the key-range data values. The Merkle root summarizes the data in each node. By comparing the Merkle roots of each virtual node hosting the same key-range, divergence in nodes is immediately visible. This full comparison requires only the communication and comparison of a single value, the Merkle root. If there is a difference, one can keep comparing the branches of the tree to efficiently find the exact divergent spots, and the data values to synchronize.

The following figure illustrates this process:

DynamoDB using Merkle trees to find data nodes out of sync. Image by author.

As you can see, I just copied the blockchain example into the above figure. Because the fundamel structure, and the concepts it relies on, are exactly the same! The difference is only in the (clever) application of the same fundamentals in a different context.

Yes, some details would likely differ. For example , DynamoDB likely uses a different hash function than Keccak. But, as illustrated here, the fundamental approach and concepts are the same, and as such applicable to more than just the common blockchain example of Merkle trees.

ZFS File System

ZFS is a filesystem supporting data spread over multiple volumes, a form of distributed filesystem. Multiple ZFS volumes can be grouped into storage pools, which can also host redundant copies of the data. ZFS is advertised as using Merkle trees to ensure data integrity.

ZFS uses Merkle trees to checksum data, to identify issues where some part of the data written to, or read from, disk is corrupted (or misread etc.). As in most filesystems, the data is stored as blocks of specific sizes. These blocks form a natural leaf node for building a Merkle tree.

The main rationale I found online for ZFS to use Merkle trees always discusses how ZFS stores the Merkle tree as a checksum outside the block itself, providing higher resiliency for identifying errors. This leaves it quite unclear to me why you need a Merkle tree for this, and not just store a general checksum outside the block. Sometimes I feel people should ask more questions.

On the other hand, I could see the benefit of verifying data across multiple storage pools and their replicas by using hierarchical Merkle trees. This case would make it very similar to how DynamoDB operates and uses Merkle trees to sync nodes (vs ZFS volumes). However, I could not find any description for using Merkle trees for this. If someone knows better, I would be happy to hear.

In any case, this seems like a very similar case to Dynamo DB, as being used to verify integrity of a distributed data storage system more efficiently. With the added benefit of separating the consistency properties (checksum hashes) from the data itself, for increased resiliency.

Git Version Control System

Another example that I found referenced as an example of Merkle trees is the Git version control system. Git does use a lot of hashing, and its basic functionality is heavily based on hashing file contents and various metadata. Git forms a directed acyclic graph (DAG) of commits identified by these hashes. Which seems to be considered by some to be a Merkle tree. I disagree, so I will use this as an example of some nuances in the concept.

A DAG is a graph where connections go one way (directed), and there are no cycles (acyclic). Unfortunately, I do not understand how the Git DAG would be a Merkle tree. The following figure represents a Git DAG with a few of the most common Git graph actions – branches and merges:

A made up DAG representing Git commits. Arrows point to parent commits. Image by author.

Here you have nodes that are identified by their hashes, and a tree-type structure where each node points to a parent, and has one or more children. However, there are multiple issues I see in calling this a Merkle tree:

  • Nodes in a Merkle tree do not have multiple parents. The DAG does. The figure above has two such nodes: 8e290 and 184da.
  • Merkle tree only hashes actual data in its leaf nodes, the Merkle branches after the leaf nodes are hashes of hashes. The Git DAG does not have branches based on hashes of hashes, just nodes with hashes of raw data.
  • The Merkle tree starts with the leaf nodes, and proceeds up the tree, constantly halving the number of branches in each iteration. The Git DAG contracts and expands all the time with branching and merging. Because it is a graph and not a tree.

Each node in the Git DAG hashes its parent hash ID as part of its own hash identifier, and it is ordered in time (directed). As such it actually very closely resembles a blockchain. Unlike cryptocurrencies, it allows forking (branching), and integrates those as a central part of its structure. However, I see no problem to relax this as a requirement for a general blockchain definition. And I do not see a blockchain requiring a Merkle tree, just as I do not see Git DAG as one. It is just useful for cryptocurrencies to have one, but you could have a different form of one, with other mechanisms, such as Git.

In summary, not everything needs to be a Merkle tree, even if it is cool. Although you should consider Merkle trees where it makes sense, because they are cool. However, I do not see the Merkle tree in the Git DAG (or elsewhere in Git). If you know better, happy to hear and learn :).


I find Merkle trees are a simple but clever data structure. And I like my algorithms and systems best when they are simple but clever. The examples I covered in this article were

  • blockchain integrity verification,
  • blockchain pruning
  • simplified payment verification,
  • mining pool protocol for work distribution,
  • DynamoDB data synchronization
  • ZFS checksumming (and possibly node synchronization)

What is common in all of these? All of them use the Merkle tree as a tool to take a potentially large set of data elements, and hierarchically verify their integrity in different steps. This allows for efficient data communication, and efficiently processing the verification of the overall data, going into more details only as needed. This, I belive sums up what I see as the best use case for Merkle trees: Enabling efficient verification and synchronization of data over distributed nodes.

Of course, then there is the stratum mining protocol, that takes it a bit further and uses Merkle trees as a means to distribute work, while controlling specific properties of the output, in a decentralized and untrusted setup. Cool.

Well, in most day-to-day software engineering tasks there is not often much chance to use, and build on, these features. But it certainly is interesting to learn them, and thus better understand what some of the latest technologies built on them are based on. And to make those informated decisions when the time is right. Right?

Zero Knowledge Proofs: Example with Pedersen Commitments in Monero

An Explanation Between Cartoons and Greek Symbols

If you spend a bit more time reading about cryptocurrencies, blockchains, or many other related technologies, you likely run into the term zero knowledge proofs (ZKP). To me, the term sounds like a paradox. How can you prove anything with zero knowledge? How do you event know what to prove, if you have zero knowledge?

So I tried to build myself some understanding on this. In this article, I try to share this understanding of what zero knowledge means in ZKP, and what is the proof really about. And how do the two relate to each other.

I originally posted this articke on Medium, where it can still be found.

I start with the simple example of the ZKP and Ali Baba cave, which seems to be often used as a simplified ZKP example. I then proceed with an example of what ZKP might look like as a cryptographic approach, and how Monero uses Pedersen Commitments to prove input and output amounts match without revealing the actual amounts. I look at it as a zero knowledge proof of the matching sums of input and output amounts, without revealing the amounts.

My goal in writing this article was to build a more intuitive understanding that fits between the Ali Baba cave, and the cryptographic theories. So no formulas with Greek symbols, but not just a Alice and Bob in a cave either.

The Basic Example: Ali Baba Cave

The usual example given for Zero Knowledge Proofs is the Ali Baba cave. This is a circular cave containing a door that unlocks only with a specific passphrase. Alice claims to know the passphrase, and wants to prove this to Bob, without revealing anything about the passphrase. The following image shows the layout of the imaginary cave:

Layout of the Ali Baba cave in this story. Bob image by David Rock Design on Pixabay, Alice image by Graphic Mama Team on Pixabay. Rest of the image by author.

The overall story is that Alice enters the cave, and walks up to Point 3 or Point 4. From there, she can return to the entrance via two paths, Path A and Path B. Once she is at one of the points, Bob comes to the entrance (Point 2) and calls for her to arrive there via one of the paths, which he chooses on random. Depending on which side of the door Alice is, she always has a 50% chance of being able to return to Bob without passing through the door. Otherwise, she needs to know the passphrase. Bob does not know if she used the door or not, he only sees her arrive through the correct path he called out.

The points of interest, as I marked then in the figure, are:

  • Point 1: Alice and Bob meet outside the cave. Alice enters the cave and walks to Point 3 or Point 4.
  • Point 2: A fork of two paths leading deeper into the cave. Once Alice reaches Point 3 or 4, Bob comes here, picks Path A or Path B,and calls for Alice to arrive via the chosen path.
  • Point 3: If Alice waits here, and Bob calls for her to come out through Path A, she can walk there without knowing the passphrase to the door. If Bob calls for Path B, she has to open the door to arrive through Path B.
  • Point 4: Same as Point 3, but here she has to know the passphrase if Bob calls for her to arrive via Path A.

The following animation illustrates the different scenarios that can play out:

A silly animation on Bob and Alice in Ali Baba cave. Bob image by David Rock Design on Pixabay, Alice image by Graphic Mama Team on Pixabay. Rest of the images and animation by author.

I initially found the Ali Baba cave to be a quite confusing example of Zero Knowledge Proof. Because we obviously know a lot, including:

  • There is a door in the cave
  • The door is locked, and there is a magic passphrase to open it
  • Alice is in the cave, (at the door)
  • Bob is outside
  • The cave is circular, and there are two paths in/out of it, with no shortcuts between them
  • Alice claims to know the passphrase
  • After the “proof” is presented, we somehow trust Alice knows the passphrase

So we know a lot about this problem, and the term zero knowledge seemed confusing to me. I eventually figured the following about this problem:

  • The protected secret is the passphrase
  • The Zero Knowledge here refers to not knowing anything that helps reveal the passphrase (or more generally, the protected secret)
  • The proof is in the action of Alice arriving through the correct path every time. Repeated so many times that, statistically, starting on the called side (thus no passphrase required) every time is incredibly unlikely
  • In the end Bob gains trust that Alice knows the secret, but gains zero knowledge of the secret itself (the passphrase). He may gain other knowledge, such as Alice arriving through correct path.

If Alice does not know the passphrase for the door, she has a 50% chance to pick the right path on the first run, 25% to get it right twice in a row, and so on, until the joint probability of always getting it right gets statistically incredibly small.

There are various other properties that are often mentioned as required for a good ZKP. Stack Exchange provides nice and concise definitions for two main ones, soundness and completeness:

  • Soundness: the proof system is truthful, so whatever it proves, it is true. If the proof shows Alice knows the passphrase, she should really know it.
  • Completeness: the proof system is comprehensive, so it can prove all true statements. If Alice knows the passphrase, the system should always show it as true. You cannot know, and fail to prove it.

I recommend the Stack Exchange post for more details, if interested.

Cryptographic Example: Matching Sums in Monero

The Ali Baba cave is an interesting ZKP example, but for me it was a bit hard to figure out how this relates to ZKP in blockchains, cryptocurrencies, and similar constructs. For this, I had a look at how Monero uses the Pedersen commitment to hide the amounts in transactions, while at the same time verifying that the output sums match input sums.

Before going into Monero and its use of the Pedersen Commitment, I will present some base concepts: Elliptic Curves (EC), and their use for basic commitments. Followed finally by Pedersen Commitments on Monero.

Elliptic Curves

Elliptic curves (EC) are a data structure that is commonly used in public-key cryptography. Their mathematical properties make them suitable for creating secure keys and for other cryptographic operations. They are also used in building the Pedersen Commitment in Monero, so we need some basic understanding about them for this article.

EC use in cryptography is referred to as Elliptic Curve Cryptography (ECC). In relation to cryptocurrencies, two commonly used (and standardized) curves are curve 25519 and Secp256k1. Secp256k1 is used by Bitcoin, and 25519 by Monero and many other cryptocurrencies (and other applications). The curves are commonly visualized like this:

Elliptic curves Secp256k1 and 25519 visualized. Image by author.

Generally, the actual elliptic curve is described as a more abstract structure, but this visualization serves as a way to get a bit more concrete idea of what is being discussed. The elliptic curve’s curve points are used as a basis for many operations in ECC. The red dots in the following figures illustrate such points:

Line and curve points over Secp256k1 and 25519. Image by author.

The red lines in these figures also illustrates an important concept related to ECC operations such as adding curve points together. This is the concept of finding lines that connect two dots on the curve, intersecting the curve at a third point. So, a line and three points it passes through.

The code I used to plot these base curves (and some of the following examples) is available on my Github.

Basic Commitment

So what is a commitment? We could look at the dictionary for a general definition, but here it refers to defining (binding) a value in a way that it cannot be changed, while hiding it. Since the value cannot be changed, we can say we are committed to it. At a later time, we can reveal the value by providing a secret to open its container. This leads to the two defining factors commonly used for (cryptographic) commitments:

  • hiding: until the committed value is revealed, it cannot be discovered
  • binding: once a commitment is made, its value cannot be changed

There can be different levels of hiding and binding, typically the tradeoff is between being perfectly hiding/binding and computationally hiding/binding. But that is more details than needed for this article at this point.

A typical example of a simple commitment is putting a piece of paper (e.g., a voting ballot), with a number, in a locked box, and providing the key later:

A basic commitment scheme example: Big chef, Little chef, and a ballot commitment scheme. Chefs, chest, and key from Pixabay. Thanks for the art! Combined story image by author.

In the above example, the big chef puts a voting ballot (with number 5 written on it) into the locked chest. The chest is then given to the little chef, who has no key. The big chef has now committed to a vote. It is inside a box, with no visibility, thus it is hidden. It can no longer be changed as the box is locked and given to someone (little chef) who has no key. Thus, the commitment has become binding. But little chef does not yet know what the vote is. At a later time, the key is given to little chef, and the vote (commitment) is revealed.

Basic Commitment with ECC

Like the above example with the chefs, Elliptic Curves can be used to build cryptographic commitment schemes. I will try to illustrate this here, starting from a simple example using a single curve point, and leading to the Pedersen Commitment. I borrow some basic ideas from the example in the Grim documentation. Thanks for the ideas!

As for notation, I will use a to represent the committed value. Mainly because if you read about Monero, this seems to be the general notation used. I guess referring to the committed amount of Monero.

Using a Single Curve Point

A curve point named H is defined and made public. We call this the public base point. This base point is actually so public that it is generally defined in the curve’s standard specification, such as for curve 25519 and for secp256k1. With a as the committed value (amount), the commitment c based on H becomes c = a * H. This c defines another point on the same Elliptic Curve, based on the base point H. The following figure illustrates this process:

Basic commitment with Elliptic Curves, and a simplified visualization. Image by author.

Here, the blue point is the base point H on the elliptic curve (not the real one for 25519, I just picked something for illustration). The commitment value would be the red point, which would equal to a*H, in this case with value a of 5, c=5*H.

Disclaimer: The below figure illustrates the simplified (fake) notation of Elliptic Curve math I use in this article, with values increasingly moving on the curve in the direction of the dotted red line. For example, 5*H in this figure is 5 line segments from H into the direction of the arrow. I use such increases for more intuitive illustration, real Elliptic Curve math is more complex and behaves different. I will describe the actual math briefly at the end of this article for anyone interested, with some link to find out more. My simplification here is just for illustrative purposes.

Illustration of the simplified visualization I use, points moving in direction of red line. Image by author.

Curves vs Chefs

Lets consider the above simple curve commitment as the chefs example. The big chef calculates c=5*H, and publishes c. The little chef does not see the value a (5) being committed, he only sees curve point c. To reveal the commitment, the big chef then reveals a=5, and the little chef can verify that a*H=c. Due to properties of Elliptic Curves and related mathematics, finding any other value x where x*H=c ,other than a, is considered very difficult, and computationally unfeasible (outside Quantum computing). Thus, this commitment is considered computationally binding.

Once a is revealed, the verification of c is trivial. Because H is a public value, the little chef knows it. Because c was published earlier , he can check a*H=c when a is revealed. If it matches, the commitment and the given a is verified.

If we compare this to the chefs example with the chest earlier, the following concepts should match:

  • chest = published (commitment) curve point c
  • ballot = committed value a
  • key = curve point equation a*H. Or just a, since H is known.

The Problem Here

The problem with the above example is that this process is good at binding the commitment value a, but very poor at hiding it. Lets assume we have 10 choices of values we could put on the chefs ballot, from 1 to 10. It would be trivial to brute force all these point values. Just calculate a*H for all a=1–10, and compare the result to the commitment c.

To address this, something called the blinding factor is used.

The Blinding Factor

The blinding factor is commonly referred to as r. This r is just a (secure) random number. Generally in range 1-2²⁵⁶, so very large. A slightly naive approach is to add this to a as in (r+a)*H. The following figure illustrates this with r=11 and a=5, resulting in published c=(11+5)*H=16*H:

Commitment to 5, with a blinding factor 11. Image by author.

Here, r is shown as the green dot. With this, there is no way to tell what a is, without knowing what r is. However, this has an added problem where any combination of a+r could be revealed as the commitment. Thus this approach would lose the binding property, as the commitment could be changed after publishing it. For example, for the above example we could publish any of (a=4, r=12), (a=5, r=11), (a=6, r=10), or many others as the commitment where the sum of r and a is 16. The commitment needs to be binding.

Pedersen Commitment

And solving this problem finally moves on to the Pedersen Commitment. For this, another base point G on the same curve can be used. This leads to the final commitment form of r*G+a*H=c. This is finally what is called the Pedersen Commitment (with EC). Using my made-up EC notation, we could visualize this like so:

Pedersen commitment on Elliptic Curve 25519, to value 5, and blinding factor 11*G. Image by author.

Now we are back to computationally binding, and at the same time hiding commitment. As I mentioned earlier, due to properties of elliptic curves, and the associated mathematics, it is considered extremely difficult to forge a commitment with different values of r and a, so that it would still match the published c=r*G+a*H. And since both r and a now use a different base point, they cannot be changed as when just summing (a+r)*H above. That’s what they tell me, and it’s enough for my goal in this article.

Homomorphic sums: Hiding the Values in Monero

Monero applies the Pedersen Commitment to hide the actual amounts in transaction inputs and outputs, while at the same time verifying that the sum of amounts in inputs and outputs match. Sounds like sorcery, so lets see how this works.

The homomorphic property of the Pedersen Commitment means that we can actually perform mathematical operations on the encrypted values, and it works just as if the values were not encrypted. In this case, this is based on the homomorphic nature of the Elliptic Curve math. We can simply add curve points together, and compare the results, even without revealing the actual transaction Monero amounts. This is what the magic in the matching Monero sums without knowing the values is based on. But I need concrete examples, so lets make one.

First, a brief description of how cryptocurrency transactions generally works. At a high level, a typical transaction looks something like this:

A cryptocurrency transaction at a high level. Image by author.

Each transaction takes a set of inputs that are spent (TxINs). Spending here means the cryptocurrency in those inputs are transferred to some output addresses on the blockchain, as transaction outputs (TxOUTs). An associated transaction fee is taken by the network, and treated as one of the TxOUTs. In Bitcoin, the transaction inputs and outputs, and their values are visible for anyone to see. In that case, the transaction looks pretty much just like the figure above, for anyone observing the blockchain.

Monero is a cryptocurrency especially focusing on privacy, and does many things to hide various parts of these transactions from the public view. One of those is to hide the amounts being transferred. Monero uses the Pedersen Commitment to hide the transaction amounts (values). In this case, the process now looks something like this:

A (Monero) transaction with amounts hidden. Image by author.

In this case, the input and output values are hidden from the public view. Only the fee is made visible. The only way to see the hidden values is if you have access to the required keys to decrypt the transaction. Thus the values are only visible to the sender and receiver. No other blockchain network nodes, or any other outsider, can see the concrete values that are in the transaction.

However, to maintain a valid ledger of transactions, the system must be able to validate that the total sum of the transaction outputs match its inputs. That is, no bigger amount of Monero is spent than is put in. Otherwise people could just spend some small amount of coins (Monero) in the transaction inputs, and create transaction outputs for much larger sums. This would effectively create Monero out of thin air, and break the entire system. Kind of like the central banks, and the associated monetary system in the “real world”, except we are still waiting for that pyramid to fall ;).

Matching input and output sums

As described earlier, EC Pedersen Commitments used by Monero use two public curve points, G and H. Each input and output Tx (TxIN and TxOUT) has its own Pedersen Commitment, so each one has its own a and r defined, like this:

Example transaction with all commitment values except fee blinding factor. Image by author.

In the above, a always matches the amount of Monero used in the TxIN or TxOUT. As I discussed earlier, the r is basically a large random number. Here I just picked some (small) numbers to illustrate the calculations.

The r in the above figure now becomes the G multiplier, and a the H multiplier. Generally, the numbers (especially r) would be bigger, but these work for an example.

TxIn Commitment

For the TxIN inputs in the above, the commitments would now be (using r and a values from image above):

  • 10$: c = 14*G + 10*H
  • 30$: c = 85*G + 30*H
  • 10$: c = 45*G + 10*H

Using the homomorphic addition property, the total commitment of all the TxIN’s is then:

  • Cin: 14*G + 10*H + 85*G + 30*H + 45*G + 10*H = (14*G + 85*G + 43*G) + (10*H + 30*H + 10*H)

TxOut Commitment

For the TxOUTs, the commitments would be:

  • 8$: c = 33*G + 8*H
  • 40$: c = 28*G + 40*H
  • (fee) 2$: x*G + 2*H

The blinding factor for the fee is still undefined (x) for now. To calculate x, let’s sum the other parts first.

Using homomorphic addition, the total commitment of all the TxOUT’s is:

  • 33*G + 8*H + 28*G + 40*H + x*G + 2*H = (33*G+28*G+x*G) + (8*H + 40*H + 2*H)

Visualizing the Amount on the Elliptic Curve

It is perhaps simpler to look at the a and r calculations here separately (for points H and G respectively). First the a and H:

  • TXin: 10*H+30*H+10*H=50*H
  • TXout: 8*H + 40*H + 2*H=50*H
Matching the sum of input vs output amounts, via their commitments. Image by author.

Both of these end up at the final curve point of 50*H. The final curve point is the part that is published in the transaction (as part of the commitment, combined with the blinding factors*G). The homomorphic nature of it allows comparing the final points to see that the sums match in input and output. This gives the binding property, as finding other matching a and r values would be computationally unfeasible, while allowing to match the amounts.

Next, we need to add all the blinding factors to gain the hiding property.

Visualizing the Blinding Factor on the Elliptic Curve

As for amount H above, for the blinding factor G we can do the same:

  • TxIN: 14*G+85*G+45*G=144*G
  • TxOUT: 33*G+28*G+x*G=61*G+x*G

As noted above, the x in these equations is the r for fee. Using my simplified EC visualization approach, it looks like this without the fee:

Initial setup of the random blinding factors for the Pedersen Commitment. Image by author.

Now we need to make also the TxIN and TxOUT curve points match here. TxIN final curve point is 144*G, so we need to make TxOUT final curve point match that.

Matching the TxIN curve to the TxOUT Curve Using Fee Blinding Factor

As you may have guessed, we can match the TxIN and TxOUT curve points for G like this:

  • r for fee (x above) = (sum of TxIN multipliers for G) – (sum of TxOUT multipliers for G, without fee) = (14+85+45)–(33+28)=83

The commitment for the fee then becomes:

  • 2$: c = 83*G + 2*H

Filling in the last missing spot (x) in the transaction, we get:

Using the fee blinding factor to balance the curve point sums. Image by author.

And when we plug this into the total output G calculation and figure above, we get the following:

  • TxIN: 14*G+85*G+45*G=144*G
  • TxOUT: 33*G+28*G+x*G=61*G+(144–61)*G=61*G+83*G=144*G

And we can now see that both input and output end up at the same curve point for G just like for H:

Using the fee to match the sums (and EC end points) of Pedersen Commitment blinding factors. Image by author.

The total summed commitments are now:

  • TxIN: 144*G + 50*H
  • TxOUT: 144*G + 50*H

We can now subtract the summed input and output commitments, and check if they match the commitment to zero.

Commitment to Zero

To prove that the total transaction inputs match the outputs, we can compare the sum of their commitments to a commitment to zero. Besides my bank account balance, commitment to zero here simply means:

  • z = 0 * H + 0 * G

And if we calculate the difference of the input and output commitments from the above, we get:

  • TxIN commitment (c_in): 144*G + 50*H
  • TxOut commitment (c_out): 144*G + 50*H
  • c_in-c_out: (144*G+50*H) – (144*G+50*H) = 0*G + 0*H
  • This is the same as commitment to zero above (z = 0*G + 0*H)

And so we have successfully proven that our total sum of inputs matches the total sum of outputs. Without revealing anything about the actual transaction amounts. This is true, since only the single, final curve points are included in the transaction data. There is no r or a in any form included in the public commitment. Only thing included is the final curve point that is the sum of 144*G+50*H.

This should reveal no information about the amounts, and, I believe, provides a practical example of zero knowledge proof with more cryptographical constructs. It proves the input and output amounts match, but tells you nothing (you gain zero knowledge) about the actual amounts.

The trialing code I used to make the above calculations and the verify that the input and outputs commitments and their differences comes to match the commitment to zero is available on my Github.

I am still no in-depth expert on cryptography, but for anyone interested to read more on homomorphic encryption and its relation to ZKP, a good start seems to be this post on Stack Exchange (as usual..). However, a brief look into actual EC math is in order after all the simplified examples I did above.

EC Addition and Multiplication, Achtually

All the above examples of elliptic curve addition, or multiplication, I showed the points simply move along the curve by x line segments, where x was the base point multiplier (x*G or x*H). For example, for 5*G I simply moved the point 5 line segments forward on the curve starting from G. This is not how real EC math works.

First, a simple addition of two points. Adding two points on an Elliptic Curve gives you a third point. The following figure illustrates this in four steps:

Adding two points to get a third (or fourth..) point on an Elliptic Curve. Image by author.

In the above figure the four steps are the following:

  • Step 1: Define the points you want to add (P1+P2).
  • Step 2: Find a line connecting these two points.
  • Step 3: The line should always intersect the curve at a third point. Find that third point (P3).
  • Step 4: Reflect P3 across the x-axis to find P4. P4 is now the result of adding P1+P2. So P1+P2=P4 in the above figure.

Why is the Elliptic Curve math done like that? I have no idea, none of the materials I read explained that part. But for me it is enough to have a good enough idea of what is happening and discussed for this topic.

Elliptic Curve Cryptography takes a bit specialized approach to this. Instead of adding two different points together, a public base point is defined, and added to itself N times. Using G to describe a base point, the following figure again illustrates this process in another 4 steps:

Adding a base point to itself on an Elliptic Curve. Image by author.

Again, G here is just a random point I picked for illustration, not any official point. In the above figure, the four steps are now:

  • Step 1: Find the defined public base point G. Since we are adding G to itself, it is used as both P1 and P2 from the previous example on adding P1+P2. Instead of P1+P2, we now do G+G.
  • Step 2: Find the line that passes through P1 and P2, or G and G in this case. For a single point, you could draw any line that passes through it (i.e., infinite possible lines). Someone smart has then decided that the tangent line is to be used. It is the line that just touches the curve at point G.
  • Step 3: Find the point where the tangent line crosses the curve, this matches P3 from the previous example.
  • Step 4: Reflect P3 across the x-axis to get P4, similar to the previous example. This results in G+G, or 2*G. Continue adding G to this to get 3*G, or add 2*G to itself to get 4*G. Repeat until you get the point you desired. For example, in the Monero example above I used values such as 14*G, 85*G, and so on. Just iterate this process to get to the x in x*G.

Additionally, there are other related points, such as use of prime numbers and modulos, for cases such as where the results would not fit in the N bits used. This gets called EC over finite fields, and other such fancy terms. However, the level I presented in this article is enough for me to have a basic understanding to read the blockchain articles, understand the topics, and possibly experiment a bit with the topic.

While all this sounds a bit complicated, luckily other people have already implemented all this in different libraries. For prototyping the topics in this article, I used an Elliptic Curve library for Golang, where the operations and calculations are implemented and verified for me. But I do find it very useful to understand what and why am I doing in those operations.

As I noted before, you can find some example code where I use a library for these calculations on my Github.

Computational vs Information Theoretical

Earlier I discussed how a commitment should be both hiding and binding. Until the committing party reveals the committed value, it should remain hidden. And it should not be possible to change the committed value after committing to it. In other words, the commitment should be binding. I believe I already mentioned the terms information theoretically hiding, and computationally binding.

The Pedersen Commitment is considered to be information theoretically hiding, meaning that no matter how much computing power you have, it is not possible to 100% determine the original committed value, if it is not revealed. As far as I understand, this is because there are always multiple possible values that could fulfill the commitment.

Because of this same reason, the Pedersen Commitment is considered to be computationally binding. Because there are multiple possible values that could fulfill the EC equations, with sufficient computing power you could find a different (and thus a “false”) commitment that would satisfy the verifier. In practice, for the Pedersen Commitment, finding such a point is known as the EC discrete logarithm problem, which is considered too hard to solve without Quantum computing.

This appears to be a common tradeoff between the hiding and binding properties. Perfect (information theoretical) hiding results in computational binding, and perfect binding in computationally hiding.

Back to the Roots: What is Zero Knowledge Proof?

After the lengthy detours, back to the original question. What is Zero Knowledge Proof? Looking back at the Ali Baba example, and the use of Pedersen commitment in Monero, I will try to summarize my take.

  • We know a lot about the problem, and are happy share a lot of that knowledge.
  • We have a specific piece of information we want to keep secret.
  • We wish to share zero knowledge of that secret, but prove we know it, or that something related to it is true.
  • The proof comes from applying a specific process according to the domain in question. This results in (statistically) sufficiently strong proof that we hold the proof as verified.
  • There are likely some properties that need to be assumed true, such as having true sources of randomness, or lack of Quantum computing.

In the Ali Baba cave, the process is that of Alice always arriving via the correct path to Bob, repeated so many times that the combined probability of Alice always picking the right path by chance is incredibly small. Assumptions including no cheating (no path to bypass door in cave), and no broken randomness (you cannot predict which path Bob calls).

In the Monero proof of matching input and output amounts via Pedersen Commitments, the process is in the Homomorphic calculations holding true. While this case is only computationally binding, the computing power and the associated discrete logarithm problem is considered so hard that the chance of “faking” the proof is also considered incredibly small. Assumptions here include no solution to the discrete logarithm problem (i.e., no Quantum computing), and again true sources of randomness (cannot guess r).


There are plenty of examples on Ali Baba cave, and resources such as Zero to Monero that are filled with Greek symbols and formulas. I wrote this piece in order to find myself the middle ground on what ZKP means, and how the conceptual examples like Ali Baba translate to the cryptographic world, and what they mean at a higher level. I hope this will be useful to someone else. At least I feel I got a better understanding to my questions.

This article also allowed me to get familiar with the Pedersen Commitment, which I had often seen mentioned, but wondered about the real meaning. In the end, I find ZKP, as most topics, is not too complicated once you get past the terminology. Just a lot of new concepts for me, which takes time.

In the case of Monero and the way it verifies and anonymizes its transactions, there are also various other features I have not touched here. These include range proofs (a.k.a. bulletproofs in current implementation) to ensure the Pedersen Commitments are always using positive numbers, ring signatures to hide the sender addresses, stealth addresses for hiding recipients, key images to identify transactions, and (Diffie-Helman) key-exchanges to communicate the r and a from sender to receiver in the Pedersen Commitment. But these are out of the scope of this article.

Just to repeat the general disclaimer, I am not a cryptographer. So do your own research if you really need to understand the details. Hope this was useful for someone anyway. Cheers.

Just to remind once more, the codes I used to trial this article are available on my Monero Scraper Github project. At the time of writing, under src/golang/pedersen and src/python/experiments/ellipticcurves.