What does it mean to test an algorithm? Let’s play

Recently I had a discussion about what it means to test an algorithm, or what it means to do algorithm test engineering. I couldn’t quite come up with a convincing definition for myself. At a basic level, maybe you figure out some basic rules of the algorithm, give it some input and output. Check the results. But I believe there is more to it, including algorithm fit for purpose, fit to data, evaluating alternatives, adapting the algorithm to the problem and system at hand, designing and running experiments, learning, etc.
To get a better idea about all this, I will explore the topic in this article. I try analyzing and testing two algorithms, and see where it takes me. I start in this article with more classical algorithms, where the inputs, outputs, and their relations are clearly defined. In a follow-up article, I hope to look at a machine learning-based one where the definition of the correct results is not so clear.
This story was originally published on Medium, where it can still be found.
Choice of Algorithm
Since my goal was to analyze the testing and analysis of algorithms, I needed to pick an algorithm to work with. Luckily, Wikipedia provides an extensive list of algorithms. I picked two classics, binary search and Levenshtein edit distance. I will warm up first with a basic test and analysis of binary search. This is followed with a bit broader look at Levenshtein.
In general, I find testing an algorithm without a broader context to miss some key points from the overall process. So I set up an application for the Levenshtein algorithm to give my testing and analysis exploration some context. This context is an application for comparing edit distances over RNA strings, more specifically motivated by the COVID-19 virus structure.
Binary Search
First, binary search. It is an algorithm designed for efficient search over a list of items, in logarithmic execution time. For my experiments, I use the IMDB names dataset, and a binary search Python implementation (adapted from a set of Stack Overflow answers).
Understanding the Algorithm
When testing an algorithm, I first need to understand what I am working with. To gain some intuition for this, I start with a look at linear search vs binary search. Linear search is a very basic search that simply loops the list in order to find a match, and provides a base reference here. To illustrate the difference, and what the binary search algorithm is about, a look at a set of 10 names from the IMDB dataset:

I picked the last 10 in the dataset, because it looked like a good random mix. Binary search requires the input to be sorted, so lets do that. Sorted and re-indexed these 10 become like this:

To build a better understanding of linear and binary search, I created illustrations on applying them to the above list to find “Lu Bevins”:

Linear search always loops through the entire list from the beginning, until it finds a match. The list can be in any order. Binary search constantly splits the list in two and checks the middle value for match, repeating this in smaller chunks, until it finds a match, or runs out of elements to check.
Binary search is a well known and simple algorithm, but above illustrates how I would go about trying to understand an algorithm. I find concrete examples and diagrams help me build an understanding. Access to a (domain) expert to help build and check these really helps. Especially with less known algorithms.
The above example list of 10 is very small, and does not really illustrate the benefit of binary search. With larger size, the comparison between linear and binary search should look like this (expecting logarithmic complexity):

Such assumptions are good to check, so I will do that next.
Testing Computational Complexity
A common property of interest when testing and analyzing algorithms is computational complexity. This refers to the resources (processing time, memory, etc) the algorithm uses at different scales. Here I focus on processing time, although similar analysis could be performed for other resources.
To explore the scalability of binary search, I sampled the IMDB names dataset for 10 to 1000 names, in increments of 10, and executed both linear and binary search on random items in these sampled lists.
To gain more statistically confident results, I ran the experiments 100 times for each of these lists. I used my codeprofile library for execution time profiling. The following figures illustrate the measured average execution times over the 100 runs at sample sizes vs the theoretical, expected time:

The above shapes show the curves follow the expected curve on logarithmic execution time growth (binary search) and linear growth (linear search). So the measurements matches the theory (scales are different in the figures due to parameters, I was just interested in the shapes).
1000 items is still a very short list for real performance measurements. Running the same experiment for 10 to 1 million samples (in 10x increments) shows the effect and benefit of binary search more clearly:

In the above figure, with list size 1 million, the difference is so big the binary search line looks flat compared to linear search. Based on the above simple experiments, I would say the assumptions hold for computational complexity.
The linear and binary search are well known and studied algorithms, and their complexity would be well known. As such, these demonstrate how we can evaluate this in practice. I have found evaluating practical computational complexity of actual implementations can produce surprises, especially different use cases for custom algorithms from 3rd parties.
Testing Input-Output Relations
Knowing the computational complexity is good, but we also need assurance that the implementation works as intended. For binary search, basic tests for finding selected names, boundaries, category-partitioning, etc are good examples for this. I call these base assurance tests.
To scale this up, I used the test generator from the above 10 to 1 million computational complexity analysis, and added the following check to it:
- for every item, linear and binary search find the same index.
I call this an expected/assumed invariant over the data/algorithm. I was expecting the two algorithms to give the same results for the same input, and that the assertion would simply be there for extra assurance. However, the assertion was failing because the indices given by the linear and binary search were not always the same even for the same input.
After some thought on what could cause this difference, and how the algorithms work, I figured there might be duplicate names in the dataset. Here is a quick check to see if this is the case (concatenated to two rows):

Above figure shows “Robert Ellis” and “Harrison Ford” as having duplicates. To check a bit deeper, a look at the name “Harrison Ford”:

There are four people listed with the same name “Harrison Ford” in this dataset. Having established there are duplicate names, some summary metrics would be of interest to see how many there are:

The above figures show a total of 853911 (row count) distinct names have duplicates. The one with the most duplicates is “David Smith”, repeating 340 times. What does the above mean for the mismatching search results? Due to how the two algorithms work, searching for “David Smith” would likely result in both linear and binary search returning a “David Smith”, but different ones. Linear search always returns the first one in list, binary search can give any result from the list of duplicates.
To me, this illustrates, how testing and analysis of the algorithm helps understand both the algorithm, and the data it is being applied to better. And how it is good to assert and test your assumptions about the data and algorithm. Having a clear goal, and performing this process systematically should help.
Besides invariants over the algorithm output, one can also consider them over input. For example, binary search expects its input to be sorted. The choice is whether to limit the scope of testing to expect the environment (the overall system) to enforce this, or to expect the algorithm implementation to handle it. I would refer to this as defining the algorithm (test) scope.
Levenshtein Edit Distance (with RNA search)
As a second example, and to gain some broader insight into the topic, I look at the Levenshtein edit distance algorithm here. Edit distance is also sometimes referred to as approximate string matching. It refers to the number of edits (character changes) required on a string to convert it to a specific (target) string. Known applications include providing spelling suggestions, approximate string search queries, and DNA/RNA sequence comparison.
Compared to the single search operation of binary search, the classic Levenshtein algorithm supports three operations: insert, delete, and substitute (a character in a string). The edit distance it gives is the minimum number of these operations required. There is also a variant called Damerau-Levenshtein adding a transposition operation, but to keep this example simpler, I will use the classic Levenshtein here.
Some examples:
- test -> tent: one substitution of s->n = Levenshtein score 1
- Levenstein -> Levenshtein: one insertion of h = score 1
- Levenshtein -> Levenstein: one deletion of h = score 1
- bad -> base: one substitution of d -> s + one insertion of e = score 2
- bad -> based: two insertions of s + e = score 2
Defining the Example Application
As I noted before, I will look at Levenshtein from the viewpoint of an example application, to give it some context that my binary search example lacked. As Levenshtein has been applied in RNA analysis, I use RNA search as the application. Specifically, the COVID-19 virus, which is RNA based. There is plenty of information available on it, and it gives me a chance to play with longer RNA strings on a timely topic. As with binary search, I start with research to understand the application domain as well as the algorithm applied.
The COVID-19 RNA sequence is described as having a length of 29881 “characters”, consisting of a four letter alphabet representing its chemical base. One part of the COVID-19 structure that is specifically considered interesting is the spike protein. This interest comes from the role of the spike protein in infecting the human cells.
The spike protein is described as a 3831 “character” long sub-sequence of the overall virus RNA. Wikipedia describes the Omicron variant as having 60 mutations from the original Wuhan variant, 32 which are in the spike protein.
So in this example I look to apply Levenshtein on sequences of length 3831 (possibly up to 29881). Following the Omicron mutations, I will use up to 60 edits as the maximum in my testing. As with binary search, I am defining the algorithm (test) parameters based on domain knowledge and expected use.
DISCLAIMER: I have not done in-depth research on RNA search, I am sure it is far more advanced, but this works as a basic example for this article.
Experiment Setup
I used the Python weighted-levenshtein library to run my experiments. My base assurance tests include handling empty string, string of different lengths, invalid characters, and a selected string mutations. And so on.
To scale up the testing, I again built a simple input generator to generate input strings, and apply the algorithms supported operations (insert, delete, substitute) on them. This generated random strings of given length, applied the selected count of operation(s) on the string, and calculated the distance score for the modified vs the original (non-modified) string. At the same time measuring execution time, and checking a set of output invariants:
- The score given by the algorithm should always match the number of operations applied.
- Flipping the compared strings, e.g, car->cars to cars->car, should always give the same score. Since the operations are effectively reversible.
My test executor checked these invariants for every generated (test) execution of the algorithm.
To support the base assurance tests, I first used the input generator to build large number of strings of relatively short length (5–10), with varying number (0–9) operations applied. This gave me increased confidence, higher position coverage, and easier to debug results (with limited length). For score calculations, string boundaries, grouped operations, etc. It also showed me how my assumptions about the above two invariants was wrong for the first one.
What I found is, running the algorithm with enough iterations, the random chance will find cases where the target string built from multiple substitution operations can be achieved with fewer combinations of insert, delete, and substitute operations together. Here is an example case:

The above shows the source string, and the target string after making 3 substitutions at randomly selected indices, with randomly selected characters. Remember, my target was COVID-19 RNA search, with 4 characters (ABCD). The following substitutions have been made; index 3: A->B, index 4: C->A, index 5: D->C. Expecting Levenshtein score to match the number of edit operations (3 substitutions here), this should give a score of 3.
However, instead of giving a score of 3, the above gives a score of 2. After some analysis, I realized the same target can be reached with one insert and one delete, giving the minimal score of 2:

With this knowledge, I had to disable the failing invariant from the checks. Instead, I opted to collect statistics on the score achieved with different number of operations applied. I could then check that the overall score distribution was not too far off from the number of operations applied, if not exactly the same. The second invariant (flipping strings) was fine for all generated tests.
I find this is an example of what one might call exploratory algorithm testing. Take any assumptions on how the algorithm works, encode them as invariants (or whatever works), generate further tests to see if they hold. Learning about the algorithm and its domain along the way.
After successfully running these smaller tests, and feeling confident I had sufficiently covered my assumptions, I increased the generator string sizes higher for testing computational complexity.
Computational Complexity: Setup
For this evaluation, I used the input generator to produce strings of length from 100 to 5000 in intervals of 250 (100, 350, 600, 850, … until 4850). The target string length I set earlier was 3981 character length, which gets covered here, with some wide margin. For statistical confidence, I repeated each test size (100 to 4850) 100 times.
For each generated string, I applied the following operations 1, 3, 9, and 60 times (60 being the omicron limit defined above):
- Substitute characters in the string at random (non-overlapping) locations
- Insert random characters at a random location in the test string
- Delete random characters in the test string
- Combinations of above, all at the same time
My goal was to see if a different algorithm operations, their number, or the string length would affect the computation time. Or more generally, to test the algorithm operations and parameters, to see how they affect its execution time.
Computational Complexity: Measurements
The results for the above described experiments were very similar in terms of performance. The following figure illustrates the substitution tests, and their overall execution time:

Substitute_x refers to substituting X character in the source string, and calculating the Levenshtein score. The x-axis in above is the string size from 100 to 4850. The y-axis is the time it took to run the 100 experiments.
The four lines on the above graph are practically overlapping, as the execution time was so close in each case. Since this seemed a bit suspicious, I ran multiple experiments separately with varying parameters to see if this holds true, which it did. A bit strange, but ok. Having a domain expert and a team to bounce thoughts on this would have been real nice.
I omitted showing all the rest of the operations, their size variants, and combinations here. They were all very close, indicating that the execution time had little to no dependence on the type or number of operations.
The above curve in general resembles an exponentially growing curve. To check this, I experimented with a few parameters to try to visualize a matching form of an exponential curve. Here is the final measured execution time plot vs a theoretical plot for exponential time:

From this, I would say the algorithm has exponential complexity. Wikipedia actually has a few fancy words on the Levenshtein computational complexity being high. From this, my verdict is that these results match the theoretical expectation, this algorithm seems to scale less optimally for longer inputs.
Testing an Adaption to the Algorithm:
Sometimes we test and analyze an algorithm, and realize it is not a great fit for the need. But we may wish to try an alternative, or an adaptation to this algorithm to address the issues. Let’s look at one example here.
Exponential execution time growth is generally considered bad for scalability. For shorter string (e.g., spell checking a word or a shell command) this is probably not an issue, as the exponential effect is so small on shorter input. However, in my RNA story I wanted to search and analyze sequences of length 3831+. For this, investigating possible speedups would seem useful.
Let’s say we decide to try to make this faster by trying to run the algorithm on smaller pieces of the input. This should make the execution time grow more linearly, instead of exponential. In the end, we just need to rank the search results relatively, not necessarily an exact score. And once the top results are known, one could calculate exact scores for that subset. Since this is a hypothetical example, I call this a hypothetical development idea :).
Here is an execution time graph for splitting the generated input strings to slices of length 100 characters, and summing up their scores:

From the execution time perspective, the above graph looks much better. Exponential growth is gone, turned linear. But how about the results, how large is the difference in score between calculating it for the full string vs summing up the scores for slices of 100? The following tables illustrate this:

In the above table, s_x refers to substituting X character in the string. Size is the total string length. The postfix of /full refers to running the Levenshtein algorithm on the full string. The /100 version uses slices of 100 characters. Each configuration was repeated 100 times to get statistical coverage. Which is why, for example, s_1/full has a score of 100 (100 runs, each scoring 1).
For this substitution example, sliced version scores are very close to the full version. Perhaps because substitution is a local change, and does not so much affect multiple slices. There are a few rows in the table where the summed score has a difference of 1–2 edits (6000 vs 5998 and 5999). This is because of the issue I noted above, where insert and delete can work together to find a smaller minimum. If the substitution operation was enough, this would seem a plausible adaptation. But the goal was to support all operations.
In the below tables, the insert operation uses prefix of i, and delete a prefix of d. So i_x refers to inserting X characters, and d_x deleting X characters:

The above insert table shows how the i_x/full score always matches the number of operations (*100) as expected. With i_x/100 slices, the score starts to get larger as the string length increases, and the number of operations applied increases (from 1 to 3, 9, and 60). After some analysis, I concluded this is because inserting a character at the beginning of the string shifts all the rest of the slices forward to the right, and thus causes each slice to require multiple edits, and increase the summed score.
The following figures illustrate this issue:


The above is an example of slice size 3 instead of 100 but the same concept. Depending on the location of the insert, the edit count propagates towards the end and increases the sum by a large amount. For each slice of 3 to match, it has to remove the first and add the last char. Same for delete operation:

The above delete table shows very similar behaviour to the insert table. And for the same reasons, a delete shifts all slices left when insert shifted them right.
So, while the slicing approach would be successful in cutting down the algorithm processing time, it would definitely not work to properly rank the search results for insert and delete. From the application perspective, I would consider this adaptation a failed experiment, except for learning.
However, for the purposes of this article, I find this is a good experiment. It shows how one might run tests on an algorithm, analyze its results, fit for purpose, and other properties, build hypothesis, implement experiments, evaluate it, and iterate. Optimally, this is how I would see algorithm testing and analysis contributing to the overall development more broadly, helping to design and evaluate experiments for improved algorithms and adaptations.
Concluding the Levenshtein Search Experiment
In a more realistic scenario, I hope I would be working with a team of experts. And have resources to perform research on what are all the approaches, state of the art, and everything else on the subject. In fact, when the algorithm and its application domain are complex, and the opportunity is there, I would consider this an essential part of the testing and analysis process. Working with domain experts, R&D team, and researching the domain knowledge. Here I will limit the scope as my resources are limited.
Summary
My goal with this article was to explore the idea of what algorithm testing (and analysis) could be. A summarizing list makes it simpler to remember:
- Traditional testing techniques can define base assurance tests with expert defined inputs and expected outputs
- Building a good, in-depth understanding of the algorithm helps understand how to test it, and how to adapt it to a domain
- This includes building a similarly good understanding of the data it is applied to, and how this relates to the algorithm
- Ideally, above works iteratively in interaction with research, development, and testing
- Besides verification, algorithm testing can contribute to understanding its limits, potential optimizations, and compare alternatives
- Identifying assumptions about the data and algorithm input describes what the algorithm is expected to work with
- Exploratory data analysis can use these assumptions as input, check if they hold, and refine them
- Identifying assumptions about the data and the algorithm output gives a basis to write invariants to check in every test
- An automated test generator helps scale testing with these invariants, and check whether the assumptions hold
- Scope of testing relates to defining the scope of the algorithm vs the overall system using it, responsibilities for input and output handling
- Theoretical computational complexity is good, but practical evaluation if it holds for the implementation and available data is good to ensure
- Algorithm testing can form a tool for exploring the algorithm, by formulating hypothesis about it and using testing to evaluate them
- Tools and techniques, such as metamorphic testing can help evaluate robustness of the algorithm to different types of valid and invalid inputs
- Testing and analysis is optimally an iterative process, where the end result is the final generated tests and checks, and all the learning on the way
Conclusions
I started this article with the idea to explore what it means to test an algorithm, or what it could mean to be do “algorithm test engineering”. I like to think I made some progress, although I am sure the definition can be subjective, much like “goodness” of a (machine learning) algorithm result.
Both my examples in this article, binary search and Levenshtein edit distance, are quite simple and basic algorithms in the end. As noted in this article, the basic testing of such algorithms is not too complicated. However, considering algorithm testing and analysis as part of a broader research and development process, I believe the interactions, collaborations, research and development contributions can make it more diverse and interesting.
In this article, I looked at two “classic” algorithms, where the input, output, and their relations are quite straightforward to define. In a followup article I will look at machine learning based algorithm(s), and an example of an algorithm where the input-output relations, and the correctness of the output is harder to define, subjective, or “undefinable”, in a traditional sense. Until then.
That’s all for now. Cheers.