Techniques for Testing Autonomous Cars and other ML-Based Systems
Testing machine learning (ML)-based systems requires different approaches compared to traditional software. With traditional software, the specification and its relation to the implementation is typically quite explicit: “When the user types a valid username and matching password, they are successfully logged in”. Very simple to understand, deterministic, and easy to write a test case for.
ML-based systems are quite different. Instead of clearly defined inputs and logical flows based on explicit programming statements, a ML-based system is based on potentially huge input spaces with probabilistic outcomes from largely black-box components (models). In this article, I take a look at metamorphic testing, which is a technique that has become increasingly popular to address some of the ML-based systems testing challenge. I will go through some of the latest research, and present examples from different application domains.
Metamorphic Testing (MMT) was originally proposed quite a while back, at least up to (Chen1998). Having worked a long time with software testing research, I always viewed MMT as a curiosity with few real use cases. With ML-based systems, however, it seems to have found its niche nicely.
The general idea of MMT is to describe the system functionality in terms of generic relations between inputs, the generic transformations of those inputs and their outputs, rather than as mappings of specific inputs to specific outputs.
One typical example used for metamorphic testing in the past has been from testing search engines (e.g., Zhou2016). As search engines are these days practically natural language processing (NLP)/ML-based systems, they also fit the topic of this article well. To illustrate the concept, I ran two queries on Google (in October 2020):
The first query is just one word “car”. The second query adds another word to the first query, “autonomous”. So the query now becomes “autonomous car”. This addition of a restrictive search keyword is an example of an input transformation (or “morphing”, in the spirit or metamorphic testing):
And to perform a check on the test results (a test oracle), we define a matching relation that should hold when the input transformation is applied:
In this case, adding the restrictive search term (“autonomous”) to the previous query (“car”) changes the result set, restricting it to a smaller set. From 8.3 billion results to 159 million. The metamorphic test would not specify these exact values, but rather the relation “restricting the query leads to fewer search results”. And one could generate several (seed) inputs (search queries), associated restrictive keywords for transformations, and run the query and check the metamorphic relation holds (restricting the query produces fewer results). For more details on MMT with search engines, see (Zhou2016).
The above is an example of what metamorphic testing refers to. You transform (morph) your inputs in some way, while at the same time defining a relation that should hold from the previous input (and its output) to the new morphed input (and its output). The key concepts / terms are:
- morph/transform: modify a seed input in a way that your defined metamorphic relations should hold
- metamorphic relation: the defined transformation of the input should have a known/measurable effect on the output. Checking that this relation holds after the transformation is the test oracle of metamorphic testing. (Test oracle is a general term for a mechanism to give a verdict on test result)
- seed inputs: the inputs that are used as initial inputs for the tests, to be transformed. If you know the output of the seed input, you may use it to define a stricter relation (output should be correct). But even without the seed output, you can still define a relation check, but it might be a bit more relaxed (output should be similar but you don’t know if it is correct).
More generally metamorphic testing refers to defining such transformations, and observing their impact (metamorphic relations) on the result. The effectiveness and applicability then depends on how well and extensively these can be defined. I will present more concrete examples in the following sections.
Why would you want to use metamorphic testing? I will illustrate this with an example for autonomous cars. Autonomous cars are recently going through a lot of development, getting a lot of funding, have safety-critical requirements, and are highly dependent on machine-learning. Which is maybe also why they have received so much attention in MMT research. Makes great examples.
For example, the Tesla Autopilot collects data (or did when I was writing this..) from several front-, rear-, and side-cameras, a radar, and 12 ultrasonic sensors. At each moment in time, it must be able to process all this data, along with previous measurements, and come up with reasoning fulfilling highest safety-standards. Such real-world input-spaces are incredibly large. Consider the two pictures I took just a few days apart, near my previous office:
Just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers / no markers, connecting roads visible, parking lots visible, other cars, and so on. Yet in all such conditions one would be expected to be able to navigate, safely. To illustrate the problem a bit more, here are some example variants in that domain that quickly come to mind:
Besides these, one can easily expand this to different locations, road shapes, object types, bridges, trains, … Other sensors have other considerations, every location is different, and so on.
In different domains of ML-based system applications, one would need to be able to identify similar problem scenarios, and their relevant combinations, to be able to test them. Manually building test sets to cover all this is (for me) an unrealistic effort.
Metamorphic Testing with Autonomous Cars
Metamorphic testing can help in better covering domains such as the above autonomous cars problem space. As the interest is high, many approaches for this have also been presented, and I will describe a few of those here.
Covering Image Variations
The DeepTest work in (Tian2018) uses transformations on real images captured from driving cars to produce new images. In this case, the metamorphic attributes are:
- Seed inputs: Real images from car cameras.
- Metamorphic transformations: moving, tilting, blurring, scaling, zooming, adding fog, adding rain, etc. on he original images
- Metamorphic relation: the autonomous driving decisions should show minimal divergence on the same input images after the transformations.
The following illustrates this with some simple examples using the road image from outside my previous office. In the following, I “transformed” the image by simply rotating the camera a bit at the location. I then added the arrows to illustrate how a system should “predict” a path that should be taken. The arrow here is manually added, and intended to be only illustrative:
And the same, but with the snowy ground (two transformations in the following compared to the above; added snow + rotation):
Of course, no-one would expect to manually create any large number of such images (or transformations). Instead, automated transformation tools can be used. For example, there are several libraries for image augmentation, originally created to help increase training dataset sizes in machine learning. The following illustrates a few such augmentations run on the original non-snow image from above:
All these augmented / transformed images were generated from the same original source image shown before, using the Python imgaug image augmentation library. Some could maybe be improved with more advanced augmentation methods, but most are already quite useful.
Once those transformations are generated, the metamorphic relations on the generated images can be checked. For example, the system should propose a very similar driving path, with minimal differences across all transformations on acceleration, steering, etc. Or more complex checks if such can be defined, such as defining a known reference path (if such exists).
Again, this process of transformation and checking metamorphic relations is what MMT is about. It helps achieve higher coverage and thus confidence by automating some of the testing process for complex systems, where scaling to the large input spaces is otherwise difficult.
GAN-based Transformations with MMT
A more advanced approach to generate input transformations is to apply different ML-based techniques to build the transformations themselves. In image augmentation, one such method is Generative Adversarial Networks (GANs). An application of GANs to autonomous cars is presented in (Zhang2018). In their work, GANs are trained to transform images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image.
The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic transformations. (Zhang2018) uses the NVidia UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos for training.
Images illustrating the GAN results are available on the UNIT website, as well as in higher resolution in their Google Photos album. I recommend having a look, it is quite interesting. The smaller images on the UNIT website look very convincing, but looking more closely in the bigger images in the photo albums reveals some limitations. However, the results are quite impressive, and this was a few years ago. I expect the techniques to improve further over time. In general, using machine learning to produce transformations appears to be a very promising area in MMT.
Besides cameras, there are many possible sensors a system can also use. In autonomous cars, one such system is LIDAR, measuring distances to objects using laser-based sensors. A study of applying metamorphic testing on LIDAR data in the Baidu Apollo autonomous car system is presented in (Zhou2019).
The system first identifies a region of interest (ROI), the “drivable” area. It then identifies and tracks objects in this area. The system consists of multiple components:
- Object segmentation and bounds identification: Find and identify obstacles in ROI
- Object tracking: Tracking the obstacles (movement)
- Sequential type fusion: To smooth the obstacle types over time (make more consistent classifications over time by using also time related data)
The (Zhou2019) study focuses on metamorphic testing of the object identification component, specifically on robustness of classification vs misclassification in minor variations of the LIDAR point cloud. The LIDAR point cloud in this case is simply collection of measurement points the LIDAR system reports seeing. These clouds can be very detailed, and the number of measured points very large (Zhou2019).
The following figures illustrates this scenario (see (Zhou2019) for the realistic LIDAR images from actual cars, I just use my own drawings here to illustrate the general idea. I marked the ROI in a darker color, and added some dots in circular fashion to illustrate the LIDAR scan. The green box illustrates a bigger obstacle (e.g., a car), and the smaller red box illustrates a smaller obstacle (e.g., a pedestrian):
The metamorphic relations and transformations in this case are:
- Metamorphic relation: same obstacles (objects) should be identified both before and after adding small amounts of noise to the LIDAR point cloud.
- Transformation: add noise (points to the LIDAR point cloud)
- Seed inputs: actual LIDAR measurements from cars
The following figure illustrates this type of metamorphic transformation, with the added points marked in red. I simply added them in a random location, outside the ROI in this case, as this was the example also in (Zhou2019):
The above is a very simple transformation and metamorphic relation to check, but I find often the simple ones work the best.
In summary, the MMT approach here takes existing LIDAR data, and adds some noise to it, in form of added LIDAR data points. In relation to the real world, such noise is described in (Zhou2019) as potentially insects, dust, or sensor noise. The amount of added noise is also described as a very small percentage of the overall points, to make it more realistic.
The metamorphic experiments in (Zhou2019) show how adding a small number of points outside the ROI area in the point cloud was enough to cause the classifier (metamorphic relation check) to fail.
As a result, (Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and how the Baidu team incorporated some of the test data into their training dataset. This can be a useful approach, since metamorphic testing can be seen as a way to generate new data that could be used for training. However, I think one should not simply discard the tests in either case, even if re-using some of the data for further ML-model training. More on this later.
Metamorphic Testing of Machine Translation
Not everyone works on autonomous cars, so examples from other domains are important for broader insight. Outside autonomous vehicles, testing of automated language translations with ML-based NLP techniques has received some attention in recent years (for example, He2020, Sun2020). I found the (He2020) paper to be especially clear and sensible, so I use their approach as an example of the metamorphic properties for translation testing here:
- Seed inputs: Sentences to be translated
- transformation: replace words with specific part of speech (POS) tag in the input sentence, with another word that has the same POS tag. For example, a verb with another verb. Finally, use another NLP model (Google’s BERT in this case) to “predict” a suitable replacement candidate word.
- metamorphic relation: the structure of the transformed output should match the original translation output sentence structure for the original input. Large deviations indicate potential errors. The test oracle metric is the difference in output sentence structures for the automated translation on the original input vs the transformed input.
Here is an illustrative example using Google Translate, and a sentence I picked (at the time) from this article. Translating that sentence from English to Finnish:
The above shows the metamorphic transformation and how the check for the defined metamorphic relation should hold. In this case the sentence structure holds fine (in my opinion as a native Finnish speaker) and the result is good. I performed these experiments manually to illustrate the concept, but the test process is the same whether automated or not. Overall, trying a few different sentences, Google Translate actually worked very well. Great for them.
To be honest, I did not really use BERT in the above example, since it was just one example I needed to illustrate the concept. I just picked a word that makes sense (to me). However, HuggingFace has really nice and easy to use implementations available of BERT and many other similar models if needed. I have used them myself for many other tasks. Much like the image augmentation libraries in the car example, the NLP libraries have come a long way, and many basic applications are quite simple and easy these days.
For more details on MMT for machine translation, I recommend checking the papers, especially the (He2020) is quite readable. An extra interesting point here is again the use of another ML-based approach to help in building the transformations, similar to the GAN-based approaches for autonomous cars.
Metamorphic Testing of Medical Images
As an example of a third application domain, applying metamorphic testing to ML-based systems in the medical domain is presented in (Ding2017). This uses MMT to test variants of existing high-resolution biological cell images.
In (Ding2017), a number of metamorphic relations are defined related to various aspects of the biological cells (mitochondria etc.) in the images, and the manipulations done to the image. I lack the medical domain expertise to analyze the transformations or metamorphic relations in more detail, and the paper does not very clearly describe these for me. But I believe my lack of understanding is actually a useful point here.
Metamorphic testing related elements in this case (as far as I understood):
Seed inputs: existing medical images (actually, the paper is very unclear on this along with many other aspects, but it serves as a domain example)
Transformations: Adding, removing, transforming, etc. of mitochondria in the images.
Metamorphic relations: The relations between the elements (mitochondria) in the algorithm outputs for the transformed images should match the defined relation (e.g., linking some elements after adding new ones).
This example highlights, for me, how in many cases, the nuances, the metamorphic relations and transformations require an in-depth domain understanding. This requires extensive collaboration between different parties, which is quite common (in my experience) in applying ML-based approaches. Cars, driving, and language translation are everyday tasks we are all familiar with. Many expert domains, such as in this example, less so. This is why I think this is a useful example in highlighting my lack of domain expertise.
Interestingly, (Ding2017) also mentions using traditional testing techniques such as combinatorial testing, randomization, and category-partitioning, to enhance the initial input seed set. This is also the case in the following example on drones.
Metamorphic Testing of Drones
As a final example domain, an approach of combining model-based testing, simulation, and metamorphic testing for testing an ML-based flight guidance systems of autonomous drones is presented in (Lindvall2017).
The drone is defined as having a set of sensors, including barometer, GPS, cameras, LIDAR, and ultrasonic. Many sensors, quite similar the autonomous cars example. The metamorphic relations defined for the drone control:
- behaviour should be similar across similar runs
- rotation of world coordinates should have no effect
- coordinate translation: same scenario in different coordinates should have no effect
- obstacle location: same obstacle in different locations should have same route
- obstacle formation: similar to location but multiple obstacles together
- obstacle proximity: always within defined bounds
- drone velocity: velocity should stay inside defined bounds
- drone altitude: altitude should stay inside defined bounds
Following are properties of such systems metamorphic testing environment:
- Seed inputs: generated using the model-based approaches based on an environment model for the simulation
- Transformations: See above; rotation and coordinate changes of drone vs environment and obstacles or obstacle groups, etc
- Checks: See also above
A test environment generator is used to define (simulated) test environments for the drone, effectively generating the seeds of the metamorphic tests. The metamorphic transformations can be seen as modifications of this environment, and finally checks test that the above defined metamorphic relations hold. Various scenarios are defined to hold these together, including lift-off, returning home, landing, etc.
Perhaps the most interesting part here is the use of model-based testing approaches to build the seed inputs themselves, including the test environment. This seems like a very useful approach for gaining further coverage in domains where this is possible.
Another relevant observation in this is the use of scenarios to group elements together to form a larger test scenario, spanning also time. This is important, since a drone or a car, or many other systems, cannot consider a single input in isolation, but rather must consider a sequence of events, and use it as a context. This time aspect also needs to be taken into account in metamorphic testing.
Adversarial Inputs and Relations Across Time
A specific type of transformation that is often separately discussed in machine learning is that of adversarial inputs, which is extensively described in (Goodfellow2018). In general, an adversarial input aims to trick the machine learning algorithm to make a wrong classification. An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues.
Generating such adversarial inputs can be seen as one example of a metamorphic transformation, with an associated relation that the output should not change, or change should be minimal, due to adversarial inputs.
Typically such adversarial testing requires specifically tailored data to trigger such misclassification. In a real-world driving scenario, where the car sensors are not tampered with, it might be harder to produce such purely adversarial effects. However, there are some studies and approaches, such as (Zhou2020) considering this for real-world cases. More on this in a bit.
Beyond autonomous cars, digitally altered or tailored adversarial inputs might be a bigger issue. For example, in domains such as cyber-security log analysis, or natural language processing, where providing customized input data could be easier. I have not seen practical examples of this from the real world, but I expect once the techniques mature and become more easily available, more practical sightings would surface.
Much of the work on adversarial elements, such as (Goodfellow2018), have examples of adversarially modified single inputs (images). Real systems are often not so simple. For example, as a car drives, the images (as well as other sensor data), and the decisions that need to be made based on that data, change continuously. This is what the (Zhou2020) paper discusses for autonomous cars.
Relations Across Time
In many cases, besides singular inputs, sequences of input over time are more relevant for the ML-based system. Driving past a sign (or a digital billboard..), the system has to cope with all the sensor data at all the different positions in relation to the environment over time. In this case, the camera viewing angle. For other sensors (LIDAR etc), the input and thus output data would change in a similar manner over time.
Following is an example of what might be two frames a short time apart. In a real video stream there would be numerous changes and images (and other inputs) per second:
Not only does the angle change, but time as a context should be more generally considered in this type of testing (and implementation). Are we moving past the sign? Towards it? Passing it? Did we stop already? What else is in the scene? And so on.
This topic is studied in (Zhou2020), which considers it from the viewpoint of adversarial input generation. In a real-world setting, you are less likely to have your image data directly manipulated, but may be susceptible to adversarial inputs on modified traffic signs, digital billboards, or similar. This is what they (Zhou2020) focus on.
The following example illustrates how any such modification would also need to change along with the images, over time (compared to calculating a single, specific altered input vs real-world physical data moving across time):
This temporal aspect is important in more ways than just for adversarial inputs. For example, all the image augmentations (weather effects, etc) I discussed earlier would benefit from being applied in a realistic driving scenario (sequences of images) vs just a single image. This is what the cars have to deal with in the real world after all.
The test oracle in (Zhou2020) also considers the effect of the adversarial input from two different viewpoints: strength and probability. That is, how large deviations can you cause in the steering of the car with the adversarial changes, and how likely it is that you can cause these deviations with the adversarial input.
Beyond cars and video streams, time series sequences are common in other domains as well. The drone scenarios discussed are one example. Other examples include processing linked paragraphs of text, longer periods of signal in a stock market, or basic sensor signals such as temperature and wind speed.
Minimizing the Test Set
While automating metamorphic testing can be quite straightforward (once you figure your domain relations and build working transformations…), the potential input space from which to choose, and the number of transformations and their combinations can quickly grow huge. For this reason, test selection in MMT is important, just as with other types of testing.
One approach to address this is presented in (Tian2018), which applies a greedy search strategy. Starting with a seed set of images and transformations, the transformations and their combinations are applied on the input (images), and the achieved neuron activation coverage is measured. If they increase coverage, the “good” combinations are added back to the seed set for following rounds, along with other inputs and transformations, as long as they provide some threshold of increased coverage. This iterates until defined ending thresholds (or number of experiments). Quite similar to more traditional testing approaches.
Coverage in (Tian2018) is measured in terms of activations of different neurons in the ML model. They build coverage criteria for different neural network architectures, such as convolutional neural nets, recurrent neural nets, and dense neural nets. Various other coverage criteria also have been proposed, that could be used, such as one in (Gerasimou2020) on evaluating the importance of different neurons in classification.
When more and easily applicable tools become available for this type of ML-model coverage measurement, it would seem a very useful approach. However, I do not see people generally writing their own neural net coverage measurement tools.
Relation to Traditional Software Testing
Besides test suite optimization, it is important to consider MMT more broadly in relation to overall software and system testing. MMT excels in testing and verifying many aspects of ML-based systems, which are more probabilistic and black-box in nature. At least to gain higher confidence / assurance in them.
However, even in ML-based systems, the ML-part is not generally an isolated component working alone. Rather it consumes inputs, produces outputs, and uses ML models for processing complex datasets. The combinatorial, equivalence partitioning, and model-based methods I mentioned earlier are some examples of how the MMT based approaches can be applied together with the overall, more traditional, system testing.
As I mentioned with the Baidu Apollo case and its LIDAR test data generation, one of the feedbacks was to use the metamorphic test data for further ML training. This in general seems like a useful idea, and it is always nice to get more training data. In my experience with building ML-based systems, and training related ML-models, everyone always wants more training data.
However, I believe one should not simply dump all MMT test data into the training dataset. A trained model will learn from the given data, and can be tested for general accuracy on a split test set. This is the typical approach to test a specific ML-model in isolation. However, in practice, the classifications will not be 100% accurate, and some items will end up misclassified, or with low confidence scores. These further feed into the overall system, which may have unexpected reactions in combination with other inputs or processes. Running specific (MMT based or not) tests with specific inputs helps highlight exactly which data is causing issues, how this behaviour changes over time, and so on. If you just throw your MMT tests into the training set and forget it, you lose the benefit of this visibility.
Besides MMT, and complimentary to it, other interesting approaches of tailoring traditional testing techniques for ML-based system testing exist. One specific approach is A/B testing (evaluating benefits of different options). In ML-based systems, this can also be a feedback loop from the human user, or operational system, back to testing and training. The Tesla Shadow Mode is one interesting example, where the autonomous ML-based system makes continuous driving decisions, but these decisions are never actually executed. Rather they are compared with the actual human driver choices in those situations, and this is used to refine the models. Similar approaches, where the system can learn from human corrections are quite common, such as tuning search-engine results and machine translations, based on human interactions with the system. You are changing / morphing the system here as well, but in a different way. This would also make an interesting seed input source for MMT, along with oracle data (e.g., driving path taken by human user) for the metamorphic relation.
Testing machine learning based systems is a different challenge from more traditional systems. The algorithms and models do not come with explicit specifications of inputs and outputs that can be simply tested and verified. The potential space for both is often quite huge and noisy. Metamorphic testing is one useful technique to gain confidence in their operation with a reasonable effort. Compared to traditional testing techniques, it is not a replacement but rather a complimentary approach.
I presented several examples of applying MMT to different domains in this article. While applications in different domains require different considerations, I believe some generally useful guidelines can be derived to help perform MMT over ML-based systems:
- metamorphic transformations: these do not have to be hugely complex, but rather simple ones can bring good benefits, such as the addition of a few random points to the LIDAR cloud. Consider how the same input could change in its intended usage environment, and how such change can be implemented with least (or reasonable) effort as a transformation.
- metamorphic relations: to build these relations, we need to ask how can we change the ML input, and what effect should it have on the output? Sometimes this requires deep domain expertise to identify most relevant changes, as in the medical domain example.
- test oracles: These check that the performed transformation results in a acceptable (vs valid) output. Requires considerations such as how to represent the change (e.g., steering angle change, sentence structural change), possibly defining the probability of some error, the severity of the error, and a distance metric between the potential outputs after transformation (e.g., steering angle calculation). That is, the values are likely not fixed but in a continuous range.
- time relation: in many systems, the inputs and outputs are not singular but the overall system performance over time is important. This may also require asking the question of how time might be impacting the system, and how it should be considered in sequences of metamorphic relations. The idea of overall test scenarios as providers of a broader context, time related and otherwise, is useful to consider here.
- test data: can you use the user interactions with the system as an automated source of test inputs for transformations and metamorphic relations? Think Tesla Shadow mode, Google search results, and the inputs from the environment and the user, and use reactions to these inputs.
As discussed with some of the examples, an interesting trend I see is the move towards using ML-based algorithms to produce or enhance the (MMT-based) tests for ML-based systems. In the NLP domain this is shown by the use of BERT as a tool to build metamorphic transformations for testing natural language translations. In the autonomous cars domain by the use of GAN-based networks to create transformations between image properties, such as different weather elements and time of day.
Overall the ML field still seems to be advancing quite fast, with useful approaches already available also for MMT, and hopefully much more mature tooling in the next few years. Without good tool support for testing (data generation, model coverage measurement, etc), finding people with all this expertise (testing, machine learning, domain specifics, …), and implementing it all over again for every system, seems likely to be quite a challenge and sometimes a needlessly high effort without good support in tools and methods.
That’s all for today, this got way too long, so if someone managed to read all this far, I am impressed :). If you have experiences in testing ML-based systems and willing to share, I am interested to hear and learn in the comments 🙂