Testing Machine Learning Applications (or self-driving cars, apparently)


Testing machine learning (ML) based applications, or ML models, requires a different approach from more "traditional" software testing. Traditional software is controlled by manually defined, explicit, control-flow logic. Or to put it differently, someone wrote the code, it can be looked at, read, and sometimes even understood (including what and how to test). ML models, and especially deep learning (DL) models, implement complex functions that are automatically learned from large amounts of data. The exact workings are largely a black box, the potential input space huge, and knowing what and how to exactly test with them is a challenge. In this post I look at what kind of approaches have been taken to test such complex models, and consider what it might mean in the context of broader systems that use such models.

As I was going through the different approaches, and looking for ways in which they address testing ML models in general, I found that most (recent) major works in the area are addressing self-driving / autonomous cars. I guess it makes sense, as the potential market for autonomous cars is very large, and the safety requirements very strict. And it is a hot market. Anyway. So most of the topics I cover will be heavily in the autonomous cars domain, but I will also try to find some generalization where possible.

The Problem Space

Using autonomous cars as examples, most autonomous cars use external cameras as (one of) their inputs. In fact, Tesla seems to use mainly cameras, and built a special chip just to be super-effective at processing camera data with (DL) neural nets. Consider the input space of such cameras and systems as an example of what to test. Here are two figures from near the office where I work:

Dry vs Snow

The figure-pair above is from the same spot on two days, about a week apart. Yes, it is in May (almost summer, eh?), and yes, I am waiting for all your epic offers for a nice job to work on all this in a nice(r) weather. Hah. In any case, just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers or not, connecting roads, parking lots, other cars, and so on.

More generally, the above figure-pair is just two different options from a huge number of different possibilities. Just some example variants that come to mind:

environment related:

  • snowing
  • raining
  • snow on ground
  • puddles on the ground
  • sunny
  • cloudy
  • foggy
  • shadows
  • road blocked by construction
  • pedestrians
  • babies crawling
  • cyclists
  • wheelchairs
  • other cars
  • trucks
  • traffic signs
  • regular road markings
  • poor/worn markings
  • connecting roads
  • parking lots
  • strange markings (constructions, bad jokes, …)
  • holes in the road
  • random objects dropped on the road
  • debris flying (plastic bags etc)
  • animals running, flying, standing, walking, …

car related

  • position
  • rotation
  • camera type
  • camera quality
  • camera position
  • other sensors combined with the camera

And so on. Then all the other different places, road shapes, objects, rails, bridges, whatever. This all on top of the general difficulty of reliably identifying even the basic roads, pavements, paths, buildings, etc. in the variety of the real world. Now imagine all the possible combinations of all of the above, and more. And combining those with all the different places, roads, markings, road-signs, movements, etc. Testing all that to some reasonable level of assurance so you would be willing to sign it off to "autonomously drive". Somewhat complicated.

Of course, there are other ML problems in multitude of domains, such as natural language processing, cybersecurity, medical, and everything else. The input space might be slightly more limited in some cases, but typically it grows very big, and has numerous elements and combinations, and their variants.

Neural Nets

In some of the topics later I refer to different types of neural network architectures. Here is just a very brief summary of a few main ones, as a basis, if not familiar.

Fully Connected / Dense / MLP

Fully connected (dense) networks are simply neural nets where every node in each layer is connected to every node in the previous layer. This is a very basic type of network, and illustrated below:


I took this image off the Internet somewhere, trying to search for it, there seem to be a huge number of hits on it. So No idea of the real origins, sorry. The "dense" part just refers to all neurons in the network being connected to all the other neurons in their neighbouring layers. I guess this is also called a form of multi-layer perceptron (MLP). But I have gotten more used to calling it Dense, so I call it Dense.

The details of this model are not important for this post, so for more info, try the keywords on internet searchs. In this post, it is mainly just important to understand the basic structure, where neuron coverage, or similar references are later mentioned.


Convolutional neural networks (CNNs) aim to automatically identify local features in data, typically in images. These features start from higher level patterns (e.g., "corners" but can be any abstract for the CNN finds useful) to more compressed ones as the network progresses towards classification from the raw image. The CNNs work by sliding a window (called "kernel" or "filter") across the image, to learn ways to identify those features. The filters (or their weight numbers) are also learned during training. The following animation from Stanford Deeplearning Wiki illustrates a filter window rolling over an image:


In the above animation, the filter is sliding across the image, which is represented in this case by some numbers of the pixels in the image. CNN networks actually consist of multiple such layers among with other layers. But I will not go into all the details here. For more info, try the Internet.

CNNs can be used for anything that can have similar data expressivity and local features. For example, they are commonly applied to text using 1-dimensional filters. The idea of filters comes up with some coverage criteria leter, which is why this is relevant here.


Another type of common network type is recurrent neural networks (RNNs). Whereas a common neural network does not consider time aspect, recurrent neural nets loop over data in steps, and use some of the previous steps data as input also as input to following steps. In this sense, it has some "memory" of what it has seen. They typically work better when there is a relevant time-dimension in the data, such as analyzing objects in a driving path over time, or stock market trends. The following figure illustrates a very simplified view of an unrolled RNN with 5 time-steps.


This example has 5 time-steps, meaning it takes 5 values as input, each following one another in a series (hence, "time-series"). The X1-X5 in the figure are the inputs over time, the H1-H5 are intermediate outputs for each time-step. And as visible in the network, each time-step feeds to the next as well as producing the intermediate output.

Common, more advanced variants of this are Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Again, try some tutorial searches for details.

As with CNN, the relevance here is when some of these concepts (e.g., unrolling) come up with coverage criteria later, and maybe as some background to understand how they work. For example, how a car system might try to predict driving angles based on recent history of visual images it observes over time, not just a single frame.


A generative adversarial network (GAN) is a two-part network, where one part generates "fake" images, and another part tries to classify them as correct or fake. The idea is simply to first build a classifier that does a good job of identifying certain types (classes) of images. For example, rainy driving scenes. Like so:

Classifier base

Once this classifier works well, you train another network that tries to generate images to fool this classifier. The network that generates the "fake" images is called the generator, and the one that tries to tell the fakes from real is called the discriminator. In a sense the discriminator network provides the loss function for the generator. Meaning it just tell how well the generator is doing, and how to do better. Like so:

Generator NW

So the generator learns to generate realistic fakes by learning to fool the discriminator. In a way, one could describe such a model to "style" another image to a different style. But GANs are used for a variety of image generation and manipulation tasks. When the fake-generator becomes good enough to fool the good classifier, you have a working GAN model. The "adversarial" part refers to the fake-generator being an adversary for the classifier (discriminator) part.

The following image illustrates a real GAN transformation performed by UNIT from summery image (real) to wintery image (fake):


It is a nice looking transformation, very realistic. In the context of this post, this pair provides very useful test data, as we will see later. Here is what it might look like if it was able to generate the snowy road I put in the beginning of this post vs the dry road on different days:


So the real world is still a bit more diverse, but the generated figures from UNIT are already very impressive. I have no doubt they can be of use in testing ML system, as I will discuss later in this post. The same approach also works beyond images, as long as you can model the data and domain as a GAN problem. I believe it has been applied to generate sounds, music, text, .. probably much more too.

This GAN information is mainly relevant, for this post, in terms of having some basic understanding of what a GAN is as this comes up in some of the works later in this post. On generating different types of driving scenarios for testing. For more details on the topic, I refer to the mighty Internet again.

Getting (Realistic) Data

A fundamental aspect in applying ML, and especially DL, is getting as much high quality data as you can. Basically, the more the merrier. With more data, you can train bigger and better models, get higher accuracy, run more experiments and tests, and keep improving. Commonly this data also needs "labels" to tell when something is happening in the data, to enable the ML algorithms to learn something useful from it. Such as steering a car in response to the environment.

How to get such data? Autonomous car development is, as usual, a good example here.

Car companies have huge fleets of vehicles on the road, making for great data collection tools. Consider the Tesla Autopilot as an example. At the time of writing this (May 2019), the Tesla Autopilot website describes the system as having:

  • 8 surroung cameras with up to 250M visibility
  • 12 ultrasonic sensors
  • forward facing radar (advertised as "able to see through heavy rain, fog, dust and even the car ahead")

Imagine all this deployed in all their customers cars all the time, recording "free" training data for you. Free as in you as the car manufacturer (or I guess "developer" these days) does not pay people to drive and collect data, but rather the customers pay the manufacturer for the privelege to do so. Potentially millions of customers.

In the Tesla Autonomy day, Andrej Karpathy (famous ML guy, now director of AI at Tesla), describes the Tesla approach as first starting with a manually collected and labeled training set. Followed by using their fleet of cars to identify new scenarios, objects, anything of interest that the system should learn to handle better. Dispatching requests to all their cars out there to provide images and data on such scenarios if/when identified.

They also describe how they use the sensor data across different sensors to train the system better. That is, using the radar measurements to give labels for how far identified objects are, to train the vision system to recognize the distance as well.

In a more traditional software testing terminology, this might be called a form "online testing", or more generally just continous monitoring, learning, and improvement. As I noted above, Tesla even introduced their specialized DL processing chips for their cars called "Tesla HW3", to enable much more efficient and continous processing of such ML models. Such systems will also be able to continously process the data locally to provide experiences on the usefulness of the ML model (compared to user input), and to help build even better "autonomy".

Taking this further, there is something called Tesla Shadow Mode, a system that is enabled all the time when human driver is driving the car themselves or with the help of the autopilot. The (ML) system is constantly trained on all the data collected from all the cars, and runs in parallel when the car is driving. It does not make actual decisions but rather "simulated" ones, and compares them to how the human driver actually performed in the situation. Tesla compares the differences and uses the data to refine and analyze the system to make it better.

This makes for an even more explicit "online testing" approach. The human driver is providing the expected result for the test oracle, while the current autopilot version provides the actual version (output). So this test compares the human driver decisions to the autonomous guidance system decisions all the time, records differences and trains itself to do better from the differences. Test input is all the sensor data from the car.

Beyond Tesla, there seems to be little information to be found on how other Car companies do this type of data collection. I guess maybe because it can be one of the biggest competitive edges to actually have access to the data. However, I did find a description from few years back on the Waymo system for training their autonomous cars. This includes both an advanced simulation system (called CarCraft), as well as a physical site for building various test environments. Data is collected from actual cars driving the roads, new and problematic scenarios are collected and analyzed, modelled and experimented with extensively in the simulations and the physical test environment. Collected data is used to further train the systems, and the loop continues. In this case, the testing includes real-world driving, physical test environments, and simulations. Sounds very similar to the Tesla case, as far as I can tell.


What about the rest of the ML world besides autonomous cars? One example that I can provide for broader context is from the realm of cybersecurity. Previously I did some work also in this area, and the cybersecurity companies are also leveraging their information and data collection infrastructure in a similar way. In the sense of using it to build better data analysis and ML models as a basis for their products and services.

This includes various end-point agents such as RSA NetWitness, TrendMicro, Carbon Black, Sophos Intercept, and McAfee. Car companies such as Tesla, Waymo, Baidu, etc. might use actual cars as endpoints to collect data. In a similar way, pretty much all cybersecurity companies are utilizing their deployed infrastructure to collect data, and use that data as inputs to improve their analytics and ML algorithms.

These are further combined with their Threat Intelligence Platforms, such as Carbon Black Threat Analysis, Brightcloud Webroot, AlienValue Open Threat Exchange, and Internet Storm Center. These use the endpoint data as well as the cybersecurity vendors own global monitoring points and network to collect data and feed them to their ML algorithms for training and analysis.

The above relates the cybersecurity aspects to the autonomous car aspects of data collection and machine learning algorithms. What about simulation to augment to augment the data in cybersecurity vs autonomous cars? I guess the input space is a bit different in nature here, as no form of cybersecurity simulation directly comes to mind. I suppose you could also just arrange cyber-excersises and hire hackers / penetration testers etc to provide you with the reference data in attacks against the system. Or maybe my mind is just limited, and there are better approaches to be had ? 🙂

Anyway, in my experience, all you need to do is to deploy a simple server or services on the internet and within minutes (or seconds?) it will start getting hammered with attacks. And it is not like the world is not full of Chinese hackers, Russian hackers, and all the random hackers to give you real data.

I guess this illustrates some difference in domains, where cybersecurity looks more for anomalies, which the adversary might try to hide in different ways in more (semi-)structured data. Autonomous cars mostly need to learn to deal with the complexity of real world, not just anomalies. Of course, you can always "simulate" the non-anomaly case by just observing your cybersecurity infrastructure in general. This is maybe a more realistic scenario (to observe the "normal" state of a deployed information system) to provide a reference than observing a car standing in a parking lot.

In summary, different domains, different considerations, even if the fundamentals remain.

MetaMorphic Testing (MT)

Something that comes up all the time in ML testing is MetaMorphic Testing, sometimes also called property-based testing. The general idea is to describe the software functionality in terms of relations between inputs and outputs, rather than exact specifications of input to output. You take one input which produces an observed output, modify this input, and instead of specifying an exact expected output, you specify a relationship between the original input and related output and the modified input and its related output.

An example of a traditional test oracle might be to input a login username and password, check that it passes with correct inputs, and fails with wrong password. Very exact. Easy and clear to specify and implement. The general examples for metamorphic testing are not very exciting, such as something about sin function. A more intuitive example I have seen is about search engines (Zhou2016). Search engines are typically backed by various natural language processing (NLP) based machine learning approaches, so it is fitting in that regard as well.

As a search-engine example, input the search string "testing" to Google, and observe some set of results. I got 1.3 billion hits (matching documents) as a result when I tried "testing" when writing this. Change the query string to be more specific "metamorphic testing" and the number of results should be fewer than the original query ("testing" vs "metamorphic testing"), since we just made the query more specific.

In my case I got 25k results for "metamorphic testing" query. This is an example of the metamorphic relation: a more specific search term should less or equal number of results as the original query. So if "metamorphic testing" would have resulted in 1.5 billion hits instead of 25k, the test would have failed, since it broke the metamorphic relation we defined. With 25k, it passes as the results are fewer than for the original search quer, and the metamorphic relation holds.

In the following examples, metamorphic testing is applied to many views of testing ML applications. In fact, people working on metamorphic testing must be really excited since this seems like the golden age of applying MT. It seems to be a great match to the type of testing needed for ML, which is all the rage right now.

Test Generation for ML applications

This section is a look at test generation for machine learning applications, focusing on testing ML models. I think the systems using these models as parth of their functionality need to be looked at in a broader context. In the end, they use the models and their predictions for something, which is the reason the system exists. However, I believe looking at testing the ML models is a good start, and complex enough already. Pretty much all of the works I look at next use variants of metamorphic testing.

Autonomous cars

DeepTest in (Tian2018) uses transformations on real images captured from driving cars to produce new images. These are given as input to a metamorphic testing process. The system prediction/model output (e.g., steering angle). The metamorphic relation is to check that the model prediction does not significantly change across transformation of the same basic image. For example, the driving path in sun and rain should be about the same for the same road, in the same environments, with same surrounding traffic (cars etc.). The relation is given some leeway (the outputs do not have to be 100% match, just close), to avoid minimal changes causing unnecessary failures.

The transformations they use in include:

  • brightness change (add/substract constant from pixels)
  • contrast change (multiply pixels by constant)
  • translation (moving/displacing the image by n pixels)
  • scaling the image bigger or smaller
  • shear (tilting the image)
  • rotation (around its center)
  • blur effect
  • fog effect
  • rain effect
  • combinations of all above

The following illustrates the idea of one such transformation on the two road images (from near my work office) from the introduction:


For this, I simply rotated the phone a bit, while taking the pics. The arrows are there just as some made-up examples of driving paths that a ML algorithm could predict. Not based on any real algorithm, just an illustrative drawing in this case. In metamorphic testing, you apply such transformations (rotation in this case) on the images, run the ML model and the system using its output to produce driving (or whatever else) instructions, compare to get a similar path before and after the transformation. For real examples, check (Tian2018). I just like to use my own when I can.

My example from above also illustrates a problem I have with the basic approach. This type of transformation would also change the path the car should take, since if the camera is rotated, the car would be rotated too, right? I expect the camera(s) to be fixed in the car, and not changing positions by themselves.

For the other transformations listed above this would not necessarily be an issue, since they do not affect the camera or car position. This is visible in a later GAN based example below. Maybe I missed something, but that is what I think. I believe a more complex definition of the metamorphic relation would be needed than just "path stays the same".

Since there are typically large numbers of input images, applying all transformations above and their combinations to all of images would produce very large input sets with potentially diminishing returns. To address this, (Tian2018) applies a greedy search strategy. Transformations and their combinations are tried on images, and when an image or a transformation pair is found to increase neuron activation coverage, they are selected for further rounds. This iterates until defined ending thresholds (number of experiments).

Coverage in (Tian2018) is measured in terms of activations of different neuros with the input images. For dense networks this is simply the activation value for each neuron. For CNNs they use averages over the convolutional filters. For RNNs (e.g., LSTM/GRU), they unroll the network and use the values for the unrolled neurons.

A threshold of 0.2 is used in their case for the activation measure (neuron activation values range from 0 to 1), even if no rationale is give for this value. I suppose some empirical estimates are in order, in whatever case, and this is what they chose. How these are summed up across the images is a bit unclear from the paper but I guess it would be some kind of combinatorial coverage thing. This is described as defining a set of equivalence partitions based on the similar neuron activations, so maybe there is more to it but I did not quite find it in the paper. An interesting thought in any case, to apply equivalence partitioning based on some metrics over the activation thresholds.

The results (Tian2018) get in terms of coverage increases, issues found etc. are shown as good in the paper. Of course, I always take that with a bit of salt since that is how academic publishing works, but it does sound like a good approach overall.

Extending this with metamorphic test generation using generative adversarial networks (GANs) is studied in (Zhang2018(1)). GANs are trained to generate images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image. Metamorphic relations are again defined as the driving guidance should remain the same (or very close) as before this transformation. The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic manipulations for the same (such as in DeepTest from above, and cousins). They use UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos to train.

The following images illustrate this with two images from the UNIT website:


On the left in both pairs is the original image, on the right is the GAN (UNIT) transformed one. The pair on the left transformed into night scene, the pair on the right transformed into a rain scene. Again, I drew the arrows myself, and this time they would also make more sense. The only thing changed is the weather, the shape of the road, other cars, and other environmental factors are be the same. So the metamorphic test relation where the path should not change should hold true.

The test oracle in this work is similar to other works (e.g., (Tian2018 above). The predictions for the original and transformed image are compared and should closely match.

Input Validation

A specific step of input validation is also presented in (Zhang2018(1)). Each image analyzed by the automated driving system is first verified in relation to known input data (images it has seen before and was trained on). If the image is classified as significantly different from all the training data previously fed into the network, the system is expected to advise the human driver to take over. The approach for input validation in (Zhang2018(1)) is to project the high-dimensional input data into a smaller dimension, and measure the distance of this projection from the projections of the used training data. If the distance crosses a set threshold, the system alerts the human driver about potentially unknown/unreliable behaviour. Various other approaches to input validation could probably be taken, in general the approach seems to make a lot of sense to me.

LIDAR analysis

Beyond image-based approaches only, a study of applying metamorphic testing on Baidu Apollo autonomous car system is presented in (Zhou2019). In this case, the car has a LIDAR system in place to map its surroundings in detail. The system first identifies a region of interest (ROI), the "drivable" area. Which sounds like a reasonable goal, although I found no process of identifying it in (Zhou2019). This system consists of multiple components:

  • Object segmentation and bounds identification: Find and identify obstacles in ROI
  • Object tracking: Tracking the obstacles (perhaps their movement? hard to say from paper)
  • Sequential type fusion: To smooth the obstacles types over time (I guess to reduce misclassification or lost objects during movement over time)

Their study focuses on the object identification component. The example of a metamorphic relation given in (Zhou2019) is to add new LIDAR points outside the ROI area to an existing point cloud, and run the classifier on both the before- and after-state. The metamorphic relation is again that same obstacles should be identified both before and after adding small amounts of "noise". In relation to the real world, such noise is described as potentially insects, dust, or sensor noise. The LIDAR point cloud data they use is described as having over 100000 samples per point cloud, and over a million per second. The ratio of noise injected is varied with values of 10, 100, and 1000 points. These are considered small numbers compared to the total of over 100k points in each point-cloud (thus "reasonable" noise). Each experiment produces errors or violations of the metamorphic relation in percentages of 2.7%, 12.1%, and 33.5%.

The point cloud is illustrated by this image from the paper:


In the figure above, the green boxes represent detected cars, and the small pink one a pedestrian. The object detection errors found were related to missing some of these types of objects when metamorphic tests were applied. The results they show also classify the problems to different numbers of mis-classifications per object type, which seems like a useful way to report these types of results for further investigation.

(Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and considered the MT approach could be useful approach to augment their train and test data. This is in line with what I would expect: ML algorithms are trained and evaluated on the training dataset. If testing finds misclassifications, it makes sense to use the generated test data as a way to improve the training dataset.

This is also visible in the data acquisition part I discussed higher above, in using different means to augment data collected from real-word experiments, real-word test environments, and simulations. MT can function as another approach to this data augmentation, but with an integrated test oracle and data generator.

Other Domains

Besides image classifiers for autonomous cars, ML classifiers have been applied broadly to image classification in other domains as well. A metamorphic testing related approach to these types of more generic images is presented in (Dwarakanath2018). The image transformations applied in this case are:

  • rotate by 90, 180, 270
  • flip images vertically or horizontally
  • switching RGB channels. e.g., RGB becomes GRB
  • normalizing the (test) data
  • scaling the (test) data

The main difference I see in this case is the domain specific limitations of autonomous cars vs general images. If you look to classify random images on the Internet or on Facebook, Instagram, or similar places, you may expect them to arrive in any short of form or shape. You might not expect the car to be driving upside down or to have strangely manipulated color channels. You might expect the car system to use some form of input validation, but that would be before running the input images through the classifier/DL model. So maybe rotation by 90 degrees is not so relevant to test for the autonomous car classifier (expect as part of input validation before the classifier), but it can be of interest to identify a rotated boat in Facebook posts.

Another example of similar ideas is in (Hosseini2017) (probably many others too, …). In this case they apply transformations as negations of the images. Which, I guess, again could be useful to make general classifiers more robust.

An example of applying metamorphic testing to machine learning in the medical domain is presented in (Ding2017). This uses MT to generate variants of existing high-resolution biological cell images. A number of metamorphic relations are defined to check, related to various aspects of the cells (mitochondria etc.) in the images, and the manipulations done to the image. Unlike the autonomous car-related examples, these seem to require in-depth domain expertise and as such I will not go very deep into those (I am no expert in biological cell structures). Although I suppose all domains would need some form of expertise, the need just might be a bit more emphasized in the medical domain.

Generating Test Scenarios

An approach of combining model-based testing, simulation, and metamorphic testing for testing an "AI" (so machine learning..) based flight guidance systems of autonomous drones is presented in (Lindvall2017).

The drone is defined as having a set of sensors (or subset of these):

  • inertial measurement unit
  • barometer
  • GPS
  • cameras
  • ultrasonic range finder

Metamorphic relations:

  • behaviour should be similar across similar runs
  • rotation of world coordinates should have no effect
  • coordinate translation: same scenario in different coordinates should have no effect
  • obstacle location: same obstacle in different locations should have same route
  • obstacle formation: similar to location but multiple obstacles together

The above are related modifying the environment. Relations that are used as general test oracle checks where values should always stay within certain bounds:

  • obstacle proximity
  • velocity
  • altitude

MBT is typically used to represent the behaviour of a system as a state-machine. The test generator traverses the state-machine, generating tests as it goes. In this case, the model generates test steps that create different elements (landing pads, obstacler, etc) in the test environemnt. The model state contains the generated environment. After this initial phase, the model generates tests steps related to flying. Lifting off, returning to home base, landing on a landing pad, etc.

I guess the usefulness of this depends on the accuracy of the simulation, as well as how the sensors are used. Overall, it does seem like a good approach to generate environments and scenarios, with various invariant based checks for generic test oracles. For guiding a car based on camera input, maybe not this is not so good. For some sensor type where the input space is well known, and realistic models of the environment and real sensor data are less complex, and easier to create?

Testing ML Algorithms

A large part of the functioniality of ML applications depends on ML frameworks and their implementation of algorithms. Thus their correctness is quite related to the applications, and there are a lot of similar aspects to consider. As noted many times above, it is difficult to exactly know what is the expected result of a ML model, when the "prediction" is correct and so on. This is exactly the same scenario with the implementation of those learning algorithms, as their output is just those models, their training, the predictions they produce, and so on. How do you know what it produced is correct? That the implementation of those models and their training is "correct"?

A generic approach to test anything when you have multiple implementations is to pit those implementations against each other. Meaning to run them all with the same input and configurations, and compare the outputs. In (Pham2019) such an approach is investigated by using Keras as the application programming interface (API), and various backends for it. The following figure illustrates this:

Keras Layers

The basic approach here is to run the different backends for the same models and configurations, and compare results. There are often some differences in the results of those algorithms, even if the implementation is "correct", due to some randomness in the process, the GPU operations used, or some other factor. Thus, (Pham2019) rather compares the distributions and "distances" of the results to see if there are large deviations in one of the multiple implementations compared to others.

This topic is explored a bit further in (Zhang2018(2)), where issues (bug reports etc) against Tensorflow are explored. The issues are accessed via the project Github issue tracker, and related questions on StackOverflow. Some of these remind me of the topics listed by Karpathy’s recent post on model training "gotchas".

One from (Zhang2018) that I have found especially troublesome myself is "stochastic execution", referring to different runs with same configurations, models, and data producing differing results. In my experience, even with fixing all the seed values, exact reproducibility can be an challenge. At least in Kaggle competitions where aiming for every slightest gain possible :). In any case, when testing the results of trained ML algorithms across iterations, I have found exact reproducability can be an issue in general, which this refers to.


In the above sections, I have discussed the general testing of ML applications (models). This is the viewpoint of using techniques such as metamorphic testing to generate numerous variants of real scenarios, and the metamorphic relations as a test oracle. This does not address the functionality of the system from a viewpoint of a malicious adversary. Or generally of "stranger" inputs. In machine learning research, this is usually called "adversary input" (Goodfellow2018).

An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues. The following example illutrates this from the web-version of the (Goodfellow2018) article:

Adversarial example

The stop sign on the left is the original, the one on the right is the adversarial one. I don’t see much of a difference, but a ML classifier can be confused by embedding specific tricky patterns into the image.

Such inputs are designed to look the same to a human observer, but modified just slightly in ways that fool the ML classifier. Typically this is done by embedding some patterns invisible to the human eye into the data, that the classifier picks up. These are types of patterns not present in the regular training data, which is why this is not visible during normal training and test. Many such attacks have been found, and ways to design them developed. A taxonomy in (Goodfellow2018) explores the possible ways, difficulty levels vs adversarial goals. The basic goal starting from reducing confidence (sometimes enough to satisfy adversarial needs) to full misclassification (can’t go more wrong than that).

The "bullet-proof" way to create such inputs as desfribed in (Goodfellow2018) is to have full knowledge of the model, its (activation) weights, and input space. To create the descired misclassification, the desired output (misclassification) is first defined, followed by solving the required constraints in the model backwards to the input. This reminds me of the formal methods vs testing discussions that have been ongoing for decades in the "traditional" software engineering side. I have never seen anyone apply format methods in practice (I do know for safety-critical it is used and very important), but can see their benefit in really safety critical applications. I guess self-driving (autonomous) cars would qualify as one. And as such this type of approach to provide further assurance on the systems would be very benefical.

I would expect that the wide-spread practical adoption of such a approach would, however, require this type of analysis to be provided as a "black-box" solution for most to use. I don’t see myself writing a solver that solve a complex DL network backwards. But I do see the benefits, and with reasonable expertise required could use it. I had trouble following the explanation and calculations in (Goodfellow2018) and believe many could be willing to pay for the tools and services where available (and if having the funding, which I guess should be there if developing something like autonomous cars). As long as I could trust and believe in the tool/service quality.

This topic is quite extensively studied and just searching for adversarial machine learning on the Internets gives plenty of hits. One recent study on the topic is (Elsayed2018) (also with Goodfellow..). It shows some interesting numbers on how many of the adversarial inputs fool the classifier and how many fool a human. The results do not show 100% accurate fail so perhaps that and the countermeasures discussed in (Goodfellow2018) and similar papers, such as verifying expected properties of input, could make for a plausible way to implement useful input validation strategies against this, similar to ones I discussed earlier in this post.

Test Coverage for ML applications

Test coverage is one of the basic building blocks of software testing. If you ever want to measure something about your test efforts, test coverage is likely to be at the very top of the list. It is then quite natural to start thinking about measuring test coverage in terms of ML/DL models and their applications. However, similar to ML testing, test coverage in ML is a bit more complicated topic.

Traditional software consists of lines of code, written to explicitly implement a specific logic. It makes sense to set a goal of having high test coverage to cover (all) the lines of code you write. The code is written for a specific reason, and you should be able to write a test for that functionality. Of course, real-world is not that simple, and personally, I have never seen 100% code coverage achieved. Of course, there are other criteria, such as requirements coverage as well, that can also be mapped in different ways to code and other artefacts.

As discussed throughout this post, the potential input space for ML models is huge. How do you evaluate the coverage of any tests over such models? For example, two common models referenced in (Ma2018) are VGG-19 and Resnet-50, with 16000 and 94000 neurons over 25 and 176 layers, respectively. Other models such as recent NLP models from OpenAI have much more. Combined with the potential input space, that is quite a few combinations to cover everything.

Coverage criteria for neural nets are explored in (Ma2018), (Sun2018), (Li2019) (and I am sure other places..). As far as I can see, they all take quite a similar approach to coverage measurement and analysis. They take a set of the valid input data and use it to train and test the model. Then measure various properties of the neural nets, the neuron activations, layer activations, etc. as a basis for coverage. Some examples of coverage metrics applied:


  • k-multisection: splits the neuron activation value range to k-sections and measures how many are covered
  • boundary: measures explicitly boundary (top/bottom) coverage of the activation function values
  • strong boundary: same as boundary but only regarding the top value
  • top-k: how many neurons have been the top-k active ones in a layer
  • top-k patterns: pair-wise combination patterns over top-k


  • neurons: neurons that have achieved their activation threshold
  • modified condition/decision coverage: different inputs (neurons) causing activation value of this neuron to change
  • boundary: activation values below/above set top/bottom boundary values

The approaches typically use the results to guide where to generate more inputs. Quite often this just seems to be another way to generate further adversarial inputs, or to identify where the adversarial inputs could be improved (increased coverage).

The results in (Ma2018) also indicate how different ML models for the same data require different adversarial inputs to trigger the desired output (coverage). This seems like a potentially useful way to guard against such adversarial inputs as well, by using such "coverage" measures to help select diverse sets of models to run and compare.

The following figure is a partial snippet from (Ma2018), showing some MNIST data:


MNIST is a dataset with images of written letters and digits. In this case, a number of adversarial input generators have been applied on the original input. As visible, the adversarial inputs are perturbed from the original, and in ways which do actually not seem too strange, simply somewhat "dirty" in this case. I can see how robustness could be improved by including these, as in the real world the inputs are not perfect and clean either.

Another viewpoint is presented in (Kim2019), using the term surprise adequacy, to describe how "surprised" the DL system would be based on specific inputs. This surprise is measured in terms of new added coverage over the model. The goal is to produce as "surprising input" as possible, to increase coverage. I assume withing limits of realistic inputs, of course. In practice this seems to translate to traversing new "paths" in the model (activations), or distance of activation values from previous activation values.

Some criticism on these coverage criteria is presented in (Li2019). Mainly with regards to whether such coverage criteria can effectively capture the relevant adversary inputs due to the extensive input and search space. However, I will gladly take any gain I can get.

It seems to me the DNN coverage analysis as well as its use are still work in progress, as the field is finding its way. Similary to metamorphics testing, this is also probably a nice time to publish papers, as shown by all the ones I listed from just the past year here. Hopefully in the following years we will see nice practical tools and methods to make this easier from application perspective.

Final Thoughts

Many of the techniques I discussed in the above sections are not too difficult to implement. The machine learning and deep learning frameworks and tools have come a long way and provide good support for building classifiers with combinations of different models (e.g., CNN, RNN, etc). I have played with these using Keras and Tensorflow as a backend, and there are some great online reasources available to learn from. Of course, the bigger datasets and problems do require extensive efforts and resources to collect and fine-tune.

As a basis for considering the ML/DL testing aspects more broadly, some form of a machine learning/deep learning defect model would seem like a good start. There was some discussion on different defect types in (Zhang2018(2)) from the viewpoint of classifying existing defects. From the viewpoint of this post, it would be interesting to see some related to

  • testing and verification needs,
  • defect types in actual use,
  • how the different types such as adversarial inputs relate to this,
  • misclassifications,
  • metamorphic relations
  • domain adaptations

This type of model would help more broadly analyze and apply testing techniques to real systems and their needs.

Since I have personally spent time working on model-based testing, the idea of applying MBT with MT for ML as in (Lindvall2017) to create new scenarios in combination with MT techniques seems interesting to me. Besides testing some ML algorithms on their own, this could help build more complex overall scenarios.

Related to this, the approaches I described in this post, seem to be mostly from a relatively static viewpoint. I did not see consideration for the changing sensor (data collection) environments, which seems potentially important. For example:

  • What happens when you change the cameras in your self-driving car? Add/Remove?
  • The resolution of other sensors? Radar? LIDAR? Add new sensors?
  • How can you leverage your existing test data to provide assurance with a changed overall "data environment"?

The third one would be a more general question, the first two would change with domains. I expect testing such changes would benefit from metamorphic testing, where the change being tested is from the old to the new configuration, and relations measure the change in the ultimate goal (e.g., driving guidance).

Data augmentation is a term used in machine learning to describe new and possibly synthetic data added to existing model training data to improve model performance. As I already discussed in some parts of this post above, it can be tempting to just consider the data generated by metamorphic testing as more training data. I think this can make sense, but I would not just blindly add all possible data, but rather focus the testing part of evaluating the robustness and overall performance of the model. Keeping that in mind as adding more data, to be able to still independently (outside the training data) perform as much verification as possible. And not add all possible data if it shows no added benefit in testing.

Overall, I would look forward to getting more advanced tools to apply the techniques discussed here. Especially in areas of:

  • metamorphic testing and related application to ML models and related checks
  • GAN application frameworks, on top of other frameworks like Keras or otherwise easily integratable in other tools
  • adversarial input generation tools using different techniques
  • job offers 🙂

Its an interesting field to see how ML applications develop, autonomous cars and everything else. Now I should get back to Kaggle and elsewhere to build some mad skills in this constantly evolving domain, eh.. 🙂


A. Dwarakanath et al., "Identifying implementation bugs in machine learning based image classifiers using metamorphic testing", 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018

J. Ding, X-H. Hu, V. Gudivada, "A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data", IEEE Transactions on Big Data, 2017.

G. F. Elsayed et al., "Adversarial Examples that Fool both Computer Vision and Time-Limited Humans", 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.

I. Goodfellow, P. McDaniel, N. Papernot, "Making Machine Learning Robust Against Adversarial Inputs", Communications of the ACM, vol. 61, no. 7, 2018.

H. Hosseini et al., "On the Limitation of Convolutional Neural Networks in Recognizing Negative Images", 6th IEEE International Conference on Machine Learning and Applications, 2017.

J. Kim, R. Feldt, S. Yoo, "Guiding Deep Learning System Testing using Surprise Adequacy", International Conference on Software Engineering (ICSE), 2019.

Z. Li et al., "Structural Coverage Criteria for Neural Networks Could Be Misleading", International Conference on Software Engineering (ICSE), 2019.

M. Lindvall et al., "Metamorphic Model-based Testing of Autonomous Systems", IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), 2017.

M-Y. Liu, T. Breuel, J. Kautz, "Unsupervised Image-to-Image Translation Networks", Advances in Neural Information Processing Systems (NIPS), 2017.

L. Ma et al., "DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.

H.V. Pham, T. Lutellier, W. Qi, L. Tan, "CRADLE : Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries", International Conference on Software Engineering (ICSE), 2019.

Y. Sun et al., "Concolic testing for deep neural networks", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018

Y. Tian et al., "DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars", 40th International Conference on Software Engineering (ICSE), 2018.

M. Zhang (1) et al., "DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.

Y. Zhang (2) et al., "An Empirical Study on TensorFlow Program Bugs", 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018.

Z. Q. Zhou, S. Xiang, T. Y. Chen, "Metamorphic Testing for Software Quality Assessment : A Study of Search Engines", IEEE Transactions on Software Engineering, vol. 42, no. 3, 2016.

Z. Q. Zhou, L. Sun, "Metamorphic Testing of Driverless Cars", Communications of the ACM, vol. 62, no. 3, 2019.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s