Understanding the Poisson Distribution

I find probability distributions would often be useful tools to know and understand, but the explanations are not always very intuitive. The Poisson distribution is one of the probability distributions that I have run into quite often. Most recently I ran into it when preparing for some AWS Machine Learning certification questions. Since this is not the first time I run into it, I figured it would be nice to understand it better. In this article, I explore questions on when, where, and how to apply it. And I try to keep it more intuitive by using some concrete example cases.

Example Uses of the Poisson Distribution

Before looking into details of something, it is nice to have an idea of what that something is useful for. The Poisson Distribution is typically described as a discrete probability distribution. It tells you the probability of discrete number of events in a timeframe, such as 1 event, or 2 events, or 3 events, … Not 1.1 events, or 2.7 events, or any other fractional events. Whole events only. This is why it is called discrete.

Some example uses for the Poisson distribution include estimating the number of:

  • calls to a call center
  • cars passing a on a highway
  • fatal airplane crashes in a year
  • particles emitted in radioactive decay
  • faults in physical hardware
  • people visiting a restaurant during lunch time

One typical use of these estimates would be as input to capacity planning (e.g., call center staffing or hardware provisioning). Besides events over time, the Poisson distribution can also be used to estimate instances in an area, volume, or over a distance. I will present some examples of these different application types in the following sections. For further ideas and examples of its application, there is a nice Quora question/answers.

Let’s start by looking at what the Poisson distribution is, and how to calculate it.

Properties of the Poisson Distribution

A process that produces values conforming to a Poisson Distribution is called a Poisson Process. Such a process, and thus the resulting Poisson Distribution, is expected to have the following properties:

  • Discrete values. An event either occurs or not, there is no partial occurrence here.
  • Independent events. Occurrence of one event does not affect the occurrence of other events.
  • Multiple events do not occur at the same time. However, the sub-intervals in which they occur may be very small.
  • Average (a.k.a. the mean) number of events is constant through time.

The following figure/line illustrates a Poisson Process:

Poisson process example, with average of 5 units between events.

The dots in the above figure illustrate events occurring. The x-axis illustrates time passing. The units of time (on the x-axis here) could be any units of time, such as seconds, minutes, hours, or days. They could even be other types of units, such as distance. The Poisson distribution / process does not really care about the unit of measurement, only that the properties of the distribution are met. I generated the above figure using a random process that would generate values between 1-10, with an average value of 5. In practice, I used the Python random module to generate the numbers.

As I said, the time unit could be anything, but for clarity of example I will use seconds. To calculate the Poisson distribution, we really only need to know the average number of events in the timeframe of interest. If we calculate the number of dots in the above figure, we get 18 events (dots). Now assume we want to know the probability of getting exactly 18 events in a timeframe of 100 seconds (as in the figure).

In this process, the average number of events would be 100/5 = 20. We have a timeframe of 100 seconds, and a process that produces events on average every 5 seconds. Thus on average we have 100/5 = 20 events in 100 seconds.

To answer the question on the probability of having exactly 18 events in a timeframe of 100 seconds, we can plug these numbers into the Poisson distribution formula. This gives a chance of about 8.4% for having 18 events occur in this timeframe (100s). In the Poisson Distribution formula this would translate to the parameters k = 18, λ (lambda) = 20.

Let’s look at what these means, or what is the formula:

The Poisson Distribution Formula

To calculate the Poisson distribution, we can use the formula defined by the French mathematician Siméon Denis Poisson, about 200 years ago:

Poisson distribution formula.

It looks a bit scary with those fancy symbols. To use this, you really just need to know the values to plug in:

  • e: Euler’s number. A constant value of about 2.71828. Most programming languages provide this as a constant, such as math.e in Python.
  • λ (lambda): The average number of events in an interval. Sometimes μ (Mu) is used, but that is just syntax, the formula stays the same.
  • k: The number of events to calculate the probability for.

For the more programming oriented, here is the same, in Python:

The Poisson distribution formula in Python.

This code, and the code to generate all the images in this article is easiest to access on my Kaggle notebook. Its a bit of a mess, but if you want to look at the code, it’s there.

If we plug the values for the imaginary Poisson Process from the previous section (18 dots on a line, with an average of 20) into this formula, we get the following:

(e^-20)*(20^18)/18! = 0.08439

Which translates to about 8.4% chance of 18 events (k=18) in the timeframe, when the average is 20 events (λ=20). That’s how I got the number in the previous section.

Example: Poisson Distribution for Reliability Analysis

I find concrete examples make concepts much easier to understand. One area that I find is a fitting example here, and where I have experience in seeing the Poisson distribution applied, is system reliability analysis. So I will illustrate the concept here with an example of how to use it to provide assurance of “five nines”, or 99.999%, reliability in relation to potential component failures.

There are many possible components in different systems that would fit this example quite nicely. I use storage nodes (disks in good old times) as an example. Say we have a system that needs to have 100 storage nodes running at all times to provide the required level of capacity. We need to provide this capacity with a reliability of 99.999%. We have some metric (maybe from past observations, from manufacturer testing, or something else) to say the storage nodes we use have an average of 1 failure per 100 units in a year.

Knowing that 1 in 100 storage nodes fails on average during a year does not really give much information on how to achieve 99.999% assurance of having a minimum of 100 nodes up. However, with this information, we can use the Poisson distribution to help find an answer.

To build this distribution, we just plug in these values into the Poisson formula:

  • λ = 1
  • k = 0-7

This results in the following distribution:

Poisson distribution for λ (avg)=1, k (events) = 0-7.

Here λ (avg in the table) is 1, since we have the average number of events at 1. For k (events in the table above), I have simply started at 0, and increased by 1 until I reached the target probability of 99.999%. The Poisson formula with these values is also in the table for each row. Adding up all the probabilities gives us the value of 7 as the number of potential failures until we reach the probability of 99.999% (cumulative in the table).

So how did we reach 7? We have a probability of 36.7879% for 0 failures, and the same 36.7879% for 1 failure. The cumulative probability for 0 or 1 failures is 36.7879 + 36.7879 = 73.5759%. The probability of exactly 2 failures is 18.394%. The cumulative probability of 0, 1, or 2 failures is thus 73.5759 + 18.394 = 0.919699%. And so on, as visible in the table (cumulative column). Since these are exclusive numbers (we cannot have both 0 and 1 failures at the same time), we need to sum them up to get the cumulative chance of avoiding all scenarios of 0-7 failures. This is the cumulative 99.999% value in the table row 7.

So the answer to the original question in this example is to start the year with 107 nodes running to have 99.999% assurance of keeping a minimum of 100 nodes running at all times through the year. Or you could also do 108 depending on how you interpret the numbers (does the last failure count as reaching 99.999%, or do you need one more).

Of course, this does not consider external issues such as meteor hits, datacenter fires, tripping janitors, etc. but rather is to address the “natural” failure rate of the nodes themselves. It also assumes that you wouldn’t be exchanging broken nodes as they fail. But this is intended as an analysis example, and one can always finetune the process based on the results and exact requirements.

For those interested, some further examples of such applications in reliability analysis are available on the internet. Just search for something like poisson distribution reliability engineering.

Poisson Distributions for average of 0-5 events

Now that we have the basic definitions, and a concrete example, let’s look at what the distribution generally looks like. With the same parameters, the Poisson distribution is always the same, regardless of the unit of time or space used. Using an average number of events 1/minute or 1/hour, the distribution itself is the same. Just the time-frame changes. Similarly, looking at an average of instances per area of 10/m2 or 10/cm2 will have the same distribution, just over a different area.

Here I will plot how the distribution evolves as the λ (average number of events in timeframe or area) is changed. I use a varying number of events k in each case up until the probability given by the Poisson distribution is very small. I use discrete (whole) number for the λ, such as 0, 1, 2, 3, 4, and 5, although the average can have continuous values (0.1 would be fine for λ). Let’s start with λ of 0:

Poisson distribution for 0-5 events (k) with an average of 0 events (λ) in time interval.

The above figure shows the Poisson distribution for an average of 0 events (λ). If you think about it for a moment, the average number of events can only be 0 if the the number of events is always 0. This is what the above figure and table show. With an average of 0 events, the number of events is always 0.

Poisson distribution for 0-5 events (k) with an average of 1 events (λ) in time interval.

A bit more interesting, the above figure with an average of 1 events (λ) in the time interval shows a bigger spread of probable number of events. This distribution is already familiar from the reliability estimation example I showed, as it had λ of 1. Here the number of events (k) can be both 0 or 1 equally often (about 36.8%), after which the probability quickly goes down for larger number of events (k). The average of 1 is balanced by the large probability of 0’s and smaller probability of the larger numbers for k.

Poisson distribution for 0-7 events (k) with an average of 2 events (λ) in time interval.

Here, with figure with an average of 2 events (λ) in the interval, we see some trends with the increase of the average number of events (λ) as compared to the distributions with λ of 0 and 1. As λ increases, the center of the distribution shifts right, the distribution spreads wider apart (x-axis becomes broader), and the percentage of single values becomes smaller (y-axis is shorter). So there are more values for k, each one individually having smaller probabilities but summing to 1 in the end.

Poisson distribution for 0-8 events (k) with an average of 3 events (λ) in time interval.

Compared to the smaller averages (0 ,1, 2), the above figure with average of 3 events (λ) shows further the trend where the average value itself (here k=3) has the highest probability, but with the value right below it (here k=2) closely (or equally) matching in probability. And the center shifting right, with the spread getting broader.

Poisson distribution for 0-10 events (k) with an average of 4 events (λ) in time interval.

Both the number of 4 (figure above) and 5 (figure below) for the average number of events (λ) show the same trends as with 0, 1, 2, and 3 for λ continue further.

Poisson distribution for 0-11 events (k) with an average of 5 events (λ) in time interval.

To save some space, and to illustrate the Poisson distribution evolution as the average number of events (λ) rises, I animated it over different λ values from 0 to 50:

Poisson distribution from 0 to 50 (discrete) averages (λ) and scaled number of events (k).

As I noted before, the average (λ) does not need to be discrete, even if these examples I use here are. You cannot have 0.1 failures, but you can have on average 0.1 failures.

With the above animated Poisson distribution going from 0 to 50 values for λ, the same trends as before are again more pronounced – As the average number of events (λ) rises,

  • The overall number of events predicted goes up (x-axis center moves right).
  • The distribution becomes wider (x-axis becomes broader) with more values for k (number of events).
  • The probability of any single number of events (k) get smaller (y-axis is shorter).
  • The probability always peaks at the average (highest probability where λ=k).
  • The summer probability always stays at 1 (=100%) over the x- and y-axis.

That sums up my basic exploration of the Poisson distribution.

Example: Snowflakes, Call Centers, and a Poisson Distribution

As I noted, I find concrete examples tend to make things easier to understand. I already presented the example from reliability engineering, which I think makes a great example of the Poisson distribution as it seems like such an obvious fit. And useful too.

However, sometimes the real world is not so clear on all the assumptions on applying the Poisson distribution. With this in mind, I will try something a bit more ambiguous. Because it is winter now (where I live it is), I use falling snowflakes as an example to illustrate this problem. When I say snowflakes falling, I mean when the flakes hit the ground, on a specific area.

Assume we know the average rate of snowflakes falling in an hour (on the selected area) is 100, and we want to know the probability of 90 snowflakes falling. Using the Poisson distribution formula defined earlier in this article, we get the following:

  • e = 2.71828: as defined above (or math.e)
  • λ = 100: The average number of snowflakes falling in a given timeframe (1 hour).
  • k = 90: The number of snowflakes we want to find the probability for.

Putting these into the Poisson formula, we get:

(e^-100)*(100^90)/90! = 0.025.

Which translates to about 2.5% probability of 90 snowflakes falling in an hour (the given time-frame), on the given area.

To calculate the broader Poisson probability distribution for different number of snowflakes, we can, again, simply run the Poisson formula using different values for k. Remember, k here stands for the number of events to find the probability for. In this case the events are the snowflakes falling. Looping k with values from 75 to 125 gives the following distribution:

Poisson distribution for k=75-125, λ=100.

Sorry for the small text in the image. The bottom shows the ticks for x-axis as the numbers from 75 to 125. These are the values k that I looped for this example. The bars in the figure denote the probability of each number of snowflakes (running the Poisson formula with the value k). The numbers on top of the bars are the probabilities for each number of snowflakes falling, given the average of 100 snowflakes in the interval (λ). As usual, the Poisson distribution always peaks at the average point (λ, here 100). The probability at the highest point (k=100) here is only about 4% due to wide spread of the probabilities in this distribution.

So as long as we meet the earlier defined requirements for applying the Poisson distribution:

  • Discrete values
  • Independent events
  • Multiple events not occurring simultaneously
  • The average number of events is constant over time

As long as these hold, we can say that there is a 4% chance of 100 snowflakes, when the average number of observed snowflakes (λ) in the timeframe is 100. And about 2.5% chance of observing 90 snowflakes (k=90) in a similar case. And so on, according to the distribution visualized above.

Let’s look at the Poisson distribution requirements here:

  1. Snowflakes are discrete events. There cannot be half a snowflake in this story. Either the snowflake falls or it doesn’t.
  2. The snowflakes can be considered independent, as one would not expect one to affect another.
  3. Multiple snowflakes are often falling at the same time, but do they fall to the ground at the exact same moment?
  4. The average number of snowflakes over the hours is very unlikely to be constant for a very long time.

I considered a few times whether to use snowflakes as an example, since this list actually goes from the first item obviously being true, to the next one slightly debatable, the third more so, and the final one quite obviously not going to hold always. But everyone likes to talk about the weather, so why not.

So if we take the first argument above as true, and the following ones at different degrees of debatable, lets look at the points 2-4 from the above list.

2: Are falling snowflakes independent? I am not an in-depth expert on the physics of snowflakes or their falling patterns. However, 100 snowflakes on an area in an hour is not much, and I believe the independence of the snowflakes is a safe assumption here. But in different cases, it is always good to consider this more deeply as well (e.g., people tending to cluster in groups).

3: How likely is it that two snowflakes hit the ground at the exact same moment? Depends. If we reduce the time frame of what you consider “the same time” enough, there are very few real-world events that would occur at the exact same moment. And if this was a problem, there would be practically very few, if any, cases where the Poisson distribution would apply. There is almost always the theoretical possibility of simultaneous events, even if incredibly small. But it is very unlikely here. So I would classify this as not a problem to make this assumption. I believe this kind of relaxed assumption is quite common to make (e.g., we could say in the reliability example, there is an incredibly small chance of simultaneous failure).

4: The average number of falling snowflakes being consistent is of course not true over a longer period of time. It is not always snowing, and sometimes there is a bigger storm, other times just a minor dribble. To address this, we can consider the problem differently, as the number of snowflakes falling on a specific area in a smaller interval, such as 1 minute. The number of snowflakes should be more constant at least for the duration of several minutes (or as long as the snowstorm maintains a reasonably steady state). Maybe it would be more meaningful to discuss the distributions of different levels of storms in different stages?

The fourth point, the change over time, highlights an issue that is relevant to many example applications of the Poisson distribution. The rate of something is not always constant, and thus the time factor may need special consideration. I even found a term for it when I was looking for information on this on the internet: time-varying poisson distribution. And as highlighted by the third point above, some other considerations may also need to be slightly relaxed, while the result can still be useful, even if not perfect.

Snowflakes vs Support Calls

The above snow example was a largely made up problem, but weather is always a nice topic to discuss. Perhaps a more realistic, but very similar, case would be to predict the number of calls you could get in a call center. Just replace snowflakes with calls into the call center.

As with the snow example, with calls to a call center, over different time periods (depending also on time-zones), the average might vary, producing different distributions. Your distributions over days might also vary. For example, weekends vs holidays vs business days. But if you had, for example, an average of 100 calls during the business hours on business days, your Poisson distribution would look the exact same as the average 100 snowflakes in the snowflake example.

You could then use this distribution for things like choosing how many people you should hire into the call center to maintain a minimum service level to match your service level agreement. For example, by finding a spot where the cumulative probability of receiving that many calls in an hour is less than 90%, and using it as evidence of being able to provide minimum of 90% service level at all times.

In such a call center example, you would have other considerations compared to the weather scenario. Including the timezones you serve, business hours, business days, special events, the probability of multiple simultaneous calls, and so on. However, if you think about it for a moment, at the conceptual level these are exactly the same considerations as for the weather example.

So far, my examples have only described events over timeframes. As I noted in the beginning, sometimes the Poisson distribution can also be used to model number of instances in volumes, or areas, of space.

Example: Poisson Distribution Over an Area – Trees in a Forest, WW2 Bombs in London

Besides events over time, the Poisson distribution can also be used to estimate other types of data. One other example is typically distributions of instances (events) over an area. Imagine the following box to describe a forest, where every dot is a tree:

Imaginary forest, dots representing trees.

A Poisson distribution can similarly represent occurrences of such instances in space as it can represent occurrence of events in time. Imagine now splitting this forest area into a grid of smaller boxes (see my Kaggle kernel for the image generation. This image is from run 31):

The forest, divided into smaller squares.

In the above figure, each tree in this is now inside a single smaller box, although the rendering makes it look like the dots on the border of two boxes might belong to two boxes. Now, we can calculate how many trees are in each smaller square:

Calculating the number of trees inside the smaller boxes.

This calculation for the image above, gives the following distribution:

Number of trees in smaller box in the imaginary forest.

From the calculation, the number of smaller boxes with 0 trees (or dots) is 316, or 79%, of the total 400. Further 17% of the smaller boxes in the grid contain 1 tree (dot), and 4% contain 2 trees (dots).

Now, we can calculate the Poisson probability distribution for the same data. We have 400 smaller boxes, and 100 trees. This makes the average number of trees in a box 100/400=0.25. This is the average number of instances in the area (similar to timeframe in earlier examples), or λ, for the Poisson formula. The number of events k in this case is the number of instances (of trees). Putting these values into the Poisson formula (k=tree_count, λ=0.25, we get:

Number of trees estimated by the Poisson distribution.

Comparing this Poisson distribution to the actual distribution calculated from the tree grid image above, it is nearly identical. Due to random noise, there is always some deviation. The more we would increase the number of “trees” in the actual example (randomly generate more), and the number of mini-boxes (smaller area), the closer the actual observations should come to the theoretical numbers calculated by the Poisson formula. This is how randomness generally seems to work, the bigger sets tend to converge to their theoretical target better. But even with the values here it is quite close.

Poisson Distribution and the London Bombings in WW2

Besides the forest and the trees, one could use this type of analysis on whatever data is available, when it makes sense. As I read about Poisson distribution, its application to the London bombings in the second world-war (WW2) often comes up. There is a good summary about this on Stack Exchange. And if you read it, you will note that this is exactly the type of analysis that my forest example above showed.

In this story, the Germans were bombing the British in London, and the British suspected they were targeting some specific locations (bases, ports, etc). To figure if the Germans were really targeting specific targets with insider knowledge, the British divided London into squares and calculated the number of bomb hits in each square. They then calculated the probability of each number of hits in each square, and compared the results to the Poisson distribution. As the result closely matched the Poisson distribution, the British then concluded that the bombings were, in fact, random, and not based on inside-information. Again, this is what would happen if we just replace “trees” with “bombs” and “forest” with “London” in my forest example above.

Of course, a cunning adversary could account for this by targeting a few high-interest targets, and hiding them in a large number of randomly dropped bombs. But the idea is there, on how the Poisson distribution could be used for this type of analysis as well. Something to note, of course, is that the Stack Exchange post also discusses this as actually being a more suitable problem for a Binomial distribution. Which is closely related to the Poisson distribution (often quite indistinguishable). Let’s see about this in more detail.

Poisson vs Binomial Distribution

When I read about what is the Poisson distribution and where does it derive from, I often run into the Binomial distribution. The point being, that the binomial distribution comes from the more intuitive notion of single events, and their probabilities. Then some smart people like mr. Siméon Denis Poisson used this to derive the generalized formula of the Poisson distribution. And now, 200 years later, I can just plug in the numbers. Nice. But I digress. The Binomial distribution formula:

The Binomial distribution formula.

The above figure shows the general Binomial distribution formula. It has only a few variables:

  • n: number of trials
  • k: number of positive outcomes (events we calculating probability for)
  • p: probability of the positive outcome

We can further split the Binomial formula into different parts:

Different parts of the Binomial formula.

I found a good description of the binomial formula, and these parts, on the Math is Fun website. To briefly summarize, the three parts in the above figure correspond to:

  • part1: number of possible trial output combinations that can produce the wanted result
  • part2: probability for positive outcomes with expected k positive outcomes
  • part3: probability for negative outcomes with expected k positive outcomes

To summarize, it uses the number of possible trial combinations, the probability of a positive outcome, and the probability of a negative outcome to calculate the probability of k positive outcomes. As before, calculating this formula for the different numbers of k, we can get the Binomial distribution.

What does any of this have to do with the Poisson distribution? As an example, let’s look at the reliability example from before as a Binomial Distribution instead. We have the following numbers for the reliability example:

  • average number of failures (λ, in one year): 1 in 100
  • number of events investigated (k): 0-7

We can convert these into the values needed for the Binomial Distribution:

  • n: number of trials = 100. Because we have an average over 100 trials, and want to know the result for 100 storage nodes.
  • p: probability of failure = 1/100 = 0.01. Because we have a 1 in 100 chance of failure.
  • k: 0-7 as in the Poisson example.

Comparing the results, we get the following table:

Poisson vs Binomial

In this table, poisson refers to the values calculated with the Poisson formula. binomial100 refers to the values calculated with Binomial formula using parameters n=100 and p=0.01. Similarly, binomial1000 uses n=1000 and p=0.001. As you can see, these are all very close.

It is generally said, that when p nears 0 and n nears infinity for the Binomial Distribution, you will approximate the Poisson Distribution. Or the other way around. In any case, the idea being that as your interval of sub-trials gets smaller in Binomial, you get closer to the theoretical value calculated by the Poisson Distribution. This is visible in the table as binomial100 is already close to poisson, but binomial1000 is still noticeably closer with a larger n and smaller p.

You can further experiment with this yourself, by increasing n, and decreasing p in this example, and using a Binomial Distribution calculator. For example, where p = 0.01, 0.001, 0.0001, while n = 100, 1000, 10000, and so on.

The difference between the Binomial and Poisson distribution is a bit elusive. As usual, there is a very nice Stack Exchange question on the topic. I would summarize it as using Poisson when we know the average rate over a timeframe, and the timeframe (or area/volume) can be split into smaller and smaller pieces. Thus the domain of the events is continuous and bordering on infinite trials (as you make the interval smaller and smaller). The Binomial being a good fit if you have a specific, discrete, number of known events and a probability. Like a thousand coin tosses. I recommend the Stack Exchange post for more insights.

Other Examples of Poisson Distribution

As I was looking into the Poisson distribution I found many examples trying to illustrate its use. Let’s look at a few more briefly.

Number of chocolate chips in a cookie dough. I though this was an interesting one. Instead of events over time, or instances in an area, we are talking about instances in a volume of mass. Kind of like 3-dimensional area. I guess if we looked at time (and distance) as 1-dimensional, and area as 2-dimensional, this would be the continuation to 3-dimensional. The applicability I think would mainly depend on how well the chips are distributed in the dough, and whether they could be considered independent (not clustering) and on average constant. But I find the application of the Poisson distribution to volumes of mass is an interesting perspective.

Estimating traffic counts. How many cars pass a spot on a highway could be suitable for a Poisson distribution. I think there can be some dependencies between cars, since people tend to drive in clusters for various reasons. And this behaviour likely changes over specific time periods (of days, weeks, etc similar to the call center example).

Estimating bus arrivals. This one does not seem very meaningful, since the buses are on a schedule. Maybe the average rate would be quite constant over some time periods, but not very independent. For estimating how much they will deviate from their schedule might work a bit better, although weather, traffic, and other aspects might have a big impact.

People arriving in a location. This is a bit tricky. I think people often tend to travel in groups, and if they arrive to some events, such as school classes, workdays, meetings, football games, concerts, or anything else like that, they tend to cluster a lot. So I do not think this would work very well for a Poisson distribution. But maybe in some cases.

Number of fatal airline crashes in a year quite a classic example. If we consider back to the past few years (today) and, for example, the issues Boeing had with their MCAS system, this would certainly not fit. But then this (MCAS issue) would show up as an anomaly, similar to what the British were looking for in the London Bombing example. And this would be correct, and a useful find it in itself, if otherwise not observed. Other than these types of clustered events due to a specific cause, I believe airplane crashes makes a reasonable example of applying the Poisson distribution.

A bit like people, animal behaviour could also be of interest. One example I saw online (sorry, could not find the link anymore) was about cows being distributed across a field. And how they tend to group (or cluster), meaning the distribution would not be truly independent, and likely not very good for a Poisson distribution. This made me think about the minimum distance between events / instances as a parameter. For example, repeating my forest example, but setting a minimum distance from one tree to the next. The longer this minimum distance would be, the further the actual distribution should differ from the predicted Poisson distribution. This type of “personal space” seems likely in many real-world scenarios.

Most of these extra examples I listed here actually seem to have some properties similar to my snowflakes / call center example. You might not have the perfect case for the Poisson distribution, but if you figure out your rules and limitations, and find a suitable angle, you might still benefit from it.


I presented a bit deeper look at three cases of applying the Poisson distribution in this article. Hopefully they are helpful in understanding about its behaviour and potential use. I find the reliability example quite concise for the most “basic” (or standard-like) application of the Poisson distribution. The call center / snowflake example gives it a bit more complex real-wold context, and finally the trees in a forest / London bombing example illustrates the expansion of the application into the two-dimensional space from the time-domain. The additional cases I briefly discussed further highlighted a few interesting points, such as the expansion into 3-dimensional volumes, and finding different angles where the Poisson might be useful, even if not directly matching all its criteria.

Mostly my day to day job, or generally daily tasks don’t really involve many uses for the Poisson (or Binomial) distribution. However, I believe the distribution is very useful to understand and keep in mind when the situation arises. And when it does, I find some good pointers are useful to remember. Some points I find useful:

  • The Poisson distribution is always the same if we have the same parameters λ and k, regardless of the timeframe or area size. Just have to be able to scale the idea.
  • The probability and rate of events generally seems to be scalable across events, so if I have a rate of 1 in 100, I can also use a rate of 0.1 in 10, or 10 in 1000.
  • The Binomial distribution is a good alternative to keep in mind. I find it a good mental note to consider the Binomial if I have a given probability, and a specific (limited) number of events.
  • The Poisson on the other hand is a clearer fit if I have the average rate, and potentially continuous number of events of a timeframe

In the end, I think the most important thing is to remember what these distributions are good for. For Poisson, this would be estimating the probability of a discrete number of events, given their average rate in a timeframe/area/volume, and when the timeframe is continuous (e.g., you can divide an hour into smaller and smaller time units for more trials) but the events are discrete (no partial events). Binomial on the other hand if you have the probability for an event, and the number of trials is discrete (specific number of trials). Keeping the basic applications in mind, you can then look up the details as needed, and figure out the best way to apply them.

That’s all for today. If you think I missed something, or anything to improve, let me know 🙂

Understanding Golang Project Structure

Go is an interesting programming language, with a nice focus on keeping it minimal. In that sense, the classic Go workspace setup seems like a bit of an anomaly, requiring seemingly complex project structure. It sort of makes sense, once you understand it. But first you need to understand it. That is what this post aims at.

I start with defining the elements of a classic Go workspace setup. In recent Golang versions, the Go Modules are a new way to set up a workspace, which I find can be used to simplify it a bit. However, even with the modules, the classic structure applies. Although it is not strictly necessary to use the classic structure, it is useful to understand.

Go Packages

The base of Go project structure is the concept of a Go package. Coming from years of Java indoctrination, with Python sprinkled on top, I found the Go package nuances quite confusing.

Package Path vs Name

A Go package reference consists of two parts. A package path and a package name. The package path is used to import the code, the package name to refer to it. The package name is typically the last part of the import but that is not a strict requirement. Example:

package main

import "github.com/mukatee/helloexample/hellopkg"

func main(){

The above code declares code that has a package name main. Lets say this code is in a file in directory github.com/mukatee/helloexample. Go only allows files from a single package in a single directory, and with this definition, all Go files in the directory github.com/mukatee/helloexample would need to define themselves as package main. The package name main is a special in Go, as it is used as the binary program entry point.

The above code also imports a package in the path github.com/mukatee/helloexample/hellopkg. This is the package path being imported. In the classic Go project structure, the package path practically refers to a directory structure. The typical package name for this import would be hellopkg, matching the last part of the package path used for the import.

For example, consider that the path github.com/mukatee/helloexample/hellopkg contains a file named somefile.go. Full path github.com/mukatee/helloexample/hellopkg/somefile.go. It contains the following code:

package hellopkg

import "fmt"

func Hello()  {
    fmt.Println("hello from hellopkg.Hello")

The code from this package is referenced in this case (in the main.go file above) as hellopkg.Hello(). Or more generally as packagename.functionname(). In this example it is quite understandable, since the package name matches the final elements of the package path (hellopkg). However, it is possible to make this much more confusing. Consider if everything else was as before, but the code in somefile.go were the following:

package anotherpkg

import "fmt"

func Hello()  {
    fmt.Println("hello from anotherpkg.Hello")

To run code from this package, now named anotherpkg, we would instead need to have the main.go contain the following:

package main

import "github.com/mukatee/helloexample/hellopkg"

func main(){

This is how you write yourself some job security. There is no way to know where the anotherpkg comes from in the above code. Which is why the strong recommendation for keeping that last package path element the same as the package name. Of course, the standard main package needed to run a Go program is an immediate anomaly, but lets not go there.

Finally, you can make the package name explicit when importing, regardless of what the package name is defined inside the file:

package main

import hlo "github.com/mukatee/helloexample/hellopkg"

func main(){

In the above code, hlo is the alias given to the package imported from the path github.com/mukatee/helloexample/hellopkg. After this aliased import, it is possible to refer to the code imported from this path as hlo.Hello() regardless of whether the package name given inside the files in the path is hellopkg, anotherpkg, or anything else. This is similar to how you might write import pandas as pd in Python.


GOROOT refers to the directory path where the Go compiler and related tools are installed. It is a bit like JAVA_HOME for Java Virtual Machines.

I generally prefer to set this up myself, even though each system likely comes with some default. For example, last time I was setting up Go on a Linux machine, the Go toolkit was installable with the Snap package manager. However, Snap warns about all the system changes the Go installer would do. Something about Snap classic mode, and some other scary sounding stuff the install might do. To avoid this, I just downloaded the official Go package, extracted it, and linked it to the path.

Extraction, and symbolic linking (on Ubuntu 20.04):

cd ~
mkdir exampledir
cd exampledir
mkdir golang_1_15
cd golang_1_15
#assume the following go install package was downloaded from the official site already:
tar -xzvf ~/Downloads/go1.15.3.linux-amd64.tar.gz
cd ..
ln -s golang_1_15/go go

The above shell script would extract the Go binaries into the directory ~/exampledir/golang_1_15/go (the last go part comes from the archive). It also creates a symbolic link from ~/exampledir/go to ~/exampledir/golang_1_15/go. This is just so I can point the GOROOT to ~/exampledir/go, and just change the symbolic link if I want to try a different Go version and/or upgrade the Go binaries.

The final step is to point GOROOT to the symbolic link, by adding the following to ~/.profile file, or whatever startup script your shell uses:

export GOROOT=/home/myusername/exampledir/go
export PATH=$PATH:$GOROOT/bin

After this, the go binaries are available in the shell, the libraries are found by the Go toolchain, and my Goland IDE found the Go toolchain on the path as well. Brilliant.


While GOROOT specifies the toolkit path, GOPATH specifies the Go workspace directory. GOPATH defaults to ~/go, and it is not strictly required to define it. When using the new Go modules, it is also easier to skip this definition, as the package path can be defined in the module definition. More on that later.

The part I find confusing for GOPATH is that the workspace directory it defines is not for a specific project, but intended for all Go projects. The actual project files are then to be put in a specific location in the GOPATH. It is possible to set up multiple workspaces, but this is generally not adviced. Rather it is suggested to use one, and put the different projects under specific subdirectories. Which seems all the same, since even if you use multiple workspaces, you still have to put your projects in the same subdirectories. I will illustrate why I find it confusing with an example.

The workspace defined by GOPATH includes all the source code files, compiled binaries, etc. for all the projects in the workspace. The directory structure is generally described as:


These directions are:

  • bin: compiled and executable binaries go here
  • pkg: precompiled binary components, used to build the actual binaries for bin. Also as some sort of intermediate cache by the go get tool, and likely other similar purposes. Generally I just ignore it since it is mainly for Go tools.
  • src: source code goes here

The above directory layout does not seem too confusing as such. However, if I look at what it means if I checkout code from multiple projects on github (or anywhere else) into the same workspace, the result is something like this:


And in the above is my confusion. Instead of checking out my projects into the root of the workspace, I first need to create their package path directories under the workspace. Then clone the git repo in there (or copy project files if no git). In the above example, this workspace would have two projects under it, with the following package paths:

  • github.com/mukatee/helloexample
  • github.com/randomuser/randompeertopeer

To set this up, I would need to go to github, check the projects, figure out their package paths, create these paths as a folder structure under the workspace, and then, finally clone the project under that directory. After this, Go will find it. There is algo the go get tool to download and install specific packages from github (and likely elsewhere) into the correct place on the GOPATH. However, this does not clone the git repo for me, nor explain to me how my own code and repository should be placed there along with other projects. For that, I need to write posts like this so I get around to figuring it out 🙂

This workspace structure is especially confusing for me, since it seems to force all Go projects on Github to hold their main.go at the project root level along with everything else you put in your repo, including your documentation and whatever else. I find many Go projects also end up hosting pretty much all their code at the root level of the repo. This easily makes the repository a complete mess when I try to look at the code and it is just one big mess in the top directory.

Again, this is what I find to be really confusing about it all. With Go modules it is at least a bit more clear. But still, there is much legacy Go out there, and one has to be able to use and understand those as needed. And even for using Go modules I find I am much better off if I understand this general Go workspace setup and structure.

Go Modules

Go modules are really simple to set up. You just create a go.mod file in the root of your project directory. This file is also very simple in itself. Here is a basic example:

module github.com/mukatee/helloexample

go 1.15

Or if that is too much to type, the go mod init command can do it for you:

go mod init github.com/mukatee/helloexample

That creates the go.mod file for you, all the 3 lines in it..

The above go.mod example defines the module path on the first line. The code itself can be in any directory, it no longer matters if it is on the GOPATH when using the Go modules. The line go 1.15 defines the language version used for the project.

The official documentation still recommends using the whole GOPATH workspace definition as before. However, even with GOPATH undefined everything works if go.mod is there:

$ go build -v -o hellobin
$ ./hellobin
hellopkg from hellopkg.Hello

In the above, I am specifying the output file with the -o option. If GOPATH is not set, it will default to ~/go. Thus if I have the above project with the go.mod definition, and run the standard Go binary build command go install on it, it will generate the output binary file in ~/go/bin/helloexample. The -o option just defines a different output path in this case.

Seems a bit overly complicated, when I am just used to slapping my projects into some random workdir as I please. But I guess having some standardized layout has its benefits. Just took me a moment to figure it out. Hope this helps someone. It certainly helped me look all of this up properly by writing this post.


This post describes the general layout of the classic Go project. While I would use the new module structure for projects whenever possible, I sometimes download projects from Github to play with, and I am sure many corporations have various legacy policies that require this classic architecture. There are a lot more information and nuances, but I encourage everyone to look it up themselves when they come to that bridge.

The package path vs package name system is something that really confused me badly coming from other programming languages. It is not too bad once you understand it. But generally most things get easy once you master them, eh. Achieving the mastery is the hard part. It is just the difficulty in achieving the understanding. I cannot say if the Go project structure is confusing, or if I am just loaded with too much legacy from other environments (Java, Python, ..).

There is much good to Go in my opinion, and the modules system helps fix many of the general issues already. I have written a few small projects in Go, and look forward to trying some more. Sometimes the aim for simplicity can require some bloat in Go, not sure if the project structure qualifies in that category. In any case, good luck working with Go, even if you don’t need it.. I do 🙂

Explaining Machine Learning Classifiers with LIME

Machine learning algorithms can produce impressive results in classification, prediction, anomaly detection, and many other hard problems. Understanding what the results are based on is often complicated, since many algorithms are black boxes with little visibility into their inner working. Explainable AI is a term referring to techniques for providing human-understandable explanations of the ML algorithm outputs.

Explainable AI is interesting for many reasons, including being able to reason about the algorithms used, the data we have to train them, and to understand better how to test the system using such algorithms.

LIME, or Local Interpretable Model-Agnostic Explanations is one technique that seems to have gotten attention lately in this area. The idea of LIME is to give it a single datapoint, and the ML algorithm to use, and it will try to build understandable explanation for the output of the ML algorithm for that specific datapoint. Such as "because this person was found to be sneezing and coughing (datapoint features), there is a high probability they have a flu (ML output)".

There are plenty of introductory articles around for LIME, but I felt I needed something more concrete. So I tried it out on a few classifiers and datasets / datapoints to see.

For the impatient, I can summarize LIME seem interesting and going in the right direction, but I still found the details confusing to interpret. It didn’t really make me very confident on the explanations. There seems to be still ways to go for easy to understand, and high confidence explanations.

Experiment Setups


There are three sections to my experiments in the following. First, I try explaining output from three different ML algorithms specifically designed for tabular data. Seconds, I try explaining the output of a generic neural network architecture. Third, I try a regression problem as opposed to the first two, which examine a classification problem. Each of the three sections uses LIME to explain a few datapoints, each from different datasets for variety.

Inverted Values

As a little experiment, I took a single feature that was ranked as having a high contribution to the explanation for a datapoint by LIME, for each ML algorithm in my experiments, and inverted their value. I then re-ran the ML algorithm and LIME on this same datapoint, with the single value changed, and compared the explanation.

The inverted feature was in each case a binary categorical feature, making the inversion process obvious (e.g, change gender from male to female or the other way around). The point with this was just to see if changing the value of a feature that LIME weights highly results in large changes in the ML algorithm outputs and associted LIME weights themselves.

Datasets and Features

The datasets used in different sections:

  • Titanic: What features contribute to a specific person classified as survivor or not?
  • Heart disease UCI: What features contribute to a specific person being classified at risk of heart disease?
  • Ames housing dataset: What features contribute positively to predicted house price, and what negatively?

Algorithms applied:

  • Titanic: classifiers from LGBM, CatBoost, XGBoost
  • Heart disease UCI: Keras multi-layer perceptron NN architecture
  • Ames housing dataset: regressor from XGBoost

Tree Boosting Classifiers

Some of the most popular classifiers I see with tabular data are gradient boosted decision tree based ones; LGBM, Catboost, and XGBoost. Many others exist that I also use at times, such as Naive Bayes, Random Forest, and Logistic Regression. Hoever, LGBM, Catboost, and XGBoost are ones I often try first these days for tabular data. So I try using LIME to explain a few datapoints for each of these ML algorithms in this section. I expect a similar evaluation for other ML algorithms should follow a quite similar process.

For this section, I use the Titanic dataset. The goal with this dataset is to predict who would survive the shipwreck and who would not. Its features:

  1. survival: 0 = No, 1 = Yes
  2. pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  3. sex: Sex
  4. Age: Age in years
  5. sibsp: number of siblings / spouses aboard the Titanic
  6. parch: number of parents / children aboard the Titanic
  7. ticket: Ticket number
  8. fare: Passenger fare
  9. cabin: Cabin number
  10. embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

The actual notebook code is available my Github as well as in a Kaggle notebook.

Each of the three boosting models (LGBM, Catboost, XGBoost) provides access to their internal statistics as a form of feature weights. For details, check some articles and documentation. These types of model feature weights provide a more holistic view of the model workings, over all datapoints as opposed to the single datapoint that LIME tries to explain. So in the following, I will show these feature weights for comparison where available.

However, there is also some very good criticism on using these types of classifier internal statistics for feature importances, noting it might also be meaningful to compare with other techniques such as permutation importance and drop-column importance. As such, I calculate also permutation importance for each of the three boosters here, as well as later for the Keras NN classifier.


Feature Weights from Classifier / Permutations

The following figure illustrates the weights given by the model itself, when I trained it on the Titanic dataset, via the classifier feature_importances_ attribute.

LGBM Feature Weights

And the ones illustrated by the following figure are the ones given by the SKLearn’s permutation importance function for the same classifier.

LGBM Permutation Weights

Comparing the two above, the model statistics based weights, and the permutation based weights, there is quite a difference in what they rank higher. Something interesting to keep in mind for LIME next.

Datapoint 1

The following figure illustrates the LIME explanations (figures are from LIME itself) for the first item in the test set for Titanic data:


The figure shows to versions of the same datapoint. The one on the left is the original data from the dataset. The one on the right has the sex attribute changed to the opposite gender. This is the invertion of the highly ranked LIME feature I mentioned before.

Now, compare these LIME visualizations/explanations for these two datapoint variants, to the global feature importances above (from model internal statistics and permutation score). The top features presented by LIME closely match those given by the global permutation importance as top features. In fact, it is almost an exact match.

Beyond that, the left side of the figure illustrates one of my main confusions about LIME in general. The prediction of the classifier for this datapoint is:

  • Not survived: 71% probability
  • Survived: 29% probability

I would expect the LIME feature weights to show highest contributions then for the not survived classification. But it shows much higher weights for survived. By far "Sex=male" seems to be the heaviest weight for any variable given by LIME here, and it is shown as pointing towards survived. Similarly, the overall LIME feature weights in the left hand figure are

  • Not survived: 0.17+0.09+0.03+0.00=0.29
  • Survived: 0.31+0.15+0.07+0.03+0.02+0.01=0.59

Funny how the not survived weights sum up the exact prediction value for survived. I might think I am looking at it the wrong way, but further explanations I tried with other datapoints seem to indicate otherwise. Starting with the right part of the above figure.

The right side of the above figure, with the gender inverted, also shows the sex attribute as the highest contibutor. But now, the title has risen much higher. So perhaps it is telling that a female master has a higher change of survival? I don’t know, but certainly the predictions of the classifier changed to:

  • Not survived: 43%
  • Survived: 57%

Similarly, passenger class (Pclass) value has jumped from weighting on survival to weighting on non-survival. The sums of LIME feature weights in the inverted case do not seem too different overall, but the prediction has changed by quite a bit. It seems complicated.

Datapoint 2

LIME explanation for the second datapoint in the test set:


For this one, the ML prediction for the left side datapoint variant seems to indicate even more strongly that the predicted survival chance is low, but the LIME feature weights point even stronger into the opposite direction (survived).

The right side figure here illustrates a bit how silly my changes are (inverting only gender). The combination of female with mr should never happen in real data. But regardless of the sanity of some of the value combinations, I would expect the explanation to reflect the prediction equally well. After all, LIME is designed to explain a given prediction with given features, however crazy those features might be. On the right hand side figure the feature weights at least seem to match a bit better to the prediction vs on the left side, but then why is it no always matching in the same way?

An interesting point is also how the gender seems to always weight heavily towards survival in both cases here. Perhaps it is due to the combinatorics of the other feature values, but given how the LIME weights vs predictions seem to vary across datapoints, I wouldn’t be so sure.


Feature Weights from Classifier / Permutations

Model feature weights based on model internals:

Catboost Feature Weights

Based on permutations:

Catboost Permutation Weights

Interestingly, parch shows negative contribution.

Datapoint 1

First datapoint using Catboost:

LIME Catboost 1

In this case, both the LIME weights for the left (original datapoint) and right (inverted gender) side seem to be more in line with the predictions. Which sort of shows that I cannot only blame myself for interpreting the figures wrong, since they sometimes seem to match the intuition, and other times not..

As opposed to the LGBM case/section above, in this case (for Catboost) the top LIME features actually seem to follow almost exactly the feature weights from the model internal statistics. For LGBM it was the other way around, they were not following the internal weights but rather the permutation weights. Confusing as everything else about these weights, yes.

Datapoint 2

The second datapoint using Catboost:

LIME Catboost 2

In this case, LIME is giving very high weights for variables on the side of survived, while the actual classifier is almost fully predicting non-survival. Uh oh..


Feature Weights from Classifier / Permutations

Model feature weights based on model internal statistics:

XGB Feature Weights

Based on permutations:

XGB Permutation Weights

Datapoint 1

First datapoint explained for XGBoost:


In this case, the left one seems to indicate not-survived on the weights quite heavily, but the actual predictions are quite even on survived and not survived. On the right side the weights vs predictions are more in line with LIME feature weights, seeming to match prediction.

As for LIME weights vs the global predictions from model internals and permutations, in this case they seem to be mixed. Some LIME top features are shared with the top feature weights for model internals, some are shared with permutations. Compared to the previous sections, the LIME weights vs model and permutation weights seem to be all over the place. Which might be some attribute of the algorithms in case of the internal feature weights but I would expect LIME to be more consistent with regards to permutation weights, as that algorithm never changes.

Datapoint 2

Second datapoint:


Here, the left one seems to indicate much more of survival on the weights, and non-survival in actual prediction. On the right side, the weights and predictions seem more in line again.

Explaining a Keras NN Classifier

This section uses a different dataset of the Cleveland Heart Disease risk. The inverted variable in this case is not gender but the cp variable, since it seemed to be the highest scoring categorical variable for LIME on the datapoints I looked at. It also has 4 values, not 2, but in any case, I expect changing a high scoring variable to show some impact.


  1. age: age in years
  2. sex: (1 = male; 0 = female)
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure in mm Hg on admission to the hospital
  5. chol: serum cholestoral in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. *maximum heart rate achieved
  9. *exercise induced angina: (1 = yes; 0 = no)
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Feature Weights from Permutations

Keras does not provide feature weights based on model internal statistics, being a generic neural networks framework, as opposed to specific algorithms such as the boosters above. But permutation based feature weighting is always an option:

Keras Permutation Weights

Training Curves

Training curves are always nice, so here you go:

Keras Training

Datapoint 1

First datapoint explained by LIME for Keras:

LIME Keras 1

This one is predicting almost fully no risk on both datapoints. Yet the weights seem to be indicating almost fully on the risk of heart side.

The LIME weights compared to the global permutation weights share the same top 1-2 features, with some changes after.

Datapoint 2

Second datapoint explained by LIME for Keras:

LIME Keras 2

In this case, the predictions and weights are more mixed on both sides. The right side seems to have the weights much more on the no risk side than the left side, yet the change between the two is that the prediction has shifted more towards the risk of heart side.

In this case, the features are quite different from the first datapoint, and also from the global weights given by permutation importance. Since LIME aims to explain single datapoints and not the global model, I don’t see an issue with this. However, I do see an issue in not being able to map the LIME weights to the predictions in any reasonable way. Not consistently at least.

Explaining an XGBoost Regressor

Features in the Ames Housing Dataset used in this section:

  • SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • Utilities: Type of utilities available
  • OverallQual: Overall material and finish quality
  • GrLivArea: Above grade (ground) living area square feet
  • ExterQual: Exterior material quality
  • Functional: Home functionality rating
  • KitchenQual: Kitchen quality
  • FireplaceQu: Fireplace quality
  • GarageCars: Size of garage in car capacity
  • YearRemodAdd: Remodel date
  • GarageArea: Size of garage in square feet

Datapoint 1


As discussed here, LIME results seem more intuitive to reason about for classification than regression. For regression it should show some relative value of how the feature values contribute to the predicted regression value. In this case how the specific features values are predicted to impact the house price.

But as mentioned, the meaning of this is a bit unclear. For example, what does it mean for something to be positively weighted? Or negatively? Regards to what? This would require more investigation, but I will stick on more details for classification in this post.

Data Distribution

Just out of interest, here is a description of the data distribution for the features shown above.

XGBReg Distribution

One could perhaps make some analysis of how the feature value distributions are with regards to the LIME weights for those variables, and use those as a means to analyze the LIME results further in relation to the predicted price. Maybe someday someone will.. 🙂


Compared to all the global feature weights given by the model internal statistics, and the permutations, the results are often sharing some of the top features. And comparing explanations for different datapoints using the same algorithm, there appears to be some changes in which features LIME ranks highest per datapoint. Overall, this all makes sense considering what LIME is supposed to be. Explaining individual data points, where the globally important features likely often (and on average ahould) rank high, but where single points can vary.

LIME in general seems like a good way to visualize feature importances for a datapoint. I like how the features are presented as weighting in one direction vs other. The idea of trying values close to a point to come up with an explanation also seems to make sense. However, many of the results I saw in above experiments do not quite seem to make sense. The weights presented often seem to be opposed to the actual predictions.

This book chapter hosts some good discussion on the limitations of LIME, and maybe it explains some of this. The book chapters ends up says to use great care in applying LIME, and how the LIME parameters impact the results and explanations given. Which seems in line with what I see above.

Also, many of the articles I linked in the beginning simply gloss over the interpretation of the results, whether they make sense, or make seemingly strange assumptions. Such as this one, which gives me the impression that the explanation weights would change depending on what is the higher predicted probability by the classifier. For me, this does not seem to be what the visualizations show.

More useful would be maybe to understand the limitations and not expect it to be too great, even if I feel like I don’t necessarily get all the details. I expect it is either poorly explained, or I did get the details and it is just very limited. This is perhaps coming from the background of LIME itself, where the academics must sell their results as the greatest and best in every case, and put aside their limitations. This is how you get your papers accepted, and cited, leading to more grants, and better tenure positions..

I would not really use LIME. Mostly because I cannot see myself trusting the results very much, no matter what the sales arguments. But overall, it seems like interesting work and perhaps something simpler (to use) will be available someday, where I feel like having more trust in the results. Or maybe the problem is just complicated. But as I said, these all seem like useful steps into the direction of improving the approaches and making them more usable. Along these lines, it is also nice to see these being integrated as part of ML platform offerings and services.

There are other interesting methods for similar approaches as well. SHAP is one that seems very popular, and Eli5 is another. Some even say LIME is a subset of SHAP, which should be more complete vs the sampling approach taken by LIME. Perhaps it would be worth the effort to make a comparison some day..

That’s all for this time. Cheers.

AWS EC2 / RDS Pricing and Performance

AWS EC2 pricing seems compilcated, and many times I have tried to figure it out. Recently I was there again, looking at it from the RDS perspective (same pricing model for the underlying EC2 instances), so here we go.

Amazon/AWS calls their virtual machine Elastic Compute Cloud (EC2). I used to think about the "elastic" part in terms of being able to scale your infrastructure by adding and removing VM’s as needed. But I guess scaling is also relevant in terms of allocating more or less compute on the same VM, or considering how many compute tasks at the same time a single hardware host in AWS can handle. Let’s see why…

Basic Terminology

What is compute in EC2? AWS used to use a term called Elastic Compute Unit (ECU). As far as I can tell, this has been largely phased out. Now they measure their VM performance in terms of virtual CPU (vCPU) units.

So what is a vCPU? At the time of writing this, it is defined as "a thread of either an Intel Xeon core or an AMD EPYC core, except for M6g instances, A1 instances, T2 instances, and m3.medium.". A1 and M6g use AWS Gravitron and Gravitron 2 (ARM) processors, which I guess is a different architecture (no hyperthreading?). T2 is not described in so much detail except as an Intel 3.0 GHz or 3.3 GHz processors (older, no hyperthreading?). Anyway, I go with vCPU meaning a (hyper)thread allocated on an actual CPU host. Usually this would not even be a full core but a hyperthread.

There are different types of vCPU’s on the instances, as they use different physical CPU’s. But that is a minor detail. The instance type are more relevant here:

  • burstable standard,
  • burstable unlimited, and
  • fixed performance.

OK then, what are they?

Fixed Performance

Fixed performance instance type is the simplest. It is always allocated the vCPU’s in full. A fixed performance instance with 2 vCPU instances can run those 2 vCPUs (hyperthreads) at up to 100% CPU load with no extra charge at all times. The price is always fixed. If you don’t need the full 100% CPU power at all times, a burstable instance can be cheaper. But only if you dont "burst" too much, in which case burstable type becomes more expensive.

Burstable Standard

The concept of a burstable instance is what I find a bit complex. There is something called the baseline performance. This is what you always get, and is included in the price.

On top of the baseline performance, for burstable instances, there is something called CPU credits. Different instance types have different number of credits. Here are a few example (at the time of writing this..):

Instance type Credits / h Max. creds. vCPUs Mem. Baseline perf.
T2.micro 6 144 1 1GB 10%
T2.small 12 288 1 2GB 20%
T2.large 36 864 2 8GB 20% * 2
T3.micro 12 288 2 1GB 10% * 2
T3.small 24 576 2 2GB 20% * 2
T3.large 36 864 2 8GB 30% * 2
M4.large 2 8GB 200%

Baseline performance

I will use the T2.micro from above table as an example. The same concepts apply to other instance types as well, just change the numbers.

T2.micro baseline performance is 10%, and there is a single vCPU instance allocated, referring to a single hyperthread. The 10% baseline refers to being able to use 10% of the maximum performance of this hyperthread (vCPU).

CPU credits

Every hour, a T2.micro gets 6 CPU credits. If the instance runs at or below the baseline performance (10% here), it saves these credits for later use. For a maximum of 144 credits saved for a T2.micro. These credits are always awarded, but if your application load is such that the instance can use more than the 10% baseline performance, it will spike to that higher load as soon as the CPU credit is allocated, and consume the credit immediately.

The credit is used up in full if the instance runs at 100%, and in parts if it runs higher than baseline but lower than maximum 100% performance. If multiple vCPUs are allocated to an instance, and they all run on higher than baseline, they will use multipe amounts of the CPU credits.

Well, that is what the sites I linked above say. But here is an example, where I ran a task on a T2.micro instance after it had been practically idle for more than 24 hours. So it should have had the full 144 CPU credits at this point.

T2 load chart

In the above chart, the initial spike around midnight is about 144 minutes, although the chart timeline is too coarse to show it. It is from an RDS T2.micro instance, under heavy write load (I was writing as much as I could all the time, from another EC2 T2.micro instance). So the timeline of 144 minutes seems consistent with the credit numbers. But the CPU percentage shown here is not, since 10% should be the baseline.. uh. It could also be that the EC2 instance responsible for loading the data into the above RDS instance is having the same CPU credit limit and thus the size of data injected for writing is also limited. Will have to investigate more later, but the shape illustrates the performance throttling and CPU credit concepts.

Considering the baseline, practically the T2.micro is an instance running at 10% of a single thread performance of a modern server processor. Does not seem much. To me, the 1 vCPU definition actually seems rather misleading, as you don’t really get a vCPU but rather 10% of one. Given 60 minutes in an hour, and 6 CPU credits awarder to a T2.micro per hour, you get about one credit every 60/6 = 10 minutes. If you save up and run in low performance load for 24 hours (144*10=1440minutes = 24 hours), you can then run for the 144 minutes (2 hours 24 minutes) at 100% CPU load. In spikes of about 10 minutes you can run for one minute equivalent of 100% load.

T2.micro instances are described as "High frequency Intel Xeon processors", with "up to 3.3 GHz Intel Scalable Processor". So the EC2 T2.micro instance is actually 10% of a single hyper-thread on a 3.3GHz processor. About equal to 330Mhz single hyperthread.

The bigger instances can have multiple vCPU’s allocated as shown in the table above. They also get a bit more credits, and have a higher baseline performance %. The performance percentage is per vCPU, so an instance with 2 vCPU’s and a baseline performance of 20% actually has a baseline performance of 2*20%. In this case, You are getting two hyperthreads at 20% of the CPU max capacity.

I still have various questions about this type, such as do you actually use a fraction of the instance CPU credit, or do you use it in full when going over the baseline? Can the different threads (over multiple vCPUs) share the total of 2*20%=40%, or is it just 20% per vCPU and anything above that is over baseline regardless of the other thread idling or not? But I guess I have to settle for burstable complicated, fixed simpler to use. Moving on.

Burstable Unlimited

The burstable instances can also be set to unlimited burstable mode.

In this mode, the instance can run (burst) at full performance all the time, not just limited by accumulated CPU credits. However, you still gain CPU credits as with burstable instances. In comparison to standard bursting type, if you use more CPU credits than you have, with unlimited mode you will just be billed extra for those. You will not be throttled by available credits, rather you can rack up nice extra bills.

If the average utilization rate is higher than baseline + available CPU credits, over a rolling 24-hour window, or during instance lifetime (if less than 24h), you will be billed according to each vCPU hour used over that measurement (baseline average + CPU credits).

Each vCPU hour above the extra billing threshold costs 0.05$ (5 cents USD). Considering the cost difference, this seems potentially quite expensive. Lets see why.

Comparing Prices

What do you actually get for the different instances? I used the following as basis for calculations:

  • T2: 3.0/3.3GHz Xeon. AWS describes T2 instances as T2.small and T2.medium being "Intel Scaleble (Xeon) Processor running at 3.3 GHz", and T2.large at 3.0 Ghz. A bit strange numbers, but I guess there is some legacy there (more cores at less GHz?).
  • T3: 3.1GHz Xeon. AWS describes this as "1st or 2nd generation Intel Xeon Platinum 8000", and "sustained all core Turbo CPU clock speed of up to 3.1 GHz". My interpretation of 3.1 GHz might be a bithigh, as the description says "boost" and "up to", but I don’t have anything better to go with.
  • M5: 3.1GHz Xeon. Desribed same as T3, "1st or 2nd generation Intel Xeon Platinum 8000", and "up to 3.1 GHz"..
Instance type CPU GHz Base perf Instance MHz vCPUs Mem. Price/h
T2.micro 3.3 10% 330 Mhz 1 1GB $0.0116
T2.small 3.3 20% 660 MHz 1 2GB $0.0230
T2.large 3.0 20% * 2 600 MHz * 2 2 8GB $0.0928
T2.large.unl 3.0 200% 3000 MHz * 2 2 8GB $0.1428
T3.micro 3.1 10% * 2 310 MHz * 2 2 1GB $0.0104
T3.small 3.1 20% * 2 620 MHz * 2 2 2GB $0.0208
T3.large 3.1 30% * 2 620 MHz * 2 2 8GB $0.0832
T3.large.unl 3.1 200% 3100 MHz * 2 2 8GB $0.1332
M5.large 3.1 200% 3100 MHz * 2 2 8GB $0.0960

I took the above prices from the AWS EC2 pricing page at the time of writing this. Interestingly, the AWS pricing seems so complicated, they cannot keep track of it themselves. For example, T3 has on price on the above page, and another on the T3 instance page. The latter lists the T3.micro price at $0.0209 / hour as opposed to the $0.0208 above. Yes, it is a minimal difference, but just shows how complicated this gets.

The table above represents the worst-case scenario where you run your instance at 100% performance as much as possible. It also does not include the burstable instances being able to run at up to 100% CPU load for short periods when they accumulate a CPU credit. And with the unlimited burstable types, you can get by with less if you run at or under the baseline. But, as the AWS docs note, the unlimited burstable is about 1.5 times more expensive than the fixed size instace (T3 vs M5).

Strangely, T2 is more expensive than T3, while the T3 is more powerul. So I guess other than free tier use, there should be absolutely no reason to use T2, ever. Unless maybe for some legacy dependency, or limited availability.


I always though it was so nice of AWS to offer a free tier, and how could they afford giving everyone a CPU to play with? Well, it turns out they don’t. They just give you one tenth of a single thread on a hyperthreaded CPU. This is what a T2.micro is in practice. I guess it can be useful for playing around and getting familiar with AWS, but yeah the marketing is a bit of.. marketing? Cute.

Still, the price difference per hour from T2.large ($0.0928) or T3.large ($0.0832) to M5.large ($0.0960) seems small. Especially the difference between the T2 and M5 is so small it seems to make no sense. So why go bursty, ever? With the T3 you are saving about 15%. If you have bursty workloads and need to be able to handle large spikes, on a really large set of servers, maybe it makes sense. Or if your load is very low, you can get smaller (fractions of a CPU) instances using the bursty mode. But seems to me it requires a lot of effort to profile your loads, make predictions, monitor and manage it all.

In most cases I would actually expect something like Lambda functions to be the really best fit for those types of cases. Scaling according to the need, clear pricing (which seems like a miracle in AWS), and a simple operational model. Sounds just great to me.

In the end, comparing the burstable vs fixed performance instances, it just seems silly to me to be paying almost the same price for such a complicated burstable model, with seemingly much worse performance. But like I said, for big houses, and big projects, maybe it makes more sense. Would be really interested to hear some concrete and practical experiences and examples on why use one over the other (especially the bursty instances).

Python Class vs Instance variables

Recently I had the pleasure of learning about Python class vs instance variables. Coming from other programming languages, such as Java, this was quite different for me. So what are they?

I was working on my Monero scraper, so I will just use that as the example, since that is where I had the fun as well..

Class variables

Monero is a blockchain. A blockchain consists of linked blocks, which contain transactions. Each transaction further contains various attributes, most relevant here being tx_in and tx_out type elements. These simply describe actual Monero coins being moved in / out of a wallet in a trasaction.

So I made a Transaction class to contain this information. Like this:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height

I figured this should match a traditional Java class like this:

public class Transaction {
    int fee;
    int blockHeight;
    List<TxIn> txIns = new ArrayList();
    List<TxOut> txOuts = new ArrayList();

    public Transaction(int fee, int blockHeight) {
        this.fee = fee;
        this.blockHeigh = blockHeigh;

Of course, it turned out I was wrong. A class variable in Python is actually more like a static variable in Java. So, in the above Python code, all the variables in the Transaction class are shared by all Transaction objects. Well, actually only the lists are in this case. But more on that in a bit.

Here is an example to illustrate the case:

t1 = Transaction(1, "aa", 1, 1, 1)
t1.tx_ins.append(TxIn(1, 1, 1, 1))
t2 = Transaction(1, "aa", 1, 1, 1)
t2.tx_ins.append(TxIn(1, 1, 1, 1))


I was expecting the above to print out a list with a single item for each transaction. Since I only added one to each. But it actually prints two:

[<monero.transaction.TxIn object at 0x109ceee10>, <monero.transaction.TxIn object at 0x11141abd0>]
[<monero.transaction.TxIn object at 0x109ceee10>, <monero.transaction.TxIn object at 0x11141abd0>]

There was something missing for me here, which was understanding the instance variables in Python.

Instance variables

So what makes an instance variable an instance variable?

My understanding is, the difference is setting it in the constructor (the __init__()) method:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_ins = []
        self.tx_outs = []

Compared to the previous example, the only difference in the above is that the list values are assigned (again) in the __init__ method. Here is the result of the previous test with this setup:

[<monero.transaction.TxIn object at 0x107447e50>]
[<monero.transaction.TxIn object at 0x108ea2d10>]

So now it works as I intended, each transaction holding its own set of tx_inand tx_out. Since hey became instance variables.

I used the above Transaction structure when scraping the Monero blockchain. Because I originally had the tx_ins and tx_outs initialized as lists at class variable level, adding new values to these lists actually just kept growing the shared (class variable) lists forever. Which was not the intent.

Because I expected each Transaction object to have a new, empty set of lists. Of course, they didn’t, but rather the values just accumulated in the shared (class variable) lists. An I inserted the transactions one at a time into a database, the number of tx_ins and tx_outs for later transactions in the blockchain kept growing and growing, as they now contained also all the values of previous transactions. Hundreds of millions of inserted rows later..

After fixing the variables to be instance variables, the results and counts make sense again.


Even with the above fix to use instance variables for the lists, I still found myself an issue. I typoed the variable name in the constructor:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_inx = []
        self.tx_outs = []

In the above I typoed self.tx_inx instead of self.tx_ins. Because the class level tx_ins is already initialized as an empty list, it gave no errors but the objects kept accumulating as before for the tx_ins part. Brilliant.

So I ended up with the following approach (for now):

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = None
    tx_outs: List[TxOut] = None

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_inx = []
        self.tx_outs = []

This way, if I typo the instance variable in the __init__ method, the class variable stays uninitialized, and I will get a runtime error in trying to use it (as the class variable value is None).

The main reason I am doing the variable initialization like I am here, is to get the variables defined for IDE autocompletion, and to be able to add type hints to them, for further IDE assistance and checking. There might be other ways to do it, but this is what I figured so far..

When I was looking into this, I also found this Stack Overflow post on the topic. It points to other optional ways to specify type hints for instance variables (e.g., typing the parameters to the constructor). There is also some pointer to Python Enhancement Proposal PEP526. Which references other PEP’s, but let’s not go into all of those..

I cannot say to have 100% assurance of all possibilities related to these annotation, and instance vs class variables, but I think I got a pretty good idea.. If you have any pointers to what I missed or misinterpreted, please leave a comment 🙂

DevOps – What does it mean?

These days people talk a lot about "DevOps". Well, in IT they do. But the definition of DevOps seems quite elusive to me. Originally I though about it as just Development and Operations working closer together. As in DEVelopment and OPerationS. Of couse, this is a quite vague definition, as I have noticed. Over time, I have seen many quite different definitions.

The most popular definition I seem to come across is one where people call a tools (or platform?) team a "DevOps" team. This is a team running the Continous Integration (CI)-systems, and providing some common development (and test) infrastructure. Or any other type of development platform service, such as software based infrastructure setup (containers, clouds, …).

I have also seen roles where people call themselves "DevOps Engineers", and they are really writing the product code, building it, testing it, operating it, … So pretty much doing everything you can do. Must be nice for the employer. Well, as long as you enjoy your job..

In some places there is the development team, and then there is a "DevOps" team working on infrastructure setup scripts using tools like Terraform. Which, BTW, is a nice tool like most from Hashicorp. At the same time, the dev team might be running all their code on whatever setup they have available, typically on their laptops using some combinations of Docker and whatnot.

This seems like a large disconnect to me. But on a second thought, maybe it makes sense (to some extent). You have to develop it somewhere, so you need some setup locally. This makes me think, it might be useful to have some model of a DevOps lifecycle, like what points in a project lifecycle do the Dev and Ops benefit most from collaboration, and what kind of collaboration is best suited at different times? Where do the QA and security best fit in?

Overall, it seems there is no single right definition for DevOps. Whatever works for you I guess. Except that many could learn a lot from others and improving, if they realized enough to have an open mindset. And if they supported people in initatives to learn and improve related processes, tools, techniques, … (..)

Regarding such different approaches, and benefits and problems, I find the DevOps Topologies website has some good definitions for both "good" DevOps and anti-DevOps teams. Their definition of anti-DevOps seems to be largely about keeping the Dev and Ops separate but still calling it "DevOps". Because it’s trendy I guess. Somehow this does not surprise me..

The types listed as working better on DevOps Topologies on the other hand seem to be focused on building more overlap between the Dev and Ops. I guess it depends on the type of work and organization that is in question. There is something that seems related in Team Topologies but the website is a bit vague. Maybe I should get the book, but then I have an overly long reading list already. And somehow manage to distract myself from reading.. I used to have more chance for reading when I had some business trips, with less distractions on a plane, hotel, etc. But I digress.

I find I got the best idea of what DevOps is from reading the book The Phoenix Project. It is from 2013, so already about 7 years old at the time of this writing, but every description in it seems perfectly fine for today. Much like the 20+ years old Office Space movie, where you could just change monitors to be flatter, and software UI’s a bit flashier. The main part of office politics would not change, much like it seems to stay the same for DevOps related development processes. But I digress again.

The Phoenix Project felt like a long story to get started, but in the end it gave me a great perspective on what the term might originally have been intended to mean. I interpret it as development working closely with operations (and testing+security) to make sure they share the exact same infrastructure (Terraform etc today), as well as QA sharing it.

Overall, I also interpret it to indicate dev not creating their own setups and throwing stuff over the wall to Ops and QA. Rather all working closely together to figure out issues, build extensive monitoring and logging to make all their lives easier, improve everything overall. Make it all work better and more reliable for the customer and for yourself (less need to get up at 4am (for this..)). And so on.

For me the Phoenix Project story delivered the message much better than all the websites and powerpoints with their diagrams and abstract descriptions. I guess I prefer stories that make things concrete with realistic examples. And yet, as I discussed above, there still seem to be many quite different definitions as well. I guess with something becoming popular this happens, and maybe for different systems and organizations a different approach with the same higher level goal works. I am sure there are plenty expensive consultants for all this with better answers than me :).

So to summarize my brief lamentations here, a few points:

  • DevOps seems to vary quite a bit across organizations, both in how they do it, and what might be a suitable model for them.
  • There seem to be many ways to do DevOps "wrong". Which I guess means just not getting optimal benefit from it.
  • I would be interested to understand better how all this relates to the different software engineering lifecycle. Early development, adding new features, maintenance, …
  • Stories on how Dev, Ops, QA, Security, and anything else have successfully worked together in different companies and software projects would be great to hear.

That’s all for today.

Leave some comments now. Like what do you think DevOps is? And what did I say all wrong? 🙂

Testing Machine Learning Intensive Systems (or Self-Driving Cars) – A Look at the Uber Accident

Previously I looked at what it means to test machine learning systems, and how one might use machine learning in software testing. Most of the materials I found on testing machine learning systems was academic in nature, and as such a bit lacking in practical views. Various documents on the Uber incident (fatally hitting a pedestrian/cyclist) have been published, and I had a look at those documents to find a bit more insights into what it might mean to be testing overall systems that rely heavily on machine learning components. I call them machine-learning intensive systems. Because.

Accident Overview

There are several articles published on the accident, and the information released for it. I leave the further details and other views for those articles, while just trying to find some insights related to the testing (and development) related aspects here. However, a brief overview is always in order to set the context.

This accident was reported to have happened on 18th of March, 2018. It involved the Uber test car (a modified Volvo X90) hitting a person walking with a bicycle. The person died as a result of the impact. This was on a specific test route that Uber used for their self-driving car experiments. There was a vehicle operator (VO) person in the Uber car, whose job was to oversee the autonomous car performance, mark any events of interest (e.g., road debris, accidents, interesting objects), label objects in the recorded data, and take over control of the vehicle in case of emergency or other need (system unable to handle some situation).

The records indicate there used to be two VO’s per car, one focusing more on the driving, and one more on recording events. They also indicate that before the accident, the number of VO’s had been reduced to just one. The roles were combined and an update of the car operating console was designed to address the single VO being able to perform both roles. The lone VO was then expected to keep a constant eye on the road, monitor everything, label the data, and perform any other tasks as needed. Use of mobile phone was prohibited during driving by the VO, but the documents for the accident indicate the VO had been eyeing the spot where their mobile device was located, inside a slot on the dashboard. The documents also indicate the VO had several video streaming applications installed, and records from the Hulu streaming service showed video streaming occurring on the VO account at the time of the accident.

The accident itself was a result of many combined factors, where the human VO seems to have put their attention elsewhere at just the time, and the automation system has failed to react properly to the pedestrian / cyclist. Some points on the automation system in relation to potential failures:

  • The system kept records of each moving object / actor and their movement history, using their previous movements and position as an aid to predict their future movements. The system was further designed to discard all previous movement (position) history information when it changed the classification of an object. So no history was available to predict movement of an object / actor, if its classification changed.
  • The classification of the pedestrian that was hit kept changing multiple times before the crash. As a result, the system constantly discarded all information related to it, severely inhibiting the system from predicting the pedestrians movement.
  • The system had an expectation to not classify anything outside a road crossing as a pedestrian. As such, before the crash, the system continously changed the pedestrian classification between vehicle, bicycle, or other. This was the cause of losing the movement history. The system was not designed for the possibility of someone walking on the road outside a crossing area.
  • The system had safeguards in place to stop it from reacting too aggressively. A delay of 1 second was in place to delay braking when a likely issue was identified. This delayed automatic braking even at a point where a likely crash was identified (as in the accident). The reasoning was to avoid too aggressive reactions to false positives. I guess they expected the VO to react, and to log issues for improvement.
  • Even when the danger was identified and automated braking started, it was limited to reasonable force to avoid too much impact on the VO. If this maximum braking was calculated as insufficient to avoid impact, the system would brake even less and emit an audio signal for the VO to take over. So if maximum is not enough, slow down less(?). And the maximum emergency braking force was not set very high (as far as I understand..).

Before the crash, the system thus took an overly long time to identify the danger due to bad assumptions (no pedestrian outside crossing). It lost pedestrian movement history due to dropping data on classification change. It waited for 1 second from finally identifying the danger to do anything, and then initiated a slowdown rather than emergency braking. And the VO seemed to be distracted from observing the situation. After the accident, Uber has moved to address all these issues.

There are several other documents on various aspects of the VO, the automation system, and the environment available on the National Transportation Safety Board website for those interested. Including nice illustrations of all aspects.

This was a look at the accident and its possible causes to give some context. Next a look at the system architecture to also give some context of potential testing approaches.

Uber System Architecture

Looking at testing in any domain, understanding the system architecture is important. A look.

Software Modules

The Uber document on the topic lists the following main software modules:

  • Perception: Collects data from different sensors around the car
  • Localization: Combines detailed map data with sensor data for accurate positioning
  • Prediction: Takes Perception output as input, predicts actions for actors and objects in the environment.
  • Routing and Navigation: Uses map data, vehicle status, operational activity to determine long term routes for a given goal.
  • Motion Planning: Generates shorter term motion plans to control the vehicle in the now. Based on Perception and Prediction inputs.
  • Vehicle Control: Executes the motion plan using vehicle communication interfaces.


The same Uber document also describes the self-driving car hardware.

The current components at the time of writing the document:

  • Light Detection and Ranging (LIDAR): Measuring distance to actors and objects, 100m+ range.
  • Cameras: multiple cameras for different distances, covering 360 degrees around the vehicle. Both near- and far-range. To identify people and objects.
  • Radar: Object detection, ranging, relative velocity of objects. Forward-, backward-, and side-facing.
  • Global Positioning System (GPS): Coarse position to support vehicle localization (positioning it), vehicle command (to use location / position for control), map data collection, satellite measurements.
  • Self-Driving Computer: A liquid-cooled local computer in the car to run all the SW modules (Perception, Prediction, Motion Planning, …)
  • Telematics: Communication with backend systems, cellular operator redundancy, etc.

Planned components (not installed back then, but in future plans..):

  • Ultrasonic Sensors: Uses echolocation to range objects. Front, back, and sides.
  • Vehicle Interface Module: Seems to be an independent backup module to safely control and stop the vehicle in case of autonomous system faults.


Now that we established a list of the SW and HW components, a look at their functionality.


The system is described as using very detailed maps, including:

  • Geometry of the road and curbs
  • Drivable surface boundaries and driveways
  • Lane boundaries, including paint lines of various types
  • Bike and bus lanes, parking regions, stop lines, crosswalks
  • Traffic control signals, light sets, and lane and conflict associations (whatever that is? :))
  • Railroad crossings and trolley or railcar tracks
  • Speed limits, constraint zones, restrictions, speed bumps
  • Traffic control signage

Combined with precise location information, the system uses these detailed maps to beforehand "predict" what type of environment lies ahead, even before the Perception module has observed it. This is used to prepare for the expected road changes, anticipate speed changes, and optimize for expected motion plans. For example, when anticipating a tight turn in the road.

Perception and Prediction

The main tasks of the Perception module are described as detecting the environment, actors and objects. It uses sensor data to continously estimate the speed, position, orientation, and other variables of the objects and actors, as a basis to make better predictions and plans about their future movement, velocity, and position.

An example given is the turn signals of other cars, which is used to predict the their actions. At the same time, all the other data is also recorded and used to predict other, alternative courses for the same car, in case it does not turn even though using a turning signal.

While the Perception module observes the environment (collects sensor data), the Prediction component uses this, and other available, data as a basis for predicting the movement of the other actors, and changes in the environment.

The observed environment can have different types of objects and actors in it. Some are classified as fixed structures, and are expected not to move: buildings, ground, vegetation. Others are classified as more dynamic actors, and expected to move: vehicles, pedestrians, cyclists, animals.

The Prediction module makes predictions on where each of these objects is likely to move in the next 10 seconds. The predictions include multiple properties for each object and actor, such as movement, velocity, future position, and intended goal. The intended goal (or intention) is mentioned in the document, but I did not find a clear description of how this would be used. In any case, it seems plausible that the system would assign "intents" to objects, such as pedestrian crossing a street, a car turning, overtaking another car, going straight, and so on. At least these would seem useful abstractions and input to the next processing module (Motion Planning).

The Prediction module makes predictions multiple times a second to keep an updated representation available. The predictions are provided as input to the Route and Motion Planning module, including the "certainty" of those predictions. This (un)certainty is another factor that the Motion Planning module can use as input to apply more caution to any control actions.

Route and Motion Planning

Motion Planning (as far as I understand) refers to short-term movements, translating to concrete control instructions for the car. Route planning on the other hand refers to long term planning on where to go, and gives goals for the Motion Planning to guide the car to the planned route.

Motion Planning combines information from generated route (Route Planning), perceived objects and actors (Perception), and predicted movements (Prediction). Mapping data is used for the "rules of the road", as well as any active constraints. I guess also combined with sensor data for more up-to-date views in the local environment (the public docs are naturally not super-detailed on everything). Using these, it creates a motion plan for vehicle. Data from Perception and Prediction modules is also used as input, to define the anticipated movements of other objects and actors.

A spatial buffer is defined to be kept between the vehicle and other objects in the environment. My understanding is that this refers to keeping some amount of open space between the car and environmental elements. The size of this buffer varies with variables such as autonomous vehicle speed (and properties and labels of other objects and actors I assume). To preserve the required buffer, the system may take action such as changing lanes, brake, or stop and wait for situation to clear.

The system is also described as being able to identify and track occlusions in the environment. These would be environmental elements, such as buildings or other cars, blocking a view to certain other parts of the environment. These are constantly reasoned about, and the system becomes more concervative in decision when occlusions are observed. It aims to be able to avoid actors coming out of occlusions with reasonable speed.

Vehicle Control

The Vehicle Control module executes trajectories provided by the Motion Planning module. It controls the vehicle through communication interfaces. Example controls include steering, braking, turn signals, throttle, and switching gears.

It also tracks any limits set for the system (or environment?), and communicates back to the operation center as needed.

Data Collection and Test Scenarios

Since my point with this "article" was to look into what it might mean to test a machine learning intensive system, I find it important to also look at what type of data is used to train the machine learning systems, and how is all the used data collected. And how these are used as part of test case (In the Uber documents they seems to call them test scenarios). Of course, such complex systems use this type of data for many different purposes besides just the machine learning part, so they are generally interesting as well.

The Uber document describes data uses including system performance analysis, quality assurance, machine teaching and testing, simulated environment creation and validation, software development, human operator training and assessment, and map building and validation.

Data Collection

Summarizing the various parts related to data collection and synthesis from the Uber descriptions, at the heart of all this are the real-world training data collected by the VO’s driving around, the car and automated sensors collecting detailed data, and the VO’s tagging the data. This tagging also helps further identify new scenarios, objects, and actors. The sensor data is based on the sensors I listed above in the HW section.

Additionally, the system is listed as recording:

  • telemetry (maybe refers to metrics about network? or just generally to transferring data?)
  • control signals (commands for vehicle control?)
  • Control Area Network (CAN) messages
  • system health, such as
    • hard drive speeds
    • internal network performance
    • computer temperatures

The larger datasets are recorded in onboard (car) storage. Smaller amounts of data are transmitted in near real-time using over-the-air (OTA) interfaces over cellular networks to the Uber control center. These use multiple cellular network for cybersecurity and resiliency purposes. The OTA data includes insights on how the vehicles are performing, where they are, and their current state.

Scenario Development

In the documents (Uber and another from the RAND corporation), the operational environment of the autonomous vehicle is referred to as the operational design domain (ODD). Defining the ODD is quite central to the development (as well as testing) of the system and training the ML algorithms, as well as the controlling logic based on those. It defines the world in which the car operates, and all the actors and objects, and their relations.

The Uber document describes using something called scenarios as test cases. Well, it mostly does not mention the word "test case", but for practical purposes this seems to be similar. Of course, this is quite a bit more complex than a traditional software test case with simple inputs and outputs, requiring description of complex real-world environments as inputs, and boundaries and profiles of accepted behaviour as outputs, rather than specific data values. These complex real-world inputs and outputs are also varying over time, different from the typical static input values as often is with traditional software tests. Thus, also a time-series aspect is relevant to the inputs and outputs.

Uber describes a unified schema being used to describe the scenarios and data. Besides the collected data and learned models, other data inputs are also used, such as operational policies. Various success criteria are defined for each scenario, such as speed, distance, and description of safe behaviour.

When new actors, environmental elements, or other similar items are encountered, they are recorded and tagged for further training of the autonomous system. The resulting definitions and characterization of the ODD is then used as input to improve the test scenarios and create new ones. This includes improving the test simulations, and test tracks for coverage.

Events such as large deviations between consequtive planned trajectories are recorded and automatically tagged for investigation. Simulations are used to evaluate they are fixed, and the new scenarios are added to ML training datasets, or as "hard test cases". This seems a bit similar to the Tesla "shadow mode" I discussed earlier, just a bit more limited.

Test Coverage

Besides a general overview of the scenario development, the Uber documents do not really discuss how they handle test coverage, or what types of tests they run. There are some minor references but nothing very concrete. It is more focused on describing the overall system, and some related processes. I tried to collect here some points that I figured seemed relevant.

A key difference to more traditional software systems seems to be how these types of systems do not have a clearly defined input or output space. The interaction interfaces (API/GUI) of traditional software systems naturally defines some contract for what type of input is allowed, and expected. With these it is possible to apply the traditional techniques such as category partitioning, boundary analysis, etc. When your input space is the real world and everything that can happen in it, and output space is all the possible actions in relation to all the possible environmental configurations, it gets a bit more complex. In a similar comment, Uber describes their system as requiring more testing with different variations.

Potential Test Scenarios from Uber Docs

These are just points I collected that I though would illustrate something related to test scenarios and test coverage.

Uber describes evaluating their system performance in different common and rare scenarios, using measurements such as traffic rule violations, and vehicle dynamic attributes. This means having very few crash and unsafe scenarios available, but a large number of safe scenarios. That is, when the scenarios are based on real-world use and data, commonly there are much more "safe" scenarios available than "un-safe", due to rarity of crashes, accidents and other problem cases vs normal operations.

With only this type of highly biased data-set available, I expect there is a need to synthesize more extensive test sets, or other methods to test and develop such systems more extensively. The definition of safety also does not seem to be a binary decision but rather there can be different scales of "safe", depending on the safety attribute. For example, a safety margin of how far from other vehicles should the autonomous vehicle keep distance, is a continous variable, not a binary value. Some variables might of course have binary representations, such as avoiding hitting a pedestrian, or ramming a red light. But even the pedestrian metric may have similar distance measures, impact measures, etc. So I guess its a bit more complicated than just safe or not safe.

Dataset augmentation and imbalanced datasets are common issues in developing and training ML models. However, those techniques are (to my understanding) based on a single clear goal such as classification of an object, not on complex output such as overall driving control and its relation to the real world. Thus, I would expect to use overall scenario augmentation type of approaches, more holistic than a simple classifier (which in its own might be part of the system).

Some properties I found in the Uber documents (as I discussed above), referring to potential examples of test requirement:

  • Movement of objects in relation to vehicle.

  • Inability of the system to classify a pedestrian correctly if not near a crosswalk.

  • Inability of the system to predict pedestrian path correctly when not classified as pedestrian.

  • Overly strict assumptions made, such as cyclist not moving across lanes.

  • Losing location history of tracked objects and actors if their classification changed

  • Uber defines test coverage requirements based on collected map data and tags.

  • Map data predictin that upcoming environment would be of specific type (e.g., left curve), but it has changed and observations differ

  • Another car signals turning left but other predictors do not predict that, and the other car may not actually turn left.

  • Certainty of classifications.

  • Occlusions in the environment.


Looking at the above examples, trying to abstract some more generic concepts that would serve as a potentially useful basis:

  • Listing of known objects / actors

  • Listing of labels for different types of objects / actors

  • Assumptions made about specific types of objects / actors

  • Properties of objects / actors

  • Interaction constraints of objects and actors

  • Probabilities of classifications for different objects / actors and labels

  • Functionality when faced with unknown objects / actors

The above list may be lacking in general details that would cover the different types of systems, or even the Uber example, but I find it provides an insight into how this is heavily about probabilities, uncertainty, and preparing for, and handling, that uncertainty.

For different types of systems, the actual objects, actors, labels and properties would likely change. To illustrate these a bit more concretely with the autonomous car example:

  • Objects / Actors, and Properties their Labels

    • Our car
      • Speed, Position, Orientation,
      • Accelerating, Slowing down,
      • Intended goal (turn left, drive forward, change lane, stop, …)
      • Predicted location in 1s, 2s, 5s, …
      • Distance to all other actors / objects
      • Right of way
    • Other car, moving
      • Same as "Our car"
    • Other car, parked
      • Probability of leaving parking mode
    • Pedestrian, moving or stopped (parked)
      • Same as "other car"
      • Crossing street
      • On pedestrian path
    • Cyclist
      • Same as "other car"
      • Crossing street
      • On bicycle path
    • Other object
      • Moving or static
      • Same as others above
    • Traffic light
      • Current light (green, yellow, red)
      • On / off / blinking
    • Traffic sign
      • Type / Meaning
        • Set speed, stop, yield, no parking, …
        • Long term / local effect
    • Building
      • Size, shape, location
    • Occlusion
      • Predicted time of object / actor coming from occlusion
    • Unknown object, moving or parked
      • Much like the other car etc but maybe with unknown goals
  • Interaction constraints

    • Safety margin (distance to our car and other actors) before triggering some action
    • Actions triggered in different constraint states / boundaries

Something that seems important is also the ability to reason about previously unknown objects and actors to an extent possible. For example, a moving object that does not seem to fit any known category, but has known movement history, speed, and other variables. Perhaps there would be a more abstract category of a moving object, or some hierarchy of such categories. As well as the any of these objects or actors changing their classifications and goals, and how their long-term history should be taken into account overall to make future predictions.

In a different "machine learning intensive" system (not autonomous cars), one might use different set of properties, actors, object, etc. But it seems some similar consideration could be useful.

Possible Test Strategies

Once the domain (the "ODD") is properly defined, as above, it seems many traditional testing techniques could be applied. In the Uber documents, they describe performing architecture analysis to identify all potential failure points. They divided faults into three levels: faults in the self driving system on its own, faults in relation to the environment (e.g., at intersections), and faults related to the operational design domain, such as unknown obstacles the system does not recognize (or misclassifies?). This could be another way to categorize a more specific system, or inspiration for other similar systems.

Another part of this type of system could be related to the human aspect. This is somewhat discussed also in the Uber docs, in relation to operational situations for the system: a distracted operator, and a fatigued operator. They have some functionality in place (especially after the accident) to monitor operator alertness via in-car dashcam and attached analysis. However, I will not go into these here.

Testing ML Components

For testing the ML components, I discussed various techniques in a previous blog post. This includes approaches such as metamorphic testing, adversarial testing, and testing with reference inputs. In autonomous cars, this might be visual classifiers (e.g., convolutional networks), or path prediction models (recurrent neural nets etc.), or something else.

Testing ML Intensive Systems

As for the set of properties I listed above, it seems once these have been defined, using traditional testing techniques should be quite useful:

  • combinatorial testing: combine different objects / actors, with different properties, labels, etc. observe the system behaviour in relation to the set constraints (e.g., safety limits).
  • boundary analysis: apply to the combinations and constraints from the previous bullet. for example, probabilities at different values. might require some work to define interesting sets of probability boundaries, or ways to explore the (combined) probability spaces. but not that different in the end from more traditional testing.
  • model-based testing: use the above type of variables to express the system state, use a test generator to build extensive test sets that can be used to cover combinations, but also transitions between states and their combinations over time.
  • fault-injection testing: the system likely uses data from multiple different data sources, including numerous different types of sensors. different types of faults in these may have different types of impact on the ML classifier outputs, overall system state, etc. fault-injection testing in all these elements can help surface such cases. think Boeing Max from recent history, where a single sensor failure caused multiple crashes with hundreds of lives lost.

The real trick may be in combining these into actual, complete, test scenarios for unit tests, integration tests, simulators, test tracks, and real-world tests.

Regarding the last bullet above (fault-injection testing), the Uber documents discuss this from the angle of fault-injection training – injecting faults into the system and seeing how the vehicle operator reacts to them. Training them how they should react. This sounds similar to fault-injection testing, and I would expect that they would have also applied the same scenarios more broadly. However, I could not find mention of this.

Regarding general failures, and when they happen in real use, the same fault models can also be used to prepare and mitigate actual operational faults. The Uber docs also discuss this viewpoints as the system having a set of identified fault conditions and mitigations when these happen. These are identified by redundant systems and overall monitoring across the system. Example faults:

  • Primary compute power failure
  • Loss of primary compute or motion planning timeout
  • Sensor data delay
  • Door open during driving

General Safety Proceduress

Volvo Safety Features

Besides the Uber self-driving technology, the documents show Volvo cars having safety features in themselves, an Advanced Driver Assistance Systems (ADAS), including an automated emergency braking system named "City Safety". It contains a forward collision warning system, alerting the driver about imminent collision and automatically applying brakes when it observes a potentially dangerous situation. This also includes pedestrian, cyclist, and large animal detection components. However, these were turned off during autonomous driving mode, and only active in manual mode. Simulation tests conducted by the Volvo Group showed how the ADAS features would have been able to avoid the collision (17 times out of 20) or significantly reduce collision speed and impact (remaining 3 times). In post-crash changes, the ADAS system has been activated at all times (along with many other fixes to all the issues discussed here).

Information Sharing and Other Domains

The documents on reviews and investigations after the accident include comparisons to safety cultures in many other (safety-critical) domains: Nuclear Power, Transportation (Rail), Aviation, Oil and Gas, and Maritime. While some are quite specific to the domains, and related to higher level process and cultural aspects, there seem to be many quite interesting points one could build on also for the autonomous driving domain. Or other similar ones. Safety has many higher level shared aspects across domains. Regarding my look for testing related aspects, in many cases replacing "safety" with "QA" would also seem to provide useful insights.

One practical example is how (at least) avionics and transportation (rail) domains have processes in place to collect, analyze, and share information on unsafe conditions and events observed. This would seem like a useful way to identify also relevant test scenarios for testing products in the autonomous driving domain. Given how much effort is required for extensive collection of such data, how expensive and dangerous it can be, the benefits seem quite obvious to everyone.

Related to this, Uber discusses shared metrics for evaluating progress of their development. These include disengagements and self-driving miles travelled. While they have used these to signal progress both internally and externally, they also note that such metrics can easily lead to "gaming the system" at the expense of safety or working system. For example, in becoming overly conservative in avoiding disentanglements, or in using inconsistent definitions of the metrics across developers / systems.

Uber discusses need for work in creating more broadly usable safety performance metrics with academic and industry partners. They list how these metrics should be:

  • Specific to different development stages (development, testing, deployment)
  • Specific to different operational design domains, scenarios and capabilities
  • Have comparable metrics for human drivers
  • Applied in validation environments and scenarios for autonomous cars with other autonomous cars from different companies

The Uber safety approach document refers also a more general work towards automotive safety framework by the RAND corporation. This includes topics such as building a shared taxonomy to form a basis for discussion and sharing across vendors. It also discusses safety metrics, their use across vendors, and the possible issues in use and possible gaming of such metrics. And many other related aspects of cross-vendor safety program. Interesting. Seems like lots of work to do there as well.


This was an overly long look of the documents from the Uber accident. I was thinking of just looking at the testing aspect briefly, but I guess it is hard to discuss them properly without setting the whole background and overall context. Overall, the summary is not that complicated. I just get carried away with writing too much details.

However, I found writing this down helped me reason better about what is the difference between more traditional software intensive systems, and these types of new machine-learning intensive systems. I would summarize it as the need to consider everything in terms of probabilities, the unknown elements in the input and output, constraints over everything, complexity of identifying all the objects and actors, and their possible intents, and all the relations between all possibilities. With probabilities (or un-certainty). But once the domain analysis is well done, and understanding the inputs and outputs, I find the traditional testing techniques such as combinatorial testing, model-based testing, category partitioning, boundary analysis, fault-injection testing would give a good basis. But it might take a bit broader insight to be able to apply them efficiently.

As for the Uber approach, it is interesting. I previously discussed the Tesla approach of collecting data from fleets of deployed consumer vehicles. And features such as the Tesla shadow mode, continuously running in the background as the human driver drives, always evaluating whether each autonomous decision the system would have made would have been similar to what action the human took, or how it differs from that taken by the actual human driver. Not specifically trained VO’s as in the Uber case, but usual consumer drivers (so Tesla customers at work helping to improve the product).

The Tesla approach seems much more scalable in general. It might also generalize better as opposed to Uber aiming for very specific routes and building super detailed maps of just those areas. Creating and maintaining such super-detailed maps seems like a challenging task. Perhaps if the companies have very good automated tools to take care of it, it can be easier to manage and scale. I don’t know if Tesla does some similar mapping with the help of their consumer fleet, but would be interesting to see similar documents and compare.

As for other types of machine learning (intensive) systems, there are many variations, such as those using IoT sensors and data to provide a service. Those are maybe not as open-worlded in all possible input spaces. However, it would seem to me that many of the considerations and approaches I discussed here could be applied. Probabilities, (un-)certainties, domain characterizations, relations, etc. Remains interesting to see, perhaps I will find a chance to try someday.. 🙂

Remote Execution in PyCharm

Editing and Running Python Code on a Remote Server in PyCharm

Recently I was looking at an option to run some code on a remote server, while editing it locally. This time on AWS, but generally ability to do so on any remote server would be nice. I found that PyCharm has this nice option to use a Python SSH interpreter. Give it some SSH credentials, and point it to the Python interpreter on the remote machine, and you should be ready to go. Nice pic about it:


Sounds cool, and actually works really well. Even supports debugging. A related issue I ran into for pipenv also mentions profiling, pip package management, etc. Great. No, I haven’t tried all the advanced stuff yet, but at least the basics worked great.

Basic Remote Use

I made this simple program to test this feature:

print("hello world")
with open("bob.txt", "w") as bob:


The point is to print text to the console and create a file. I am looking to see that running this remotely will show me the prints locally, and create the file remotely. This would confirm to me that the execution happens remotely, while I edit, control execution, and see the results locally.

Running this locally prints "hello world" followed by "oops" and a file named "hello.txt" appears. Great.

To try remotely, I need to set up a remote Python interpreter in PyCharm. This can be done via project preferences:

Add interpreter

Or by clicking the interpreter in the status bar:

Statusbar interpreter

On a local configuration this shows the Python interpreter (or pipenv etc.) on my computer. In remote configuration it asks for many options such as remote server IP and credentials. All the run/debugging traffic between local and remote machines is then automatically transferred over SSH tunnels by PyCharm. To start, select SSH interpreter as type when adding new interpreter:

SSH interpreter

Just enter the remote IP/URL address, and username. Click next to enter also password/keyfile. PyCharm will try to connect and see this all works. On the final page of the remote interpreter dialog, it asks for the interpreter path:

Remote Python config

This is referring to the python executable on the remote machine. A simple which python3 does the trick. This works to run the code using the system python on the remote machine.

To run this remote configuration, I just press the run button as usual in PyCharm. With this, PyCharm uploads my project files to the remote server over SSH, starts the interpreter there for the given configuration, and transports back to my local host the console output of the execution. For me it looks exactly the same as running it locally. This is the output of running the above configuration:

ssh://ec2-user@ -u /tmp/pycharm_project_411/hello_world.py
hello world

The first line shows some useful information. It shows that it is using the SSH interpreter with the given IP and username, with the configured Python path. It also shows the directory where it has uploaded my project files. In this case it is "/tmp/pycharm_project_411". This is the path defined in Project Interpreter settings in the Path Mappings part, as illustrated higher above in image (with too many red arrows) in this post. OK, the attached image further above has a different number due to playing with different projects but anyway. To see the files and output:

[ec2-user@ip-172-31-3-125 ~]$ cd /tmp/pycharm_project_411/
[ec2-user@ip-172-31-3-125 pycharm_project_411]$ ls
bob.txt  hello_world.py

This is the file listing from the remote server. PyCharm has uploaded the "hello_world.py" file, since this was the only file I had in my project (under project root as configured for synch in path mappings). There is a separate tab on PyCharm to see these uploads:

Remote synch

After syncing the files, PyCharm has executed the configuration on the remote host, which defined to run the hello_world.py file. And this execution has create the file "bob.txt" as it should (on remote host). The output files go in this remote target directory, as it is the working directory for the running python program.

Another direction to synchronize is from the remote host to local. Since PyCharm provides intelligent coding assistance and navigation on local system, it needs to know and install the libraries used by the executed code. For this reason, it installs all the packages installed in the remote host Python environment. Something to keep in mind. I suppose it must install some type of a local virtual environment for this. Haven’t needed to look deeper on that yet.

Using a Remote Pipenv

The above discusses the usage of standard Python run configuration and interpreter. Something I have found useful for Python environemnts is pipenv.

So can we also do a remote execution of a remote pipenv configuration? The issue I linked earliner contains solutions and discussion on this. Basically, the answer is, yes we can. Just have to find the pipenv files on the remote host and configure the right one as the remote interpreter.

For more complex environments, such as those set up with pipenv, a bit more is required. The issue I linked before had some actual instructions on how to do this:

Remote pipenv config

I made a directory "t" on the remote host, and initialized pipenv there. Installed a few dependencies. So:

  • mkdir t
  • cd t
  • pipenv install pandas

And there we have the basic pipenv setup on the remote host. To find the pipenv dir on remote host (t is the dir where pipenv was created above):

[ec2-user@ip-172-31-3-125 t]$ pipenv --venv

To see what it contains:

[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c
bin  include  lib  lib64  src
[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin
activate       activate.ps1      chardetect        pip     python     python-config
activate.csh   activate_this.py  easy_install      pip3    python3    wheel
activate.fish  activate.xsh      easy_install-3.7  pip3.7  python3.7

To get python interpreter name:

[ec2-user@ip-172-31-3-125 t]$ pipenv --py

This is just a link to python3:

[ec2-user@ip-172-31-3-125 t]$ ls -l /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python
lrwxrwxrwx 1 ec2-user ec2-user 7 Nov  7 20:55 /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python -> python3

Use that to configure this pipenv as remote executor, as shown above already:

Remote pipenv config


Besides automated sync, I found the Pycharm IDE has features for manual upload to / download from the remote server. Seems quite useful.

First of all, the root of the remote deployment dir is defined in Deployment Configuration / Root Path. Under Deployment / Options, you can also disable the automated remote sync. Just set "Update changed files automatically to the default server" to "never". Here I have set the root dir to "/home/ec2-user". Which means the temp directory I discussed above actually is created under /home/ec2-user/tmp/pycharm_project_703/…

Deployment config

With the remote configuration defined, you can now view files on the remote server. First of all, enable the View->Tools Windows->Remote Host. This opens up the Remote Host view on the right hand side of the IDE window. The following shows a screenshot of the PyCharm IDE with this window open. The popup window (as also shown) lets you also download/upload files between the remote host and the localhost:

Deployment view

In a similar way, we can also upload local files to the remote host using the context menu for the files:

Upload to remote

One can also select entire folders for upload / download. The root path on the remote host used for all this is the one I discussed above (e.g., /home/ec2-user as defined higher above).


I haven’t used this feature on a large scale yet, but it seems very useful. The issue I keep linking discusses one option of using it to run data processing on a large desktop system from a laptop. I also find it interesting for just running experiments in parallel on a separate machine, or for using cloud infrastrucure while developing.

The issue also has some discussion with potential pipenv management from PyCharm coming in 2020.1 or 2020.2 version. Just speculation, of course. But until then one can set up the virtualenv using pipenv on remote host and just use the interpreter path above to set up the SSH Interpreter. This works to run the code inside the pipenv environment.

Some issues I ran into included PyCharm apparently only keeping a single state mapping in memory for remote and local file diffs. PyCharm synchronizes files very well, and identifies changes to upload new files. But if I change the remote host address, it seems to still think it has the same delta. Not a big issue, but something to keep in mind as always.

UPDATE: The manual sync I added a description for it actually quite nice way to bypass the issues on automated sync. Of course it is manual, and using it for uploading everything all the time in a big project is not useful. But for me and my projects it has been nice so far..

That’s all.

Robot Framework by Examples


Robot Framework (RF) is a popular keyword driven test framework (at least in Finland it seems to be..). Recently had to look into it again for some potential work related opportunities. Have to say open source is great but the docs could use improvements..

I made a few examples for the next time I come looking:


To install RF itself, in Python pip does the job. Installing RF itself, along with Selenium keywords, and Selenium Webdriver for those keywords:

pip3 install robotframework
pip3 install selenium
pip3 install robotframework-seleniumlibrary

Using Selenium WebDriver as an example here, a Selenium driver for the selected browser is needed. For Chrome, one can be downloaded from the Chrome website itself. Similarly for other browsers on their respective sites. The installed driver needs to be on the search path for the operating system. On macOS, this is as simple as adding it to the path. Assuming the driver is in currect directory:


So just the dot, which works as long as the driver file is in the working directory when running the tests.

In PyCharm, the PATH can also be similarly added to run configuration environment variables.

General RF Script Structure

RF script elements are separated by minimum of 2 space indentation. Both indenting test steps under a test, and also to separate keywords and parameters. There is also the pipe separated format which might look a bit fancier, if you like. Sections are identified by three stars *** and a pre-defined name for the section.

The following examples illustrate.


Built-in Keywords / Logging to console

The built-in keywords are avaiable without needing to import a specific library. Rather they are part of the built-in library. Simple example of logging some statement to console:

The .robot script (hello.robot in this case):

*** Test Cases ***
Say Hello
    Log To Console    Hello Donkey
    No Operation
    Comment           Hello ${bob}

The built-in keyword "Log To Console" writes the given parameter to the log file. A hello world equivalent. To run the test, we can either write code to invoke the RF runner from Python or use RF command line tools. Python runner example:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

suite = TestSuiteBuilder().build("./hello.robot")
result = suite.run(output="test_output.xml")
#ResultWriter(result).write_results(report='report.html', log="log.html")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The "hello.robot" in above is the name of the test script file listed above also.

The strangest thing (for me) here is the writing of the log file. The docs suggest to use the first approach I commented out above. The ResultWriter with the results object as a parameter. This generates the report.html and the log.html.

The problem is, the log.html is lacking all the prints, keywords, and test execution logs. Later on the same docs state that to get the actual logs, you have to pass in the name of the XML file that was created by the suite.run() method. This is the uncommented approach in the above code. Since the results object is also generated from this call, why does it not give the proper log? Oh dear. I don’t understand.

Commandline runner example:

robot hello.robot

This seems to automatically generate an appropriate log file (including execution and keyword trace). There are also a number of command line options available, for all the properties I discuss next using the Python API. Maybe the general / preferred approach? But somehow I always end up needing to do my own executors to customize and integrate with everything, so..

Finally on logging, Robot Framework actually captures the whole stdout and stderr, so statements like print() get written to the RF log and not to actual console. I found this to be quite annoying and resulting in overly verbose logs with all the RF boilerplate/overhead. There is a StackOverflow answer on how to circumvent this though, from the RF author himself. I guess I could likely write my own keyword to use that if needed to get more log customization, but seems a bit complicated.

Tags and Critical Tests

RF tags are something that can be used to filter and group tests. One use is to define some tests as "critical". If a critical test fails, the suite is considered failed.

Example of non-critical test filtering. First, defining two tests:

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Running them, while filtering with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

suite = TestSuiteBuilder().build("./noncritical.robot")
result = suite.run(output="test_output.xml", noncritical="*crit")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The above classifies all tests that have tags matching the regexp "*crit" as non-critical. In this case, it includes both the tags "crit" and "non-crit", which would likely be a bit wrong. So the report for this actually shows 2 non-critical tests.

The same execution with a non-existent non-critical tag:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

suite = TestSuiteBuilder().build("./noncritical.robot")
#this tag does not exist in the given suite, so no critical tests should be listed in report
result = suite.run(noncritical="non")
ResultWriter(result).write_results(report='report.html', log="log.html")

This runs all tests as critical, since no test has a tag of "non". To finally fix it, the filter should be exactly "non-crit". This would not match "crit" but would match exactly "non-crit".

Filtering / Selecting Tests

There are also keywords include and exclude. To include or exclude (surprise) tests with matching tags from execution.

A couple of tests with two different tags (as before):

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Run tests, include with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter
from io import StringIO

suite = TestSuiteBuilder().build("./include.robot")
stdout = StringIO()
result = suite.run(include="*crit", stdout=stdout)
ResultWriter(result).write_results(report='report.html', log="log.html")
output = stdout.getvalue()

This includes both of the two tests defined above, since the tags match. If the filter was "non", nothing would match, and error is produced for no tests to run.

Creating new Keywords from Existing Keywords

Besides somebody elses keywords, custom keywords can be extended from existing keywords. Example test file:

*** Settings ***
Resource    simple_keywords.robot

*** Test Cases ***
Run A Google Search
    Search for      chrome    emoji wars
    Sleep           10s
    Close All Browsers

The included (by the Resource keyword above) file simple_keywords.robot:

*** Settings ***
Library  SeleniumLibrary

*** Keywords ***
Search for
    [Arguments]    ${browser_type}    ${search_string}
    Open browser    http://google.com/   ${browser_type}
    Press Keys      name:q    ${search_string}+ENTER

So the keyword is defined above in a separate file, with arguments defined using the [Arguments] notation. Followed by the argument names. Which are then referenced in following keywords, Open Browser and Press Keys, imported from SeleniumLibrary. Simple enough.

Selenium Basics on RF

Due to popularity of Selenium Webdriver and testing of web applications, there is a specific RF library with keywords built for it. This was installed way up in Installing section.

Basic example:

*** Settings ***
Library  SeleniumLibrary

*** Test Cases ***
Run A Google Search
    Open browser    http://google.com/   Chrome
    Press Keys      name:q    emoji wars+ENTER
    Sleep           10s
    Close All Browsers

Run it as always before:

from robot.running import TestSuiteBuilder
import robot

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

This should open up Chrome browser, load Google on it, do a basic search, and close the browser windows. Assuming it founds the Chrome driver also listed in the Installing section.

Creating New Keywords in Python

Besides building keywords as composites of existing ones, building new ones with Python code is an option.

Example test file:

*** Settings ***
Library         google_search_lib.py    chrome

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s

The above references google_search_lib.py, where the implementation is:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class google_search_lib(object):
    driver = None

    def get_driver(cls, browser):
        if cls.driver is not None:
            return cls.driver
        if (browser.lower()) == "chrome":
            cls.driver = webdriver.Chrome("../chromedriver")
        return cls.driver

    def __init__(self, browser):
        driver = google_search_lib.get_driver(browser)
        self.driver = driver
        self.wait = WebDriverWait(driver, 10)

    def search_for(self, term):
        search_box = self.driver.find_element_by_name("q")

    def close(self):

Defining the library import names is a bit tricky. If it is the same in both cases (module + class) just one is needed.

Again, running it as before:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

If you think about this for a moment, there is some strange magic here. Why is the classmethod there? How is state managed within tests / suites? I borrowed the initial code for this example from this fine tutorial. It does not discuss the use of this annotation, but it seems to me that this is used to shared the driver object during test execution.

Mapping Python Functions to Keywords

It is simply by taking the function name and underscores for space. So in the above google_search_lib.py example, the Search For maps to the search_for() function. Close keyword maps to close() function. Much complex, eh?

Test Setup and Teardown

Test setup and teardown are some basic functionality. This is supported in RF by specific keywords in the Settings section.

Example test file:

*** Settings ***
Library         google_search_lib.py    chrome
Test Setup      Log To Console    Starting a test...
Test Teardown   Close

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s

The referenced google_search_lib.py file is the same as above. This includes defining the close function / keyword used in Test Teardown.

Run it as usual:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

You can define a single keyword for both setup and teardown. RF docs suggest to write your own custom keyword, composing multiple actions as needed.

The way the library class is defined and created is also impacted on how the scope of the library is defined. It seems to get a bit tricky to manage the resources, since sometimes the instances are different in the setup, teardown, tests, or in all tests. I think this is one of the reasons for using the classmethod annotation in the tutorial example I cited.

There would be much more such as variables in tests. And RF also supports the BDD (Gherkin) syntax in addition to the keywords I showed here. But the underlying framework is quite the same in both cases.

Anyway, that’s all I am writing on today. I find RF is quite straightforward once you get the idea, and not too complex to use even with the docs not being so straightforward. Overall, a very simple concept, and I guess one that the author(s) have managed to build a reasonable community around. Which I guess is what makes it useful and potentially successfull.

I personally prefer writing software over putting keywords after one another, but for writing tests I guess this is one useful method. And maybe there is an art in itself to writing good, suitably abstracted, reusable yet concrete keywords?

That’s all, folks,…

A Look into AWS Elastic Container Service


Recently, I got myself the AWS Certified Associate Solutions Architecture certificate. To start getting more familiar with it, I did the excellent Cloud Guru class on the certificate preparation on Udemy. One part that was completely missing from that preparatory course was ECS. Yet questions related to it came up in the exam. Questions on ECS and Fargate, among others. I though maybe Fargate is something from Star Trek. Enter the Q continuum? But no.

Later on, I also went through the Backspace preparatory course on Udemy, which briefly touches on ECS, but does not really give any in-depth understanding. Maybe the certificate does not require it, but I wanted to learn it to understand the practical options on working with AWS. So I went on to explore.. and here it is.

Elastic Container Service (ECS) is an AWS service for hosting and running Docker images and containers.

ECS Architecture

The following image illustrates the high-level architecture, components, and their relations in ECS (as I see it):

ECS High-Level Architecture

The main components in this:

  • Elastic Container Service (ECS): The overarching service name that is composed of the other (following) elements.
  • Elastic Container Registry (ECR): basically handles the role of private Docker Hub. Hosts Docker images (=templates for what a container runs).
  • Docker Hub: The general Docker Hub on the internet. You can of course use standard Docker images and templates on AWS ECS as well.
  • Docker/task runner: The hosts running the Docker containers. Fargate or EC2 runner.
  • Docker image builder: Docker images are built from specifications given in a DockerFile. The images can then be run in a Docker container. So if you want to use your own images, you need to build them first, using either AWS EC2 instances, or your own computers. Upload the build images to ECR or Docker Hub. I call the machine used to do the build here "Docker Image Builder" even if it is not an official term.
  • Event Sources: Triggers to start some task running in an ECS Docker container. ELB, Cloudwatch and S3 are just some examples here, I have not gone too deep into all the possibilities.
  • Elastic Load Balancer (ELB): To route incoming traffic to different container instances/tasks in your ECS configuration. So while ELB can "start tasks", it can also direct traffic to running tasks.
  • Scheduled tasks: Besides CloudWatch events, ECS tasks may be manually started or scheduled over time.

Above is of course a simplified description. But it should capture the high level idea.

Fargate: Serverless ECS

Fargate is the "serverless" ECS version. This just means the Docker containers are deployed on hosts fully managed by AWS. It reduces the maintenance overhead on the developer/AWS customer as the EC2 management for the containers is automated. The main difference being that there is no need to define the exact EC2 (host) instance types to run the container(s). This seems like a simply positive thing for me. Otherwise I would need to try to calculate my task resource definitions vs allocated containers, etc. So without Fargate, I need to manage the allocated vs required resources for the Docker containers manually. Seems complicated.

Elastic Container Registry / ECR

ECR is the AWS integrated, hosted, and managed container registry for Docker images. You build your images, upload them to ECR, and these are then available to ECS. Of course, you can also use Docker Hub or any other Docker registry (that you can connect to), but if you run your service on AWS and want to use private container images, ECR just makes sense.

When a new Docker container is needed to perform a task, the AWS ECS infrastructure can then pull the associated "container images" from this registry and deploy them in ECS host instances. The hosts being EC2 instances with the ECS-agent running. The EC2 instances managed either by you (EC2 ECS host type) or by AWS (Fargate).

Since hosting custom images with own code likely includes some IPR you don’t want to share with everyone, ECR is encrypted, as well as all communication with it. There are also ECS VPC Endpoints available to further secure the access and to reduce the communication latencies, removing public Internet roundtrips, with the ECR.

As for availability and reliability, I did not directly find good comments on this, except that the container images and ECR instances are region-specific. While AWS advertises ECR as reliable and scalable and all that, I guess this means they must simply be replicated within the region.

Besides being region-specific, there are also some limitations on the ECS service. But these are in the order of max 10000 repositories per region, each with max of 10000 images. And up to 20 docker pull type requests per second, bursting up to 200 per second. I don’t see myself going over those limits, pretty much ever. With some proper architecting, I do not see this generally happening or these limits becoming a problem. But I am not running Netflix on it, so maybe someone else has it bigger.

ECS Docker Hosting Components

The following image, inspired by a Medium post (thanks!), illustrates the actual Docker related components in ECS:

ECS Docker Components

  • Cluster: A group of ECS container instances (for EC2 mode), or a "logical grouping of tasks" (Fargate).
  • Container instance: An EC2 instance running the ECS-agent (a Go program, similar to Docker daemon/agent).
  • Service: This defines what your Docker tasks are supposed to do. It defines the configuration, such as the Task Defition to run the service, the number of task instances to create from the definition, and the scheduling policy. I see this as a service per task, but defining also how multiple instances of the tasks work together to implement a "service", and their related overall configuration.
  • Task Definition: Defines the docker image, resources (CPU, memory), instance type (micro, nano, macro, …), IAM roles, image boot command, …
  • Task Instance: An instantiation of a task definition. Like docker run on your own host, but for the ECS.

Elastic Load Balancer / ELB with ECS

The basic function of a load balancer is to spread the load for an ECS service across its multiple tasks running on different host instances. Similar to "traditional" EC2 scaling based on monitored ELB target health and status metrics, scaling on ECS can also be triggered. Simply based on ECS tasks vs pure EC2 instances in a traditional setting.

As noted higher above, an Elastic Load Balancer (ELB) can be used to manage the "dynamics" of the containers coming and going. Unlike in a traditional AWS load balancer setting, with ECS, I do not register the containers to the ELB as targets myself. Instead, the ECS system registers the deployed containers as targets to the ELB target group as the container instances are created. The following image illustrates the process:

ELB with ECS

The following points illustrate this process:

  • ELB performs healthchecks on the containers with a given configuration (e.g., HTTP request on a path). If the health check fails (e.g., HTTP server does not respond), it terminates the associated ECS task and starts another one (according to defined ESC scaling policy)
  • Additionally there are also ECS internal healthchecks for similar purposes, but configured directly on the (ECS) containers.
  • Metrics such as Cloudwatch monitoring ECS service/task CPU loads can be used to trigger autoscaling, to deploy new tasks for a service (up-scaling) or remove excess tasks (down-scaling).
  • As requests come in, they are forwarded to the associated ECS tasks, and the set of tasks may be scaled according to the defined service scaling policy.
  • When a new task / container instance is spawned, it registers itself to the ELB target group. The ELB configuration is given in the service definition to enable this.
  • Additionally, there can be other tasks not associated to the ELB, such as scheduled tasks, constantly running tasks, tasks triggered by Cloudwatch events or other sources (e.g., your own code on AWS), …

Few points that are still unclear for me:

  • An ELB can be set to either instance or port type. I experimented with simple configurations but had the instance type set. Yet the documentation states that with awsvpc network type I should use IP based ELB configuration. But it still seemed to work when I used instance-type. Perhaps I would see more effect with larger configurations..
  • How the ECS tasks, container instances, and ELBs actually relate to each other. Does the ELB actually monitor the tasks or the container instances? Does the ELB instance vs port type impact this? Should it monitor tasks but I set it to monitor instances, and it worked simply because I was just running a single task on a single instance? No idea..

Security Groups

As with the other services, such as Lambda in my previous post, to be able to route the traffic from the ELB to the actual Docker containers running your code, the security groups need to be configured to allow this. This would look something like this:


Here, the ELB is allowed to accept connections from the internet, and to make connections to the security group for the containers. The security groups are:

  • SG1: Assigned to the ELB. Allows traffic in from the internet. Because it is a security group (not a network access control list]), traffic is also allowed back out if allowed in.
  • SG2: Assigned to the ECS EC2 instances. Allows traffic in from SG1. And back out, as usual..

Final Thoughts

I found ECS to be reasonably simple, and providing useful services to simplify management of Docker images and containers. However, Lambda functions seem a lot simpler still, and I would generally use those (as the trend seems to be..). Still, I guess there are still be plenty of use cases for ECS as well. For those with investments into, or otherwise preferences for containers, and for longer running tasks, or otherwise tasks less suited for short invocation Lambdas.

As the AWS ECS-agent is just an open-source program written in Golang and hosted on Github, it seems to me that it should be possible to host ECS agents anywhere I like. Just as long as they could connect to the ECS services. How well that would work from outside the core AWS infrastructure, I am not sure. But why not? Have not tried it, but perhaps..

Looking at ECS and Docker in general, Lambda functions seem like a clear evolution path from this. Docker images are composed from DockerFiles, which build the images from [stacked layers(https://blog.risingstack.com/operating-system-containers-vs-application-containers/)%5D, which are sort of build commands. The final layer being the built "product" of those layered commands. Each layer in a Dockerfile builds on top of the previous layer.

Lambda functions similarly have a feature called Lambda Layers, which can be used to provide a base stack for the Lambda function to execute on. They seem a bit different in defining sets of underlying libraries and how those stack on top of each other. But the core concept seems very similar to me. Finally, the core a Lambda function is the function that executes when a triggering event runs, similar to the docker run command in ECS. Much similar, such function.

The main difference for Lambda vs ECS perhaps seems to be in requiring even less infrastructure management from me when using Lambda (vs ECS). The use of Lambda was illustrated in my earlier post.