Previously I played around with ElasticSearch + Kibana, which was nice for visualizing some activities on my Minecraft server. I then proceeded to try the same for some server performance metrics data such as CPU and memory use (collected using some Python scripts I created with the psutil library). ES + Kibana are absolutely great tools for analyzing textual data such as logs. Not so good and effective for numerical time-series type data I would say. Maybe this is why it is so often combined with stuff like Logstash etc.
Anyway, after looking around, I then found InfluxDB and Grafana as more suitable alternatives for my performance data. And what is even nicer is, there is now a nice and shiny tool from the InfluxDB folks called Telegraf that collects all those system metrics for me. And since shiny pictures are nice, I thought I would post some of those here.
In this occasion, I am testing a simple backend server for an Android app. Clients register to the server, which then sends updates for subscribed sensors at 1 second interval to each client. The client is the Android app in this case.. My tests run from a single laptop and test how much load the server can handle, by continuously registering new clients to receive more updates.
Off we go, with the initial tests. I first implemented my test client using plain Java SSL sockets. The app uses a custom binary protocol and it is just cost-effective to keep the connection open and push the data through. Which is why the plain socket approach. I used to types of probes to collect the data. The Telegraf plugins for basic system resources, such as CPU, memory and network traffic. And a few custom probes I inserted into the application and into the tester to collect additional data.
Anyway, pics you say:
This one shows the custom metrics I collected from the server/tester. The memory chart contains four different values. The “mem_used” metric measures the amount of reserved RAM in the JVM. The “mem_total” measures the total amount of RAM the JVM has allocated of all the possible RAM it has been given. In this case, I gave the JVM a max value of 1G RAM to allocate. This chart shows the JVM never allocates more than 250MB of this, so this is plenty. This is reflected here in the “mem_free_max” metric, which shows how much more at the most the server could allocate beyond what the JVM has already allocated.
The session count metric is a measure of how many clients have connected to the server a time (active sessions). The “frequency” metric here is a measure of how much time on average is at each point taken for the server to send the requested data updates to the clients. This chart shows the deviation from the expected 1000 milliseconds is at most 4 milliseconds, so well within range.
These “custom” metrics are all collected using my own measurement agents I deployed both at the server and tester end. With the server, I can also use there during production deployment to keep monitoring the performance, so they are useful beyond performance testing as well.
In this initial try with the plain sockets approach, I got up to around 2000 client connections. At this point, I ran into the operating system thread limit. I tried to get around that with another approach but more on that later.
For metrics on the system resources, I used the Telegraf tool, which had the option to directly collect operating system (and various other) metrics and push them into InfluxDB. I forgot to turn Telegraf on at first when doing this example so it is a bit short. Anyway, visualization:
Here, the top row shows CPU use percentage for each of the four cores on the system. The middle row shows the average of these for total system CPU load. The bottom row is for network traffic as bytes in/out. Unfortunately this did not seem to work on OSX so it is just a flatline for now.
In any case, 2000 clients seems plenty enough to test for my app that has never seen more than a fraction of that in installs. But since I am always too curious, I wanted to try to implement this properly. So I went with Netty (as I needed both Java NIO and SSL support which is horrible to implement by yourself, thank you very much Java engineers..).
So, using the same visualizations, here are some graphs.
This one shows the start of the test execution. Notice the flatline part on the left when the test is started? The flatline is the measurement agents showing the server JVM statistics before the test clients are started up. Interestingly, the JVM starts by allocating about 256MB of memory out of the possible 1GB. Shortly after more load is put on the JVM, it goes down much closer to the actual usage limit.
As for the frequency measure, it is a bit off at the start which is quite normal I would guess for the first measure to calibrate itself. What is more noticeable here is that I added a second frequency measure. Now there are two overlapping lines, one for the delay observed at the server end, and one for the tester end. This gives me a way to see if the delay is due to too much load on the server or too much load on the network (and avoids need for time synchronization such as NTP or PTP when I distribute this). It also has the benefit that I can keep observing the server delay using the same agent in production.
After a longer run, this is how far I end up at:
At around 4000 client connections I hit another limit. This time I have passed the thread count limit but I hit a limit on the number of available open file descriptors on the operating system. File descriptors are also used for sockets, and the OS has a limit on how many can be opened at once. Apparently this can be modified using various tricks involving ulimit and various kernel paremeter tuning on OSX. However, I do not wish to mess my personal laptop so I give up on this and take the 4000 as enough for now. Otherwise I would take a virtual machine or several, install Ubuntu and mess them up with the file limits as a VM is easy to redo. Then make several and bombard my server with all of them. But for now this is fine.
For the Telegraf metrics we get this:
So the CPU load goes up steadily towards the end and suddenly drops at the end. Which is kind of weird as I would expect it to really drop after the test finishes, not 30 seconds before it. Either there is some issue with different ways the agents take time or there are some delays in processing some metrics. I should probably investigate further. However, this would not really affect my performance test so I am fine with that for now.
Anything else? Well, there are a few interesting bits to show. Lets see.
Here is a frequency spike that hides all other variation due to being so large:
We can drill down on the flatline to see the variation there much like we did in Kibana (click and drag to select on the chart):
Of course, it is a valid question here if the variation at this very fine granularity makes a difference or if the large spike is the only thing of real interest here. But is illustrates how these fancy tools can be nice.
Another thing of interest. During some test runs, I managed to get some weird large spikes in the frequency (as well as session registration):
These spikes are the same for both the tester and server frequency, and the number of new sessions that get through at this time are also flatlining. So something must hang the system for close to 30 seconds here. I would suspect a some network lag here but I believe I was running both the tester and server on localhost so should not be that (although InfluxDB is on another host). Also, the fact that the server spike is identical (overlapping line) with the tester spike is against this. So a look at the CPU load for this period:
There does seem to be some correlation with the highest CPU spikes and the lag spikes in update frequencies. So I tried to think what I had changed or what could be causing this. I figured it was probably the fact that I had added some pretty verbose logging and enabled log rotation + compression. So I figured maybe logs are filling up fast and getting rotated + compressed on those points. So I reduced logging to sane levels and the spikes never came back. Maybe it was that, or maybe it was the sun spots. But its gone so fine with me..
So that is some basic example of how I used this toolset for performance testing. Now, in practice I would not run the server on localhost, nor would the client apps be running on localhost (or the same host as the server is on). So I should actually test it on the real server environment. About that next…