This time running the test with the fixed SNMP based router monitoring and with the client-server monitoring as well. Notice in the following graphs how I learned to also put two scales on a single chart, combine line and bar charts, and add units to all the charts. OOoooo, prettyy….
First up, the number of sessions, the update frequency delays, and the connection (SSL handshake) errors from the test start:
Here the “errors” graph shows the overall number of connection errors so far in the test (the orange line) and the number of errors for each second (the green bars). The number of sessions is growing at relatively steady pace, with slight drop in rate over time. At around 16:46:45 we start getting errors at a steady rate, which is at around 1300 sessions and at the same time the tester starts seeing bigger delays.
The router CPU load goes up pretty high quite fast:
At the problem start point of 16:46:45 the router CPU load starts to go over 80% quite often. This is likely a good part of the cause, although I would expected more properties to be involved since this is still not at 100% (although looking at the router internal graph below, maybe it is at 100%, the forced 5s averaging just hides it).
Notice another interesting part on the left. The test starts from about 20% CPU load. Why is this? I speculate it is due to the other traffic over the same network. You know, “other users” streaming YouTube and playing online games, etc., while I run my tests. So this is actually the “background noise” of the system. It is also probably why in this test run I only got to around 3800 clients when I have previously gotten close to 4200 with less background noise.
Looking at the router internal CPU graph:
It is actually quite high but in peaks. With the finest granularity of 5 seconds, we are seeing much of this averaged much lower in the Grafana chart above. So maybe the CPU load on the router is actually the reason to see some of those connection errors. The minimum of 5 second averaging the SNMP agent on the router gives us is just partly hiding it.
A couple of other, slightly less relevant, but still quite interesting charts I collected during the run:
This is the “br0”, or interface number 7, from the router (that I speculate is the wireless part). That is, the amount of traffic in/out on this interface. Notice how the “out” part (the orange line) actually has a much bigger scale on the left, while the “in” part (green) has a much smaller scale on the right. At first I wondered how can this be. I would expect much more downloading (in) to happen over this interface than uploading (out).
With my tinfoil hat on I immediately thought, maybe the NSA or the chinese are downloading all my secrets and streaming video of me picking my nose 24/7 from my compewters? But then I realized this is from the router viewpoint, so “out” probably means the router is sending data to the WLAN clients and “in” means it is receiving data from them. So this would be opposite of my assumption, and “in” is actually uploads and “out” downloads. Which seems more correct and I can remove the hat for now.
Anyway, slightly interestingly the upload part starts growing at a faster rate compared to the download part when I run the test. Most likely due to the difference in scale but shows how the traffic for upload increases proportionally more. Although in absolute terms I believe the increase is the same, as it is the application server and client sending data to each other over the wireless, so it is up/download over the same channel for the same data.
Just for the kicks, the JVM memory use I collected:
Nothing special here, but notice again how having the actual correct units on the y-axis is nice.
Now to take a look at the end part of the test. The errors, update frequency delays and router CPU load:
The router CPU in the above is actually hitting 100% in a couple of spots, which is quite heavy since these CPU load measures are 5 second averages as explained before. This is why Grafana is upping the graph top boundary to 125%. I should probably have set the graph maximum upper bound in Grafana at 100% but whatever. At the end ,when the test stops creating new connections, and old ones die out, the router load also drops back to the level of the background noise.
Which we can also see in the router internal chart:
We can do from this an interesting comparison using the direct measures (the ones directly queried over from specific SNMP OIDs) vs derived ones (derived from the several raw measures):
The the scale different due to auto-scaling changing the raw chart to 125% top while the direct never goes near 100% and is thus topped at 100%. Anyway, this comparison again shows how the longer averaging period hides much of the important information. This is especially visible in the last part of the period, although the whole period is pretty obvious. And if I forgot before the “CPU busy” value (blue line in the “Router CPU raw”) metric is the sum of all the other metrics in that chart (user + system + nice + sirq). Anyway, in any case, the “direct” chart is averaged over longer term (likely over a minute as I speculated in my previous posts) and thus more information loss is evident.
The router RAM part is pretty constant and was this during all my tests. Thus I haven’t put much of it in these posts but there it is for once in the above chart as well..
Then we take the server CPU load just to see if was still low enough not to cause any of these issues:
And no problem, it never goes over 50%. Interesting spikes though.
We can also look at the whole picture at once for the whole test run. The test runs for about 10 minutes and gets up to about 3800 concurrent sessions:
and the network errors, router CPU load and update frequency delays in client and server:
Cute. Are we missing anything? Yes, we do not measure the load on the client side where the tester is running. This could also have a very significant impact. Here is an example of the top running processes at the client machine during the test:
What are these? Well, Firefox is running Grafana, which seems to really suck on the CPU when set up with complex graphs and continous refreshes over 5 second intervals. IntelliJ is the IDE I am using to run the tests. The “java” process is likely related to the same, being a part of the performance tests. Why are they showing up as two, both sucking on the CPU as a separate process at relatively high intensity? No idea (arrr..), but maybe “java” is for the actual tester and IntelliJ is where it is run from, and it needs to track the tester process, capture prints, etc. Finally, “Python” is the SNMP poller that I run to capture the router metrics.
So what do I think I should have learned from this:
- I should architect the system to run monitoring from one node, logging database on one node, graphing UI from another node, …
- I should track all elements of the system, not just the client and the server
- Long averaging periods for measurements lose important information for detailed analysis of the performance tests vs functionality (e.g., reasons for network errors as sessions start piling).
- However, I should maybe not spend excessive time building shiny monitoring systems when the potential cause has become clear. Perhaps if I have one ready as I now do..
- I should architect my test system to become a distributed system, easily deployed on several nodes. This should allow avoiding process resource limits (e.g., threads per process), kernel limits (e.g., file descriptors), and provide more realistic loads over several distributed nodes
- I should design in monitoring features that are useful in both production and testing
What next? Well there are a few missing parts. One is that I should monitor the tester clients as noted above. Then another not-so-obvious point is that I should synchronize the clocks of all systems as close as possible. Otherwise I can never link properties such as network errors to other metrics from other nodes. Something like NTP should be used. But is it a bit complex to deploy effectively. In fact, distributed computing in general is a bit complex so I should probably look into building a nice shiny distributed performance test framework and NTP synchronization sometime in the future..
Another interesting aspect here would be to see how much the longer averaging period makes it harder to find actual causes, in a scenario where this could be realistically controlled with other factors.