I noticed I have previously also tried to monitor my ASUS RT-N66U router to see what traffic passes through. You know, just your regular paranoid stuff. However, this time I was trying to monitor the router CPU load and bandwidth use to figure out if delays in my performance tests were somehow related to the router performance (as illustrated in some previous posts). Somehow this did not turn out to be quite so simple. I will just document my experiments in relation to setting up the CPU monitoring here for whatever that’s worth..
First off, I needed to enable SNMP on the router. Of course, I had done this long time ago, but anyway, on this particular router this is done in the admin panel under Administration->SNMP. The SNMP communities set on this panel are by default “public” for the reading operations and “private” for the write operations. So I will use these in my examples..
Secondly, I needed to find the OID’s to monitor. An OID is an Object ID that uniquely identifies some property we want to monitor/manage with SNMP. These are listed in something called MIB (Management Information Base) files, which there are plenty of, and browsing these gets complicated to try to find what you are interested in. So for me, I just Googled for “SNMP CPU OID” etc. Some potential ones I found:
Average CPU load
over last 1 minute: .1.3.6.1.4.1.2021.10.1.3.1
over last 5 minutes: .1.3.6.1.4.1.2021.10.1.3.2
over last 15 minutes: .1.3.6.1.4.1.2021.10.1.3.3
Percentages:
user CPU time: .1.3.6.1.4.1.2021.11.9.0
system CPU time: .1.3.6.1.4.1.2021.11.10.0
idle CPU time: .1.3.6.1.4.1.2021.11.11.0
raw user cpu time: .1.3.6.1.4.1.2021.11.50.0
raw system cpu time: .1.3.6.1.4.1.2021.11.52.0
raw idle cpu time: .1.3.6.1.4.1.2021.11.53.0
raw nice cpu time: .1.3.6.1.4.1.2021.11.51.0
So, looking at these, I figured if I wanted to figure out why my client-server performance gets hit when running the performance tests over the router with high numbers of concurrent clients, I would want a very fine granularity of information on the CPU load. Averages of 1-15 minutes are then not very useful for me. The percentage values seemed much more interesting. However, even though these are listed the same way all over the internet, nowhere does it really say what is their granularity. That is, percentage over what time? Or I am sure it is said somewhere but I did not stumble onto it. Anyway, I hopefully assumed it was “real-time” percentage. That is, load percentage at the time of measurement.
So I set up my system to query these from the router:
I was expecting system+user CPU time to equal to what would be the actual CPU load. Obviously not, as the idle time shows going down by about 25% on the right end, while the system+user loads sum up to less than 2%. So I am missing a large chunk of actual CPU load somewhere (about 23% in the right side here). Also, the idle graph might be correct but it is definitely not “real-time”. When I look at the CPU load graph shown in the router admin interface itself, this is quite obvious:
The admin interface shows a shorter timeframe, so the right hand side of the SNMP idle graph matches generally the slope of CPU load shown in the router admin interface. However, it is much less “bumpy” than the router graph, meaning it is an average over a longer period than the one shown in the router interface. Even though my “system+user” measure is obviously broken, I could get the CPU load from this by using a formula of “100 – idle”. However, obviously I would prefer something more fine-grained.
Another hopeful approach I found is the OID for something called ProcessorLoad (.1.3.6.1.2.1.25.3.3.1.2.x). This one is a bit odd though, I believe the “x” part identifies the processor for which I want the statistics for (in this case my router only has one core, in more modern ones there are multiple). To find this out I do an snmpwalk on commandline:
snmpwalk -v2c -c public 192.168.2.1 .1.3.6.1.2.1.25.3.3.1.2
Which on my N66U gives me:
iso.3.6.1.2.1.25.3.3.1.2.196608 = INTEGER: 1
And on another AC68U that I also had access to it also gives another core with id 196609 (the AC68U is a dual-core router). The value (INTEGER:1) is the load of the processor. Using this, I get a load graph that is the opposite of the idle graph, so obviously it is now correctly measuring the load average over the same period as the idle graph is:
However, as before, this is obviously not as fine grained as I was hoping. For comparison, the admin graph again:
So is there another way I can get a finer granularity of information somehow? Looking and asking around a bit I did not find much information on this. However, somewhere on the mighty internet I did find information that the average value OIDs (1-15min) would be deprecated because the same information can be derived using the “raw” values. I already listed most of these “raw” measure OIDs in the beginning, but what is a raw value and how do I use them?
After some Google-fu (or flu) it turns out the “raw” values are something called CPU ticks. So what is such a tick? My understanding these are some values used internally to provide fine-grained time intervals that map to how much time the CPU spends in each state (system, user, idle, …). Anyway, they seem to be at quite granular level (much less than a second, and I was looking for one second precision), so maybe I can use these. And how do you calculate the percentage of CPU load from the raw values?
Again, not a huge amount of information on this. But I figure it can be done by adding all the different “raw” values together and calculating the percentage of busy values vs idle values in the observed time period. Here I use the diff of the values so if at second 1 the value is 5 and at second 2 it is 10, the actual number of ticks for the last period (second 2) is 10-5=5.
To calculate the percentage, if the raw values for “user”+”system”+”nice” sum up to 10 ticks in a second, and the raw value for “idle” is 90 ticks in the same second, I get a 10% CPU load for that second (with 90% idle). Of course, I cannot assume these values are not 0-100 or anything like that, but I have to assume there can be any number of ticks in a second. I also believe the number of ticks per second can also vary across a system, so I need a more dynamic calculation.
Formula I tried is (user+system+nice)/(user+system+nice+idle)*100 to get the percentage. As I mentioned, there is not much information on this that I found. However, I found some post(s) suggesting this was actually the way some commonly used SNMP monitoring tools also do it. So what does it look like if I do it like this? It looks like this:
As before, the idle line seems about correct (the router graph is the same one above), and in this case it is obviously at much finer granularity than before. Which is nice. However, when I compare the “system”, “nice” and “user” lines to the “idle” line they are obviously still nowhere near the actual load. They (“system”+”nice”+”user”) sum up to about 2% at most, while the idle line shows that something is consuming up to 65% or so of CPU load (right end of the graph, “100-idle”).
Why these broken results? After going through my SNMP polling code too many times to look for any bugs, trying various fixes, etc. I figure either the router SNMP implementation is broken or I am still doing it wrong. More likely the latter.
So finally I got the idea to SSH into the router (ASUS Merlin firmware at least has the option to enable an internal SSH server in the router). Then I run the “top” command to see if I the load reported by “top” is as far away from my SNMP graphs as the router admin panel charts.
Surprisingly, “top” shows values very close to what I was getting from my raw percentage calculations for “user”, “system”, and “nice”. So what gives? Well, the “idle” is also close to the same as my graph, so after looking around some more, I notice there is another value called “sirq” that I have not included in my formula. And it is big. Googling around for “sirq” just got me a bunch of questions on “top” and “sirq” and why it is sometimes high. But no explanation for what “sirq” actually stands for. After some time I finally figure out it must be related to “software interrupts”.
So after some more Googleing, I find there is an OID that seems relevant: 1.3.6.1.4.1.2021.11.61 (ssCpuRawSoftIRQ). Performing an SNMP walk over1.3.6.1.4.1.2021.11 actually shows this (the raw values are actually under the same hierarchy branch). I did the snmpwalk before but missed this as it was never mentioned elsewhere I looked and appeared later in the list of walk results.. Thats my excuse.
So I add the soft interrupt load to my formulas and now I finally get:
So summing up the user+system+nice+sirq should do the trick. However, even without this I see it is in my case mostly just the “sirq” value that makes up the load here. Which seems much closer to the router admin panel chart:
However, it is still not quite the same. Why is this? The number of ticks seems to be actually only updated every 5 seconds, so I guess this is the finest granularity that I can monitor on this router. But the admin panel seems finer, probably one second interval. But at least 5 seconds is better than one minute..
As a related note, I was interested in more metrics than just this. The metrics for bandwidth consumed on the router (maybe if my performance test was taking too much bandwidth it would cause problems) as well as other resource use on the router, such as memory. The bandwidth is a bit tricky. The router has a number of network interfaces. For the internet (WAN) connect, the wired ports, as well as the wireless. I could get a list of all these by performing another snmpwalk of the related OID:
snmpwalk -v2c -c public 192.168.2.1 .1.3.6.1.2.1.2.2.1.2
IF-MIB::ifDescr.1 = STRING: lo
IF-MIB::ifDescr.2 = STRING: eth0
IF-MIB::ifDescr.3 = STRING: eth1
IF-MIB::ifDescr.4 = STRING: eth2
IF-MIB::ifDescr.5 = STRING: vlan1
IF-MIB::ifDescr.6 = STRING: vlan2
IF-MIB::ifDescr.7 = STRING: br0
What is what here? I dont really know. But doing an snmpwalk on the bytes in/out OID values (1.3.6.1.2.1.31.1.1.1.6, 1.3.6.1.2.1.31.1.1.1.10) showed me that eth0 has much higher download than any other, so I am assuming that eth0 is the internet (WAN) connection shared by all devices in the LAN. From some Googleing I assume that “br0” is probably the wireless interface. The others, I am less sure about. Although I guess eth1 and eth2 are two more of the wired ports, and vlan1 and vlan2 are probably the remaining two others. Maybe they are just in a special mode (IPTV?). So I tried monitoring the OID for “br0” for traffic. Which produced nice graphs going up. So maybe I was right.
The rest (memory, etc) was quite simple. Just SNMP query the OIDs for those, the seem to give the actual “real-time” values. I will not repeat that here. However, for anyone who needs more details, I have the code on github. It is commonly in flux (until I find another thing to play with), the docs are usually not fully up to date, blaabllaa.. but the general idea could probably be found by browsing the source code if nothing else..: https://github.com/mukatee/pypro
Finally, what would be an interesting experiment would be to iteratively increase the length of the polling interval for the raw metrics to see how high it should be set to match the ProcessorLoad or CPUPercentage SNMP OID metrics. This would give me the actual averaging interval for those.. Arrrrr…
Anyway, that is that for how far I got in setting up my router monitoring at a fine granularity for my performance test analysis. Next I should see if I can correlate this with any delays or errors in the actual performance test…
Hey, great write-up. I myself am in the “paranoid” stage of wanting to monitor everything. Mostly for fun actually.
So I have flashed the latest Merlin build and I am checking out the OIDs. As I was reading your post I noticed your graphs.
What tool are you using? Right now I have a rasp-pi with cacti on it and I’d like to use something a bit lighter. Just have not seen that graph that you have used before.
How has your graphing come along since you posted?
Hi Eric,
Thanks for the nice comments. The tools I used to build the graphs is Grafana and the backend database I use for this is InfluxDB.
I doubt if it is lighter for you but suits my use case nicely as I collect metrics from SNMP, my own application servers, etc. Currently I am not running this fulltime in my home environment but thinking about it maybe I will give it a try for a while and see as it would be very interesting.
Another topic would be to do some more detailed monitoring within the router itself, as in my post some year(s) ago on running tcpdump in the router and looking at the raw results. But I haven’t graphed that. I remember also reading about a remote pcap component that might allow to do “real-time” wiresharking on the router. That would also be cool for all us paranoids.. 🙂
Graphing the “raw” results would also be cool, not sure what would work good for that.
Being a fellow Grafana fan and Influx user, seeing those graphs put a smile on my face 🙂
Would you be willing to share what you’ve done thus far so we can collaborate via GitHub? I’d like to get this stood up into a container or VM that I’d drive via Ansible so I can run it via my home machine. I’d potentially fold in Sensu alerting for things like when there’s excessive usage or unexpected traffic (e.g.: high BW util overnight).
Thanks in advance!
Hi,
Glad you liked it :).
The Python code I used for the SNMP polling is on github at https://github.com/mukatee/pypro. It has a bunch of other experiments with it so you might find it easier to pick some or so. But that could be helpful to get you started.
Cheers,
Teemu
Cool write up. Here’s some info:
* eth0 should be the wan.
* eth1-2 are two of the physical ports (do you have only 2 plugged in? I imagine the others would show up if you plug something into them)
* vlan1-2 are virtual interfaces used to segregate traffic so you can have two separate networks that can not share data in the same router. This is most likely used for separating your home wifi from guest wifi. They act like virtual switches. Anything on vlan1 can not talk to anything on vlan2 and vice-versa.
*br0 is your physical wifi interface. It’s likely a Broadcom chipset, thus the ‘br’. Vlan1 and vlan2 piggyback on top of the physical interface.
Cool! Thanks for the great information David. All that makes sense now that you said it … The problem always seems to be to find the right people who have the information, and then all is suddenly clear 🙂
The virtual switch part actually sounds a lot like all the network slicing stuff I have been hearing lately in the telco domain, as well as all the virtualization going on at network layers. Interesting to see this in my home router already for many years.
VLANs have been around in most switches and routers for years and years. They allow you to make computers and other devices believe they’re on a completely separate network from other devices that are all plugged into the same physical wiring. A single port (or ssid for wifi) is assigned to a specific VLAN. The switches and routers are programmed in the operating system to prevent packets from hopping VLANs, essentially creating any number of separate virtual networks within the same physical one.
This is useful for security and compartmentalization of information. You could put 2 different departments on two separate vlans and they’d not be able to talk to each other, and you’d only have to run one set of wiring. Routers can enable cross-talking between VLANs when set up properly.
In the case of these home routers they only need a couple VLANs because there is no need for more than 2 networks. Commercial switches and routers like Cisco can handle hundreds of VLANs simultaneously.