Category Archives: Uncategorized

Collecting java.util.logging to log4j2

Everybody wants to write a log. And in Java everybody wants to write their own logging framework or at least use of the many different ones. Then someone comes up with logging framework framework such as SLF4J.

OK but what was I about to say. As so many times, I had a piece of Java software writing a log file using Log4J2. I was using some libs/someone elses code that uses java.util.logging to write their log. I wanted to capture those logs and include them in my Log4J2 log file for debugging, error resolution or whatever.

This case was when trying to log errors from the InfluxDB Java driver. The driver uses java.util.logging for minimal external dependencies or something. I used Log4J2 in my app.

So the usual question of how do you merge java.util.logging code, that you do not control, with your own code using Log4J2 to produce a single unified log file?

Most Googling would tell me all about SLF4J etc. I did not want yet-another framework on top of existing frameworks, and yet some more (transitive) dependencies and all sorts of weird stuff. Because I am old and naughty and don’t like too many abstractions just because.

So the code to do this with zero external dependencies.

First a log Handler object for java.util.logging to write to Log4J2:

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.LogRecord;

/**
* @author Daddy Bigbelly.
*/
public class JekkuHandler extends Handler {
//notice that this is the Log4J2 logger here, inside a java.util.logging Handler object
private static final Logger log = LogManager.getLogger();

  @Override
  public void publish(LogRecord record) {
    Level level = record.getLevel();
    if (level.intValue() == Level.SEVERE.intValue()) {
      log.error(record.getMessage(), record.getThrown());
    } else if (level.intValue() >= Level.INFO.intValue()) {
      log.info(record.getMessage(), record.getThrown());
    } else {
      log.debug(record.getMessage(), record.getThrown());
    }
  }

  @Override
  public void flush() {}

  @Override
  public void close() throws SecurityException {}
}

Next setting it up and using it, with the InfluxDB Java driver as an example:

import org.influxdb.InfluxDB;
import org.influxdb.InfluxDBFactory;
import org.influxdb.dto.BatchPoints;
import org.influxdb.dto.Point;
import org.influxdb.dto.Query;
import org.influxdb.impl.BatchProcessor;

import java.io.IOException;
import java.util.concurrent.TimeUnit;
import java.util.logging.ConsoleHandler;
import java.util.logging.FileHandler;
import java.util.logging.Formatter;
import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.logging.SimpleFormatter;

/**
* @author Daddy Bigbelly.
*/

public class LogCaptureExample {
  public static void main(String[] args) throws Exception {
    //oh no the root password is there
    InfluxDB db = InfluxDBFactory.connect("http://myinfluxdbhost:8086", "root", "root");
    String dbName = "aTimeSeries";
    db.createDatabase(dbName);
    db.enableBatch(2000, 1, TimeUnit.SECONDS);

    //if you look at the influxdb driver code for batchprocessor, 
    //where we wanted to capture the log from, you see it using the classname to set up the logger. 
    //so we get the classname here and use it to hijack the writes for that logger (the one we want to capture)
    System.out.println(BatchProcessor.class.getName());
    Logger logger = Logger.getLogger("org.influxdb.impl.BatchProcessor");
    Handler handler = new JekkuHandler();
    logger.addHandler(handler);

    //this runs forever, but the batch mode can throw an error if the network drops.
    //so disconnect network to test this in middle of execution
    while (true) {
      Point point1 = Point.measurement("cpu")
        .time(System.currentTimeMillis(), TimeUnit.MILLISECONDS)
        .addField("idle", 90L)
        .addField("user", 9L)
        .addField("system", 1L)
        .build();
      db.write(dbName, "autogen", point1);
    }
  }
}

You could probably quite easily configure a global java.util.logger that would capture all logging written with java.util.logging this way. I did not need it so its not here.

In a similar way, you should be able to capture java.util.logging to any other log framework just by changing where the custom Handler writes the logs to.

Well there you go. Was that as exciting for you as it was for me?

Building a (Finnish) Part of Speech Tagger

I wanted to try a part of speech tagger (POS) to see if it could help me with some of the natural language processing (NLP) problems I had. This was in Finnish, although other languages would be nice to have supported for the future. So off I went, (naively) hoping that there would be some nicely documented, black-box, open-source, free, packages available. Preferably, I was looking for one in Java as I wanted to try using it as part of some other Java code. But other (programming) languages might work as well if possible to use as a service or something. Summary: There are a bunch of cool libs out there, just need to learn POS tagging and some more NLP terms to train them first…

I remembered all the stuffs on ParseMcParseFace, Syntaxnet and all those hyped Google things. It even advertises achieving 95% accuracy on Finnish POS tagging . How cool would that be. And its all about deep learning, Tensorflow, Google Engineers and all the other greatest and coolest stuff out there, right? OK, so all I need to do is go to their Github site , run some 10 steps of installing various random sounding packages, mess up my OS configs with various Python versions, settings, and all the other stuff that makes Python so great (OK lets not get upset, its a great programming language for stuffs :)). Then I just need check out the Syntaxnet git repo, run a build script for an hour or so, set up all sorts of weird stuff, and forget about a clean/clear API. OK, I pass, after messing with it too long.

So. After trying that mess, I Googled, Googled, Duckducked, and some more for some alternatives better suited for me. OpenNLP seemed nice as it is an Apache project, which have generally worked fine for me. There are a number of different models for it at SourceForge . Some of them are even POS tagger models. Many nice languages there. But no Finnish. Now, there is an option to train your own model . Which seems to require some oddly formatted, pre-tagged text sets to train. I guess that just means POS tagging is generally seen as a supervised learning problem. Which is fine, it’s just that if you are not deep in the NLP/POS tagging community, these syntaxes do look a bit odd. And I just wanted a working POS tagger, not a problem of trying to figure out what all these weird syntaxes are, or a problem of going to set up a project on Mechanical Turk or whatever to get some tagged sentences in various languages.

What else? There is a nice looking POS tagger from Stanford NLP group. It also comes with out-of-the-box models for a few languages. Again, no Finnish there either but a few European ones. Promising. After downloading it, I managed to get it to POS tag some English sentences and even do lemmatization for me (finding the dictionary base form of the word, if I interpret that term correctly). Cool, certainly useful for any future parsing and and other NLP tasks for English. They also provide some instructions for training it for new languages.

This training again requires the same pre-annotated set of training data with POS tagging. Seeing some pattern here.. See, even I can figure it out sometime. So there is actually a post on the internets, where someone describes building a Swedish POS tagger using the Stanford tagger. And another one instructing people (in comments) to downloaded the tagger code and read it to understand how to configure it. OK, not going to do that. I just wanted a POS tagger, not an excursion into some large code base to figure out some random looking parameters that require a degree in NLP to understand them. But hey, Sweden is right next to Finland, maybe I can try the configuration used for it to train my own Finnish POS tagger? What a leap of logic I have there..

I downloaded the Swedish .props file for the Stanford tagger, and now just needed the data. Which, BTW, I needed also for all the others, so I might as well have gone with the OpenNLP as well and tried that, but who would remember that anymore at this point.. The Swedish tagger post mentioned using some form of Swedish TreeBank data. So is there a similar form of Finnish TreeBank? I remember hearing that term. Sure there is. So downloaded that. Unpack the 600MB zip to get a 3.8GB text file for training. The ftb3.1.conllx file. Too large to open in most text editors. More/less to the rescue.

But hey, this is sort of like big data, which this should be all about, right? Maybe the Swedish .props file just works with it, after all, both are Treebanks (whatever that means)? The Swedish Treebank site mentions having a specific version for the Stanford parser built by some Swedish treebank visitor at Googleplex. Not so for Finnish.

Just try it. Of course the Swedish .props file wont work with the Finnish TreeBank data. So I build a Python script to parse it and format it more like the Swedish version. Words one per line, sentences separated with linefeeds. The tags seem to differ across various files around but I have no idea about how to map them over so I just leave them and hope the Stanford people have it covered. (Looking at it later, I believe they all treat it as a supervised learning problem with whatever target tags you give.)

Tried the transformed file with the Stanford POS tagger. My Python script tells me the file has about 4.4 million sentences, with about 76 million words or something like that. I give the tagger JVM 32GB memory and see if it can handle it. No. Out of memory error. Oh dear. It’s all I had. After a few minor modifications in the .props file, and I make the training data set smaller until finally at 1M sentences the tagger finishes training.

Meaning the program runs through and prints nothing  (no errors but nothing else either). There is a model file generated I can use for tagging. But I have no idea if this is any good or not, or how badly did I just train it. Most of the training parameters have a one-line description in the Javadoc, which isn’t hugely helpful  (for me). Somehow I am not too confident I managed to do it too well. Later as I did various splits on the FinnTreeBank data for my customized Java tagger and the OpenNLP tagger, I also tried this one with the 1.4M sentence test set. Got about 82% accuracy, which seems pretty poor considering everything else I talk about in the following. So I am guessing my configuration must have been really off since otherwise people have reported very good results with it. Oh well, maybe someone can throw me a better config file?

This is what running the Stanford tagger on the 1M sentence set looked like on my resource graphs:

stanford1m

So it mostly runs on a single core and uses about 20GB of RAM for the 1M sentence file. But obviously I did not get it to give me good results, so what other options do I have?

During my Googling and stuff I also ran into a post describing writing a custom POS tagger in 200 lines of Python. Sounds great, even I should be able to get 200 lines of Python, right? I translated that to Java to try it out on my data. Maybe I will call my port “LittlePOS”. Make of that what you will :). At least now I can finally figure out what the input to it should be and how to provide it, since I wrote (or translated) the code, eh?

Just to quickly recap what (I think) this does.

  • Normalize all words = lowercase words, change year numbers to “!YEAR” and other numbers to “!DIGIT”.
  • Collect statistics for each word, how often different POS tags appear for each word. A threshold of 97% is used to mark a word as “unambiguous”, meaning it can always be tagged with a specific tag if it has that tag 97% or more times in the training data. The word also needs to occur some minimum number of times (here it was 20).
  • Build a set of features for each POS tag. These are used for the “machine learning” part to learn to identify the POS tag for a word. In this case the features used were:
    • Suffix of word being tagged. So its last 3 letters in this case.
    • Prefix of word being tagged. Its first letter in this case.
    • Previous tag. The tag assigned to previous word in sentence.
    • 2nd previous tag. The tag assigned to the previous word to the previous word :).
    • Combination of the previous and previous-previous tags. So previous tag-pair.
    • The word being tagged itself.
    • Previous tag and current-word pair.
    • Previous word in sentence.
    • Suffix of previous word, its 3 last letters.
    • Previous-previous word. So back two spots in the sentence where we are tagging.
    • Next word in sentence.
    • Suffix of next word. Its 3 last letters.
    • Next-next word in sentence. So the next word after the next word. To account for the start and end of a sentence, the sentence word array is always initialized with START1, START2 and END1, END2 “synthetic words”. So these features also work even if there is no real previous or next word in the sentence. Also, word can be anything, including punctuation marks.
  • Each of the features is given a weight. This is used to calculate prediction of what POS tag a word should get based on its features in the sentence.
  • If, in training, a word is given (predicted) a wrong tag based on its features, the weights of those features for the wrong tag are reduced by 1 each, and the weights for those features for the correct tag are increase by 1 each.
  • If the tag was correctly predicted, the weights stay the same.

Getting this basic idea also helps me understand the other parsers and their parameters a bit better. I think this is what is defined by the “arch” parameter in the Stanford tagger props file, and would maybe need a better fix? I believe this setting of parameters must be one of the parts of POS tagging with the most diverse sets of possibilities as well.. Back to the Stanford tagger. It also seemed a bit slow at 50ms average tagging time per sentence, compared to the other ones I discuss in the following. Not sure what I did wrong there. But back to my Python to Java porting.

I updated my Python parser for the FinnTreeBank to produce just a file with the word and POS tag extracted and fed that LittlePOS. This still ran out of memory on the 4.4M sentences with 32GB JVM heap. But not in the training phase, only when I finally tried to save the model as a Protocol Buffers binary file. The model in memory seems to get pretty big, so I guess the protobuf generator also ran out of resources when trying to build 600MB file with all the memory allocated for the tagger training data.

In the resources graph this is what it looks like for the full 4.4M sentences:

protobuf5m_lowuse

The part on the right where the “system load” is higher and the “CPU” part looks to bounce wildly is where the protobuf is being generated. The part on the left before that is the part where the actual POS tagger training takes place. So the protobuf generation actually was running pretty long, my guess is the JVM memory was low and way too much garbage collection etc. is happening. Maybe it would have finished after few more hours but I called it a no-go and stopped it.

3M sentences finishes training fine. I use the remaining 1.4M for testing the accuracy. Meaning I use the trained tagger to predict tags for those 1.4M sentences and count how many words it tagged right in all of those. This gives me about 96.1% accuracy on using the trained tagger. Aawesome, now I have a working tagger??

The resulting model for the 3M sentence training set, when saved as a protobuf binary, is about 600MB. Seems rather large. Probably why it was failing to write it with the full 4.4M sentences. A smaller size model might be useful to make it more usable in a smaller cloud VM or something (I am poor, and cloud is expensive for bigger resources..). So I tried to train it on sentences of size 100k to 1M on 100k increments. And on 1M and 2M sentences. Results for LittlePOS are shown in the table below:

Sentences Words correct Accuracy PB Size Time/1
100k 21988662 88.7% 90MB 4.5ms
200k 22490881 90.7% 153MB 4.1ms
300k 22608641 91.2% 195MB 3.9ms
400k 22779163 91.9% 233MB 3.8ms
500k 22911452 92.4% 268MB 3.7ms
600k 23033403 92.9% 304MB 3.5ms
700k 23095784 93.1% 337MB 3.7ms
800k 23149286 93.4% 366MB 3.5ms
900k 23169125 93.4% 390MB 3.2ms
1M 23167721 93.4% 378MB 3.3ms
2M 23520297 94.8% 651MB 3.0ms
3M 23843609 96.2% 890MB 2.0ms
1M_2 23105112 93.2% 467MB ms
3M_0a 20859104 84.1% 651MB 1.7ms
3M_0b 22493702 90.7% 651MB 1.7ms

Here

  • Sentences is the number of sentences in the dataset.
  • Correct is the number of words correctly predicted. The total number of words is always 24798043 as all tests were run against the last 1.4M sentences (ones left over after taking the 3M training set).
  • Accuracy is the percentage of all predictions that it got right.
  • PB Size is the size of the model as a Protocol Buffers binary after saving to disk.
  • Time/1 is the time the tagger took on average to tag a sentence.

The line with 1M_2 shows an updated case, where I changed the training algorithm to run for 50 iterations instead of the 10 it had been set to in the Python script. Why 50? Because the Stanford and OpenNLP seem to use a default of 100 iterations and I wanted to see what difference it makes to increase the iteration count. Why not 100? Because I started it with training the 3M model for 100 iterations and looking at it, I calculated it would take a few days to run. The others were much faster so plenty of room for optimization there. I just ran it for 1M sentences and 50 iterations then, as that gives an indication of improvement just as well.

So, the improvement seems pretty much zero. In fact, the accuracy seems to have gone slightly down. Oh well. I am sure I did something wrong again. It is possible also to take the number of correctly predicted tags from the added iterations during training. The figure below illustrates this:

test

This figure shows how much of the training set the tagger got right during the training iterations. So maybe the improvement in later iterations is not that big due to the scale but it is still improving. Unfortunately, in this case, this did not seem to have a positive impact on the test set. There are also a few other points of interest in the table.

Back to the results table. The line with 3M_0a shows a case where all the features were ignored. That is, only the “unambiguous” ones were tagged there. This already gives the result of 84.1%. The most frequent tag in the remaining untagged ones is “noun”. So tagging all the remaining 15.9% as nouns gives the score in 3M_0b. In other words, if you take all the words that seem to clearly only have one tag given for them, given them that tag, and tag all the remaining ones as nouns, you get about 90.7% accuracy. I guess that would be the reference to compare against.. This score is without any fancy machine learning stuffs. Looking at this, the low score I got for training the Stanford POS tagger was really bad and I really need that for dummies guide to properly configure it.

But wait, now that I have some tagged input data and Python scripts to transform it into different formats, I could maybe just modify these scripts to give me OpenNLP compliant input data? Brilliant, lets try that. At least OpenNLP has default parameters and seems more suited for dummies like me. So on to transform my FinnTreeBank data to OpenNLP input format and run my experiments. Python script. Results below.

Sentences Words correct Accuracy PB Size Time/1
100k 22247182 89.7% 4.5MB 7.5ms
200k 22680369 91.5% 7.8MB 7.6ms
300k 22861728 92.2% 10.4MB 7.7ms
400k 22994242 92.7% 12.8MB 7.8ms
500k 23114140 93.2% 14.8MB 7.8ms
600k 23199457 93.6% 17.1MB 7.9ms
700k 23235264 93.7% 19.2MB 7.9ms
800k 23298257 94.0% 21.1MB 7.9ms
900k 23324804 94.1% 22.8MB 7.9ms
1M 23398837 94.4% 24.5MB 8.0ms
2M 23764711 95.8% 39.9MB 8.0ms
3M 24337552 98.1% 55.9MB 8.1ms
(4M) 24528432 98.9% 69MB 9.6ms
4M_2 6959169 98.5% 69MB 9.7ms
(4.4M) 24567908 99.1% 73.5MB 9.6ms

There are some special cases here:

  • (4M): This mixed training and test data in training with the first 4M of the 4.4M sentences, and then taking the last 1.4M of the 4.4M for testing. I believe in machine learning you are not supposed to test with the training data or the results will seem too good and not indicate any real world performance. Had to do it anyway, didn’t I 🙂
  • (4.4): This one used the full 4.4M sentences to train and then tested on the subset 1.4M of the same set. So its a broken test again by mixing training data and test data.
  • 4M_2: For the evaluation, this one used the remaining number of sentences after taking out the 4M training sentences. So since the total is about 4.4M, which is actually more like 4.36M, the test set here was only about 360k sentences as opposed to the other where it was 1.4M or 1.36M to be more accurate. But it is not mixing training and test data any more. Which is probably why it is slightly lower. But still an improvement so might as well train on the whole set at the end. The number of test tags here is 7066894 as opposed to the 24798043 in the 1.4M sentence test set.

And the resource use for training at 4M file size:

opennlp4m

So my 32GB of RAM is plenty, and as usual it is a single core implementation..

Next I should maybe look at putting this up as some service to call over the network. Some of these taggers actually already have support for it but anyway..

A few more points I collected on the way:

For the bigger datasets it is obviously easy to run out of memory. Looking at the code for the custom tagger trainer and the full 4.4M sentence training data, I figure I could scale this pretty high in terms of sentences processed by just storing the sentences into a document database and not in memory all at once. ElasticSearch would probably do just fine as I’ve been using it for other stuff as well. Then read the sentences from the database into memory as needed. The main reason the algorithm seems to need to keep the sentences in memory is to shuffle them randomly around for new training iterations. I could just shuffle the index numbers for sentences stored in the DB and read some smaller batches for training into memory. But I guess I am fine with my tagger for now. Similarly, the algorithm uses just a single core in training for now, but could be parallelized to process each sentence separately quite easily, making it “trivially parallel”. Would need to test the impact on accuracy though. Memory use could probably go lower using various optimizations, such as hashing the keys. Probably for both CPU and memory plenty of optimizations are possibly, but maybe I will just use OpenNLP and let someone else worry about it :).

From the results of the different runs, there seems to be some consistency in LittlePOS running faster on bigger datasets, and the OpenNLP slightly slower. The Stanford tagger seems to be quite a bit slower at 50ms, but could be again due to configuration or some other issues. OpenNLP seems to get a better accuracy than my LittlePOS, and the model files are smaller. So the tradeoff in this case would be model size vs tagging speed. The tagging speed being faster with bigger datasets seems a bit odd but maybe more of the words become “unambigous” and thus can be handled with a simple lookup on a map?

Finally, in the hopes of trying the stuff out on a completely different dataset, I tried to download the Finnish datasets for Universal Dependencies and test against those. I got this idea as the Syntaxnet stats showed using these as the test and training sets. Figured maybe it would give better results across sets taken from different sources. Unfortunately Universal Dependencies had different tag sets from the FinnTreeBank I used for training, and I ran out of motivation trying to map them together. Oh well, I just needed a POS tagger and I believe I now know enough on the topic and have a good enough starting point to look at the next steps..

But enough about that. Next, I think I will look at some more items in my NLP pipeline. Get back to that later…

Porting An Elasticsearch Java App to 5.X

Recently I was upgrading some stuff on my search app which makes use of Elasticsearch. The good people at ES had been promoting ES 5.0 for a long time as it was in beta, and now it was out of beta so I figured I might as well upgrade that as well. Did not turn out quite so simple. Some pointers from along the way.

There is a large set of breaking changes listed on their website. Only had to mess with a few. But there were a few points not clearly explained there. My experiences:

Few basic notes:

  • “string” mapping is now “keyword” or “text”. This is rather straightforward, although might take a few re-indexes.
  • “index” property in type mappings. The breaking changes list this as only supporting “true” or “false” as opposed to “analyzed” etc. from before. But the old style of “analyzed” still seems to work (at least no error). Not sure if I should investigate more but it seems to work for me still.

A bit more complicated:

Accessing fields in the Java API. I used to be able to query specific fields with something like client.prepareSearch().addFields(“a”, “b”).. and get the results from a SearchHit object by hit.getFields(). The addFields() methods are not completely gone but there is something called addStoredFields() . Which did not work on my old mappings, just returns null for the fields.

So now I need to mark my fields either as “stored” in the mapping or use source filtering to get the values. I guess in 2.X it was implicitly using source filtering. And if I mark the fields as “stored” then the addStoredFields() methods start to work.

So what is the difference between using stored fields and source filtering? The ES docs seem to discourage setting “stored” to true, but it does not always seem so clear. My understanding is that stored fields require separate reads from disk per field, whereas source filtering loads the whole document source in one go, and filters the fields from that. This can be good or bad, for example, if you have some very large content fields it may cause high overhead to just load some metadata. But if not, using stored fields might add more overhead. So it depends I guess.

I also guess this might be a good change as it makes the rationale for schema design more explicit.

Accessing dates in Java API. Using the old approach of addFields() I could access dates stored as long values of epoch milliseconds with just “long time = fields.get(“doc_date”).value()”. It does not work anymore, as apparently ES uses a different format on disk, and the source filtering just gives me the value as stored. I thought it was how ES stored it on disk (as epoch long). Not sure if it ever was so or just my assumptions. Well, the docs say something in that direction but it is a bit up to interpretation.

So to access the date as epoch long, some conversions are needed now.

Plugin API is largely changed. So if you depend on some custom plugin, you might be out of luck or you have to port the plugin yourself. I ended up porting one by myself. I found it helpful to be able to look at some examples on github. The source tree has several even if that direct link is just the Polish analyzer.

Security manager cannot be disabled. In ES 1.x, it was not used. In 2.x, it was an option to disable it. In 5.x, the ES option to disable it is removed. So if you use a plugin that needs to access JNA or some other lib that is already loaded by ES, you have to do tricks. Well, at least for the security policy you have to either unpack the ES jar file, modify the policy in it, and repack it. Or you have to modify the JRE policy file for the JRE you use to run ES with. If your plugin needs special permissions that is.. Such as loading some specific native library.

That is all I remember for now.. In a few weeks I might not remember even this much, which is why I am writing this down usually 🙂

 

 

 

 

 

Automating Deployment of my (Micro)Services

So I read somewhere on the internets that microservices are great and we should all be doing them. Being an overly enthusiastic geek to try all sorts of new fads and see how they work, I just had to go and give it a try, of course. So proceed to try to split my project into various smallish parts, connect these using GRPC and see how it all runs.

Used GRPC because I like the efficiency, documentation and simplicity of profobufs. And Google has too much of a reputation anyway. Unfortunately the GRPC generated Java code just feels weird and oddly bloated. Also had some concurrency issues, although this might be due to my lack on understanding as it seems the docs are not that great outside of Google (where you can just ask the authors or friends..).

I split my service to 10 smaller ones, did some tries and settled on a merge at 5 services. But how do I actually sensibly deploy this vs previously uploading a single dir? Then I remember the next buzzword I keep hearing about “Continous Delivery”. Sweet, that must solve it for me, right?

Um no. I must be missing something as the CD terminology seems to come up with just some conceptual level descriptions but little concrete examples of how to do it. Maybe DockerHub or some yet another hype term. But I am still not on that boat despite using various Docker images and building some myself. So what then? Most concrete reference I found seemed to be around “I has some scripts” etc. OK, whatever, so I start cooking up some scriptz. In Python.

Python ConfigParser seemed suitable. So I created a configuration file like this:

[service1]
ip=192.168.56.101
dst_dir=/home/randomguy/s/service1
src_dir=../service1
jar_prefix=s-service1
properties=s-service1.properties

[service2]
ip=192.168.56.102
dst_dir=/home/randomguy/s/service2
src_dir=../service2
jar_prefix=s-service2
properties=s-service2.properties

Read it with Python:

config = configparser.ConfigParser()
config.read(filename)

service1_ip = config['service1']['ip']
service1_dst_dir = config['service1']['dst_dir']
service1_src_dir = config['service1']['src_dir']
service1_jar_prefix = config['service1']['jar_prefix']
service1_properties = config['service1']['properties']

Doing this for all services gives the information on how to upload everything.

With the paramiko Python package installed from pip, next we are off to create the target directory if it does not exist:

def mkdir_p(ssh, remote_directory):
    with paramiko.SFTPClient.from_transport(ssh.get_transport()) as sftp:
        dir_path = str()
        for dir_folder in remote_directory.split("/"):
            if dir_folder == "":
                continue
            dir_path += r"/{0}".format(dir_folder)
            try:
                sftp.listdir(dir_path)
            except IOError:
                sftp.mkdir(dir_path)

To upload a directory recursively:

def upload_dir(ssh, localpath, remotepath, name):
    local_dirpath = os.path.join(localpath, name)
    mkdir_p(ssh, remotepath)
    with SCPClient(ssh.get_transport()) as scp:
        scp.put(local_dirpath, remotepath, recursive=True)

To upload a specific file:

def upload_file(ssh, localpath, remotepath):
    mkdir_p(ssh, remotepath)
    with SCPClient(ssh.get_transport()) as scp:
        scp.put(localpath, remotepath)

Using the information and code snippets, it is quite easy to build custom scripts for uploading specific service data to specific services. Instead of posting too much code here, I turned it into something a bit more generic and put in on Github:
https://github.com/mukatee/scp-uploader

It is the best thing since sliced bread. Of course it is…

And now you can tell me how it is really supposed to be done, thank you 🙂

7 bytes of HTTPS on CloudFlare

I don’t do much front-end software engineering. If I can avoid, I don’t really do it at all. Prefer the backends etc. But for my personal projects no-one else is going to do it. I figured I should move to use HTTPS for some of the content. Using my lack of front-ending for past years as an excuse, I then proceeded to make the silliest of mistakes.

I spent some 1-2+h trying to search the internetz for the reason why my website would only load up 7 bytes of seemingly random data on the browser when trying to load it over HTTPS, which would most of the time get downloaded as a file instead of showing anything in the browser. I was using a self-signed certificate for testing, so given that my first few attempts gave me a certificate warning and later ones did not give warnings, rather just garbage data, I was feeling rather confused.

In the end it finally turned out to be because I was typing “localhost:8443/mysite” in the browser address bar, and this of course causes the browsers to load it as a HTTP page.8443 is the port I was hosting the web-server on as opposed to the 443 regular HTTPS port. Initially I must have put the HTTPS in the start as I got the certificate warnings, later not I somehow managed to leave it out.

So I guess the browser tried to load a HTTP page and got handed some parts of an SSL handshake, failed the handshake and got left with 7 bytes of seemingly random data that it downloaded for me. That’s my theory, couldn’t quite figure out the handshake byte content with a quick internet search. Using https://localhost:8443 as the URL worked great. Maybe next time I forget this and need to search again I will hit this blog entry and save myself some hours and frustration.

But what does CloudFlare have to do with any of this? Nothing much really for the part above. But I put my service behind CloudFlare and needed also to change it to give me HTTPS connections for part of the service.

I also needed a certificate that the browser would not complain about. LetsEncrypt seems to be a nice service to do this for free. But it also seems to be quite geared for Apache and Nginx. Having neither of those and running the service behind CloudFlare was bit hard. With embedded Jetty, there was also no way I was going to enable directory listings (who does that for security anyway…) and it was unclear what I should put where to get it to work. Also, CloudFlare being in the way seemed to be an issue.

So CloudFlare has some of these nice options also at its “free” tier. “Full” SSL lets me use self-signed certificates between CloudFlare server and my own server, while having CloudFlare provide the browser client with a valid certificate from an authorized party. Nice for some poor little guys project like mine.

But it only works for sites such as “bob.sitename.com” and not, for example, “admin.bob.sitename.com”. So that has to be “adminbob.sitename.com”.  It also seems to be want to push the SSL connection on me even if I explicitly try to go on the address “http://adminbob.sitename.com:80 “. So not sure if mixing HTTP and HTTPS for a single site is an option, at least on the free plan. Not a real problem as I can just put it all in different services and run some of the fully on HTTPS. But for redirects etc…

So sometimes, for me, the SSL never happens, other times it happens when I try to avoid it. Maybe I should just learn something better, but this security stuff could certainly be a bit simpler to get right.. It’s not really asking for that complex requirements (like maybe the browser could try to ID a HTTPS connection and inform me, large hosting providers/fronts could make the basic stuff simply work easily and give clear errors, and LetsEncrypt could make it easy for running their clients regardless of used webserver, etc.) . Well, its free stuff so what am I complaining.. 🙂

 

 

 

 

Locally Debugging Spigot Minecraft Server Plugins in IntelliJ

This time I had to debug some code for a Spigot server plugin. Spigot is a custom Minecraft server. You download the Spigot jar file, run it and it sets up a Minecraft server for you. You can then write custom plugins for the server, using the Spigot API, and drop them in the “plugins” folder that Spigot generates for you. These can be used to customize the game experience (assuming you know Minecraft is a game etc. here.. :))

From my Googling experience (and asking from some people who developed some plugins..), the typical way to develop these plugins seems to be to write the code in Eclipse (the plugins are Java .JAR files), compile it, package it into a .jar file, copy it over to the server plugin folder, reload the server (causing the updated plugins to reload), and see what went wrong. Debugging is done with printouts from the plugin or using some weird remote debug setup in Eclipse. Seems like a slow and lengthy process to me.

Well, I am lazy so I wanted to run/debug my plugin code straight from IntelliJ, which is the IDE I use. Here is how I set it up, I had trouble finding anything on this from the internetz, so maybe this helps someone, if anyone happens to look for the same information. Or maybe it just helps me the next time I have wiped my environment and come here looking..

First we set up the .pom file for Maven

<!--Spigot API-->
<dependency>
  <groupId>org.spigotmc</groupId>
  <artifactId>spigot-api</artifactId>
  <version>1.9-R0.1-SNAPSHOT</version>
  <scope>provided</scope>
</dependency>
<!--Bukkit API-->
<dependency>
  <groupId>org.bukkit</groupId>
  <artifactId>bukkit</artifactId>
  <version>1.9-R0.1-SNAPSHOT</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.bukkit</groupId>
  <artifactId>craftbukkit</artifactId>
  <version>1.9-R0.1-SNAPSHOT</version>
  <!--version>${bukkit.version} </version-->
  <type>jar</type>
  <scope>compile</scope>
  <exclusions>
    <exclusion>
      <groupId>org.spigotmc</groupId>
      <artifactId>minecraft-server</artifactId>
    </exclusion>
  </exclusions>
</dependency>

Using this configuration, we use Maven to package us a jar file. This seems well documented for Spigot so let’s just pretend this is done and all set up for our plugin project.

We then copy the Spigot jar to our working dir.  This is the Spigot jar we can download from the Spigot website to run the Spigot server. I created a folder “spigot-run” under my IntelliJ plugin project folder, put the spigot.jar file under this folder and ran it from there. First from the command line to see it generates the directory structure correctly. Because when you first run it, it creates the Minecraft server directory structure for you. Following this, add the spigot.jar file from “spigot-run” (or where-ever you put it) to the module classpath in the IntelliJ project.

Now, to set up the run configuration in IntelliJ IDEA, we can look into the METAINF in the spigot.jar file. The main class (at the time of writing this) is :

org.bukkit.craftbukkit.Main

So set that up as the main class for the run configuration. Now set up the working dir for the same run configuration as “spigot-run” or whatever is the place where you installed the Spigot jar and had it create the server directory structures. Try running this and see that Spigot starts right and the IntelliJ console window shows the Spigot console. Sweet.

So then our plugin. Let’s assume here that we have packaged it to a nice little jar file using Maven as described before. Now copy this plugin jar file to the “plugins” folder under the “spigot-run” folder (or whatever name you used). Start Spigot again from IntelliJ using the previously created run configuration. Spigot tries to load your plugin from the “plugins” folder and gives a ClassCastException or some similar error about PluginClassLoader not being used to load the class.

Now this is where you go Google it and find suggestions to go use the remote debugger. But no we can’t be that pragmatic. So what is the cause of the error? The cause is that when we run Spigot from IntelliJ where we also write the code for the plugin, IntelliJ loads the plugin code into memory using a different classloader that what Spigot needs. This is done before Spigot has a chance to go for it from the jar file we put in the “plugins” folder. Spigot wants to use its own classloader, the PluginClassLoader, but IntelliJ already loaded the class so Spigot is not happy. We can get around this by creating a run configuration for IntelliJ that does not load the plugin classes.

So to get this done, we create a new module in our IntelliJ plugin project. We call it “dummy” (or whatever you like). I put this module under the main plugin project working directory (the project top dir) in its own folder. I then let IntelliJ create an empty “src” folder for it. Basically this means I just clicked “next” until all was nice and shiny. This new module now has no code in it. So we just add the same spigot.jar (actually called spigot1.9.jar for me) file from under “spigot-run” to the “dummy” module classpath. We then take our old run configuration we already set up (that failed on PluginClassLoader error) and change this run configuration to use the “dummy” module classpath.

This causes IntelliJ to not load the plugin classes since they are not in the “dummy” module classpath. However, Spigot finds them and correctly loads them the jar we put in the “plugins” directory. Now you can run your plugin directly from IntelliJ. Just set up Maven to create the jar for you, and when the jar has been created drag and drop it into the “spigot-run/plugins” directory and hit “run”. If you are a good boy/girl, you can even automate copying the jar with Maven so you just start the Maven build and hit run. Two buttons, which generally are needed in software engineering (compile+run) so I am OK with that.

But wait!? What about debugging the code? Well it turns out you can now add debug breakpoints into your plugin code direct in IntelliJ. As you now start Spigot with the “dummy” run configuration, Spigot happily loads your plugin code from its “plugins” folder with its PluginClassLoader. However, as you started Spigot from IntelliJ, IntelliJ can still recognize your plugin classes are loaded and executed and lets you debug your code as normal. So when a breakpoint in your plugin is hit, you get your regular debug view in IntelliJ for the breakpoint. No remote debugger setups, not weird hassles (OK actually several with dummy modules etc but lets pretend we did good, OK?), no external copy or other complex procedures. Awesooomeeee!!! 🙂

As a final note, if you need some Craftbukkit classes (I did), run the buildtools for Spigot and it will install them in the local Maven repository. Something to do with developer arguments, DMCA requests and whatever. Hey, I am new to all this. 😮

That’s all folks. It’s really simple with just set up a dummy module to trick the classloaders, copy a jar and hit run/debug.

 

 

Harharhar said the Santa when capturing browser test metrics with Webdriver and LittleMobProxy

In the modern age of big data and all that, it is trendy to capture as much data we can. So this is an attempt at capturing data on a set of web browsing tests I run on Selenium WebDriver. This is with Java using Selenium Webdriver and LittleMobProxy.

What I do here is configure an instance of LittleMobProxy to capture the traffic to/from a webserver in our tests. The captured data is written to a HAR file (HTTP archive file), the HAR file is parsed and the contents are used for whatever you like. In this case I dump them to InfluxDB and show some graphs on the generated sessions using Grafana.

This can be useful to see how much bandwidth your website uses, how many requests end up being generated, what elements are slowest to load, how your server caching configurations affect everything, and so on. I have used it to provide data for overall network performance analysis by simulating a set of browsers and capturing the data on their sessions.

First up, start the LittleMobProxy instance, create a WebDriver instance, and configure the WebDriver instance to use the LittleMobProxy instance:

    // start the proxy
    BrowserMobProxy proxy = new BrowserMobProxyServer();
    proxy.start(0);
    // get the Selenium proxy object
    Proxy seleniumProxy = ClientUtil.createSeleniumProxy(proxy);

    // configure it as a desired capability
    DesiredCapabilities capabilities = new DesiredCapabilities();
    capabilities.setCapability(CapabilityType.PROXY, seleniumProxy);

    driver = new ChromeDriver(capabilities);
    proxy.newHar("my_website_data");

Then we run some scripts on our website. The following is just a simple example as I do not wish to put here bots for browsing common website. I have prototyped some on browsing various videos on YouTube and on browsing different news portals. Generally this might be against the terms of service on public websites, so either use your own service you are testing (with a test service instance) or download a Wikipedia dump or something similar to run your tests on. Example code:

  public List<WebElement> listArticles() {
    List<WebElement> elements = driver.findElements(By.className("news"));
    List<WebElement> invisibles = new ArrayList<>();
    for (WebElement element : elements) {
      if (!element.isDisplayed()) {
        System.out.println("Not displayed:"+element);
        invisibles.add(element);
      }
    }
    elements.removeAll(invisibles);
    List<WebElement> articles = new ArrayList<>();
    List<String> descs = new ArrayList<>();
    for (WebElement element : elements) {
      List<WebElement> links = element.findElements(By.tagName("a"));
      for (WebElement link : links) {
        //we only take long links as this website has some "features" in the content portal causing pointless short links.
        //This also removes the "share" buttons for facebook, twitter, etc. which we do not want to hit
        //A better alternative might be to avoid links leading out of the domain 
        //(if you can figure them out..)
        //this is likely true for the ads as well..
        if (link.getText().length() > 20) {
          descs.add(link.getText());
          articles.add(link);
        }
      }
    }
    return articles;
  }

  public void openRandomArticle() throws Exception {
    List<WebElement> articles = listArticles();
    //sometimes our randomized user might hit a seemingly dead end on the article tree, 
    //in which case we just to the news portal main page ("base" here)
    if (articles.size() == 0) {
      driver.get(base);
      Thread.sleep(2000);
      articles = listArticles();
    }
    //this is a random choice of the previously filtered article list
    WebElement link = TestUtils.oneOf(articles);
    Actions actions = new Actions(driver);
    actions.moveToElement(link);
    actions.click();
    actions.perform();

    Har har = proxy.getHar();
    har.writeTo(new File(“my_website.har”));
    //if we just want to print it, we can do this..
    WebTestUtils.printHar(har);
    //or to drop stats in a database, do something like this
    WebTestUtils.influxHar(har);
  }

The code to access the data in the HAR file:

  public static void printHar(Har har) {
    HarLog log = har.getLog();
    List<HarPage> pages = log.getPages();
    for (HarPage page : pages) {
      String id = page.getId();
      String title = page.getTitle();
      Date started = page.getStartedDateTime();
      System.out.println("page: id=" + id + ", title=" + title + ", started=" + started);
    }
    List<HarEntry> entries = log.getEntries();
    for (HarEntry entry : entries) {
      String pageref = entry.getPageref();
      long time = entry.getTime();

      HarRequest request = entry.getRequest();
      long requestBodySize = request.getBodySize();
      long requestHeadersSize = request.getHeadersSize();
      String url = request.getUrl();

      HarResponse response = entry.getResponse();
      long responseBodySize = response.getBodySize();
      long responseHeadersSize = response.getHeadersSize();
      int status = response.getStatus();

      System.out.println("entry: pageref=" + pageref + ", time=" + time + ", reqBS=" + requestBodySize + ", reqHS=" + requestHeadersSize +
              ", resBS=" + responseBodySize + ", resHS=" + responseHeadersSize + ", status=" + status + ", url="+url);
    }
  }

So we can use this in many ways. Above I have just printed out some of the basic stats. Some example information available can be found on the internet, e.g. https://confluence.atlassian.com/display/KB/Generating+HAR+files+and+Analysing+Web+Requests. In the following I show some simple data from browsing a local newssite, visualized in Grafana using InfluxDB as a backend:

Here is some example code to write some of the HAR stats to InfluxDB:

  public static void influxHar(Har har) {
    HarLog harLog = har.getLog();
    List<HarPage> pages = harLog.getPages();
    for (HarPage page : pages) {
      String id = page.getId();
      String title = page.getTitle();
      Date started = page.getStartedDateTime();
      System.out.println("page: id=" + id + ", title=" + title + ", started=" + started);
    }
    List<HarEntry> entries = harLog.getEntries();
    long now = System.currentTimeMillis();
    int counter = 0;
    for (int i = index ; i < entries.size() ; i++) {
      HarEntry entry = entries.get(i);
      String pageref = entry.getPageref();
      long loadTime = entry.getTime();

      HarRequest request = entry.getRequest();
      if (request == null) {
        log.debug("Null request, skipping HAR entry");
        continue;
      }
      HarResponse response = entry.getResponse();
      if (response == null) {
        log.debug("Null response, skipping HAR entry");
        continue;
      }

      Map<String, Long> data = new HashMap<>();
      data.put("loadtime", loadTime);
      data.put("req_head", request.getHeadersSize());
      data.put("req_body", request.getBodySize());
      data.put("resp_head", response.getHeadersSize());
      data.put("resp_body", response.getBodySize());
      InFlux.store("browser_stat", now, data);
      counter++;
    }
    index += counter;
    System.out.println("count:"+counter);
  }

And the code to write the data into InfluxDB..

  private static InfluxDB db;

  static {
    if (Config.INFLUX_ENABLED) {
      db = InfluxDBFactory.connect(Config.INFLUX_URL, Config.INFLUX_USER, Config.INFLUX_PW);
      db.enableBatch(2000, 1, TimeUnit.SECONDS);
      db.createDatabase(Config.INFLUX_DB);
    }
  }

  public static void store(String name, long time, Map<String, Long> data) {
    if (!Config.INFLUX_ENABLED) return;
    Point.Builder builder = Point.measurement(name)
            .time(time, TimeUnit.MILLISECONDS)
            .tag("tom", name);
    for (String key : data.keySet()) {
      builder.field(key, data.get(key));
    }
    Point point = builder.build();
    log.debug("storing:"+point);
    //you should have enabled batch mode (as shown above) on the driver or this will bottleneck
    db.write(Config.INFLUX_DB, "default", point);
  }

And here are some example data visualized with Grafana for some metrics I collected this way:

Untitled

In the lower line chart, the number of elements loaded on click is shown. This refers to how many HTTP requests are generated per a WebDriver click on the website. The upper line/plot chart shows the minimum, maximum and average load times for different requests/responses. That is, how much time it took for the server to send back the HTTP responses for the clicks.

This shows how the first few page loads generated high number of HTTP requests/responses. After this, the amount goes down and stays quite steady at a lower level.  I assume this is due to the browser having cached much of the static content and not needing to request it all the time. Occasionally our simulated random browser user enters a slightly less explored path on the webpage, causing a small short term spike.

This nicely shows how modern websites end up generating surprisingly large numbers of requests. Also this shows how some requests are pretty slow to respond, so this might be a useful point to investigate for optimizing overall response time.

That’s it. Not too complicated but I find it rather interesting. Also does not require too many modifications to the existing WebDriver tests, just taking into use the proxy component, parsing the HAR file and writing to the database.

My Little TCP Tunnel

Do you ever wonder what happens when you try to use some driver or client library to access a networked service? Like send queries to a database using a client driver or use a client library provided by some PaaS provider such as AWS, and get back a cryptic error message (or nothing) and not knowing what really happened? Or how much data is it really generating (bandwidth is usually not unlimited and free..)?

Anyway, I do. Most of the times when I try to look up what happens at the protocol level to debug any issues, I end up on some answers where people say to use Fiddler, Wireshark, or some other overly complex solution. Not to mention all the issues with Fiddler requiring a .NET install (Linux/OSX/Mono and all the problems, no thanks) and being overly complex for my simple needs, or Wireshark having trouble capturing localhost traffic, or me having problems finding the filters etc for my stuff.

I did find a nice TCP tunnel software on Sourceforge, but it has been abandoned long ago and source code is gone. As most small but useful open source projects seem to go, people grow up and figure out there is more to life than writing free software for others and still needing to pay your own bills (which is hard with “free software”). So anyway, after looking around too much and having to debug these issues too many times, I finally tried to write my own. It is found at http://www.github.com/mukatee/java-tcp-tunnel.

How does it work? Instead of sending my network requests directly to the service, I start up my TCP tunnel, and configure it to forward everything received at given port to another TCP address. And data received in response is of course written back to the party who initiated the connection. So it is practically a man-in-the-middle attack on myself. And then we write the data in the request to the console or to a file to get the debugging information we want.

So to give a few examples (some also listed on the project page at https://github.com/mukatee/java-tcp-tunnel:

Start the tunnel, listening for connections on port 5566 and forwarding any requests on that port to port 80 at http://www.github.com.

java -jar tcptunnel-1.0.0.jar 5566 www.github.com 80

To request the contents of http://www.github.com (port 80) using curl and get the full HTTP request-response pair logged, we can do:

curl localhost:5566 --header 'Host: www.github.com'

We are now sending the HTTP request to port 5566 on localhost, where the TCP tunnel captures it, logs it, and forwards it to port 80 on http://www.github.com. The tunnel then waits for any response to this from the remote end (here the server at http://www.github.com). When a response is received, the tunnel logs it and forwards the data as is back to the curl client that sent the request to port 5566. From the client (curl) viewpoint nothing changed, so the client is not modified in any way. Except that we had to explicitly specify the HTTP host header in this case with the parameter given.

The output from running the above commands is:

TCP Forwarding 0:0:0:0:0:0:0:1:62020 <--> 192.30.252.128:80
up:
GET / HTTP/1.1
Host: www.github.com
User-Agent: curl/7.43.0
Accept: */*

down:
HTTP/1.1 301 Moved Permanently
Content-length: 0
Location: https://www.github.com/
Connection: close

TCP Forwarding 0:0:0:0:0:0:0:1:62020 <--> 192.30.252.128:80 stopped.

This is the printout from the TCP tunnel. I started it with the console logger so this is what gets logged (see the github project page for more on configuration).

The “up” part refers to the data sent to upstream (in this case towards http://www.github.com). This shows all the actual headers and all of the HTTP protocol that was really generated by curl.

The “down” part refers to the response data sent back from the server at http://www.github.com. In this case this would be a redirect to the HTTPS version of the site. And no, I have not implemented support for SSL, certificates, etc. Because I have not needed it so far in debugging the REST etc. services I use in development. If I need it, I will, If you implement it, I am happy to get contributions.

So for something different, example number 2. Now I want to send some commands (over HTTP) to an ElasticSearch instance in my local development environment:

I have an index (ElasticSearch term for a database) named ‘my_index1’ created at the ElasticSearch instance running at 192.168.1.100. If I run the standard HTTP DELETE request on it using curl I get only a brief acknowledgement:

curl -XDELETE 'http://192.168.1.100:9200/my_index1/'

gives back

{"acknowledged":true}

What really happened?

I start the TCP tunnel as before:

java -jar tcptunnel-1.0.0.jar 5566 192.168.1.100 9200

Then I send the same command over the tunnel:

curl -XDELETE 'http://localhost:5566/my_index1/'

The tunnel output is similar to before:

TCP Forwarding 0:0:0:0:0:0:0:1:62646 <--> 192.168.1.100:9200
up:
DELETE /my_index1/ HTTP/1.1
Host: localhost:5566
User-Agent: curl/7.43.0
Accept: */*

down:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 21

{"acknowledged":true}
TCP Forwarding 0:0:0:0:0:0:0:1:62646 <--> 192.168.1.100:9200 stopped.

In the console where I ran curl, I get the same response as without the tunnel:

{"acknowledged":true}

As we can see here, curl just prints the response body, with the tunnel we see all the data that was passed. Not really super-exciting as this is pretty basic HTTP getting passed. But at least we do not need to wonder if that was really what happened, as we see it directly in the tunnel log. For some other types of communications this can be much more interesting.

For example, using the ElasticSearch Java library we can run commands and queries using the ElasticSearch Java API. In this case it is much less clear what really gets sent over the protocol. I find it especially annoying to try to debug this as the whole ElasticSearch API is well documented in terms of HTTP requests that are sent but rather poorly documented in terms of the language specific API’s. So capturing this is the same as before (and works for any programming language as the tunnel is just TCP). Some examples:

This time I start the tunnel programmatically:

public static void main(String[] args) {
  Params params = new Params(5566, "192.168.1.100", 9300);
  params.enableStringConsoleLogger();
  Main main = new Main(params);
  main.start();
}

I then use the ElasticSearch Java API to perform commands/queries on it. First to index a few documents:

    Client client = TransportClient.builder().build()
                     .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 5566));
    Map String, Object< map = new HashMap<>();
    map.put("message", "hello");
    client.prepareIndex("my_index1", "msg_doc", "1").setSource(map).get();

    map.put("message", "world");
    client.prepareIndex("my_index1", "msg_doc", "2").setSource(map).get();

Then to perform a boolean query:

    Client client = TransportClient.builder().build()
            .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 5566));
    BoolQueryBuilder qb = QueryBuilders.boolQuery();
    SearchResponse scrollResp = client.prepareSearch()
            .setIndices(“my_index1”)
            .setTypes(“msg_doc”)
            .setSearchType(SearchType.DFS_QUERY_AND_FETCH)
            .setScroll(new TimeValue(60000))
            .setQuery(qb)
            .setSize(10).execute().actionGet();

Note that above, I have configured the ElasticSearch server to be at “localhost:5566”, which is not true. This is where the tunnel is listening. The tunnel captures the connections made to it and forwards the data to the actual ElasticSearch on 192.168.1.100:9300 address.

Unfortunately, these will only print “gibberish” to the console when the tunnel logs it. Well, it is not entirely gibberish as the data can be seen in between the binary characters. But it is not the HTTP request data I was looking for. This is because ElasticSearch is written in Java and the official clients for Java make use of the same codebase and the more optimized internal binary protocol. For debugging any other platform ES client this works as they use the HTTP protocol. However, we can achieve the same on Java side by using an “unofficial” client such as JEST from https://github.com/searchbox-io/Jest.

To try this:

First, change the tunnel to connect to port 9200, which is where ElasticSearch is listening for the HTTP requests:

  public static void main(String[] args) {
    Params params = new Params(5566, "192.168.1.100", 9200);
    params.enableStringConsoleLogger();
    Main main = new Main(params);
    main.start();
  }

Then run some commands using JEST over the tunnel:

  public static void main(String[] args) throws Exception {
    // Construct a new Jest client according to configuration via factory
    JestClientFactory factory = new JestClientFactory();
    factory.setHttpClientConfig(new HttpClientConfig
            .Builder("http://localhost:5566")
            .multiThreaded(true)
            .build());
    JestClient client = factory.getObject();
    client.execute(new CreateIndex.Builder("my_index1").build());
  }

The log from the tunnel is:

TCP Forwarding 127.0.0.1:64562 <--> 192.168.1.100:9200
up:
PUT /my_index1 HTTP/1.1
Content-Length: 2
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{}
down:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 21

{"acknowledged":true}
TCP Forwarding 127.0.0.1:64562 <--> 192.168.1.100:9200 stopped.

Which, as we see is now using the HTTP API. To index a few docs:

    Map<String, String> source = new HashMap<>();
    source.put("msg", "hello");
    Index index = new Index.Builder(source).index("my_index1").type("msg_doc").build();
    client.execute(index);

    source.put("msg", "world");
    index = new Index.Builder(source).index("my_index1").type("msg_doc").build();
    client.execute(index);

And the log again:

TCP Forwarding 127.0.0.1:64701 <--> 192.168.1.100:9200
up:
POST /my_index1/msg_doc HTTP/1.1
Content-Length: 15
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{"msg":"hello"}

down:
HTTP/1.1 201 Created
Content-Type: application/json; charset=UTF-8
Content-Length: 145

{"_index":"my_index1","_type":"msg_doc","_id":"AVHgELHqpIrLtHSr6-gg","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}

up:
POST /my_index1/msg_doc HTTP/1.1
Content-Length: 15
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{"msg":"world"}

down:
HTTP/1.1 201 Created
Content-Type: application/json; charset=UTF-8
Content-Length: 145

{"_index":"my_index1","_type":"msg_doc","_id":"AVHgELIMpIrLtHSr6-gh","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}

TCP Forwarding 127.0.0.1:64701 <--> 192.168.1.100:9200 stopped.

Then to do some queries. First I did a “broken” query in a hurry. Here I set the doc type into the index list as well (there is no index named “msg_doc”, hence the error):

    BoolQueryBuilder qb = QueryBuilders.boolQuery();
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(qb);

    Search search = new Search.Builder(searchSourceBuilder.toString())
            // multiple index or types can be added.
            .addIndex("my_index1")
            .addIndex("msg_doc")
            .build();
    SearchResult result = client.execute(search);

We can see the error in the HTTP response below. Running the code above actually passes with no error, exception or whatever. Perfectly fine, I am sure, as probably we need to check the result object properties to get the error. Nothing wrong with that. But if we want to see what ElasticSearch actually returned, we can. The log from the tunnel:

TCP Forwarding 127.0.0.1:64916 <--> 192.168.1.100:9200
up:
POST /my_index1%2Cmsg_doc/_search HTTP/1.1
Content-Length: 38
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{
  "query" : {
    "bool" : { }
  }
}

down:
HTTP/1.1 404 Not Found
es.resource.type: index_or_alias
es.resource.id: msg_doc
es.index: msg_doc
Content-Type: application/json; charset=UTF-8
Content-Length: 311

{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"msg_doc","index":"msg_doc"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"msg_doc","index":"msg_doc"},"status":404}

TCP Forwarding 127.0.0.1:64916 <--> 192.168.1.100:9200 stopped.

Here, the error is nicely visible. Too many libraries do a poor job of reporting the actual details etc. so I have found this to be very helpful.

Fix the error and re-run the query:

    BoolQueryBuilder qb = QueryBuilders.boolQuery();
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(qb);

    Search search = new Search.Builder(searchSourceBuilder.toString())
            // multiple index or types can be added.
            .addIndex("my_index1")
            .build();
    SearchResult result = client.execute(search);

Now we get the actual results:

TCP Forwarding 127.0.0.1:64839 <--> 192.168.1.100:9200
up:
POST /my_index1/_search HTTP/1.1
Content-Length: 38
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{
  "query" : {
    "bool" : { }
  }
}

down:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 339

{"took":25,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index1","_type":"msg_doc","_id":"AVHgELIMpIrLtHSr6-gh","_score":1.0,"_source":{"msg":"world"}},{"_index":"my_index1","_type":"msg_doc","_id":"AVHgELHqpIrLtHSr6-gg","_score":1.0,"_source":{"msg":"hello"}}]}}

TCP Forwarding 127.0.0.1:64839 <--> 192.168.1.100:9200 stopped.

Notice that so far the upstream “query” has been pretty empty. Not so informative as I wanted to use this to see how the queries from the Java API map to the better documented HTTP requests. Notice that I am using the actual Query objects from the ElasticSearch API here with JEST, so the information should map directly also to the “official” ElasticSearch API. So, give the query some content to get more information:

    BoolQueryBuilder qb = QueryBuilders.boolQuery();
    qb.must(QueryBuilders.termQuery("msg_doc”, "hello"));
    qb.should(QueryBuilders.termQuery("msg_doc”, "world"));
    qb.mustNot(QueryBuilders.termQuery("msg_doc”, "bob"));
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(qb);

    Search search = new Search.Builder(searchSourceBuilder.toString())
            // multiple index or types can be added.
            .addIndex("my_index1")
            .build();
    SearchResult result = client.execute(search);

And now we can look at the TCP tunnel log to see how this part of the Java API actually maps to the better documented HTTP API:

TCP Forwarding 127.0.0.1:65189 <--> 192.168.1.100:9200
up:
POST /my_index1/_search HTTP/1.1
Content-Length: 300
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{
  "query" : {
    "bool" : {
      "must" : {
        "term" : {
          "msg_doc" : "hello"
        }
      },
      "must_not" : {
        "term" : {
          "msg_doc" : "bob"
        }
      },
      "should" : {
        "term" : {
          "msg_doc" : "world"
        }
      }
    }
  }
}

down:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 122

{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

TCP Forwarding 127.0.0.1:65189 <-->; 192.168.1.100:9200 stopped.

Nice. But wait, something went wrong with that. What? Well, I put the doc type into the term field name so nothing can be matched. Which we can also see from the log above.

Fix this and try again:

    BoolQueryBuilder qb = QueryBuilders.boolQuery();
    qb.must(QueryBuilders.termQuery("msg", "hello"));
    qb.should(QueryBuilders.termQuery("msg", "world"));
    qb.mustNot(QueryBuilders.termQuery("msg", "bob"));
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(qb);

    Search search = new Search.Builder(searchSourceBuilder.toString())
      .addIndex("my_index1")
      .build();
    SearchResult result = client.execute(search);

And the tunnel log:

TCP Forwarding 127.0.0.1:65120 <--> 192.168.1.100:9200
up:
POST /my_index1/_search HTTP/1.1
Content-Length: 288
Content-Type: application/json; charset=UTF-8
Host: localhost:5566
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.1 (Java/1.8.0_60)
Accept-Encoding: gzip,deflate

{
  "query" : {
    "bool" : {
      "must" : {
        "term" : {
          "msg" : "hello"
        }
      },
      "must_not" : {
        "term" : {
          "msg" : "bob"
        }
      },
      "should" : {
        "term" : {
          "msg" : "world"
        }
      }
    }
  }
}

down:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 243

{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"my_index1","_type":"msg_doc","_id":"AVHgELHqpIrLtHSr6-gg","_score":0.04500804,"_source":{"msg":"hello"}}]}}

TCP Forwarding 127.0.0.1:65120 <--> 192.168.1.100:9200 stopped.

Now we got the results we were looking for. Whooppee.

As a final example, the tunnel can also be used for Java unit/integration testing. There is an example on the project website (https://github.com/mukatee/java-tcp-tunnel). To illustrate it with the ElasticSearch example above:

  @Test
  public void sendRequestMITM() throws Exception {
    //assume we have ElasticSearch in the URL/port as before set up for testing
    //configure the tunnel to accept connections on port 5566 and forward them to 192.168.2.100:9200
    Params params = new Params(5566, “192.168.1.100”, 9200);
    //we want to use the captured data in testing, so enable logging the tunnel data in memory with buffer size 8092 bytes
    params.enableInMemoryLogging(8092);
    //this gives us programmatic access to the data passed from client connected to port 5566 -> 192.168.1.100:9200 (test client to ElasticSearch)
    InMemoryLogger upLogger = params.getUpMemoryLogger();
    //this gives us access to the data passed from 192.168.2.100:9200 -> client connected to port 5598 (ElasticSearch to test client)
    InMemoryLogger downLogger = params.getDownMemoryLogger();
    //this is how we actually start the tunnel
    Main main = new Main(params);
    main.start();

    //perform the previously defined query over the tunnel
    JestClientFactory factory = new JestClientFactory();
    factory.setHttpClientConfig(new HttpClientConfig
            .Builder("http://localhost:5566")
            .multiThreaded(true)
            .build());
    JestClient client = factory.getObject();

    BoolQueryBuilder qb = QueryBuilders.boolQuery();
    qb.must(QueryBuilders.termQuery("msg", "hello"));
    qb.should(QueryBuilders.termQuery("msg", "world"));
    qb.mustNot(QueryBuilders.termQuery("msg", "bob"));
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(qb);

    Search search = new Search.Builder(searchSourceBuilder.toString())
            // multiple index or types can be added.
            .addIndex("my_index1")
            .build();
    SearchResult result = client.execute(search);

    //we could get any attributes from “result” here and assert them but i focus on the tunnel
    //assert the HTTP protocol data passed through the tunnel both ways
    assertTcpStream(upLogger, "expected_up1.txt");
    assertTcpStream(downLogger, "expected_down1.txt");
    //You might need to wait a second or so for the requests to pass, in which case a sleep could be in order or a more advanced observer
    Thread.sleep(1000);
}

  private void assertTcpStream(InMemoryLogger logger, String filename) throws Exception {
    //here we get the actual data that was passed through the tunnel in one direction (depending if we get passed the upstream memorylogger or downstream)
    String actual = logger.getString("UTF8");
    //the rest of this is just making sure the test should run the same over different platforms and with varying date-times in HTTP headers
    actual = TestUtils.unifyLineSeparators(actual, "\n");
    String expected = TestUtils.getResource(CaptureTests.class, filename);

    String[] replaced = TestUtils.replace("##", expected, actual);
    actual = replaced[0];
    expected = replaced[1];

    expected = TestUtils.unifyLineSeparators(expected, "\n");
    assertEquals(actual, expected, "Request full content");
  }

I have found this to be helpful in documenting my understanding of how the protocol actually works, for assuring I know if the protocol changes when an update is made, and so on. For example, in another project (https://github.com/mukatee/kafka-consumer) I implemented parsers for the InfluxDB line protocol to capture data from a Kafka stream sent by their Telegraf tool. If the line protocol used by Influx/Telegraf changes, I want to catch it right away (happened once already..). Or we could do this type of test for our own REST interfaces so that we catch all places where it is used and if some refactoring or some other changes produce unwanted side-effects we catch it immediately at the protocol level (instead of wondering why some call is not giving correct response).

So that was the ElasticSearch example. I have also done debugging with the tunnel on binary protocols such as Avro and ProtocolBuffers over Kafka. Those are slighly different in not producing readable data on the console due to being binary protocols. However, the tunnel has options to write the bytes to console or to file. From the file these would then require parsing the message content using the Avro/ProtoBuf/Thrift schemas etc. Have not needed to go there yet, so have not done that. Also might require splitting the messages to parts to separate the actual message from the binary stream for this type of protocol. This can be done, for example, by extending the tunnel code with a specific Observer type.

So some potential improvements would be to have SSL support, GZIP support and maybe specific support for some binary protocols. If I ever need them, maybe I will do it.. Until then.

And you can also go to github and search for “tcp tunnel”, which gives a long list of slightly similar projects..

 

Performance testing with InfluxDB + Grafana + Telegraf, Part 3

This time running the test with the fixed SNMP based router monitoring and with the client-server monitoring as well. Notice in the following graphs how I learned to also put two scales on a single chart, combine line and bar charts, and add units to all the charts. OOoooo, prettyy….

First up, the number of sessions, the update frequency delays, and the connection (SSL handshake) errors from the test start:

errors_start

Here the “errors” graph shows the overall number of connection errors so far in the test (the orange line) and the number of errors for each second (the green bars). The number of sessions is growing at relatively steady pace, with slight drop in rate over time. At around 16:46:45 we start getting errors at a steady rate, which is at around 1300 sessions and at the same time the tester starts seeing bigger delays.

The router CPU load goes up pretty high quite fast:

nw_cpu_raw_start

At the problem start point of 16:46:45 the router CPU load starts to go over 80% quite often. This is likely a good part of the cause, although I would expected more properties to be involved since this is still not at 100% (although looking at the router internal graph below, maybe it is at 100%, the forced 5s averaging just hides it).

Notice another interesting part on the left. The test starts from about 20% CPU load. Why is this? I speculate it is due to the other traffic over the same network. You know, “other users” streaming YouTube and playing online games, etc., while I run my tests. So this is actually the “background noise” of the system. It is also probably why in this test run I only got to around 3800 clients when I have previously gotten close to 4200 with less background noise.

Looking at the router internal CPU graph:

router_start

It is actually quite high but in peaks. With the finest granularity of 5 seconds, we are seeing much of this averaged much lower in the Grafana chart above. So maybe the CPU load on the router is actually the reason to see some of those connection errors. The minimum of 5 second averaging the SNMP agent on the router gives us is just partly hiding it.

A couple of other, slightly less relevant, but still quite interesting charts I collected during the run:

bandwidth_overload

This is the “br0”, or interface number 7, from the router (that I speculate is the wireless part). That is, the amount of traffic in/out on this interface. Notice how the “out” part (the orange line) actually has a much bigger scale on the left, while the “in” part (green) has a much smaller scale on the right. At first I wondered how can this be. I would expect much more downloading (in) to happen over this interface than uploading (out).

With my tinfoil hat on I immediately thought, maybe the NSA or the chinese are downloading all my secrets and streaming video of me picking my nose 24/7 from my compewters? But then I realized this is from the router viewpoint, so “out” probably means the router is sending data to the WLAN clients and “in” means it is receiving data from them. So this would be opposite of my assumption, and “in” is actually uploads and “out” downloads. Which seems more correct and I can remove the hat for now.

Anyway, slightly interestingly the upload part starts growing at a faster rate compared to the download part when I run the test. Most likely due to the difference in scale but shows how the traffic for upload increases proportionally more. Although in absolute terms I believe the increase is the same, as it is the application server and client sending data to each other over the wireless, so it is up/download over the same channel for the same data.

Just for the kicks, the JVM memory use I collected:

jvm_memory

Nothing special here, but notice again how having the actual correct units on the y-axis is nice.

Now to take a look at the end part of the test. The errors, update frequency delays and router CPU load:

network_cpu_end

The router CPU in the above is actually hitting 100% in a couple of spots, which is quite heavy since these CPU load measures are 5 second averages as explained before. This is why Grafana is upping the graph top boundary to 125%. I should probably have set the graph maximum upper bound in Grafana at 100% but whatever. At the end ,when the test stops creating new connections, and old ones die out, the router load also drops back to the level of the background noise.

Which we can also see in the router internal chart:

router_cpu_end

We can do from this an interesting comparison using the direct measures (the ones directly queried over from specific SNMP OIDs) vs derived ones (derived from the several raw measures):

nw_direct_end

The the scale different due to auto-scaling changing the raw chart to 125% top while the direct never goes near 100% and is thus topped at 100%. Anyway, this comparison again shows how the longer averaging period hides much of the important information. This is especially visible in the last part of the period, although the whole period is pretty obvious. And if I forgot before the “CPU busy” value (blue line in the “Router CPU raw”) metric is the sum of all the other metrics in that chart (user + system + nice + sirq). Anyway, in any case, the “direct” chart is averaged over longer term (likely over a minute as I speculated in my previous posts) and thus more information loss is evident.

The router RAM part is pretty constant and was this during all my tests. Thus I haven’t put much of it in these posts but there it is for once in the above chart as well..

Then we take the server CPU load just to see if was still low enough not to cause any of these issues:

server_cup_end

And no problem, it never goes over 50%. Interesting spikes though.

We can also look at the whole picture at once for the whole test run. The test runs for about 10 minutes and gets up to about 3800 concurrent sessions:

whole_sessions

and the network errors, router CPU load and update frequency delays in client and server:

whole_cpu_raw

Cute. Are we missing anything? Yes, we do not measure the load on the client side where the tester is running. This could also have a very significant impact. Here is an example of the top running processes at the client machine during the test:

tasks

What are these? Well, Firefox is running Grafana, which seems to really suck on the CPU when set up with complex graphs and continous refreshes over 5 second intervals. IntelliJ is the IDE I am using to run the tests. The “java” process is likely related to the same, being a part of the performance tests. Why are they showing up as two, both sucking on the CPU as a separate process at relatively high intensity? No idea (arrr..), but maybe “java” is for the actual tester and IntelliJ is where it is run from, and it needs to track the tester process, capture prints, etc. Finally, “Python” is the SNMP poller that I run to capture the router metrics.

So what do I think I should have learned from this:

  • I should architect the system to run monitoring from one node, logging database on one node, graphing UI from another node, …
  • I should track all elements of the system, not just the client and the server
  • Long averaging periods for measurements lose important information for detailed analysis of the performance tests vs functionality (e.g., reasons for network errors as sessions start piling).
  • However, I should maybe not spend excessive time building shiny monitoring systems when the potential cause has become clear. Perhaps if I have one ready as I now do..
  • I should architect my test system to become a distributed system, easily deployed on several nodes. This should allow avoiding process resource limits (e.g., threads per process), kernel limits (e.g., file descriptors), and provide more realistic loads over several distributed nodes
  • I should design in monitoring features that are useful in both production and testing

What next? Well there are a few missing parts. One is that I should monitor the tester clients as noted above. Then another not-so-obvious point is that I should synchronize the clocks of all systems as close as possible. Otherwise I can never link properties such as network errors to other metrics from other nodes. Something like NTP should be used. But is it a bit complex to deploy effectively. In fact, distributed computing in general is a bit complex so I should probably look into building a nice shiny distributed performance test framework and NTP synchronization sometime in the future..

Another interesting aspect here would be to see how much the longer averaging period makes it harder to find actual causes, in a scenario where this could be realistically controlled with other factors.

SNMP monitoring my ASUS RT-N66U for performance testing

I noticed I have previously also tried to monitor my ASUS RT-N66U router to see what traffic passes through. You know, just your regular paranoid stuff. However, this time I was trying to monitor the router CPU load and bandwidth use to figure out if delays in my performance tests were somehow related to the router performance (as illustrated in some previous posts). Somehow this did not turn out to be quite so simple. I will just document my experiments in relation to setting up the CPU monitoring here for whatever that’s worth..

First off, I needed to enable SNMP on the router. Of course, I had done this long time ago, but anyway, on this particular router this is done in the admin panel under Administration->SNMP. The SNMP communities set on this panel are by default “public” for the reading operations and “private” for the write operations. So I will use these in my examples..

Secondly, I needed to find the OID’s to monitor. An OID is an Object ID that uniquely identifies some property we want to monitor/manage with SNMP. These are listed in something called MIB (Management Information Base) files, which there are plenty of, and browsing these gets complicated to try to find what you are interested in. So for me, I just Googled for “SNMP CPU OID” etc. Some potential ones I found:

Average CPU load
over last 1 minute: .1.3.6.1.4.1.2021.10.1.3.1
over last 5 minutes: .1.3.6.1.4.1.2021.10.1.3.2
over last 15 minutes: .1.3.6.1.4.1.2021.10.1.3.3

Percentages:
user CPU time: .1.3.6.1.4.1.2021.11.9.0
system CPU time: .1.3.6.1.4.1.2021.11.10.0
idle CPU time: .1.3.6.1.4.1.2021.11.11.0

raw user cpu time: .1.3.6.1.4.1.2021.11.50.0
raw system cpu time: .1.3.6.1.4.1.2021.11.52.0
raw idle cpu time: .1.3.6.1.4.1.2021.11.53.0
raw nice cpu time: .1.3.6.1.4.1.2021.11.51.0

So, looking at these, I figured if I wanted to figure out why my client-server performance gets hit when running the performance tests over the router with high numbers of concurrent clients, I would want a very fine granularity of information on the CPU load. Averages of 1-15 minutes are then not very useful for me. The percentage values seemed much more interesting. However, even though these are listed the same way all over the internet, nowhere does it really say what is their granularity. That is, percentage over what time? Or I am sure it is said somewhere but I did not stumble onto it. Anyway, I hopefully assumed it was “real-time” percentage. That is, load percentage at the time of measurement.

So I set up my system to query these from the router:

direct

I was expecting system+user CPU time to equal to what would be the actual CPU load. Obviously not, as the idle time shows going down by about 25% on the right end, while the system+user loads sum up to less than 2%. So I am missing a large chunk of actual CPU load somewhere (about 23% in the right side here). Also, the idle graph might be correct but it is definitely not “real-time”. When I look at the CPU load graph shown in the router admin interface itself, this is quite obvious:

direct_router

The admin interface shows a shorter timeframe, so the right hand side of the SNMP idle graph matches generally the slope of CPU load shown in the router admin interface. However, it is much less “bumpy” than the router graph, meaning it is an average over a longer period than the one shown in the router interface. Even though my “system+user” measure is obviously broken, I could get the CPU load from this by using a formula of “100 – idle”. However, obviously I would prefer something more fine-grained.

Another hopeful approach I found is the OID for something called ProcessorLoad (.1.3.6.1.2.1.25.3.3.1.2.x). This one is a bit odd though, I believe the “x” part identifies the processor for which I want the statistics for (in this case my router only has one core, in more modern ones there are multiple). To find this out I do an snmpwalk on commandline:

snmpwalk -v2c -c public 192.168.2.1 .1.3.6.1.2.1.25.3.3.1.2

Which on my N66U gives me:

iso.3.6.1.2.1.25.3.3.1.2.196608 = INTEGER: 1

And on another AC68U that I also had access to it also gives another core with id 196609 (the AC68U is a dual-core router). The value (INTEGER:1) is the load of the processor. Using this, I get a load graph that is the opposite of the idle graph, so obviously it is now correctly measuring the load average over the same period as the idle graph is:

direct2

However, as before, this is obviously not as fine grained as I was hoping. For comparison, the admin graph again:

direct_router2

So is there another way I can get a finer granularity of information somehow? Looking and asking around a bit I did not find much information on this. However, somewhere on the mighty internet I did find information that the average value OIDs (1-15min) would be deprecated because the same information can be derived using the “raw” values. I already listed most of these “raw” measure OIDs in the beginning, but what is a raw value and how do I use them?

After some Google-fu (or flu) it turns out the “raw” values are something called CPU ticks. So what is such a tick? My understanding these are some values used internally to provide fine-grained time intervals that map to how much time the CPU spends in each state (system, user, idle, …). Anyway, they seem to be at quite granular level (much less than a second, and I was looking for one second precision), so maybe I can use these. And how do you calculate the percentage of CPU load from the raw values?

Again, not a huge amount of information on this. But I figure it can be done by adding all the different “raw” values together and calculating the percentage of busy values vs idle values in the observed time period. Here I use the diff of the values so if at second 1 the value is 5 and at second 2 it is 10, the actual number of ticks for the last period (second 2) is 10-5=5.

To calculate the percentage, if the raw values for “user”+”system”+”nice” sum up to 10 ticks in a second, and the raw value for “idle” is 90 ticks in the same second, I get a 10% CPU load for that second (with 90% idle). Of course, I cannot assume these values are not 0-100 or anything like that, but I have to assume there can be any number of ticks in a second. I also believe the number of ticks per second can also vary across a system, so I need a more dynamic calculation.

Formula I tried is (user+system+nice)/(user+system+nice+idle)*100 to get the percentage. As I mentioned, there is not much information on this that I found. However, I found some post(s) suggesting this was actually the way some commonly used SNMP monitoring tools also do it. So what does it look like if I do it like this? It looks like this:

raw1

As before, the idle line seems about correct (the router graph is the same one above), and in this case it is obviously at much finer granularity than before. Which is nice. However, when I compare the “system”, “nice” and “user” lines to the “idle” line they are obviously still nowhere near the actual load. They (“system”+”nice”+”user”) sum up to about 2% at most, while the idle line shows that something is consuming up to 65% or so of CPU load (right end of the graph, “100-idle”).

Why these broken results? After going through my SNMP polling code too many times to look for any bugs, trying various fixes, etc. I figure either the router SNMP implementation is broken or I am still doing it wrong. More likely the latter.

So finally I got the idea to SSH into the router (ASUS Merlin firmware at least has the option to enable an internal SSH server in the router). Then I run the “top” command to see if I the load reported by “top” is as far away from my SNMP graphs as the router admin panel charts.

top

Surprisingly, “top” shows values very close to what I was getting from my raw percentage calculations for “user”, “system”, and “nice”. So what gives? Well, the “idle” is also close to the same as my graph, so after looking around some more, I notice there is another value called “sirq” that I have not included in my formula. And it is big. Googling around for “sirq” just got me a bunch of questions on “top” and “sirq” and why it is sometimes high. But no explanation for what “sirq” actually stands for. After some time I finally figure out it must be related to “software interrupts”.

So after some more Googleing, I find there is an OID that seems relevant: 1.3.6.1.4.1.2021.11.61 (ssCpuRawSoftIRQ). Performing an SNMP walk over1.3.6.1.4.1.2021.11 actually shows this (the raw values are actually under the same hierarchy branch). I did the snmpwalk before but missed this as it was never mentioned elsewhere I looked and appeared later in the list of walk results.. Thats my excuse.

So I add the soft interrupt load to my formulas and now I finally get:

raw2

So summing up the user+system+nice+sirq should do the trick. However, even without this I see it is in my case mostly just the “sirq” value that makes up the load here. Which seems much closer to the router admin panel chart:

direct_router2

However, it is still not quite the same. Why is this? The number of ticks seems to be actually only updated every 5 seconds, so I guess this is the finest granularity that I can monitor on this router. But the admin panel seems finer, probably one second interval. But at least 5 seconds is better than one minute..

As a related note, I was interested in more metrics than just this. The metrics for bandwidth consumed on the router (maybe if my performance test was taking too much bandwidth it would cause problems) as well as other resource use on the router, such as memory. The bandwidth is a bit tricky. The router has a number of network interfaces. For the internet (WAN) connect, the wired ports, as well as the wireless. I could get a list of all these by performing another snmpwalk of the related OID:

snmpwalk -v2c -c public 192.168.2.1 .1.3.6.1.2.1.2.2.1.2
IF-MIB::ifDescr.1 = STRING: lo
IF-MIB::ifDescr.2 = STRING: eth0
IF-MIB::ifDescr.3 = STRING: eth1
IF-MIB::ifDescr.4 = STRING: eth2
IF-MIB::ifDescr.5 = STRING: vlan1
IF-MIB::ifDescr.6 = STRING: vlan2
IF-MIB::ifDescr.7 = STRING: br0

What is what here? I dont really know. But doing an snmpwalk on the bytes in/out OID values (1.3.6.1.2.1.31.1.1.1.6, 1.3.6.1.2.1.31.1.1.1.10) showed me that eth0 has much higher download than any other, so I am assuming that eth0 is the internet (WAN) connection shared by all devices in the LAN. From some Googleing I assume that “br0” is probably the wireless interface. The others, I am less sure about. Although I guess eth1 and eth2 are two more of the wired ports, and vlan1 and vlan2 are probably the remaining two others. Maybe they are just in a special mode (IPTV?). So I tried monitoring the OID for “br0” for traffic. Which produced nice graphs going up. So maybe I was right.

The rest (memory, etc) was quite simple. Just SNMP query the OIDs for those, the seem to give the actual “real-time” values. I will not repeat that here. However, for anyone who needs more details, I have the code on github. It is commonly in flux (until I find another thing to play with), the docs are usually not fully up to date, blaabllaa.. but the general idea could probably be found by browsing the source code if nothing else..: https://github.com/mukatee/pypro

Finally, what would be an interesting experiment would be to iteratively increase the length of the polling interval for the raw metrics to see how high it should be set to match the ProcessorLoad or CPUPercentage SNMP OID metrics. This would give me the actual averaging interval for those.. Arrrrr…

Anyway, that is that for how far I got in setting up my router monitoring at a fine granularity for my performance test analysis. Next I should see if I can correlate this with any delays or errors in the actual performance test…