I previously did a review on applications of machine learning in software testing and network analysis. I was looking at updating that, maybe with some extra focus. As usual, I got distracted. This time to build an actual system to do some of the tasks discussed in those reviews. This post discusses how I built a bug report classifier based on bugreport descriptions. Or more generally, they are issues listed in a public Jira, but nevermind..
The classifier I built here is based on bi-directional LSTM (long short-term memory) networks using Keras (with Tensorflow). So deep learning, recurrent neural networks, word embeddings. Plenty of trendy things to see here.
Getting some data
The natural place to go looking for this type of data is open source projects and their bug data bases. I used the Qt project bug tracker (see, even the address has word "bug" in it, not "issue"). It seems to be based on the commonly used Jira platform. You can go to the web site and select some project, fill in filters, click on export, and directly get a CSV formatted output file that can be simply imported into Pandas and thus Python ML and data analytics libraries. This is what I did.
Since the export interface only allows for downloading data for 1000 reports at once, I scripted it. Using Selenium Webdriver I automated filling in download filters one month at a time. This script is stored in my ML-experiments Github. Along with a script that combines all the separate downloads into one CSV file. Hopefully I don’t move these around the repo too much and keep breaking the links.
Some points in building such a downloader:
- Disable save dialog in browser via Selenium settings
- Autosave on
- Wait for download to complete by scanning partial files or otherwise
- Rename latest created file according to filtered time
- Check filesizes, bug creation dates and index continuity to see if something was missing
Exploring the data
Before running in full speed to build some classifier, it is generally good to explore the data a bit and see what it looks like, what can be learned, etc. Commonly this is called exploratory data analysis (EDA).
First, read the data and set dates to date types to enable date filters:
df_bugs = pd.read_csv("bugs/reduced.csv",
parse_dates=["Created", "Due Date", "Resolved"])
print(df_bugs.columns)
This gives 493 columns:
Index(['Unnamed: 0', 'Affects Version/s', 'Affects Version/s.1',
'Affects Version/s.10', 'Affects Version/s.11',
'Affects Version/s.12',
'Affects Version/s.13', 'Affects Version/s.14',
'Affects Version/s.15', 'Affects Version/s.16',
...
'Status', 'Summary', 'Time Spent', 'Updated', 'Votes',
'Work Ratio', 'Σ Original Estimate', 'Σ Remaining Estimate',
'Σ Time Spent'], dtype='object', length=493)
This is a lot of fields for a bug report. The large count is because Jira seems to dump multiple valued items into multiple columns. The above snippet shows an example of "Affects version" being split to multiple columns. If one bug has at most 20 affected versions, then all exported rows will have 20 columns for "Affects version". So one item per column if there are many. A simple way I used to combine them was by the count of non-null values:
#this dataset has many very sparse columns,
#where each comment is a different field, etc.
#this just sums all these comment1, comment2, comment3, ...
#as a count of such items
def sum_columns(old_col_name, new_col_id):
old_cols = [col for col in all_cols if old_col_name in col]
olds_df = big_df[old_cols]
olds_non_null = olds_df.notnull().sum(axis=1)
big_df.drop(old_cols, axis=1, inplace=True)
big_df[new_col_id] = olds_non_null
#just showing two here as an example
sum_columns("Affects Version", "affects_count")
sum_columns("Comment", "comment_count")
...
print(df_bugs.columns)
Index(['Unnamed: 0', 'Assignee', 'Created', 'Creator',
'Description',
'Due Date', 'Environment', 'Issue Type', 'Issue id',
'Issue key',
'Last Viewed', 'Original Estimate', 'Parent id', 'Priority',
'Project description', 'Project key', 'Project lead',
'Project name', 'Project type', 'Project url',
''Remaining Estimate', 'Reporter',
'Resolution', 'Resolved', 'Security Level', 'Sprint',
'Sprint.1',
'Status', 'Summary', 'Time Spent', 'Updated', 'Votes',
'Work Ratio', 'Σ Original Estimate',
'Σ Remaining Estimate', 'Σ Time Spent',
'outward_count', 'custom_count', 'comment_count',
'component_count',
'labels_count', 'affects_count', 'attachment_count',
'fix_version_count', 'log_work_count'],
dtype='object')
So, that would be 45 columns after combining several of the counts. Down from 493, and makes it easier to find bugs with most votes, comments, etc. This enables views such as:
df_bugs.sort_values(by="Votes", ascending=False)
[["Issue key", "Summary", "Issue Type",
"Status", "Votes"]].head(10)
In a similar way, bug type counts:
order = ['P0: Blocker', 'P1: Critical', 'P2: Important',
'P3: Somewhat important','P4: Low','P5: Not important',
'Not Evaluated']
df_2019["Priority"].value_counts().loc[order]
.plot(kind='bar', figsize=(10,5))
In addition, I ran various other summaries and visualizations on it to get a bit more familiar with the data.
The final point was to build a classifier and see how well that does. A classifier needs a classification target. I went with the assigned component. So, my classifier tries to predict the component to assign bug report to, using only the bug reports natural language description.
To start with, a look at the components. A bug report in this dataset can be assigned to multiple components. Similar to the "Affects version" above.
The distribution looks like this:
df_2019["component_count"].value_counts().sort_index()
1 64499
2 5616
3 596
4 64
5 10
6 5
7 1
8 3
9 1
10 3
11 1
This shows having and issue assigned to more than 2 components being rare, and more than 3 very rare. For this experiment, I only collected the first two components the bugs were assigned to (if any). Of those, I simply used the first assigned component as the training target label. Some further data could be had by adding training set items with labels also for second and third components. Or for all of them if feeling like it. But the first component served good enough for this experiment.
How many unique ones are in those first two?
values = set(df_2019["comp1"].unique())
values.update(df_2019["comp2"].unique())
len(values)
172
So there would be 172 components to predict. And how does their issue count distribution look like?
counts = df_2019["comp1"].value_counts()
counts
1. QML: Declarative and Javascript Engine 5260
2. Widgets: Widgets and Dialogs 4547
3. Documentation 3352
4. Quick: Core Declarative QML 2037
5. Qt3D 1928
6. QPA: Other 1898
7. Build tools: qmake 1862
8. WebEngine 1842
9. Packaging & Installer 1803
10. Build System 1801
11. Widgets: Itemviews 1534
12. GUI: Painting 1480
13. Multimedia 1478
14. GUI: OpenGL 1462
15. Quick: Controls 1414
16. GUI: Text handling 1378
17. Core: I/O 1265
18. Device Creation 1255
19. Quick: Controls 2 1173
20. GUI: Font handling 1141
...
155. GamePad 14
156. KNX 12
157. QPA: Direct2D 11
158. ODF Writer 9
157. Network: SPDY 8
158. GUI: Vulkan 8
159. Tools: Qt Configuration Tool 7
160. QPA: KMS 6
161. Extras: X11 6
162. PIM: Versit 5
163. Cloud Messaging 5
164. Testing: QtUITest 5
165. Learning/Course Material 4
166. PIM: Organizer 4
167. SerialBus: Other 3
168. Feedback 3
169. Systems: Publish & Subscribe 2
170. Lottie 2
171. CoAP 1
172. Device Creation: Device Utilities 1
Above shows how the issue count distribution is very unbalanced.
To summarize the above, there are 172 components, with largely uneven distributions. Imagine trying to predict the correct copmponent from 172 options, given that for some of them, there is very limited data available. Would seem very difficult to learn to distinguish the ones with very little training data. I guess this skewed distribution might be due to new components having little data on them. Which, in a more realistic-scenario, would merit some additional consideration. Maybe collecting these all to a new category like "Other, manually check". And updating the training data constantly, re-training the model as new issues/bugs are added. Well, that is probably a good idea anyway.
Besides these components with very few linked issues, there are several in the dataset marked as "Inactive". These would likely also be beneficial to remove from the training set, since we would not expect to see any new ones coming for them. I did not do it, as for my experiment this is fine even without. In any case, this is what is looks like:
df_2019[df_2019["comp1"].str.contains("Inactive")]["comp1"].unique()
array(['(Inactive) Porting from Qt 3 to Qt 4',
'(Inactive) GUI: QWS Integration (Qt4)', '(Inactive) Phonon',
'(Inactive) mmfphonon', '(Inactive) Maemo 5',
'(Inactive) OpenVG',
'(Inactive) EGL/Symbian', '(Inactive) QtQuick (version 1)',
'(Inactive) Smart Installer ', '(Inactive) JsonDB',
'(Inactive) QtPorts: BB10', '(Inactive) Enginio'], dtype=object)
I will use the "description" column for the features (the words in the description), and the filtered "comp1" column shown above for the target.
Creating an LSTM Model for Classification
This classifier code is available on Github. As shown above, some of the components have very few values (issues to train on). Longish story shorter, I cut out the targets with 10 or less values:
min_count = 10
df_2019 = df_2019[df_2019['comp1']
.isin(counts[counts >= min_count].index)]
This enabled me to do a 3-way train-validation-test set split and still have some data for each 3 splits for each target component. A 3-way stratified split that is. Code:
def train_val_test_split(X, y):
X_train, X_test_val, y_train, y_test_val =
train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
X_val, X_test, y_val, y_test =
train_test_split(X_test_val, y_test_val, test_size=0.25,
random_state=42, stratify=y_test_val)
return X_train, y_train, X_val, y_val, X_test, y_test
Before using that, I need to get the data to train, that is the X (features) and y (target).
To get the features, tokenize the text. For an RNN the input data needs to be a fixed length vector (of tokens), so cut the document at seq_length if longer, or pad it to length if shorter. This uses Keras tokenizer, which I guess should be quite confident to produce suitble output for Keras..
def tokenize_text(vocab_size, texts, seq_length):
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = pad_sequences(sequences, maxlen=seq_length)
print('Shape of data tensor:', X.shape)
return data, X, tokenizer
That produces the X. To produce the y:
from sklearn.preprocessing import LabelEncoder
df_2019.dropna(subset=['comp1', "Description"], inplace=True)
# encode class values as integers
# so they work as targets for the prediction algorithm
encoder = LabelEncoder()
df_2019["comp1_label"] = encoder.fit_transform(df_2019["comp1"])
The "comp1_label" in above now has the values for the target y variable.
To put these together:
data = df_2019["Description"]
vocab_size = 20000
seq_length = 1000
data, X, tokenizer = tokenize_text(vocab_size, data, seq_length)
y = df_2019["comp1_label"]
X_train, y_train, X_val, y_val, X_test, y_test =
train_val_test_split(X, y)
The 3 sets of y_xxxx variables still need to be converted to Keras format, which is a one-hot encoded 2D-matrix. To do this after the split:
y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)
Word Embeddings
I am using Glove word vectors. In this case the relatively small set based on 6 billion tokens (words) with 300 dimensions. The vectors are stored in a text file, with one word per line, along with the vector values. First item on line is the word, followed by the 300 dimensional vector values for it. So the following loads this into the embedding_index dictionary, keys being words and values the vectors.
def load_word_vectors(glove_dir):
print('Indexing word vectors.')
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.300d.txt'),
encoding='utf8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
return embeddings_index
With these loaded, convert the embedding index into matrix form that the Keras Embedding layer uses. This simply puts the embedding vector for each word at a specific index in the matrix. So if word "bob" is in index 10 in word_index, the embedding vector for "bob" will be in embedding_matrix[10].
def embedding_index_to_matrix(embeddings_index, vocab_size,
embedding_dim, word_index):
print('Preparing embedding matrix.')
# prepare embedding matrix
num_words = min(vocab_size, len(word_index))
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
if i >= vocab_size:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
return embedding_matrix
Build the bi-LSTM model
I use two model versions here. The first one uses the basic LSTM layer from Keras. The second one uses the Cuda optimized CuDNNLSTM layer. I used the CuDNNLSTM layer to train this model on GPU, saved the weights after training, and then loaded the weights into the plain LSTM version. I then used the plain LSTM version to do the predictions on my laptop when developing and demoing this.
Plain LSTM version:
def build_model_lstm(vocab_size, embedding_dim,
embedding_matrix, sequence_length, cat_count):
input = Input(shape=(sequence_length,), name="Input")
embedding = Embedding(input_dim=vocab_size,
weights=[embedding_matrix],
output_dim=embedding_dim,
input_length=sequence_length,
trainable=False,
name="embedding")(input)
lstm1_bi1 = Bidirectional(LSTM(128, return_sequences=True,
name='lstm1'), name="lstm-bi1")(embedding)
drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
lstm2_bi2 = Bidirectional(LSTM(64, return_sequences=False,
name='lstm2'), name="lstm-bi2")(drop1)
drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
output = Dense(cat_count,
activation='sigmoid', name='sigmoid')(drop2)
model = Model(inputs=input, outputs=output)
model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics=['accuracy'])
return model
CuDNNLSTM version:
def build_model_lstm_cuda(vocab_size, embedding_dim,
embedding_matrix, sequence_length, cat_count):
input = Input(shape=(sequence_length,), name="Input")
embedding = Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
weights=[embedding_matrix],
input_length=sequence_length,
trainable=False,
name="embedding")(input)
lstm1_bi1 = Bidirectional(CuDNNLSTM(128,
return_sequences=True, name='lstm1'),
name="lstm-bi1")(embedding)
drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
lstm2_bi2 = Bidirectional(CuDNNLSTM(64,
return_sequences=False, name='lstm2'),
name="lstm-bi2")(drop1)
drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
output = Dense(cat_count,
activation='sigmoid', name='sigmoid')(drop2)
model = Model(inputs=input, outputs=output)
model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics=['accuracy'])
return model
The structure of the above models visualizes to this:
The first layer is an embedding layer, and uses the embedding matrix from the pre-trained Glove vectors. This is followed by the two bi-LSTM layers, each with a Dropout layer behind it. The bi-LSTM layer looks at each word along with its context (as I discussed previously). The dropout layers help avoid overfitting. Finally, a dense layer is used to make the prediction from "cat_count" categories. Here cat_count is the number of categories to predict. It is actually categories and not cats, sorry about that.
The "weights=[embedding_matrix]" parameter given to the Embedding layer is what can be used to initialize the pre-trained word-vectors. In this case, those would be the Glove word-vectors. The current Keras Embedding docs say nothing about this parameter, which is a bit weird. Searching for this on the internet also seems to indicate it would be deprecated etc. but it also seems difficult to find a simple replacement. But it works, so I go with that..
In a bit more detail, the model summarizes to this:
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Input (InputLayer) (None, 1000) 0
_________________________________________________________________
embedding (Embedding) (None, 1000, 300) 6000000
_________________________________________________________________
lstm-bi1 (Bidirectional) (None, 1000, 256) 440320
_________________________________________________________________
drop1 (Dropout) (None, 1000, 256) 0
_________________________________________________________________
lstm-bi2 (Bidirectional) (None, 128) 164864
_________________________________________________________________
drop2 (Dropout) (None, 128) 0
_________________________________________________________________
sigmoid (Dense) (None, 165) 21285
=================================================================
Total params: 6,626,469
Trainable params: 626,469
Non-trainable params: 6,000,000
_________________________________________________________________
This shows how the embedding layer turns the input into a suitable shape for LSTM input as I discussed in my previous post. That is, 1000 timesteps, each with 300 features. Those being the 1000 tokens for each document (issue report description) and 300 dimensional word-vectors for each token.
Another interesting point is the text at the end of the summary: "Non-trainable params: 6,000,000". This matches the number of parameters in the summary for the embedding layer. When the embedding layer is given the paremeter "trainable=False", all the parameters in it are fixed. If this is set to True, then all these parameters will be trainable as well.
Training it
Training the model is simple now that everything is set up:
checkpoint_callback = ModelCheckpoint(filepath=
"./model-weights-issue-pred.{epoch:02d}-{val_loss:.6f}.hdf5",
monitor='val_loss', verbose=0, save_best_only=True)
model = build_model_lstm_cuda(vocab_size=vocab_size,
embedding_dim=embedding_dim,
sequence_length=seq_length,
embedding_matrix=embedding_matrix,
cat_count=len(encoder.classes_)
history = model.fit(X_train, y_train,
batch_size=128,
epochs=15,
validation_data=(X_val, y_val),
callbacks=callbacks)
model.save("issue_model_word_embedding.h5")
score, acc = model.evaluate(x=X_test,
y=y_test,
batch_size=128)
print('Test loss:', score)
print('Test accuracy:', acc)
Notice that I use the build_model_lstm_cuda() version here. That is to train on the GPU environment, to have some sensible training time.
The callback given will just save the model weights when the validation score improves. In this case it monitors the validation loss getting smaller (no mode="max" as in my previous version).
Predicting
Prediction with a trained model:
def predict(bug_description, seq_length):
#texts_to_sequences vs text_to_word_sequence?
sequences = tokenizer.texts_to_sequences([bug_description])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = pad_sequences(sequences, maxlen=seq_length)
probs = model.predict(X)
result = []
for idx in range(probs.shape[1]):
name = le_id_mapping[idx]
prob = (probs[0, idx]*100)
prob_str = "%.2f%%" % prob
#print(name, ":", prob_str)
result.append((name, prob))
return result
Running this on the bug QTBUG-74496 gives the following top predictions:
Quick: Other: 0.3271
(Inactive) QtQuick (version 1): 0.4968
GUI: Drag and Drop: 0.7292
QML: Declarative and Javascript Engine: 2.6450
Quick: Core Declarative QML : 5.8533
The bigger number signified higher likelihood given by the classifier. This highlights many of the topics I mentioned above. There is one inactive component there, which relates to perhaps it being better to remove all inactive ones from training set. The top one presented (Quick: Core Declarative QML) is not the one assigned to the report at this time, but the second highest is (QML: Declarative and Javascript Engine). They both seem to be associated with the same top-level component (QML) and I do not have the expertise to say why one might be better than the other in this case.
In most of the issue reports I tried this on, it seemed to get the "correct" one as marked on the issue tracker. In the ones that did not match, the suggestion always seemed to make sense (sometimes more than what had been set by whoever sets the value), and commonly in case of mismatch, the "correct" one was in the top suggestions still. But overall, component granularity might useful to consider as well in building these types of classifiers and their applications.
Usefulness of pre-trained word-embeddings
When doing the training, I started with using the Glove embeddings, and sometimes just accidentally left them out and trained without them. This reminded me to do an experiment to see how much effect using the pre-trained embeddings actually have, and how does the accuracy etc. get affected with or without them. So I tried to train the model with different options for the Embedding layer:
- fixed Glove vectors (trainable=False)
- trainable Glove vectors (trainable=True)
- trainable, non-initialized vectors (trainable=True, no Glove)
The training accuracy/loss curves look like this:
The figure is a bit small but you can hopefully expand it. At least I uploaded it bigger :).
These results can be summarized as:
- Non-trainable Glove alone improves all the way for the 15 iterations (epochs) and reaches validation accuracy of about 0.4 and loss around 2.55. The improvements got really small there so I did not try further epochs.
- Trainable, uninitialized (no Glove) got best validation accuracy/loss at epoch 11 for 0.488 accuracy and 2.246 loss. After this it overfits.
- Trainable with Glove initialization reaches best validation accuracy/loss at epoch 8 for 0.497 accuracy and 2.154 loss. After this it overfits.
Some interesting points I gathered from this:
- Overall, non-trainable Glove gives quite poor results (but I guess still quite usable) over trainable embeddings in the two other options.
- Glove initialized but further trained embeddings converge much faster and get better scores.
- I guess further trained Glove embeddings would be a form of "transfer" learning. Cool, can I put that in my CV now?
My guess is that the bug descriptions have many terms not commonly used in general domains, which causes the effect of requiring updates of the general Glove embeddings to be more effective. When I did some exploratory analysis of the data (e.g., TF-IDF across components) these types of terms were actually quite visible. However, they are intermixed with the general terms, which benefit from Glove and mixing the two gives best results. Just my "educated" guess. Another of my guesses is that, this would be quite a similar result in other domains as well, with domain specific terminology.
General Notes (or "Discussion" in academic terms..)
The results from the training and validation showed about 50% accuracy. In a binary classification problem this would not be so great. More like equal to random guessing. But with 160+ targets to choose from, this seems to me to be very good. The loss is maybe a better metric, but I am not that good at interpreting the loss against 160+ categories. Simply smaller is better, and it should overall signify how far off the predictions for categories are. (but how to measure and interpret the true distance of all, when you just give one as correct target label, you tell me?)
Also, as noted earlier, an issue can be linked to several components. And from my tries of running this with new data and comparing results, the mapping is not always very clear, and there can be multiple "correct" answers. This is also shown in my prediction example above. The results given by Keras predict() are actually listing the probabilities for each of the 160+ potential targets. So if accuracy just measures the likelihood of getting it exactly right, it misses the ones that are predicted at position 2, 3, and so on. One way I would see using this would be to provide assistance on selections to an expert analyzing the incoming bug reports. In such case, having "correct" answers even in the few top predictions would seem useful. Again, as shown in my prediction example above.
With that, the 50% chance of getting it 100% correct for the top prediction actually seems very good. Consider that there are over 160 possible targets to predict. So not nearly as simple as a binary classifier, where 50% would match random guessing. Here this is much better than that.
Besides the LSTM, I also tried a more traditional classifier on this same task. I tried several versions, include multinomial naive bayes, random forest, and lgbm. The features I provided were heavily pre-processed word tokens in their Tf-IDF format. That classifier was pretty poor, perhaps even useless poor. Largely perhaps due to my lack of skills in feature engineering any domain, and in lack of hyperparameter optimization. But it was still the case. With that background, I was surprised to see good performance from the LSTM version.
Overall, the predictions give by this classifier I built are not perfect but much closer than I expected to get. Out of 160+ categories getting most time the correct one, and often close to the correct one, based only on the natural language description was a very nice result for me.
In cases where the match is not perfect, I believe it still provides valuable input. Either the match is given in the other top suggestions, or maybe we can learn something about considering why some of the others are suggested, and is there some real meaning behind it. All the mismatches I found made sense when considering the reported issue from different angles. I would guess the same might hold for other domains and similar classifiers as well.
Many other topics to investigate would include:
- different types of model architectures (layers, neuron counts, GRU, 1D CNN, …)
- Attention layers (Keras still does not include support but they are very popular now in NLP)
- Different dimensions of embeddings
- Different embeddings initializers (Word2Vec)
- effects of more preprocessing
- N-way cross-validation
- training the final classifier on the whole training data at once, after finishing the model tuning etc.