Previously I wrote about building an issue category predictor using LSTM networks on Keras. This was a two-layer bi-directional LSTM network. A neural network architecture that has been gaining some "attention" recently in NLP is Attention. This is simply an approach to have the network pay some more "attention" to specific parts of the input. That’s what I think anyway. So the way to use Attention layers is to add them to other existing layers.
In this post, I look at adding Attention to the network architecture of my previous post, and how this impacts the resulting accuracy and training of the network. Since Keras still does not have an official Attention layer at this time (or I cannot find one anyway), I am using one from CyberZHG’s Github. Thanks for the free code!
I tried a few different (neural) network architectures with Attention, including the ones from my previous post, with and without Glove word embeddings. In addition to these, I tried with adding a dense layer before the final output layer, after the last attention layer. Just because I head bigger is better :). The maximum model configuration of this network looks like this:
With a model summary as:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= Input (InputLayer) (None, 1000) 0 _________________________________________________________________ embedding (Embedding) (None, 1000, 300) 6000000 _________________________________________________________________ lstm-bi1 (Bidirectional) (None, 1000, 256) 440320 _________________________________________________________________ drop1 (Dropout) (None, 1000, 256) 0 _________________________________________________________________ seq_self_attention_3 (SeqSel (None, 1000, 256) 16449 _________________________________________________________________ lstm-bi2 (Bidirectional) (None, 1000, 128) 164864 _________________________________________________________________ drop2 (Dropout) (None, 1000, 128) 0 _________________________________________________________________ seq_weighted_attention_3 (Se (None, 128) 129 _________________________________________________________________ mid_dense (Dense) (None, 128) 16512 _________________________________________________________________ drop3 (Dropout) (None, 128) 0 _________________________________________________________________ output (Dense) (None, 165) 21285 ================================================================= Total params: 6,659,559 Trainable params: 659,559 Non-trainable params: 6,000,000 _________________________________________________________________
I reduced from this for a few different configurations to try what would be the impact on the loss and accuracy of the predictions.
I used 6 variants of this. Each had the 2 bi-direction LSTM layers. Variants of this:
- 2-att-d: each bi-lstm followed by attention. dropout of 0.5 and 0.3 after each bi-lstm
- 2-att: each bi-lstm followed by attention, no dropout
- 2-att-d2: each bi-lstm followed by attention, dropout of 0.2 and 0.1 after each bi-lstm
- 2-att-dd: 2-att-d2 with dense in the end, dropouts 0.2, 0.1, 0.3
- 1-att-d: 2 bi-directional layers, followed by single attention. dropout 0.2 and 0.1 after each bi-lstm.
- 1-att: 2 bi-directional layers, followed by single attention. no dropout.
The code for the model definition with all the layers enabled:
input = Input(shape=(sequence_length,), name="Input") embedding = Embedding(input_dim=vocab_size, weights=[embedding_matrix], output_dim=embedding_dim, input_length=sequence_length, trainable=embeddings_trainable, name="embedding")(input) lstm1_bi1 = Bidirectional(CuDNNLSTM(128, return_sequences=True, name='lstm1'), name="lstm-bi1")(embedding) drop1 = Dropout(0.2, name="drop1")(lstm1_bi1) attention1 = SeqSelfAttention(attention_width=attention_width)(drop1) lstm2_bi2 = Bidirectional(CuDNNLSTM(64, return_sequences=True, name='lstm2'), name="lstm-bi2")(attention1) drop2 = Dropout(0.1, name="drop2")(lstm2_bi2) attention2 = SeqWeightedAttention()(drop2) mid_dense = Dense(128, activation='sigmoid', name='mid_dense')(attention2) drop3 = Dropout(0.2, name="drop3")(mid_dense) output = Dense(cat_count, activation='sigmoid', name='output')(drop3) model = Model(inputs=input, outputs=output)
The 3 tables below summarize the results for the different model configurations, using the different embeddings versions of:
- Glove initialized embeddings, non-trainable
- Glove initialized embeddings, trainable
- Uninitialized embeddings, trainable
The training curves in general look similar to this (picked from one of the best results):
So not too different from my previous results.
At least with this type of results, it is nice to see a realistic looking training + validation accuracy and loss curve, with training going up and crossing validation at some point close to where overfitting starts. I have recently done a lot of Kaggle with the intent to learn how to use all these things in practice. And I think Kaggle is really great place for this. However, the competitions seem to be geared to trick you somehow, the given training vs test sets are usually really weirdly skewed, and the competition on getting those tiny fractions of accuracy great. So compared to that this is a nice, fresh view on the real world being more sane for ML application. But I digress..
Summary insights from the above:
- The tables above show how adding Attention does increase the accuracy by about 10%, from bit below 50% to about 54% in the best case.
- Besides the impect from adding Attention, the rest of the configurations seem merely fiddling with it, without much impact on accuracy or loss.
- 2 Attention layers are slightly better than one
- Dropout quite consistently has a small negative impact on performance. As I did not try that many configurations, maybe it could be improved by tuning.
- Final dense layer here mainly has just a negative impact
- The Kaggle kernels I ran this with have this nasty habit of sometimes cutting the output for some parts of the code. In this case it consistently cut the un-trainable Glove version output at around 9th epoch, which is why all those in the above tables are listed as best around 8th or 9th epoch. It might have shown small gains for one or two more epochs still. However, it was plateauing so strong already I do not think it is a big issue for these results.
- Due to training with smaller batch size taking longer, I had to limit epochs to 10 from previous post’s 15. On the other hand, Attention seems to converge faster, so not that bad a tradeoff.
Attention: My Final Notes
I used the Attention layer from the Github I linked. Very nice work in that Github in many related accounts BTW. Definitely worth checking out. This layer seems very useful. however, it seems to be tailored to the Github owners specific needs and not documented in much detail There seem to be some different variants of Attention layers I found around the interents, some only working on previous versions of Keras, others requiring 3D input, others only 2D.
For example, the above Attention layer works on the 3D inputs. SeqSelfAttention takes input as 3D sequences, outputs 3D sequences. SeqWeightedAttention takes input as 3D, outputs 2D. There is at least one implementation being copy-pasted around in Kaggle kernels that uses 2D inputs and outputs. Some other custom Keras layers seem to have gone stale. Another I found on Github seems promising but has not been updated. One of the issues links to a patched version though. But in any case, my goal was not to compare different custom implementations, so I will just wait for the final and play with this for now.
As noted, I ran these experiments on Kaggle kernels. At the time they were running on NVidia P100 GPU’s, which are intended to be datacenter scale products. These have 16GB GPU memory, which at this time is a lot. Using the two attention layers I described above, I managed to exhaust this memory quite easily. This is maybe because I used a rather large sequence length of 1000 timesteps (words). The model summary I printed above shows the Attention layers having only 16449 and 129 parameters to train, so the implementation must otherwise require plenty of space. Not that I understand the details at such depth, but something to consider.
Some of the errors I got for setting up these Attention layers also seemed to indicate it was building a 4D representation by adding another layer (of size 1000) on top of the layer it was paying attention to (the bi-LSTM in this case). This sort of makes sense, considering if it takes a 3D input (as LSTM sequence output) and pays attention to it. This attention window is just one parameter that could be tuned in this Attention implementation I used, so a better understanding of this implementation and its tuning parameters/options/impacts would likely be useful and maybe help with many of my issues.
Overall, as far as I understand, using a smaller number of timesteps is quite common. Likely using fewer would give very good results still but allow for more freedom to experiment with other parts of the model architecture without runnign out of memory. The memory issue required me to run with a much smaller batch size of 48 down from 128 and higher from before. This has yet again the effect of slowing performance as with smaller batch size takes longer to process the whole dataset.
"Official" support for Attention has been a long time coming (well, in terms of DL frameworks anyway..), and seems to be quite awaited feature (so the search-engines tell me). The comments I link above (in the Keras issue tracker on Github) also seem to contain various proposals for implementation. Perhaps the biggest issue still being the need to figure out how the Keras team wants to represent Attention to users, and how to make it as easy to use (and I suppose effective) as possible. Still, over these years of people waiting, maybe it would be nice to have something and build on that? Of course, as a typical OSS customer, I expect to have all this for free, so that is my saltmine.. 🙂
Some best practices / easy to understand documentation I would like to see:
- Tradeoffs in using different types of Attention: 3D, 2D, attention windows, etc.
- Attention in multi-layer architectures, where does it make the most sense and why (intuitively too)
- Parameter explanations and tuning experiences / insights (e.g., attention window size)
- Other general use types in different types of networks