Word2Vec with some Finnish NLP

To get a better view of the popular Word2Vec algorithm and its applications in different contexts, I ran experiments on Finnish language and Word2vec. Let’s see.

I used two datasets. First one is the traditional Wikipedia dump. I got the Wikipedia dump for the Finnish version from October 20th. Because I ran the first experiments around that time. The seconds dataset was the Board minutes for the City of Oulu for the past few years.

After running my clearning code on the Wikipedia dump it reported 600783 sentences and 6778245 words for the cleaned dump. Cleaning here refers to removing all the extra formatting, HTML tagging, etc. Sentences were tokenized using Voikko. For the Board minutes the similar metrics were 4582 documents, 358711 sentences, and 986523 words. Most interesting, yes?

For running Word2vec I used the Deeplearning4J implementation. You can find the example code I used on Github.

Again I have this question of whether to use lemmatization or not. Do I run the algorithm on baseformed words or just unprocessed words in different forms?

Some prefer to run it after lemmatization, while generally the articles on word2vec say nothing on the topic but rather seem to run it on raw text. This description of a similar algorithm actually shows and example of mapping “frog” to “frogs”, further indicating use of raw text. I guess if you have really lots of data and a language that does not have a huge number of forms for different words that makes more sense. Or if you find relations between forms of words more interesting.

For me, Finnish has so many forms of words (morphologies or whatever they should be called?) and generally I don’t expect to run with hundreds of billions of words of data, so I tried both ways (with and without lemmatization) to see. With my limited data and the properties of the Finnish language I would just go with lemmatization really, but it is always interesting to try and see.

Some results for my experiments:

Wikipedia without lemmatization, looking for the closest words to “auto”, which is Finnish for “car”. Top 10 results along with similarity score:

auto vs kuorma = 0.6297630071640015
auto vs akselin = 0.5929439067840576
auto vs auton = 0.5811734199523926
auto vs bussi = 0.5807990431785583
auto vs rekka = 0.578578531742096
auto vs linja = 0.5748337507247925
auto vs työ = 0.562477171421051
auto vs autonkuljettaja = 0.5613142848014832
auto vs rekkajono = 0.5595266222953796
auto vs moottorin = 0.5471497774124146

Words from above translated:

kuorma = load
akselin = axle’s
auton = car’s
bussi = bus
rekka = truck
linja = line
työ = work
autonkuljettaja = car driver
rekkajono = truck queue
moottorin = engine’s

A similarity score of 1 would mean a perfect match, and 0 a perfect mismatch. Word2vec builds a model representing position of words in “vector-space”. This is inferred from “word-embeddings”. This sounds fancy, and as usual, it is difficult to find a simple explanation of what is done. I view it a taking typically 100-300 numbers to represent each numbers relation in the “word-space”. These get adjusted by the algorithm as it goes through all the sentences and records each words relation to other words in those sentences. Probably all wrong in that explanation but until someone gives a better one..

To preprocess the documents for word2vec, I split the documents to sentences to give the words a more meaningful context (a sentence vs just any surrounding words). There are other similar techniques, such as Glove, that may work better with more global “context” than a sentence. But anyway this time I was playing with Word2vec, which I think is also interesting for many things. It also has lots of implementations and popularity.

Looking at the results above, there is the word “auton”, translating to “car’s”. Finnish language has a a large number of forms that different words can take. So, sometimes, it may be good to lemmatize to see what the meaning of the word better maps to vs matching forms of words. So I lemmatize with Voikko, the Finnish language lemmatizer again. Re-run of above, top-10:

auto vs ajoneuvo = 0.7123048901557922
auto vs juna = 0.6993820667266846
auto vs rekka = 0.6949941515922546
auto vs ajaa = 0.6905277967453003
auto vs matkustaja = 0.6886627674102783
auto vs tarkoitettu = 0.66249680519104
auto vs rakennettu = 0.6570218801498413
auto vs kuljetus = 0.6499230861663818
auto vs rakennus = 0.6315782070159912
auto vs alus = 0.6273047924041748

Meanings of the words in English:

ajoneuvo = vehicle
juna = train
rekka = truck
ajaa = drive
matkustaja = passenger
tarkoitettu = meant
rakennettu = built
kuljetus = transport
rakennus = building
alus = ship

So generally these mappings make some sense. Not sure about those building words. Some deeper exploration would probably help..

Some people also came up with the idea of POS tagging before running word2vec. Called it Sense2Vec and whatever. Just so you could better differentiate how different meanings of a word map differently. So to try to POS tag with the tagger I implemented before. Results:

auto_N vs juna_N = 0.7195479869842529
auto_N vs ajoneuvo_N = 0.6762610077857971
auto_N vs alus_N = 0.6689988970756531
auto_N vs kone_N = 0.6615594029426575
auto_N vs kuorma_N = 0.6477057933807373
auto_N vs tie_N = 0.6470917463302612
auto_N vs seinä_N = 0.6453390717506409
auto_N vs kuljettaja_N = 0.6449363827705383
auto_N vs matka_N = 0.6337422728538513
auto_N vs pää_N = 0.6313328146934509

Meanings of the words in English:

juna = train
ajoneuvo = vehicle
alus = ship
kone = machine
kuorma = load
tie = road
seinä = wall
kuljettaja = driver
matka = trip
pää = head

soo… The weirdest ones here are the wall and head parts. Perhaps again a deeper exploration would tell more. The rest seem to make some sense just by looking.

And to do the same for the City of Oulu Board minutes. Now looking for a specific word for the domain. The word being “serviisi”, which is the city office responsible for food production for different facilities and schools. This time lemmatization was applied for all results. Results:

serviisi vs tietotekniikka = 0.7979459762573242
serviisi vs työterveys = 0.7201094031333923
serviisi vs pelastusliikelaitos = 0.6803742051124573
serviisi vs kehittämisvisio = 0.678106427192688
serviisi vs liikel = 0.6737961769104004
serviisi vs jätehuolto = 0.6682301163673401
serviisi vs serviisin = 0.6641604900360107
serviisi vs konttori = 0.6479293704032898
serviisi vs efekto = 0.6455909013748169
serviisi vs atksla = 0.6436249017715454

because “serviisi” is a very domain specific word/name here, the general purpose Finnish lemmatization does not work for it. This is why “serviisin” is there again. To fix this, I added this and some other basic forms of the word to the list of custom spellings recognized by my lemmatizer tool. That is, using Voikko but if not found trying a lookup in a custom list. And if still not found, writing a list of all unrecognized words sorted by highest frequency first (to allow augmenting the custom list more effectively).

Results after change:

serviisi vs tietotekniikka = 0.8719592094421387
serviisi vs työterveys = 0.7782909870147705
serviisi vs johtokunta = 0.695137619972229
serviisi vs liikelaitos = 0.6921887397766113
serviisi vs 19.6.213 = 0.6853622794151306
serviisi vs tilakeskus = 0.673351526260376
serviisi vs jätehuolto = 0.6718368530273438
serviisi vs pelastusliikelaitos = 0.6589146852493286
serviisi vs oulu-koilismaan = 0.6495324969291687
serviisi vs bid=2300 = 0.6414187550544739

Or another run:

serviisi vs tietotekniikka = 0.864517867565155
serviisi vs työterveys = 0.7482070326805115
serviisi vs pelastusliikelaitos = 0.7050554156303406
serviisi vs liikelaitos = 0.6591876149177551
serviisi vs oulu-koillismaa = 0.6580390334129333
serviisi vs bid=2300 = 0.6545186638832092
serviisi vs bid=2379 = 0.6458192467689514
serviisi vs johtokunta = 0.6431671380996704
serviisi vs rakennusomaisuus = 0.6401894092559814
serviisi vs tilakeskus = 0.6375274062156677

So what are all these?

tietotekniikka = city office for ICT
työterveys = occupational health services
liikelaitos = company
johtokunta = board (of directors)
konttori = office
tilakeskus = space center
pelastusliikelaitos = emergency office
energia = energy
oulu-koilismaan = name of area surrounding the city
bid=2300 is an identier for one of the Serviisi board meeting minutes main pages.
19.6.213 seems to be a typoed date and could at least be found in one of the documents listing decisions by different city boards.

So almost all of these words that “serviisi” is found to be closest to are other city offices/companies responsible for different aspects of the city. Such as ICT, energy, office space, emergency response, of occupation health. Makes sense.

OK, so much for the experimental runs. I should summarize something about this.

The wikipedia results seem to give slightly better results in terms of the words it suggests being valid words. For the city board minutes I should probably filter more based on presence of special characters and numbers. Maybe this is the case for larger datasets vs smaller ones, where the “garbage” more easily drowns in the larger sea of data. Don’t know.

The word2vec algorithm also has a set of parameters to tune, which probably would be worth more investigation to get more optimized results for these different types of datasets. I simply used the same settings for both the city minutes and Wikipedia. Yet due to size differences, likely it would be interesting to play at least with the size of the vector space. For example, bigger datasets might benefit more from having a bigger vector space, which should enable them to express richer relations between different words. For smaller sets, a smaller space might be better. Similarly, number of processing iterations, minimum word frequencies etc should be tried a bit more. For me the goal here was to get a general idea on how this works and how to use it with Finnish datasets. For this, these experiments are enough.

If you read up on any articles of Word2Vec you will likely also see the hype on the ability to do equations such as “king – man + woman” = “queen”. These are from training on large English corpuses. It simply says that the relation of the word “queen” to word “woman” in sentences is typically the same as the relation of the word “king” to “man”. But then this is often the only or one of very few examples ever. Looking at the city minutes example here, since “serviisi” seems to map closest to all the other offices/companies of the city, what do we get if we run the arithmatic on “serviisi-liikelaitos” (so liikelaitos would be the common concept of the office/company). I got things like “city traffic”, “reduce”, “children home”, “citizen specific”, “greenhouse gas”. Not really useful. So this seems most useful as a potential tool for exploration but cannot really say which part gives useful results when. But of course, it is nice to report on the interesting abstractions it finds, not on boring fails.

I think lemmatization in these cases I showed here makes sense. I have no interest in just knowing that a singular form of a word is related to a plural form of the same word. But I guess in some use cases that could be valid. Of course, for proper lemmatization you might also wish to first do POS tagging to be able to choose the correct baseforms from all the options presented. In this case I just took the first baseform from the list Voikko gives for each word.

Tokenization could also be of more interest. Finnish language has a lot of compound words, some of which are visible in the above examples. For example, “kuorma-auto”, and “linja-auto” for the wikipedia example. Or the different “liikelaitos” combinations for the city of Oulu version. Further n-grams (combinations of words) would be useful to investigate further. For example, “energia” in the city example could easily be related to the city power company called “Oulun Energia”. Many similar examples likely can be found all over any language and domain vocabulary.

Further custom spelling would also be useful. For example, “oulu-koilismaan” above could be spelled as “oulu-koillismaan”. And it could further be baseformed with other forms of itself as “oulu-koillismaa”. Collecting these from the unrecognized words should make this relatively easy, and filtering out the low-frequency occurrences of the words.

So perhaps the most interesting question, What is this good for?

Not synonym search. Somehow over time I got the idea word2vec could give you some kind of synonums and stuffs. Clearly it is not for that but rather to identify words over similar concepts and the like.

So generally I can see it could be useful for exploring related concepts in documents. Or generally exploring datasets and building concept maps, search definitions, etc. More as an input to the human export work rather than fully automated as the results vary quite a bit.

Some interesting applications I found while looking at this:

Word2vec in Google type search, as well as search in general.
Exploring associations between medical terms. Perhaps helpful identify new links you did not think of before? Likely would match other similar domains as well.
Mapping words in different languages together.
Spotify mapping similar songs together via treating songs as words and playlists as sentences.
Someone tried it on sentiment analysis. Not really sure how useful that was as I just skimmed the article but in general I can see how it could be useful to find different types of words related to sentiments. As before, not necessarily as automated input but rather as input to an expert to build more detailed models.
Using the similarity score weights as means to find different topics. Maybe you could combine this with topic modelling and the look for diversity of topics?
Product recommendations by using products as words and sequences of purchases as sentences. Not sure how big is the meaning of purchase order but interesting idea.
Bet recommendations by modelling bets made by users as bet targets being words and sequences of bets sentences, finding similarities with other bets to recommend.

So that was mostly that. Similar tools exist for many platforms, whatever gives you the kicks. For example, Voikko has some python module on github to use and Gensim is a nice tool for many NLP processing tasks, including Word2Vec on python.

Also lots of datasets, especially for the English language, to use as pretrained word2vec models. For example, Facebooks FastText, Stanfords Glove datasets, Google news corpus from here. Anyway, some simple internet searches should turn out many such to use, which I think is useful for general purpose results. For more detailed domain specific ones training is good as I did here for the city minutes..

Many tools can also take in word vector models built with some other tool. For example, deeplearning4j mentions import of Glove models and Gensim lists support for FastText, VarEmbed and WordRank. So once you have some good idea of what such models can do and how to use them, building combinations of these is probably not too hard.

Word2Vec with some Finnish NLP

Leave a comment Cancel reply

Published by Teemu

Share this:

Leave a comment Cancel reply

Published by Teemu