To get a better view of the popular Word2Vec algorithm and its applications in different contexts, I ran experiments on Finnish language and Word2vec. Let’s see.
I used two datasets. First one is the traditional Wikipedia dump. I got the Wikipedia dump for the Finnish version from October 20th. Because I ran the first experiments around that time. The seconds dataset was the Board minutes for the City of Oulu for the past few years.
After running my clearning code on the Wikipedia dump it reported 600783 sentences and 6778245 words for the cleaned dump. Cleaning here refers to removing all the extra formatting, HTML tagging, etc. Sentences were tokenized using Voikko. For the Board minutes the similar metrics were 4582 documents, 358711 sentences, and 986523 words. Most interesting, yes?
Again I have this question of whether to use lemmatization or not. Do I run the algorithm on baseformed words or just unprocessed words in different forms?
Some prefer to run it after lemmatization, while generally the articles on word2vec say nothing on the topic but rather seem to run it on raw text. This description of a similar algorithm actually shows and example of mapping “frog” to “frogs”, further indicating use of raw text. I guess if you have really lots of data and a language that does not have a huge number of forms for different words that makes more sense. Or if you find relations between forms of words more interesting.
For me, Finnish has so many forms of words (morphologies or whatever they should be called?) and generally I don’t expect to run with hundreds of billions of words of data, so I tried both ways (with and without lemmatization) to see. With my limited data and the properties of the Finnish language I would just go with lemmatization really, but it is always interesting to try and see.
Some results for my experiments:
Wikipedia without lemmatization, looking for the closest words to “auto”, which is Finnish for “car”. Top 10 results along with similarity score:
- auto vs kuorma = 0.6297630071640015
- auto vs akselin = 0.5929439067840576
- auto vs auton = 0.5811734199523926
- auto vs bussi = 0.5807990431785583
- auto vs rekka = 0.578578531742096
- auto vs linja = 0.5748337507247925
- auto vs työ = 0.562477171421051
- auto vs autonkuljettaja = 0.5613142848014832
- auto vs rekkajono = 0.5595266222953796
- auto vs moottorin = 0.5471497774124146
Words from above translated:
- kuorma = load
- akselin = axle’s
- auton = car’s
- bussi = bus
- rekka = truck
- linja = line
- työ = work
- autonkuljettaja = car driver
- rekkajono = truck queue
- moottorin = engine’s
A similarity score of 1 would mean a perfect match, and 0 a perfect mismatch. Word2vec builds a model representing position of words in “vector-space”. This is inferred from “word-embeddings”. This sounds fancy, and as usual, it is difficult to find a simple explanation of what is done. I view it a taking typically 100-300 numbers to represent each numbers relation in the “word-space”. These get adjusted by the algorithm as it goes through all the sentences and records each words relation to other words in those sentences. Probably all wrong in that explanation but until someone gives a better one..
To preprocess the documents for word2vec, I split the documents to sentences to give the words a more meaningful context (a sentence vs just any surrounding words). There are other similar techniques, such as Glove, that may work better with more global “context” than a sentence. But anyway this time I was playing with Word2vec, which I think is also interesting for many things. It also has lots of implementations and popularity.
Looking at the results above, there is the word “auton”, translating to “car’s”. Finnish language has a a large number of forms that different words can take. So, sometimes, it may be good to lemmatize to see what the meaning of the word better maps to vs matching forms of words. So I lemmatize with Voikko, the Finnish language lemmatizer again. Re-run of above, top-10:
- auto vs ajoneuvo = 0.7123048901557922
- auto vs juna = 0.6993820667266846
- auto vs rekka = 0.6949941515922546
- auto vs ajaa = 0.6905277967453003
- auto vs matkustaja = 0.6886627674102783
- auto vs tarkoitettu = 0.66249680519104
- auto vs rakennettu = 0.6570218801498413
- auto vs kuljetus = 0.6499230861663818
- auto vs rakennus = 0.6315782070159912
- auto vs alus = 0.6273047924041748
Meanings of the words in English:
- ajoneuvo = vehicle
- juna = train
- rekka = truck
- ajaa = drive
- matkustaja = passenger
- tarkoitettu = meant
- rakennettu = built
- kuljetus = transport
- rakennus = building
- alus = ship
So generally these mappings make some sense. Not sure about those building words. Some deeper exploration would probably help..
Some people also came up with the idea of POS tagging before running word2vec. Called it Sense2Vec and whatever. Just so you could better differentiate how different meanings of a word map differently. So to try to POS tag with the tagger I implemented before. Results:
- auto_N vs juna_N = 0.7195479869842529
- auto_N vs ajoneuvo_N = 0.6762610077857971
- auto_N vs alus_N = 0.6689988970756531
- auto_N vs kone_N = 0.6615594029426575
- auto_N vs kuorma_N = 0.6477057933807373
- auto_N vs tie_N = 0.6470917463302612
- auto_N vs seinä_N = 0.6453390717506409
- auto_N vs kuljettaja_N = 0.6449363827705383
- auto_N vs matka_N = 0.6337422728538513
- auto_N vs pää_N = 0.6313328146934509
Meanings of the words in English:
- juna = train
- ajoneuvo = vehicle
- alus = ship
- kone = machine
- kuorma = load
- tie = road
- seinä = wall
- kuljettaja = driver
- matka = trip
- pää = head
soo… The weirdest ones here are the wall and head parts. Perhaps again a deeper exploration would tell more. The rest seem to make some sense just by looking.
And to do the same for the City of Oulu Board minutes. Now looking for a specific word for the domain. The word being “serviisi”, which is the city office responsible for food production for different facilities and schools. This time lemmatization was applied for all results. Results:
- serviisi vs tietotekniikka = 0.7979459762573242
- serviisi vs työterveys = 0.7201094031333923
- serviisi vs pelastusliikelaitos = 0.6803742051124573
- serviisi vs kehittämisvisio = 0.678106427192688
- serviisi vs liikel = 0.6737961769104004
- serviisi vs jätehuolto = 0.6682301163673401
- serviisi vs serviisin = 0.6641604900360107
- serviisi vs konttori = 0.6479293704032898
- serviisi vs efekto = 0.6455909013748169
- serviisi vs atksla = 0.6436249017715454
because “serviisi” is a very domain specific word/name here, the general purpose Finnish lemmatization does not work for it. This is why “serviisin” is there again. To fix this, I added this and some other basic forms of the word to the list of custom spellings recognized by my lemmatizer tool. That is, using Voikko but if not found trying a lookup in a custom list. And if still not found, writing a list of all unrecognized words sorted by highest frequency first (to allow augmenting the custom list more effectively).
Results after change:
- serviisi vs tietotekniikka = 0.8719592094421387
- serviisi vs työterveys = 0.7782909870147705
- serviisi vs johtokunta = 0.695137619972229
- serviisi vs liikelaitos = 0.6921887397766113
- serviisi vs 19.6.213 = 0.6853622794151306
- serviisi vs tilakeskus = 0.673351526260376
- serviisi vs jätehuolto = 0.6718368530273438
- serviisi vs pelastusliikelaitos = 0.6589146852493286
- serviisi vs oulu-koilismaan = 0.6495324969291687
- serviisi vs bid=2300 = 0.6414187550544739
Or another run:
- serviisi vs tietotekniikka = 0.864517867565155
- serviisi vs työterveys = 0.7482070326805115
- serviisi vs pelastusliikelaitos = 0.7050554156303406
- serviisi vs liikelaitos = 0.6591876149177551
- serviisi vs oulu-koillismaa = 0.6580390334129333
- serviisi vs bid=2300 = 0.6545186638832092
- serviisi vs bid=2379 = 0.6458192467689514
- serviisi vs johtokunta = 0.6431671380996704
- serviisi vs rakennusomaisuus = 0.6401894092559814
- serviisi vs tilakeskus = 0.6375274062156677
So what are all these?
- tietotekniikka = city office for ICT
- työterveys = occupational health services
- liikelaitos = company
- johtokunta = board (of directors)
- konttori = office
- tilakeskus = space center
- pelastusliikelaitos = emergency office
- energia = energy
- oulu-koilismaan = name of area surrounding the city
- bid=2300 is an identier for one of the Serviisi board meeting minutes main pages.
- 19.6.213 seems to be a typoed date and could at least be found in one of the documents listing decisions by different city boards.
So almost all of these words that “serviisi” is found to be closest to are other city offices/companies responsible for different aspects of the city. Such as ICT, energy, office space, emergency response, of occupation health. Makes sense.
OK, so much for the experimental runs. I should summarize something about this.
The wikipedia results seem to give slightly better results in terms of the words it suggests being valid words. For the city board minutes I should probably filter more based on presence of special characters and numbers. Maybe this is the case for larger datasets vs smaller ones, where the “garbage” more easily drowns in the larger sea of data. Don’t know.
The word2vec algorithm also has a set of parameters to tune, which probably would be worth more investigation to get more optimized results for these different types of datasets. I simply used the same settings for both the city minutes and Wikipedia. Yet due to size differences, likely it would be interesting to play at least with the size of the vector space. For example, bigger datasets might benefit more from having a bigger vector space, which should enable them to express richer relations between different words. For smaller sets, a smaller space might be better. Similarly, number of processing iterations, minimum word frequencies etc should be tried a bit more. For me the goal here was to get a general idea on how this works and how to use it with Finnish datasets. For this, these experiments are enough.
If you read up on any articles of Word2Vec you will likely also see the hype on the ability to do equations such as “king – man + woman” = “queen”. These are from training on large English corpuses. It simply says that the relation of the word “queen” to word “woman” in sentences is typically the same as the relation of the word “king” to “man”. But then this is often the only or one of very few examples ever. Looking at the city minutes example here, since “serviisi” seems to map closest to all the other offices/companies of the city, what do we get if we run the arithmatic on “serviisi-liikelaitos” (so liikelaitos would be the common concept of the office/company). I got things like “city traffic”, “reduce”, “children home”, “citizen specific”, “greenhouse gas”. Not really useful. So this seems most useful as a potential tool for exploration but cannot really say which part gives useful results when. But of course, it is nice to report on the interesting abstractions it finds, not on boring fails.
I think lemmatization in these cases I showed here makes sense. I have no interest in just knowing that a singular form of a word is related to a plural form of the same word. But I guess in some use cases that could be valid. Of course, for proper lemmatization you might also wish to first do POS tagging to be able to choose the correct baseforms from all the options presented. In this case I just took the first baseform from the list Voikko gives for each word.
Tokenization could also be of more interest. Finnish language has a lot of compound words, some of which are visible in the above examples. For example, “kuorma-auto”, and “linja-auto” for the wikipedia example. Or the different “liikelaitos” combinations for the city of Oulu version. Further n-grams (combinations of words) would be useful to investigate further. For example, “energia” in the city example could easily be related to the city power company called “Oulun Energia”. Many similar examples likely can be found all over any language and domain vocabulary.
Further custom spelling would also be useful. For example, “oulu-koilismaan” above could be spelled as “oulu-koillismaan”. And it could further be baseformed with other forms of itself as “oulu-koillismaa”. Collecting these from the unrecognized words should make this relatively easy, and filtering out the low-frequency occurrences of the words.
So perhaps the most interesting question, What is this good for?
Not synonym search. Somehow over time I got the idea word2vec could give you some kind of synonums and stuffs. Clearly it is not for that but rather to identify words over similar concepts and the like.
So generally I can see it could be useful for exploring related concepts in documents. Or generally exploring datasets and building concept maps, search definitions, etc. More as an input to the human export work rather than fully automated as the results vary quite a bit.
Some interesting applications I found while looking at this:
- Word2vec in Google type search, as well as search in general.
- Exploring associations between medical terms. Perhaps helpful identify new links you did not think of before? Likely would match other similar domains as well.
- Mapping words in different languages together.
- Spotify mapping similar songs together via treating songs as words and playlists as sentences.
- Someone tried it on sentiment analysis. Not really sure how useful that was as I just skimmed the article but in general I can see how it could be useful to find different types of words related to sentiments. As before, not necessarily as automated input but rather as input to an expert to build more detailed models.
- Using the similarity score weights as means to find different topics. Maybe you could combine this with topic modelling and the look for diversity of topics?
- Product recommendations by using products as words and sequences of purchases as sentences. Not sure how big is the meaning of purchase order but interesting idea.
- Bet recommendations by modelling bets made by users as bet targets being words and sequences of bets sentences, finding similarities with other bets to recommend.
So that was mostly that. Similar tools exist for many platforms, whatever gives you the kicks. For example, Voikko has some python module on github to use and Gensim is a nice tool for many NLP processing tasks, including Word2Vec on python.
Also lots of datasets, especially for the English language, to use as pretrained word2vec models. For example, Facebooks FastText, Stanfords Glove datasets, Google news corpus from here. Anyway, some simple internet searches should turn out many such to use, which I think is useful for general purpose results. For more detailed domain specific ones training is good as I did here for the city minutes..
Many tools can also take in word vector models built with some other tool. For example, deeplearning4j mentions import of Glove models and Gensim lists support for FastText, VarEmbed and WordRank. So once you have some good idea of what such models can do and how to use them, building combinations of these is probably not too hard.