Previously I wrote about a few experiments I ran with topic-modelling. I briefly glossed over having some results for a set of Finnish text as an example of a smaller dataset. This is a bit deeper look into that..
I use two datasets, the Finnish wikipedia dump, and the city of Oulu board minutes. Same ones I used before. Previously I covered topic modelling more generally, so I won’t go into too much detail here. To summarize, topic modelling algorithms (of which LDA or Latent Dirilect Allocation is used here) find sets of words with different distributions over sets of documents. These are then called the “topics” discussed in those documents.
This post looks at how to use topic models for a different language (besides English) and what could one maybe do with the results.
Lemmatize (turn words into baseforms before use) or not? I choose to lemmatize for topic modelling. This seems to be the general consensus when looking up info on topic modelling, and in my experience it just gives better results as the same word appears only once. I covered POS tagging previously, and I believe it would be useful to apply here as well, but I don’t. Mostly because it is not needed to test these concepts, and I find the results are good enough without adding POS tagging to the mix (which has its issues as I discussed before). Simplicity is nice.
I used the Python Gensim package for building the topic models. As input, I used the Finnish Wikipedia text and the city of Oulu board minutes texts. I used my existing text extractor and lemmatizer for these (to get the raw text out of the HTML pages and PDF docs, and to baseform them, as discussed in my previous posts). I dumped the lemmatized raw text into files using slight modifications of my previous Java code and the read the docs from those files as input to Gensim in a Python script.
I started with the Finnish Wikipedia dump, using Gensim to provide 50 topics, with 1 pass over the corpus. First 10 topics that I got:
- topic0=focus var liivi luku html murre verkkoversio alku joten http
- topic1=viro substantiivi gen part taivutus tyyppi täysi taivutustyyppi liite rakenne
- topic2=isku pieni tms aine väri raha suuri helppo saattaa heprea
- topic3=suomi suku substantiivi pudottaa kasvi käännös luokka sana kieli taivutusmuoto
- topic4=ohjaus white off black red sotilas fraasi yellow perinteinen flycatcher
- topic5=lati eesti www http keele eki lähde dict sõnaraamat tallinn
- topic6=suomi käännös substantiivi aihe muualla sana liittyvä etymologi viite kieli
- topic7=italia substantiivi japani inarinsaame kohta yhteys vaatekappale rinnakkaismuoto taas voimakas
- topic8=sana liittyvä substantiivi ruotsi synonyymi alas etymologi liikuttaa johdos yhdyssana
- topic9=juuri des jumala tadžikki tuntea tekijä tulo mitta jatkuva levy
- topic10=törmätä user sur self hallita voittaa piste data harjoittaa jstak
The format of the topic list I used here is “topicX=word1[count] word2[count]”, where X is the number of the topic, word1 is the first word in the topic, word2 the second, and so on. The [count] is how many times the word was associated with the topic in different documents. Consider it the strength, weight, or whatever of the word in the topic.
So just a few notes on the above topic list:
- topic0 = mostly website related terms, interleaved with a few odd ones. Examples of odd ones; “liivi” = vest, “luku” = number/chapter (POS tagging would help differentiate), “murre” = dialect.
- topic1 = mostly Finnish language related terms. “viro” = estonia = slightly odd to have here. It is the closest related language to Finnish but still..
- topic3 = another Finnish language reated topic. Odd one here is “kasvi” = plant. Generally this seems to be more related to words and their forms, where as topic1 maybe more about structure and relations.
- topic5 = estonia related
Overall, I think this would improve given more passes over the corpus to train the model. This would give the algorithm more time and data to refine the model. I only ran it with one pass here since the training for more topics and with more passes started taking days and I did not have the resources to go there.
My guess is also that with more data and more broader concepts (Wikipedia covering pretty much every topic there is..) you would also need more topics that the 50 I used here. However, I had to limit the size due to time and resource constraints. Gensim probably also has more advanced tuning options (e..g, parallel runs) that would benefit the speed. So I tried a few more sizes and passes with the smaller Oulu city board dataset, as it was faster to run.
Some topics for the city of Oulu board minutes, run for 20 topics and 20 passes over the training data:
- topic0=oulu kaupunki kaupunginhallitus 2013 päivämäärä vuosi päätösesitys jäsen hallitus tieto
- topic1=kunta palvelu asiakaspalvelu yhteinen viranomainen laki valtio myös asiakaspalvelupiste kaupallinen
- topic2=oulu palvelu kaupunki koulu tukea edistää vuosi osa nuori toiminta
- topic3=tontti kaupunki oulu asemakaava rakennus kaupunginhallitus päivämäärä yhdyskuntalautakunta muutos alue
- topic5=kaupunginhallitus päätös jäsen oulu kaupunki pöytäkirja klo päivämäärä oikaisuvaatimus matti
- topic6=000 2012 oulu muu tilikausi vuosi yhde kunta 2011 00000
- topic8=alue asemakaava rakentaa tulla oleva rakennus merkittävä kortteli oulunsalo nykyinen
- topic10=asiakirjat.ouka.fi ktwebbin 2016 eet pk_asil_tweb.htm? ktwebscr dbisa.dll url=http doctype =3&docid
- topic11=yhtiö osake osakas energia hallitus 18.11.2013 liite lomautus sähkö osakassopimus
- topic12=13.05.2013 perlacon kuntatalousfoorumi =1418 meeting_date=21.3.2013 =2070 meeting_date=28.5.2013 =11358 meeting_date=3.10.2016 -31.8.2015
- topic13=001 oulu 002 kaupunki sivu ��� palvelu the asua and
Some notes on the topics above:
- The word “oulu” repeats in most of the topics. This is quite natural as all the documents are from the board of the city of Oulu. Depending on the use case for the topics, it might be useful to add this word to the list of words to be removed in the pre-cleaning phase for the documents before running the topic modelling algorithm. Or it might be useful information, along with the weight of the word inside the topic. Depends.
- topic0 = generally about the board structure. For example, “kaupunki”=city, “kaupunginhallitus”=city board, “päivämäärä”=date, “päätösesitys”=proposal for decision.
- topic1 = Mostly city service related words. For example, “kunta” = county, “palvelu” = service, “asiakaspalvelu” = customer service, “myös” = also, so something to add to the cleaners again.
- topic2 = School related. For example, “koulu” = school, “tukea” = support, … Sharing again common words such as “kaupunki” = city, which may also be considered for removal or not depending on the case.
- topic3 = City area planning related. For example, “tontti” = plot of land, “asemakaava” = zoning plan, …
- In general quite good and focused topics here, so I think in general quite a good result. Some exceptions to consider:
- topic10 = mostly garbage related to HTML formatting and website link structures. still a real topic of course, so nicely identified.. I think something to consider to add to the cleaning list for pre-processing.
- topic12 = Seems related to some city finance related consultation (perlacon seems to be such as company) and associated event (the forum). With a bunch of meeting dates.
- topic13 = unclear garbage
- So in general, I guess reasonably good results but in real applications, several iterations of fine-tuning the words, the topic modelling algorithm parameters, etc. based on the results would be very useful.
So that was the city minutes topics for a smaller set of topics and more passes. What does it look for 100 topics, and how does the number of passes over the corpus affect the larger size? more passes should give the algorithm more time to refine the topics, but smaller datasets might not have so many good topics..
For 100 topics, 1 passes, 10 first topics:
- topic0=oulu kaupunki 000 sivu palvelu alue vuosi muu uusi tavoite
- topic1=kaupunki oulu jäsen 000 kaupunginhallitus kaupunginjohtaja klo muu vuosi takaus
- topic2=hallitus oulu 25.03.2013 kaupunginhallitus jäsen varsinainen tilintarkastaja kaupunki valita yhtiökokousedustaja
- topic3=kuntalisä oulu palkkatuki kaupunki tervahovi henkilö tukea yritys kaupunginhallitus työtön
- topic4=koulu oulu sahantie 000 äänestyspaikka maikkulan kaupunki kirjasto monitoimitalo kello
- topic5=oulu kaupunki euro kaupunginhallitus 2013 vuosi milj palvelu kunta uusi
- topic6=000 oulu kaupunki vuosi 2012 muu kunta muutos 2013 sivu
- topic7=000 26.03.2013 oulu 2012 kunta vuosi kirjastojärjestelmä muu kaupunki muutos
- topic8=oulu kaupunki kaupunginhallitus 2013 päivämäärä päätös vuosi tieto 000 päätösesitys
- topic9=oulu lomautus 000 kaupunki säästötoimenpidevapaa vuosi kunta kaupunginhallitus sivu henkilöstö
- topic10=oulu kaupunki alue sivu rakennus asemakaava vuosi tontti 2013 osa
Without going too much into translating every word, I would say these results are too spread out, so from this, for this dataset, it seems a smaller set of topics would do better. This also seems to be visible in the word counts/strengths in the [square brackets]. The topics with small weights also seem pretty poor topics, while the ones with bigger weights look better (just my opinion of course :)). Maybe something to consider when trying to explore the number of topics etc.
And the same run, this time with 20 passes over the corpus (100 topics and 10 first ones shown):
- topic0=oulu kaupunki palvelu toiminta kehittää myös tavoite osa vuosi toteuttaa
- topic1=-seurantatieto 2008-2010 =30065 =170189 =257121 =38760 =13408 oulu 000 kaupunki
- topic2=harmaa tilaajavastuulaki tilaajavastuu.fi torjunta -palvelu talous harmaantalous -30.4.2014 hankintayksikkö kilpailu
- topic3=juhlavuosi 15.45 perussopimus reilu kauppa juhlatoimikunta työpaja 24.2.2014 18.48 tapahtumatuki
- topic4=kokous kaupunginhallitus päätös pöytäkirja työjärjestys hyväksyä tarkastaja esityslista valin päätösvaltaisuus
- topic5=koulu sivistys- suuralue perusopetus tilakeskus kulttuurilautakunta järjestää korvensuora päiväkota päiväkoti
- topic6=piste hanke toimittaja hankesuunnitelma tila toteuttaa hiukkavaara hyvinvointikeskus tilakeskus monitoimitalo
- topic7=tiedekeskus museo- prosenttipohjainen taidehankinta uudisrakennushanke hankintamääräraha prosenttitaide hankintaprosessi toteutusajankohta ulosvuokrattava
- topic8=euro milj vuosi oulu talousarvio tilinpäätös kaupunginhallitus kaupunki 2012 2013
- topic9=päätös oikaisuvaatimus oulu kaupunki päivä voi kaupunginhallitus posti pöytäkirja viimeinen
Even the smaller topics here seem much better now with the increase in passes over the corpus. So perhaps the real difference just comes from having enough passes over the data, giving the algorithms more time and data to refine the models. At least I would not try without multiple passes based on comparing the results here of 1 vs 20 passes.
For example, topic2 here has small numbers but still all items seem related to grey market economy. Similarly, topic7 has small numbers but the words are mostly related to arts and culture.
So to summarize, it seems lemmatizing your words, exploring your parameters, and ensuring to have a decent amount of data and decent number of passes for the algorithm are all good points. And properly cleaning your data, and iterating over the process many times to get these right (well, as “right”as you can).
To answer my “research questions” from the beginning: topic modelling for different languages and use cases for topic modelling.
First, lemmatize all your data (I prefer it over stemming but it can be more resource intensive). Clean all your data from the typical stopwords for your language, but also for your dataset and domain. Run the models and analysis several times, and keep refining your list of removed words to clean also based on your use case, your dataset and your domain. Also likely need to consider domain specific lemmatization rules as I already discussed with POS tagging.
Secondly, what use cases did I find looking at topic modelling use cases online? Actually, it seems really hard to find concrete actual reports of uses for topic models. Quora has usually been promising but not so much this time. So I looked at reports in the published research papers instead, trying to see if any companies were involved as well.
Some potential use cases from research papers:
Bug localization, as in finding locations of bugs in source code is investigated here. Source code (comments, source code identifiers, etc) is modelled as topics, which are mapped to a query created from a bug report.
Matching duplicates of documents in here. Topic distributions over bug reports are used to suggest duplicate bug reports. Not exact duplicates but describing the same bug. If the topic distributions are close, flag them as potentially discussing the same “topic” (bug).
Ericsson has used topic models to map incoming bug reports to specific components. To make resolving bugs easier and faster by automatically assigning them to (correct) teams for resolution. Large historical datasets of bug reports and their assignments to components are used to learn the topic models. Topic distributions of incoming bug reports are used to give probability rankings for the bug report describing a specific component, in comparison to topic distributions of previous bug reports for that component. Topic distributions are also used as explanatory data to present to the expert looking at the classification results. Later, different approaches are reported at Ericsson as well. So just to remind that topic models are not the answer to everything, even if useful components and worth a try in places.
In cyber security, this uses topic models to describe users activity as distributions over the different topics. Learn topic models from user activity logs, describe each users typical activity as a topic distribution. If a log entry (e.g., session?) diverges too much from this topic distribution for the user, flag it as an anomaly to investigate. I would expect simpler things could work for this as well, but as input for anomaly detection, an interesting thought.
Tweet analysis is popular in NLP. This is an example of high-level tweet topic classification: Politics, sports, science, … Useful input for recommendations etc., I am sure. A more targeted domain specific example is of using topics in Typhoon related tweet analysis and classification: Worried, damage, food, rescue operations, flood, … useful input for situation awareness, I would expect. As far as I understood, topic models were generated, labeled, and then users (or tweets) assigned to the (high-level) topics by topic distributions. Tweets are very small documents, so that is something to consider, as discussed in those papers.
Use of topics models in biomedicine for text analysis. To find patterns (topic distributions) in papers discussing specific genes, for example. Could work more broadly as one tool to explore research in an area, to find clusters of concepts in broad sets of research papers on a specific “topic” (here a research on a specific gene). Of course, there likely exist number of other techniques to investigate for that as well, but topic models could have potential.
Generally labelling and categorizing large number of historical/archival documents to assist users in search. Build topic models, have experts review them, and give the topics labels. Then label your documents based on their topic distributions.
Bit further outside the box, split songs into segments based on their acoustic properties, and use topic modelling to identify different categories/types of music in large song databases. Then explore the popularity of such categories/types over time based on topic distributions over time. So the segments are your words, and the songs are your documents.
Finding image duplicates of images in large data sets. Use image features as words, and images as documents. Build topic models from all the images, and find similar types of images by their topic distributions. Features could be edges, or even abstract ones such as those learned by something like a convolutional neural nets. Assists in image search I guess..
Most of these uses seem to be various types of search assistance, with a few odd ones thinking outside the box. With a decent understanding, and some exploration, I think topic models can be useful in many places. The academics would sayd “dude XYZ would work just as well”. Sure, but if it does the job for me, and is simple and easy to apply..