Calculating Croatian Word Frequency from Wikipedia

I like learning foreign languages and I am currently developing an application to support that endeavour. When you learn a foreign language with a text book, the text book gives you a guideline on what grammar and which words to learn. However, according to my experience usually after level A2 or B1 text books become rarer and when you want to practice languages with real texts you’re on your own.

While not strictly necessary, I think it’s useful to know what the most common words in a foreign language are and whether you already memorized them. It allows you to get a feeling what you already know, which texts you might want to read and where you can improve. For example, I was quite successful using NHK Easy to practice Japanese once I had reached about 80% of Kanji coverage on their site.

Previously, I had already searched a bit for lists with the most important Croatian vocabulary, but didn’t find anything useful. I found some websites showing about 10 or 100 common words, but I’m interested in lists of 1000, 2000 or 5000 words. There’s also a Github project for Wikipedia Word Frequencies, but it does not include Croatian. Moreover, it does not lemmatize the words, which means that for example in the English list you’ll find run, runs, running and so on. That doesn’t help me if I want to compare it to my Anki list of vocabulary. It might detect run as already known, but still list runs and running as unknown.

Thus, I started a simple project to create the lists myself. This article shows how it’s done.

Wikipedia Dumps

Since Wikipedia is a quite big corpus of text data and they provide XML dumps, they’re a very good starting points. There are a few mirrors you can choose from to download the XML files. The -pages-articles.xml.bz2 dumps contain all articles. These dumps contain some overall site information and a list of pages, each of them with a list of revisions. A revision is an update of the contents of a page, so basically the history of the page.

According to my analysis, in the -pages-articles.xml.bz2 dump there seems to be exactly one revision per page, which makes it quite simple to use.

I used mwxml to read the dump. mwxml expects a file pointer to a Wikipedia dump and then you can iterate over the pages and the revisions. In their example, they’re loading a plain XML file (dump.xml), but Wikipedia dumps are of course zipped.

You might now be tempted to first unzip them, but that’s unnecessary. With Python bz2 we can transparently unpack the dump file on the fly.

import bz2
import mwxml

with bz2.open('wiki_raw/hrwiki-20231101-pages-articles.xml.bz2') as f:
    dump = mwxml.Dump.from_file(f)

    for page in dump:
        revisions = list(page)
        assert len(revisions) == 1

        print(f'{page.id} - {page.title} - Revision {revisions[0].id}')

The text content (not printed in this example) can be read from the text attribute of a revision.

Parsing Wikipedia Pages

Wikipedia pages are not raw text - they contain tags for templates, links and so on. Such “code” elements would pollute our word frequency data set. It’s possible that they don’t occur often enough to make it into the top 5000 list, but for very frequent tags it doesn’t seem entirely impossible.

To parse Wikipedia pages we can use mwparserfromhell, another Python library.

It expects as input the Wikipedia page text and returns a Wikicode object (a class from mwparserfromhell). This object has a method strip_code() which removes all unprintable code, exactly what we need.

parsed_page = mwparserfromhell.parse(revision.text)
text = parsed_page.strip_code()

Standardizing Words

As said I don’t want flections of words to appear in my vocabulary list: If a text contains the term running it should only be listed in the vocabulary list as run. Standardization of words into their base form is called lemmatization.

During my work with foreign language texts I have made very good experience with stanza, a NLP library with support for many languages that’s based on machine learning models.

For Croatian (and some other languages) there is a dedicated fork called classla with some improvements. Classla says it achieves an F1 score of 99.17% on lemmatization for Slovenian.

They don’t seem to keep up to date with the upstream project, though. At the time of writing this article, classla is 2563 commits behind stanza. That’s why I in a first version of my script decided to use stanza. It supports many languages with a single API. I might decide to use classla in the future to benefit from the improved accuracy, but and at the moment I don’t want to run into possible API inconsistencies between classla and stanza.

Enough said, let’s see how we can lemmatize our texts and count words using stanza:

from collections import Counter
import stanza

c = Counter()
pipeline = stanza.Pipeline('hr', processors='tokenize,pos,lemma')

doc = pipeline(text)
for sentence in doc.sentences:
    for token in sentence.tokens:
        token_data = token.to_dict()[0]

        if (
            'lemma' in token_data
            and "upos" in token_data
            and token_data["upos"].lower() not in ["punct", "sym"]
        ):
            c[token_data['lemma']] += 1

Now c holds the word counts for all words of the text.

Finalizing

That’s basically it. Now you’ll only have to stitch together the different parts and aggregate the counts across all texts. Since parsing the whole dump takes some time, it makes sense to write intermediate results to be able to continue in case of an application crash.

You can find the published lists up to the 5000 most common words in a my git repository polyglotstats.

There are still some issues with the word frequency lists. For example it lists numbers, which I’d rather not include, because they’re not words. Even if you come from a language without Arabic numerals they’re single characters, not whole words, and you’ll quickly learn them.

Moreover, it also lists special symbols as words (for example = or }) and even some HTML tags like <small> made it onto the list. I’ll have to debug why they were not filtered out by the punct and sym filters.

I’m not sure yet how useful the list of most common words on Wikipedia actually is for language learners. I guess that different lists from different data sources are required. For example everyday conversation (taught by beginner text books), music lyrics or an encyclopedia like Wikipedia use very different vocabulary. In everyday conversation we need many words for things like groceries, songs usually talk about feelings and encyclopedia are very history-driven (the Croatian word for war, rat, is among the 100 most common words on Wikipedia).

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.