Calculating Croatian Word Frequency from Wikipedia
I like learning foreign languages and I am currently developing an application to support that endeavour. When you learn a foreign language with a text book, the text book gives you a guideline on what grammar and which words to learn. However, according to my experience usually after level A2 or B1 text books become rarer and when you want to practice languages with real texts you’re on your own.
While not strictly necessary, I think it’s useful to know what the most common words in a foreign language are and whether you already memorized them. It allows you to get a feeling what you already know, which texts you might want to read and where you can improve. For example, I was quite successful using NHK Easy to practice Japanese once I had reached about 80% of Kanji coverage on their site.
Previously, I had already searched a bit for lists with the most important
Croatian vocabulary, but didn’t find anything useful. I found some
websites showing about 10 or 100 common words, but I’m interested
in lists of 1000, 2000 or 5000 words. There’s also a Github project for
Wikipedia Word Frequencies,
but it does not include Croatian. Moreover, it does not lemmatize the words,
which means that for example in the English list you’ll find run
, runs
, running
and so on. That doesn’t help me if I want to compare it to my Anki
list of vocabulary. It might detect run
as already known, but still list
runs
and running
as unknown.
Thus, I started a simple project to create the lists myself. This article shows how it’s done.
Wikipedia Dumps
Since Wikipedia is a quite big corpus of text data and they provide XML dumps,
they’re a very good starting points. There are
a few mirrors you can choose from
to download the XML files.
The -pages-articles.xml.bz2
dumps contain all articles. These dumps contain
some overall site information and a list of pages, each of them with a list of
revisions. A revision is an update of the contents of a page, so basically the
history of the page.
According to my analysis, in the -pages-articles.xml.bz2
dump there seems
to be exactly one revision per page, which makes it quite simple to use.
I used mwxml to read
the dump. mwxml
expects a file pointer to a Wikipedia dump and then you can
iterate over the pages and the revisions. In their example, they’re loading a
plain XML file (dump.xml
), but Wikipedia dumps are of course zipped.
You might now be tempted to first unzip them, but that’s unnecessary. With
Python bz2
we can transparently unpack the dump file on the fly.
import bz2
import mwxml
with bz2.open('wiki_raw/hrwiki-20231101-pages-articles.xml.bz2') as f:
dump = mwxml.Dump.from_file(f)
for page in dump:
revisions = list(page)
assert len(revisions) == 1
print(f'{page.id} - {page.title} - Revision {revisions[0].id}')
The text content (not printed in this example) can be read from the text
attribute of a revision.
Parsing Wikipedia Pages
Wikipedia pages are not raw text - they contain tags for templates, links and so on. Such “code” elements would pollute our word frequency data set. It’s possible that they don’t occur often enough to make it into the top 5000 list, but for very frequent tags it doesn’t seem entirely impossible.
To parse Wikipedia pages we can use mwparserfromhell, another Python library.
It expects as input the Wikipedia page text and returns a Wikicode
object
(a class from mwparserfromhell). This object has a method strip_code()
which removes all unprintable code, exactly what we need.
parsed_page = mwparserfromhell.parse(revision.text)
text = parsed_page.strip_code()
Standardizing Words
As said I don’t want flections of words to appear in my vocabulary list: If
a text contains the term running
it should only be listed in the vocabulary
list as run
.
Standardization of words into their base form is called lemmatization.
During my work with foreign language texts I have made very good experience with stanza, a NLP library with support for many languages that’s based on machine learning models.
For Croatian (and some other languages) there is a dedicated fork called classla with some improvements. Classla says it achieves an F1 score of 99.17% on lemmatization for Slovenian.
They don’t seem to keep up to date with the upstream project, though. At the time of writing this article, classla is 2563 commits behind stanza. That’s why I in a first version of my script decided to use stanza. It supports many languages with a single API. I might decide to use classla in the future to benefit from the improved accuracy, but and at the moment I don’t want to run into possible API inconsistencies between classla and stanza.
Enough said, let’s see how we can lemmatize our texts and count words using stanza:
from collections import Counter
import stanza
c = Counter()
pipeline = stanza.Pipeline('hr', processors='tokenize,pos,lemma')
doc = pipeline(text)
for sentence in doc.sentences:
for token in sentence.tokens:
token_data = token.to_dict()[0]
if (
'lemma' in token_data
and "upos" in token_data
and token_data["upos"].lower() not in ["punct", "sym"]
):
c[token_data['lemma']] += 1
Now c
holds the word counts for all words of the text.
Finalizing
That’s basically it. Now you’ll only have to stitch together the different parts and aggregate the counts across all texts. Since parsing the whole dump takes some time, it makes sense to write intermediate results to be able to continue in case of an application crash.
You can find the published lists up to the 5000 most common words in a my git repository polyglotstats.
There are still some issues with the word frequency lists. For example it lists numbers, which I’d rather not include, because they’re not words. Even if you come from a language without Arabic numerals they’re single characters, not whole words, and you’ll quickly learn them.
Moreover, it also lists special symbols as words (for example =
or }
)
and even some HTML tags like <small>
made it onto the list. I’ll have to
debug why they were not filtered out by the punct
and sym
filters.
I’m not sure yet how useful the list of most common words on Wikipedia actually
is for language learners. I guess that different lists from different data
sources are required. For example everyday conversation (taught by beginner
text books), music lyrics or an encyclopedia like Wikipedia use very different
vocabulary. In everyday conversation we need many words for things like groceries,
songs usually talk about feelings and encyclopedia are very history-driven
(the Croatian word for war, rat
, is among the 100 most common words on
Wikipedia).