Subtitling Croatian Videos Revisited with OpenAI Whisper

In 2018, I first blogged about the idea to subtitle Croatian TV shows using machine learning. Back then I used a Python program called autosub which splits the audio data and sends it to Google’s public speech recognition engine. It was a cool idea, but the quality was not good enough for me to use in practice.

The goal was to support me in learning the Croatian language. Since Croatia is a small country and there are not many people learning it (for more than touristic reasons) it’s also hard to find good and fun material above the A1/A2 grade. Since my skills were quite limited, but I wanted to watch some TV shows I had the idea of automatically adding subtitles.

My plans have evolved a bit more from when I had first started this project. I want to go one step further: I’d like to connect the transcript to my Anki profile to exclude vocabulary I already know. This will allow me to list all words that I have to learn before watching the show.

Recently, OpenAI released a new speech recognition system called whisper. Whisper supports Croatian with a WER of 16.7% (according to OpenAI) when using the best model. A WER of about 4-5% is said to be on level with a human listener. In a closer analysis Microsoft according to my understanding observed slightly higher numbers. For reference: For Spanish, Italian, English, Portuguese and German OpenAI gives a WER of less than 6% with the best model. Japanese - another language I learn - follows closely with 6.4%. So, this model could be useful for more learning endeavours for other languages.

But let’s look at how it works. After some more tests I might put it all into a git repository.

Whisper

Whisper is extremely simple to install. You need ffmpeg and Python. I always install software from PyPI into a virtual environment. It seems somebody also took the time already to create an AUR package for Arch Linux, but let’s stick to the manual approach. If you’re using another distribution just replace the pacman commands with your package manager.

pacman -S ffmpeg python-venv
python -m venv venv
source venv/bin/activate
pip install wheel
pip install git+https://github.com/openai/whisper.git

This only installs the Whisper code, the model will be downloaded on-demand.

Using Whisper to subtitle a Croatian audio file is now simple enough:

whisper audio.mp3 --model large --language Croatian > audio.txt

This will write the detections to a file called audio.txt. I think there is IO buffering involved, so don’t be surprised if for a long time you do not see any output in the file. There seem to be ways to disable buffering on shell redirection, but I haven’t tried it. I am running the code on a CPU machine where a single minute of detection requires a full hour of model runtime.

Lemmatization

For lemmatization in Croatian I found the ReLDI project which also provides a web API, but Nikola Ljubešić recommended classla to me. According to him classla has better performance on lemmatization. Another advantage: classla can also be installed easily from PyPI.

We can continue in the same virtual environment:

pip install classla

For classla we need a tiny bit of Python code:

import classla
import sys

classla.download('hr')
nlp = classla.Pipeline('hr')

doc = nlp(sys.stdin.read())

for word in doc.iter_words():
    print(word.lemma)

I am using the standard mode here, but I will have to do some experiments on whether the non-standard mode is better or not. Non-standard mode can be used with:

classla.download('hr', type='nonstandard')
nlp = classla.Pipeline('hr', type='nonstandard')

Non-standard mode is for example required if your text uses s instead of š and so on. This is common in short messages, but I’ve also seen it in song lyrics.

Anki

Anki uses an SQLite database. The schema is - to say it humbly - not according to best-practices and not officially documented. Luckily, the android app for Anki took the time to document it.

To get a list of all vocabulary that has already been added to a specific deck we need the tables cards, notes and decks. The deck name is specified in decks.name. The contents of a card, called its fields, are stored in notes.flds. This is a string that can also contain multiple fields. Multiple fields are separated by the character 0x1f (31). We can use cards.ivl, the interval for the SRS algorithm, to find out whether a card was already learnt. It is unequal to zero if there is a repetition interval specified, which means the card is not completely new, anymore.

We can put this into a simple enough SQL query to get the first field of all cards in a deck that have already been learnt:

SELECT substr(n.flds, 0, instr(n.flds, char(31)))
FROM cards c
JOIN notes n ON c.nid = n.id
JOIN decks d ON c.did = d.id
WHERE d.name = ? COLLATE NOCASE AND c.ivl != 0

The SQLite file is stored in Anki’s data directory, in my case at $HOME/.local/share/Anki2/User\ 1/collection.anki2.

Combining Everything

Let’s put this all together. First, we run Whisper and get a text file with time stamps. We do not need the time stamps, so we can remove them with sed:

sed 's/\[[0-9.:>< -]*\] *//' audio.txt > text.txt

The resulting text file we can put into classla:

cat text.txt | python lemmatize.py | sort | uniq > words.txt

From the Anki database we can retrieve all learnt words:

sqlite3 .local/share/Anki2/User\ 1/collection.anki2 > /tmp/words_known.txt <<EOF
SELECT substr(n.flds, 0, instr(n.flds, char(31)))
FROM cards c
JOIN notes n ON c.nid = n.id
JOIN decks d ON c.did = d.id
WHERE d.name = 'Kroatisch' COLLATE NOCASE AND c.ivl != 0
EOF
cat /tmp/words_known.txt | sort | uniq > words_known.txt

Finally, we can compare both files with comm to list the lines that are not available in words_known.txt. comm requires sorted files, that’s why we sorted previously. comm outputs three columns: lines unique to file 1, lines unique to file 2 and lines that appear in both files. We only want to show lines unique to file 1 (words from the video that are not in Anki, yet). This means, we have to suppress columns two and three with -23.

comm -23 words.txt words_known.txt

This gives us all words that we still need to learn to understand the video.

The scripts in this article are still a bit rough. For example, not all words I know are listed in Anki, so I will need a second list with words too simple for Anki. I also spotted that not all my Anki data is clean, e.g. Anki sometimes kept a hidden <a></a> tag on the card. I will experiment with this idea and if it works well, I might create a git repository with some scripts. Then other people learning Croatian can profit from it, too.

Future

There is a C/C++ implementation of Whisper which also runs on CPU (in fact, it runs only on CPU) and requires less memory. This would allow me to run it on cheaper servers. Plus, I’m curious on whether it is faster.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.