In 2018, I first blogged about the idea to subtitle Croatian TV shows using machine learning. Back then I used a Python program called autosub which splits the audio data and sends it to Google’s public speech recognition engine. It was a cool idea, but the quality was not good enough for me to use in practice.
The goal was to support me in learning the Croatian language. Since Croatia is a small country and there are not many people learning it (for more than touristic reasons) it’s also hard to find good and fun material above the A1/A2 grade. Since my skills were quite limited, but I wanted to watch some TV shows I had the idea of automatically adding subtitles.
My plans have evolved a bit more from when I had first started this project. I want to go one step further: I’d like to connect the transcript to my Anki profile to exclude vocabulary I already know. This will allow me to list all words that I have to learn before watching the show.
Recently, OpenAI released a new speech recognition system called whisper. Whisper supports Croatian with a WER of 16.7% (according to OpenAI) when using the best model. A WER of about 4-5% is said to be on level with a human listener. In a closer analysis Microsoft according to my understanding observed slightly higher numbers. For reference: For Spanish, Italian, English, Portuguese and German OpenAI gives a WER of less than 6% with the best model. Japanese - another language I learn - follows closely with 6.4%. So, this model could be useful for more learning endeavours for other languages.
But let’s look at how it works. After some more tests I might put it all into a git repository.
Whisper is extremely simple to install. You need ffmpeg and Python. I always
install software from PyPI into a virtual environment. It seems somebody
also took the time already to create an
AUR package for Arch Linux,
but let’s stick to the manual approach. If you’re using another distribution
just replace the
pacman commands with your package manager.
pacman -S ffmpeg python-venv python -m venv venv source venv/bin/activate pip install wheel pip install git+https://github.com/openai/whisper.git
This only installs the Whisper code, the model will be downloaded on-demand.
Using Whisper to subtitle a Croatian audio file is now simple enough:
whisper audio.mp3 --model large --language Croatian > audio.txt
This will write the detections to a file called
audio.txt. I think
there is IO buffering involved, so don’t be surprised if for a long time
you do not see any output in the file. There seem to be ways to
disable buffering on shell redirection,
but I haven’t tried it.
I am running the code on a CPU machine
where a single minute of detection requires a full hour of model runtime.
For lemmatization in Croatian I found the ReLDI project which also provides a web API, but Nikola Ljubešić recommended classla to me. According to him classla has better performance on lemmatization. Another advantage: classla can also be installed easily from PyPI.
We can continue in the same virtual environment:
pip install classla
For classla we need a tiny bit of Python code:
import classla import sys classla.download('hr') nlp = classla.Pipeline('hr') doc = nlp(sys.stdin.read()) for word in doc.iter_words(): print(word.lemma)
I am using the standard mode here, but I will have to do some experiments on whether the non-standard mode is better or not. Non-standard mode can be used with:
classla.download('hr', type='nonstandard') nlp = classla.Pipeline('hr', type='nonstandard')
Non-standard mode is for example required if your text uses
s instead of
š and so on. This is common in short messages, but I’ve also seen it in
Anki uses an SQLite database. The schema is - to say it humbly - not according to best-practices and not officially documented. Luckily, the android app for Anki took the time to document it.
To get a list of all vocabulary that has already been added to a
specific deck we need the tables
The deck name is specified in
decks.name. The contents of a card, called
its fields, are stored in
notes.flds. This is a string that can also
contain multiple fields. Multiple fields are separated
by the character
0x1f (31). We can use
cards.ivl, the interval for the
SRS algorithm, to find out whether a card was already learnt. It is
unequal to zero if there is a repetition interval specified, which means the
card is not completely new, anymore.
We can put this into a simple enough SQL query to get the first field of all cards in a deck that have already been learnt:
SELECT substr(n.flds, 0, instr(n.flds, char(31))) FROM cards c JOIN notes n ON c.nid = n.id JOIN decks d ON c.did = d.id WHERE d.name = ? COLLATE NOCASE AND c.ivl != 0
The SQLite file is stored in Anki’s data directory, in my case at
Let’s put this all together. First, we run Whisper and get a text file with time stamps. We do not need the time stamps, so we can remove them with sed:
sed 's/\[[0-9.:>< -]*\] *//' audio.txt > text.txt
The resulting text file we can put into classla:
cat text.txt | python lemmatize.py | sort | uniq > words.txt
From the Anki database we can retrieve all learnt words:
sqlite3 .local/share/Anki2/User\ 1/collection.anki2 > /tmp/words_known.txt <<EOF SELECT substr(n.flds, 0, instr(n.flds, char(31))) FROM cards c JOIN notes n ON c.nid = n.id JOIN decks d ON c.did = d.id WHERE d.name = 'Kroatisch' COLLATE NOCASE AND c.ivl != 0 EOF cat /tmp/words_known.txt | sort | uniq > words_known.txt
Finally, we can compare both files with
comm to list the lines that are
not available in
comm requires sorted files, that’s why
we sorted previously.
comm outputs three columns: lines unique to file 1,
lines unique to file 2 and lines that appear in both files. We only want
to show lines unique to file 1 (words from the video that are not in Anki, yet).
This means, we have to suppress columns two and three with
comm -23 words.txt words_known.txt
This gives us all words that we still need to learn to understand the video.
The scripts in this article are still a bit rough. For example, not all words
I know are listed in Anki, so I will need a second list with words too simple
for Anki. I also spotted that not all my Anki data is clean, e.g. Anki
sometimes kept a hidden
<a></a> tag on the card. I will experiment with
this idea and if it works well, I might create a git repository with some
scripts. Then other people learning Croatian can profit from it, too.
There is a C/C++ implementation of Whisper which also runs on CPU (in fact, it runs only on CPU) and requires less memory. This would allow me to run it on cheaper servers. Plus, I’m curious on whether it is faster.I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to firstname.lastname@example.org.