I’ve always found it hard to continue learning languages once you have completed introductory text books. There are many text books for levels A1, A2 and B1, but my feeling is that starting with B2 there’s just not so much anymore. Additionally, I usually want to learn foreign languages to understand original content, so wouldn’t it be perfect to jump into original content as quickly as possible?

One hindrance to this is lack of vocabulary knowledge. I find it extremely frustrating when I just don’t know enough words to understand whatever I want to read or listen to. Switching from text to dictionary back and forth also becomes tiresome quite quickly.

That’s why I’m currently developing an application to connect vocabulary knowledge with texts. It allows you to paste a foreign language text and then highlights unknown words. It also supports an in-program vocabulary lookup. This workflow seems to work really well for me.

In this short tutorial I want to explain how I use it to improve foreign language skills with news shows (to be honest, watching the news is still quite a bit too difficult, but thanks to my application I learn new words each time).

Small Rant

Due to recent trials in Germany, I’m not willing to take any risk and won’t give any names for applications that simplify downloading a video from web sites. In case you’re not from here and are wondering what’s wrong: We’ve had two (according to people from the IT community) very, very wrong court decisions in the past few months that are somewhat related to linking and a concept called Störerhaftung. One against a Swiss DNS provider and one against a website linking to a Github repository for a program that simplifies downloading videos.

Subtitles and Script with Whisper

So, let’s assume we have a video of whatever news show we’re interested in, because the news show provided a download button on their website. To prepare watching the show we should get a text script for it. Also watching it might be a bit easier with subtitles.

To achieve this we can use whisper.ccp.

The installation is straight-forward. I usually use it with the largest model and run it on a large cloud server at Hetzner which I shut down afterwards. While writing this tutorial, for example, I’m using a CCX33 server with dedicated CPU, costing about 10ct per hour.

apt-get install build-essential make
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make
bash ./models/download-ggml-model.sh large

Next, we need to encode the audio in the right format for whisper.ccp. Fortunately, the README also gives us the required command for this. The input file can be audio or video.

ffmpeg -i <input-file> -ar 16000 -ac 1 -c:a pcm_s16le audio-for-whisper.wav

Let’s assume we have a few news shows that we want to watch, then we can run the following bash script over all files:

for file in *; do
    ffmpeg -i "$file" -ar 16000 -ac 1 -c:a pcm_s16le ${file%.*}.wav
done

We can then create the text script for the news shows as well as subtitles in one go with whisper! I’m setting number of threads to 8, because my server has 8 cores, and language to hr, because I’m learning Croatian. -otxt instructs whisper to generate a text script, -osrt writes subtitles in SRT format.

./main -otxt -osrt -l hr -t 8 -m models/ggml-large.bin -f audio-for-whisper.wav

Vocabulary Lookup with Balalingo

Then I use my own application Balalingo to lookup unknown words from the text script. Balalingo connects to my Anki database and reads all words I already know. It then highlights words which I do not know, yet. The tool also uses natural language processing to detect lemmas for words, so that for example the word running would not be marked if my Anki deck contained the verb to run.

Screenshot of Balalingo with the script of a news show

A quite recent addition to Balalingo is batch lookup of vocabulary. I can click a single button and Balalingo generates a table with all unknown words and their translations.

I can then go through this table and put most words into Anki. There are a few reasons why I still copy them over to Anki manually (technically it could be automated): First of all, it allows me to learn the word the first time when I add it. Anki does not really have a learning mode when you see a word for the first time, it directly starts with the first review. Additionally, I just do not want to add all words to my Anki deck. Some words might not important enough. And finally, the vocabulary lookup is just not perfect, yet. It contains annotations from the dictionary API which I’d have to remove and sometimes I’m just not happy with the result the dictionary gives me.

Watching the News

Once you know the words, you can watch the foreign news with or without auto-generated subtitles. And even if you do not understand most of the things (which unfortunately is still true for me) at least you have learned some words along the way!

By the way, whisper can also automatically translate to English if you prefer English subtitles. I prefer foreign language subtitles when learning a foreign language.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.