I started to learn Finnish in 2024 and wanted to combine this with some programming ideas for foreign language studies. One of these ideas is searching for specific grammatical constructions in song lyrics. I’m trying this for Finnish at first, because it’s probably easiest when done together with a text book for extremely simple grammar.

Why do I want to search specifically in song lyrics? I often start to get interested in a specific foreign language due to music I heard, so it feels quite natural to try to find newly learnt stuff in these songs.

Scraping the Lyrics

Of course this means, we first have to collect the lyrics. I like the band Uniklubi, so I’ll fetch their lyrics. They have several albums with at the time of writing 92 songs listed on lyricstranslate.com. This should be enough to detect some different grammatical constructions.

Since I only need the lyrics for this one band, I’ll do a dirty one-shot script in Python. We can use the libraries requests to fetch the website and BeautifulSoup (bs4) to parse the HTML.

We can find the links to all lyrics in the div with the ID artistsonglist. This div contains links to untranslated lyrics and to the translations. Since I do not need the translations, I’ll restrict the search to elements with the class songName. From there we can then go on to fetch the lyrics:

from pathlib import Path
import time
import urllib.parse

import bs4
import requests

OVERVIEW_URL = 'https://lyricstranslate.com/en/uniklubi-lyrics.html'
HTTP_TIMEOUT = 10

print(f'Fetching {OVERVIEW_URL}')
r = requests.get(OVERVIEW_URL, timeout=HTTP_TIMEOUT)

soup = bs4.BeautifulSoup(r.text, features='lxml')
songlist = soup.find(id='artistsonglist')
for song in songlist.find_all('td', class_='songName'):
    for anchor in song.find_all('a', href=True):
        title = anchor.text
        url = urllib.parse.urljoin(OVERVIEW_URL, anchor['href'])

        print(f'Fetching {url}')
        # TODO: Download the lyrics for this song from the URL

Now that we have all the titles and URLs, we need to visit each URL individually and fetch the lyrics. As said this is a dirty one-time script. I’m not doing any character cleaning in the following script, because I checked that the titles are all clean. In the wild you’ll have to make sure that title is a valid file name and does not contain bad characters that can lead to attacks on your file system. In general: Never trust any input from outside systems, e.g. from the web, from user input and so on.

I’m fetching each paragraph (given by the class name par) individually, because I spotted that get_text() on the full song does not lead to correct newlines in the raw text file.

When crawling web sites you should always make sure to act politely, i.e. not overload the site. I’m not in a hurry, so I can wait a second between each call.

        r = requests.get(url, timeout=HTTP_TIMEOUT)
        soup = bs4.BeautifulSoup(r.text, features='lxml')
        lyrics = soup.find(id='song-body')

        text = '\n\n'.join([
            paragraph.get_text().strip()
            for paragraph in lyrics.find_all('div', class_='par')
        ])

        # Note: Since I only use this script one time, I'm sure that
        #       title only contains valid filenames. For general scripts
        #       you must validate/sanitize title.
        with open(Path('Uniklubi') / title, 'wt', encoding='utf-8') as f:
            f.write(text)

        time.sleep(1)

Using Stanza to find the Grammatical Construction

The first grammatical construction explained by my text book is the suffix -ko and -kö. According to the NLP library stanza this is called a clitic. The Wikipedia page on clitics agrees with this and even lists -ko as an example:

Finnish has seven clitics, which change according to the vowel harmony: -kO (-ko ~ -kö) […]

-kO attached to a verb makes it a question. It is used in yes/no questions […]

So, let’s perform a first experiment with stanza on example phrases from my text book to see what we can expect:

import stanza
nlp = stanza.Pipeline('fi')

sentences = [
    'Oletko saksalainen?',
    'Ymmärrättekö suomea?',
]

for sentence in sentences:
    print(sentence)
    print(list(nlp(sentence).iter_words())[0].feats)

Gives us:

Oletko saksalainen?
Clitic=Ko|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act
Ymmärrättekö suomea?
Clitic=Ko|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act

It’s listed as Clitic=Ko in both cases.

I was not able to find out how to retrieve the features from stanza in a structured way (e.g. a dictionary), so we’ll have to parse the string. Again, I am really lazy, so I’ll only split on | and then search for the exact string Clitic=Ko.

Searching through the Lyrics

So, let’s do a ugly, hacky script again to search through the collected lyrics. Since they’re all in one folder, we just have to iterate over all files in the folder, analyze the texts with stanza and search for words with the feature Clitic=Ko.

from pathlib import Path

import stanza

nlp = stanza.Pipeline('fi')

for path in Path('Uniklubi').glob('*'):
    with open(path, 'rt', encoding='utf-8') as f:
        text = f.read()

    document = nlp(text)
    found = False
    for word in document.iter_words():
        if word.feats is None:
            continue

        features = word.feats.split('|')
        if 'Clitic=Ko' in features:
            found = True
            break

    if found:
        print(f'Found clitic ko/kö in {path}')

This lists quite a lot of occurrences, hooray!

Found clitic ko/kö in Uniklubi/Sulla on lupa
Found clitic ko/kö in Uniklubi/Koko talvi kesämökillä
Found clitic ko/kö in Uniklubi/Kiinni jään
Found clitic ko/kö in Uniklubi/Bailaten koko elämä
Found clitic ko/kö in Uniklubi/Huojuva Silta
Found clitic ko/kö in Uniklubi/Vnus
Found clitic ko/kö in Uniklubi/Laavaa
Found clitic ko/kö in Uniklubi/Menneisyys
Found clitic ko/kö in Uniklubi/Se ei lähde pois
Found clitic ko/kö in Uniklubi/Synti ja enkeli
Found clitic ko/kö in Uniklubi/Hetki hiljaisuutta
Found clitic ko/kö in Uniklubi/Jäämaisema
Found clitic ko/kö in Uniklubi/Sinut tahtoisin vielä
Found clitic ko/kö in Uniklubi/Kuivilla
Found clitic ko/kö in Uniklubi/Sinun nimesi on Morrison
Found clitic ko/kö in Uniklubi/Mitä vittua?
Found clitic ko/kö in Uniklubi/Rakkaudesta hulluuteen
Found clitic ko/kö in Uniklubi/Helios
Found clitic ko/kö in Uniklubi/Polje
Found clitic ko/kö in Uniklubi/Hipin sydän on rikki
Found clitic ko/kö in Uniklubi/Tartu sisko kiinni
Found clitic ko/kö in Uniklubi/Tulennielijä
Found clitic ko/kö in Uniklubi/Poispäin minusta
Found clitic ko/kö in Uniklubi/Ikuinen
Found clitic ko/kö in Uniklubi/Hei Hei Tähdenlento
Found clitic ko/kö in Uniklubi/Jos tähän jään
Found clitic ko/kö in Uniklubi/Kiertää Kehää

The first one looks right with the lyrics Kaipaatko kosketusta, kun olet lohduton? / Odotatko mahdotonta, miksi olet onneton?. The second one already seems a bit weird. Is it a false positive from koko (which seems to mean full)? Let’s see where stanza finds the -ko clitic.

Nope, it’s found in the word onko in the sentence Vai onko avanto ja järvi jäässä?. According to Google Translate this means Or are the opening and the lake frozen?, so definitely a yes/no question.

The lyrics are of course still too difficult to understand, but it’s nonetheless nice to see this used in native speaker content. The song Bailaten koko elämä might contain an occurrence that already at a basic level I can understand, let’s see: The lyrics are Onko silloin ongelmaa and according to Pons silloin means then and ongelma means problem. However, I do not yet understand why there is a second -a appended, let’s see what stanza has to say.

According to stanza it is Case=Par, this probably refers to the partitive case. I sort of remember that some languages have the concept of understanding existence as a part-of relationship (or something like that, don’t judge me, it’s just a memory somewhere deep in my mind). So my guess is that the sentence means Is it then a problem?. Slightly off, according to both Google Translate and Pons it’s Is there a problem then?. That also matches better to my existence guess. But still fun to detect and translate -ko.

Future Ideas

Of course this could (and maybe will) be extended. We could cache the result after NLP to search over the texts quicker. Currently a full run requires about 90 seconds. Even one step further we could store the detected features or the grammatical constructions into a database to provide instant search results.

Of course, searching for -ko was a very simple example. I’m not sure about Finnish, but in other languages detecting advanced grammar requires more effort, because it might be a combination of multiple features detected by stanza (e.g. a participle and an auxiliary verb in a specific tense). I might re-visit this from time to time while I progress through my text book.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.