Language Identification in Mixed-Language Texts using Python

If I’m working on hobby programming projects these days, it usually involves foreign language stuff. It often starts with language identification of the text at hand. In Python there are a lot of libraries for language identification and they work quite well. I’m currently using Stanford Stanza’s language identification.

However, most libraries only output a single language for the whole text. That’s fine in many cases, but sometimes we have mixed-language texts - that means texts containing multiple languages.

For example, the metal band Ill Niño sing most of their songs in English, some in Spanish and sometimes they mix both. In J-Pop we can also often hear a mix of Japanese and English, Japanese rock band ONE OK ROCK also mixes Japanese and English. If I want to do vocabulary extraction on such a mixed-language song I need to know which parts belong to which language.

The Python library lingua says that it has experimental support for mixed-language texts, so let’s try it on a few lyrics. We can install the library from PyPI:

pip install lingua-language-detector

First, let’s make it a bit easier for the library by restricting the possible languages to the two languages contained in the text:

from lingua import Language, LanguageDetectorBuilder

with open('example') as f:
    text = f.read()

languages = [Language.ENGLISH, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

print('Trying to detect languages in this text:')
print(text)

for result in detector.detect_multiple_languages_of(text):
    print(f"{result.language.name}: {text[result.start_index:result.end_index]}")

This is basically the sample code from their README with some adjustments.

Spanish and English

For English-Spanish I’ll use Ill Niño’s song March against me. I’m not gonna print the whole lyrics for you, just some interesting sections:

ENGLISH: Everything's fine
...
And what is the cost
Me
SPANISH: amas no más...
Siempre todos tan conscientes...
ENGLISH: I'm down for the ride
...

SPANISH: Ya cuando te vas
Mi vida es la paz.
Ya que cuentas tantos cuentos...
ENGLISH: It's clear in your eyes

We can see that it made a minor mistake labeling the Me at the beginning of a verse as English instead of Spanish. Apart from that it worked really well.

Let’s retry it using all languages. For that we have to adjust the creation of the language builder to:

detector = LanguageDetectorBuilder.from_all_languages().build()

The Spanish detection was worse now, it has not detected several parts of the Spanish text as Spanish. Additionally, it detected some parts as Sotho (never heard of that language) and Irish:

SOTHO: feel so
...
IRISH: again. (again...)
Ya cuando te vas
...

So, detection from all available languages does not work so well. Maybe we could try to automatically detect candidate languages from the confidence values of single-language detection?

from lingua import Language, LanguageDetectorBuilder

with open('example') as f:
    text = f.read()

detector = LanguageDetectorBuilder.from_all_languages().build()
confidence_values = detector.compute_language_confidence_values(text)
for confidence in confidence_values:
    print(f"{confidence.language.name}: {confidence.value:.5f}")

Unfortunately no. The model is absolutely sure that the text is written in English:

ENGLISH: 1.00000
TAGALOG: 0.00000
SPANISH: 0.00000
ESPERANTO: 0.00000
LATIN: 0.00000
PORTUGUESE: 0.00000
YORUBA: 0.00000

Japanese and English

Let’s continue with the Japanese lyrics, I’m using First Love by Utada Hikaru. In a first shot I’m again restricting the model to only English and Japanese. This song is quite interesting, because Utada Hikaru even uses single English words within a Japanese line. On the other hand it should be much easier to detect the language just from the change in characters.

Confidence for overall detection by the way is 100% on Japanese.

JAPANESE: 最後のキスはタバコの flavor がした
ニガくてせつない香り
...
ENGLISH: You are always gonna be my love
JAPANESE: いつか誰かとまた恋に落ちても
ENGLISH: I'll remember to love
...
JAPANESE: 今はまだ悲しい
ENGLISH: love song
...

The word flavor in the first verse was not detected as English, but other situations like the language change in the verse 今はまだ悲しい love song were correctly detected. Thus, in general lingua seems to be able to detect multiple languages in the same sentence on the same line.

When I tried to detect languages from all available languages again, the detection quality became worse. It did not introduce wrong languages this time, but several English sections were detected as Japanese.

Two-Phase approach

Let’s test one more thing. For Japanese even when I worked with all languages it only detected Japanese and English sections. Nonetheless, when we told the model to only search for Japanese and English the results where much better.

This means, for this sample a two-phase approach would have worked really well:

Run with all available languages and see which languages are detected.
Perform a second run with only those languages.

Let’s try the same approach on the English and Spanish text. For our Spanish and English text we also got some detections for Sotho and Irish. This means, in an automated second run we’d run with English, Spanish, Irish and Sotho.

languages = [Language.ENGLISH, Language.SPANISH, Language.IRISH, Language.SOTHO]

Unfortunately, it still detects the same lines as Sotho and Irish. However, the (wrongly detected) Irish section became a bit shorter and the (correctly detected) Spanish section following it longer.

Summary

Overall, language identification in mixed-language texts seems to work quite well if the languages used are already known. This makes it quite useful for situations where humans choose the texts and have at least a basic understanding of the text contents.

Automated detection of any language in unknown texts gives mixed results. The detection of language boundaries becomes much less reliable if running with all available languages and sometimes wrong languages are detected.

A two-phase approach can improve the detection quality a bit, but might still include wrongly detected languages.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.