Using Python to find Japanese Texts with 〜ようとする
My Japanese course recently taught the grammar 〜ようとする with the meaning “trying to do” or “about to do”. Let’s try to auto-detect this piece of grammar in a collection of texts using the Python library Stanza.
I’ve already done some experiments. Building a grammar tree using the head
relations won’t be of use for this use case, because the head relations vary
too much between examples. We’re basically looking for a sequence of words:
verb in 〜よう form (volitional form) followed by と followed by some
conjugation of する.
The experiments also showed that we’ll have some manual work to do. At least for godan verbs from what I can see we basically only get that it’s a godan verb, but not that it’s in volitional form. Thus, we’ll first create a simple helper function to find out whether a verb looks like a volitional form. I guess it won’t be perfect, but I want to get something running. We can always improve it later if needed.
def is_volitional_verb(word):
endings = ['よう', 'おう', 'こう', 'ごう', 'そう', 'とう', 'のう', 'ぼう', 'もう', 'ろう']
for ending in endings:
if word.endswith(ending):
return True
return False
With this little helper we can iterate over the text and search for a sequence of three words:
- first word is a verb in volitional form
- second word is と
- third word is any conjugation of する (listed as lemma 為る in stanza)
def is_youtosuru(sentence, pos):
if pos + 2 >= len(sentence):
# not enough words in remainder of sentence
return False
is_volitional = is_volitional_verb(sentence[pos].text)
is_to = sentence[pos + 1].text == 'と'
is_suru = sentence[pos + 2].lemma == '為る'
Now we only have to iterate over the sentences of a text and check whether we can find a match somewhere in the text.
with open(sys.argv[1], encoding='utf_8') as f:
text = f.read()
nlp = stanza.Pipeline('ja')
for sentence in nlp(text).sentences:
words = sentence.words
for pos in range(0, len(words)):
if is_youtosuru(words, pos):
print('Found 〜ようとする')
I have a few song lyrics of Kobukuro and MY FIRST STORY on my computer. To
cross check the findings I just searched for うと
in the files and manually
checked which of these were 〜ようとする variants.
It did not detect 君の瞳に鍵をかけようとしても in Kobukuro’s song BEST FRIEND, because stanza split して into two separate words し and て.
On the MY FIRST STORY lyrics it worked better. For example it found 忘れようとしたけど (Even though I tried to forget) in the song あいことば. It also correctly detected 僕を試そうとしたんでしょ? (Were you trying to test me?) in the song “Plastic” and 誰もがずっと 何も見ようとしないから in the song “Drive Me”. The last one gets translated to “Because no one wants to see anything for a long time” by DeepL. Taking the sentence apart more literally the second part is probably “because trying not to see anything” and the first part then adds the info “everyone / noone” with the time “the whole time” to it.
Overall I have to say I’m positively surprised. This simple approach did not find all occurrences, because stanza sometimes split して into two words. But it did find a few occurrences and there were no false positives at all. It has served me well - I’ve now seen a few examples for 〜ようとする in real song lyrics, both with ichidan and godan verbs.
I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.