Using Hiragana, Katakana and Kanji for tokenizing Japanese
People working in the international web-development (and maybe also in other sectors) often suffer from the problem of tokenization. This usually happens, when you want to create a search engine. As already mentioned, I am currently working on a library to analyze texts with PHP. This library shall help all people having to implement a search field.
The complex method of tokenizing in Japanese
So let’s look at how we can tokenize Japanese. If you searched about this topic, you probably have found out that it’s not an easy topic, nothing you can write down yourself quickly. If you want to implement it really well, you need probability heuristics and use advanced topics like Hidden-Markov-Models or similar.
A simpler approach: Japanese knows several writing systems
So let’s think about an easier approach, somebody without a M.Sc. can implement or create. Somebody who cannot rely on C++ code (e.g. KyTea) or similar.
If you think about the structure of Japanese, you wil recognize that it has three writing systems, sometimes even four (which occurs on the web pretty often):
- Romaji (latin characters)
We can use these different writing systems for tokenizing Japanese texts. If you have read Japanese text, you might know that the switch between Hiragana and Kanji fairly often happens at word borders. Of course verbs mostly consist of both kanji and hiragana, but as said, we want to keep it all very simple. We can still make it more difficult and accurate later. The point is: Even if we have a verb consisting of kanji and hiragana, the kanji will carry the meaning. On the other hand, nouns are very often delimited by hiragana because particles are written in hiragana.
Let’s look at this easy sentence (I just remembered from some song or anime):
There we have 約束 (promise, meeting) and then a hiragana, because this is being talked about. Then we have a form of 忘れる where the kanji carries the meaning of “to forget”. And then we have the whole hiragana rest carrying the meaning of “please not”.
Now, why should we not just tokenize the words when the writing system changes? That’s exactly what I thought and what I will also do in my current project. It would give us these tokens:
Of course, it is not perfect, but it is better than not splitting the words at all or splitting them into whole sentences (marked by 。 in Japanese). As a small improvement, we can still implement 。 as sort of its own writing system (like non-saved-writing-system, so that it will be considered a border too.
My current class
That’s the code I have read up to know. It still lacks recognition of Romaji (which I did not think of before, which can be seen often on the web) and I do not handle 。 、 or other special characters yet (but you do not do that in a pure whitespace tokenizer either).
Feel free to work with it. I will also improve it while working on my tokenizing-stemming-searching-PHP-library.I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to firstname.lastname@example.org.