Croatian POS Tagging
POS tagging in not so common languages usually requires a bit of effort to be set up. Luckily, for Croatian, Željko Agić has created a very good POS tagger licensed under CC-BY-SA-3.0. It is based on the hunpos package which was originally created for Hungarian and which is licensed under the New BSD License.
According to my research, Agić is the most important POS tagging researcher for Croatian language. Another very important person in the field of Croatian NLP is Marko Tadić, but he seems to be more involved in the whole field of corpus creation.
Create hunpos binary
To use the Croatian POS tagger model with hunpos, you need to compile the latest version of hunpos from source. It does not seem to work with the precompiled hunpos linux binaries. Compiling it is quite simple. Download the package, unpack it and then call the build script.
You now have the hunpos binaries in the current directory as tagger.native
and trainer.native
.
Download and use the model
The next step is using the Croatian model. Just download
the model from Agić’s website and rename it to something
more convenient like croatian-ffzg.hunpos
.
The input into the POS tagger is one token per line. Empty lines are used as sentence separators. So a simple test file from a wikipedia article about Lučka kapetanija (was on croatian wikipedia’s main page on 2016-05-26) might look like this:
Lučka
kapetanija
(engl.
Harbour
Master's
Office)
je
glavna
institucija
s
provedbenom
funkcijom
u
području
pomorstva
u
Republici
Hrvatskoj
Na
čelu
lučke
kapetanije
nalazi
se
lučki
kapetan
po
kome
je
institucija
kroz
povijest
i
dobila
ime
According to my tests, you should make sure to remove punctuation marks,
because it leads to wrong classifications. For example, when I had
Hrvatskoj.
and ime.
in the file, it classified Hrvatskoj.
as a masculine
noun (but it is feminine) and ime.
as a number (but it is a noun). Without
the dots, classification was correct. As far as I can say, these two words
(Hrvatskoj
and ime
) are then classified totally correct as Npfsl
(noun,
proper name, feminine, singular, locative) and Ncnsa
(noun, common noun,
neuter, singular, accusative). This degree of detail is really impressive.
You can run the POS tagging on this test file yourself with:
Classification for this whole file is:
Lučka Agpfsn
kapetanija Ncfsn
(engl Ncmsn
Harbour Npmsn
Master's Npmsn
Office) Npmsn
je Vcr3s
glavna Agpfsn
institucija Ncfsn
s Si
provedbenom Agpfsi
funkcijom Ncfsi
u Sl
području Ncnsl
pomorstva Ncnsg
u Sl
Republici Ncfsl
Hrvatskoj Npfsl
Vmr3s
Na Sl
čelu Ncnsl
lučke Agpfsg
kapetanije Ncfsg
nalazi Vmr3s
se Px--sa--ypn
lučki Agpmsn
kapetan Agpmsn
po Sl
kome Pp3fsi--n-n
je Vcr3s
institucija Ncfsn
kroz Sa
povijest Ncfsa
i Cc
dobila Vmp-sf
ime Ncnsa
There seems to be a bug in hunspell with the input format, because even though
the documentation states Empty lines are sentence separators. my empty lines
are classified as verbs (Vmr3s
). However, in my test, this does not
influence the second sentence.
The POS tags are given in form of revised Multext East version 4.
According to Agić, the model achieves an accuracy of 87% at full MSD-HR and a “POS-only accuracy” of 97%. The worst numbers in Agić’s paper are 80% on full MSD-HR and 94% for POS-only. It is also worth mentioning, that this POS tagger can be employed both to Croatian and Serbian (in latin characters). If you put in cyrillic characters, everything is a noun:
Први Npmsn
потпредседник Npmsn
Владе Npmsn
Републике Npmsn
Србије Npmsn
However, you can just transliterate the characters to latin and then it works:
Prvi Agpmsn
potpredsednik Ncmsn
Vlade Npfsg
Republike Ncfsg
Srbije Npfsg
Using the latest model
Agić also maintains a repository on github for the corpora used to train
his POS tagger (SETimes.HR). All of them are licensed under CC,
but at the time of writing news
and web
include the NC
requirement.
Building the model is quite simple, because the corpus has a similar format to the hunpos input/output format (except for more columns). So we just have to strip off a few columns:
Now you have the latest version of the model.
I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.