Croatian POS Tagging - s.koch blog

POS tagging in not so common languages usually requires a bit of effort to be set up. Luckily, for Croatian, Željko Agić has created a very good POS tagger licensed under CC-BY-SA-3.0. It is based on the hunpos package which was originally created for Hungarian and which is licensed under the New BSD License.

According to my research, Agić is the most important POS tagging researcher for Croatian language. Another very important person in the field of Croatian NLP is Marko Tadić, but he seems to be more involved in the whole field of corpus creation.

Create hunpos binary

To use the Croatian POS tagger model with hunpos, you need to compile the latest version of hunpos from source. It does not seem to work with the precompiled hunpos linux binaries. Compiling it is quite simple. Download the package, unpack it and then call the build script.

wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/hunpos/source-archive.zip
unzip source-archive.zip
cd hunpos/trunk
bash build.sh build

You now have the hunpos binaries in the current directory as tagger.native and trainer.native.

Download and use the model

The next step is using the Croatian model. Just download the model from Agić’s website and rename it to something more convenient like croatian-ffzg.hunpos.

wget http://nlp.ffzg.hr/data/tagging/cc-by-sa.hunpos
mv cc-by-sa.hunpos croatian-ffzg.hunpos

The input into the POS tagger is one token per line. Empty lines are used as sentence separators. So a simple test file from a wikipedia article about Lučka kapetanija (was on croatian wikipedia’s main page on 2016-05-26) might look like this:

Lučka
kapetanija
(engl.
Harbour
Master's
Office)
je
glavna
institucija
s
provedbenom
funkcijom
u
području
pomorstva
u
Republici
Hrvatskoj
 
Na
čelu
lučke
kapetanije
nalazi
se
lučki
kapetan
po
kome
je
institucija
kroz
povijest
i
dobila
ime

According to my tests, you should make sure to remove punctuation marks, because it leads to wrong classifications. For example, when I had Hrvatskoj. and ime. in the file, it classified Hrvatskoj. as a masculine noun (but it is feminine) and ime. as a number (but it is a noun). Without the dots, classification was correct. As far as I can say, these two words (Hrvatskoj and ime) are then classified totally correct as Npfsl (noun, proper name, feminine, singular, locative) and Ncnsa (noun, common noun, neuter, singular, accusative). This degree of detail is really impressive.

You can run the POS tagging on this test file yourself with:

./tagger.native croatian-ffzg.hunpos < testtext

Classification for this whole file is:

Lučka	Agpfsn	
kapetanija	Ncfsn	
(engl	Ncmsn	
Harbour	Npmsn	
Master's	Npmsn	
Office)	Npmsn	
je	Vcr3s	
glavna	Agpfsn	
institucija	Ncfsn	
s	Si	
provedbenom	Agpfsi	
funkcijom	Ncfsi	
u	Sl	
području	Ncnsl	
pomorstva	Ncnsg	
u	Sl	
Republici	Ncfsl	
Hrvatskoj	Npfsl	
   Vmr3s	
Na	Sl	
čelu	Ncnsl	
lučke	Agpfsg	
kapetanije	Ncfsg	
nalazi	Vmr3s	
se	Px--sa--ypn	
lučki	Agpmsn	
kapetan	Agpmsn	
po	Sl	
kome	Pp3fsi--n-n	
je	Vcr3s	
institucija	Ncfsn	
kroz	Sa	
povijest	Ncfsa	
i	Cc	
dobila	Vmp-sf	
ime	Ncnsa

There seems to be a bug in hunspell with the input format, because even though the documentation states Empty lines are sentence separators. my empty lines are classified as verbs (Vmr3s). However, in my test, this does not influence the second sentence.

The POS tags are given in form of revised Multext East version 4.

According to Agić, the model achieves an accuracy of 87% at full MSD-HR and a “POS-only accuracy” of 97%. The worst numbers in Agić’s paper are 80% on full MSD-HR and 94% for POS-only. It is also worth mentioning, that this POS tagger can be employed both to Croatian and Serbian (in latin characters). If you put in cyrillic characters, everything is a noun:

Први   Npmsn
потпредседник   Npmsn
Владе   Npmsn
Републике   Npmsn
Србије   Npmsn

However, you can just transliterate the characters to latin and then it works:

Prvi   Agpmsn
potpredsednik   Ncmsn
Vlade   Npfsg
Republike   Ncfsg
Srbije   Npfsg

Using the latest model

Agić also maintains a repository on github for the corpora used to train his POS tagger (SETimes.HR). All of them are licensed under CC, but at the time of writing news and web include the NC requirement.

Building the model is quite simple, because the corpus has a similar format to the hunpos input/output format (except for more columns). So we just have to strip off a few columns:

git clone https://github.com/ffnlp/sethr.git 
awk '{print $2 "\t" $5}' sethr/set.hr.conll > croatian-ffzg.train
./trainer.native croatian-ffzg.hunpos < croatian-ffzg.train

Now you have the latest version of the model.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.