POS tagging in not so common languages usually requires a bit of effort to be set up. Luckily, for Croatian, Željko Agić has created a very good POS tagger licensed under CC-BY-SA-3.0. It is based on the hunpos package which was originally created for Hungarian and which is licensed under the New BSD License.
According to my research, Agić is the most important POS tagging researcher for Croatian language. Another very important person in the field of Croatian NLP is Marko Tadić, but he seems to be more involved in the whole field of corpus creation.
Create hunpos binary
To use the Croatian POS tagger model with hunpos, you need to compile the latest version of hunpos from source. It does not seem to work with the precompiled hunpos linux binaries. Compiling it is quite simple. Download the package, unpack it and then call the build script.
You now have the hunpos binaries in the current directory as
Download and use the model
The next step is using the Croatian model. Just download
the model from Agić’s website and rename it to something
more convenient like
The input into the POS tagger is one token per line. Empty lines are used as sentence separators. So a simple test file from a wikipedia article about Lučka kapetanija (was on croatian wikipedia’s main page on 2016-05-26) might look like this:
Lučka kapetanija (engl. Harbour Master's Office) je glavna institucija s provedbenom funkcijom u području pomorstva u Republici Hrvatskoj Na čelu lučke kapetanije nalazi se lučki kapetan po kome je institucija kroz povijest i dobila ime
According to my tests, you should make sure to remove punctuation marks,
because it leads to wrong classifications. For example, when I had
ime. in the file, it classified
Hrvatskoj. as a masculine
noun (but it is feminine) and
ime. as a number (but it is a noun). Without
the dots, classification was correct. As far as I can say, these two words
ime) are then classified totally correct as
proper name, feminine, singular, locative) and
Ncnsa (noun, common noun,
neuter, singular, accusative). This degree of detail is really impressive.
You can run the POS tagging on this test file yourself with:
Classification for this whole file is:
Lučka Agpfsn kapetanija Ncfsn (engl Ncmsn Harbour Npmsn Master's Npmsn Office) Npmsn je Vcr3s glavna Agpfsn institucija Ncfsn s Si provedbenom Agpfsi funkcijom Ncfsi u Sl području Ncnsl pomorstva Ncnsg u Sl Republici Ncfsl Hrvatskoj Npfsl Vmr3s Na Sl čelu Ncnsl lučke Agpfsg kapetanije Ncfsg nalazi Vmr3s se Px--sa--ypn lučki Agpmsn kapetan Agpmsn po Sl kome Pp3fsi--n-n je Vcr3s institucija Ncfsn kroz Sa povijest Ncfsa i Cc dobila Vmp-sf ime Ncnsa
There seems to be a bug in hunspell with the input format, because even though
the documentation states Empty lines are sentence separators. my empty lines
are classified as verbs (
Vmr3s). However, in my test, this does not
influence the second sentence.
The POS tags are given in form of revised Multext East version 4.
According to Agić, the model achieves an accuracy of 87% at full MSD-HR and a “POS-only accuracy” of 97%. The worst numbers in Agić’s paper are 80% on full MSD-HR and 94% for POS-only. It is also worth mentioning, that this POS tagger can be employed both to Croatian and Serbian (in latin characters). If you put in cyrillic characters, everything is a noun:
Први Npmsn потпредседник Npmsn Владе Npmsn Републике Npmsn Србије Npmsn
However, you can just transliterate the characters to latin and then it works:
Prvi Agpmsn potpredsednik Ncmsn Vlade Npfsg Republike Ncfsg Srbije Npfsg
Using the latest model
Agić also maintains a repository on github for the corpora used to train
his POS tagger (SETimes.HR). All of them are licensed under CC,
but at the time of writing
web include the
Building the model is quite simple, because the corpus has a similar format to the hunpos input/output format (except for more columns). So we just have to strip off a few columns:
Now you have the latest version of the model.