Filtering an OCR Scan of Harry Potter with a List of 5000 most common Words

In a previous post I looked a bit at vocabulary from a scan of Harry Potter in Spanish. I’ve now created a list of common Spanish words using the Gutenberg Dammit corpus against which I can compare my scan.

Taking the 5000 most common words from the Gutenberg corpus and comparing those with my scan of Harry Potter, I find 2257 distinct words in Harry Potter that also occur in the top 5000 list. On the other hand there are 5938 words in my scan of Harry Potter that do not occur in the list of most common words.

Looking at the list of these words, there are indeed a lot of OCR errors like úmico (único?) or cerpresa (sorpresa?). But there are also some real words like pulcro, telespectadores or taburete.

In the running text the uncommon words account for a total of 13,918 running words. That’s almost 18% of all text (77,655 words). We cannot just ignore them all, so let’s look a bit closer.

Looking at the different words with their number of occurrences in the scan of Harry Potter, we can indeed find many real words. Some are proper nouns like Seamus (15 times), Ginny (4 times) and so on. Others are names of objects that only exist in the Harry Potter world like bludger (19 times) or quaffle (16 times). And some are indeed correct spellings like esquivar (5 times), agachar (10 times) or centímetro (13 times).

Skimming over the data, I roughly chose a cut-off at 5 occurrences. If a word occurs 5 times inside my scan of Harry Potter, I put it into the list of words to know even if it does not occur in the list of 5000 most common words.

This brings us up to 70,741 words in the running text, meaning that 6914 words in the running text are still not covered. But we might be able to assume that many of these are OCR errors. On the other hand it’s quite common that books contain several words only once.

Thus, I’d now just say that it’s worth to know any word that occurs in the list of 5000 most common words or that occurs in Harry Potter more than 5 times. This gives the ability to understand 90% of the book. If most of the remaining 10% are OCR errors then this might indeed give the ability to understand 98% of the running words.

Unfortunately, this approach together with my list of known words in Spanish (from Anki) shows 1143 unknown words, of which many are already clear to me and just never made it into my Anki deck: for example señorita, ignorar, elegante, distancia, detestar. All proper nouns could also directly be removed again from this list (Harry, King’s Cross). And some OCR errors seem just too common like one-character detections T, P or m.

But there are also some correct findings like el rato, el tejido or palidecer.

All in all still quite disappointing. And it was quite hacky and annoying to read several files containing vocabulary with a Python script, do some set operations on them and print outputs. Much more annoying than just reading the book and looking up a word here and there. I’d really like to have some tooling for this. Unfortunately, my own custom application does not have any support for long texts like books at the moment.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.