I’ve now scanned all pages of my Spanish edition of Harry Potter (y la piedra filosofal). This allows me to check which vocabulary I should know to read it comfortably and which of the currently unknown vocabulary I should add to Anki. Words that only occur once in the book (and might also be rare in general Spanish) are not worth being learnt right now.

The scan quality is not ideal, so OCR quality also suffers. I just used tesseract out-of-the-box. Without any further tweaks it was really finicky during my tests. If the text lines are not completely straight or if the text is a bit blurry, detection goes down the drain.

I can also see this in the word list, there are a lot of obviously wrong detections like eortix5, º'r or twm¡/… in the list of words. These could easily be filtered if we compare detected words with a list of all Spanish words. I’m not sure if there is something like that out in the wild, but it could be created from a huge text corpus of course. I’m a bit lazy and hope that for me it works in the state it is. The wrong detections will hopefully only occur once throughout the text and thus be deemed unworthy for Anki.

Results

OCR with lemmatization detected 8195 different words (in lemmatized form) in the book. There are 2916 words that occur more than once. In total we have 77,655 words. The English book is said to have 76,944 words, so my total word number looks plausible. Of course due to OCR errors it cannot be absolutely precise.

The 98% rule says that we can understand unknown words from context and thus read a text without problems if we know 98 out of 100 words in a text. Doing the maths this means that from my Spanish edition of Harry Potter I’d have to know 76,102 out of the 77,655 written words. Or in other words it’s OK if I do not know 1553 words.

According to my OCR analysis (including wrong OCR detections) there are 5279 words that only occur once throughout the whole book. To decide which ones I can omit and which ones I should learn I’d now need information about the importance of these words in general Spanish. So, I’ll need to search for a list of Spanish words or create one from Wikipedia and then write a follow-up blog post.

Luckily, that will then also get rid of the misdetected words from my list. So stay tuned.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.