Creating a Vocabulary List for a Printed Book

I’m currently reading Harry Potter in Spanish. Reading a book in a foreign language that you’ve already read in your native language before is quite relaxing, because even if you do not understand everything you still know what it’s all about.

Up to now I have looked up all words while reading the book, but this is quite cumbersome and according to an article I recently read it’s not useful either. There are many words in a book that only occur once throughout the whole book and that are not really important for the language in general. One should instead focus on those words that are often repeated in the book, because these are words that either are also used in the general language quite often or are important for the topic one is currently reading about. An example of the former in Harry Potter might be profesor. This is often used in Harry Potter, because the story takes place in a school for Wizards. For the latter I don’t have a perfect example at hand, maybe buscador which in Harry Potter is Harry’s position in Quidditch. In common language it means search engine or searcher (meaning a person), a word which also in German I do not actively use.

If the book was digital, we could just run it through a word analysis and check all occuring words, but I own a printed copy (working every day with computers I prefer to have my books on paper; plus I like knowing that I can just keep them in my shelf until I grow old). Some day I finally had the idea that of course I could scan it! And luckily enough there are mobile phone apps that improve photographing books a bit.

Photographing the Book

I had a quite good experience with CamScanner. In its premium version this can even do the OCR analysis on its own, so you could skip the tesseract part. One nice thing about CamScanner is that it normalizes the photographed page.

When you take a picture of an opened book the lines are usually not completely straight because the page does not lie completely flat on the table (as it would be for a single sheet of paper). It you put a raw image into tesseract for OCR, tesseract miserably fails. But with CamScanner’s normalized version it works quite well.

CamScanner can photograph two pages at a time and creates one JPEG image per page.

OCR with tesseract

We can then copy these images to a computer and perform OCR with tesseract. I installed tesseract from my Linux distribution’s package repository including additional data for the Spanish language. You’d have to check how to install it on your operating system.

Once tesseract is installed, we can just run it on all images one-by-one:

for file in BookScan*.jpg; do
    tesseract "$file" "$file" -l spa
done

This produces a .txt file next to each image with the detected text.

Just be aware that tesseract is quite finicky with the text quality. I had an image that was a bit blurry, because I did not hold the phone completely stable and tesseract only detected gibberish on that one.

Concatenating the Pages

Now we have one .txt file per page. With a bit of Python we can concatenate these pages to a single chapter (or the whole book, but while writing this article I only did the first chapter). We’ll have to fix a few things from the scan along the way, but (if we’re not too pedantic) everything’s quite easy.

First of all, each page contains at the bottom the page number. We want to get rid of that. Easy enough: Whenever we find a number at the end of the page preceeded by multiple new lines, remove it. Let’s write a function for that:

import re

def remove_page_number(page_text):
    return re.sub(r'\s+\d+$', '', page_text, re.M)

My edition of Harry Potter uses — to indicate direct speech, but it seems the stanza tokenizer cannot handle that very well. Thus, I replaced it with the poor-man’s digital quotation mark ".

def replace_direct_speech_characters(text):
    return text.replace('—', '"')

I added a function to concatenate all pages, but that’s of course a simple one:

def concat_pages(pages):
    return '\n'.join([page.strip() for page in pages])

And finally, we need a function to remove hyphens at line breaks. I decided to go the easy way here and just always concatenate the lines when there is a hyphen at the end of the line. But this can lead to mistakes if the word itself contains a hyphen, e.g. in Harry Potter this can happen for the word Quien-usted-sabe (which in my word list occurs wrongly as Quienusted-sabe).

def unhyphenize(full_text):
    paragraphs = re.split('\n\n+', full_text)

    fixed_paragraphs = []
    for paragraph in paragraphs:
        prepared_lines = []
        for line in paragraph.strip().split('\n'):
            if line[-1] == '-':
                prepared_lines.append(line[:-1])
            else:
                prepared_lines.append(line + ' ')

        fixed_paragraphs.append(''.join(prepared_lines))

    return '\n\n'.join(fixed_paragraphs)

With that we have basically everything we need to concatenate the OCR’ed text into a single text document:

pages = []
for i in range(1, 16):
    with open(f'BookScan{i}.jpg.txt', 'rt') as f:
        text = f.read()

    text = remove_page_number(text)
    pages.append(text)

text = concat_pages(pages)
text = unhyphenize(text)
text = replace_direct_speech_characters(text)

Lemmatizing Words with stanza

I’ve already used stanza several times before in this blog to do text analysis. There may be simpler solution if only lemmatization is needed, but I’m really happy with stanza and it works well, so I’ll continue to use it.

from collections import Counter
import stanza

c = Counter()
nlp = stanza.Pipeline('es')
doc = nlp(text)

for word in doc.iter_words():
    word_data = word.to_dict()
    if 'upos' in word_data and word_data['upos'].lower() not in ['punct', 'sym']:
        c[word.lemma] += 1

for word, count in c.most_common():
    print(f'{word}\t{count}')

Future Possibilities

This blog post focused on the technical aspects on how to extract vocabulary from a printed book. I’ll probably write another blog post where I do a bit of analysis. Ideally, you should understand 98 out of 100 words in a foreign language text if you want to have fun reading it. According to my understanding this meanings running words in the text, so if a word is contained 5 times in 100 words and you know said word, then you’re already 5 points closer to the 98%. I want to analyze what that would mean for my copy of Harry Potter. Which words I’ll still have to learn, how many words you need to know in total and how many words you can just ignore and still confidently read the book.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.