Simple Grammar Search using a POS-Tree from Stanza Output

I’m currently playing around a bit with grammar detection in English lyrics. A first experiment using past perfect and comparative in English looked promising, so I’ll take some notes about my approach.

For simple grammar it’s enough to look at the POS tags and maybe the previous and next word, but I guess that for more difficult grammar this will not be enough anymore. Thus I wanted to build a grammar tree from the sentence and then try some pattern matching on the tree.

Natural language processing knowns two different forms of relationship analysis in sentences: Dependency Parsing and Constituency Parsing. I have not yet fully grasped the exact difference between both, but I have the feeling that Dependency parsing (focusing on the relationships between each word) is what I want.

The natural language processing library Stanza can output dependency infomation between words, namely a reference to the head token and a description of the relation named deprel. We can use these head links to construct a tree for each sentence. The deprel would then be annotations at the edges of the tree.

For the first few experiments I just used a simple Python class to build the tree and print some debug output. But when I needed to access the siblings of a node, I thought, there must be an existing library in Python to build trees.

The Python library anytree seems to be doing what I want. We could use Anynode with additional keyword arguments or even our own class with the NodeMixin mixin. The library NetworkX should also work. It’s built for general graphs, but of course a tree is just a special kind of graph.

Both libraries have some missing parts. anytree does not seem to have a way to store edge labels which we could use for deprel. Since each node only has a single parent, we can store the edge label with the node, but it feels a bit weird. NetworkX on the other hand does not seem to have a direct way to get the siblings of a node. You’d have to go to the parent node first and then back to the children. Thus, I’ll stick with anytree for this post.

First we can create a node class for the tree using the anynode.NodeMixin. I’ll use nodes without a token to represent the document root and each sentence root. In my examples I always saw only a single word with head=0, which could act as the sentence’s root, but as far as I understand the stanza documentation a virtual node should be used as root. I guess there might be more than one word with head=0:

head (int): The id of the syntactic head of this word in the sentence, 1-based for actual words in the sentence (0 is reserved for an artificial symbol that represents the root of the syntactic tree).

I also added some helper methods to access information about the tokens.

import sys

import stanza
from anytree import NodeMixin, RenderTree


class TreeNode(NodeMixin):
    def __init__(self, token, parent=None, children=None):
        self.token = token
        self.parent = parent
        if children:
            self.children = children

    @property
    def name(self):
        if self.token:
            return self.token.text
        else:
            return "<no-word>"
    @property
    def deprel(self):
        if self.token:
            return self.token.deprel
        else:
            return "<no-word>"

    @property
    def xpos(self):
        if self.token:
            return self.token.xpos
        else:
            return "<no-word>"

    def __repr__(self):
        return f"TreeNode({self.token['text']})"

Now we can read the document and create a tree. Again, this is just quickly hacked together, so it might be far from optimal. It should just do the trick for a proof-of-concept.

def build_tree(nlp, text):
    doc = nlp(text)

    root_node = TreeNode(None)

    for sentence in doc.sentences:
        sentence_node = TreeNode(None, parent=root_node)

        id_to_node = {}
        for word in sentence.words:
            id_to_node[word.id] = TreeNode(word)

        for node in id_to_node.values():
            if node.token.head == 0:
                node.parent = sentence_node
            else:
                node.parent = id_to_node[node.token.head]

    return root_node


def main(file_path):
    nlp = stanza.Pipeline('en')

    with open(file_path, 'r') as f:
        text = f.read()

    tree = build_tree(nlp, text)

    for pre, fill, node in RenderTree(tree):
        print(f"{pre}{node.name} (DepRel: {node.deprel}, XPOS: {node.xpos})")



if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python script.py <path>")
        sys.exit(1)

    main(sys.argv[1])

Now, let’s try to find some grammar in an example document. I’ll try to detect Wh-clefts, because I had never heard of that grammar before (even though I use it myself) and it requires access to siblings.

We can write a function that takes as input argument a single tree node and checks some conditions on the tree starting at that node. For a Wh-cleft the conditions should be something along the line of:

wh-word labeled with deprel nsubj:outer (in one example it was labeled only as nsubj)
has as first sibling a copula
has as child a deprel acl:relcl

I am not sure whether these are indeed the exact required conditions, probably not. But it should be enough to show the concept.

def is_wh_cleft(node):
    if not node.token:
        return False

    if node.deprel != 'nsubj:outer' or not node.name.lower().startswith('wh'):
        return False

    siblings = node.siblings
    if len(siblings) == 0 or siblings[0].deprel != 'cop':
        return False

    has_relcl_child = any(child.deprel == 'acl:relcl' for child in node.children)
    if not has_relcl_child:
        return False

    return True

Now we can run this on each sentence tree of a test document and output the tree if we found a wh-cleft. As said, one of the sentences had the wh-word only labeled as nsubj, not as nsubj:outer, so that one is not detected.

    for sentence in tree.children:
        wh_cleft = any(is_wh_cleft(node) for node in PreOrderIter(sentence))
        if wh_cleft:
            print('Found wh-cleft in the following tree:')
            for pre, fill, node in RenderTree(sentence):
                print(f"{pre}{node.name} (DepRel: {node.deprel}, XPOS: {node.xpos})")

Here’s the test input I used. The second sentence is a normal question using a wh-word, this should not be detected.

What she did was help the children.
What do you think about this?
What they like is smoked salmon.
What we need to do is get new batteries for it.

And that’s the output I got:

Found wh-cleft in the following tree:
<no-word> (DepRel: <no-word>, XPOS: <no-word>)
└── help (DepRel: root, XPOS: VB)
    ├── What (DepRel: nsubj:outer, XPOS: WP)
    │   └── did (DepRel: acl:relcl, XPOS: VBD)
    │       └── she (DepRel: nsubj, XPOS: PRP)
    ├── was (DepRel: cop, XPOS: VBD)
    ├── children (DepRel: obj, XPOS: NNS)
    │   └── the (DepRel: det, XPOS: DT)
    └── . (DepRel: punct, XPOS: .)
Found wh-cleft in the following tree:
<no-word> (DepRel: <no-word>, XPOS: <no-word>)
└── get (DepRel: root, XPOS: VB)
    ├── What (DepRel: nsubj:outer, XPOS: WP)
    │   └── need (DepRel: acl:relcl, XPOS: VBP)
    │       ├── we (DepRel: nsubj, XPOS: PRP)
    │       └── do (DepRel: xcomp, XPOS: VB)
    │           └── to (DepRel: mark, XPOS: TO)
    ├── is (DepRel: cop, XPOS: VBZ)
    ├── batteries (DepRel: obj, XPOS: NNS)
    │   └── new (DepRel: amod, XPOS: JJ)
    ├── it (DepRel: obl, XPOS: PRP)
    │   └── for (DepRel: case, XPOS: IN)
    └── . (DepRel: punct, XPOS: .)

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.