Simple Grammar Search using a POS-Tree from Stanza Output
I’m currently playing around a bit with grammar detection in English lyrics. A first experiment using past perfect and comparative in English looked promising, so I’ll take some notes about my approach.
For simple grammar it’s enough to look at the POS tags and maybe the previous and next word, but I guess that for more difficult grammar this will not be enough anymore. Thus I wanted to build a grammar tree from the sentence and then try some pattern matching on the tree.
Natural language processing knowns two different forms of relationship analysis in sentences: Dependency Parsing and Constituency Parsing. I have not yet fully grasped the exact difference between both, but I have the feeling that Dependency parsing (focusing on the relationships between each word) is what I want.
The natural language processing library Stanza can output dependency infomation
between words, namely a reference to the head
token and a description of
the relation named deprel
. We can use these head
links to construct a tree
for each sentence. The deprel
would then be annotations at the edges of the
tree.
For the first few experiments I just used a simple Python class to build the tree and print some debug output. But when I needed to access the siblings of a node, I thought, there must be an existing library in Python to build trees.
The Python library anytree seems
to be doing what I want. We could use Anynode
with additional keyword
arguments or even our own class with the NodeMixin
mixin.
The library NetworkX should also work. It’s built for
general graphs, but of course a tree is just a special kind of graph.
Both libraries have some missing parts. anytree does not seem to have a way
to store edge labels which we could use for deprel
. Since each node only
has a single parent, we can store the edge label with the node, but it feels
a bit weird. NetworkX on the other hand does not seem to have a direct way to
get the siblings of a node. You’d have to go to the parent node first and then
back to the children. Thus, I’ll stick with anytree for this post.
First we can create a node class for the tree using the anynode.NodeMixin
.
I’ll use nodes without a token to represent the document root and each sentence
root. In my examples I always saw only a single word with head=0
, which could
act as the sentence’s root, but as far as I understand the stanza documentation
a virtual node should be used as root. I guess there might be more than one word with
head=0
:
head (int): The id of the syntactic head of this word in the sentence, 1-based for actual words in the sentence (0 is reserved for an artificial symbol that represents the root of the syntactic tree).
I also added some helper methods to access information about the tokens.
import sys
import stanza
from anytree import NodeMixin, RenderTree
class TreeNode(NodeMixin):
def __init__(self, token, parent=None, children=None):
self.token = token
self.parent = parent
if children:
self.children = children
@property
def name(self):
if self.token:
return self.token.text
else:
return "<no-word>"
@property
def deprel(self):
if self.token:
return self.token.deprel
else:
return "<no-word>"
@property
def xpos(self):
if self.token:
return self.token.xpos
else:
return "<no-word>"
def __repr__(self):
return f"TreeNode({self.token['text']})"
Now we can read the document and create a tree. Again, this is just quickly hacked together, so it might be far from optimal. It should just do the trick for a proof-of-concept.
def build_tree(nlp, text):
doc = nlp(text)
root_node = TreeNode(None)
for sentence in doc.sentences:
sentence_node = TreeNode(None, parent=root_node)
id_to_node = {}
for word in sentence.words:
id_to_node[word.id] = TreeNode(word)
for node in id_to_node.values():
if node.token.head == 0:
node.parent = sentence_node
else:
node.parent = id_to_node[node.token.head]
return root_node
def main(file_path):
nlp = stanza.Pipeline('en')
with open(file_path, 'r') as f:
text = f.read()
tree = build_tree(nlp, text)
for pre, fill, node in RenderTree(tree):
print(f"{pre}{node.name} (DepRel: {node.deprel}, XPOS: {node.xpos})")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python script.py <path>")
sys.exit(1)
main(sys.argv[1])
Now, let’s try to find some grammar in an example document. I’ll try to detect Wh-clefts, because I had never heard of that grammar before (even though I use it myself) and it requires access to siblings.
We can write a function that takes as input argument a single tree node and checks some conditions on the tree starting at that node. For a Wh-cleft the conditions should be something along the line of:
- wh-word labeled with deprel
nsubj:outer
(in one example it was labeled only asnsubj
) - has as first sibling a copula
- has as child a deprel
acl:relcl
I am not sure whether these are indeed the exact required conditions, probably not. But it should be enough to show the concept.
def is_wh_cleft(node):
if not node.token:
return False
if node.deprel != 'nsubj:outer' or not node.name.lower().startswith('wh'):
return False
siblings = node.siblings
if len(siblings) == 0 or siblings[0].deprel != 'cop':
return False
has_relcl_child = any(child.deprel == 'acl:relcl' for child in node.children)
if not has_relcl_child:
return False
return True
Now we can run this on each sentence tree of a test document and output the tree if we found a
wh-cleft. As said, one of the sentences had the wh-word only labeled as
nsubj
, not as nsubj:outer
, so that one is not detected.
for sentence in tree.children:
wh_cleft = any(is_wh_cleft(node) for node in PreOrderIter(sentence))
if wh_cleft:
print('Found wh-cleft in the following tree:')
for pre, fill, node in RenderTree(sentence):
print(f"{pre}{node.name} (DepRel: {node.deprel}, XPOS: {node.xpos})")
Here’s the test input I used. The second sentence is a normal question using a wh-word, this should not be detected.
What she did was help the children.
What do you think about this?
What they like is smoked salmon.
What we need to do is get new batteries for it.
And that’s the output I got:
Found wh-cleft in the following tree:
<no-word> (DepRel: <no-word>, XPOS: <no-word>)
└── help (DepRel: root, XPOS: VB)
├── What (DepRel: nsubj:outer, XPOS: WP)
│ └── did (DepRel: acl:relcl, XPOS: VBD)
│ └── she (DepRel: nsubj, XPOS: PRP)
├── was (DepRel: cop, XPOS: VBD)
├── children (DepRel: obj, XPOS: NNS)
│ └── the (DepRel: det, XPOS: DT)
└── . (DepRel: punct, XPOS: .)
Found wh-cleft in the following tree:
<no-word> (DepRel: <no-word>, XPOS: <no-word>)
└── get (DepRel: root, XPOS: VB)
├── What (DepRel: nsubj:outer, XPOS: WP)
│ └── need (DepRel: acl:relcl, XPOS: VBP)
│ ├── we (DepRel: nsubj, XPOS: PRP)
│ └── do (DepRel: xcomp, XPOS: VB)
│ └── to (DepRel: mark, XPOS: TO)
├── is (DepRel: cop, XPOS: VBZ)
├── batteries (DepRel: obj, XPOS: NNS)
│ └── new (DepRel: amod, XPOS: JJ)
├── it (DepRel: obl, XPOS: PRP)
│ └── for (DepRel: case, XPOS: IN)
└── . (DepRel: punct, XPOS: .)