Entity recognition with Machine Learning - spaCy

In our pilot study – Entity recognition with Machine Learning – we checked the effectiveness of applying automated named entity recognition on a set of miscellaneous documents in a corporate document management system. The idea is to use Machine Learning and Artificial Intelligence techniques to identify and extract natural person names from documents.
This results in a list of person names and the position of where those person names appear in the document, and that information can be added to the document as meta-data. This can e.g. be used to link data or to derive GDPR-sensitivity.
In that pilot study, we compared ready-to-use, pre-trained models for English of OpenNLP and Stanford CoreNLP, and trained our own models based on an annotated dataset of 2,616 documents with 23,462 natural person names. Although the pre-trained model of Standford CoreNLP scores significantly better than OpenNLP for our real-world dataset, results are not good enough for fully automated systems.
However, we learned that modest training efforts (using up to 2,000 labeled documents) boost effectiveness and narrow the differences between OpenNLP and CoreNLP – especially for precision – obtaining up to 90/95% precision and 70/80% recall.
In this pilot study, we want to push things further and we introduce spaCy, another free open-source NLP library, and compare results with OpenNLP and Stanford CoreNLP.
About Spacy
Spacy is an open-source Python library that claims “blazing fast and industrial-strength natural language processing”. It is a rising star in the NLP world: “in the five years since its release, spaCy has become an industry standard with a huge ecosystem”.
It is an interesting addition to our pilot study as it is based on an artificial neural network (OpenNLP is based on a maximum entropy model and Stanford CoreNLP on a conditional random field model).
Setup for our study
We used the exact same data as in our pilot study and assessed two scenarios: using the pre-trained models for English and using a model trained on our training data. For the pre-trained models, we used the “large” model ‘en_core_web_lg’ (there is also a “small” and “medium” model, but the “large” model gave better results for our data). For the training, we did not start from scratch but we used the option offered by spaCy to start from a pre-trained model (the same “large” model) and update that model with our training data (we also tried the option to train a blanc model as we did for OpenNLP and Stanford CoreNLP in our pilot study, but updating a pre-trained model yielded better results). We used the same training set 1 with 1,500 labeled documents as used in our pilot study.
Results and conclusion
To assess the quality of the results we again used the same test set 1 with 200 labeled documents as in our pilot study, so we can have a side-by-side comparison with our OpenNLP and Stanford CoreNLP results from our pilot study.
Precision and recall of OpenNLP, CoreNLP and spaCy for pre-trained model and custom trained models (training set 1 with 1,500 labeled documents and test set 1 with 200 labeled documents)
Test set 1 | ||||||
OpenNLP |
CoreNLP |
spaCy |
||||
Model |
Precision |
Recall |
Precision |
Recall |
Precision |
Recall |
Pre-trained |
0.30 |
0.57 |
0.70 |
0.64 |
0.15 |
0.53 |
1500-1 |
0.92 |
0.63 |
0.92 |
0.76 |
0.77 |
0.93 |
It is clear again that pre-trained models can underperform on real-world data; the pre-trained model of spaCy yields the lowest results. At the same time, it is also clear again that a relatively small amount of training boosts results significantly. After updating the pre-trained model with the information from the 1,500 labeled documents, spaCy scores best on recall and reasonably well on precision. This high recall / reasonable precision result is a very appealing result as it is easier to correct lower precision than lower recall. Indeed, a high recall / low precision solution means that most potential natural names will be identified, but with a sheer amount of mistakes. If needed, these mistakes can be identified and corrected, as the list of identified person names can be scanned for mistakes. As such, a high recall / low precision solution can be turned into a high recall / high precision solution with reasonable effort. The opposite is less obvious; a high precision / low recall solution means that many natural person names will be missed, and the only way to find those missed cases is by going through all the data.
In that respect, we cannot say that Machine Learning spaCy library works better compared to OpenNLP and Stanford CoreNLP. It works differently, but most importantly, can be quite complementary to the use of OpenNLP and CoreNLP as these methods yield a higher precision but lower recall.