Back To Blog

Entity recognition with Machine Learning - Ensemble Learners

ensemble learners

In part I and part II of this series, we checked the effectiveness of applying automated named entity recognition on a set of miscellaneous documents in a corporate document management system. The idea is to use machine learning and artificial intelligence techniques to identify and extract natural person names from documents. This results in a list of person names and the position of where those person names appear in the document, and that information can be added to the document as meta-data. This can e.g., be used to link data or to derive GDPR-sensitivity.

What did we learn so far?

In part I – Entity recognition with Machine Learning –, we compared ready-to-use, pre-trained models for English of OpenNLP and Stanford CoreNLP, and trained our own models based on an annotated dataset of 2,616 documents with 23,462 natural person names. Although the pre-trained model of Standford CoreNLP scores significantly better than OpenNLP for our real-world dataset, results are not good enough for fully automated systems. However, we learned that modest training efforts (using up to 2,000 labeled documents) boost effectiveness and narrow the differences between OpenNLP and CoreNLP – especially for precision – obtaining up to 90/95% precision and 70/80% recall.

In part II – Entity recognition with Machine Learning: spaCy –, we added spaCy to the comparison. spaCy is a free open-source NLP library for Python that claims, “blazing fast and industrial-strength natural language processing”. According to their website, it is a rising star in the NLP world: “in the five years since its release, spaCy has become an industry standard with a huge ecosystem”. Applying the pre-trained spaCy models to the exact same data again revealed rather low performance (15% precision, 53% recall), but again, a limited amount of training (1.500 labeled documents used to update the pre-trained models) boosts performance (77% precision, 93% recall), resulting in the best score for recall and a reasonable score on precision. In that respect, we cannot say that spaCy works better compared to OpenNLP and Stanford CoreNLP. It works differently, but most importantly, can be quite complementary to the use of OpenNLP and CoreNLP as these methods yield a higher precision but lower recall.

In this third part, we focus on using a combination of methods/models, also called ensemble learners.

Introduction to ensemble learners

Any machine learning method/model will suffer from a trade-off between precision and recall, i.e., an evaluation that is too soft/relax will result in many false positives – objects that are not of interest – while an evaluation that is too hard/tight will result in many false negatives – objects of interest that are missed. Even human classifiers will suffer from this trade-off and almost never reach a level of 100% precision (all identified objects are indeed relevant) and 100% recall (no relevant object is missed).

However, as every single machine learning method/model will have a different view on the classification (as any human classifier will have a different opinion), the combination of multiple methods/models might yield better results.

For example, imagine one is interested in a maximum precision solution, i.e., a solution where all identified objects are of interest at the risk of missing some objects of interest – in our example, all person names proposed by the model are indeed person names, but some person names will be missed by the system. In that case, one can only retain those person names that are labeled by multiple methods/models as person names, increasing the likelihood that it is indeed a person name (as multiple independent models say so).

Or, imagine one is interested in a maximum recall solution, i.e., a solution where none of the objects of interest are missed at the cost of having identified objects that are not of interest – in our example, all person names that are present in the document set are identified by the system, but some of the identified persons are mistakenly identified as a person’s name. In that case, one can combine all person names that are revealed by at least one method/model, increasing the likelihood that all person names will be revealed.

In these two extreme examples, it is obvious that combining multiple methods can increase precision and recall respectively, but at the expense of lower recall and precision respectively. However, when there is enough diversity amongst methods/models, combining them can boost both precision and recall at the same time (compared to using a single method/model).

Ensemble Learners

Types of ensemble learners

The idea of ensemble learning is to combine multiple methods/models to achieve better results. To reach a single outcome, voting can be used, i.e., based on the outcomes of multiple methods/techniques, the final decision is taken by applying a threshold on the sum of outcomes. That can be a majority vote (at least half of the methods/models – or half plus one – need to classify the object as an object of interest), a specific threshold (minimum x methods/models need to classify the object as an object of interest), or a weighted combination of votes (every method/model can have a different importance/weight in the voting). For example, if 10 methods/models are available to identify natural person names, one can decide that at least 4 models need to classify a word as a person name to classify it as a person’s name.

Ensemble Learners method

The following two typical ensemble learner techniques use this voting mechanism:

  • Bagging: the same method is trained on different independent samples of the data to arrive at multiple models, and voting is used to come to a final verdict (a classical example is a random forest, which is a combination of multiple decision trees).
  • Boosting: first, a model is trained on the data, and next models are added emphasizing the cases that were misclassified in the previous model (i.e., a new model is trained based on a new data sample mainly – but not only – containing the misclassified cases of the previous model). Again, voting is used to come to a final verdict (the difference with bagging is that with bagging different methods/models are trained in parallel on multiple independent subsets of the data; with boosting, different methods/models are trained in cascade based on the misclassified cases of the previous models).

Another typical approach (not based on voting) is to train a machine learning model based on the outcomes of multiple techniques/models:

  • Stacking: first multiple methods/models are trained independently on the labeled data. Next, all the outcomes are combined resulting in a new dataset with as many features as the number of methods/models used in the first step. A new machine learning model is trained on that data to come to a final classification. This is a two-stage approach: first train multiple different machine learning models in parallel on the source data, next train a single machine learning model on the outcomes of all the models of the first stage.

Setup of comparative study

The starting point of an ensemble learner is to have a set of methods with a variety of results. For our case, OpenNLP, CoreNLP, and spaCy can obviously be combined to construct an ensemble learner. However, the higher the variety in the results, the better the chances to arrive at better results. This means that combining more methods/models yield more opportunities for improvement.

To arrive at more methods than the three methods under study (OpenNLP, CoreNLP, spaCy), one can derive multiple models from the same method. Indeed, a given method (like spaCy) can rely on multiple parameters to guide the way the method derives a model from training data. Changing these so-called ‘hyperparameter’ values result in different outcomes. So, by playing around with these hyperparameters values, multiple models can be generated from the same basic technique.

We experimented with two hyperparameters in spaCy: (1) the number of iterations made to train the model (labeled examples are shuffled and presented multiple times during training to find optimal model parameters) and (2) the dropout rate (the percentage of neurons dropped during training to prevent overfitting – spaCy is based on artificial neural networks).

We trained models from our document set composed of 1,500 labeled documents with 1, 2, 5, 10, 20 iterations, and a dropout rate of 40% and 50%, resulting in 10 model variants based on spaCy. We ran those 10 trained models on our validation sample with 200 documents (exactly the same setup as our previous tests) and observed that every model variant does indeed yield different results, i.e. that every model results in a different set of identified natural person names, but all spaCy model variants have a high recall and reasonable precision in common (in line with our previous results for spaCy).


Based on the 10 spaCy variants, and combined with the OpenNLP model, multiple ensemble learners were derived.

On one hand, the classical voting mechanism was used to assign a final classification, resulting in a set of ensemble learners where a word is classified as a person name if a minimum of 1, 2, 3 … 10 models do so.

On the other hand, an ensemble learner was derived using a ‘stacking’-mechanism; based on the outcomes of all the models and the real outcome, a random forest was trained to obtain a final classification.

Precision and recall of ensemble methods (voting and stacking) f

or custom trained models (trained on 1,500 labeled documents, validated on 200 labeled documents)

  10 spaCy variants combined with OpenNLP
Ensemble Precision Recall
1 vote 0.30 0.96
2 votes 0.49 0.91
3 votes 0.61 0.88
4 votes 0.70 0.86
5 votes 0.76 0.84
6 votes 0.80 0.82
7 votes 0.85 0.81
8 votes 0.88 0.80
9 votes 0.92 0.78
10 votes 0.95 0.75
Stacking (random forest) 0.92 0.84

As expected, precision goes up and recall goes down with the number of votes required to come to a positive classification (a word being a person’s name). Playing around with the minimum required number of votes allows choosing an equilibrium between precision and recall. However, for the given validation set, the stacking solutions (training a new model on the outcomes of a set of models) results in the best overall results (equal precision but far better re

HTML Snippets Powered By :