Back To Blog

Automatic Document Classification with Machine Learning and AI

Nowadays modern businesses are leveraging machine learning (ML) based solutions to help automate operations and make the whole process of document management faster and more effective. 

The latest systems are incorporating artificial intelligence (AI) to “read” documents like a human, identify and classify the type of document and extract key data. Such systems can efficiently, and accurately convert the varied content those sources contain into standard data types, scanning for relevant information and feeding that information into common data stores.

The importance of cognitive capture

Automated data capture technology already increases workplace efficiency and decreases business costs—but “intelligent” capture is even more powerful, leveraging AI and robotic process automation (RPA) to bring additional benefits to enterprises. Cognitive capture also uses natural language understanding to recognize phrases and determine the “emotion” of the document.

When business information is trapped in unstructured documents, it remains essentially invisible. Until now, most companies simply scan these documents, indexed them with a date and document number, and stored them in a repository.

Only now, with these new cognitive systems, has data capture made it possible to capture all documents—not just structured layouts—and make that data actionable with a minimum of human intervention (IBM – The essential buyer’s guide to data capture and automation).

The most valuable data capture solution is one that can help:

  • Classify: the software learns to recognize different types of documents after being given a few variations and examples.
  • Extract:  the software trains itself to understand context, such as what an invoice number is not and what should (or shouldn’t) be around the number, so there’s a high degree of accuracy in the extraction.
  • Validate: advanced search capabilities can validate extracted data from a document with existing information in another system.

Every industry has its own unique document types, every organization handles documents in its own unique way according to its policies and procedures. Xenit is now working closely with customers and investing in automatic document classification, as well as data capture and extraction. 

A real case of data classification using ML/AI

One of our customers is an international company for ready-to-use building product systems for waterproofing, building repairs, tile laying, and industrial floor coatings. They had a long and impractical process involving multiple manual steps to integrate documents into their system. This laborious process was repeated several times in one day. Our client’s goal was to automate the document classification.

The daily workflow of the processing of their documents consisted in :

  • Starting by selecting the right template for the document;
  • Adding some metadata to the template like the project number, project manager, and client, among many other things;
  • Printing document pending signatures;
  • Acquiring necessary signatures;
  • Scanning the document to PDF and assigning it to the project manager;
  • Based on the document type it then should be placed under the right department folder.

The huge amount of documents that needed to be processed and classified, led to a high probability of human errors due to the complexity of the above process. 

To overcome this challenge, our team suggested using Machine Learning models to automatically classify documents into a set of predefined categories (i.e. two documents belong to the same category if they are typically related).

The solution

The Company provided us with almost 20 thousand documents with their respective templates that were stored in an encrypted drive to protect private information. Our team built a scalable, deployable model to perform document classification of this set of documents and extracted information from them.

Of these 20 thousand PDF documents, there were 7 different classes. The model extracts the text of each PDF document and applies vectorization to it. Count Vectorisation involves counting the number of occurrences of each word appearing in a document (i.e distinct text such as an article, book, or even a paragraph!).

The vectorized representation is then inputted into the model for prediction. The vectorization includes two popular NLP (Natural Language Processing) approaches:

  • Extraction of BoW (bag of words), and
  • TFIDF (term frequency-inverse document frequency)

Once these features were extracted, and after splitting the data into a test set and validation set, we applied different models for the classification of the data (logistic regression, support vector machines, and neural networks)

This NLP approach showed to be very effective with simple vectorized. Comparing TFIDF with BoW, it was possible to see that TFIDF was able to generate better results in terms of metrics mentioned in the glossary below, and it was also more stable. The prediction time and training time for the models (not including the time taken to extract the text from the PDF document) were negligible for the model application.

The models’ metrics showed very promising results: slight variance for accuracy, precision, recall, and F1 score between models but the best result was recorded when using logistic regression using TFIDF with an accuracy of 0.945, precision of 0.966, recall 0.919, F1 score of 0.942.

By taking advantage of the existing documents and by looking rigorously into a functional problem, we were able to provide a machine learning-based solution using a basic yet effective way to extract data, build a model, and deliver a classification.

Eliminating manual processing could be a turning point inside a company. Implementing such a solution could reduce administration overhead, and accelerate the process of document delivery resulting in improved customer satisfaction.


  • Bag Of Words (BoW): it consists in extracting the unique words of each document to account for all the unique words in the whole set of documents.
  • TFIDF: It works similarly to BoW. But here, the word count (or term frequency) is also counterweighted by how common the word is in the entire set of documents.
  • Accuracy: it answers the following question: How many documents did we correctly classify out of all the documents?
  • Precision: it answers the following question: How many of those who we classified as Type1 (name of a cluster) are actually Type1?
  • Recall:  it answers the following question: Of all the documents that are Type1, how many of those do we correctly predict?
  • F1: is the harmonic mean (average) of the precision and recall.
HTML Snippets Powered By :