Context:
For GDPR, companies need to categorize their documents. Many companies managed millions of documents, and a manual classification whether they contain private or sensitive information is not realistic. However, this kind of categorisation is necessary to implement appropriate processes to protect privacy related information.
Objective:
We want to apply basic ML techniques to classify documents in 3 or 4 GDPR categories. The idea is to set up basic ML infrastructure and try a number of techniques to see where short term positive results can be obtained. There will be a setup with deep learning (neural networks) to evaluate the potential of this AI technique.
Tasks:
- study set of douments to be used and task at hand to categorize according to GDPR category
- identify 2 techniques and approaches that could yield short term results
- study and analyse state of the art techniques
- (google tensorflow …)
- implement those 2 different classification ways
- run 2 learning experiments and assist in building a learning model
- draw conclusions on the effort it takes to train ML models