Chaos converter

High-Level Project Summary

We have developed a web service for queries and a back-end to download and process the documents.The project has been divided into three phases:1- Covering the requirements proposed to the challenge of keyword extraction and generating a summary of the document.2- Document similarity. Grouping and searching documents by similarity.3- Knowledge graph. Create a knowledge of the processed documents to find relationships between them. With these three points, we have a way to find documents, read them quickly and find relationships between documents.

Detailed Project Description

Chaos converter


The project has been divided into three phases.


The project starts with a script that downloads documents and document information through the provided API. When this information is extracted, it is stored in ElasticSearch, which will serve as a search engine and source of information for the following processes.


First phase - Document summary and keyword extraction.


Due to time constraints, we have looked for solutions from libraries (which can already give us some acceptable solutions), rather than reinventing the wheel to try to achieve better quality.


The first problem we noticed was the poor quality of the OCR (mainly due to the quality of the documents). To try to overcome these errors, we used a grammar checker (language-tool).


For the generation of the summary we used the HuggingFace library. With it, we used a

distilbart fine-tuned model with which we liked the results.


Keyword extraction was performed with the Rake library.


As future work for this phase:

- We would need to investigate on techniques to correct OCR errors.

- Test different models for summary generation.

- Test other techniques for keyword extraction (Text Rank, Multi-word Keyword Scoring, Expand Rank, Position Rank, Word Attraction Rank, Embed Rank, TAKE, KeyBERT, Topic Rank).


Second phase - Grouping and searching of documents by similarity


With the first phase, we could search for documents that share keywords in the hope that they are similar. To take hope out of the equation, we propose to apply document similarity techniques.


The basic idea is to apply the TF-IDF transformation to the documents to obtain representative vectors. Then, using cosine similarity, group them into clusters. Once grouped, we can save the information of which document is similar to another and even define a threshold for cosine similarity and use it to search for new documents in the future.


Third phase - Knowledge extraction


Here we are going to use the spaCy library and the Neo4j database.


The objective is to extract from the documents information in the form of relationship between entities Entity -> Relationship -> Entity (Joe -> Work_in -> NASA).


With this information we are able to link documents that a priori had no apparent relationship. For example, document A has the solution to a problem, which, this problem is mentioned in document B.


For this objective we apply a chain of NLP tasks:

1- Coreference resolution

2- Named Entity Recognition

3- Named Entity Linking

4- Relationship Extraction


Finally, the web interface should be modified to add similarity searches and Neo4j searches.

Space Agency Data

We have used documents extracted from NTRS



With Center = Legacy CDMS and with documents available for download.

Hackathon Journey

It has been a good experience. The challenge was very well explained and the resources provided were very well organized and documented. It was very easy to use them.


We selected this challenge because one of the group members loves the NLP field and is always looking for challenges that make him think. Besides, it is a good experience because in the business world there is also the problem of knowing how to organize and take advantage of the information.


We would like to thank our wives for their patience in putting up with the extra work and giving us some time for the challenge.

References

- Programming language: python

- Libraries: HuggingFace, language_tool, RAKE, nltk, PyPDF2, django

- Tools: docker, docker-compose, ElasticSearch, git, jupyter notebooks

- Image generated with DallE

- Google Presentations

Tags

#nlp #elasticsearch #keyword #summarize #documentsimilarity #knowledgeextraction #knowledgegraph #grammarchecker