Creative Minds

Universal Event | Can AI Preserve Our Science Legacy?

Document Analyser System

High-Level Project Summary

Our project aims to make the NASA research papers easily accessible,searchable and usable for researchers,academicians,students,decision makers as well as the general public. So,people can find the relevant information from these papers in a relatively faster way which can accelerate and aid in the process of research and lead to better knowledge sharing.Thus,in this way we will be preserving NASA's legacy of research and exploration and make it easily accessible to innovators,researchers and academicians for better utilization of these resources for innovation and research in future to solve modern world problems using this data.

Link to Final Project

https://github.com/SparshRastogi/Document-Analyser

Link to Project "Demo"

https://docs.google.com/presentation/d/1nlpq6u5dvWw0h7Nyn0Vs_ucNkrEENAKxpkn_uCfFUwE/edit?usp=sharing

Detailed Project Description

What did we develop?

We have developed a NLP based model which will analyze all the files in a directory(corpus) and will generate an interactive report of each file known as theDocument Analysis Report

This report will try to convey the crux of the whole document interactively in the form of charts and graphs constructed on the basis of the words used in text, which will help the reader in determining that the document is relevant to their present requirement or not so in this way the reader can easily determine which article is worth reading.

The report will also create a short summary of the document which will be approximately about 10 to 20 percent of the length of the original text.

We will also be providing the links of original document,document analysis report along with category,subcategory and top keywords of that document in the form of a spreadsheet known as the Mapping Sheet which will make the corpus more easily searchable and navigable by making the links to all the documents and their reports available at a single place along with keywords thus improving the navigation and search experience significantly.

Report(Our project report please open in new tab if it does not open directly)

How does it solve challenge?

Our project aims to solve this challenge by providing more searchability and navigability of the Legacy CDMS documents by classifying them into proper categories and subcategories depending upon the term frequency in the text and organising them efficiently in a spreadsheet along with the list of keywords to make it easier and faster for the user to search for the required content.

It also provides the critical insights of long research documents in the form of a brief and precise report with insightful graphs and concise summary which will make it easier for the user to find the relevant content for their context.

What exactly does it do?

Our model takes a corpus of documents as input which is a directory containing all the documents in PDF format for which report is to be generated.

Our model works in an iterative manner and will iterate through each document present in the corpus and generate its report. The key features of our models are:

Scans the content of the original document from the pdf file and generate a bag of words by cleaning the scanned data
Generate a high level summary of the contents of the document
Generates interactive and appealing graphs from the bag of words to provide critical insights into the contents of the document in a visually appealing and easily understandable manner
Categorizes the documents into various categories and subcategories using the NASA STI Subject and Scope Category Guide
Combines all the graphs and summary to generate ‘Document Analysis Report’ which will be a pdf file conveying all these insights to the user at a single place
Generate a mapping sheet containing the links of all the original documents,their reports along with categories, subcategories and keywords to provide better search and navigation

How does it work?

The model is mainly developed using Python along with several libraries for NLP,PDF parsing etc. The working of our model involves the following steps:

Our model will take the path of a directory(corpus) as input and then will scan the directory using the OS module of Python and obtain the address of all the documents present in the corpus and store them in a list
Then by using the file address collected in step 1,it will open and parse each document in the list iteratively and extract all the text present in that document using the PyPDF module and store that in a string
Then,the model will tokenize the text in two different ways,by using word tokenization as well as sentence tokenization to create a vocabulary of both words and sentence present in the text using the NLTK library
Then the model will clean the data to remove the stopwords(most commonly use words like 'is','are' etc.),numerics and special characters from the word vocabulary to create a list of the major keywords in the document
Then the algorithm will calculate the term frequency and inverse document frequency of the list of words created in step 4 and then using this TF-IDF of each word it will calculate scores for the sentences stored in sentence vocabulary created in step 3
Then using these scores the algorithm will generate a summary for the content by picking the top 10% sentences from the sentence vocabulary depending upon this score and combine them to create a summary
The model will also generate several graphs like wordcloud,keyword frequency graph and term frequency percentage graph for the bag of keywords, in order to provide meaningful insights about the contents of the document to the user in an easily understandable and visually appealing manner using the Plotly library
The model will also classify the data into various categories and subcategories depending upon the keywords present in the document by using the category and subcategory wise list of keywords given in NASA STI Subject and Scope Category Guide
Then,the model will generate 'Document Analysis Report' which will provide the user with all the insights about the document and will contain all the graphs generated in step 7,summary generated in step 6 and also the bag of words with frequencies using the FPDF module
Then,after generating the reports for all documents in the corpus it will generate a mapping sheet containing the links of all the original documents,their reports along with categories,subcategories and keywords to provide better search and navigation using the Pandas module

What benefits does our model have?

Our model has various benefits for the people who face the problem of navigating through lengthy research articles and documents to find the relevant information and are frustrated by this tedious and time consuming task to find the relevant set of information

The current algorithm in NASA mainly scans the article title and abstract to filter the search results. Moreover, even if we navigate through those documents,the documents are so lengthy that it becomes a very tedious task to explore each and every article in detail.

Our model presents the major keywords in the document in the form of visually appealing graphs along with their frequency which will help the user in determining instantly whether the article is relevant to their context or not. Our model also provides a high level summary which the user can use for further reference alongside the graphs to determine whether the document is worth exploring in detail or not

Our model also provides the link to all these documents,their reports along with categories,subcategories and keywords in a single spreadsheet thus leading to faster and easier search

What do we hope to achieve?

We hope to provide a real time application that can analyze the contents of a document properly in depth and communicate to the people the insights of the document in a brief,precise,clear and understandable manner.

Some things we would like to incorporate in our current model to make it better are:

Due to limited time and resources we were only able to implement our solution only to a small corpus containing only 60 documents,if we are provided with more time and better computational resources to train our model we would like to apply it on much larger corpuses
Currently,our model uses extractive summarization techniques for text summarization. We would like to apply abstractive text summarization techniques as it produces better results but currently we lack the computational and monetary resources to do so
We would like to build an interactive chatbot too as a part of our model,which can make the application much more interactive and provide better assistance to the users in finding the right information
We will also like to incorporate a web scraping feature that can extract the data from the NASA NTRS server in a real time,thus automating the task of corpus creation
We will also like to add the feature of a real time summarizer which could generate a custom summary with the length and language according to the inputs given by the user
We will also like to use translation features to make our model multilingual to make it accessible to people all around the world
Finally, we would also like to incorporate and develop better recommendation engines which can recommend articles to the user similar to the ones they read the most

Looking at the Big Picture

In the longer run,our model will be very helpful for people from different professional backgrounds in various ways. It will help them a lot in doing their work more efficiently and productively by providing them easier access to the years of NASA research data at a single place, thus boosting their productivity which will lead to development of better solutions to many real world problems. Some of the major professional backgrounds for whom our model will be most beneficial are:

1.Researchers and Innovators

Our model will be very much helpful for the future researchers and innovators as it will provide them easier access and navigation to this treasure trove of research and information in an easily accessible and understandable way which will accelerate the process of research and innovation

2.Academics

It will be highly useful for academics as they will gain better access to knowledge which will lead to better sharing of knowledge and information

3.Policy Makers

It will be highly useful for policymakers as well as they can get brief and specific reports on a variety of issues which will help them in making better decisions

4.General Public

Our model will also be useful for the general public,especially the individuals with curiosity and inquisitiveness as this treasure trove of information will become more accessible to them

What tools, coding languages, softwares and hardwares did we use?

We have used the following tools and softwares in development of our project:

Python
Google Workspace
Google Colab
Pypdf2 module
FPDF module
NLTK
Matplotlib
Plotly
Pandas
Regular Expressions

Space Agency Data

We have used the data from NASA NTRS for data collection in our corpus

https://ntrs.nasa.gov/

We have also used the NASA STI Subject and Scope Category Guide for categorising the data into different categories as per the keyword frequency

https://ntrs.nasa.gov/api/citations/20000025197/downloads/20000025197.pdf

Hackathon Journey

How would we describe our Space Apps Challenge Experience?

Space Apps have provided us a golden opportunity to learn new skills and use our skills and talents for the betterment of society and to solve the problem which we have faced ourselves in our lives on personal level and if tackled properly could accelerate the processes of research and development,sharing of knowledge and quick information availability

What did we learn?

Working on this project for the hackathon to contribute in providing a solution to preserve the Space Legacy and make it easily accessible has provided us the chance and opportunity to learn and grow ourselves. We have gained knowledge about many new tools and technologies as well as about the studies and research going on in domains of Machine Learning, Natural Language Processing etc. Some of our major learning outcomes were:

We did an extensive search to clean the various text datasets which sharpened our querying skills and gave us a practical experience of the work of a data scientist
While going through the NTRS portal for data collection we get to know about the massive research being conducted by NASA in different domains to improve the human life in different aspects
We learned to work in an efficient way with text data using several Python frameworks like NLTK,FPDF, PyPDF, Plotly etc. Before this we never had an opportunity to do such kind of project before this
We got the opportunity to polish our skills about the things we had good knowledge of and exploring and discovering new features, functions and algorithms. For example- We learnt and used several libraries in Python that we even had not previously heard.
And most of all we came to know the practical application of data science and natural language processing and how we could use it to solve a problem that we have personally experienced

What inspired our team to choose this challenge?

This is a problem that we have experienced on a personal level. We had the experience of scanning and reading through large documents of text to find the relevant information both as students and developers. We know how tedious and frustrating this process can be and how these kinds of tasks reduce the overall productivity in the whole task being performed.

So,this inspired us to choose this challenge and provide a solution that can help other people facing this challenge so as to to boost their overall productivity and efficiency and accelerate the processes of research and innovation.

How did we develop our project?

1.Data Collection : Firstly, we gathered all the datasets which were the NASA Legacy CDMS documents from NTRS portal.

2.Data Warehousing : The collected data was stored in a repository named Corpus. The Corpus is the main storage directory for all our data that will be further utilized in our project. (If any other user uses our model,they can create a similar Corpus directory containing all the documents that they want to be analyzed and the model will analyze all the documents in the corpus and generate their report)

3.Tokenization :The data present in the corpus is then tokenized to create a bag of words and a bag of sentences for further operations

4.Data Cleaning and Preprocessing: The data tokenized is then cleaned and preprocessed by removing the stopwords,alphanumeric characters and other special characters

5.TF-IDF Summarization:Then the scores were assigned to the sentence tokens based on TF-IDF technique and then the top 10% sentences with highest scores were combined to give the summary

6.Data Visualization: Then the tokenized data was used to plot various graphs like wordcloud, keyword frequency graph and term frequency percentage graph to give the insights to the user in a visual appealing way

7.Categorization: The document was then classified into categories and subcategories depending upon the term frequency and with the keywords data from NASA STI Subject and Scope Category Guide

8.Report Generation and Mapping Sheet:Then the document analysis report was generated by using the summary and graphs generated and then all these details(Document links,report link,category,subcategory and keywords) were written to spreadsheet called Mapping Sheet

What challenges and setbacks our team had faced?

Several datasets have copyrights due to which we could not include them in our project.
The data that contained OCR scanned documents was a bit messy for the model.
It was a challenge against a constant race of time.
Due to lack of time we were not able to completely develop all the things that we had thought about.

A word of thanks!

First of all we would like to express our gratitude to NASA and other organisers for giving us a golden opportunity to express and explore ourselves, because of them we are getting a chance to think out of the box and create something which could be an asset for the society. Secondly we would also like to thank NTRS for providing us the data for our application and we would also like to thank the developers and maintainers of the tools and technologies that we used to build our solution.Ultimately we would like to thank each and every person who supported us directly or indirectly in our journey.

References

For data usage: