Document Analyser System

High-Level Project Summary

Our project aims to make the NASA research papers easily accessible,searchable and usable for researchers,academicians,students,decision makers as well as the general public. So,people can find the relevant information from these papers in a relatively faster way which can accelerate and aid in the process of research and lead to better knowledge sharing.Thus,in this way we will be preserving NASA's legacy of research and exploration and make it easily accessible to innovators,researchers and academicians for better utilization of these resources for innovation and research in future to solve modern world problems using this data.

Detailed Project Description

What did we develop?



We have developed a NLP based model which will analyze all the files in a directory(corpus) and will generate an interactive report of each file known as theDocument Analysis Report 

This report will try to convey the crux of the whole document interactively in the form of charts and graphs constructed on the basis of the words used in text, which will help the reader in determining that the document is relevant to their present requirement or not so in this way the reader can easily determine which article is worth reading. 

The report will also create a short summary of the document which will be approximately about 10 to 20 percent of the length of the original text.


We will also be providing the links of original document,document analysis report along with category,subcategory and top keywords of that document in the form of a spreadsheet known as the Mapping Sheet which will make the corpus more easily searchable and navigable by making the links to all the documents and their reports available at a single place along with keywords thus improving the navigation and search experience significantly.



Report(Our project report please open in new tab if it does not open directly)




How does it solve challenge?



Our project aims to solve this challenge by providing more searchability and navigability of the Legacy CDMS documents by classifying them into proper categories and subcategories depending upon the term frequency in the text and organising them efficiently in a spreadsheet along with the list of keywords to make it easier and faster for the user to search for the required content. 

It also provides the critical insights of long research documents in the form of a brief and precise report with insightful graphs and concise summary which will make it easier for the user to find the relevant content for their context.



What exactly does it do?



Our model takes a corpus of documents as input which is a directory containing all the documents in PDF format for which report is to be generated. 

Our model works in an iterative manner and will iterate through each document present in the corpus and generate its report. The key features of our models are:






  1. Scans the content of the original document from the pdf file and generate a bag of words by cleaning the scanned data
  2. Generate a high level summary of the contents of the document 
  3. Generates interactive and appealing graphs from the bag of words to provide critical insights into the contents of the document in a visually appealing and easily understandable manner
  4. Categorizes the documents into various categories and subcategories using the NASA STI Subject and Scope Category Guide
  5. Combines all the graphs and summary to generate ‘Document Analysis Report’ which will be a pdf file conveying all these insights to the user at a single place
  6. Generate a mapping sheet containing the links of all the original documents,their reports along with categories, subcategories and keywords to provide better search and navigation 



How does it work?



The model is mainly developed using Python along with several libraries for NLP,PDF parsing etc. The working of our model involves the following steps:





  1. Our model will take the path of a directory(corpus) as input and then will scan the directory using the OS module of Python and obtain the address of all the documents present in the corpus and store them in a list
  2. Then by using the file address collected in step 1,it will open and parse each document in the list iteratively and extract all the text present in that document using the PyPDF module and store that in a string
  3. Then,the model will tokenize the text in two different ways,by using word tokenization as well as sentence tokenization to create a vocabulary of both words and sentence present in the text using the NLTK library
  4. Then the model will clean the data to remove the stopwords(most commonly use words like 'is','are' etc.),numerics and special characters from the word vocabulary to create a list of the major keywords in the document 
  5. Then the algorithm will calculate the term frequency and inverse document frequency of the list of words created in step 4 and then using this TF-IDF of each word it will calculate scores for the sentences stored in sentence vocabulary created in step 3
  6. Then using these scores the algorithm will generate a summary for the content by picking the top 10% sentences from the sentence vocabulary depending upon this score and combine them to create a summary 
  7. The model will also generate several graphs like wordcloud,keyword frequency graph and term frequency percentage graph for the bag of keywords, in order to provide meaningful insights about the contents of the document to the user in an easily understandable and visually appealing manner using the Plotly library
  8. The model will also classify the data into various categories and subcategories depending upon the keywords present in the document by using the category and subcategory wise list of keywords given in NASA STI Subject and Scope Category Guide
  9. Then,the model will generate 'Document Analysis Report' which will provide the user with all the insights about the document and will contain all the graphs generated in step 7,summary generated in step 6 and also the bag of words with frequencies using the FPDF module
  10. Then,after generating the reports for all documents in the corpus it will generate a mapping sheet containing the links of all the original documents,their reports along with categories,subcategories and keywords to provide better search and navigation using the Pandas module



What benefits does our model have?



Our model has various benefits for the people who face the problem of navigating through lengthy research articles and documents to find the relevant information and are frustrated by this tedious and time consuming task to find the relevant set of information


The current algorithm in NASA mainly scans the article title and abstract to filter the search results. Moreover, even if we navigate through those documents,the documents are so lengthy that it becomes a very tedious task to explore each and every article in detail.


Our model presents the major keywords in the document in the form of visually appealing graphs along with their frequency which will help the user in determining instantly whether the article is relevant to their context or not. Our model also provides a high level summary which the user can use for further reference alongside the graphs to determine whether the document is worth exploring in detail or not


Our model also provides the link to all these documents,their reports along with categories,subcategories and keywords in a single spreadsheet thus leading to faster and easier search



What do we hope to achieve?



We hope to provide a real time application that can analyze the contents of a document properly in depth and communicate to the people the insights of the document in a brief,precise,clear and understandable manner.


Some things we would like to incorporate in our current model to make it better are:






  • Due to limited time and resources we were only able to implement our solution only to a small corpus containing only 60 documents,if we are provided with more time and better computational resources to train our model we would like to apply it on much larger corpuses
  • Currently,our model uses extractive summarization techniques for text summarization. We would like to apply abstractive text summarization techniques as it produces better results but currently we lack the computational and monetary resources to do so
  • We would like to build an interactive chatbot too as a part of our model,which can make the application much more interactive and provide better assistance to the users in finding the right information 
  • We will also like to incorporate a web scraping feature that can extract the data from the NASA NTRS server in a real time,thus automating the task of corpus creation 
  • We will also like to add the feature of a real time summarizer which could generate a custom summary with the length and language according to the inputs given by the user
  • We will also like to use translation features to make our model multilingual to make it accessible to people all around the world 
  • Finally, we would also like to incorporate and develop better recommendation engines which can recommend articles to the user similar to the ones they read the most 



Looking at the Big Picture 



In the longer run,our model will be very helpful for people from different professional backgrounds in various ways. It will help them a lot in doing their work more efficiently and productively by providing them easier access to the years of NASA research data at a single place, thus boosting their productivity which will lead to development of better solutions to many real world problems. Some of the major professional backgrounds for whom our model will be most beneficial are:



1.Researchers and Innovators


Our model will be very much helpful for the future researchers and innovators as it will provide them easier access and navigation to this treasure trove of research and information in an easily accessible and understandable way which will accelerate the process of research and innovation 


2.Academics


It will be highly useful for academics as they will gain better access to knowledge which will lead to better sharing of knowledge and information 


3.Policy Makers


It will be highly useful for policymakers as well as they can get brief and specific reports on a variety of issues which will help them in making better decisions 


4.General Public 


Our model will also be useful for the general public,especially the individuals with curiosity and inquisitiveness as this treasure trove of information will become more accessible to them



What tools, coding languages, softwares and hardwares did we use?



We have used the following tools and softwares in development of our project:






  • Python 
  • Google Workspace 
  • Google Colab
  • Pypdf2 module
  • FPDF module
  • NLTK
  • Matplotlib
  • Plotly
  • Pandas
  • Regular Expressions

Space Agency Data

We have used the data from NASA NTRS for data collection in our corpus


https://ntrs.nasa.gov/


We have also used the NASA STI Subject and Scope Category Guide for categorising the data into different categories as per the keyword frequency


https://ntrs.nasa.gov/api/citations/20000025197/downloads/20000025197.pdf

Hackathon Journey

How would we describe our Space Apps Challenge Experience?



Space Apps have provided us a golden opportunity to learn new skills and use our skills and talents for the betterment of society and to solve the problem which we have faced ourselves in our lives on personal level and if tackled properly could accelerate the processes of research and development,sharing of knowledge and quick information availability



What did we learn?



Working on this project for the hackathon to contribute in providing a solution to preserve the Space Legacy and make it easily accessible has provided us the chance and opportunity to learn and grow ourselves. We have gained knowledge about many new tools and technologies as well as about the studies and research going on in domains of Machine Learning, Natural Language Processing etc. Some of our major learning outcomes were:






  1. We did an extensive search to clean the various text datasets which sharpened our querying skills and gave us a practical experience of the work of a data scientist 
  2. While going through the NTRS portal for data collection we get to know about the massive research being conducted by NASA in different domains to improve the human life in different aspects 
  3. We learned to work in an efficient way with text data using several Python frameworks like NLTK,FPDF, PyPDF, Plotly etc. Before this we never had an opportunity to do such kind of project before this 
  4. We got the opportunity to polish our skills about the things we had good knowledge of and exploring and discovering new features, functions and algorithms. For example- We learnt and used several libraries in Python that we even had not previously heard.
  5. And most of all we came to know the practical application of data science and natural language processing and how we could use it to solve a problem that we have personally experienced 



What inspired our team to choose this challenge?



This is a problem that we have experienced on a personal level. We had the experience of scanning and reading through large documents of text to find the relevant information both as students and developers. We know how tedious and frustrating this process can be and how these kinds of tasks reduce the overall productivity in the whole task being performed.

So,this inspired us to choose this challenge and provide a solution that can help other people facing this challenge so as to to boost their overall productivity and efficiency and accelerate the processes of research and innovation.



How did we develop our project?



1.Data Collection : Firstly, we gathered all the datasets which were the NASA Legacy CDMS documents from NTRS portal.


2.Data Warehousing : The collected data was stored in a repository named Corpus. The Corpus is the main storage directory for all our data that will be further utilized in our project. (If any other user uses our model,they can create a similar Corpus directory containing all the documents that they want to be analyzed and the model will analyze all the documents in the corpus and generate their report)


3.Tokenization :The data present in the corpus is then tokenized to create a bag of words and a bag of sentences for further operations



4.Data Cleaning and Preprocessing: The data tokenized is then cleaned and preprocessed by removing the stopwords,alphanumeric characters and other special characters 


5.TF-IDF Summarization:Then the scores were assigned to the sentence tokens based on TF-IDF technique and then the top 10% sentences with highest scores were combined to give the summary


6.Data Visualization: Then the tokenized data was used to plot various graphs like wordcloud, keyword frequency graph and term frequency percentage graph to give the insights to the user in a visual appealing way


7.Categorization: The document was then classified into categories and subcategories depending upon the term frequency and with the keywords data from NASA STI Subject and Scope Category Guide


8.Report Generation and Mapping Sheet:Then the document analysis report was generated by using the summary and graphs generated and then all these details(Document links,report link,category,subcategory and keywords) were written to spreadsheet called Mapping Sheet 



What challenges and setbacks our team had faced?






  1. Several datasets have copyrights due to which we could not include them in our project.
  2. The data that contained OCR scanned documents was a bit messy for the model.
  3. It was a challenge against a constant race of time.
  4. Due to lack of time we were not able to completely develop all the things that we had thought about.



A word of thanks!

  

First of all we would like to express our gratitude to NASA and other organisers for giving us a golden opportunity to express and explore ourselves, because of them we are getting a chance to think out of the box and create something which could be an asset for the society. Secondly we would also like to thank NTRS for providing us the data for our application and we would also like to thank the developers and maintainers of the tools and technologies that we used to build our solution.Ultimately we would like to thank each and every person who supported us directly or indirectly in our journey.








Tags

#nlp #machinelearning #dataforgood #legacy #project #artificialintelligence #creativity #students #ntrs #nasa