Neon_Hackers NTRS Text Analysis

High-Level Project Summary

NASA NTRS Text Analysis for obtaining records and building analysis reports for them. Downloading all the required PDFs using the NASA NTRS API. Extraction of text from PDFs and creating a Corpus. Keyword Extraction and Frequency Plot. Document Summarization with Named Entity Recognition. Knowledge Graph Construction. Report Generation with Keywords, Summary and Knowledge Graph.

Detailed Project Description

1. Downloading all the required PDFs using the NASA NTRS API. Using the API for bulk record download of centre "Legacy CDMS" to obtain a json file containing corpus of the required category of PDFs.

2. Extraction of text from PDFs and creating a Corpus. Using the PyPDF2 library in Python.

3. Keyword Extraction and Frequency Plot. Using the TF-IDF vectrizer, YAKE and Seaborn libraries in Python.

4. Document Summarization with Named Entity Recognition. Using the Heapq, Spacy and Displacy libraries in Python.

5. Knowledge Graph Construction. Using the Networkx, Spacy and Matplotlib libraries in Python.

6. Report Generation with Keywords, Summary and Knowledge Graph. Using the Nbreport library in Python.

Space Agency Data

NASA NTRS Legacy CDMS Records

Hackathon Journey

One fine morning, I Googled "biggest hackathon in the world" and discovered the NASA Space Apps Challenge. Having been the winner of Smart India Hackathon 2020 and runner-up of ASEAN India Hackathon 2021, this intrigued me instantly into signing up for the NASA Hackathon 2022.


My teammate and I were both very excited and we decided to go to the Bristol centre to participate in-person, irrespective of the rail-strike across UK. Upon reaching there, we received a warm welcome from John Bradford, our Local Lead. We were bamboozled by the sheer young talent around us (a little intimidated too). We took it up as an inspiration and coded the whole day long. Here, we completed almost 80% of our work. The next day, we attended the hackathon online due to the time and money constraints involved in traveling to the Bristol centre. We have finally added the finishing touches and completed the coding.


On a whole, we thoroughly enjoyed this journey and were really honoured to have been a part of the NASA International Space Apps Challenge 2022!

References

1. Downloading all the required PDFs using the NASA NTRS API : https://ntrs.nasa.gov/api/citations/search?subjectCategory=%22Lunar%20Planetary%20Exploration%22 , https://www.sti.nasa.gov/harvesting-data-from-ntrs/


2. Extraction of text from PDFs and creating a Corpus : https://pypi.org/project/pyPdf/ , https://pypi.org/project/pyreadr/


3. Keyword Extraction and Frequency Plot :https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html , https://github.com/LIAAD/yake


4. Document Summarization with Named Entity Recognition : https://spacy.io/ , https://github.com/explosion/spaCy , https://docs.python.org/3/library/heapq.html , https://spacy.io/usage/visualizers


5. Knowledge Graph Construction : https://networkx.org/ ,


6. Report Generation with Keywords, Summary and Knowledge Graph : https://github.com/choldgraf/nbreport

Tags

#NASA #NTRS #SpaceAppsChallenge #NLP #KnowledgeGraph #KeywordExtraction #TextSummarization #PDFExtraction