High-Level Project Summary
Gandr is a AI powered search engine that utilizes indexes generated through Natural Language Processing (NLP) to assist researchers in finding reports. To do this we leveraged NLP to process text from documents on the NASA Technical Report Server (NTRS) and summarize them into a collection of keywords. The keywords are then stored in a SQL database. A Flask web app then allows users to enter a query, which is then summarized and used to find relevant reports.Our project solves the challenge through our utilization of NLP, and allows for a more streamlined method to access the NTRS with our intelligent indexing of documents.
Link to Final Project
Link to Project "Demo"
Detailed Project Description
Gandr uses Spacy's NLP model to process text and generate tokens, then using the tokens, and the frequency of each tokenized word, it generates a list of keywords which is stored in a SQL database along with the document id.
To create our corpus, we created an algorithm to asynchronously scrape the NTRS and then index each document. Having the scraper run throughout our project, we were able to index more than 25,000 documents.
With our corpus in place we created a Flask web application. The site has a simple user interface which allows users to enter a query which is sent to the server and tokenized to generate a list of keywords. The keywords are matched with the database to find the most relevant documents, which are returned back to the user in an easily accessible format.
The NTRS database takes several inputs such as a search query, the authors' names, date, document type, etc. Most users are not able to make use of most of these fields. However, while the search function certainly is useful, it only queries the titles and abstracts, which not all documents contain (the abstracts that is), and can only hold so much information about the actual document.
Our project allows for a more inclusive method of querying the NTRS, by allowing users to make searches that are relevant to entire documents.
All scripts are written in python. We used SpaCy's NLP model to help us analyize lexical data. asyncio and AIOHTTP were used to scrape files from the NTRS. We used SQLite3 to create and interact with our SQL database. Our front end was created using the Flask framework, with the individual pages structured and styled with HTML and CSS.
Space Agency Data
As our project objective is to increase accessibility of existing data from NASA's Technical Report Server, we relied heavily on the information and data from the site. We also used the NTRS OpenAPIdocumentation to understand how to interact with the API.
Hackathon Journey
Our Space Apps experience was wild. We all tried something new.
Austin experimented with natural language processing, learning how NLP and syntactic and sentiment analysis techniques, as well as subtasks such as tokenization and relationship extraction. Chen learnt how to create and query databases, with a little prior knowledge, but with the challenge of devising it all in python, and creating a system which could accommodate the needs of our application. I (Jude) learnt how to use the NTRS API and interact with it asynchronously, while also being on standby, monitoring the progress of the scraper, and helping out the team where I could.
We were inspired to choose this challenge because we collectively have an interest in AI and software development, and this challenge gave us the opportunity to experiment with them. Prior to the hackathon, we only had experience with binary image classification, however, we wanted to learn more about text analysis and what functions it can be used for.
We approached the project eager, and ready to work, immediately jumping into the deep end. We started looking into the NTRS, and understanding what it was and how we could use. With this foundation, we worked through different approaches to the problem, and resolved to split the task up into three major components. The interaction with the NTRS, the processing of the data using NLP, and the creation of a UI.
We discussed setbacks and challenges by communicating with one another, and working together as a team. We also, asked the Subject Matter Experts questions about problems we encountered that we could not solve on our own. We really owe it to all the fresh faces who gave us help and advice, especially when we couldn't see it any other way.
References
Generic:
Languages
Python
Tools
GitHub
Visual Studio Code
Text Classification:
Libraries
SpaCy, string, heapq, re
Databasing:
Languages
SQL
Libraries
SQLite3
Tools
DBBrowser
Scraper:
Libraries
asyncio, AIOHTTP, requests, URLlib, time
Resources
Web Application:
Languages
HTML, CSS
Libraries
Flask
We all programmed in python using Visual Studio Code, and utilized the live-share function to collaborate. We also used GitHub to upload and store our project, conveniently documenting our changes.
Tags
#software, #NLP, #webapp, #NTRS

