Austin, we have a problem

Birmingham | Can AI Preserve Our Science Legacy?

Gandr

High-Level Project Summary

Gandr is a AI powered search engine that utilizes indexes generated through Natural Language Processing (NLP) to assist researchers in finding reports. To do this we leveraged NLP to process text from documents on the NASA Technical Report Server (NTRS) and summarize them into a collection of keywords. The keywords are then stored in a SQL database. A Flask web app then allows users to enter a query, which is then summarized and used to find relevant reports.Our project solves the challenge through our utilization of NLP, and allows for a more streamlined method to access the NTRS with our intelligent indexing of documents.

Link to Final Project

https://github.com/judev1/gandr

Link to Project "Demo"

https://www.youtube.com/watch?v=OLUS1lEPJjs

Detailed Project Description

Gandr uses Spacy's NLP model to process text and generate tokens, then using the tokens, and the frequency of each tokenized word, it generates a list of keywords which is stored in a SQL database along with the document id.

To create our corpus, we created an algorithm to asynchronously scrape the NTRS and then index each document. Having the scraper run throughout our project, we were able to index more than 25,000 documents.

With our corpus in place we created a Flask web application. The site has a simple user interface which allows users to enter a query which is sent to the server and tokenized to generate a list of keywords. The keywords are matched with the database to find the most relevant documents, which are returned back to the user in an easily accessible format.

The NTRS database takes several inputs such as a search query, the authors' names, date, document type, etc. Most users are not able to make use of most of these fields. However, while the search function certainly is useful, it only queries the titles and abstracts, which not all documents contain (the abstracts that is), and can only hold so much information about the actual document.

Our project allows for a more inclusive method of querying the NTRS, by allowing users to make searches that are relevant to entire documents.

All scripts are written in python. We used SpaCy's NLP model to help us analyize lexical data. asyncio and AIOHTTP were used to scrape files from the NTRS. We used SQLite3 to create and interact with our SQL database. Our front end was created using the Flask framework, with the individual pages structured and styled with HTML and CSS.

Space Agency Data

As our project objective is to increase accessibility of existing data from NASA's Technical Report Server, we relied heavily on the information and data from the site. We also used the NTRS OpenAPI documentation to understand how to interact with the API.

Hackathon Journey

Our Space Apps experience was wild. We all tried something new.

Austin experimented with natural language processing, learning how NLP and syntactic and sentiment analysis techniques, as well as subtasks such as tokenization and relationship extraction. Chen learnt how to create and query databases, with a little prior knowledge, but with the challenge of devising it all in python, and creating a system which could accommodate the needs of our application. I (Jude) learnt how to use the NTRS API and interact with it asynchronously, while also being on standby, monitoring the progress of the scraper, and helping out the team where I could.

We were inspired to choose this challenge because we collectively have an interest in AI and software development, and this challenge gave us the opportunity to experiment with them. Prior to the hackathon, we only had experience with binary image classification, however, we wanted to learn more about text analysis and what functions it can be used for.

We approached the project eager, and ready to work, immediately jumping into the deep end. We started looking into the NTRS, and understanding what it was and how we could use. With this foundation, we worked through different approaches to the problem, and resolved to split the task up into three major components. The interaction with the NTRS, the processing of the data using NLP, and the creation of a UI.

We discussed setbacks and challenges by communicating with one another, and working together as a team. We also, asked the Subject Matter Experts questions about problems we encountered that we could not solve on our own. We really owe it to all the fresh faces who gave us help and advice, especially when we couldn't see it any other way.

References

Generic:

Languages

Python

Tools

GitHub

Visual Studio Code

Text Classification:

Libraries

SpaCy, string, heapq, re

Databasing:

Languages

SQL

Libraries

SQLite3

Tools

DBBrowser

Scraper:

Libraries

asyncio, AIOHTTP, requests, URLlib, time

Resources

NASA Technical Report Server

NASA NTRS OpenAPI

Web Application:

Languages

HTML, CSS

Libraries

Flask

We all programmed in python using Visual Studio Code, and utilized the live-share function to collaborate. We also used GitHub to upload and store our project, conveniently documenting our changes.

Austin, we have a problem

Birmingham | Can AI Preserve Our Science Legacy?

Gandr

High-Level Project Summary

Link to Final Project

Link to Project "Demo"

Detailed Project Description

Space Agency Data

Hackathon Journey

References

Tags

Can AI Preserve Our Science Legacy?