Shock Wave Surfers

Tacoma | Can AI Preserve Our Science Legacy?

Awards & Nominations

Shock Wave Surfers has received the following awards and nominations. Way to go!

Global Nominee

CREATION.AI : An End-To-End Text Analysis Pipeline

High-Level Project Summary

At creation.ai,We strive to answer the difficult questions and present unique insights that even a researcher might be oblivious to.Our application uses SOTA Deep Learning Models, as well as traditional algorithms to present the user with analytics seen never-before, all through a flawless web-interface.It reads the paper ten times over and comes up with a short and sweet extract all about it so you don't have to!It picks out the most important words, which alone could account for what the paper does - presented to you in the form of charismatic word-clouds and graphs.As a cherry on the top, it also suggests similar papers from our database which could be the key to your success!

Link to Final Project

https://github.com/space-apps-tacoma/creation.ai

Link to Project "Demo"

https://docs.google.com/presentation/d/1ccbCVqQvLJaEZb4IVzMYn86HU_d5d151uHufcXMVtj0/edit?usp=sharing

Detailed Project Description

CREATION.AI

The Problem

The NASA Technical Report Server (NTRS) includes hundreds of thousands of items containing scientific and technical information, through which it is near impossible to navigate (believe us, we've experienced it first-hand).

Our solution proposes a revolutionary approach to filter past the data flawlessly.

The Solution

At it's core, our application has two hemispheres -

The Librarian (Still Under Development)

Like a very well-educated librarian, who can recommend books based ones already borrowed, ours is an intelligent chatbot that guides the discussion with the patron to help them query the NTRS database tagged CDMS, with resulting keywords, and n-grams.

It uses keyword extraction to extract similar words from the corpus, match them with the document queried and scour for similar documents from our corpus.

After which it uses an NLP based ranking algorithm to rank the results and presents them as a binder of similar documents - with even the similarities highlighted!

Current Status

We are currently able to search and match single page documents with those in our corpus. Adding support for multi-pages and it's visualisation is under development.

'The Librarian' is essentially a well-built intricate 'Information Retrieval System' , updated with State Of The Art techniques and processes.

The Analyst

The Analyst, meanwhile, has quite a few features!

The Text Summariser: It uses the best-in-the-business 'fine-tuned Transformers model' to generate a custom summary just for you!
Word-Level Analysis: It generates aesthetically pleasing word-clouds, and frequency distribution graphs and data frames rife with information
Keyword Extraction: It uses another State-Of-The-Art Model Transformers model to produce tags and keywords that can help you look in all the right places, not to mention categorise data effectively.
Question & Answer Generation: While this is still under development, it leverages the summariser to work on part of the extracted text

The Web App

The web application uses the 'Flask' framework to integrate the backend with a beautiful front-end built with HTML, CSS & Vanilla JavaScript.

It works flawlessly, currently hosted locally - retrieving results in under 10 seconds!

Technologies Used

Our entire application has been built in Python, leveraging several different frameworks.

PyTorch: A Deep Learning Framework
Flask: A Web-Development Framework
Deep Learning: For Both Hemispheres Of Our Application
Recursive Searching: For The Librarian Hemisphere

Space Agency Data

We have strictly confined ourselves to the data provided by NASA, of which too we have only used a subset.

Of the 381,000 data points available through the CDMS centre, we have used a subset of 2,000 data points to train our models.

An Issue With The Data -

The data available from the CDMS centre was unreliable and un-usable for such heavy tasks, with even the text available in JSON format being extremely dirty.

We wrote numerous custom text-cleaning algorithms because even such pre-existing algorithms could not solve this issue

Hackathon Journey

The Journey

Our SpaceApps experience was a memorable one - we had an amazing time collaborating with one another.

Both of us had always been fascinated by how the 'Natural Language Processing' domain works

We approached the challenge with virtually no experience of text analysis or even search algorithms. From the very first day, we decided to employ the 'divide and conquer' analogy - and it worked out 'ok' for us!

One of the major issues we faced was with the quality of the data - with even the extracted text leaving extremely iffy and unreadable text. Finally, we created a text-cleaning algorithm which was able to fix most of our issues.

Another issue that I am proud of us for tackling was the lack of man-power - Over the course of the weekend, we were unable to find a third team member with a similar skillset: making it impossible for us to finish our application within the limited time-frame!

We'd like to thank the NASA SpaceApps administration to come up with a challenging, yet interesting challenge for us to work on!

References

Data

The NTRS Server
The NTRS OpenAPI

Resources

PyPDF2 for converting from PDF to text files https://pypdf2.readthedocs.io/en/latest/index.html
PyTorch And Transformers To Create The Summary And The Q&A Generation Model
NLTK for writing and analyzing the corpus https://www.nltk.org/
Inspiration to accept this years challenge, came from the 2015 Challenge Winners, Team NYSpacetag: https://github.com/jonroberts/nasaMining ; https://2015.spaceappschallenge.org/project/nyspacetag/
The term “will” in this document refers to a forward looking statement that may or may not actually occur.
N-Grams: https://en.wikipedia.org/wiki/N-gram
Project Page: https://github.com/space-apps-tacoma/creation.ai