Archetix Data Squad

Prague | Can AI Preserve Our Science Legacy?

Archetix´s NTRS Extended Terminal

High-Level Project Summary

Imagine how difficult it can be to located desired information in large technical papers repository. The similar challenges are faced also in academical background. We reimagined the accessibility and discoverability with usage of cloud technologies and AI/NLP models. Our models were able to enhance the existing database with following metadata:Language - dominant language detected based on the fulltext of the articleTopics - topics derived from fulltext of the article based on the NASA STI Scope and Subject Category Guide AI Summary - summary generated with Sumy NLP Similar articles - AI generated list of similar articles via recommendation engine (not finished)

Link to Final Project

https://archetix-data-squad.web.app

Link to Project "Demo"

https://docs.google.com/presentation/d/1nHEUGh6a_Y0r3ZM5KI3B4a5MRmF7b2M7_ictSbqEvZg/edit?usp=sharing

Detailed Project Description

What exactly does it do?

Archetix Data Squad App which we created during the Space Apps hackathon in Prague on 1-2. October 2022 is extended interface for https://ntrs.nasa.gov/. It enables to search a limited part of the database (so called Corpus) which was created from first 100 articles filtered by selecting Center = Legacy CDMS.

How does it work?

There is copy of the NTRS (Corpus) database created in Keboola cloud including the fulltext of the articles. Through the extended interface user is able to search this Corpus and retrieved the articles in similar way as in the original interface. Even though the Corpus is small right now its expected that as its running or largely scalable technology is able to pertain a peak performance and high access speed.

For this fast ability the snapshot of the database (excluding the fulltext) is copied to Firestore database in Google Cloud which is JSON database with fastest access speeds possible. The snapshot saved here which is access via the frontend app on address https://archetix-data-squad.web.app/. This stack should withstand peak loads and maintain the accessibility.

What benefits does it have?

The key benefit is that as it was mentioned in the challenge this extended interface enhances the original database with following meta data attributes and functions for each article:

Language - dominant language detected based on the fulltext of the article
Topics - topics derived from fulltext of the article based on the NASA STI Scope and Subject Category Guide (https://ntrs.nasa.gov/api/citations/20000025197/downloads/20000025197.pdf), own categorization model based on the thesaurus
AI Summary - summary generated with Sumy NLP model in Python (https://www.topcoder.com/thrive/articles/text-summarization-in-nlp)
Similar articles - AI generated list of similar articles from the Corpus via recommendation engine (not finished)

In early stage of development there is functionality which enables to select specific articles and run above mentioned metadata analyses on demand.

What tools, coding languages, hardware, or software did you use to develop your project?

Frontend - Firebase Hosting, Bootstrap, HTML, CSS, JS

Backend - Firebase Firestore, Keboola (extraction, orchestration), partly Google Big Query for temporal data storage during processing, Python, Sumy NLP https://www.topcoder.com/thrive/articles/text-summarization-in-nlp

Space Agency Data

We used the NASA Technical Report Server (NTRS) digitalized archive for various technical reports, articles and papers including the Content & Document Management System (CDMS) which we filtered by selecting Center = Legacy CDMS.

We accessed the database via website https://ntrs.nasa.gov/ and its NTRS OpenAPI https://ntrs.nasa.gov/api/openapi/

In addition we used the following nasa Guidelines

NASA Thesaurus - https://sti.nasa.gov/nasa-thesaurus/

NASA STI Scope and Subject Category Guide - https://ntrs.nasa.gov/api/citations/20000025197/downloads/20000025197.pdf

Hackathon Journey

We were inspired to select this challenge due to our work background as data scientists. This challenge was close to real-life examples and we could use our toolbox at full. No matter the dataset (database behind) the challenge this AI enhanced search and recommendation engines have a place in nowadays data science projects.

Per aspera astra (over obstacles towards a stars)!

Especially for our NLP / AI expert Melanie it was a first step towards her dream carrer of NLP data engineer. Being able to test some AI pre-trained NLP models on real space challenges helped her to understand the field. To quote Melanie: per aspera astra (over obstacles towards a stars). Tomáš discovered the potential and awesomeness of GitHub, Firebaes and Boostrap. Michal explored new functions in Keboola and played with NLP models. Marek understood more of the Firebase Hosting and Firestore management and managed to coordinate the project from end to end.

As we were also challenge to create functional solution, we again discovered how important is to be able to move local application and model to the cloud. In this case especially Keboola and Google Cloud were the perfect environments for putting our project into production.

The challenges we come across and overcome was overloaded API, structure and complexity of the PDFs and text inside them and complexity of the topics we tired the pre-trained models on.We tested multiple NLP models and selected the one which was able to process the scientific articles at the best level.

As always in every data scientist daily job the connection of the two systems was a challenge in our case the extraction of data from NTRS to Keboola and from Keboola to Firestore for frontend application processing. Especially keeping the formats and forcing to database large (50 000 words) string was a bit of challenge. But Keboola manage to do it, which created a stable foundation for our progress. The same goes for data cleaning which as important but painful process.

As there was team cooperation we heavily relied on collaboration tools such as GitHub, Google Workspace, Deepnote, Jupyter Notebooks, Keboola and Slack. It was amazing to see how would these tools perform in real-time work of team of people.

In addition we would love to mention that as in every data science / development project, only after exploring the topic for a few hours we understood how to approach the topic, how to create a solid business case and develop the final product.

We would love to thank Vojta Tůma from Keboola for helping us to load large amounts of text data into the Keboola Storage. Also to Honza Spratek and organization team from Keboola and Planetum for providing us with amazing place, food and support during the weekend hackaton.

There are some questions for future research:

How to process graphs and tables which are typically part of scientific papers?

How often to update the database and what queries to enhance first?

References

Sumy NLP - https://www.topcoder.com/thrive/articles/text-summarization-in-nlp
Bootstrap - https://getbootstrap.com/docs/5.2/getting-started/introduction/
NTRS - https://ntrs.nasa.gov/
NTRS OpenAPI - https://ntrs.nasa.gov/api/openapi/
NASA Thesaurus - https://sti.nasa.gov/nasa-thesaurus/
NASA STI Scope and Subject Category Guide - https://ntrs.nasa.gov/api/citations/20000025197/downloads/20000025197.pdf
Search Bar - barebone code used for further filtering the results https://www.geeksforgeeks.org/search-bar-using-html-css-and-javascript/