Sunyantra

Butwal | Can AI Preserve Our Science Legacy?

SUNYANTRA- Our Literature Legacy

High-Level Project Summary

Sunyantra is a search engine, document analysis, and recommendation website, that provides better accessibility and discoverability of the public technical document from custom enriched NASA Technical Report Server (NTRS) data via the use of Natural Language Processing (NLP). We have developed a powerful search engine that is not only capable of searching the database, but users can also upload documents to the website we have built. Our system automatically analyzes the files and recommends similar document lists. It is also equipped with features like language detection, summarizing the document, listing out keywords, and tallying word term frequency.

Link to Final Project

https://sunyantra.netlify.app/

Link to Project "Demo"

https://drive.google.com/file/d/1Ml7t2YWeZWceMpSNCDCI4mEbU22jzURN/view?usp=sharing

Detailed Project Description

Introduction

Our team "Sunyantra" chose to work on the challenge "Can AI Preserve Our Science Legacy?". The challenge focuses mainly on developing a technique using artificial intelligence to improve the accessibility and discoverability of records in public NTRS. NTRS includes more than 381,547 scientific documents, making it difficult to locate desired information in such a large repository. Acknowledging the problem, our team has come up with a solution. Taking the resources and data provided to us by the agency into consideration, we came up with our project by integrating the finest AI technology, creativity, and research to address the problem.

Status Quo

We came up with the idea of developing a search engine along with search by files, analysis, recommendation by the document, and enrichment of our existing data from NTRS. Our final product is a web application that will be used to communicate our technology to laypersons. The Unique Selling Point (USP) of our project is that users can upload documents and files to the website we have built, and our AI system automatically suggests similar listing documents along with summarizing the document, predicting keywords, word frequency tally, and language used in the document. We have achieved all of these by fine-tuning our model with respect to titles and abstracts of documents using transfer learning from pre-trained models to achieve higher accuracy with fewer resources.

Our website (project) is built on the Python programming language and framework FastAPI framework and is hosted on the Heroku platform because it is suitable for automation and accelerates development productivity. Similarly, we have used Javascript programming language and ReactJS as our frontend web framework hosted on Netlify and use the REST Framework to make these two technologies communicate with each other. Our machine learning models are being hosted on AWS’s Sagemaker instance for better performance. We were provided with the NASA NTRS OPEN API as one of the resources. With the help of the API, we scraped around 2,500 rows of document parameters. Some of the data in the NTRS were empty values, especially with the center Legacy CDMS, in which we have managed to fill fields like keywords, abstracts, and other additional statistical metrics like word frequency and language detection using our machine learning models. After our data corpus was ready, we worked on developing an ML model to summarize the text, keyword classification and frequency measurement, and language detection. We have used the pypdf2 python PDF management library, which uses openOCR to extract data from PDFs.

Our future goals

In the future, not so far, we are planning to use Grover's Algorithm in the development of our search engine to speed up our search engine by Big Notation of Square Root of N and also plan to explain a concept of research to the AI using our search bar and it will show us similar documents regarding that idea. For researchers who have different ideas for working on research, this would be a perfect tool to know if that particular work has been done before or if not, one can work on it to publish his/her work. We are also looking to search phrases in the local language and get summarized documents in their local language.

Resources

We have pushed our code to GitHub. The readME file inside the repo explains the prerequisites and setup requirements for the functioning of our project on the device. The GitHub repository for our frontend application is located at https://github.com/lamdiv/Sunya its production application is hosted on Netlify. The URL is https://sunyantra.netlify.app/. Our backend application, which is hosted on Heroku, is https://nasa-spaceapp-inference.herokuapp.com/docs and our GitHub repo can be found at https://github.com/sahajrajmalla/nasa-spaceapp-inference.

Space Agency Data

The challenge we are working on focuses mainly on the accessibility and discoverability of records or data. As stated in the challenge we have to come up with a solution to easily locate the desired information in such a large repository. Therefore the best-case utility of data is the foundation of our project. We have been provided with the NASA NTRS API as resources. Our team used an API to scrap the data from the NTRS API and used that API to fine-tune our Machine Learning model from Hugging Face and host it in Amazon Web Services’s Sagemaker Instance. We have also enriched the NTRS dataset of Legacy CDMS center and others which had empty abstract and keywords. We have also taken statistical analysis of those documents like word frequency and similar listings. This way we prepared our data corpus. The automated python script that we developed scraped more than 2500 scientific documents/PDFs and converted them into text and CSV files respectively. We used the corpus to train our ML model to recommend similar listing files, summarizing the files, language detection, and listing keywords and the frequency they have been used. The three pertained models used for transfer learning are papluca/xlm-roberta-base-language-detection for language detection, sshleifer/distilbart-cnn-12-6 for document summary and distilbert-base-uncased for keyword classification.

Hackathon Journey

NASA Space App Challenge is one of the most ambitious and back-breaking hackathons our team has ever participated in. To achieve our determined target, we all had to push ourselves outside our comfort zone. The journey was very challenging and exhausting but in the end when we completed our project, we all were in awe of our potential. This made us all think about how much we were under-evaluating ourselves. We are all so grateful for the opportunity to participate in the NASA Space App Challenge because if it wasn’t for this competition and all the hustle we have done throughout the week, we would have never unleashed our strength. Beyond unlocking our strength limits, we got to learn things like working in a team, respecting and accepting the variety in each other's thinking processes, solving internal conflicts as a team, and many more that we all will be forever grateful for.

Our team formation is a bit of an interesting story but the most amazing part is how the vision of all the teammates matched. All of us wanted to do something with AI. We were most drawn by the “Can AI Preserve Our Science Legacy” because we thought our contribution in organizing scientific papers/documents might be creating a meaningful impact in scientific society. Scientific research papers are the most important assets for the science community. The number of scientific research papers and documents is growing each day. Therefore, it is very crucial to manage documents and papers. Hence, we came up with the idea of the search engine to easily locate the desired information in such a large repository. First, we build a corpus and train our ML model using those data. Our final product is a website. Users can upload a file, and insert keywords in the search bar. Our algorithm analyzes the uploaded file, and keywords and recommends a list of similar files. The other feature of the website is the summarization of files, language detection, listing out keywords, and the word frequency it has been used.

Participating virtually in such a competitive hackathon was very tedious. We had to face technical errors and communication gaps time and again. Also here in our home country, it’s the festival season. With so much celebration going around us it was very difficult to create a working, concentrating, and coding environment. However, we were able to overcome these challenges by showing patience and having faith in each other. Finally, we would like to thank the Nepal Astronomical Society (NASO) for hosting local events and educating us about the NASA Space App Challenge. We had a very fruitful and knowledgeable experience participating in the hackathon.