Maginauts

Chino Hills | Can AI Preserve Our Science Legacy?

RSR

High-Level Project Summary

Readanator Scanator Reportanator( RSR ) is an application that applies NLP with AI. It can be used by scientists, and historical researchers. NASA Technical Reports Server (NTRS) documents are legacy documents that were obtained by scanning and using Optical Character Recognition (OCR). These documents are difficult to read and use in NTRS. Our AI utilizes Natural Language Processing (NLP) to automatically read these documents, summarize them, and produce a list of keywords. This solves the challenge since it will help improve the accessibility of these documents.

Link to Final Project

https://github.com/steeevan/AIwithNLP.git

Link to Project "Demo"

https://drive.google.com/file/d/1BTCScu_D8xE85U7gg05rhSpIp54Jh9qi/view?usp=sharing

Detailed Project Description

When developing our project, we wanted to focus on the main goal of having it do exactly what the challenge instructed us to do in the main objective section. We thought about what types of information future researchers will need to locate desired documents and what would be the best data to aid them in their search for relevant information. This led us to develop a program that is able to sort through a corpus of PDF files, pick out multiple relevant PDF files/documents, and proceeds to summarize each PDF's 10-20 pages into a short 10-20 paragraph output. In doing so, we wanted it to filter our irrelevant words, so we create a function that would do so, thus making the summarized output a lot more concise. The benefits of this program allow researchers to access and analyze data from the NTRS records in a more efficient manner, thus achieving their goal of aiding them in their search for relevant information. The entirety of this program was written in Python using various libraries.

Space Agency Data

When using Space Agency Data within our project, we utilized the NTRS API at first because we wanted to try to pull data from the NTRS using API GET requests through a software called Postman. However, after realizing that it was not possible to access the content of the PDF files/articles on the NTRS using the API, we ended up just utilizing the NTRS database and manually downloading PDF files from there and creating our own corpus for the program to analyze, extract, and summarize.

Hackathon Journey

Day 1 of our Hackathon Journey began around 8:30 AM where the team was briefed on the challenge to be tackled over the coming 48 hours. We first began by researching about natural language processing and how to utilize text summarization to develop an application that could produce summaries of large volumes of articles and text in the NASA Technical Report Server (NTRS).

Romain: On Day 1, I used an infographic website maker called Canva.com to design the user interface of the application we're creating. My main challenge was to figure out what exactly needed to be on the user interface. I looked at database websites like PubMed to help me with the formatting. Once I figured out the features that needed to be on the app, such as an advanced search tool or a key term filter, it was just a matter of carefully putting everything together. I used the free graphics on Canva.com as well as a royalty-free image website called Unsplash.com to help visually create the user interface. Following this, I spent my time developing the schematic for the user interface. The main issue with this was that I had to figure out how to set up the logical structure since a flow chart for many buttons on a page can end up being quite convoluted.

Kaevon: On Day 1, I started off by attempting to pull data from the NTRS library using the NTRS API. I imported GET requests into a software called Postman and was able to successfully pull all the citations of the articles into an output terminal that included a link to each article. After realizing I could not extract the actual content of the article using the API through Postman, I decided to go another route which involved manually downloading the PDF files of the articles, then writing Python code that would analyze, extract, and summarize the key concepts of the PDF files. This program essentially would summarize a 20-page PDF file into a nice concise set of 20 paragraphs allowing for easier analysis of the data.

Estevan: On Day 1, I started off by understanding what was the challenge and what has been the current solution for NASA to find the records desired. I then found resources needed to be able to get text from PDF files, and store the text into memory. I then found different ways of being able to generate text of pdf files, and chose to use module 'pdfminer' since it made the process of converting text from pdf smoother. I also discovered a number of modules that are useful when it came to extracting data from PDF's. I then began to do some research on how to apply NLP to the AI so it may generate summaries from the PDF's.

Kenneth: On Day 1, understanding the problem and breaking it down into smaller challenges was the first thing I did. One crucial component of our team's challenge was being familiar with natural language processing. After some research about NLP the team decided on how we would use and implement it. Then I was tasked with reviewing the NASA resources and figuring out how to use them in our project. I also researched various methods on how to set up an environment for AI as well as some techniques on how to develop it.

Joseline: On Day 1, I began designing a simple team image that could summarize all of our members in a single picture so that we could capture the experience in a photo. After creating a design for that, I also worked on drawing a simple sketch for our team's logo. Once I finished drafting up each, I began to solidify my design and add color to it. After finishing the team picture, I worked on the logo for our team's name. Lastly, I drafted a simple template to use for the presentation of our application.

References

https://medium.com/geekculture/a-paper-summarizer-with-python-and-gpt-3-2c718bc3bc88

https://github.com/VuongRon/pdf-summarizer

https://github.com/khong612/pdf-summarizer

https://pspdfkit.com/blog/2022/summarize-pdf-document-using-machine-learning-and-natural-language-processing/

https://www.ibm.com/cloud/learn/natural-language-processing

https://www.canva.com/q/pro/?clickId=wc%3AQ7zwyuxyNROqRXqQDewRjUkDT7ATS0XZ2wQ0&utm_medium=affiliate&utm_source=Neefla%20Technologies%20Ltd._387082&irgwc=1&v=11

https://unsplash.com/photos/Wj1D-qiOseE

https://towardsdatascience.com/setup-an-environment-for-machine-learning-and-deep-learning-with-anaconda-in-windows-5d7134a3db10 https://plat.ai/blog/how-to-build-ai/https://towardsdatascience.com/7-nlp-techniques-you-can-easily-implement-with-python-dc0ade1a53c2 https://realpython.com/nltk-nlp-python/https://www.edureka.co/blog/artificial-intelligence-algorithms/#Artificial%20Intelligence%20Algorithms