High-Level Project Summary
Our project's objective was to solve an important issue when it comes to the way an enterprise (in this case NASA) stores and accesses its digital information. More specifically, our objective required us to transform physical files into user-friendly searchable documents. To do this, we created a web application with a search engine that allows users to access information rapidly. To make this process more intuitive, our search engine permits users to search for any required information rather than a document's file name.
Link to Final Project
Link to Project "Demo"
Detailed Project Description
Technology Stack:
- Django (Front-End, Back-End, Webserver)
- MySQL (Database)
- Python (scripting language)
Functionalities:
- Extract text from PDFs (can add any number of PDFs)
- Pre-process & clean text from PDFs (many errors in original PDF files)
- Generate summaries from extracted text
- Generate keywords from extracted text
- Generate knowledge base relationships from extracted text
- Conduct search queries against digital information
- Search queries return ranked results + document name & summary
How does it work:
- Text is extracted using PyPDF2 library which simply extracts all readable text from PDFs
- Using regex (a python library) we extract unwanted characters
- Using an NLP model we fix any grammatical mistakes so that the next processes will have better performance
- Using rake-nltk (a python library commonly used for NLP tasks), we generate keywords for each document's page, as well as record the frequency at which the keyword is used and save these two values to our database.
- Using nltk & heapq (python libraries), we generate summaries for each document and save them to our database.
- Using the transformers & pytorch (python libraries), we generate knowledge base relationships for each document's page and save them to our database.
- For our search engine, we use Django's powerful search features to search for all keywords & knowledge base relationships that CONTAIN the user's search input. From there a function executes some logic that we wrote in order to rank/score the search results. From there, we return to the user a list of search results (ranked/scored) and the according documents' summaries.
Space Agency Data
We used the provided NTRS data. More specifically, we downloaded 6 PDF files with the moon as a general subject.
Hackathon Journey
It all started when I (Nicola Kerin) was looking through my emails and saw that NASA had just completed NASA's Double Asteroid Redirection Test (DART) mission. Somehow, from the live stream, I ended up on the sapceappschallenge.org website. From that point, I was ecstatic to learn that anyone could start a team and could participate in any challenge. From the many challenges, I decided to opt for the "Can AI Preserver Our Science Legacy?" challenge since I've recently had a deep interest in NLP. I then recruited my father and we began our journey.
We quickly realized that the project we chose was going to require a hefty amount of time, and given that we had signed up with only 4-5 days until submission, we knew that we needed a structured approach. We started by going over all the concepts that we knew. This meant that we singled out the technologies that we'd eventually utilize. Although they may not have been the "best", they we're the best options at our disposal. The next step was to set up a common DevOps repo for easy version control. Next, we started experimenting with various NLP models to output keywords from an inputted text. We then found an interesting sub-technology of NLP called Knowledge Bases. To put it in simple terms, using NLP we can extract knowledge bases which are essentially libraries of information about a product, service, department, or topic. This new extractable data now allowed us to have keywords & more advanced extracted information in the form of a relationship between two subjects. (for example: NASA --> is a --> Space Agency)
The next step was to start creating our corpus. For the facility of testing, we chose PDFs that have the same general subject (in our case we chose the moon). We made our corpus easily manipulable by using MySQL as a database & Django as a Front-End/Back-End/Webserver. Once the database was populated with all the documents, pages, keywords (per page), and knowledge bases (per page), the next step was to create a search engine that would take the search query and find all related documents. To do this we used some simple logic. We would order results by frequency distribution, meaning we would display the documents that had the most related keywords, and knowledge base relationships.
Once the search engine was functional, the final step was to display the content (documents) to the user. For starters we wanted to present the user with the top results (most relevant), we then wanted each result to display: the name of the document, a summary of the document, and the key topics (keywords/knowledge base relationships). We also wanted to create a custom page that users would use to conduct these searches, which would be our next step. Unfortunately, time did not permit us to do this.
Throughout this whole process, we have learnt so much when it comes to NLP, Django, MySQL and Python. Not only on an individual level but also on an organizational level. We had to create a scalable infrastructure that: allowed users to add more files, new keywords, new knowledge bases, new summaries, etc... Many many many problems arose! But with our tenacity, we managed to create something we are truly proud of.
Thank you NASA Space Apps for this amazing opportunity & experience!
References
NLTK:
- https://realpython.com/nltk-nlp-python/
SUMMARIZATION:
- https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/
- https://stackabuse.com/text-summarization-with-nltk-in-python/
KNOWLEDGE BASE:
- https://medium.com/nlplanet/building-a-knowledge-base-from-texts-a-full-practical-example-8dbbffb912fa
DJANGO:
- https://stackoverflow.com/questions/910169/resize-fields-in-django-admin
- https://stackoverflow.com/questions/12522661/how-to-override-the-queryset-giving-the-filters-in-list-filter
- https://docs.djangoproject.com/en/4.1/ref/contrib/admin/filters/
- https://docs.djangoproject.com/en/4.1/ref/contrib/admin/filters/
Tags
#AI, #NLP, #NASA, #SPACEAPPSCHALLENGE

