Awards & Nominations

FIONA has received the following awards and nominations. Way to go!

Global Finalists Honorable Mentions

Towards preserving our science legacy: Building a topically-aware, searchable, and accessible system

High-Level Project Summary

The current state of the NTRS repository limits its usability, since the current filter-based system is only useful when the user knows exactly what he/she is looking for. To address this issue, we created an enhanced search engine that displays the topics in the collection before showing the relevant documents. Being able to compare the search results in terms of their topics allows the user to explore the results more efficiently and have a better understanding of them. In our system, users submit a query and then select the topics they are interested in, which are then used to guide the retrieval process, enhancing the results from a state-of-the-art search engine.

Link to Final Project

Detailed Project Description

Our web-based solution facilitates the exploration of the data stored in the NASA Technical Report Server (NTRS) by using a state-of-the-art search engine with extended capabilities for managing, displaying and interacting with topics. 

The entire NTRS database was indexed to be efficiently accessed by query-based interactions from the user [2], and we explored the use of topic modeling [6] for discovering topics on the corpus. Our solution combines these two tools to provide an enhanced search experience. By displaying the topics and allowing the user to filter using them, our solution allows the users to find documents that are more closely related to the topics of interest.

Our system models the topics in the corpus by estimating their distribution on the whole corpus using the implementation of Latent Dirichlet Allocation [4] of the Scikit-Learn [6] library. When a user searches for a key phrase, the system maps that query into the same representation as the topics, i.e., the same topic space, and displays the topics related to the query in a 2D scatter plot. The user can then filter the desired topics through a visualization displayed on our application’s dashboard [1, 3]. Once the topics are filtered, the user receives query results based on the desired topics. Statistics for each of the search results like word frequency count and summary are also displayed to summarize the query results to inform the user of the contents of the document.

In the future, we hope to build a scalable, high-performing enhanced search engine that integrates Topic Modeling in the user experience of exploring the NTRS repository. Our wish is that this codebase will serve as the starting point to enable accessibility of data in diverse domains and possibly extend accessibility to different forms of information like audio, images and video.

 Our project is built entirely using the Python ecosystem. Here are the packages we used:


  • User Interface: Dash [1]
  • Search Engine: Whoosh [2]
  • Topic Modeling: Sci-Kit Learn (Latent Dirichlet Allocation) [4]
  • Data Preprocessing: spaCy, Pandas, Numpy [7, 9, 10]
  • API to fetch metadata and data: requests and json
  • Plotting: Plotly [3]
  • General Natural Language Processing tool kit: NLTK [8]

Codebase:https://github.com/hurricane-fiona/nasa

Space Agency Data

We used a subset of the NTRS data available here to demonstrate the viability of our idea. Our current working prototype has a search engine indexed on 166,441 metadata files fetched chronologically since 1960 and our initial topic model used the full text for all the documents when available (3,019 full texts).

Hackathon Journey

This SpaceApps challenge was a collaborative, challenging, educational, and fun experience for our team. This hackathon gave our team of mostly academics and first-time hackathoners an opportunity to work on a real-world tangible problem. We learned from each other and worked together as a team, playing to individual strengths and supplementing each other’s weaknesses to develop a usable product.

We approached the problem by brainstorming ideas to solve the challenge as a team and identified the features we would like to implement to build a prototype within the available time frame. We then proceeded to identify the different components of the system we intended to build and identified the members of the team that would be responsible for each of the identified tasks. We resolved our setbacks and challenges by efficiently and effectively communicating with each other, with one of the members of our team acting as an arbitrator when a couple of members of our team were at an impasse. 

We would like to thank NASA for hosting the competition, Dalhousie University for hosting the local event, and the members of the team for creating a friendly and collaborative learning experience for each other.

References

  1. Dash Bootstrap Components: Nokeri, Tshepo Chris. "Dash Bootstrap Components." Web App Development and Real-Time Web Analytics with Python. Apress, Berkeley, CA, 2022. 87-97.
  2. Whoosh — A fast, pure Python search engine library. https://whoosh.readthedocs.io/en/latest/ 
  3. Plotly: Sievert, Carson. Interactive web-based data visualization with R, plotly, and shiny. CRC Press, 2020.
  4. Topic modeling: Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.
  5. Animaker - https://app.animaker.com/
  6. Scikit-learn: Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830.
  7. SpaCy: Srinivasa-Desikan, Bhargav. Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018.
  8. NLTK: Hardeniya, Nitin, et al. Natural language processing: python and NLTK. Packt Publishing Ltd, 2016.
  9. Pandas: McKinney, Wes. "pandas: a foundational Python library for data analysis and statistics." Python for high performance and scientific computing 14.9 (2011): 1-9.
  10. Numpy: Harris, Charles R., et al. "Array programming with NumPy." Nature 585.7825 (2020): 357-362.

Tags

#SearchEngine, #InformationRetrieval, #TopicModelling, #MachineLearning, #NaturalLanguageProcessing