Lineax

FORAGE!

High-Level Project Summary

Searching through millions of words in thousands of files. The project uses a combination of three Natural Language Processing algorithms; Summarization, Key Words identification and relevance based search. The input data, of no matter what size is vectorized and searched with respect to the query of the user. This allows any user to access any desired file, or perform semantic analysis among terabytes of data. This solution can be used by any firm or organization which deals with Big Data.

Link to Final Project

https://github.com/ItsMeAbby/NASA_SPACE_APPS_CHALLENGExLINEAX

Link to Project "Demo"

https://docs.google.com/presentation/d/1w8fyKMRBY48hwHtrPQ6geX7JUoXLi-eV

Detailed Project Description

Searching through millions of words in thousands of files. The project uses a combination of three Natural Language Processing algorithms; Summarization, Key Words identification and relevance based search. The input data, of no matter what size is vectorized and searched with respect to the query of the user. This allows any user to access any desired file, or perform semantic analysis among terabytes of data. This solution can be used by any firm or organization which deals with Big Data.

There are no such competitive entry barriers for our product but what will make our business successful is our highly optimized searching algorithm that works really well even with a huge amount of files and data. As of now, no one in Europe has a product that provides a similar solution as ours (searching for a particular phrase or a sentence over terabytes of data present on the system) which will be a USP (Unique Selling Point). This USP will enable large-scale enterprises to search through their records whether it belongs to healthcare, marketing, or any other domain.

The simple pipeline is:

input data
Extract keywords
Extract summary
Extract relevant documents

The social impact that this application will have will include fewer human resources for searching the documents which will as a result improve the results and user experience.

Software:

VS CODE
GITHUB
Google Colab

Languages:

Python

Space Agency Data

We accessed data from Swagger UI (nasa.gov), specifically having center ="Legacy CDMS". We wrote a small script to get data from the API which allowed us to get few documents, which are in our project repo under nasa_dataset folder.

The data was used for following purposes:

Test keywords extraction on space agency documents
Test relevancy based search on space agency dataset
Test summarization on space agency dataset

The final results are developed on the small chunk of space agency dataset, available in our final project

Hackathon Journey

It was quite fun.

learned to work on real world problems

we all have been working on NLP since long and it is quite a need of current world, so that intrigued us.

Time management was a setback as we all are in professional life but working on collaborative tools helped us a lot.

References

Nazir, Waseemullah & Fatima, Zainab & Zardari, Shehnila. (2022). A Novel Approach for Semantic Extractive Text Summarization. Applied Sciences. 12. 4479.
PPT Template by Free Google Slides themes and Powerpoint templates | Slidesgo
Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. pdf
Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. pdf
Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. pdf
rest mentioned in NASA_SPACE_APPS_CHALLENGExLINEAX/requirements.txt at master · ItsMeAbby/NASA_SPACE_APPS_CHALLENGExLINEAX (github.com)