High-Level Project Summary
Searching through millions of words in thousands of files. The project uses a combination of three Natural Language Processing algorithms; Summarization, Key Words identification and relevance based search. The input data, of no matter what size is vectorized and searched with respect to the query of the user. This allows any user to access any desired file, or perform semantic analysis among terabytes of data. This solution can be used by any firm or organization which deals with Big Data.
Link to Final Project
Link to Project "Demo"
Detailed Project Description
Searching through millions of words in thousands of files. The project uses a combination of three Natural Language Processing algorithms; Summarization, Key Words identification and relevance based search. The input data, of no matter what size is vectorized and searched with respect to the query of the user. This allows any user to access any desired file, or perform semantic analysis among terabytes of data. This solution can be used by any firm or organization which deals with Big Data.
There are no such competitive entry barriers for our product but what will make our business successful is our highly optimized searching algorithm that works really well even with a huge amount of files and data. As of now, no one in Europe has a product that provides a similar solution as ours (searching for a particular phrase or a sentence over terabytes of data present on the system) which will be a USP (Unique Selling Point). This USP will enable large-scale enterprises to search through their records whether it belongs to healthcare, marketing, or any other domain.
The simple pipeline is:
- input data
- Extract keywords
- Extract summary
- Extract relevant documents
The social impact that this application will have will include fewer human resources for searching the documents which will as a result improve the results and user experience.
Software:
- VS CODE
- GITHUB
- Google Colab
Languages:
- Python
Space Agency Data
We accessed data from Swagger UI (nasa.gov), specifically having center ="Legacy CDMS". We wrote a small script to get data from the API which allowed us to get few documents, which are in our project repo under nasa_dataset folder.
The data was used for following purposes:
- Test keywords extraction on space agency documents
- Test relevancy based search on space agency dataset
- Test summarization on space agency dataset
The final results are developed on the small chunk of space agency dataset, available in our final project
Hackathon Journey
It was quite fun.
learned to work on real world problems
we all have been working on NLP since long and it is quite a need of current world, so that intrigued us.
Time management was a setback as we all are in professional life but working on collaborative tools helped us a lot.
References
- Nazir, Waseemullah & Fatima, Zainab & Zardari, Shehnila. (2022). A Novel Approach for Semantic Extractive Text Summarization. Applied Sciences. 12. 4479.
- PPT Template by Free Google Slides themes and Powerpoint templates | Slidesgo
- Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. pdf
- Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. pdf
- Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. pdf
- rest mentioned in NASA_SPACE_APPS_CHALLENGExLINEAX/requirements.txt at master · ItsMeAbby/NASA_SPACE_APPS_CHALLENGExLINEAX (github.com)
Tags
#pdf #search #document-access

