PDF Document analysis

High-Level Project Summary

The project extracts important information from a pdf file of the user's preference.The project which was developed in the python programming language contains three files with each one of them with a different application. In the 'pdf document info.py' the application exrtacts important infomration about the document like hte author, title etc. In the 'pdf summarizer.py' the application summarizes the document into 100 words. In the 'pdf keywords.py' the application extracts the top 20 keywords of the document. By analyzing a pdf file with this project's features researchers can save time when looking for information and it helps to categorize each document according to their needs.

Detailed Project Description

The project Summarizes a pdf document, extracts information regarding the title, the author and other data that might be usefull to the reader and it also extracts keywords from the document. The project was developed in the python programming language and the main library that was used was the gensim library from python as it is very practical and it does not require complex coding.

In each one of the .py files(summarizer, keywords, document info) the application requires from the user to insert a pdf file. In the case of the 'pdf document info.py' one line of code was enough to extract several information about the file. Regarding the 'summarizer.py' file the document had to be converted into a text file and then once again with a simple code, a satisfied summarization of the document was developed into 100 words which it can easily be changed to more or even less words. And finally the 'pdf keywords.py' file was responsible for extracting the top 20 keywords from the document and displayed them on a data frame.

The project required simple coding and its very efficient.

Space Agency Data

The data used were several pdf files from the NTRS records of NASA.

Hackathon Journey

It has been an excellent experience. I have never developed such an application before and to be able to come through feels very satisfying. The basic challenge was to create a project that is very simple to develop regarding the coding material and also to be very efficient regarding the desired outcome which I believe is exactly what it is, simple and efficient. The isnpiration behind this challenge was the need of developing solutions to overcome the time cosnuming process of reading and understanding any sort of data not just for reasearchers but also for people who are just space enthousisasts. Furthermore this project can be applied in every industry which is another reason behind the choice of this challenge.

References

  • Several pdf documents from NASA'S NTRS records
  • https://towardsdatascience.com/how-to-extract-keywords-from-pdfs-and-arrange-in-order-of-their-weights-using-python-841556083341
  • https://rare-technologies.com/text-summarization-with-gensim/



Tags

#pdfdocumentanalysis