AI Cosmic

Navi Mumbai | Can AI Preserve Our Science Legacy?

AI based technique to improve accessibility and discoverability of records using AWS Comprehend NLP

High-Level Project Summary

In this project we propose using Amazon Comprehend(AC) which uses natural language processing (NLP) to extract insights about the content of public NTRS documents. It develops insights by recognizing the entities, key phrases, language and other common elements in a document. Using Amazon Comprehend we can create web and mobile app to understand the structure of documents and give improved search results which will improve accessibility and discoverability of public NTRS records. It supports asynchronous analysis jobs for large document sets. We can find the documents about a particular subject using AC topic modeling, We can specify the number of topics that AC will return from the doc

Link to Final Project

https://docs.google.com/presentation/d/1kDqGGFVdoPYE60zEPta-X8sOD_bO1jft42ELL5zChNc/edit?usp=sharing

Link to Project "Demo"

https://docs.google.com/presentation/d/1kDqGGFVdoPYE60zEPta-X8sOD_bO1jft42ELL5zChNc/edit?usp=sharing

Detailed Project Description

Amazon Comprehend uses a pre-trained model to gather insights about a document or a set of documents. This model is continuously trained on a large body of text so that there is no need for you to provide training data.

You can use Amazon Comprehend to build your own custom models for custom classification and custom entity recognition.

Amazon Comprehend provides topic modeling using a built-in model. Topic modeling examines a corpus of documents and organizes the documents based on similar keywords within them. In this case we can add NASA thesaurus for keywords training.

Amazon Comprehend provides synchronous and asynchronous document processing modes. Use synchronous mode for processing one document or a batch of up to 25 documents. Use an asynchronous job to process a large number of documents.

For database calling we use NASA STI Repository OpenAPI: Data Dictionary -

Citation Search Response document for API building for database. Here we use an asynchronous job since NASA contains a huge document database to be made available for search to process a large number of documents.

Insights

Amazon Comprehend can analyze a document or set of documents to gather insights about it. Some of the insights that Amazon Comprehend develops about a document include:

Entities – Amazon Comprehend returns a list of entities, such as people, places, and locations, identified in a document.
Events – Amazon Comprehend detects speciﬁc types of events and related details.
Key phrases – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score.
Personally identifiable information (PII) – Amazon Comprehend analyzes documents to detect personal data that identify an individual, such as an address, bank account number, or phone number.
Dominant language – Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can identify 100 languages.
Sentiment – Amazon Comprehend determines the dominant sentiment of a document. Sentiment can be positive, neutral, negative, or mixed.
Targeted Sentiment – Amazon Comprehend determines the sentiment of specific entities mentioned in a document. The sentiment of each mention can be positive, neutral, negative, or mixed.
Syntax analysis – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun.

Real-time analysis using the built-in models

We can use the Amazon Comprehend console to run real-time analysis of a UTF-8 encoded text document. The document can be English or one of the other languages supported by Amazon Comprehend. The results are shown in the console so that you can review the analysis.

We can replace the sample text with your own text and then choose Analyze to get an analysis of your text. Below the text being analyzed, the Results pane shows more information about the text.

Using the Amazon Comprehend API

The Amazon Comprehend API supports operations to perform real-time (synchronous) analysis and operations to start and manage asynchronous analysis jobs.

We can use the Amazon Comprehend API operators directly, or you can use the CLI or one of the SDKs.

Async analysis for Amazon Comprehend insights

The following example demonstrates how to use Amazon Comprehend API for real-time analysis, using the AWS CLI, Java, and Python.

Steps on NASA public NTRS

The following section describes using the Amazon Comprehend API to run asynchronous operations for Amazon Comprehend insights.

Prerequisites

Documents must be in UTF-8-formatted text files. We can submit our documents in two formats. The forma twe use depends on the type of documents we want to analyze, as described in the following table.

Description

Format

Each file contains one input document. This is best for collections of large documents.

One document per file

The input is one or more files. Each line in a file is considered a document. This is best for short documents, such as social media postings.

Each line must end with a line feed (LF, \n), a carriage return (CR, \r), or both (CRLF, \r\n). You can't use the UTF-8 line separator (u+2028) to end a line.

One document per line

When we start an analysis job, we need to specify the S3 location for our input data. The URI must be in the same AWS Region as the API endpoint that we are calling. The URI can point to a single file or it can be the prefix for a collection of data files.

We must grant Amazon Comprehend access to the Amazon S3 bucket that contains your document collection and output files.

Starting an analysis job

To submit an analysis job, use either the Amazon Comprehend console or the appropriate Start* operation:

StartDominantLanguageDetectionJob — Start a job to detect the dominant language in each document in the collection. For more information about the dominant language in a document, see Dominant language.
StartEntitiesDetectionJob — Start a job to detect entities in each document in the collection. For more information about entities, see Entities.
StartKeyPhrasesDetectionJob — Start a job to detect key phrases in each document in the collection. For more information about key phrases, see Key phrases.
StartPiiEntitiesDetectionJob — Start a job to detect personally identifiable information (PII) in each document in the collection. For more information about PII, see Detecting PII entities.
StartSentimentDetectionJob — Start a job to detect the sentiment in each document in the collection. For more information about sentiments, see Sentiment.

Monitoring analysis jobs

The Start* operation returns an ID that you can use to monitor the job's progress.

To monitor progress using the API, you use one of two operations, depending on whether you want to monitor the progress of an individual job or multiple jobs.

To monitor the progress of an individual analysis job, use the Describe* operations. You provide the job ID returned by the Start* operation. The response from the Describe* operation contains the JobStatus field with the job's status.

To monitor the progress of multiple analysis jobs, use the List* operations. List* operations return a list of jobs that you submitted to Amazon Comprehend. The response includes a JobStatus field for each job that tells you the status of the job.

If the status field is set to COMPLETED or FAILED, job processing has completed.

To get the status of individual jobs, use the Describe* operation for the analysis that you are performing.

To get the status of a multiple jobs, use the List* operation for the analysis that you are performing.

To restrict the results to jobs that match certain criteria, use the List* operations' Filter parameter. You can filter on the job name, the job status, and the date and time that the job was submitted. For more information, see the Filter parameter for each of the List* operations in the Amazon Comprehend API reference.

Getting analysis results

After an analysis job has finished, use a Describe* operation to get the location of the results. If the job status is COMPLETED, the response includes an OutputDataConfig field that contains a field with the Amazon S3 location of the output file. The file, output.tar.gz, is a compressed archive that contains the results of the analysis.

If the status of a job is FAILED, the response contains a Message field that describes the reason that the analysis job didn't complete successfully.

To get the status of individual jobs, use the appropriate Describe* operation:

The results are returned in a single file, with one JSON structure for each document. Each response file also includes error messages for any job with the status field set to FAILED.

Each of the following sections shows examples of output for the two input formats.

Getting dominant language detection results

The following is an example of an output file from an analysis that detected the dominant language. The format of the input is one document per line. For more information, see the DetectDominantLanguage operation.

{"File": "0_doc", "Languages": [{"LanguageCode": "en", "Score": 0.9514502286911011}, {"LanguageCode": "de", "Score": 0.02374090999364853}, {"LanguageCode": "nl", "Score": 0.003208699868991971}, "Line": 0}

{"File": "1_doc", "Languages": [{"LanguageCode": "en", "Score": 0.9822712540626526}, {"LanguageCode": "de", "Score": 0.002621392020955682}, {"LanguageCode": "es", "Score": 0.002386554144322872}], "Line": 1}

The following is an example of output from an analysis where the format of the input is one document per file:

{"File": "small_doc", "Languages": [{"LanguageCode": "en", "Score": 0.9728053212165833}, {"LanguageCode": "de", "Score": 0.007670710328966379}, {"LanguageCode": "es", "Score": 0.0028472368139773607}]}

{"File": "huge_doc", "Languages": [{"LanguageCode": "en", "Score": 0.984955906867981}, {"LanguageCode": "de", "Score": 0.0026436643674969673}, {"LanguageCode": "fr", "Score": 0.0014206881169229746}]}

Getting entity detection results

The following example shows an output file from an analysis that detected entities in documents. The format of the input is one document per line. The output contains two error messages, one for a document that is too long and one for a document that isn't in UTF-8 format.

{"File": "50_docs", "Line": 0, "Entities": [{"BeginOffset": 0, "EndOffset": 22, "Score": 0.9763959646224976, "Text": "Cluj-NapocaCluj-Napoca", "Type": "LOCATION"}"]}

{"File": "50_docs", "Line": 1, "Entities": [{"BeginOffset": 11, "EndOffset": 15, "Score": 0.9615424871444702, "Text": "Maat", "Type": "PERSON"}}]}

{"File": "50_docs", "Line": 2, "ErrorCode": "DOCUMENT_SIZE_EXCEEDED", "ErrorMessage": "Document size exceeds maximum size limit 102400 bytes."}

{"File": "50_docs", "Line": 3, "ErrorCode": "UNSUPPORTED_ENCODING", "ErrorMessage": "Document is not in UTF-8 format and all subsequent lines are ignored."}

The following is an example of output from an analysis where the format of the input is one document per file. The output contains two error messages, one for a document that is too long and one for a document that isn't in UTF-8 format.

{"File": "non_utf8.txt", "ErrorCode": "UNSUPPORTED_ENCODING", "ErrorMessage": "Document is not in UTF-8 format and all subsequent line are ignored."}

{"File": "small_doc", "Entities": [{"BeginOffset": 0, "EndOffset": 4, "Score": 0.645766019821167, "Text": "Maat", "Type": "PERSON"}]}

{"File": "huge_doc", "ErrorCode": "DOCUMENT_SIZE_EXCEEDED", "ErrorMessage": "Document size exceeds size limit 102400 bytes."}

Getting key phrase detection results

The following is an example of an output file from an analysis that detected key phrases in a document. The format of the input is one document per line.

{"File": "50_docs", "KeyPhrases": [{"BeginOffset": 0, "EndOffset": 22, "Score": 0.8948641419410706, "Text": "Cluj-NapocaCluj-Napoca"}, {"BeginOffset": 45, "EndOffset": 49, "Score": 0.9989854693412781, "Text": "Cluj"}], "Line": 0}

The following is an example of the output from an analysis where the format of the input is one document per file.

{"File": "1_doc", "KeyPhrases": [{"BeginOffset": 0, "EndOffset": 22, "Score": 0.8948641419410706, "Text": "Cluj-NapocaCluj-Napoca"}, {"BeginOffset": 45, "EndOffset": 49, "Score": 0.9989854693412781, "Text": "Cluj"}]}

Getting personally identifiable information (PII) detection results

The following is an example of an output file from an analysis job that detected PII entities in documents. The format of the input is one document per line.

{"Entities":[{"Type":"NAME","BeginOffset":40,"EndOffset":69,"Score":0.999995},{"Type":"ADDRESS","BeginOffset":247,"EndOffset":253,"Score":0.998828},{"Type":"BANK_ACCOUNT_NUMBER","BeginOffset":406,"EndOffset":411,"Score":0.693283}],"File":"doc.txt","Line":0}

{"Entities":[{"Type":"SSN","BeginOffset":1114,"EndOffset":1124,"Score":0.999999},{"Type":"EMAIL","BeginOffset":3742,"EndOffset":3775,"Score":0.999993},{"Type":"PIN","BeginOffset":4098,"EndOffset":4102,"Score":0.999995}],"File":"doc.txt","Line":1}

The following is an example of output from an analysis where the format of the input is one document per file.

{"Entities":[{"Type":"NAME","BeginOffset":40,"EndOffset":69,"Score":0.999995},{"Type":"ADDRESS","BeginOffset":247,"EndOffset":253,"Score":0.998828},{"Type":"BANK_ROUTING","BeginOffset":279,"EndOffset":289,"Score":0.999999}],"File":"doc.txt"}

Getting sentiment detection results

The following is an example of an output file from an analysis that detected the sentiment expressed in a document. It includes an error message because one document is too long. The format of the input is one document per line. {"File": "50_docs", "Line": 0, "Sentiment": "NEUTRAL", "SentimentScore": {"Mixed": 0.002734508365392685, "Negative": 0.008935936726629734, "Neutral": 0.9841893315315247, "Positive": 0.004140198230743408}}

{"File": "50_docs", "Line": 1, "ErrorCode": "DOCUMENT_SIZE_EXCEEDED", "ErrorMessage": "Document size is exceeded maximum size limit 5120 bytes."}

{"File": "50_docs", "Line": 2, "Sentiment": "NEUTRAL", "SentimentScore": {"Mixed": 0.0023119584657251835, "Negative": 0.0029857370536774397, "Neutral": 0.9866572022438049, "Positive": 0.008045154623687267}}

The following is an example of the output from an analysis where the format of the input is one document per file.

{"File": "small_doc", "Sentiment": "NEUTRAL", "SentimentScore": {"Mixed": 0.0023450672160834074, "Negative": 0.0009663937962614, "Neutral": 0.9795311689376831, "Positive": 0.017157377675175667}}

{"File": "huge_doc", "ErrorCode": "DOCUMENT_SIZE_EXCEEDED", "ErrorMes

Space Agency Data

[1] NASA STI Program. (2012). NASA thesaurus [Data file]. Retrieved from https://sti.nasa.gov/nasa-thesaurus/

[2] NASA Technical Report Server

NTRShttps://ntrs.nasa.gov/

[3] NASA STI Repository OpenAPI: Data Dictionary - Citation Search Response

https://sti.nasa.gov/docs/OpenAPI-Data-Dictionary-062021.pdf

[4] NASA STI Repository OpenAPI

https://sti.nasa.gov/docs/STI_Open_API_Documentation_20210426.pdf

Since the problem statement clearly mentioned that we need to work on NTRS data we used its thesaurus for keyword search and STI repository to create API for proposed web and mobile app and in database connections.

Hackathon Journey

We planned for the problem statement a solution which is cloud based using Amazon Comprehend. It was good to learn about datasets provided by NASA and problems faced by them.