Can aws comprehend be used in splitting documents to sentences?

Can aws comprehend be used in splitting documents to sentences? - amazon-web-services

I started to try aws comprehend. One thing I noticed is that the sentences in the document will affect the sentiment analysis and entity extraction results especially when mixed sentiment sentences exist or some sentences are not capitalized in the document. So correctly splitting the sentences is an important step. However, I can't find an API in comprehend that splits the document in sentences. Is it because comprehend doesn't have the step? If there is, could someone points out how to obtain the splitting results?
BTW, I tried Stanford coreNLP and Google Language Cloud. They both make mistakes in some cases.

Here is what I did: I added '>>>' as a separator between reviews when I was scraping them, then I used this code:
reviews = all_reviews_as_text.split('>>>')
responses = []
for review in reviews:
response = comprehend.detect_sentiment(Text=review, LanguageCode="en")
responses.append(response)

Related

Google Natural Language Sentiment Analysis incorrect result

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!

The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

Is there a way to explicitly set start and end of sentences on Google Natural Language API sentiment analysis?

I am using Google Natural Language API for sentiment analysis. I have an array of string texts that I concatenate and send to google to get sentiment values for each sentiment, but Google has its own idea on where the sentences start and end, so getting scrambled sentiment results and a different count of sentiment scores, then sent sentences.
If you could only set a flag like <sentence> </sentence> for each phrase you would like to treat as a separate sentence - that would be great, but docs don't have info about it.
P.S.
I am using it this(sending a chunk and not doing a separate API request for each sentence) way because I have thousands of sentences and latency is important.

With Google Natural Language API, a way to mark separate sentences that seems to work (although not officially documented) is to set the document type to HTML and end each sentence with .<br> (full-stop and HTML line break tag) to hint an end-of-sentence to the parser.

Is it possible to use the multiclass classifier of aws to recognize the given place of the text?

I'm using AWS SageMaker, and i want to create something that, with a given text, it recognize the place of that description. Is it possible?

If there are no other classes besides the text that you would like your model to identify, you may not need a multiclass classifier.
You could train your own text detection model using Amazon SageMaker, and train using a dataset with labelled examples using the Object Detection Algorithm, but this becomes rather involved for a problem that has existing solutions available.
If the appearance of the text you're trying to detect is identical each time, your problem space gets reduced from trying to interpret variable text, to simply having to gather enough examples and perform object detection for the "pattern" your text forms visually. Note that if the text were to appear in different fonts or styles, that the generic object detection method would not interpret it dynamically, and an OCR-based solution would likely be necessary.
More broadly, for text identification in images on AWS, you have quite a few options:
Amazon Rekognition has a DetectText method that will enable you to easily find text within an image. If it's a small or simple phrase, with alphanumeric characters, this should work very well for your use case.
Amazon Textract will help you perform OCR (optical character recognition) while retaining the structure of the source. This is great for documents and tables, but doesn't sound like it may be applicable to your use case.
The AWS marketplace will also have hosted options available from third party vendors. One example of this for text region identification is this one from RocketML.
There are also some great open source tools I'd recommend looking into: OpenCV for ascertaining the text bounding boxes, and Tesseract for OCR and text extraction. This blog post does a good job walking through the process of using them together.
Any of these will help to solve your problem of performing OCR/text identification on AWS, but the best choice comes down to what your current and future needs are, and how quickly you're looking to implement the feature.

Your question is not clear regarding the data that you have or the problem that you want to solve.
If you have a text that includes a place name in it (for example, "I visited Seattle and enjoyed the fish market"), you can use Amazon Comprehend Name Entity Extraction (NEE) including places ("Seattle" in the above example)
{
"Entities": [
{
"Score": 0.9857407212257385,
"Type": "LOCATION",
"Text": "Seattle",
"BeginOffset": 10,
"EndOffset": 17
}
]
}
If the description is more general and you want to classify if the description is of a hotel, a restaurant, a theme park, a concert/show, or similar types of places, you can either use the Custom classification in Comprehend or the Neural Topic Model in SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html). You will need some examples of the classes and documents/sentences that are used for the model training.

Text document classification with AWS Machine Learning

I've been looking at using AWS Machine Learning to implement a categorizer for my project. I have something on the order of 40,000 documents that have a several text-only features. For example: Name (< 200 chars) and Description (potentially hundreds / thousands of words).
In a nutshell, I'm looking to assign categories (0 or more) to each document based on it's content.
I've read through the AWS ML tutorial and checked out a few other sources but the available material seems to deal with feature fields that are numeric, boolean, datetime, or otherwise non-textual.
Is AWS Machine Learning capable of performing multi-class categorization on documents based primarily (or possibly only) on text fields? And if so, is there any reference material available for this particular avenue?

Mainly, you don`t need "text fields", first you have to create a vector space model (VTM) from your corpus (texts), than you can weight your VTM with tf-idf, and you can use numeric field.
Are you sure that do you want to apply AWS ML to train a corpus with only 40.000 documents?

Simple Phrases recognition

I am looking to recognize simple phrases like the ones what happens in google calendar
but rather than parsing Calendar Entries I have to parse Sentence related to finance, accounting and to do lists. So For example I have to parse sentences like
I spent 50 dollars on food yesterday
I need to mark an separate the info as Reason : 'food' , Cost : 50 and Time: <Yesterday's Date>
My question is do I go in for a full fledged Natural Language Processing like
given in these Questions and use Something like GATE
Machine Learning and Natural Language Processing
Natural Language Processing in Ruby
Ideas for Natural Language Processing project?
https://stackoverflow.com/a/3058063/492561
Or is it better to Write simple grammars using Something like AntLR and try to recognize it .
Or should I go really low and just Define a syntax and use Regular expressions .
Time is a Constraint , I have about 45 - 50 Days , And I don't know how to use AntLR or NLP libraries like GATE.
Preferred languages : Python , Java , Ruby (Not in any particular order)
PS : This is not home-work , So please Don't tag it as so.
PPS : Please try to give an answer with Facts on why using a particular method is better.
even if a particular method may not fit inside the time constraint please feel free to share it because It might benefit someone else .

You could look at named entity recognition indeed. From your question I understand your domain is pretty well defined, so you can identify the (few?) entities (dates, currencies, money amount, time expressions, etc.) that are relevant for you. If the phrases are very simple, you could go with a rule-based approach, otherwise it's likely to get too complex too soon.
Just to get yourself up and running in a few sec, http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code is an extremely nice example of what you could do. Of course I would not expect an high accuracy from just 6 lines of python, but it should give you an idea of how it works:
1>>> import nltk
2>>> def extract_entities(text):
3... for sent in nltk.sent_tokenize(text):
4... for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
5... if hasattr(chunk, 'node'):
6... print chunk.node, ' '.join(c[0] for c in chunk.leaves())
The core idea is on line 3 and 4: on line 3 it split text in sentences and iterates them.
On line 4, it splits the sentence in tokens, it runs "part of speech" tagging on the sentence, and then it feeds the pos-tagged sentence to the named entity recognition algorithm. That's the very basic pipeline.
In general, nltk is an extremely beautiful piece of software, and very well documented: I would look at it. Other answers contain very useful links.

Your task is a type of Information Extraction task, specifically relation/fact extraction, preceded by Named Entity Recognition.
Take a look at the following frameworks for Java/Python:
GExp
GATE
NLTK. Python. Book chapter on Information Extraction.
UIMA. (used for IBM's Watson.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js