Google Natural Language Sentiment Analysis incorrect result - google-cloud-platform

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!

The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

Related

aligning sentences to corpus and finding mismatches

The ideal goal is to correct the output from a speech2text model according to a reference corpus (the actual text). I don't mind using any off the selves tool either in NLP space or ElasticSearch
I have a reference corpus like the following:
It is a reliance that has led to a cycle of addiction that has
destroyed lives it is a cycle that makes you sick when you try to stop
and potentially takes your life when you don't and beyond its physical
effects this cycle of addiction also includes constant contact with
the criminal justice system and not just a cycle of arrests release
and violation.
In fact its much longer ...
On the other hand, I have a set of sentences that are recognized from a speech-2-text model in a CSV files
1, is a cycle that makes you dick when
2, try two stops and essentially hates your
3, posses activated
4, lives when who don't and beyond
As you can see there because the speech2text model is not perfect there are errors, for example
1) With references to the corpus these subsentences are misspelled (e.g. dick instead of sick in number the sentence number 1
2) there are sentences that do not match to the corpus at all - e.g. number 3
3) putting the sentences together does not cover the whole paragraph.
So basically I wonder what is this task called in the NLP topic, then I can do a better googling, and I appreciate if you name specific functions or examples that I can leverage, e.g. in Space or NLTK or any other tool.
edit : * I already have experience with nlp (coursera certificate) - therefore, looking for a concrete answer and/or example rather a scientific paper. This is not a general error correction task or the next work recommendation based on sequential models.
The most suited NLP technique for this is probably language models.
They predict the likelihood of a word given the previous words (or surrounding words).
They can be used for error correction .
You may find following useful:
article
page
Why do you think this is "not a general error correction task"? I think it is. You cool look into 'grammar correction' or 'sentence validity'.
Sentence validity is discussed at How to check whether a sentence is correct (simple grammar check in Python)?. The listed tools also provide suggestions, and could therefore be useful for you.

AWS Lex matches the wrong intent despite entering exact utterance

I have been having this problem in a variety of different cases.
I'll share an example of one.
I have a few FAQ intents.
One answers "What is Named Entity Recognition"
These are it's utterances :
Tell me about Named Entity Recognition
Tell me about NER
What is NER
What do you mean by Named Entity Recognition
What is Named Entity Recognition
and the other answers "What is Optical Character Recognition?"
These are it's utterances :
OCR
What do you mean by OCR
Can you tell me what OCR is
Tell about OCR
What is optical character recognition
What is OCR
When I enter, "What is ocr?" it works as expected and shows the answer for OCR.
But when I instead enter OCR in all caps, with the same exact question (What is OCR?). It switches to the NER intent and shows me the answer for "What is NER?"
Can any one answer why it is doing so? and more important than that, What do I do to make it work as expected.
Do keep in mind that this is just one example. I have encountered this in many other similar scenarios too.
There was also a case where the intent utterances didn't seem to match even remotely. But it still switched to the wrong intent.
As per the Lex and Alexa documentation, acronyms in custom slot types should be written as either a single word in all caps (OCR) or lowercase letters separated by periods and spaces (o. c. r.).
Along the bottom of the table, you can see the examples for Fire HD7, Fire h. d., Fire HD, and Fire HD 7 that demonstrate this -- both of the valid options will resolve to the same Slot Value Output.
Assuming utterances are set up in accordance with best practice, if you're providing vocal input, it's important to note that utterances are sensitive to things such as inflection in your voice, pacing/space between words, accents, and more.
As for immediate steps to improve accuracy, you can always try breaking up your intents further, where instead of having two intents, you have one for each permutation of custom slot value (NER, Named Entity Recognition, OCR, and Optical Character Recognition). It's easy for humans to understand that the first letter of a phrase maps to the letters in an acronym, but when it comes to teaching a chatbot to understand that these phrases are synonymous, that's a bit harder.
In the end I didn't find a proper solution but used some really inelegant workarounds, but hey as long as it works :D
The workaround I used was to make a "what" intent, a "how" intent etc. Keeping the sentence structure intact:
For example :
IntentName => "Bot_HowTo"
Utterances =>
"What is {slotName}"
"What are {slotName}"
"Meaning of {slotName}"
Slots =>
name : "slotName"
values (using synonyms) :
{OCR => "ocr", Optical Character recognition"}
{NER=> "ner", Named Entity Recognition"}
This makes the amount of intents needed much less and also eliminates a lot of the ambiguity. All questions that have "what" or similar formats go straight to that intent.
And then in my codehook I see which synonym was matched and provide the answer accordingly.

How to optimize number recognition with Google Vision API?

I am experiencing strange behavior, when using this Vision ML API.
I am capturing images from a live stream and I have tens of thousands of key frames cropped for detection of a single digit against a clear background. However, the performance of the Google ML Vision API is very unreliable for such a simple task. I am wondering why that might be and what can I do about it?
I have some hypothesis:
The language detection fails and leads to empty response, which I tend to get often (I have double checked that the empty response is not caused by authentication problems).
The background some how makes the task hard.
The numbers are too small; they are 35x35 images and the character lines are clear of width approximately 4 pixels.
The live stream causes some artifacts, which are invisible to the eye, but very disturbing for the OCR.
Google doesn't want us to use Vision API for these kinds of problem, and we should instead use pre-trained MNIST to recognize numbers.
I have used both, detect-text and detect-document; the latter is a bit more accurate.
I came up with one solution, which seems to be working quite well.
I added text around the numbers (in order to give context) and then remove the text around with regexp and pick the numbers. It seems that the API is not for character recognition, but also likes to have some context words around the numbers to increase confidence. This solution works quite well for my use case and probably to many others also, since adding context text for the numbers is quite trivial thing to do ("My shoe number is: X"). Adding text to images should be trivial task to be done with ImageMagick.

Testing if a string contains one of several thousand substrings

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.
You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection
If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.
Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.
Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.
For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.