Google Cloud Vision API - TEXT_DETECTION - google-cloud-platform

When i try to recognize a text in image, like the italian word "Perchè", Vision API get back the word "Perche" (give back the "e" and not the correct one "è").
I don't want to use languageHints to try to obtain better results because i've to do OCR Recognition across different language.
What is the problem here?

This is known issue with the Cloud Vision API when you don't use language hints.
You can see the actual bug report here.
It is in state accepted, but there seems to be radio silence on it for the last few months. It may take some time to roll it out.

Related

Google Dialogflow - Asking a customer for a unique number

I am trying to ask a customer for a unique number. I have tested this using the test console and it's coming up with multiple variations without giving a value.
The numbers are a mix of 4/6/8 digits. I want a customer to be able to say 'my plan number is 12345678' and for me to be able to get that value and work with it.
What parameters/system entities should I be using to get a result? Often times it will miss a digit/put in a hyphen etc.
P.S. this is using voice only, not text.
There's a feature called Auto speech adaptation that will help you in this specific case. After enabling it, check the point 5 in the Example speech recognition improvements. It explains how you can use auto speech adaptation with Regexp entities to capture digit sequences and it gives you a regular expression you can use. It also recommends using #sys.number-sequence entity.
The enhanced speech models can also help with the number identification accuracy, but bear in mind that it is still a beta feature.
For reference you can also check the article Improving speech recognition for contact centers in the Google Cloud Blog.

How do I tell Google Vision OCR Api that image is numerical numbers only?

I am using Google Vision OCR Api to convert a ton of handwriting into keyed text. Large segments of the data is numerical only, but sometimes Google Vision returns alpha characters instead. Often the handwriting is really sloppy so I am not shocked that this is happening.
Is there anyway to tell the Api that the image only includes numbers and no letters?
I have read documentation on Google Vision, but have not found anything that would achieve this.
gcloud ml vision detect-document "https://myimage.png"
I am hoping that if Google Vision knows in advance that the image only contains numbers, then perhaps it will make it easier and more likely to recognize the correct numbers.

How to disable auto correction for Google Cloud Speech to Text API

Is there a way to disable auto correction for Google Cloud Speech to Text API? It is important for me to get accurate transcript of user's speech, with any errors they make rather than a corrected version.
It is difficult to distinguish between mistakes made by speaker (grammar/pronunciation errors) in the audio content and mistakes made by Speech API. However, you can check different versions of text output predicted by model behind the scene with the help of maxAlternatives property of the API.
You have not provided the example of such use-case, but if you are already expecting unusual pronunciation or Acronyms you can provide hint to the request using phraseHint property.
Please provide further details if it doesn't answer your question.

Regarding Google Speech API

I am using Google Speech API to convert voice to text convertion, it is working fine when i use my own recorded voice,
but the result is not proper while using computer generated Lady voice, like cell phone network operator voice.
Any one faced this kind of problem? or any one having solution for this? please help me to solve this issue...
Thank you.
Did you set proper sampling rate sampleRateHertz of the speech?
Did it return something close, but not correct, or it totally failed with no speech at all? If you didn't get anything converted then verify that you sent correct info to speech api.

Getting the amplitude(or rms voltage) of audio signal captured in C++ by wavin lib.?

I am working on a very basic robotics project, and wish to implement voice recognition in it.
i know its a complex thing but i wish to do it for only 3 or 4 commands(or words).
i know that using wavin i can record audio. but i wish to do real-time amplitude analysis on the audio signal, how can that be done, the wave will be inputed as 8-bit, mono.
i have thought of divinding the signal into a set of some specific time, further diving it into smaller subsets, getting the average rms value over the subset and then summing them up and then see how much different they are from the actual stored signal.If the error is below accepted value for all(or most) of the sets, then print the word.
How can this be implemented?
if you can provide me any other suggestion also, it would be great.
Thanks, in advance.
There is no simple way to recognize words, because they are basically a sequence of phonemes which can vary in time and frequency.
Classical isolated word recognition systems use signal MFCC (cepstral coefficients) as input data, and try to recognize patterns using HMM (hidden markov models) or DTW (dynamic time warping) algorithms.
You will also need a silence detection module if you don't want a record button.
For instance Edimburgh University toolkit provides some of these tools (with good documentation).
If you don't want to build it "from scratch" or have a source of inspiration, here is an (old but free) implementation of such a system (which uses its own toolkit) with a full explanation and practical examples on how it works.
This system is a LVCSR (Large-Vocabulary Continuous Speech Recognition) and you only need a subset of it. If someone know an open source reduced vocabulary system (like a simple IVR) it would be welcome.
If you want to make a basic system from your own, I recommend you to use MFCC and DTW:
For each target word to modelize:
record some instances of the word
compute some (eg each 10ms) delta-MFCC through the word to have a model
When you want to recognize a signal:
compute some delta-MFCC of this signal
use DTW to compare these delta-MFCC to each modelized word's delta-MFCC
output the word that fits the best (use a threshold to drop garbage)
If you just want to recognize a few commands, there are many commercial and free products you can use. See Need text to speech and speech recognition tools for Linux or What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition? or Speech Recognition on iPhone. The answers to these questions link to many available products and tools. Speech recognition and understanding of a list of commands is a very common problem solved commercially. Many of the voice automated phone systems you call uses this type of technology. The same technology is available for developers.
From watching these questions for few months, I've seen most developer choices break down like this:
Windows folks - use the System.Speech features of .Net or Microsoft.Speech and install the free recognizers Microsoft provides. Windows 7 includes a full speech engine. Others are downloadable for free. There is a C++ API to the same engines known as SAPI. See at http://msdn.microsoft.com/en-us/magazine/cc163663.aspx. or http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx
Linux folks - Sphinx seems to have a good following. See http://cmusphinx.sourceforge.net/ and http://cmusphinx.sourceforge.net/wiki/
Commercial products - Nuance, Loquendo, AT&T, others
Online service - Nuance, Yapme, others
Of course this may also be helpful - http://en.wikipedia.org/wiki/List_of_speech_recognition_software