Google Vision API does not recognize single digits - google-cloud-platform

I have a project that make use of Google Vision API DOCUMENT_TEXT_DETECTION in order to extract text from document images.
Often the API has troubles in recognizing single digits, as you can see in this image:
I suppose that the problem could be related to some algorithm of noise removal, that recognizes isolated single digits as noise. Is there a way to improve Vision response in these situations? (for example managing noise threshold or others parameters)
At other times Vision confuses digits with letters:
But if I specify as parameter languageHints = 'en' or 'mt' these digits are ignored by the ocr. Is there a way to force the recognition of digits or latin characters?

Unfortunately I think the Vision API is optimized for both ends of the spectrum -- dense text (DOCUMENT_TEXT_DETECTION) on one end, and arbitrary bits of text (TEXT_DETECTION) on the other. As you noted in the comments, the regular TEXT_DETECTION works better for these stray single digits while DOCUMENT_TEXT_DETECTION works better overall.
As far as I've heard, there are no current plans to try to cover both of these in a single way, but it's possible that this could improve in the future.
I think there have been other requests to do more fine-tuning and hinting on what you're looking to detect (e.g., here and here), but this doesn't seem to be available yet. Perhaps in the future you'll be able to provide more hints on the format of the text that you're looking to find in images (e.g., phone numbers, single digits, etc).

Related

Google Natural Language Sentiment Analysis incorrect result

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!
The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

Google speech adds extra digits and mis-transcribes 9 and 10 digit strings

Scenario: a user speaks a 9 or 10 digit ID and Google speech is used to transcribe it.
Google STT sometimes forces the number into a phone number format, adding mystery digits to make it fit (and thus failing to capture the number accurately).
For example if the caller says "485839485", it may come out as "485-839-4850", with an extra digit that the caller never said. Digits are sometimes added in the middle of the number as well.
This happens even with added hints such as "one,two,three,four,five,six,seven,eight,nine,zero"
Has anyone found a workaround to this issue?
There are many open source speech recognition toolkits which will recognize number sequences reliably and for free, you just need to spend an hour to setup them.
This behavior seems to be related to the logic used by the API's model when performing the transcription tasks. Since this issue is part of an internal process that tries to fit the transcribed numbers into a phone format, I don't think there is a current workaround for this scenario; however, I recommend you to take a look on this ticket that has been created to review this issue, as well as the Release Notes documentation of Speech-to-Text API to keep the track of the new functionalities added to the service.

How to optimize number recognition with Google Vision API?

I am experiencing strange behavior, when using this Vision ML API.
I am capturing images from a live stream and I have tens of thousands of key frames cropped for detection of a single digit against a clear background. However, the performance of the Google ML Vision API is very unreliable for such a simple task. I am wondering why that might be and what can I do about it?
I have some hypothesis:
The language detection fails and leads to empty response, which I tend to get often (I have double checked that the empty response is not caused by authentication problems).
The background some how makes the task hard.
The numbers are too small; they are 35x35 images and the character lines are clear of width approximately 4 pixels.
The live stream causes some artifacts, which are invisible to the eye, but very disturbing for the OCR.
Google doesn't want us to use Vision API for these kinds of problem, and we should instead use pre-trained MNIST to recognize numbers.
I have used both, detect-text and detect-document; the latter is a bit more accurate.
I came up with one solution, which seems to be working quite well.
I added text around the numbers (in order to give context) and then remove the text around with regexp and pick the numbers. It seems that the API is not for character recognition, but also likes to have some context words around the numbers to increase confidence. This solution works quite well for my use case and probably to many others also, since adding context text for the numbers is quite trivial thing to do ("My shoe number is: X"). Adding text to images should be trivial task to be done with ImageMagick.

Voice activated password implementation in python

I want to record a word beforehand and when the same password is spoken into the python script, the program should run if the spoken password matches the previously recorded file. I do not want to use the speech recognition toolkits as the passwords might not be any proper word but could be complete gibberish. I started with saving the previously recorded file and the newly spoken sound as numpy arrays. Now I need a way to determine if the two arrays are 'close' to each other. Can someone point me in the right direction for this?
It is not possible to compare to speech samples on a sample level (or time domain). Each part of the spoken words might vary in length, so they won't match up, and the levels of each part will also vary, and so on. Another problem is that the phase of the individual components that the sound signal consists of can change too, so that two signals that sound the same can look very different in the time domain. So likely the best solution is to move the signal into the frequency domain. One common way to do this is using the Fast Fourier Transform (FFT). You can look it up, there is a lot of material about this on the net, and good support for it in Python.
Then could could proceed like this:
Divide the sound sample into small segments of a few milliseconds.
Find the principal coefficients of FFT of segments.
Compare the sequences of some selected principal coefficients.

How can I capture images from the screen and compare them to known images in C++?

In an effort to see if it is possible to easily break very simple CAPTCHAs, I am attempting to write a program (as simple and small as possible). This program, which I hope to write in C++, should do the following:
Make a partial screenshot of a known area of the screen. (Assuming that the CAPTCHA is always in the exact same place - for instance: Pixels 500-600 x, pixels 300-400 y).
Automatically dissect the CAPTCHA into individual letters. (The CAPTCHAS I will create for testing will all have only a few white letters, always on a black background, spaced well apart, to make things easy on me.)
The program then compares the "cut" letters against an array of "known" images of letters (which look similar to the letters used in the CAPTCHA), which contains 26 elements, each holding an image of a single letter of the English alphabet.
The program takes the letter associates with the image that the comparison mapped to, and sends that key to the console (via std::cout)
My question is: Is there an easy-to-use library (I am only a beginner at programming), which can handle tasks 1-3 (The 4. is rather easy)? Especially the third point is something I haven't found pretty much anything worthwhile on. What would be ideal is if this library had a "score" function, using a float to indicate how similar the images are. Then, the one with the highest score is the best hit. (I.e: 100.0 means the images are identical, 29.56 means they are very different, etc.)
A good library for this job is OpenCV. http://opencv.org
OpenCV has all the necessary low-level imge processing tools to segment the different elements of the captcha. Then you can use its template matching module.
You could even try to detect letters directly without the preprocessing. It will be slower, but the captcha image is typically so small, that it should rarely matter. See:
http://docs.opencv.org/modules/imgproc/doc/object_detection.html#cv2.matchTemplate
For some tutorials to get into the library see:
http://docs.opencv.org/doc/tutorials/tutorials.html