How to optimize number recognition with Google Vision API? - google-cloud-platform

I am experiencing strange behavior, when using this Vision ML API.
I am capturing images from a live stream and I have tens of thousands of key frames cropped for detection of a single digit against a clear background. However, the performance of the Google ML Vision API is very unreliable for such a simple task. I am wondering why that might be and what can I do about it?
I have some hypothesis:
The language detection fails and leads to empty response, which I tend to get often (I have double checked that the empty response is not caused by authentication problems).
The background some how makes the task hard.
The numbers are too small; they are 35x35 images and the character lines are clear of width approximately 4 pixels.
The live stream causes some artifacts, which are invisible to the eye, but very disturbing for the OCR.
Google doesn't want us to use Vision API for these kinds of problem, and we should instead use pre-trained MNIST to recognize numbers.
I have used both, detect-text and detect-document; the latter is a bit more accurate.

I came up with one solution, which seems to be working quite well.
I added text around the numbers (in order to give context) and then remove the text around with regexp and pick the numbers. It seems that the API is not for character recognition, but also likes to have some context words around the numbers to increase confidence. This solution works quite well for my use case and probably to many others also, since adding context text for the numbers is quite trivial thing to do ("My shoe number is: X"). Adding text to images should be trivial task to be done with ImageMagick.

Related

Google Natural Language Sentiment Analysis incorrect result

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!
The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

Text Detection with YOLO on Challenging Images

I have images that look as follows:
My goal is to detect and recognize the number 31197394. I have already fine-tuned a deep neural network on text recognition. It can successfully identify the correct number, if it is provided it in the following format:
The only task that remains is the detection of the corresponding bounding box. For this purpose, I have tried darknet. Unfortunately, it's not recognizing anything. Does anyone have an idea of a network that performs better on these kind of images? I know, that amazon recognition is able to solve this task. But I need a solution that works offline. So my hopes are still high that there exist pre-trained networks that work. Thank's a lot for your help!
Don't say darknet doesn't work. It depends on how you labeled your dataset. It is true that the numbers you want to recognize are too small so if you don't make any changes to the image during the pre-processing phase, it would be complicated for a neural network to recognize them well. So what you can do that will surely work is:
1---> Before labeling, increase the size of all images by 2 times its current size (like 1000*1000)
2---> used this size (1000 * 1000) for the darket trainer instead of the default size proposed by darknet which is 416 * 416. You would then have to change the configuration file
3---> use the latest darknet version (yolo v4)
4---> On the configuration file, always keep a number of subdivisions at 1.
I also specify that this method is too greedy in memory, it is therefore necessary to provide a machine with RAM > 16 GB. The advantage is that it works...
Thanks for your answers guys! You were right, I had to finetune yolo to make it work. So I created a dataset and fine-tuned yolov5. I am surprised how good the results are. Despite only having about 300 images in total, I get an accuracy of 97% to predict the correct number. This is mainly due to strong augmentations. And indeed the memory requirements are large, but I could train on a 32 GM RAM machine. I can really encourage anyone who faces similar problems to give yolo a shot!!
Maybe use an R-CNN to identify the region where the number is and then pass that region to your fine-tuned neural network for the digit classification

Google Vision API does not recognize single digits

I have a project that make use of Google Vision API DOCUMENT_TEXT_DETECTION in order to extract text from document images.
Often the API has troubles in recognizing single digits, as you can see in this image:
I suppose that the problem could be related to some algorithm of noise removal, that recognizes isolated single digits as noise. Is there a way to improve Vision response in these situations? (for example managing noise threshold or others parameters)
At other times Vision confuses digits with letters:
But if I specify as parameter languageHints = 'en' or 'mt' these digits are ignored by the ocr. Is there a way to force the recognition of digits or latin characters?
Unfortunately I think the Vision API is optimized for both ends of the spectrum -- dense text (DOCUMENT_TEXT_DETECTION) on one end, and arbitrary bits of text (TEXT_DETECTION) on the other. As you noted in the comments, the regular TEXT_DETECTION works better for these stray single digits while DOCUMENT_TEXT_DETECTION works better overall.
As far as I've heard, there are no current plans to try to cover both of these in a single way, but it's possible that this could improve in the future.
I think there have been other requests to do more fine-tuning and hinting on what you're looking to detect (e.g., here and here), but this doesn't seem to be available yet. Perhaps in the future you'll be able to provide more hints on the format of the text that you're looking to find in images (e.g., phone numbers, single digits, etc).

How can I capture images from the screen and compare them to known images in C++?

In an effort to see if it is possible to easily break very simple CAPTCHAs, I am attempting to write a program (as simple and small as possible). This program, which I hope to write in C++, should do the following:
Make a partial screenshot of a known area of the screen. (Assuming that the CAPTCHA is always in the exact same place - for instance: Pixels 500-600 x, pixels 300-400 y).
Automatically dissect the CAPTCHA into individual letters. (The CAPTCHAS I will create for testing will all have only a few white letters, always on a black background, spaced well apart, to make things easy on me.)
The program then compares the "cut" letters against an array of "known" images of letters (which look similar to the letters used in the CAPTCHA), which contains 26 elements, each holding an image of a single letter of the English alphabet.
The program takes the letter associates with the image that the comparison mapped to, and sends that key to the console (via std::cout)
My question is: Is there an easy-to-use library (I am only a beginner at programming), which can handle tasks 1-3 (The 4. is rather easy)? Especially the third point is something I haven't found pretty much anything worthwhile on. What would be ideal is if this library had a "score" function, using a float to indicate how similar the images are. Then, the one with the highest score is the best hit. (I.e: 100.0 means the images are identical, 29.56 means they are very different, etc.)
A good library for this job is OpenCV. http://opencv.org
OpenCV has all the necessary low-level imge processing tools to segment the different elements of the captcha. Then you can use its template matching module.
You could even try to detect letters directly without the preprocessing. It will be slower, but the captcha image is typically so small, that it should rarely matter. See:
http://docs.opencv.org/modules/imgproc/doc/object_detection.html#cv2.matchTemplate
For some tutorials to get into the library see:
http://docs.opencv.org/doc/tutorials/tutorials.html

How do I write a Perl script to filter out digital pictures that have been doctored?

Last night before going to bed, I browsed through the Scalar Data section of Learning Perl again and came across the following sentence:
the ability to have any character in a string means you can create, scan, and manipulate raw binary data as strings.
An idea immediately hit me that I could actually let Perl scan the pictures that I have stored on my hard disk to check if they contain the string Adobe. It seems by doing so, I can tell which of them have been photoshopped. So I tried to implement the idea and came up with the following code:
#!perl
use autodie;
use strict;
use warnings;
{
local $/="\n\n";
my $dir = 'f:/TestPix/';
my #pix = glob "$dir/*";
foreach my $file (#pix) {
open my $pic,'<', "$file";
while(<$pic>) {
if (/Adobe/) {
print "$file\n";
}
}
}
}
Excitingly, the code seems to be really working and it does the job of filtering out the pictures that have been photoshopped. But problem is many pictures are edited by other utilities. I think I'm kind of stuck there. Do we have some simple but universal method to tell if a digital picture has been edited or not, something like
if (!= /the origianl format/) {...}
Or do we simply have to add more conditions? like
if (/Adobe/|/ACDSee/|/some other picture editors/)
Any ideas on this? Or am I oversimplifying due to my miserably limited programming knowledge?
Thanks, as always, for any guidance.
Your best bet in Perl is probably ExifTool. This gives you access to whatever non-image information is embedded into the image. However, as other people said, it's possible to strip this information out, of course.
I'm not going to say there is absolutely no way to detect alterations in an image, but the problem is extremely difficult.
The only person I know of who claims to have an answer is Dr. Neal Krawetz, who claims that digitally altered parts of an image will have different compression error rates from the original portions. He claims that re-saving a JPEG at different quality levels will highlight these differences.
I have not found this to be the case, in my investigations, but perhaps you might have better results.
No. There is no functional distinction between a perfectly edited image, and one which was the way it is from the start - it's all just a bag of pixels in the end, after all, and any other metadata you can remove or forge all you want.
The name of the graphics program used to edit the image is not part of the image data itself but of something called meta data - which may be stored in the image file but, as others have noted, is neither required (so some programs may not store it, some may allow you an option of not storing it) nor reliable - if you forged an image, you might have forged the meta data as well.
So the answer to your question is "no, there's no way to universally tell if the pic was edited or not, although some image editing software may write its signature into the image file and it'll be left there by carelessness of the editing person.
If you're inclined to learn more about image processing in Perl, you could take a look at some of the excellent modules CPAN has to offer:
Image::Magick - read, manipulate and write of a large number of image file formats
GD - create colour drawings using a large number of graphics primitives, and emit the drawings in various formats.
GD::Graph - create charts
GD::Graph3d - create 3D Graphs with GD and GD::Graph
However, there are other utilities available for identifying various image formats. It's more of a question for Super User, but for various unix distros you can use file to identify many different types of files, and for MacOSX, Graphic Converter has never let me down. (It was even able to open the bizarre multi-file X-ray of my cat's shattered pelvis that I got on a disc from the vet.)
How would you know what the original format was? I'm pretty sure there's no guaranteed way to tell if an image has been modified.
I can just open the file (with my favourite programming language and filesystem API) and just write whatever I want into that file willy-nilly. As long as I don't screw something up with the file format, you'd never know it happened.
Heck, I could print the image out and then scan it back in; how would you tell it from an original?
As other's have stated, there is no way to know if the image was doctored. I'm guessing what you basically want to know is the difference between a realistic photograph and one that has been enhanced or modified.
There's always the option of running some extremely complex image recognition algorithm that would analyze every pixel in your image and do some very complicated stuff to determine if the image was doctored or not. This solution would probably involve AI which would examine millions of photos that are both doctored and those that are not and learn from them. However, this is more of a theoretical solution and isn't very practical... you would probably only see it in movies. It would be extremely complex to develop and probably take years. And even if you did get something like this to work, it probably still wouldn't be 100% correct all the time. I'm guessing AI technology still isn't at that level and could take a while until it is.
A not-commonly-known feature of exiftool allows you to recognize the originating software through an analysis of the JPEG quantization tables (not relying on image metadata). It recognizes tables written by many applications. Note that some cameras may use the same quantization tables as some applications, so this isn't a 100% solution, but it is worth looking into. Here is an example of exiftool run on two images, the first was edited by photoshop.
> exiftool -jpegdigest a.jpg b.jpg
======== a.jpg
JPEG Digest : Adobe Photoshop, Quality 10
======== b.jpg
JPEG Digest : Canon EOS 30D/40D/50D/300D, Normal
2 image files read
This will work even if the metadata has been removed.
There is existing software out there which uses various techniques (compression artifacting, comparison to signature profiles in a database of cameras, etc.) to analyze the actual image data for evidence of alteration. If you have access to such software and the software available to you provides an API for external access to these analysis functions, then there's a decent chance that a Perl module exists which will interface with that API and, if no such module exists, it could probably be created rather quickly.
In theory, it would also be possible to implement the image analysis code directly in native Perl, but I'm not aware of anyone having done so and I expect that you'd be better off writing something that low-level and processor-intensive in a fully-compiled language (e.g., C/C++) rather than in Perl.
http://www.impulseadventure.com/photo/jpeg-snoop.html
is a tool that does the job almost good
If there has been any cloning , there is a variation in the pixel density..or concentration which sometimes shows up.. upon manual inspection
a Photoshop cloned area will have even pixel density(my meaning is variation of Pixels wrt a scanned image)