I am using AWS Rekognition to detect text from a pdf that is converted into a jpeg.
The image that I am using has text that is approximately size 10-12 or a regular letter page. However, The font changes throughout the image several times.
Is my lack of detection and low confidence levels due to having a document where the text changes often? Small Font?
Essentially I'd like to know what kind of image/text do I need to have the best results from a detect text algorithm?
DetectText API
can detect up to 50 words in an image
and to be detected, text must be within +/- 30 degrees orientation of the
horizontal axis.
and you are trying to extract a page full of text, that's the problem :)
AWS now provides AWS Textract service that is specifically intended for OCR purposes from images and documents.
Related
Basic text detection API (e.g. google) does not return anything for the following image. To try Google's vision API, save the image locally and run:
gcloud ml vision detect-text <local-path-to-image> | grep description
It may return gibberish. The text we want is RAW9405. Are there any existing models for this or does it require training?
What you can do is use craft-text-detector which is available opensource, you will get the bounding box coordinates for every single word and based on y-axis you can form a sentence than use tesseract for recognition.
So I am attempting to use Azure Computer Vision OCR to recognize text in a jpg image. The image is about 2000x3000 pixels and is a picture of a contract. I want to get all the text and the bounding boxes. The image DPI is over 300 and it's quality is very clear. I noticed that a lot of text was being skipped so I cropped a section of the image and submitted that instead. This time it recognized text that it did not recognize before. Why would it do this? If the quality of the image never changed and the image was within the bounds of the resolution requirements, why is it skipping texts?
I'm trying to use Microsoft's Computer Vision OCR API to get information from a table on an image. The trouble that I'm having is that the data returned typically has all sorts of qwerky regions going on and I'm attempting to piece all the regions together to get full lines of readable and parse-able text.
The only way I've thought of that makes any sense is to use the orientation to rotate the bounding box coordinates and check which "lines" are within a given percentage of the height of another given bounding box - perhaps 20% or so.
This is literally the only way I've thought of so far and I'm beginning to think I'm over complicating this; is there a standard way that people tend to build up OCR regions to get readable text?
There is no standard way as such. However, people do go with the option of REGEX, depending on the requirement.
Azure OCR returns the JSON Response as words and their bounding boxes. From there on, it is up to you to interpret the result. The ocr apis do not help with this task.
As a start, regex is a great way to parse text data. Or try a machine learning approach as described in this reddit post: https://www.reddit.com/r/MachineLearning/comments/53ovp9/extracting_a_total_cost_from_ocr_paper_receipt/
I am experimenting with the Google Vision API text detection feature, and trying to perform OCR on text images. The text images are quite clean and it works 80% of the times. The 20% of errors include misinterpreted numbers / characters (fixable), and some words / numbers that simply don't show up (not fixable!).
I followed the best practices page tips (image is 1024x768, 16-bit PNG) with no avail.
Here is an example: this sample page
https://storage.googleapis.com/ximian-cloud.appspot.com/sample_page.png
Has a number 177 (Under observations, right of "RT ARM") and this is not detected at all by the API ...
I tried:
Twice the resolution (2048 x 1536)
BMP 24-bit
BMP 32-bit
All of the above, in grayscale
All of the above, inverted (black background and white letters)
No luck ...
Any hint on why this is happening? Is it the API or my image format could use some formatting?
This is an error that has been noticed and registered already, now in the process of getting fixed, hopefully quite soon.
Does anyone know of a c++ library for taking an image and performing image recognition on it such that it can find letters based on a given font and/or font height? Even one that doesn't let you select a font would be nice (eg: readLetters(Image image).
I've been looking into this a lot lately. Your best is simply Tesseract. If you need layout analysis on top of the OCR than go with Ocropus (which in turn uses Tesseract to do the OCR). Layout analysis refers to being able to detect position of text on the image and do things like line segmentation, block segmentation, etc.
I've found some really good tips through experimentation with Tesseract that are worth sharing. Basically I had to do a lot of preprocessing for the image.
Upsize/Downsize your input image to 300 dpi.
Remove color from the image. Grey scale is good. I actually used a dither threshold and made my input black and white.
Cut out unnecessary junk from your image.
For all three above I used netbpm (a set of image manipulation tools for unix) to get to point where I was getting pretty much 100 percent accuracy for what I needed.
If you have a highly customized font and go with tesseract alone you have to "Train" the system -- basically you have to feed a bunch of training data. This is well documented on the tesseract-ocr site. You essentially create a new "language" for your font and pass it in with the -l parameter.
The other training mechanism I found was with Ocropus using nueral net (bpnet) training. It requires a lot of input data to build a good statistical model.
In terms of invoking Tesseract/Ocropus are both C++. It won't be as simple as ReadLines(Image) but there is an API you can check out. You can also invoke via command line.
While I cannot recommend one in particular, the term you are looking for is OCR (Optical Character Recognition).
There is tesseract-ocr which is a professional library to do this.
From there web site
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available
I think what you want is Conjecture. Used to be the libgocr project. I haven't used it for a few years but it used to be very reliable if you set up a key.
The Tesseract OCR library gives pretty accurate results, its a C and C++ library.
My initial results were around 80% accurate, but applying pre-processing on the images before supplying in for OCR the results were around 95% accurate.
What is pre-preprocessing:
1) Binarize the bitmap (B&W worked better for me). How it could be done
2) Resampling your image to 300 dpi
3) Save your image in a lossless format, such as LZW TIFF or CCITT Group 4 TIFF.