Restrict Preprocessing of Tesseract - c++

I am new to tesseract library and I set it up on Ubuntu 12.04.
I am using this data set to be recognized. When I was feeding these images to tesseract as it is (without any preprocessing) using this code I was getting 70-75% approx. accuracy.
I want accuracy to be 90+% so I did some preprocessing steps I followed to enhance the image are
Steps for Preprocessing
Applied bottom hat operator with structured element of circle of radius 12
Complement of image to make background white and text as black
Enhance the contrast of resultant image
Erode the image.
after these steps I get pretty clear images can be seen here. But now when I feed these images to tessearct using that same code accuracy get reduced to < 50% I dont know why? Is it because of tesseract do some preprocessing as well? if yes then how can I restrict tesseract from doing that preprocessing. If not then why it is giving me bad results when image is pretty clear now? Pardon me if I have asked some basic question.

Regarding your question why tesseract delivers better results when using a binary image instead of a gray image as input for tesseract:
Tesseract will do an internal binarization of the gray scale image with various methods (haven't figured out right know what method for binarization is used exactly, some times local adaptiv threshold, some times global OTSU threshold is mentioned in the internet). Sure is, that tesseract performs character recognition on a binary image and that the preprocessing of tesseract can still fail at specific problems (hasn't got good layout analyzes for example). So if you do the preprocessing part yourself and give tesseract as input image only a binary image with text and disable all layout analyzes in tesseract you could achieve better results than letting tesseract doing all for you. Since it is an open source free utility, it has some known drawbacks, which has to be accepted.
If you use tesseract as command line tool, this thread is very useful for the parameter. tesseract command line page segmantation
If you use the source code of tesseract in developing your own C++ Code, you have to initialze tesseract with the correct parameter. Parameter are described here at the tesseract API side. tesseract API

Well I was feeding grayscale(8bpp) image to tesseract after preprocessing so after getting that grayscale image tesseract is trying to binarize i.e. convert it to black and white, that was giving me bad results, I still don't know why.
But after that I tried to first convert my scale image in to b/w or 1bpp image and then I fed that image to tesseract I got relatively much better results.

Related

TEXT_DETECTION ignoring / eliminating words

I am experimenting with the Google Vision API text detection feature, and trying to perform OCR on text images. The text images are quite clean and it works 80% of the times. The 20% of errors include misinterpreted numbers / characters (fixable), and some words / numbers that simply don't show up (not fixable!).
I followed the best practices page tips (image is 1024x768, 16-bit PNG) with no avail.
Here is an example: this sample page
https://storage.googleapis.com/ximian-cloud.appspot.com/sample_page.png
Has a number 177 (Under observations, right of "RT ARM") and this is not detected at all by the API ...
I tried:
Twice the resolution (2048 x 1536)
BMP 24-bit
BMP 32-bit
All of the above, in grayscale
All of the above, inverted (black background and white letters)
No luck ...
Any hint on why this is happening? Is it the API or my image format could use some formatting?
This is an error that has been noticed and registered already, now in the process of getting fixed, hopefully quite soon.

Does tesseract-ocr perform any image pre-processing?

I am currently working with the Tesseract OCR engine and I'm using it in conjunction with OpenCV for pre-processing the image before sending it to the OCR engine. However, I was wondering if Tesseract itself is performing some image pre-processing before extracting the text. If so, what are the methods that Tesseract implements?
My objective is to ensure I don't perform redundant pre-processing methods. Some of the pre-processing methods I perform are adaptiveThreshold and GaussianBlur.
Any help/guidance would be much appreciated!
EDIT:
I understand Tesseract does basic image pre-processing. I would like to know if it is possible to bypass these methods and directly feed in an image that I processed manually. (Again, in order to avoid redundant processing on the image)?
Tesseract uses Leptonica library for doing various pre-processing operations like Otsu binarization algorithm, dilation, erosion and so on.
But because the operations are not dependent to your data, they will cause bad results in some cases.
For more information read this page.

Extracting part of a scanned document (personal ID) - which library and method to choose?

I have to process a lot of scanned IDs and I need to extract photos from them for further processing.
Here's a fictional example:
The problem is that the scans are not perfectly aligned (rotated up to 10 degrees). So I need to find their position, rotate them and cut out the photo. This turned out to be a lot harder than I originally thought.
I checked OpenCV and the only thing I found was rectangle detection but it didn't give me good results: the rectangle not always matches good enough on samples. Also its image matching algorithm works only for not-rotated image since it's just a brute force comparison.
So I though about using ARToolkit (augmented reality lib) because I know that it's able to very precisely locate given marker on an image. But it it seems that the markers have to be very simple, so I can't use a constant part of the document for this purpose (please correct me if I'm wrong). Also, I found it super-hard to compile it on Ubuntu 11.10.
OCR - haven't tried this one yet and before I start my research I'd be thankful for any suggestions what to look for.
I look for a C(preferable)/C++ solution. Python is an option too.
If you don't find another ideal solution, one method I ended up using for OCR preprocessing in the past was to convert the source images to PPM and use unpaper in Ubuntu. You can attempt to deskew the image based on whichever sides you specify as having clearly-defined edges, and there is an option to bypass the filters that would normally be applied to black and white text. You probably don't want those for images.
Example for images skewed no more than 15 degrees, using the bottom and right edges to detect rotation:
unpaper -n -dn bottom,right -dr 15 input.ppm output.ppm
unpaper was written in C, if the source is any help to you.

What is the difference between ImageMagick and GraphicsMagick?

I've found myself evaluating both of these libs. Apart from what the GraphicsMagick comparison says, I see that ImageMagick still got updates and it seems that the two are almost identical.
I'm just looking to do basic image manipulation in C++ (i.e. image load, filters, display); are there any differences I should be aware of when choosing between these libraries?
As with many things in life, different people have different ideas about what is best. If you ask a landscape photographer who wanders around in the rain in Scotland's mountains which is the best camera in the world, he's going to tell you a light-weight, weather-sealed camera. Ask a studio photographer, and he'll tell you the highest resolution one with the best flash sync speed. And if you ask a sports photographer he'll tell you the one with the fastest autofocus and highest frame rate. So it is with ImageMagick and GraphicsMagick.
Having answered around 2,000 StackOverflow questions on ImageMagick over the last 5+ years, I make the following observations...
In terms of popularity...
ImageMagick questions on SO outnumber GraphicsMagick questions by a factor of 12:1 (7,375 questions vs 611 at May 2019), and
ImageMagick followers on SO outnumber GraphicsMagick followers by 15:1 ((387 followers versus 25 at May 2019)
In terms of performance...
I am happy to concede that GraphicsMagick may be faster for some, but not all problems. However, if speed is your most important consideration, I think you should probably be using either libvips, or parallel code on today's multi-core CPUs or heavily SIMD-optimised (or GPU-optimised) libraries like OpenCV.
In terms of features and flexibility...
There is one very clear winner here - ImageMagick. My experience is that there are many features missing from GraphicsMagick which are present in ImageMagick and I list some of these below, in no particular order.
I freely admit I am not as familiar with GraphicsMagick as I am with ImageMagick, but I made my very best effort to find any mention of the features in the most recent GraphicsMagick source code. So, for Canny Edge Detector, I ran the following command on the GM source code:
find . -type f -exec grep -i Canny {} \;
and found nothing.
Canny Edge detector
This appears to be completely missing in GM. See -canny radiusxsigma{+lower-percent}{+upper-percent} in IM.
See example here and sample of edge-detection on Lena image:
Parenthesised processing, sophisticated re-sequencing
This is a killer feature of ImageMagick that I frequently sorely miss when having to use GM. IM can load, or create, or clone a whole series of images and apply different processing selectively to specific images and re-sequence, duplicate and re-order them very simply and conveniently. It is hard to convey the incredible flexibility this affords you in a short answer.
Imagine you want to do something fairly simple like load image A and blur it, load image B and make it greyscale and then place the images side-by-side with Image B on the left. That looks like this with ImageMagick:
magick imageA.png -blur x3 \( imageB.png -colorspace gray \) +swap +append result.png
You can't even get started with GM, it will complain about the parentheses. If you remove them, it will complain about swapping the image order. If you remove that it will apply the greyscale conversion to both images because it doesn't understand parentheses and place imageA on the left.
See the following sequencing commands in IM:
-swap
-clone
-duplicate
-delete
-insert
-reverse
fx DIY Image Processing Operator
IM has the -fx operator which allows you to create and experiment with incredibly sophisticated image processing. You can have function evaluated for every single pixel in an image. The function can be as complicated as you like (save it in a file if you want to) and use all mathematical operations, ternary-style if statements, references to pixels even in other images and their brightness or saturation and so on.
Here are a couple of examples:
magick rose: -channel G -fx 'sin(pi*i/w)' -separate fx_sine_gradient.gif
magick -size 80x80 xc: -channel G -fx 'sin((i-w/2)*(j-h/2)/w)/2+.5' -separate fx_2d_gradient.gif
A StackOverflow answer that uses this feature to great effect in processing green-screen (chroma-keyed) images is here.
Fourier (frequency domain) Analysis
There appears to be no mention of forward or reverse Fourier Analysis in GM, nor the High Dynamic Range support (see later) that is typically required to support it. See -fft in IM.
Connected Component Analysis / Labelling/ Blob Analysis
There appears to be no "Connected Component Analysis" in GM - also known as "labelling" and "Blob Analysis". See -connected-components connectivity for 4- and 8-connected blob analysis.
This feature alone has provided 60+ answers - see here.
Hough Line Detection
There appears to be no Hough Line Detection in GM. See -hough-lines widthxheight{+threshold} in IM.
See description of the feature here and following example of detected lines:
Moments and Perceptual Hash (pHash)
There appears to be no support for image moments calculation (centroids and higher orders), nor Perceptual Hashing in GM. See -moments in IM.
Morphology
There appears to be no support for Morphological processing in GM. In IM there is sophisticated support for:
dilation
erosion
morphological opening and closing
skeletonisation
distance morphology
top hat and bottom hat morphology
Hit and Miss morphology - line ends, line junctions, peaks, ridges, Convex Hulls etc
See all the sophisticated processing you can do with this great tutorial.
Contrast Limited Adaptive Histogram Equalisation - CLAHE
There appears to be no support for Contrast Limited Adaptive Histogram Equalisation in GM. See -clahe widthxheight{%}{+}number-bins{+}clip-limit{!} in IM.
HDRI - High Dynamic Range Imaging
There appears to be no support for High Dynamic Range Imaging in GM - just 8, 16, and 32-bit integer types.
Convolution
ImageMagick supports many types of convolution:
Difference of Gaussians DoG
Laplacian
Sobel
Compass
Prewitt
Roberts
Frei-Chen
None of these are mentioned in the GM source code.
Magick Persistent Register (MPR)
This is an invaluable feature present in ImageMagick that allows you to write intermediate processing results to named chunks of memory during processing without the overhead of writing to disk. For example, you can prepare a texture or pattern and then tile it over an image, or prepare a mask and then alter it and apply it later in the same processing without going to disk.
Here's an example:
magick tree.gif -flip -write mpr:tree +delete -size 64x64 tile:mpr:tree mpr_tile.gif
Broader Colourspace Support
IM supports the following colourspaces not found in GM:
CIELab
HCL
HSI
LMS
others.
Pango Support
IM supports Pango Text Markup Language which is similar to HTML and allows you to annotate images with text that changes:
font, colour, size, weight, italics
subscript, superscript, strike-through
justification
mid-sentence and much, much more. There is a great example here.
Shrink-on-load with JPEG
This invaluable feature allows the library to shrink JPEG images as they are read from disk, so that only the necessary coefficients are read, so the I/O is lessened, and the memory consumption is minimised. It can massively improve performance when down-scaling images.
See example here.
Defined maximum JPEG size when writing
IM supports the much-requested option to specify a maximum filesize when writing JPEG files, -define jpeg:extent=400KB for example.
Polar coordinate transforms
IM supports conversion between cartesian and polar coordinates, see -distort polar and -distort depolar.
Statistics and operations on customisable areas
With its -statistic MxN operator, ImageMagick can generate many useful kinds of statistics and effects. For example, you can set each pixel in an image to the gradient (difference between brightest and darkest) of its 5x3 neighbourhood:
magick image.png -statistic gradient 5x3 result.png
Or you can set each pixel to the median of its 1x200 neighbourhood:
magick image.png -statistic median 1x200 result.png
See example of application here.
Sequences of images
ImageMagick supports sequences of images, so if you have a set of very noisy images shot at high ISO, you can load up the entire sequence of images and, for example, take the median or average of all images to reduce noise. See the -evaluate-sequence operator. I do not mean the median in a surrounding neighbourhood in a single image, I mean by finding the median of all images at each pixel position.
The above is not an exhaustive list by any means, they are just the first few things that came to mind when I thought about the differences. I didn't even mention support for HEIC (Apple's format for iPhone images), increasingly common High Dynamic Range formats such as EXR, or any others. In fact, if you compare the file formats supported by the two products (gm convert -list format and magick identify -list format) you will find that IM supports 261 formats and GM supports 192.
As I said, different people have different opinions. Choose the one that you like and enjoy using it.
As always, I am indebted to Anthony Thyssen for his excellent insights and discourse on ImageMagick at https://www.imagemagick.org/Usage/ Thanks also to Fred Weinhaus for his examples.
From what I have read GraphicsMagick is more stable and is faster.
I did a couple of unscientific tests and found gm to be twice as fast as im (doing a resize).
I found ImageMagick to be incredibly slow for processing TIFF group-4 images (B&W document images), mainly due to the fact that it converts from 1-bit-per-pixel to 8 and back again to do any image manipulation. The GraphicsMagick group overhauled the TIFF format support with their version 1.2, and it is much faster at processing these types of images than the original ImageMagick was. The current GraphicsMagick stable release is at 1.3.5.
I use ImageMagick when speed isn't a factor. However on the server side, where tens of thousands of images are being processed daily, GraphicsMagick is quite noticeably faster - in some cases up to 50% faster in benchmarks!
History
graphicsmagick was forked from imagemagick back in 2002 due to disputes between founding developers. thus they share the same codebase.
Ref : https://en.wikipedia.org/wiki/GraphicsMagick
Goal
graphicsmagick
focuses on simple, stable, and clearer codebase / architecture
imagemagick
focuses on rolling out new features, extend a wider toolbase
Other than speed, imagemagick adds a number of cli tools to terminal shell whereas graphicsmagick is a single tool which you can call.
CLI interface design
graphicsmagick
gm <command> <options> <file>
imagemagick
convert <options> <file>
compare <options> <file>
imho, i prefer (in fact, only use) graphicsmagick(gm) over imagemagick as the latter has higher chance of tool name clash, which causes lots of issues in finding out why certain tools are not running, especially during server side automation tasks. in summary graphicsmagick has much clearer design.
imagine a binary called convert in a project and is it imagemagick's convert or your own rolled tool in project that will be called?
list of imagemagick tools (including convert, compare, display) : https://imagemagick.org/script/command-line-tools.php
list of graphicsmagick commands :
http://www.graphicsmagick.org/utilities.html
note : as of v7 as mentioned by Mark S, imagemagick is now distributed as single binary, and also supporting older v6 commands.
Performance
a simple memory consumption test can be found here :
https://coderwall.com/p/1l7h-a/imagemagick-bloat-graphicsmagick
Dependancies
GraphicsMagick depends on 36 libraries whereas ImageMagick requires 64. Ref : http://www.graphicsmagick.org/1.3/FAQ.html
Note that GraphicsMagick provides API and ABI stability, which isn't part of the guarantee for ImageMagick. This would be important in the long run unless you are vendoring all your dependencies.
GraphicsMagick was an early fork from Imagemagick. You can read about Imagemagick's history and the fork to GraphicsMagick at https://imagemagick.org/script/history.php. It seems that Imagemagick has continued to be developed rather extensively, while GraphicsMagick has remained more or less stagnant since the fork.

C++ Library for image recognition: images containing words to string

Does anyone know of a c++ library for taking an image and performing image recognition on it such that it can find letters based on a given font and/or font height? Even one that doesn't let you select a font would be nice (eg: readLetters(Image image).
I've been looking into this a lot lately. Your best is simply Tesseract. If you need layout analysis on top of the OCR than go with Ocropus (which in turn uses Tesseract to do the OCR). Layout analysis refers to being able to detect position of text on the image and do things like line segmentation, block segmentation, etc.
I've found some really good tips through experimentation with Tesseract that are worth sharing. Basically I had to do a lot of preprocessing for the image.
Upsize/Downsize your input image to 300 dpi.
Remove color from the image. Grey scale is good. I actually used a dither threshold and made my input black and white.
Cut out unnecessary junk from your image.
For all three above I used netbpm (a set of image manipulation tools for unix) to get to point where I was getting pretty much 100 percent accuracy for what I needed.
If you have a highly customized font and go with tesseract alone you have to "Train" the system -- basically you have to feed a bunch of training data. This is well documented on the tesseract-ocr site. You essentially create a new "language" for your font and pass it in with the -l parameter.
The other training mechanism I found was with Ocropus using nueral net (bpnet) training. It requires a lot of input data to build a good statistical model.
In terms of invoking Tesseract/Ocropus are both C++. It won't be as simple as ReadLines(Image) but there is an API you can check out. You can also invoke via command line.
While I cannot recommend one in particular, the term you are looking for is OCR (Optical Character Recognition).
There is tesseract-ocr which is a professional library to do this.
From there web site
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available
I think what you want is Conjecture. Used to be the libgocr project. I haven't used it for a few years but it used to be very reliable if you set up a key.
The Tesseract OCR library gives pretty accurate results, its a C and C++ library.
My initial results were around 80% accurate, but applying pre-processing on the images before supplying in for OCR the results were around 95% accurate.
What is pre-preprocessing:
1) Binarize the bitmap (B&W worked better for me). How it could be done
2) Resampling your image to 300 dpi
3) Save your image in a lossless format, such as LZW TIFF or CCITT Group 4 TIFF.