Differentiate between Handwritten and Machine printed texts - python-2.7

Is there any effective way exists to detect and extract only the handwritten part from a noisy image containing both handwritten and machine printed texts? The image is attached below.
https://i.stack.imgur.com/yN2Do.jpg

You can see this as a detection problem: Detect (draw axis-aligned bounding boxes around) all characters which are machine printed.
The simplest way to do this is a sliding-window + a classifier:
Crop a patch out of the image for which you want to know "is this a machine printed text"
Apply a classifier which gets the patch as input and outputs a probability for "yes, it is printed text".
The classifier will likely be a CNN.

I guess you have images with same format structure as given image having contents in fixed format with known coordinates of Machine printed texts, You can use coordinate information to retrieve your texts categories.
As mentioned by #Rethunk you can also utilize the font information of machine printed texts to get more precise result.

Related

Finding lines from text image OpenCV C++

I have images that are noised with some random lines like the following one:
I want to apply on them some preprocessing in order to find the lines (the lines that distort the writing).
I was seen some ways, but it is in Python, not C++:
Remove noisy lines from an image
Remove non straight lines from text image
In C++, I was try but result images:
the result which I want (I do it with Photoshop):
How to find lines in that images in C++ with OpenCV? thanks
I am not sure about this. Like #Chistoph Racwit said, you might need to use some sort of OCR.
But just to try it out, I think you can apply a horizontal filter that highlights any horizontal line in the image. It might not give the best-looking result but with some clean-up, you could end up with where the lines are in the image.
You can use this image to detect lines' locations and draw them in the original image with red color.

Rekognition for numbers aligned vertically?

Rekognition does great with traditional horizontally aligned numbers but doesn't work well when numbers are vertically aligned (top to bottom). Can anyone think of a way to use Rekognition for vertically aligned numbers?
I've tried cropping the image and rotating it, but with same poor results.
I use python but doesn't really matter since Rekognition is doing the work internally. (see attached example, seems very clear to me and would work perfectly if numbers were aligned horizontally.)
You would need to write some code that looks at the Bounding Boxes of the returned text and makes some assumptions of how they align.
For example:
If the bounding boxes mostly overlap horizontally
Sort them in order of the vertical position
Confirm that the vertical spacing is within a given distance
Concatenate them together into one string
Update: I tried putting your sample image into Rekognition and it didn't detect the numbers. However, when I cropped the image to a smaller section, it successfully detected the numbers. It also provided them back in "top-down" order.

OpenCV: Ignore text-like contours

Background
I want to detect all contours in an image that contains 2D geometric shapes, but strip away anything that looks like text.
Example input:
I tried to detect text areas via Tesseract and remove those text areas subsequently. For some images where the OCR recognition is good this works fine, thus text areas are recognized with quite good rate and contours of recognized text can then be removed. But for most images the text is not recognized well and thus I cannot remove irrelevant text contours from the image.
Therefore my question is: How can I distinguish text-like contours from my 2D "geometric" contours?
If you don't care about the text and just want to get rid of it, then you can just detect outer contours by passing RETR_EXTERNAL as the mode parameter to the findCountours() function. That will give you the outermost contours and ignore anything contained inside of the geometric shapes.
Or if you want more control, you can pass the mode parameter as RETR_TREE and then walk the tree, keeping only the top-level contours and ignoring anything below that level in the hierarchy. That way you'll get everything and you can decide later what you want to keep and what you want to ignore.
Read this page of the OpenCV documentation for information on how findCountours() represents the hierarchy (that page is from a python tutorial, but it's generic enough to follow along).
Of course that will only work if the images always look similar to the example you gave in your question - i.e. the text is always inside of the geometric shapes. If you have text outside of the shapes, then maybe you could look at the size of the contours (bounding rectangles) and ignore anything that falls below a certain threshhold (assuming text contours will be much smaller than your geometric shapes).
Contours that belong to text, also represent a region according to your example. So that you can try to use the properties of regions to eliminate some unneeded regions (text contours!) I can suggest that you can use some properties like eccentricity, solidity or compactness (you can find code example here: https://github.com/mribrahim/Blob-Detection
)
For ex: Regular shapes and the others can be distinguished by using compactness value, or you can combine any other properties

Improve Tesseract detection quality

I am trying to extract alphanumeric characters (a-z0-9) which do not form sensefull words from an image which is taken with a consumer camera (including mobile phones). The characters have equal size and font type and are not formated. The actual processing is done under Windows.
The following image shows the raw input:
After perspective processing I apply the following with OpenCV:
Convert from RGB to gray
Apply cv::medianBlur to remove noise
Convert the image to binary using adaptive thresholding cv::adaptiveThreshold
I know the number of rows and columns of the grid. Thus I simply extract each grid cell using this information.
After all these steps I get images which look similar to these:
Then I run tesseract (latest SVN version with latest training data) on each extracted cell image individually (I tried different -psm and -l values):
tesseract.exe -l eng -psm 11 sample.png outtext
The results produced by tesseract are not very good:
Most characters are not recognized.
The grid lines are sometimes interpreted as "l" or "i" characters.
I already experimented with morphologic operations (open, close, erode, dilate) and replaced adaptive thresholding with OTSU thresholding (THRESH_OTSU) but the results got worse.
What else could I try to improve the recognition quality? Or is there even a better method to extract the characters besides using tesseract (for instance template matching?)?
Edit (21-12-2014):
I tested simple template matching (using normalized cross correlation and LMS but with even worse results). But I have made a huge step forward by extracting each character using findCountours and then running tesseract with only one character and the -psm 10 option which interprets each input image as a single character. Additonaly I remove non-alphanumeric characters in a post processing step. The first results are encouraging with detection rates of 90% and better. The main problem are misdetections of "9" and "g" and "q" characters.
Regards,
As I say here, you can tell tesseract to pay attention on "almost same" characters.
Also, there is some option in tesseract that don't help you in your example.
For instance, a "Pocahonta5S" will become, most of the time, a "PocahontaSS" because the number is in a letter word. You can see in this way so.
Concerning pre-processing, you better have to use a sharpen filter.
Don't forget that tesseract will always apply an Otsu's filter before reading anything.
If you want good result, sharpening + Adaptive Threshold with some other filters are good ideas.
I recommend to use OpenCV in Combination with tesseract.
The problem in your input images for tesseract are the non-character regions in your image.
An approach myself
To get rid of these I would use the openCV findContour function to receive all contours in your binary image. Afterwards define some criteria to illiminate the non-character regions. For example only take the regions, which are inside the image and doesn't touch the border, or to only take the regions with a specific region-area or a specific ratio of heigth to width. Find some kind of features, that let you distinguish between character an non-character contours.
Afterwards eliminate these non-character regions and handle the images forward to tesseract.
Just as idea for general testing this approach:
Eliminate the non-character regions manual (gimp or paint,...) and give the image to tesseract. If the result fits your expactations you can try to eliminate the the non-character regions with proposed method of above.
I suggest a similar approach I'm using in my case.
(I only have the problem of speed, which you should not have if its only some characters to compare)
First: Get the form to have default size and transform it:
https://www.youtube.com/watch?v=W9oRTI6mLnU
Second: Use matchTemplate
Improve template matching with many templates for one Image/ find characters on image
I also played around with OCR but I didn't like it because of 2 reasons:
Some kind of blackbox and hard to debug why its not recognized
In my case it was never 100% accurate no matter what i did even for screenshots with "perfect" characters.

How to detect Text Area from image?

i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls, so i want to detect only text content in image,any idea of how to do that will be helpful,thanks.
Take a look at this bounding box technique demonstrated with OpenCV code:
Input:
Eroded:
Result:
Well, I'm not well-experienced in image processing, but I hope I could help you with my theoretical approach.
In most cases, text is forming parallel, horisontal rows, where the space between rows will contail lots of background pixels. This could be utilized to solve this problem.
So... if you compose every pixel columns in the image, you'll get a 1 pixel wide image as output. When the input image contains text, the output will be very likely to a periodic pattern, where dark areas are followed by brighter areas repeatedly. These "groups" of darker pixels will indicate the position of the text content, while the brighter "groups" will indicate the gaps between the individual rows.
You'll probably find that the brighter areas will be much smaller that the others. Text is much more generic than any other picture element, so it should be easy to separate.
You have to implement a procedure to detect these periodic recurrences. Once the script can determine that the input picture has these characteristics, there's a high chance that it contains text. (However, this approach can't distinguish between actual text and simple horisontal stripes...)
For the next step, you must find a way to determine the borderies of the paragraphs, using the above mentioned method. I'm thinking about a pretty dummy algorithm, witch would divide the input image into smaller, narrow stripes (50-100 px), and it'd check these areas separately. Then, it would compare these results to build a map of the possible areas filled with text. This method wouldn't be so accurate, but it probably doesn't bother the OCR system.
And finally, you need to use the text-map to run the OCR on the desired locations only.
On the other side, this method would fail if the input text is rotated more than ~3-5 degrees. There's another backdraw, beacuse if you have only a few rows, then your pattern-search will be very unreliable. More rows, more accuracy...
Regards, G.
I am new to stackoverflow.com, but I wrote an answer to a question similar to this one which may be useful to any readers who share this question. Whether or not the question is actually a duplicate, since this one was first, I'll leave up to others. If I should copy and paste that answer here, let me know. I also found this question first on google rather than the one i answered so this may benefit more people with a link. Especially since it provides different ways of going about getting text areas. For me, when I looked up this question, it did not fit my problem case.
Detect text area in an image using python and opencv
In the Current time, the best way to detect the text is by using EAST (An Efficient and Accurate Scene Text Detector)
The EAST pipeline is capable of predicting words and lines of text at arbitrary orientations on 720p images, and furthermore, can run at 13 FPS, according to the authors.
EAST quick start tutorial can be found here
EAST paper can be found here