Optically detect specific characters in a text document - computer-vision

We try to detect certain rare characters when doing OCR of scans of old documents.
What is the state-of-the-art approach to text-object detection in document image analysis and OCR? Are standard SURF or SIFT algorithms applicable?

Related

Bold text detection

I am currently working on a project where I need to detect bold text on a multi font-size image (so no mathematic morphology possible). This detection will be used in parallel of an OCR system (with tesseract) to detect which information (in bold) are important in a document.
I already tested the wordFontAttribute() function of tesseract but it is inconsistent : it provide me poor results of bold detection and decresease the performance of my OCR system because to use this function an old version of tesseract (v3) is needed.
I found a couple of scientific researchs who were based on font style detection and so on bold detection ("Automatic Detection of Italic, Bold and All-Capital Words in Document Images" and "Script Independent Detection of Bold Words in
Multi Font-size Documents" on google scholar).
I was wondering if there is an code implementation of this research online.
Any others ideas on bold detection is also welcome

Improving tesseract performance for specific tasks

I have already read the answers to this question.
I have a series of images that contain a single word between 3-10 characters. They are images created on a computer itself, so quality of the images is consistent and the images don't have any noise on them. The fonts are quite large (about 30 pixels in height). This should already be easy enough for tesseract to read accurately but what are some techniques I can use for improving the speed, even if it's only an improvement of a few milliseconds?
The character set consists of uppercase letters only. As the OCR task in this case is very specific, Would it help if I train the tesseract engine with this specific font and font-size or is that overkill?
Edited to include sample
Other than tesseract, are there any other solutions that I can use with C/C++ that can provide better performance? Could it be done faster with OpenCV? Compatibility with Linux is preferred.
Sample
If all the letters have same size & style, you can try something really simple like running blob detection followed by template matching of individual letters. I am not sure how will it compare with tesseract but it is a very simple experiment. (Additionaly, lowering the resolution will speed things up...)
You can also have a look at this question: Simple Digit Recognition OCR in OpenCV-Python, it may be relevant

Feature Extraction of a binary image

i am doing OCR project using c++ and opencv. I have some black and white images of separated handwritten characters. I want to extract unique features from those images in order to classify them using LIBSVM. can any one tell me what are the suitable algorithms for feature extraction in opencv?
You can read this. And try this.

How to detect style of text in an image using opencv?

Currently i am working on opencv. I have an image with a text. And i want to find out the style(Bold, Italic) of the text. How can i achieve this? Thanks
What you can do is (assuming a letter by letter approach)
Using segmentation techniques you first segment out the letters
Using the segmented letters,compare against your owns data set of pre-segmented/pre-filtered letters to find the font style.
Comparison can be done using various features, SIFT,SURF,BRISK,Harris corners, template matching, or your come up with something of your own. My best guess would be to go with HAAR-features and training.
Once you get a set of features for a letter, matching for closest candidate against your pre-filtered dataset can be achieved using different techniques such as KNN, euclidean distance, etc If you use HAAR features, OpenCV can help alot in retrieval.
Eventually you might ending doing some OCR which includes font style.
OpenCV has a set of built in feature descriptors which you can read here
Good Luck!
This might help you, I know it's not exact. But it will suffice for my similar project.
"Typefont is an experimental library that detects the font of a text in a image."
https://github.com/Vasile-Peste/Typefont

C++ Library for image recognition: images containing words to string

Does anyone know of a c++ library for taking an image and performing image recognition on it such that it can find letters based on a given font and/or font height? Even one that doesn't let you select a font would be nice (eg: readLetters(Image image).
I've been looking into this a lot lately. Your best is simply Tesseract. If you need layout analysis on top of the OCR than go with Ocropus (which in turn uses Tesseract to do the OCR). Layout analysis refers to being able to detect position of text on the image and do things like line segmentation, block segmentation, etc.
I've found some really good tips through experimentation with Tesseract that are worth sharing. Basically I had to do a lot of preprocessing for the image.
Upsize/Downsize your input image to 300 dpi.
Remove color from the image. Grey scale is good. I actually used a dither threshold and made my input black and white.
Cut out unnecessary junk from your image.
For all three above I used netbpm (a set of image manipulation tools for unix) to get to point where I was getting pretty much 100 percent accuracy for what I needed.
If you have a highly customized font and go with tesseract alone you have to "Train" the system -- basically you have to feed a bunch of training data. This is well documented on the tesseract-ocr site. You essentially create a new "language" for your font and pass it in with the -l parameter.
The other training mechanism I found was with Ocropus using nueral net (bpnet) training. It requires a lot of input data to build a good statistical model.
In terms of invoking Tesseract/Ocropus are both C++. It won't be as simple as ReadLines(Image) but there is an API you can check out. You can also invoke via command line.
While I cannot recommend one in particular, the term you are looking for is OCR (Optical Character Recognition).
There is tesseract-ocr which is a professional library to do this.
From there web site
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available
I think what you want is Conjecture. Used to be the libgocr project. I haven't used it for a few years but it used to be very reliable if you set up a key.
The Tesseract OCR library gives pretty accurate results, its a C and C++ library.
My initial results were around 80% accurate, but applying pre-processing on the images before supplying in for OCR the results were around 95% accurate.
What is pre-preprocessing:
1) Binarize the bitmap (B&W worked better for me). How it could be done
2) Resampling your image to 300 dpi
3) Save your image in a lossless format, such as LZW TIFF or CCITT Group 4 TIFF.