I'm trying to use Microsoft's Computer Vision OCR API to get information from a table on an image. The trouble that I'm having is that the data returned typically has all sorts of qwerky regions going on and I'm attempting to piece all the regions together to get full lines of readable and parse-able text.
The only way I've thought of that makes any sense is to use the orientation to rotate the bounding box coordinates and check which "lines" are within a given percentage of the height of another given bounding box - perhaps 20% or so.
This is literally the only way I've thought of so far and I'm beginning to think I'm over complicating this; is there a standard way that people tend to build up OCR regions to get readable text?
There is no standard way as such. However, people do go with the option of REGEX, depending on the requirement.
Azure OCR returns the JSON Response as words and their bounding boxes. From there on, it is up to you to interpret the result. The ocr apis do not help with this task.
As a start, regex is a great way to parse text data. Or try a machine learning approach as described in this reddit post: https://www.reddit.com/r/MachineLearning/comments/53ovp9/extracting_a_total_cost_from_ocr_paper_receipt/
Related
I raised this question due to curiousity while using Google Goggle and Google's "Search by Image".
If you try giving Google an image to search, it can show you some results. Identical images work best (of course), but taken photo of various objects could be difficult.
I guess Google Goggle has workaround a bit by using text recognition and image matching recognition. If text recognition found the text, for instance, "SONY", then things might get simpler. If a brand's image is detected, then things should be simpler as well. The same goes with other famous brand and famous landmark, such as an Eiffel Tower. Having text and brand's image could help recognize things easily.
But if we are to search for something more obscure (need a better wording here), for instance, take this ramen image.
If you put this image into Google, you will get images of various other images that have similar colors and sometimes similar shape. Heck, there are other ramen images in the result, but I think it would be better if these ramen images are up in the top, since we input a ramen image, and our context here is ramen.
So here is my question, will it be possible to create such a software that can understand the context of the image? How can we express the context in the software?
Man, you just pointet out the very reason why so much people work on computer vision.
Is is quite easy to mathematically describe objects. Color, shape, density, . . .
All those can be calculated easily.
But computer vision becomes very complex when talking about "real life objects".
Angle, luminosity, and simply non consistency make it really almost impossible to detect an object accurately.
When working on computer vision, you should always ask yourself : what makes the object I want to recognize unique ?
What descriptor can I use that no other object possess ?
Ask yourself the question for theses ramen. Let's say I simply want to detect ramens.
What if the color of the soup changes? What if the meat is bigger ?
If you want to know more, you should read about pattern recognition and pattern matching.
And if you can find the solution to this kind of problems in a generic way, you can register for the nobel price I think :)
Some things are quite well known nowadays, like face recognition or OCR; but they are often quite specialized and apply to only one domain.
Think about it, even Google's image search algorithm sucks when you feed it with ramen.
It is pretty efficient with sudoku though, as he knows exactly what he is searching for.
All the difference is made in training, where you give a list of assumptions to help the algorithm.
So basically you got it. either you create a really nice computer vision system good at detecting one thing based on a lot of assumptions, or an "ok" but quite generic one :).
The choice mostly depends on your application
I have to process a lot of scanned IDs and I need to extract photos from them for further processing.
Here's a fictional example:
The problem is that the scans are not perfectly aligned (rotated up to 10 degrees). So I need to find their position, rotate them and cut out the photo. This turned out to be a lot harder than I originally thought.
I checked OpenCV and the only thing I found was rectangle detection but it didn't give me good results: the rectangle not always matches good enough on samples. Also its image matching algorithm works only for not-rotated image since it's just a brute force comparison.
So I though about using ARToolkit (augmented reality lib) because I know that it's able to very precisely locate given marker on an image. But it it seems that the markers have to be very simple, so I can't use a constant part of the document for this purpose (please correct me if I'm wrong). Also, I found it super-hard to compile it on Ubuntu 11.10.
OCR - haven't tried this one yet and before I start my research I'd be thankful for any suggestions what to look for.
I look for a C(preferable)/C++ solution. Python is an option too.
If you don't find another ideal solution, one method I ended up using for OCR preprocessing in the past was to convert the source images to PPM and use unpaper in Ubuntu. You can attempt to deskew the image based on whichever sides you specify as having clearly-defined edges, and there is an option to bypass the filters that would normally be applied to black and white text. You probably don't want those for images.
Example for images skewed no more than 15 degrees, using the bottom and right edges to detect rotation:
unpaper -n -dn bottom,right -dr 15 input.ppm output.ppm
unpaper was written in C, if the source is any help to you.
What kind of debugging is available for image processing/computer vision/computer graphics applications in C++? What do you use to track errors/partial results of your method?
What I have found so far is just one tool for online and one for offline debugging:
bmd: attaches to a running process and enables you to view a block of memory as an image
imdebug: enables printf-style of debugging
Both are quite outdated and not really what I would expect.
What would seem useful for offline debugging would be some style of image logging, lets say a set of commands which enable you to write images together with text (probably in the form of HTML, maybe hierarchical), easy to switch off at both compile and run time, and the least obtrusive it can get.
The output could look like this (output from our simple tool):
http://tsh.plankton.tk/htmldebug/d8egf100-RF-SVM-RBF_AC-LINEAR_DB.html
Are you aware of some code that goes in this direction?
I would be grateful for any hints.
Coming from a ray tracing perspective, maybe some of those visual methods are also useful to you (it is one of my plans to write a short paper about such techniques):
Surface Normal Visualization. Helps to find surface discontinuities. (no image handy, the look is very much reminiscent of normal maps)
color <- rgb (normal.x+0.5, normal.y+0.5, normal.z+0.5)
Distance Visualization. Helps to find surface discontinuities and errors in finding a nearest point. (image taken from an abandoned ray tracer of mine)
color <- (intersection.z-min)/range, ...
Bounding Volume Traversal Visualization. Helps visualizing a bounding volume hierarchy or other hierarchical structures, and helps to see the traversal hotspots, like a code profiler (e.g. Kd-trees). (tbp of http://ompf.org/forum coined the term Kd-vision).
color <- number_of_traversal_steps/f
Bounding Box Visualization (image from picogen or so, some years ago). Helps to verify the partitioning.
color <- const
Stereo. Maybe useful in your case as for the real stereographic appearance. I must admit I never used this for debugging, but when I think about it, it could prove really useful when implementing new types of 3d-primitives and -trees (image from gladius, which was an attempt to unify realtime and non-realtime ray tracing)
You just render two images with slightly shifted position, focusing on some point
Hit-or-not visualization. May help to find epsilon errors. (image taken from metatrace)
if (hit) color = const_a;
else color = const_b
Some hybrid of several techniques.
Linear interpolation: lerp(debug_a, debug_b)
Interlacing: if(y%2==0) debug_a else debug_b
Any combination of ideas, for example the color-tone from Bounding Box Visualization, but with actual scene-intersection and lighting applied
You may find some more glitches and debugging imagery on http://phresnel.org , http://phresnel.deviantart.com , http://picogen.deviantart.com , and maybe http://greenhybrid.deviantart.com (an old account).
Generally, I prefer to dump bytearray of currently processed image as raw data triplets and run Imagemagick to create png from it with number e.g img01.png. In this way i can trace the algorithms very easy. Imagemagick is run from the function in the program using system call. This make possible do debug without using any external libs for image formats.
Another option, if you are using Qt is to work with QImage and use img.save("img01.png") from time to time like a printf is used for debugging.
it's a bit primitive compared to what you are looking for, but i have done what you suggested in your OP using standard logging and by writing image files. typically, the logging and signal export processes and staging exist in unit tests.
signals are given identifiers (often input filename), which may be augmented (often process name or stage).
for development of processors, it's quite handy.
adding html for messages would be simple. in that context, you could produce viewable html output easily - you would not need to generate any html, just use html template files and then insert the messages.
i would just do it myself (as i've done multiple times already for multiple signal types) if you get no good referrals.
In Qt Creator you can watch image modification while stepping through the code in the normal C++ debugger, see e.g. http://labs.qt.nokia.com/2010/04/22/peek-and-poke-vol-3/
Good night :)
I am currently playing with the DevIL library that allows me to load in image and check RGB values per pixel. Just as a personal learning project, I'm trying to write a very basic OCR system for a couple of images I made myself in Photoshop.
I am successfully able to remove all the distortions in the image and I'm left with text and numbers. I am currently not looking for an advanced neural network that learns from input. I want to start out relatively easy and so I've set out to identify the individual characters and count the pixels in those characters.
I have two problems:
Identifying the individual characters.
Most importantly: I need an algorithm to count connected pixels (of the same color) without counting pixels I've previously counted. I have no mathemathical background so this is the biggest issue for me.
Any help in the matter is appreciated, thanks.
edit:
I have tagged this question as C++ because that is what I am currently using. However, pseudo-code or easily readable code from another language is also fine.
The flood fill algorithm will work for counting the included pixels, as long as you have the images filtered down to simple black & white bitmaps.
Having said that, you can perform character recognition by comparing each character to a set of standard images of each character in your set, measuring the similarity, and then choosing the character with the highest score.
Take a look at this question for more information.
Not sure this helps, but there is a GPL OCR lib called gocr.
Apologies if this is too far off-topic, but IMHO Vigra (not the other one!) is a much better image processing library for C++ than DevIL.
Does anyone know of a c++ library for taking an image and performing image recognition on it such that it can find letters based on a given font and/or font height? Even one that doesn't let you select a font would be nice (eg: readLetters(Image image).
I've been looking into this a lot lately. Your best is simply Tesseract. If you need layout analysis on top of the OCR than go with Ocropus (which in turn uses Tesseract to do the OCR). Layout analysis refers to being able to detect position of text on the image and do things like line segmentation, block segmentation, etc.
I've found some really good tips through experimentation with Tesseract that are worth sharing. Basically I had to do a lot of preprocessing for the image.
Upsize/Downsize your input image to 300 dpi.
Remove color from the image. Grey scale is good. I actually used a dither threshold and made my input black and white.
Cut out unnecessary junk from your image.
For all three above I used netbpm (a set of image manipulation tools for unix) to get to point where I was getting pretty much 100 percent accuracy for what I needed.
If you have a highly customized font and go with tesseract alone you have to "Train" the system -- basically you have to feed a bunch of training data. This is well documented on the tesseract-ocr site. You essentially create a new "language" for your font and pass it in with the -l parameter.
The other training mechanism I found was with Ocropus using nueral net (bpnet) training. It requires a lot of input data to build a good statistical model.
In terms of invoking Tesseract/Ocropus are both C++. It won't be as simple as ReadLines(Image) but there is an API you can check out. You can also invoke via command line.
While I cannot recommend one in particular, the term you are looking for is OCR (Optical Character Recognition).
There is tesseract-ocr which is a professional library to do this.
From there web site
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available
I think what you want is Conjecture. Used to be the libgocr project. I haven't used it for a few years but it used to be very reliable if you set up a key.
The Tesseract OCR library gives pretty accurate results, its a C and C++ library.
My initial results were around 80% accurate, but applying pre-processing on the images before supplying in for OCR the results were around 95% accurate.
What is pre-preprocessing:
1) Binarize the bitmap (B&W worked better for me). How it could be done
2) Resampling your image to 300 dpi
3) Save your image in a lossless format, such as LZW TIFF or CCITT Group 4 TIFF.