How to get the Character Level Data from Amazon Textract? - amazon-web-services

I'm trying to use Amazon Textract to perform OCR to build a small application. I'm trying to find a way to get the character co-ordinates from each word.
Is there any way I can find the character level coordinates/character data?

For each 'word', yes there is. The documentation specifies how:
Using Amazon Textract: Item Location on a Document Page
https://docs.aws.amazon.com/textract/latest/dg/text-location.html
Amazon Textract operations return the location and geometry of items found on a document page. DetectDocumentText and GetDocumentTextDetection return the location and geometry for lines and words, while AnalyzeDocument and GetDocumentAnalysis return the location and geometry of key-value pairs, tables, cells, and selection elements.
To determine where an item is on a document page, use the bounding box (Geometry) information that's returned by the Amazon Textract operation in a Block object. The Geometry object contains two types of location and geometric information for detected items:
An axis-aligned BoundingBox object that contains the top-left coordinate and the width and height of the item.
A polygon object that describes the outline of the item, specified as an array of Point objects that contain X (horizontal axis) and Y (vertical axis) document page coordinates of each point.
You can use geometry information to draw bounding boxes around detected items. For an example that uses BoundingBox and Polygon information to draw boxes around lines and vertical lines at the start and end of each word, see Detecting Document Text with Amazon Textract. The example output is similar to the following.
Bounding Box
A bounding box (BoundingBox) has the following properties:
Height – The height of the bounding box as a ratio of the overall document page height.
Left – The X coordinate of the top-left point of the bounding box as a ratio of the overall document page width.
Top – The Y coordinate of the top-left point of the bounding box as a ratio of the overall document page height.
Width – The width of the bounding box as a ratio of the overall document page width.
Each BoundingBox property has a value between 0 and 1. The value is a ratio of the overall image width (applies to Left and Width) or height (applies to Height and Top). For example, if the input image is 700 x 200 pixels, and the top-left coordinate of the bounding box is (350,50) pixels, the API returns a Left value of 0.5 (350/700) and a Top value of 0.25 (50/200).

Related

Output vector for YOLO and sliding window algorithms

As I understand in YOLO algorithm we divide inuput image into a grid, for example 19x19 and we have to have output vector (pc, bx, by, bh bw, c) for each cell. Then we can train our network. And my question is: why we give to network XML file with only one bounding box, labels etc. (if only one object is on image) instead of give 19*19=361 ones? Does implementation of network divide image and create vector for each cell automatically? (How it do that?)
The same question is for sliding window algorithm. Why we give to network only one vector with label and bounding box instead of giving vector for each sliding window.
Let' say that the output of YOLO is composed of 19 by 19 grid cells, and each grid cell has some depth. Each grid cell can detect some bounding boxes, whose maximum number depends on the configuration of the model. For example, if one grid cell can detect up to 5 bounding boxes, the model can detect 19x19x5 = 1805 bounding boxes in total.
Since this number is too large, we train the model such that only the grid cell that contains the center of the bounding box within it predicts a bounding box with high confidence. When we train the model, we first figure out where the center of the true bounding box falls, and train the model such that the grid cell containing the center will predict a bounding box similar to the truth one with high probability, and such that other grid cells will predict bounding boxes with as lower probability as possible (when the probability is lower than a threshold, this prediction is discarded).
The image below shows a grid cell containing the box center when the output has 13 by 13 grid cells.
This is the same when there are more than one object in the training images. If there are two object in a training image, we update the two grid cells that contain the centers of the true two boxes such that they produce bounding boxes with high probability.

Using Opencv how to detect a box in image while eliminating objects printed inside box?

I am trying to develop box sorting application in qt and using opencv. I want to measure width and length of box.
As shown in image above i want to detect only outermost lines (ie. box edges), which will give me width and length of box, regardless of whatever printed inside the box.
What i tried:
First i tried using Findcontours() and selected contour with max area, but the contour of outer edge is not enclosed(broken somewhere in canny output) many times and hence not get detected as a contour.
Hough line transform gives me too many lines, i dont know how to get only four lines am interested in out of that.
I tried my algorithm as,
Convert image to gray scale.
Take one column of image, compare every pixel with next successive pixel of that column, if difference in there value is greater than some threshold(say 100) that pixel belongs to edge, so store it in array. Do this for all columns and it will give upper line of box parallel to x axis.
Follow the same procedure, but from last column and last row (ie. from bottom to top), it will give lower line parallel to x axis.
Likewise find lines parallel to y axis as well. Now i have four arrays of points, one for each side.
Now this gives me good results if box is placed in such a way that its sides are exactly parallel to X and Y axis. If box is placed even slightly oriented in some direction, it gives me diagonal lines which is obvious as shown in below image.
As shown in image below i removed first 10 and last 10 points from all four arrays of points (which are responsible for drawing diagonal lines) and drew the lines, which is not going to work when box is tilted more and also measurements will go wrong.
Now my question is,
Is there any simpler way in opencv to get only outer edges(rectangle) of box and get there dimensions, ignoring anything printed on the box and oriented in whatever direction?
I am not necessarily asking to correct/improve my algorithm, but any suggestions on that also welcome. Sorry for such a big post.
I would suggest the following steps:
1: Make a mask image by using cv::inRange() (documentation) to select the background color. Then use cv::not() to invert this mask. This will give you only the box.
2: If you're not concerned about shadow, depth effects making your measurment inaccurate you can proceed right away with trying to use cv::findContours() again. You select the biggest contour and store it's cv::rotatedRect.
3: This cv::rotatedRect will give you a rotatedRect.size that defines the width en the height of your box in pixels
Since the box is placed in a contrasting background, you should be able to use Otsu thresholding.
threshold the image (use Otsu method)
filter out any stray pixels that are outside the box region (let's hope you don't get many such pixels and can easily remove them with a median or a morphological filter)
find contours
combine all contour points and get their convex hull (idea here is to find the convex region that bounds all these contours in the box region regardless of their connectivity)
apply a polygon approximation (approxPolyDP) to this convex hull and check if you get a quadrangle
if there are no perspective distortions, you should get a rectangle, otherwise you will have to correct it
if you get a rectangle, you have its dimensions. You can also find the minimum area rectangle (minAreaRect) of the convexhull, which should directly give you a RotatedRect

Grouping different scale bounding boxes

I've created an openCV application for human detection on images.
I run my algorithm on the same image over different scales, and when detections are made, at the end I have information about the bounding box position and at which scale it was taken from. Then I want to transform that rectangle to the original scale, given that position and size will vary.
I've wrapped my head around this and I've gotten nowhere. This should be rather simple, but at the moment I am clueless.
Help anyone?
Ok, got the answer elsewhere
"What you should do is store the scale where you are at for each detection. Then transforming should be rather easy right. Imagine you have the following.
X and Y coordinates (center of bounding box) at scale 1/2 of the original. This means that you should multiply with the inverse of the scale to get the location in the original, which would be 2X, 2Y (again for the bounxing box center).
So first transform the center of the bounding box, than calculate the width and height of your bounding box in the original, again by multiplying with the inverse. Then from the center, your box will be +-width_double/2 and +-height_double/2."

How to implement painting (with layer support) in OpenGL?

situation
I'm implementing a height field editor, with two views. The main view displays the height field in 3D enabling trackball navigation. The edit view shows the height field as a 2D image.
On top of this height field, new images can be applyed, that alter its appearence (cut holes, lower, rise secific areas). This are called patches.
Bouth the height field and the patches are one channel grayscale png images.
For visualisation I'm using the visualisation library framework (c++) and OpenGL 4.
task
Implement a drawing tool, available in the 2D edit view (orthographic projection), that creates this patches (as seperate images) at runtime.
important notes / constrains
the image of the height field may be scaled, rotated and transposed.
the patches need to have the same scale as the height field, so one pixel in the patch covers exactly a pixel in the height field.
as a result of the scaling the size of a framebuffer pixel may be bigger or smaller than the size of the height field/patch image pixel.
the scene contains objects (example: a pointing arrow) that should not appear in the patch.
question
What is the right approach to this task? So far I had the following ideas:
Use some kind of QT canvas to create the patch, then map it to the height field image proposions and save it as a new patch. This will be done everytime the user starts drawing, this way implementing undo will be easy (remove the last patch created).
Use an neutral colored image in combination with textre buffer objects to implement some kind of canvas myself. This way every time the user stops drawing the contents of the canvas is mapped to the height field and saved as a patch. Reseting the canvas for the next drawing.
Thre are some examples using a frame buffer object. However I'm not sure if this approach fits my needs. When I use open gl to draw a sub image into the frame buffer, woun't the resultig image contain all data?
Here is what I ende up:
I use the PickIntersector of the Visualisation Library to pick agains the Height Field Image in the edit view.
This yealds local coords of the image.
There are transformed to uv coords, wich in turn get transformed into pixel coords.
This is done when the user presses a mouse button and continues to happen when the mouse moves as long as its over the image.
I have a PatchCanvas class, that collects all this points. On commands it uses the Anti-Grain Geometry library to accually rasterize the lines that can be constructes from the points.
After that is done the rasterized image is divied up into a grid of fixed size. Every tile is scanned for a color different then the neutral one. Tiles that only contain the neutral color are dropped. The other are saved following the appropied naming schema, and can be loaded in the next frame.
Agg supports lines of different size. This issn't implemented jet, but the idea is to pick to adjacened points in screen space, get those uv coords, convert them to pixels and use this as the line thickness. This should result in broader strockes for zoom out views.

Image Processing - Rotation and Optical Character Recognizion

Good Morning everybody,
Today I wanna concern about the topic "Image Manipulation in C++".
So far I am able to filter all the noisy stuff out of the picture and change the color to black and white.
But now I have two questions.
First Question:
Below you see a screenshot of the image. What is the best way to find out how to rotate the text. In the end it would be nice if the text is horizontal. Does anybody have a good link or an example.
Second Question:
How to go on? Do you think I should send the image to an "Optical Character Recognizer" (a) or should I filter out each letter (b)?
If the answer is (a) what is the smallest ocr lib? All libs I found so far seem to be overpowered and difficult to implement in an existing project. (like gocr or tesseract)
If the answer is (b) what is the best way to save each letter as an own image? Shoul i search for an white pixel an than go from pixel to pixel an save the coordinates in an 2D Array? What is with the letter "i" ;)
Thanks to everybody who will help me to find my way!Sorry for the strange english above. I'm still a language noob :-)
The usual name for the problem in your first question is "Skew Correction"
You may Google for it (lot of references). A nice paper here, showing for example how to get this:
An easy way to start (but not as good as the previously mentioned), is to perform a Principal Component Analysis:
For your first question:
First, Remove any "specs" of noisy white pixels that aren't part of the letter sequence. A gentle low-pass filter (pixel color = average of surrounding pixels) followed by a clamping of the pixel values to pure black or pure white. This should get rid of the little "dot" underneath the "a" character in your image and any other specs.
Now search for the following pixels:
xMin = white pixel with the lowest x value (white pixel closest to the left edge)
xMax = white pixel with the largest x value (white pixel closest to the right edge)
yMin = white pixel with the lowest y value (white pixel closest to the top edge)
yMax = white pixel with the largest y value (white pixel closest to the bottom edge)
with these four pixel values, form a bounding box: Rect(xMin, yMin, xMax, yMax);
compute the area of the bounding box and find the center.
using the center of the bounding box, rotate the box by N degrees. (You can pick N: 1 degree would be an ok value).
Repeat the process of finding xMin,xMax,yMin,yMax and recompute the area
Continue rotating by N degrees until you've rotated K degrees. Also rotate by -N degrees until you've rotated by -K degrees. (Where K is the max rotation... say 30 degrees). At each step recompute the area of the bounding box.
The rotation that produces the bounding box with the smallest area is likely the rotation that aligns the letters parallel to the bottom edge (horizontal alignment).
You could measure the height to each white pixel from the bottom and find how much the text is leaning. It's a very simple approach but it worked fine for me when I tried it.