As I understand in YOLO algorithm we divide inuput image into a grid, for example 19x19 and we have to have output vector (pc, bx, by, bh bw, c) for each cell. Then we can train our network. And my question is: why we give to network XML file with only one bounding box, labels etc. (if only one object is on image) instead of give 19*19=361 ones? Does implementation of network divide image and create vector for each cell automatically? (How it do that?)
The same question is for sliding window algorithm. Why we give to network only one vector with label and bounding box instead of giving vector for each sliding window.
Let' say that the output of YOLO is composed of 19 by 19 grid cells, and each grid cell has some depth. Each grid cell can detect some bounding boxes, whose maximum number depends on the configuration of the model. For example, if one grid cell can detect up to 5 bounding boxes, the model can detect 19x19x5 = 1805 bounding boxes in total.
Since this number is too large, we train the model such that only the grid cell that contains the center of the bounding box within it predicts a bounding box with high confidence. When we train the model, we first figure out where the center of the true bounding box falls, and train the model such that the grid cell containing the center will predict a bounding box similar to the truth one with high probability, and such that other grid cells will predict bounding boxes with as lower probability as possible (when the probability is lower than a threshold, this prediction is discarded).
The image below shows a grid cell containing the box center when the output has 13 by 13 grid cells.
This is the same when there are more than one object in the training images. If there are two object in a training image, we update the two grid cells that contain the centers of the true two boxes such that they produce bounding boxes with high probability.
Related
Let's say I annotated all images in my dataset to have 20 Bounding Boxes.
I basically want my predicted bounding boxes to also be only 20. After training however, I get differing amounts of bounding boxes, that aren't 20.
I'm trying to detect the same 20 objects in an image. All the objects are the same so I only 1 class for all 20 bounding boxes.
I'm currently using YOLOv5 but is there a better model for a use-case like this?
I suggest selecting the 20 detected objects with higher confidence, you can do that easily by appending the all detected objects boxes to a list as well as the confidence and labels and then iterated through the list with range limitation, and then you can draw the bounding box of the filtered objects (20 objects).
I'm trying to use Amazon Textract to perform OCR to build a small application. I'm trying to find a way to get the character co-ordinates from each word.
Is there any way I can find the character level coordinates/character data?
For each 'word', yes there is. The documentation specifies how:
Using Amazon Textract: Item Location on a Document Page
https://docs.aws.amazon.com/textract/latest/dg/text-location.html
Amazon Textract operations return the location and geometry of items found on a document page. DetectDocumentText and GetDocumentTextDetection return the location and geometry for lines and words, while AnalyzeDocument and GetDocumentAnalysis return the location and geometry of key-value pairs, tables, cells, and selection elements.
To determine where an item is on a document page, use the bounding box (Geometry) information that's returned by the Amazon Textract operation in a Block object. The Geometry object contains two types of location and geometric information for detected items:
An axis-aligned BoundingBox object that contains the top-left coordinate and the width and height of the item.
A polygon object that describes the outline of the item, specified as an array of Point objects that contain X (horizontal axis) and Y (vertical axis) document page coordinates of each point.
You can use geometry information to draw bounding boxes around detected items. For an example that uses BoundingBox and Polygon information to draw boxes around lines and vertical lines at the start and end of each word, see Detecting Document Text with Amazon Textract. The example output is similar to the following.
Bounding Box
A bounding box (BoundingBox) has the following properties:
Height – The height of the bounding box as a ratio of the overall document page height.
Left – The X coordinate of the top-left point of the bounding box as a ratio of the overall document page width.
Top – The Y coordinate of the top-left point of the bounding box as a ratio of the overall document page height.
Width – The width of the bounding box as a ratio of the overall document page width.
Each BoundingBox property has a value between 0 and 1. The value is a ratio of the overall image width (applies to Left and Width) or height (applies to Height and Top). For example, if the input image is 700 x 200 pixels, and the top-left coordinate of the bounding box is (350,50) pixels, the API returns a Left value of 0.5 (350/700) and a Top value of 0.25 (50/200).
I struggle to find a solution to the following problem :
I used opencv to mark all connected white pixels with a unique label.
Now i got a group of those elements.
Those objects often are 90% rectangular but most of the time contain some extra lines and stuff.
Im searching for a algorithm which does thr following :
-get biggest Rectangle out of image(within the same label)
- fast performance
-maybe even filter, larget rectangle which contains at least xx% Pixels with the same label
Maybe someone can help me
Thanks a lot
Edit: Example Pictures(in this case fo licence plate location):
my desired output of the algorithm would be the rectangle of the plate(and of curse all other rectangles in the image, im gonna filter them later)
Important the rectangles may be rotated !
My suggestion
make sure to fill small holes either by blob analysis or mathematical morphology;
compute the distance map in the white areas;
binarize the distance map with a threshold equal to the half plate height.
The rectangles will appear as line segments long as the plate width minus the plate height. You may locate them by fitting rotated rectangular bounding boxes; they must have a large aspect ratio.
I am trying to develop box sorting application in qt and using opencv. I want to measure width and length of box.
As shown in image above i want to detect only outermost lines (ie. box edges), which will give me width and length of box, regardless of whatever printed inside the box.
What i tried:
First i tried using Findcontours() and selected contour with max area, but the contour of outer edge is not enclosed(broken somewhere in canny output) many times and hence not get detected as a contour.
Hough line transform gives me too many lines, i dont know how to get only four lines am interested in out of that.
I tried my algorithm as,
Convert image to gray scale.
Take one column of image, compare every pixel with next successive pixel of that column, if difference in there value is greater than some threshold(say 100) that pixel belongs to edge, so store it in array. Do this for all columns and it will give upper line of box parallel to x axis.
Follow the same procedure, but from last column and last row (ie. from bottom to top), it will give lower line parallel to x axis.
Likewise find lines parallel to y axis as well. Now i have four arrays of points, one for each side.
Now this gives me good results if box is placed in such a way that its sides are exactly parallel to X and Y axis. If box is placed even slightly oriented in some direction, it gives me diagonal lines which is obvious as shown in below image.
As shown in image below i removed first 10 and last 10 points from all four arrays of points (which are responsible for drawing diagonal lines) and drew the lines, which is not going to work when box is tilted more and also measurements will go wrong.
Now my question is,
Is there any simpler way in opencv to get only outer edges(rectangle) of box and get there dimensions, ignoring anything printed on the box and oriented in whatever direction?
I am not necessarily asking to correct/improve my algorithm, but any suggestions on that also welcome. Sorry for such a big post.
I would suggest the following steps:
1: Make a mask image by using cv::inRange() (documentation) to select the background color. Then use cv::not() to invert this mask. This will give you only the box.
2: If you're not concerned about shadow, depth effects making your measurment inaccurate you can proceed right away with trying to use cv::findContours() again. You select the biggest contour and store it's cv::rotatedRect.
3: This cv::rotatedRect will give you a rotatedRect.size that defines the width en the height of your box in pixels
Since the box is placed in a contrasting background, you should be able to use Otsu thresholding.
threshold the image (use Otsu method)
filter out any stray pixels that are outside the box region (let's hope you don't get many such pixels and can easily remove them with a median or a morphological filter)
find contours
combine all contour points and get their convex hull (idea here is to find the convex region that bounds all these contours in the box region regardless of their connectivity)
apply a polygon approximation (approxPolyDP) to this convex hull and check if you get a quadrangle
if there are no perspective distortions, you should get a rectangle, otherwise you will have to correct it
if you get a rectangle, you have its dimensions. You can also find the minimum area rectangle (minAreaRect) of the convexhull, which should directly give you a RotatedRect
I've created an openCV application for human detection on images.
I run my algorithm on the same image over different scales, and when detections are made, at the end I have information about the bounding box position and at which scale it was taken from. Then I want to transform that rectangle to the original scale, given that position and size will vary.
I've wrapped my head around this and I've gotten nowhere. This should be rather simple, but at the moment I am clueless.
Help anyone?
Ok, got the answer elsewhere
"What you should do is store the scale where you are at for each detection. Then transforming should be rather easy right. Imagine you have the following.
X and Y coordinates (center of bounding box) at scale 1/2 of the original. This means that you should multiply with the inverse of the scale to get the location in the original, which would be 2X, 2Y (again for the bounxing box center).
So first transform the center of the bounding box, than calculate the width and height of your bounding box in the original, again by multiplying with the inverse. Then from the center, your box will be +-width_double/2 and +-height_double/2."