Use less than 5 anchor scales in FPN object detection model - computer-vision

Usually the anchors sizes are set to {32, 64, 128, 256, 512}. However, in my dataset, I don't have boxes as large as 512 x 512. So I would like to use only 4 anchor scales i.e. {32, 64, 128, 256}. How would this be possible, since the FPN has 5 levels?
To elaborate, consider the following image. (It's from an article about detectron2)
Decreasing the number of anchors isn't really straightforward since removing a scale involves removing a stage of the resnet (resnet block) from being used. Both the BoxHead and the RPN expects P2 to P5 (RPN expects res5/P6 as well). So my question is if I were to remove an anchor scale (in my case 512 x 512, since my images are only 300 x 300 and objects won't exceed that size) which resnet block should be ignored. Should the low resolution block (res2) be ignored or should the high res (res5) be removed?
Or is it that the structure does not allow removal of an anchor scale and 5 scales must be used?

You can remove the anchor scale, but be aware to also modify your RPN and your BoxHead. The P2 will have the largest dimension (512 in your case).
But maybe think about keeping all of them and changing only the resolutions, beginning from 16 to 256. I guess this can save you from a lot of reorganization of your model plus will improve detection for smaller objects.

Related

How to interpret anchor boxes in Yolo or R-CNN?

For algorithms like yolo or R-CNN, they use the concept of anchor boxes for predicting objects. https://pjreddie.com/darknet/yolo/
The anchor boxes are trained on specific dataset, one for COCO dataset is:
anchors = 0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828
However, i don't understand how to interpret these anchor boxes? What does a pair of values (0.57273, 0.677385) means?
In the original YOLO or YOLOv1, the prediction was done without any assumption on the shape of the target objects. Let's say that the network tries to detect humans. We know that, generally, humans fit in a vertical rectangle box, rather than a square one. However, the original YOLO tried to detect humans with rectangle and square box with equal probability.
But this is not efficient and might decrease the prediction speed.
So in YOLOv2, we put some assumption on the shapes of the objects. These are Anchor-Boxes. Usually we feed the anchor boxes to the network as a list of some numbers, which is a series of pairs of width and height:
anchors = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]
In the above example, (0.57273, 0.677385) represents a single anchor box, in which the two elements are width and height respectively. That is, this list defines 5 different anchor boxes. Note that these values are relative to the output size. For example, YOLOv2 outputs 13x13 feature mat and you can get the absolute values by multiplying 13 to the values of anchors.
Using anchor boxes made the prediction a little bit faster. But the accuracy might decrease. The paper of YOLOv2 says:
Using anchor boxes we get a small decrease in accuracy. YOLO only
predicts 98 boxes per image but with anchor boxes our model predicts
more than a thousand. Without anchor boxes our intermediate model gets
69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even though the mAP decreases, the increase
in recall means that our model has more room to improve
That's what I understood: YOLO divides an 416x416 image into 13x13 grids. Each grid as 32 pixels. The anchor boxes size are relative to the size of the grid.
So an anchor box of width and height 0.57273, 0.677385 pixels has actually a size of
w = 0.57273 * 32 = 18.3 pixels
h = 0.677385 * 32 = 21.67 pixels
If you convert all those values, you can then plot them on a 416x416 image to visualise them.

Given the pitch of an image, how to calculate the GL_UNPACK_ALIGNMENT?

Most image library give an image as width, height, pitch and then an array pixels. OpenGL however does not take a pitch directly, instead it expects:
glPixelStorei(GL_UNPACK_ALIGNMENT, align);
glPixelStorei(GL_UNPACK_ROW_LENGTH, row_length);
row_length is pitch / bytes_per_pixel, but how does one calculate the proper value for align?
For the vast majority of images that you'll see, you shouldn't need GL_ROW_LENGTH. Setting this to a non-zero value can degrade upload performance severely; only employ it when you absolutely need it.
For most images, the pitch will simply be (width * bpp), aligned to some boundary. For example, if the bpp is 3 (bytes per pixel), and the width is 9, you get 27. That's not a nice row number for many computers, so they'll often align that up to the nearest multiple of 4. Thus the pitch you would see in most images is 28.
So to compute the alignment, simply compute the largest power of two (up to 8) that divides into the pitch. If the pitch in the image is 28, the largest power of two is 4. Set that to GL_UNPACK_ALIGNMENT. Generally, you can stop at 4, because that's the largest alignment most images will use.
The times when you need to use GL_ROW_LENGTH are when the computed pitch (align(width * bpp, alignment)) does not equal the actual pitch. If that happens, then you need to use both the alignment and the row length. But this is such a rare occurrence that you should probably just discard any image that uses such a wildly incorrect pitch.
GL_UNPACK_ALIGNMENT is the byte-offset of where the first pixel in a row starts in an image. Setting it to one is almost always the safe bet (i.e., it works with almost every image type [and it's never not worked for me]).
The default value is four, which works fine for RGBA images, but fails miserably for RGB images (you'll know instantly when you see the image - it'll look like a set of diagonal bands).

Filtering 1bpp images

I'm looking to filter a 1 bit per pixel image using a 3x3 filter: for each input pixel, the corresponding output pixel is set to 1 if the weighted sum of the pixels surrounding it (with weights determined by the filter) exceeds some threshold.
I was hoping that this would be more efficient than converting to 8 bpp and then filtering that, but I can't think of a good way to do it. A naive method is to keep track of nine pointers to bytes (three consecutive rows and also pointers to either side of the current byte in each row, for calculating the output for the first and last bits in these bytes) and for each input pixel compute
sum = filter[0] * (lastRowPtr & aMask > 0) + filter[1] * (lastRowPtr & bMask > 0) + ... + filter[8] * (nextRowPtr & hMask > 0),
with extra faff for bits at the edge of a byte. However, this is slow and seems really ugly. You're not gaining any parallelism from the fact that you've got eight pixels in each byte and instead are having to do tonnes of extra work masking things.
Are there any good sources for how to best do this sort of thing? A solution to this particular problem would be amazing, but I'd be happy being pointed to any examples of efficient image processing on 1bpp images in C/C++. I'd like to replace some more 8 bpp stuff with 1 bpp algorithms in future to avoid image conversions and copying, so any general resouces on this would be appreciated.
I found a number of years ago that unpacking the bits to bytes, doing the filter, then packing the bytes back to bits was faster than working with the bits directly. It seems counter-intuitive because it's 3 loops instead of 1, but the simplicity of each loop more than made up for it.
I can't guarantee that it's still the fastest; compilers and especially processors are prone to change. However simplifying each loop not only makes it easier to optimize, it makes it easier to read. That's got to be worth something.
A further advantage to unpacking to a separate buffer is that it gives you flexibility for what you do at the edges. By making the buffer 2 bytes larger than the input, you unpack starting at byte 1 then set byte 0 and n to whatever you like and the filtering loop doesn't have to worry about boundary conditions at all.
Look into separable filters. Among other things, they allow massive parallelism in the cases where they work.
For example, in your 3x3 sample-weight-and-filter case:
Sample 1x3 (horizontal) pixels into a buffer. This can be done in isolation for each pixel, so a 1024x1024 image can run 1024^2 simultaneous tasks, all of which perform 3 samples.
Sample 3x1 (vertical) pixels from the buffer. Again, this can be done on every pixel simultaneously.
Use the contents of the buffer to cull pixels from the original texture.
The advantage to this approach, mathematically, is that it cuts the number of sample operations from n^2 to 2n, although it requires a buffer of equal size to the source (if you're already performing a copy, that can be used as the buffer; you just can't modify the original source for step 2). In order to keep memory use at 2n, you can perform steps 2 and 3 together (this is a bit tricky and not entirely pleasant); if memory isn't an issue, you can spend 3n on two buffers (source, hblur, vblur).
Because each operation is working in complete isolation from an immutable source, you can perform the filter on every pixel simultaneously if you have enough cores. Or, in a more realistic scenario, you can take advantage of paging and caching to load and process a single column or row. This is convenient when working with odd strides, padding at the end of a row, etc. The second round of samples (vertical) may screw with your cache, but at the very worst, one round will be cache-friendly and you've cut processing from exponential to linear.
Now, I've yet to touch on the case of storing data in bits specifically. That does make things slightly more complicated, but not terribly much so. Assuming you can use a rolling window, something like:
d = s[x-1] + s[x] + s[x+1]
works. Interestingly, if you were to rotate the image 90 degrees during the output of step 1 (trivial, sample from (y,x) when reading), you can get away with loading at most two horizontally adjacent bytes for any sample, and only a single byte something like 75% of the time. This plays a little less friendly with cache during the read, but greatly simplifies the algorithm (enough that it may regain the loss).
Pseudo-code:
buffer source, dest, vbuf, hbuf;
for_each (y, x) // Loop over each row, then each column. Generally works better wrt paging
{
hbuf(x, y) = (source(y, x-1) + source(y, x) + source(y, x+1)) / 3 // swap x and y to spin 90 degrees
}
for_each (y, x)
{
vbuf(x, 1-y) = (hbuf(y, x-1) + hbuf(y, x) + hbuf(y, x+1)) / 3 // 1-y to reverse the 90 degree spin
}
for_each (y, x)
{
dest(x, y) = threshold(hbuf(x, y))
}
Accessing bits within the bytes (source(x, y) indicates access/sample) is relatively simple to do, but kind of a pain to write out here, so is left to the reader. The principle, particularly implemented in this fashion (with the 90 degree rotation), only requires 2 passes of n samples each, and always samples from immediately adjacent bits/bytes (never requiring you to calculate the position of the bit in the next row). All in all, it's massively faster and simpler than any alternative.
Rather than expanding the entire image to 1 bit/byte (or 8bpp, essentially, as you noted), you can simply expand the current window - read the first byte of the first row, shift and mask, then read out the three bits you need; do the same for the other two rows. Then, for the next window, you simply discard the left column and fetch one more bit from each row. The logic and code to do this right isn't as easy as simply expanding the entire image, but it'll take a lot less memory.
As a middle ground, you could just expand the three rows you're currently working on. Probably easier to code that way.

glPixelStorei(GL_UNPACK_ALIGNMENT, 1) Disadvantages?

What are the disadvantages of always using alginment of 1?
glPixelStorei(GL_UNPACK_ALIGNMENT, 1)
glPixelStorei(GL_PACK_ALIGNMENT, 1)
Will it impact performance on modern gpus?
How can data not be 1-byte aligned?
This strongly suggests a lack of understanding of what the row alignment in pixel transfer operations means.
Image data that you pass to OpenGL is expected to be grouped into rows. Each row contains width number of pixels, with each pixel being the size as defined by the format and type parameters. So a format of GL_RGB with a type of GL_UNSIGNED_BYTE will result in a pixel that is 24-bits in size. Pixels are otherwise expected to be packed, so a row of 16 of these pixels will take up 48 bytes.
Each row is expected to be aligned on a specific value, as defined by the GL_PACK/UNPACK_ALIGNMENT. This means that the value you add to the pointer to get to the next row is: align(pixel_size * width, GL_*_ALIGNMENT). If the pixel size is 3-bytes, the width is 2, and the alignment is 1, the row byte size is 6. If the alignment is 4, the row byte size is eight.
See the problem?
Image data, which may come from some image file format as loaded with some image loader, has a row alignment. Sometimes this is 1-byte aligned, and sometimes it isn't. DDS images have an alignment specified as part of the format. In many cases, images have 4-byte row alignments; pixel sizes less than 32-bits will therefore have padding at the end of rows with certain widths. If the alignment you give OpenGL doesn't match that, then you get a malformed texture.
You set the alignment to match the image format's alignment. If you know or otherwise can ensure that your row alignment is always 1 (and that's unlikely unless you've written your own image format or DDS writer), you need to set the row alignment to be exactly what your image format uses.
Will it impact performance on modern gpus?
No, because the pixel store settings are only relevent for the transfer of data from or to the GPU, namely the alignment of your data. Once on the GPU memory it's aligned in whatever way the GPU and driver desire.
There will be no impact on performance. Setting higher alignment (in openGL) doesn't improve anything, or speeds anything up.
All alignment does is to tell openGL where to expect the next row of pixels. You should always use an alignment of 1, if your image pixels are tightly packed, i.e. if there are no gaps between where a row of bytes ends and where a new row starts.
The default alignment is 4 (i.e. openGL expects the next row of pixels to be after a jump in memory which is divisible by 4), which may cause problems in cases where you load R, RG or RGB textures which are not 4-bytes floats, or the width is not divisible by 4. If your image pixels are tightly packed you have to change alignment to 1 in order for the unpacking to work.
You could (I personally haven’t encountered them) have an image of, say, 3x3 RGB ubyte, whose rows are 4th-aligned with 3 extra bytes used as padding in the end. Which rows might look like this:
R - G - B - R - G - B - R - G - B - X - X - X (16 bytes in total)
The reason for it is that aligned data improves the performance of the processor (not sure how much it's true/justified on todays processors). IF you have any control over how the original image is composed, then maybe aligning it one way or another will improve the handling of it. But this is done PRIOR to openGL. OpenGL has no way of changing anything about this, it only cares about where to find the pixels.
So, back to the 3x3 image row above - setting the alignment to 4 would be good (and necessary) to jump over the last padding. If you set it to 1 then, it will mess your result, so you need to keep/restore it to 4. (Note that you could also use ROW_LENGTH to jump over it, as this is the parameter used when dealing with subsets of the image, in which case you sometime have to jump much more then 3 or 7 bytes (which is the max the alignment parameter of 8 can give you). In our example if you supply a row length of 4 and an alignment of 1 will also work).
Same goes for packing. You can tell openGL to align the pixels row to 1, 2, 4 and 8. If you're saving a 3x3 RGB ubyte, you should set the alignment to 1. Technically, if you want the resulting rows to be tightly packed, you should always give 1. If you want (for whatever reason) to create some padding you can give another value. Giving (in our example) a PACK_ALIGNMENT of 4, would result in creating rows that look like the row above (with the 3 extra padding in the end). Note that in that case your containing object (openCV mat, bitmap, etc.) should be able to receive that extra padding.

compact representation and delivery of point data

I have an array of point data, the values of points are represented as x co-ordinate and y co-ordinate.
These points could be in the range of 500 upto 2000 points or more.
The data represents a motion path which could range from the simple to very complex and can also have cusps in it.
Can I represent this data as one spline or a collection of splines or some other format with very tight compression.
I have tried representing them as a collection of beziers but at best I am getting a saving of 40 %.
For instance if I have an array of 500 points , that gives me 500 x and 500 y values so I have 1000 data pieces.
I around 100 quadratic beziers from this. each bezier is represented as controlx, controly, anchorx, anchory.
which gives me 100 x 4 = 400 pcs of data.
So input = 1000pcs , output = 400pcs.
I would like to further tighen this, any suggestions?
By its nature, spline is an approximation. You can reduce the number of splines you use to reach a higher compression ratio.
You can also achieve lossless compression by using some kind of encoding scheme. I am just making this up as I am typing, using the range example in previous answer (1000 for x and 400 for y),
Each point only needs 19 bits (10 for x, 9 for y). You can use 3 bytes to represent a coordinate.
Use 2 byte to represent displacement up to +/- 63.
Use 1 byte to represent short displacement up to +/- 7 for x, +/- 3 for y.
To decode the sequence properly, you would need some prefix to identify the type of encoding. Let's say we use 110 for full point, 10 for displacement and 0 for short displacement.
The bit layout will look like this,
Coordinates: 110xxxxxxxxxxxyyyyyyyyyy
Dislacement: 10xxxxxxxyyyyyyy
Short Displacement: 0xxxxyyy
Unless your sequence is totally random, you can easily achieve high compression ratio with this scheme.
Let's see how it works using a short example.
3 points: A(500, 400), B(550, 380), C(545, 381)
Let's say you were using 2 byte for each coordinate. It will take 16 bytes to encode this without compression.
To encode the sequence using the compression scheme,
A is first point so full coordinate will be used. 3 bytes.
B's displacement from A is (50, -20) and can be encoded as displacement. 2 bytes.
C's displacement from B is (-5, 1) and it fits the range of short displacement 1 byte.
So you save 10 bytes out of 16 bytes. Real compression ratio is totally depending on the data pattern. It works best on points forming a moving path. If the points are random, only 25% saving can be achieved.
If for example you use 32-bit integers for point coords and there is range limit, like x: 0..1000, y:0..400, you can pack (x, y) into a single 32-bit variable.
That way you achieve another 50% compression.
You could do a frequency analysis of the numbers you are trying to encode and use varying bit lengths to represent them, of course here I am vaguely describing Huffman coding
Firstly, only keep enough decimal points in your data that you actually need. Removing these would reduce your accuracy, but its a calculated loss. To do that, try converting your number to a string, locating the dot's position, and cutting of those many characters from the end. That could process faster than math, IMO. Lastly you can convert it back to a number.
150.234636746 -> "150.234636746" -> "150.23" -> 150.23
Secondly, try storing your data relative to the last number ("relative values"). Basically subtract the last number from this one. Then later to "decompress" it you can keep an accumulator variable and add them up.
A A A A R R
150, 200, 250 -> 150, 50, 50