Pixel array compression - c++

I am comparing 2 images and getting an array of unmatched pixels like
rgb(12, 54, 69) 1 4
rgb(19, 54, 98) 4 8
rgb(12, 54, 69) 2 9
rgb(86, 85, 10) 9 7
I need to transmit this over network. so to compress I can make it
rgb(12, 54, 69) (1, 4), (2, 9)
rgb(19, 54, 98) (4, 8)
rgb(86, 85, 10) (9, 7)
However I doubt this simple compression would not yield much benefit in case of large difference. I'vent run any tests yet.
When the whole image is changed normal JPEG compression of the new image will be much smaller in size. however for any small difference this method will yield a smaller byte overhead. and there is no way to know the amount of change without looping top to bottom of each image.
Is there any standard way of doing the same ? I'll be implementing it in C++ on the top of protobuf or boost serialization and Qt

Compress the image to jpeg at the start, pass the size of this new jpeg file to your function. As you iterate in your function, as soon as you exceed the size passed in, return a failure and just send the jpeg. This way you do not have to reach the end of your function.
However, I would first profile your current code; In the unlikely case you actually are bottlenecked by this code, you would probably be better served by creating a native/C module for doing this traversal.
Of course this presumes that the time taken by your function exceeds the JPEG creation time on average. As always, profile, profile, profile before optimizing.

Related

Use less than 5 anchor scales in FPN object detection model

Usually the anchors sizes are set to {32, 64, 128, 256, 512}. However, in my dataset, I don't have boxes as large as 512 x 512. So I would like to use only 4 anchor scales i.e. {32, 64, 128, 256}. How would this be possible, since the FPN has 5 levels?
To elaborate, consider the following image. (It's from an article about detectron2)
Decreasing the number of anchors isn't really straightforward since removing a scale involves removing a stage of the resnet (resnet block) from being used. Both the BoxHead and the RPN expects P2 to P5 (RPN expects res5/P6 as well). So my question is if I were to remove an anchor scale (in my case 512 x 512, since my images are only 300 x 300 and objects won't exceed that size) which resnet block should be ignored. Should the low resolution block (res2) be ignored or should the high res (res5) be removed?
Or is it that the structure does not allow removal of an anchor scale and 5 scales must be used?
You can remove the anchor scale, but be aware to also modify your RPN and your BoxHead. The P2 will have the largest dimension (512 in your case).
But maybe think about keeping all of them and changing only the resolutions, beginning from 16 to 256. I guess this can save you from a lot of reorganization of your model plus will improve detection for smaller objects.

Reshaping the input layer to a single channel and multiple images

In some reference code I have picked up, there is:
net_->input_blobs()[0]->Reshape(1, 3, height, width);
My prototxt has:
input_shape {
dim: 1
dim: 3
dim: 260
dim: 347
}
I have been indirectly informed that the model provided has been tuned for greyscale (we have both a colour and a greyscale prototxt), and the currently-used Python code uses a greyscaled input with three identical channels.
Now I want to do either both or separately process 4 images in a single call to net_->Forward(); and pass in these four images as one-channel greyscale. So, first, choosing a single channel:
net_->input_blobs()[0]->Reshape(1, 1, height, width);
What are the repercussions of changing the number of channels? How do all my layers react? Will it work? If it works, will a one-channel net be faster?
Second, choosing four images:
net_->input_blobs()[0]->Reshape(4, 3, height, width);
I have a feeling that won't work, and I should be looking at increasing the number of input_blobs, but how to do that? Or what is the correct approach?
working with a single channel rather than identical three should be faster (fewer multiplication-addition operations). Since this is done at the finest scale, this might even have noticeable impact on run time.
Feeding 4 images as a single batch is usually faster than processing each image separately as a batch with one image (due to internal optimization of the computation to work with batches).
Bottom line: you should get better run time running a single batch of four images. If the input is three identical channels - it is better to modify the model to work with only one.

Deep Learning: Using a pretrained network's earlier activations

I have around 25k images belonging to 14 different classes (different kinds of neck-lines, e.g. v-neck, round neck, etc). The images mainly contain the the top-part of the apparel and/or the face of the model. Here are some examples:
In order to do this, I thought of extracting the features after the 1st block of VGG16 (pretrained on imagenet) because the feature map of the earlier blocks will be capturing things lines, shapes, etc. Here is the model.summary():
Layer (type) Output Shape Param #
=================================================================
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 802816) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 3288338432
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 16) 65552
=================================================================
Total params: 3,305,224,016
Trainable params: 3,305,224,016
Non-trainable params: 0
The problem is that the total number of parameters is huge. Can you please advise, considering my specific dataset, how to reduce that?
The problem is: a dense layer will create weights for each of the inputs.
So, because you've got an image full of pixels, and you haven't reduced its size significantly, adding Flatten + Dense will result in this absurd amount of weights.
I'm not sure I understand why you want only the first block. This is the block that will be identifying very rudimentar little shapes without considering their relations with each other. And certainly, there are not any earlier blocks.
I do recommend that you use more blocks to identify elaborated features and to reduce the image size. The best to do is simply take the entire VGG16 model with include_top=False and trainable=False, and add your trainable top layers - (See Here)
Now, if you really really want to have so few blocks and not reduce the image size, then you can try adding a GlobalMaxPooling2D layer. This will get only the max values in the entire image. This might or not be useful, depending on how possible it is to identify what you want with so few convolutions. (Another option is GlobalAveragePooling2D, but I believe this is even less effective in this case). In this case, more blocks will lead to better results.
You have to make your own feature extractor that comes by removing all the dense layers from the original VGG to build your own dense layers. I suggest you to put 2 dense layers also, fc1 with 1024 nodes and fc2 with 512 nodes. Of course you have to add a third dense layer that will be the fake classifier to train the feature extractor. Then train only this three layers keeping the rest of the VGG as trainable=false, that will reduce the parameters. For sure after training these layers you should remove the last one to have your feature extractor. Now for every image you would have 512 features that you can fit on a simple NN or a SVM as your choice that would be your classifier.
You should have a GPU with at least 8gb.
In Keras blog you can find how to finetune your last layers in order to have your own feature extractor that is more or less what are you looking for: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
Hope it helps!!
You're on the right track, but you've been confused (as I once was) by an inconsistency in the way people describe neural nets. In Keras docs, the "top" layer is what other docs call the "bottom", "last", "final", or "deepest" layer. It's the layer that calculates the final probabilities. To implement transfer learning, you freeze the early layers (where images enter the network and are convolved) and replace or re-train the final layer (where the answer comes out). Keras calls that final layer "top". So in Keras-talk, you either instantiate the model with include_top=False, or you remove a layer with model.pop().
I hope that helps.
The response from Eric is correct. I second his recommendation that you read this blog post by François Chollet, the creator of Keras: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
If you are using a pre-trained model, you don't have to re-train the lower layers and just keep the last few classification layers to be trainable. For example, if you'd like to freeze the first 6 layers you can call:
for idx, layer in enumerate(model.layers[:6]):
print('Make layer {} {} untrainable.'.format(idx, layer.name))
layer.trainable = False
Then if you call the model.summary(), you'll see that you have much less trainable parameters which will not only make the training faster but usually provides better results when you do not alter the pre-trained convolutional layers.

is there any way to determine width and height of rgb values array?

I have RGB values array with raw size each time. I'm trying to determine which width/height it would be more suitable for it.
The idea is, I'm getting raw files and I want to display file data as BMP image (e.g Hex Workshop got that feature which called Data Visualizer)
Any suggestions?
Regards.
Find the divisors of the pixel array size.
For instance, if your array contains 243 pixels, divisors are 1, 3, 9, 27, 81 and 243. It means that your image is either 1x243, 3x81, 9x27, 27x9, 81x3 or 243x1.
You can only guess which is the good one by analyzing image content, vertical or horizontal features, recurring patterns, common aspect ratio, etc.

Check for similarity on different size images

I have a video source that produce many streams for different devices (such as: HD television, Pads, smart phones, etc.), every of them has to be checked within each other for similarity. The video stream release 50 images per second, one image every 20 milliseconds.
Lets take for instance img1 coming from stream1 at time ts1=1, img2 coming from stream2 at ts2=1 and img1.1 taken from stream1 at ts=2 (20 milliseconds later than ts=1), the comparison result should look something like this:
compare(img1, img1) = 1 same image same size
compare(img1, img2) = 0.9 same image different size
compare(img1, img1.1) = 0.8 different images same size
ideally this should be done real time, so within 20 millisecond, the goal is to understand if the streams are out of synchronization, I already implemented some compare methods (nobody of them works for this case yet):
1) histogram (SSE and OpenCV cuda), result compare(img1, img2) ~= compare(img1, img1.1)
2) pnsr (SSE and OCV cuda), result compare(img1, img2) < compare(img1, img1.1)
3) ssim (SSE and OCV cuda), resulting the same as pnsr
Maybe I get bad results because of the resize interpolation method?
Is it possible to realize a comparison method that fulfill my requirements, any ideas?
I'm afraid that you're running into a Real Problem (TM). This is not a trivial lets-give-it-to-the-intern problem.
The main challenge is that you can't do a brute-force comparison. HD images are 3 MB or more, and you're talking about O(N*M) comparisons (in time and across streams).
What you essentially need is a fingerprint that's robust against resizing but time-variant. And as you didn't realize that (the histogram idea for instance is quite time-stable, for instance) you didn't include the necessary information in this question.
So this isn't a C++ question, really. You need to understand your inputs.