Size of prototype mask and Mask produced in Yolact - computer-vision

I read the paper explaining Yolact and Yolact++. I'm confused with the mask size and prototype mask. There is an illustration of protonet and the output from protonet is of size 138 * 138 * 32. Is this the size of protomask? I have read in the paper saying that the algorithm produces an image sized mask. So Please clarify the size of the mask produced.

Take for example an input with the following size:
(H,W,C) = (512,512,3)
The protonet will give you the following output size (a.k.a proto-masks): (128,128,32) - where 32=Number of Protos. It is 1/4 of the input size.
The protos are being used for getting the mask by a linear combination of them, with the corresponding coefficients predicted by the prediction module.
Therefore you will have a mask, with the size (128,128). Then a crop is being done on this mask, the cropping is done according to the bbox prediction (after NMS).
The bbox values can be related as relative to the image size, therefore (0.5,0.5,1.,1.) which corresponds to (256.,256.,512.,512.) in the input image, are (64.,64.,128.,128.) in the mask created by the protos

Related

Compression of an image

I have been calculating the uncompressed and compressed file sizes of an image. This for me has always resulted in the compressed image being smaller than the uncompressed image which I would expect. If an image contains a large number of different colours, then storing the palette takes up a significant amount of space, and more bits are also needed to store each code. However my question is, would it be possible the compression method could potentially result in a larger file than the uncompressed RGB image. What would the size (in pixels) of the smallest square RGB image, containing a total of k different colours, for which this compression method is still useful? So we want to find, for a given value of k, find the smallest integer number n for which an image of size n×n takes up less storage space after compression than the original RGB image.
Let's begin by making a small simplification -- the size of the encoded output depends on the number of pixels (the actual proportion of width vs. height doesn't really matter). Hence, let's generalize the problem to number of pixels N, from which we can always calculate n by taking a square root.
To further simplify the problem, we will also ignore the overhead of any image headers/metadata, such as width, height, size of the palette, etc. In practice, this would generally be some relatively small constant.
Problem Statement
Given that we have
N representing the number of pixels in an image
k representing the number of distinct colours in an image
24 bits per pixel RGB encoding
LRGB representing the length of a RGB image
LP representing the length of a palette image
our goal is to solve the following inequality
in terms of N.
Size of RGB Image
RGB image is just an array of N pixels, each pixel taking up a fixed number of bits given by the RGB encoding. Hence,
Size of Palette Image
Palette image consists of two parts: a palette, and the pixels.
A palette is an array of k colours, each colour taking up a fixed number of bits given by the RGB encoding. Therefore,
In this case, each pixel holds an index to a palette entry, rather than an actual RGB colour. The number of bits required to represent k values is
However, unless we can encode fractional bits (which I consider outside the scope of this question), we need to round this up. Therefore, the number of bits required to encode a palette index is
Since there are N such palette indices, the size of the pixel data is
and the total size of the palette image is
Solving the Inequality
And finally
In Python, we could express this in the following way:
import math
def limit_size(k):
return (k * 24.) / (24. - math.ceil(math.log(k, 2)))
def size_rgb(N):
return (N * 24.)
def size_pal(N, k):
return (N * math.ceil(math.log(k, 2))) + (k * 24.)
In general no, but your question is not precise.
If we compress normal files, they could be larger. E.g. if you compress a random generated sequence of bytes, there is not much to compress, and so you get the header of compression program, which tell which compression method is used, and some versioning. This will enlarge the file, and ev. some escaping. Good compression program will see that compression will not shrink the size, and so they should just not compress, and tell in the header that it is a flat file. Possibly this is done by region of program.
But your question is about images. Compression is done inside the file, and often not all file, but just the image bits. In this case program will see that there is no need to compress, and so they would keep the file uncompressed. But because the image headers are always present, this change only a flag, and so no increase of size.
But this could depends also on file format. You wrote about "palette", but this is not much used nowadays: compression is done finding similar pattern on file. But again: this depends on the image format. If you look in Wikipedia, for particular file format, you may see a table with headers parameters (e.g. bit depth or number of colours (palette), definitions of colours, and methods used to compress).
Then, for palette like image, the answer of Dan Mašek (https://stackoverflow.com/a/58683948/2758823) has some nice mathematical explanation, but one should not forget that compression is much heuristic and test of real examples: real images have patterns.

OpenCV: Denoising image / video frame

I want to denoise a video using OpenCV and C++. I found on the OpenCV doc site this:
fastNlMeansDenoising(contourImage,contourImage2);
Every time a new frame is loaded, my program should denoise the current frame (contourImage) and write it to contourImage2.
But if I run the code, it returns 0 and exits. What am I doing wrong or is there an alternative way to denoise an image? (It should be fast, because I am processing a video)
while you are using c++ you are not providing the full argument try this that way.
cv::fastNlMeansDenoisingColored(contourImage, contourImage2, 10, 10,7, 21);
// This is Original Function to be used.
cv::fastNlMeansDenoising(src[, dst[, h[, templateWindowSize[, searchWindowSize]]]]) → dst
Parameters:
src – Input 8-bit 1-channel, 2-channel or 3-channel image.
dst – Output image with the same size and type as src .
templateWindowSize – Size in pixels of the template patch that is used to compute weights. Should be odd. Recommended value 7 pixels.
searchWindowSize – Size in pixels of the window that is used to compute weighted average for given pixel. Should be odd. Affect performance linearly: greater.
searchWindowsSize - greater denoising time. Recommended value 21 pixels.
h – Parameter regulating filter strength. Big h value perfectly removes noise but also removes image details, smaller h value preserves details but also preserves some noise

OpenCV: Understanding Kernel

My book says this about the Image Kernel concept in OpenCV
When a computation is done over a pixel neighborhood, it is common to
represent this with a kernel matrix. This kernel describes how the
pixels involved in the computation are combined in order to obtain the
desired result.
In image blur techniques, we use the kernel size.
cv::GaussianBlur(inputImage,outputImage,Size(1,1),0,0)
So, if I say the kernel size is Size(1,1) does that mean the kernel got only 1 pixel?
Please have a look at the following image
In here, what's the Kernel size? Size(3,3) ? If I say size Size(1,1) in this image, does that mean the kernel got only 1 pixel and the pixel value is 0 (The first value in the image)?
The kernel size in the example image you gave is 3-by-3 (Size(3,3)), yes. A kernel size of 1-by-1 is valid, although it wouldn't be very interesting.
The generic name for the operation being performed by GaussianBlur is a convolution.
The GaussianBlur function is creating a Gaussian kernel, which is basically a matrix that represents how you should combine a window of n-by-n pixels to get a single pixel value (using a Gaussian-shaped blurring pattern in this case).
A kernel of size 1-by-1 can't do anything other than scalar multiplication of an image; that is, convolution by the 1-by-1 matrix [c] is just c * inputImage.
Typically, you'll want to choose a n-by-n Gaussian kernel that satisfies:
spread of Gaussian (i.e. standard deviation or variance) such that it blurs the amount you want
larger number means more blurring; smaller number means less blurring
choose n sufficiently large as to not truncate the Gaussian too close to the mode
Links:
Convolution (Wikipedia)
Gaussian blur (Wikipedia)
this section in particular
The image you post is a 3x3 kernel, which would be specified by cv::Size(3,3). You are correct in saying that cv::Size(1,1) corresponds to a single pixel, but saying "cv::Size(1,1)" in reference to the image is not meaningful. A 1x1 kernel would simply have the value [1].
This image is a kernel and it's size is 3x3. Kernels are applied to image by multiplying corresponding pixel values and getting sum of 9 results. This is called convolution / filtering in literature. You can look at following resources for more information :
http://en.wikipedia.org/wiki/Kernel_(image_processing)
http://homepages.inf.ed.ac.uk/rbf/HIPR2/filtops.htm
http://www.cse.usf.edu/~r1k/MachineVisionBook/MachineVision.files/MachineVision_Chapter4.pdf

quantization of dct image for steganography

I hav a greyscale image. I did 8x8 blocks and computed each of their DCTs. I want to quantize the DCT coefficients and then replace their LSBs with my secret message bits. How exactly do I quantize the coefficients? Should I use the quantization matrix used by JPEG? How to determine the values of such a quantization matrix?
You will probably want to set the quality level to the highest (smallest values in the quantization matrix) so that the modified LSB of each coefficient perturbs the image data the least.
For encoding:
You will need access to the DCT values after quantization and before entropy coding. There you can modify the LSB's. You should probably only modify the non-zero coefficient values or you will make the compressed image file much larger and more distorted. This way, you will probably be able to encode 20-30 bits per DCT block.
For decoding:
You will need to do the reverse and get access to the DCT values immediately after the entropy decode and before the dequantization step.
To calculate the total number of bits available for your message, use the following example:
For a VGA sized image (640x480) which is encoded as 4:2:0 (subsampled color in both dimensions), you will have 40 x 30 = 1200 MCUs. Each MCU has 6 DCT blocks (4Y, 1Cr, 1Cb). This is a total of 7200 DCT blocks. If each block encodes an average of 25 coefficients (a reasonable quality level), then your message can be a total of 7200x25 = 180000 bits.

Scaling performed by gpu::dft with OpenCV in C++

I want to use a GPU-accelerated algorithm, to perform a fast and memory saving dft. But, when I perform the gpu::dft, the destination matrix is scaled as it is explained in the documentation. How I can avoid this problem with the scaling of the width to dft_size.width / 2 + 1? Also, why is it scaled like this? My Code for the DFT is this:
cv::gpu::GpuMat d_in, d_out;
d_in = in;
d_out.create(d_in.size(), CV_32FC2 );
cv::gpu::dft( d_in, d_out, d_in.Size );
where in is a CV_32FC1 matrix, which is 512x512.
The best solution would be a destination matrix which has the size d_in.size and the type CV_32FC2.
This is due to complex conjugate symmetry that is present in the output of an FFT. Intel IPP has a good description of this packing (the same packing is used by OpenCV). The OpenCV dft function also describes this packing.
So, from the gpu::dft documentation we have:
If the source matrix is complex and the output is not specified as real, the destination matrix is complex and has the dft_size size and CV_32FC2 type.
So, make sure you pass a complex matrix to the gpu::dft function if you don't want it to be packed. You will need to set the second channel to all zeros:
Mat realData;
// ... get your real data...
Mat cplxData = Mat::zeros(realData.size(), realData.type());
vector<Mat> channels;
channels.push_back(realData);
channels.push_back(cplxData);
Mat fftInput;
merge(channels, fftInput);
GpuMat fftGpu(fftInput.size(), fftInput.type());
fftGpu.upload(fftInput);
// do the gpu::dft here...
There is a caveat though...you get about a 30-40% performance boost when using CCS packed data, so you will lose some performance by using the full-complex output.
Hope that helps!
Scaling is done for obtaining the result within the range of +/- 1.0. This is the most useful form for most applications that need to deal with frequency representation of the data. For retrieving a result which is not scaled just don't enable the DFT_SCALE flag.
Edit
The width of the result is scaled, because it is symmetric. So all you have to do is append the former values in a symmetric fashion.
The spectrum is symmetric, because at half of the width the sampling theorem is fulfilled. For example a 2048 point DFT for a signal source with a samplerate of 48 kHz can only represent values up to 24 kHz and this value is represented at half of the width.
Also for reference take a look at Spectrum Analysis Using the Discrete Fourier Transform.