High performance OpenCV matrix conversion from 16UC3 to 8UC3

High performance OpenCV matrix conversion from 16UC3 to 8UC3 - c++

I have an OpenCV CV_16UC3 matrix in which only the lower 8Bit per channel are occupied. I want to create a CV_8UC3 from it. Currently I use this method:
cv::Mat mat8uc3_rgb(imgWidth, imgHeight, CV_8UC3);
mat16uc3_rgb.convertTo(mat8uc3_rgb, CV_8UC3);
This has the desired result, but I wonder if it can be faster or more performant somehow.
Edit:
The entire processing chain consists of only 4 sub-steps (computing time framewise determined by QueryPerformanceCounter measurement on video scene)
mount raw byte buffer in OpenCV-Mat:
cv::Mat mat16uc1_bayer(imgHeight, RawImageWidth, CV_16UC1, (uint8*)payload);
De-Mosaiking
-> cv::cvtColor(mat16uc1_bayer, mat16uc3_rgb, cv::COLOR_BayerGR2BGR);
needs 0.008808[s]
pixel shift (only 12 of the 16 bits are occupied, but we only need 8 of them)
-> uses openCV parallel access to the pixels using mat16uc3_rgb.forEach<>
needs 0.004927[s]
conversion from CV_16UC3 to CV_8UC3
mat16uc3_rgb.convertTo(mat8uc3_rgb, CV_8UC3);
needs 0.006913[s]
I think I won't be able to do without the conversion of the raw buffer into CvMat or demosaiking. The pixel shift probably won't accelerate any further (here the parallelized forEach() is already used). I hoped that when converting from CV_8UC3 to CV_16UC3 an update of the matrix header info or similar would be possible, because the matrix data is already correct and doesn't have to be scaled anymore or similar.

I think you can safely assume that cv::Mat::convertTo is the fastest possible implementation of that operation.
Seeing you are going from one colorspace to another, it will likely not be a zero-cost operation. Memory copy is required for rearranging.
If you are designing a very high-performance system, you should do in-depth analysis of your bottlenecks, and redesign you system to minimize them. Ask yourself: is this conversion really required at this point? Can I solve it by making a custom function that integrates multiple operations in one? Can I use CPU parallelism extensions, multithreading or GPU acceleration? Etc.

Related

CUDA CUFFT/IFFT Different results if i pad my data?

I have a signal that i am doing an FFT to, doing an convolution with itself and then an IFFT back to the time domain. The signal is 8192 long. If i pad the signal to 16384 (N*2) and perform the operations i get the correct output. However, is this necessary? But when i try and stick with 8192 using C2C FFT transforms I have similar data throughout until the IFFT. (When using 8192 it only has every 2nd point from the 16384 data).
I've run this through matlab and got the same results so i suspect its more to do with the maths than the implementation, but as I'm doing this in cuda, any advice is welcome, I don't mind having to pad the data in some shape if necessary but the data is fine up to the point i do the IFFT.
N.B I know i'm not doing all computation on the GPU, this was simply to remove errors and allow me to see what the code was doing.
Link to code: http://pasted.co/e2e5e625
This is what i should get
This is what i get if i don't pad.

I have a signal that I am doing an FFT to, doing an convolution with itself and then an IFFT back to the time domain.
Looking at your code, you are not doing a "convolution with itself" in the frequency domain but rather a multiplication by itself.
The entire sequence of operations (FFT, multiplication, IFFT) would correspond to computing the circular convolution of the signal with itself in the time domain. The circular convolution would only be equivalent to a linear convolution if the signal is first padded to a length at least 2*N-1 (which happens to be the minimum size required to store all linear convolution coefficients after the IFFT).
You may use a smaller FFT size (i.e. smaller than 2*N-1, but at least N) to compute a linear convolution by using the Overlap-add method.

Copying cv::Mat to a vx_image

I'm looking at optimising some code with NVidia's OpenVX, and from previous experience with the CUDA API, GPU memory allocation is always a significant overhead.
So, I have a series of cv::Mat from video that I want to copy into an image; the naive code is of course:
vxImage = nvx_cv::createVXImageFromCVMat(context, cvMat);
The optimisation would be to allocate a single image, then just copy the bits on top. Looking at the header files (documentation is rather scant) I find:
nvx_cv::copyCVMatToVXMatrix(vxImage, cvMat);
However, the name is VXMatrix, so the compiler complains about a mismatch between the vx_matrix and vx_image types, of course. As far as I can tell, there is no copyCVMatToVXImage API; am I missing something, or is there another way to do this?

Linear Operator on each channel of OpenCV Mat

What is the most efficient way to shift and scale (linear operation) each channel with different shift/scale values (floats) per channel? The input image may be any data type, but the output matrix needs to be CV_32F.
I could write my own for loop to do this, similar to
out[i,j,k] = scale[i] * in[i,j,k] + shift[i]
, but I wonder if that would be slow compared to perhaps more optimized routines. Scaler operations are supported, but over the entire image, so I don't know how to isolate a given channel for this. I looked at split, but the docs don't mention whether memory is copied or not. If it is, it's sure to be slower. There are routines for color conversion, which is similar, but of course only specific conversions are supported.
Might there be some way, similar to
out.channels[i] = scale[i] * in.channels[i] + shift[i]
Preferably, I'd like to skip creating an intermediate matrix.

OpenGL optimised representation of textures

I have read many places that one should avoid OpenGL textures that are 3 bytes long and should always use 4 bytes bus aligned representations. Keeping that in mind, I have a couple of questions about the glTextImage2D API.
From the documentation, the API signature is:
void glTexImage2D(GLenum target, GLint level, GLint internalformat,
GLsizei width, GLsizei height, GLint border,
GLenum format, GLenum type, const GLvoid * data);
Am I correct in assuming that if I have an RGB image and I want a RGBA representation, it is suffice to specigy the internalformat parameter as GL_RGBA and the format parameter as GL_RGB? Would there be an internal conversion between the formats when generating the texture?
My second question is what if I have grayscale data (so just one channel). Is it ok to represent this as GL_RED or is it also better to have a 4 byte representation for this?

I disagree with the recommendation you found to avoid RGB formats. I can't think of a good reason to avoid RGB textures. Some GPUs support them natively, many other do not. You have two scenarios:
The GPU does not support RGB textures. It will use a format that is largely equivalent to RGBA for storage, and ignore the A component during sampling.
The GPU supports RGB textures.
In scenario 1, you end up with pretty much the same thing as you get when specifying RGBA as the internal format. In scenario 2, you save 25% of memory (and the corresponding bandwidth) by actually using a RGB internal format.
Therefore, using RGB for the internal format is never worse, and can be better on some systems.
What you have in your code fragment is completely legal in desktop OpenGL. The RGB data you provide will be expanded to RGBA by filling the A component with 1.0. This would not be the case in OpenGL ES, where you can only use a very controlled number of formats for each internal format, which mostly avoids format conversions during TexImage and TexSubImage operations.
It is generally beneficial to match the internal format and the format/type to avoid conversions. But even that is not as clear cut as it might seem. Say you compare loading RGB data or RGBA data into a texture with an internal RGBA format. Loading RGBA has a clear advantage by not using a format conversion. Since on the other hand the RGB data is smaller, loading it requires less memory bandwidth, and might cause less cache pollution.
Now, the memory bandwidth of modern computer systems is so high that you can't really saturate it with sequential access from a single core. So the option that avoids conversion is likely to be better. But it's going to be very platform and situation dependent. For example, if intermediate copies are needed, the smaller amount of data could win. Particularly if the actual RGB to RGBA expansion can be done as part of a copy performed by the GPU.
One thing I would definitely avoid is doing conversions in your own code. Say you get RGB data from somewhere, and you really do need to load it into a RGBA texture. Drivers have highly optimized code for these conversions, or they might even happen as part of a GPU blit. And it's always going to be better to perform the conversion as part of a copy, compared to your code creating another copy of the data.
Sounds confusing? These kinds of tradeoffs are very common when looking at performance. It's often the case that to get the optimal performance with OpenGL, you need to use a different code path for different GPUs/platforms.

Am I correct in assuming that if I have an RGB image and I want a RGBA representation, it is suffice to specigy the internalformat parameter as GL_RGBA and the format parameter as GL_RGB? Would there be an internal conversion between the formats when generating the texture?
This will work, and GL will assign a constant 1.0 to the alpha component for every texel. OpenGL is required to convert compatible image data into the GPU's native format during pixel transfer, and this includes things like adding extra image channels, converting from floating-point to fixed-point, swapping bytes for endian differences between CPU and GPU.
For best pixel transfer performance, you need to eliminate all of that extra work that GL would do. That means if you used GL_RGBA8 as your internal format your data type should be GL_UNSIGNED_BYTE and the pixels should be some variant of RGBA (often GL_BGRA is the fast path -- the data can be copied straight from CPU to GPU unaltered).
Storing only 3 components on the CPU side will obviously prevent a direct copy, and will make the driver do more work. Whether this matters really depends on how much and how frequently you transfer pixel data.
My second question is what if I have grayscale data (so just one channel). Is it ok to represent this as GL_RED or is it also better to have a 4 byte representation for this?
GL_RED is not a representation of your data. That only tells GL which channels the pixel transfer data contains, you have to combine it with a data type (e.g. GL_UNSIGNED_BYTE) to make any sense out of it.
GL_R8 would be a common internal format for a grayscale image, and that is perfectly fine. The rule of thumb you need to concern yourself with is actually that data sizes needs to be aligned to a power-of-two. So 1-, 2-, 4- and 8-byte image formats are fine. The oddball is 3, which happens when you try to use something like GL_RGB8 (the driver is going to have to pad that image format for alignment).
Unless you actually need 32-bits worth of gradiation (4.29 billion shades of gray!) in your grayscale, stick with either GL_R8 (256 shades) or GL_R16 (65536 shades) for internal format.

cudaMemcpy2D for shared memory copies

I have some memory that has been allocated on device that is just a single malloc of H*W*sizeof(float) in size.
This is to represent an H*W matrix.
I have a code where I need to swap the quadrants of the matrix. Can i use cudaMemcpy2D to accomplish this? Would I just need to specify the spitch and dpitch to be W*sizeof(float) and just use pointers to each quadrant of the matrix to accomplish this?
Also, when these cudaMemcpy talk about the memory areas not overlapping - does that mean src and dst cannot overlap at all? As in, if I had a 10 byte wide array that I wanted to shift left one time - it will fail?
Thanks

You can use cudaMemcpy2D for moving around sub-blocks which are part of larger pitched linear memory allocations. There is no problem in doing that. The non-overlapping requirement is non-negotiable and it will fail if you try it. The source and destination can come from the same allocation, but the address ranges of the source and destination cannot overlap. If you need to do some "in-situ" copying where there is overlap, you might be better served to write a kernel to do it (see the matrix transpose example in the SDK as a sound way to do that kind of thing).

I suggest writing a simple kernel to do this matrix manipulation. I think it would be easier to write than using cudaMemcpy(2D) and almost definitely faster assuming you write it to get good memory coherence.
It's probably easiest to do an out-of-place transform (i.e. different input and output arrays) to avoid clobbering the input matrix. Each thread would simply read from its input offset and write to the transformed offset.
It would be similar to a matrix transpose. There is a matrix transpose example in the CUDA SDK.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js