Linear Operator on each channel of OpenCV Mat - c++

What is the most efficient way to shift and scale (linear operation) each channel with different shift/scale values (floats) per channel? The input image may be any data type, but the output matrix needs to be CV_32F.
I could write my own for loop to do this, similar to
out[i,j,k] = scale[i] * in[i,j,k] + shift[i]
, but I wonder if that would be slow compared to perhaps more optimized routines. Scaler operations are supported, but over the entire image, so I don't know how to isolate a given channel for this. I looked at split, but the docs don't mention whether memory is copied or not. If it is, it's sure to be slower. There are routines for color conversion, which is similar, but of course only specific conversions are supported.
Might there be some way, similar to
out.channels[i] = scale[i] * in.channels[i] + shift[i]
Preferably, I'd like to skip creating an intermediate matrix.

Related

High performance OpenCV matrix conversion from 16UC3 to 8UC3

I have an OpenCV CV_16UC3 matrix in which only the lower 8Bit per channel are occupied. I want to create a CV_8UC3 from it. Currently I use this method:
cv::Mat mat8uc3_rgb(imgWidth, imgHeight, CV_8UC3);
mat16uc3_rgb.convertTo(mat8uc3_rgb, CV_8UC3);
This has the desired result, but I wonder if it can be faster or more performant somehow.
Edit:
The entire processing chain consists of only 4 sub-steps (computing time framewise determined by QueryPerformanceCounter measurement on video scene)
mount raw byte buffer in OpenCV-Mat:
cv::Mat mat16uc1_bayer(imgHeight, RawImageWidth, CV_16UC1, (uint8*)payload);
De-Mosaiking
-> cv::cvtColor(mat16uc1_bayer, mat16uc3_rgb, cv::COLOR_BayerGR2BGR);
needs 0.008808[s]
pixel shift (only 12 of the 16 bits are occupied, but we only need 8 of them)
-> uses openCV parallel access to the pixels using mat16uc3_rgb.forEach<>
needs 0.004927[s]
conversion from CV_16UC3 to CV_8UC3
mat16uc3_rgb.convertTo(mat8uc3_rgb, CV_8UC3);
needs 0.006913[s]
I think I won't be able to do without the conversion of the raw buffer into CvMat or demosaiking. The pixel shift probably won't accelerate any further (here the parallelized forEach() is already used). I hoped that when converting from CV_8UC3 to CV_16UC3 an update of the matrix header info or similar would be possible, because the matrix data is already correct and doesn't have to be scaled anymore or similar.
I think you can safely assume that cv::Mat::convertTo is the fastest possible implementation of that operation.
Seeing you are going from one colorspace to another, it will likely not be a zero-cost operation. Memory copy is required for rearranging.
If you are designing a very high-performance system, you should do in-depth analysis of your bottlenecks, and redesign you system to minimize them. Ask yourself: is this conversion really required at this point? Can I solve it by making a custom function that integrates multiple operations in one? Can I use CPU parallelism extensions, multithreading or GPU acceleration? Etc.

What is the fastest way to perform FFT on a large file?

I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).
Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.
I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.
I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.
Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.
Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.

Optimized image convolution algorithm

I am working on implementing Image convolution in C++, and I already have a naive working code based on the given pseudo code:
for each image row in input image:
for each pixel in image row:
set accumulator to zero
for each kernel row in kernel:
for each element in kernel row:
if element position corresponding* to pixel position then
multiply element value corresponding* to pixel value
add result to accumulator
endif
set output image pixel to accumulator
As this can be a big bottleneck with big Images and Kernels, I was wondering if there exist some other approach to make things faster ? even with additionnal input info like : sparse image or kernel, already known kernel etc...
I know this can be parallelized, but it's not doable in my case.
if element position corresponding* to pixel position then
I presume this test is meant to avoid a multiplication by 0. Skip the test! multiplying by 0 is way faster than the delays caused by a conditional jump.
The other alternative (and it's always better to post actual code rather than pseudo-code, here you have me guessing at what you implemented!) is that you're testing for out-of-bounds access. That is terribly expensive also. It is best to break up your loops so that you don't need to do this testing for the majority of the pixels:
for (row = 0; row < k/2; ++row) {
// inner loop over kernel rows is adjusted so it only loops over part of the kernel
}
for (row = k/2; row < nrows-k/2; ++row) {
// inner loop over kernel rows is unrestricted
}
for (row = nrows-k/2; row < nrows; ++row) {
// inner loop over kernel rows is adjusted
}
Of course, the same applies to loops over columns, leading to 9 repetitions of the inner loop over kernel values. It's ugly but way faster.
To avoid the code repetition you can create a larger image, copy the image data over, padded with zeros on all sides. The loops now do not need to worry about accessing out-of-bounds, you have much simpler code.
Next, a certain class of kernel can be decomposed into 1D kernels. For example, the well-known Sobel kernel results from the convolution of [1,1,1] and [1,0,-1]T. For a 3x3 kernel this is not a huge deal, but for larger kernels it is. In general, for a NxN kernel, you go from N2 to 2N operations.
In particular, the Gaussian kernel is separable. This is a very important smoothing filter that can also be used for computing derivatives.
Besides the obvious computational cost saving, the code is also much simpler for these 1D convolutions. The 9 repeated blocks of code we had earlier become 3 for a 1D filter. The same code for the horizontal filter can be re-used for the vertical one.
Finally, as already mentioned in MBo's answer, you can compute the convolution through the DFT. The DFT can be computed using the FFT in O(MN log MN) (for an image of size MxN). This requires padding the kernel to the size of the image, transforming both to the Fourier domain, multiplying them together, and inverse-transforming the result. 3 transforms in total. Whether this is more efficient than the direct computation depends on the size of the kernel and whether it is separable or not.
For small kernel size simple method might be faster. Also note that separable kernels (for example, Gauss kernel is separable) as mentioned, allow to make filtering by lines then by columns, resulting O(N^2 * M) complexity.
For other cases: there exists fast convolution based on FFT (Fast Fourier Transform). It's complexity is O(N^2*logN) (where N is size of image ) comparing to O(N^2*M^2) for naive implementation.
Of course, there some peculiarities in applying this techniques, for example, edge effects, but one needs to account for them in naive implementation too (in a lesser degree though).
FI = FFT(Image)
FK = FFT(Kernel)
Prod = FI * FK (element-by-element complex multiplication)
Conv(I, K) = InverseFFT(Prod)
Note that you can use some fast library intended for image filtering, for example, OpenCV allows to apply kernel to 1024x1024 image in 5-30 milliseconds.
One way to this speed up, might be, depending on target platform, to distinctly get every value in the kernel, then, in memory, store multiple copies of the image, one for every distinct value in the kernel, and multiply each copy of the image by its distinct kernel value, then at the end, multiply by distinct kernel value, shift, sum and divide up all the image copies into one image. This could be done on a graphics processor for example where memory is ample and which is more suited for this tight repetitive processing. The copies of the image will need to support overflow of the pixels, or you could use floating point values.

Are these the irreconcilable cons of using DictVectorizer in Scikit learn?

I have 5+ million data to predict people's race. One textual feature gives rise to tens of thousands more. For example, name 'Smith' give rise to 'sm', 'mi', 'it'... etc. I then need to transform it into some sparse matrix
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X2= vec.fit_transform(measurements)
Because of the tens of thousands of generated features, I can't use the following to give me an array, otherwise I am getting an out of memory error.
X = vec.fit_transform(measurements).toarray()
As far as I can tell, a lot of other functions/modules in scikilearn only allows the array format data to be fitted. For example: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA and http://scikit-learn.org/stable/modules/feature_selection.html for dimensionality reduction and feature selection.
pca = PCA(n_components=2)
pca.fit(X) # X works but not X2, though I can't get X with my big data set because of out-of-memory error
I am not certain that this will help, but you can try to slice your X2 into smaller parts (but still as big as possible), and use IncrementalPCA on them.
from sklearn.utils import gen_batches
from sklearn.decomposition import IncrementalPCA
pca = IncrementalPCA()
n_samples, n_features = X2.shape
batchsize = n_features*5
for slice in gen_batches(n_samples, batchsize):
pca.partial_fit(X2[slice].toarray())
You may change that 5 constant to some bigger number, if your RAM size allows to do that.
As you noticed you probably won't be able to convert your text features into a numpy array.
So you'll need to focus on techniques that can handle sparse data.
PCA is not one of them.
The reason is that PCA performs centering of the data, which makes the data dense (picture a sparse matrix, then substract 0.5 to every element).
This SO answer provides more explanation and an alternative:
To clarify: PCA is mathematically defined as centering the data (removing the mean value to each feature) and then applying truncated SVD on the centered data.
As centering the data would destroy the sparsity and force a dense representation that often does not fit in memory any more, it is common to directly do truncated SVD on sparse data (without centering). This resembles PCA but it's not exactly the same.
In the context of text data performing SVD after a TfidfVectorizer or a CountVectorizer is actually a famous technique called latent semantic analysis.
As for the feature selection part, you'll probably have to modify the source code of your scoring function (e.g. chi2) so that it handles sparse matrices without making them dense.
It is possible, this is mostly a trade-off between keeping the sparsity of matrices and using efficient array operations.
In your case though I'd try and throw this at a classifier first to see if the extra work is worth your time.

Is it possible to use a value in d[x] register as address in vld?

I have an Image with sizes M x N, and each pixel is 14 bits (all of them are stored in 16 bit integers but 2 least significant bits are not used). I want to map each pixel to an 8 bit value, due to a mapping function which is simply an array of 16384 values. I perform this image tone mapping using pure C++ as follows:
for(int i=0;i<imageSize;i++)
{
resultImage[i] = mappingArray[image[Index]];
}
However, I want to optimize this operation using ARM Neon intrinsics. Since there are 32 (correct it if I'm wrong) neon (dx) registers registers, I cannot use VTBL instruction for a lookup table larger than
8x32 = 256 elements. Moreover, there is another discussion on stacoverflow to use a lookup table larger than 32 bytes:
ARM NEON: How to implement a 256bytes Look Up table
How can I manage to optimize such simple looking operation? I think of using pixels of the image as address parameter of VLD function just as something like the following:
VLD1.8 {d1},[d0] ??
Is it possible? Or how can I handle this?
The optimization in the other example works by holding an entire lookup table in registers. You simply cannot do this: your table is 16384 bytes (2^14 -> 2^8), and that is way, way more than you have in register space.
Hence, your table will reside in L1 cache. The obvious C++ code:
unsigned char mappingArray[16384];
fill(mappingArray);
for(int i=0;i<imageSize;i++)
{
resultImage[i] = mappingArray[image[i]>>2];
}
will probably compile straight to the most efficient code. The problem isn't how you get things in registers. The problem is that you need memory access to your input image, mapping table and output image.
If speed was a problem, I'd solve this by aggressively trimming the table to perhaps 128 entries, and using linear interpolation on the next few bits.
Given a large look-up table, the normal process is to look very closely at it to figure out (or find on the internet) the algorithm to compute each entry. If that algorithm turns out to be simple enough then you might find that it's faster to perform the calculations in parallel rather than to perform scalar table look-ups.
Alternatively, based on the shape of the data you can try to find approximations which are up to requirements but which are easier to compute.
For example, you can use VTBL on the top three or four bits of the input, and linear interpolation on the rest. But this only works if the curve is smooth enough that linear interpolation is an adequate approximation.
A common operation which matches the parameters stated is linear to sRGB conversion; in which case you're looking at raising each input to the power of 5/12. That's a bit hairy, but you might still be able to get some performance gain if you don't need to be too accurate.