How can I make this as fast as possible? - Iterating through an image mat - c++

The question is quite straightforward. I'll also explain what I do in case there is a faster way to do this without optimizing this specific way.
I go through an image and its rgb values. I have bins of size 256 for each color. So for every pixel I calculate the 3 bins of its rgb values. The bins essentially give me the index to access data for the specific color in a large vector. With this data, I do some calculations which are irrelevant. What I want to optimize is the accessing part.
Keep in mind that the large vector has an extra dimension. Every pixel belongs to some defined areas of the image. For every area it belongs to, it has an element in the big vector. So, if a pixel belongs in 4 areas(eg 3,9,12,13) then the data I want to access is: data[colorIndex][3],data[colorIndex][9],data[colorIndex][12],data[colorIndex][13].
I think that's enough to explain the code which is the following:
//Just filling with data for the sake of the example
int cols = 200; int rows = 200;
cv::Mat image(200, 200, CV_8UC3);
image.setTo(Scalar(100, 100, 100));
int numberOfAreas = 50;
//For every pixel (first dimension) we have a vector<int> containing ones for every area the pixel belongs to.
//For this example, every pixel belongs to every area.
vector<vector<int>> areasThePixelBelongs(200 * 200, vector<int>(numberOfAreas, 1));
int numberOfBins = 32;
int sizeOfBin = 256 / numberOfBins;
vector<vector<float>> data(pow(numberOfBins, 3), vector<float>(numberOfAreas, 1));
//Filling complete
//Part I need to optimize
uchar* matPointer;
for (int y = 0; y < rows; y++) {
matPointer = image.ptr<uchar>(y);
for (int x = 0; x < cols; x++) {
int red = matPointer[x * 3 + 2];
int green = matPointer[x * 3 + 1];
int blue = matPointer[x * 3];
int binNumberRed = red / sizeOfBin;
int binNumberGreen = green / sizeOfBin;
int binNumberBlue = blue / sizeOfBin;
//Instead of a 3d vector where I access the elements like: color[binNumberRed][binNumberGreen][binNumberBlue]
//I use a 1d vector where I just have to calculate the 1d index as follows
int index = binNumberRed * numberOfBins * numberOfBins + binNumberGreen * numberOfBins + binNumberBlue;
vector<int>& areasOfPixel = areasThePixelBelongs[y*cols+x];
int numberOfPixelAreas = areasOfPixel.size();
for (int i = 0; i < numberOfPixelAreas; i++) {
float valueOfInterest = data[index][areasOfPixel[i]];
//Some calculations here...
}
}
}
Would it be better accessing each mat element as a Vec3b? I think I'm essentially accessing an element 3 times for each pixel using uchar. Would accessing one Vec3b be faster?

First of all vector<vector<T>> is not efficiently stored in memory as it is not contiguous. This as often a big impact on performance and should be avoided as mush as possible (especially when the inner arrays are of the same size). Instead of this, you can use std::array for fixed-size arrays or a flatten std::vector (with the size dim1 * dim2 * ... dimN).
Moreover, the loop is a good candidate for parallelization. You can parallelize this code easily with OpenMP. This assumes Some calculations here can be implemented in a thread-safe way (you should be careful about shared writes if any). If this code is embarrassingly-parallel, then the resulting parallel code can be much faster. Still, using multi-threading introduces some overhead which may be too big compared to the overall computation time (which is highly dependent of the content in Some calculations here).
Finally, regarding the content in Some calculations here it may or may not be possible to adapt the code so the compiler use SIMD instructions. The data[index][areasOfPixel[i]] will likely prevent most compiler to do that, but the following computation could be. Note that software prefetching and gather instructions may help to speed up a bit the data[index][areasOfPixel[i]] operation.
Note that the way you access pixels should not have a significant impact on the runtime as the computation should be bounded by the speed of the inner loop iterating on areas containing some unknown code (unless this unknown code actually access pixels too).

Related

How to access matrix data in opencv by another mat with locations (indexing)

Suppose I have a Mat of indices (locations) called B, We can say that this Mat has dimensions of 1 x 100 and We suppose to have another Mat, called A, full of data of the same dimensions of B.
Now, I would access to the data of A with B. Usually I would create a for loop and I would take for each elements of B, the right elements of A. For the most fussy of the site, this is the code that I would write:
for(int i=0; i < B.cols; i++){
int index = B.at<int>(0, i);
std::cout<<A.at<int>(0, index)<<std:endl;
}
Ok, now that I showed you what I could do, I ask you if there is a way to access the matrix A, always using the B indices, in a more intelligent and fast way. As someone could do in python thanks to the numpy.take() function.
This operation is called remapping. In OpenCV, you can use function cv::remap for this purpose.
Below I present the very basic example of how remap algorithm works; please note that I don't handle border conditions in this example, but cv::remap does - it allows you to use mirroring, clamping, etc. to specify what happens if the indices exceed the dimensions of the image. I also don't show how interpolation is done; check the cv::remap documentation that I've linked to above.
If you are going to use remapping you will probably have to convert indices to floating point; you will also have to introduce another array of indices that should be trivial (all equal to 0) if your image is one-dimensional. If this starts to represent a problem because of performance, I'd suggest you implement the 1-D remap equivalent yourself. But benchmark first before optimizing, of course.
For all the details, check the documentation, which covers everything you need to know to use te algorithm.
cv::Mat<float> remap_example(cv::Mat<float> image,
cv::Mat<float> positions_x,
cv::Mat<float> positions_y)
{
// sizes of positions arrays must be the same
int size_x = positions_x.cols;
int size_y = positions_x.rows;
auto out = cv::Mat<float>(size_y, size_x);
for(int y = 0; y < size_y; ++y)
for(int x = 0; x < size_x; ++x)
{
float ps_x = positions_x(x, y);
float ps_y = positions_y(x, y);
// use interpolation to determine intensity at image(ps_x, ps_y),
// at this point also handle border conditions
// float interpolated = bilinear_interpolation(image, ps_x, ps_y);
out(x, y) = interpolated;
}
return out;
}
One fast way is to use pointer for both A (data) and B (indexes).
const int* pA = A.ptr<int>(0);
const int* pIndexB = B.ptr<int>(0);
int sum = 0;
for(int i = 0; i < Bi.cols; ++i)
{
sum += pA[*pIndexB++];
}
Note: Be carefull with pixel type, in this case (as you write in your code) is int!
Note2: Using cout for each point access put the optimization useless!
Note3: In this article Satya compare four methods for pixel access and fastest seems "foreach": https://www.learnopencv.com/parallel-pixel-access-in-opencv-using-foreach/

Intensity Histogram ++

I'm writing my own Intensity histogram for greyscale images where the number of bins is passed into the function.
This is what i have so far:
std::vector<unsigned int> Image::histogram(const int bins)
{
std::vector<unsigned int> histogram(bins ,0);
for (unsigned int i(0); i < bins; i++)
{
for (unsigned int j(0); j < m_height * m_width; ++j)
{
if (i == m_p_image[j])
{
histogram[i]++;
}
}
}
return histogram;
}
This works perfectly for 256 bins as each count is added to histogram, but for 128 bins its misses the second half of the image, I know I need to implement a way of grouping points together if the bin size is less than 256 but I'm unsure how to do this.
Your code strikes me as unnecessarily clumsy. There's no real need for the outer loop.
To answer the question you asked, however, the usual way to do this would be to use linear interpolation--that is, find the proportional position of a value in the input range, then increment the same proportional position in the output range.
for (j =0; j<height * width; j++) {
double input_pos = image[j] / 256.0;
int output_pos = int(input_pos * bin_count);
++histogram[output_pos];
}
Given that these are colors, you could (if you chose to) apply a gamma curve instead of doing linear interpolation. The reason to do that would be if you wanted to model how you see colors instead of just basing the histogram on the input numbers themselves. The difference between the two is based on the fact that vision is something like logarithmic instead of linear, so a linear histogram (especially if you're using relatively few bins compared to the number of possible input values) doesn't represent what we see very accurately.

Highly efficient way of reordering and rotating image simultaneously

For fast jpeg-loading I implemented a .mex-wrapper for turbojpeg to read (large) jpegs into MATLAB efficiently. The actual decoding takes only around 120 ms (not 5ms) for a 4000x3000px image. However, the pixel ordering is RGBRGBRGB... , while MATLAB requires a [W x H x 3] matrix, which in memory is a W*H*3 array, where the first WH entries correspond to red, the second WH entries to green, and the last WH entries to blue.
Additionally the image is mirrored around the axis from top left to bottom right.
The straightforward implementation of a rearrangement loop is the following:
// buffer contains mirrored and scrambled output of turbojpe
// outImg contains image matrix for use in MATLAB
// imgSize is an array containing {H,W,3}
for(int j=0; j<imgSize[1]; j++) {
for(int i=0; i<imgSize[0]; i++) {
curIdx = j*imgSize[0] + i;
curBufIdx = (i*imgSize[1] + j)*3;
outImg[curIdx] = buffer[curBufIdx++];
outImg[curIdx + imgSize[0]*imgSize[1] ] = buffer[curBufIdx++];
outImg[curIdx + 2*imgSize[0]*imgSize[1] ] = buffer[curBufIdx];
}
}
It works, but it takes around 120ms (not 20ms), about as long as the actual decoding. Any suggestions on how to make this code more efficient?
Due to a bug I updated the processing times.
EDIT: 99% of C libraries will store images row-major, meaning if you get a 3 x WH (a 2D array) from turbojpeg, you can just treat it as a 3 x W x H (the expected input above). In this representation, pixels read across then down. You need them to read down then across in MATLAB. You also need to convert pixel order (RGBRGBRGB...) to planar order (RRRR....GGGGG....BBBBB...). The solution is permute(reshape(I,3,W,H),[3 2 1]).
This is one of those situations where MATLAB's permute command is probably going to be faster than anything you will code by hand on short notice (at least 50% faster than the loop shown). I usually steer away from solutions with mexCallMATLAB, but I think this may be an exception. However, the input is a mxArray, which may be inconvenient. Anyway, here's how to do a permute(I,[3 2 1]):
#include "mex.h"
int computePixelCtoPlanarMATLAB(mxArray*& imgPermuted, const mxArray* img)
{
mxArray *permuteRHSArgs[2];
// img must be row-major (across first), pixel order (RGBRGBRGB...)
permuteRHSArgs[0] = const_cast<mxArray*>(img);
permuteRHSArgs[1] = mxCreateDoubleMatrix(1,3,mxREAL);
// output is col-major, planar order (rows x cols x 3)
double *p = mxGetPr(permuteRHSArgs[1]);
p[0] = 3;
p[1] = 2;
p[2] = 1;
return mexCallMATLAB(1, &imgPermuted, 2, permuteRHSArgs, "permute");
}
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) {
// do some argument checking first (not shown)
// ...
computePixelCtoPlanarMATLAB(plhs[0], prhs[0]);
}
Or call permute(I,[3 2 1]) yourself back in MATLAB.
What about the reshape to first go from 3xWH to 3xWxH? Just tell the code that it's really 3xWxH! reshape moves no data -- it just tells MATLAB to treat a given data buffer as being a certain size.

How do you iterate through a pitched CUDA array?

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.
Cuda by Example is a great start.
The snippet on page 43 shows:
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).
For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:
// Host code
int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c > width; ++c) {
float element = row[c];
}
}
}
This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).
According to this answer the iterator for when there are blocks of threads going on will be
int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.
So how do I iterate through a pitched array without going through the padding elements?
In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).
After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.
In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.
I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.
The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.
So given an image of dimensions width x height in pixels, this is how it comes together:
Allocating memory
I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:
cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);
To copy memory to/from pitched arrays you cannot use cudaMemcpy.
cudaMemcpy2D(inputGPU, complexPitch, //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);
FFT plan for pitched arrays
JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:
int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
&plan,
2, n, //transform rank and dimensions
nembed, 1, //input array physical dimensions and stride
1, //input distance to next batch (irrelevant because we are only doing 1)
nembed, 1, //output array physical dimensions and stride
1, //output distance to next batch
cufftType::CUFFT_C2C, 1);
Executing the FFT is trivial:
cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);
So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.
void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));
CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}
Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.
Traversing a pitched array in parallel
The kernel uses the standard pointer arithmetic from the documentation:
__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
int threadX = threadIdx.x + blockDim.x * blockIdx.x;
if (threadX >= width)
return;
int threadY = threadIdx.y + blockDim.y * blockIdx.y;
if (threadY >= height)
return;
CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
CFPtype complex = threadRow[threadX];
FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
FPtype *magElement = &(magRow[threadX]);
FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
FPtype *phaseElement = &(phaseRow[threadX]);
*magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
*phaseElement = atan2(complex.y, complex.x);
}
The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

Add 1 to vector<unsigned char> value - Histogram in C++

I guess it's such an easy question (I'm coming from Java), but I can't figure out how it works.
I simply want to increment an vector element by one. The reason for this is, that I want to compute a histogram out of image values. But whatever I try I just can accomplish to assign a value to the vector. But not to increment it by one!
This is my histogram function:
void histogram(unsigned char** image, int height,
int width, vector<unsigned char>& histogramArray) {
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
// histogramArray[1] = (int)histogramArray[1] + (int)1;
// add histogram position by one if greylevel occured
histogramArray[(int)image[i][j]]++;
}
}
// display output
for (int i = 0; i < 256; i++) {
cout << "Position: " << i << endl;
cout << "Histogram Value: " << (int)histogramArray[i] << endl;
}
}
But whatever I try to add one to the histogramArray position, it leads to just 0 in the output. I'm only allowed to assign concrete values like:
histogramArray[1] = 2;
Is there any simple and easy way? I though iterators are hopefully not necesarry at this point, because I know the exakt index position where I want to increment something.
EDIT:
I'm so sorry, I should have been more precise with my question, thank you for your help so far! The code above is working, but it shows a different mean value out of the histogram (difference of around 90) than it should. Also the histogram values are way different than in a graphic program - even though the image values are exactly the same! Thats why I investigated the function and found out if I set the histogram to zeros and then just try to increase one element, nothing happens! This is the commented code above:
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
histogramArray[1]++;
// add histogram position by one if greylevel occured
// histogramArray[(int)image[i][j]]++;
}
}
So the position 1 remains 0, instead of having the value height*width. Because of this, I think the correct calculation histogramArray[image[i][j]]++; is also not working properly.
Do you have any explanation for this? This was my main question, I'm sorry.
Just for completeness, this is my mean function for the histogram:
unsigned char meanHistogram(vector<unsigned char>& histogram) {
int allOccurences = 0;
int allValues = 0;
for (int i = 0; i < 256; i++) {
allOccurences += histogram[i] * i;
allValues += histogram[i];
}
return (allOccurences / (float) allValues) + 0.5f;
}
And I initialize the image like this:
unsigned char** image= new unsigned char*[width];
for (int i = 0; i < width; i++) {
image[i] = new unsigned char[height];
}
But there shouldn't be any problem with the initialization code, since all other computations work perfectly and I am able to manipulate and safe the original image. But it's true, that I should change width and height - since I had only square images it didn't matter so far.
The Histogram is created like this and then the function is called like that:
vector<unsigned char> histogramArray(256);
histogram(array, adaptedHeight, adaptedWidth, histogramArray);
So do you have any clue why this part histogramArray[1]++; don't increases my histogram? histogramArray[1] remains 0 all the time! histogramArray[1] = 2; is working perfectly. Also histogramArray[(int)image[i][j]]++; seems to calculate something, but as I said, I think it's wrongly calculating.
I appreciate any help very much! The reason why I used a 2D Array is simply because it is asked for. I like the 1D version also much more, because it's way simpler!
You see, the current problem in your code is not incrementing a value versus assigning to it; it's the way you index your image. The way you've written your histogram function and the image access part puts very fine restrictions on how you need to allocate your images for this code to work.
For example, assuming your histogram function is as you've written it above, none of these image allocation strategies will work: (I've used char instead of unsigned char for brevity.)
char image [width * height]; // Obvious; "char[]" != "char **"
char * image = new char [width * height]; // "char*" != "char **"
char image [height][width]; // Most surprisingly, this won't work either.
The reason why the third case won't work is tough to explain simply. Suffice it to say that a 2D array like this will not implicitly decay into a pointer to pointer, and if it did, it would be meaningless. Contrary to what you might read in some books or hear from some people, in C/C++, arrays and pointers are not the same thing!
Anyway, for your histogram function to work correctly, you have to allocate your image like this:
char** image = new char* [height];
for (int i = 0; i < height; ++i)
image[i] = new char [width];
Now you can fill the image, for example:
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
image[i][j] = rand() % 256; // Or whatever...
On an image allocated like this, you can call your histogram function and it will work. After you're done with this image, you have to free it like this:
for (int i = 0; i < height; ++i)
delete[] image[i];
delete[] image;
For now, that's enough about allocation. I'll come back to it later.
In addition to the above, it is vital to note the order of iteration over your image. The way you've written it, you iterate over your columns on the outside, and your inner loop walks over the rows. Most (all?) image file formats and many (most?) image processing applications I've seen do it the other way around. The memory allocations I've shown above also assume that the first index is for the row, and the second is for the column. I suggest you do this too, unless you've very good reasons not to.
No matter which layout you choose for your images (the recommended row-major, or your current column-major,) it is in issue that you should always keep in your mind and take notice of.
Now, on to my recommended way of allocating and accessing images and calculating histograms.
I suggest that you allocate and free images like this:
// Allocate:
char * image = new char [height * width];
// Free:
delete[] image;
That's it; no nasty (de)allocation loops, and every image is one contiguous block of memory. When you want to access row i and column j (note which is which) you do it like this:
image[i * width + j] = 42;
char x = image[i * width + j];
And you'd calculate the histogram like this:
void histogram (
unsigned char * image, int height, int width,
// Note that the elements here are pixel-counts, not colors!
vector<unsigned> & histogram
) {
// Make sure histogram has enough room; you can do this outside as well.
if (histogram.size() < 256)
histogram.resize (256, 0);
int pixels = height * width;
for (int i = 0; i < pixels; ++i)
histogram[image[i]]++;
}
I've eliminated the printing code, which should not be there anyway. Note that I've used a single loop to go through the whole image; this is another advantage of allocating a 1D array. Also, for this particular function, it doesn't matter whether your images are row-major or column major, since it doesn't matter in what order we go through the pixels; it only matters that we go through all the pixels and nothing more.
UPDATE: After the question update, I think all of the above discussion is moot and notwithstanding! I believe the problem could be in the declaration of the histogram vector. It should be a vector of unsigned ints, not single bytes. Your problem seems to be that the value of the vector elements seem to stay at zero when your simplify the code and increment just one element, and are off from the values they need to be when you run the actual code. Well, this could be a symptom of numeric wrap-around. If the number of pixels in your image are a a multiple of 256 (e.g. 32x32 or 1024x1024 image) then it is natural that the sum of their number would be 0 mod 256.
I've already alluded to this point in my original answer. If you read my implementation of the histogram function, you see in the signature that I've declared my vector as vector<unsigned> and have put a comment above it that says this victor counts pixels, so its data type should be suitable.
I guess I should have made it bolder and clearer! I hope this solves your problem.