Highly efficient way of reordering and rotating image simultaneously - c++

For fast jpeg-loading I implemented a .mex-wrapper for turbojpeg to read (large) jpegs into MATLAB efficiently. The actual decoding takes only around 120 ms (not 5ms) for a 4000x3000px image. However, the pixel ordering is RGBRGBRGB... , while MATLAB requires a [W x H x 3] matrix, which in memory is a W*H*3 array, where the first WH entries correspond to red, the second WH entries to green, and the last WH entries to blue.
Additionally the image is mirrored around the axis from top left to bottom right.
The straightforward implementation of a rearrangement loop is the following:
// buffer contains mirrored and scrambled output of turbojpe
// outImg contains image matrix for use in MATLAB
// imgSize is an array containing {H,W,3}
for(int j=0; j<imgSize[1]; j++) {
for(int i=0; i<imgSize[0]; i++) {
curIdx = j*imgSize[0] + i;
curBufIdx = (i*imgSize[1] + j)*3;
outImg[curIdx] = buffer[curBufIdx++];
outImg[curIdx + imgSize[0]*imgSize[1] ] = buffer[curBufIdx++];
outImg[curIdx + 2*imgSize[0]*imgSize[1] ] = buffer[curBufIdx];
It works, but it takes around 120ms (not 20ms), about as long as the actual decoding. Any suggestions on how to make this code more efficient?
Due to a bug I updated the processing times.

EDIT: 99% of C libraries will store images row-major, meaning if you get a 3 x WH (a 2D array) from turbojpeg, you can just treat it as a 3 x W x H (the expected input above). In this representation, pixels read across then down. You need them to read down then across in MATLAB. You also need to convert pixel order (RGBRGBRGB...) to planar order (RRRR....GGGGG....BBBBB...). The solution is permute(reshape(I,3,W,H),[3 2 1]).
This is one of those situations where MATLAB's permute command is probably going to be faster than anything you will code by hand on short notice (at least 50% faster than the loop shown). I usually steer away from solutions with mexCallMATLAB, but I think this may be an exception. However, the input is a mxArray, which may be inconvenient. Anyway, here's how to do a permute(I,[3 2 1]):
#include "mex.h"
int computePixelCtoPlanarMATLAB(mxArray*& imgPermuted, const mxArray* img)
mxArray *permuteRHSArgs[2];
// img must be row-major (across first), pixel order (RGBRGBRGB...)
permuteRHSArgs[0] = const_cast<mxArray*>(img);
permuteRHSArgs[1] = mxCreateDoubleMatrix(1,3,mxREAL);
// output is col-major, planar order (rows x cols x 3)
double *p = mxGetPr(permuteRHSArgs[1]);
p[0] = 3;
p[1] = 2;
p[2] = 1;
return mexCallMATLAB(1, &imgPermuted, 2, permuteRHSArgs, "permute");
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) {
// do some argument checking first (not shown)
// ...
computePixelCtoPlanarMATLAB(plhs[0], prhs[0]);
Or call permute(I,[3 2 1]) yourself back in MATLAB.
What about the reshape to first go from 3xWH to 3xWxH? Just tell the code that it's really 3xWxH! reshape moves no data -- it just tells MATLAB to treat a given data buffer as being a certain size.


How can I make this as fast as possible? - Iterating through an image mat

The question is quite straightforward. I'll also explain what I do in case there is a faster way to do this without optimizing this specific way.
I go through an image and its rgb values. I have bins of size 256 for each color. So for every pixel I calculate the 3 bins of its rgb values. The bins essentially give me the index to access data for the specific color in a large vector. With this data, I do some calculations which are irrelevant. What I want to optimize is the accessing part.
Keep in mind that the large vector has an extra dimension. Every pixel belongs to some defined areas of the image. For every area it belongs to, it has an element in the big vector. So, if a pixel belongs in 4 areas(eg 3,9,12,13) then the data I want to access is: data[colorIndex][3],data[colorIndex][9],data[colorIndex][12],data[colorIndex][13].
I think that's enough to explain the code which is the following:
//Just filling with data for the sake of the example
int cols = 200; int rows = 200;
cv::Mat image(200, 200, CV_8UC3);
image.setTo(Scalar(100, 100, 100));
int numberOfAreas = 50;
//For every pixel (first dimension) we have a vector<int> containing ones for every area the pixel belongs to.
//For this example, every pixel belongs to every area.
vector<vector<int>> areasThePixelBelongs(200 * 200, vector<int>(numberOfAreas, 1));
int numberOfBins = 32;
int sizeOfBin = 256 / numberOfBins;
vector<vector<float>> data(pow(numberOfBins, 3), vector<float>(numberOfAreas, 1));
//Filling complete
//Part I need to optimize
uchar* matPointer;
for (int y = 0; y < rows; y++) {
matPointer = image.ptr<uchar>(y);
for (int x = 0; x < cols; x++) {
int red = matPointer[x * 3 + 2];
int green = matPointer[x * 3 + 1];
int blue = matPointer[x * 3];
int binNumberRed = red / sizeOfBin;
int binNumberGreen = green / sizeOfBin;
int binNumberBlue = blue / sizeOfBin;
//Instead of a 3d vector where I access the elements like: color[binNumberRed][binNumberGreen][binNumberBlue]
//I use a 1d vector where I just have to calculate the 1d index as follows
int index = binNumberRed * numberOfBins * numberOfBins + binNumberGreen * numberOfBins + binNumberBlue;
vector<int>& areasOfPixel = areasThePixelBelongs[y*cols+x];
int numberOfPixelAreas = areasOfPixel.size();
for (int i = 0; i < numberOfPixelAreas; i++) {
float valueOfInterest = data[index][areasOfPixel[i]];
//Some calculations here...
Would it be better accessing each mat element as a Vec3b? I think I'm essentially accessing an element 3 times for each pixel using uchar. Would accessing one Vec3b be faster?
First of all vector<vector<T>> is not efficiently stored in memory as it is not contiguous. This as often a big impact on performance and should be avoided as mush as possible (especially when the inner arrays are of the same size). Instead of this, you can use std::array for fixed-size arrays or a flatten std::vector (with the size dim1 * dim2 * ... dimN).
Moreover, the loop is a good candidate for parallelization. You can parallelize this code easily with OpenMP. This assumes Some calculations here can be implemented in a thread-safe way (you should be careful about shared writes if any). If this code is embarrassingly-parallel, then the resulting parallel code can be much faster. Still, using multi-threading introduces some overhead which may be too big compared to the overall computation time (which is highly dependent of the content in Some calculations here).
Finally, regarding the content in Some calculations here it may or may not be possible to adapt the code so the compiler use SIMD instructions. The data[index][areasOfPixel[i]] will likely prevent most compiler to do that, but the following computation could be. Note that software prefetching and gather instructions may help to speed up a bit the data[index][areasOfPixel[i]] operation.
Note that the way you access pixels should not have a significant impact on the runtime as the computation should be bounded by the speed of the inner loop iterating on areas containing some unknown code (unless this unknown code actually access pixels too).

How to access matrix data in opencv by another mat with locations (indexing)

Suppose I have a Mat of indices (locations) called B, We can say that this Mat has dimensions of 1 x 100 and We suppose to have another Mat, called A, full of data of the same dimensions of B.
Now, I would access to the data of A with B. Usually I would create a for loop and I would take for each elements of B, the right elements of A. For the most fussy of the site, this is the code that I would write:
for(int i=0; i < B.cols; i++){
int index = B.at<int>(0, i);
std::cout<<A.at<int>(0, index)<<std:endl;
Ok, now that I showed you what I could do, I ask you if there is a way to access the matrix A, always using the B indices, in a more intelligent and fast way. As someone could do in python thanks to the numpy.take() function.
This operation is called remapping. In OpenCV, you can use function cv::remap for this purpose.
Below I present the very basic example of how remap algorithm works; please note that I don't handle border conditions in this example, but cv::remap does - it allows you to use mirroring, clamping, etc. to specify what happens if the indices exceed the dimensions of the image. I also don't show how interpolation is done; check the cv::remap documentation that I've linked to above.
If you are going to use remapping you will probably have to convert indices to floating point; you will also have to introduce another array of indices that should be trivial (all equal to 0) if your image is one-dimensional. If this starts to represent a problem because of performance, I'd suggest you implement the 1-D remap equivalent yourself. But benchmark first before optimizing, of course.
For all the details, check the documentation, which covers everything you need to know to use te algorithm.
cv::Mat<float> remap_example(cv::Mat<float> image,
cv::Mat<float> positions_x,
cv::Mat<float> positions_y)
// sizes of positions arrays must be the same
int size_x = positions_x.cols;
int size_y = positions_x.rows;
auto out = cv::Mat<float>(size_y, size_x);
for(int y = 0; y < size_y; ++y)
for(int x = 0; x < size_x; ++x)
float ps_x = positions_x(x, y);
float ps_y = positions_y(x, y);
// use interpolation to determine intensity at image(ps_x, ps_y),
// at this point also handle border conditions
// float interpolated = bilinear_interpolation(image, ps_x, ps_y);
out(x, y) = interpolated;
return out;
One fast way is to use pointer for both A (data) and B (indexes).
const int* pA = A.ptr<int>(0);
const int* pIndexB = B.ptr<int>(0);
int sum = 0;
for(int i = 0; i < Bi.cols; ++i)
sum += pA[*pIndexB++];
Note: Be carefull with pixel type, in this case (as you write in your code) is int!
Note2: Using cout for each point access put the optimization useless!
Note3: In this article Satya compare four methods for pixel access and fastest seems "foreach": https://www.learnopencv.com/parallel-pixel-access-in-opencv-using-foreach/

(C++)(Visual Studio) Change RGB to Grayscale

I am accessing the image like so:
pDoc = GetDocument();
int iBitPerPixel = pDoc->_bmp->bitsperpixel; // used to see if grayscale(8 bits) or RGB (24 bits)
int iWidth = pDoc->_bmp->width;
int iHeight = pDoc->_bmp->height;
BYTE *pImg = pDoc->_bmp->point; // pointer used to point at pixels in the image
int Wp = iWidth;
const int area = iWidth * iHeight;
int r; // red pixel value
int g; // green pixel value
int b; // blue pixel value
int gray; // gray pixel value
BYTE *pImgGS = pImg; // grayscale image pixel array
and attempting to change the rgb image to gray like so:
// convert RGB values to grayscale at each pixel, then put in grayscale array
for (int i = 0; i<iHeight; i++)
for (int j = 0; j<iWidth; j++)
r = pImg[i*iWidth * 3 + j * 3 + 2];
g = pImg[i*iWidth * 3 + j * 3 + 1];
b = pImg[i*Wp + j * 3];
r * 0.299;
g * 0.587;
b * 0.144;
gray = std::round(r + g + b);
pImgGS[i*Wp + j] = gray;
finally, this is how I try to draw the image:
//draw the picture as grayscale
for (int i = 0; i < iHeight; i++) {
for (int j = 0; j < iWidth; j++) {
// this should set every corresponding grayscale picture to the current picture as grayscale
pImg[i*Wp + j] = pImgGS[i*Wp + j];
original image:
and the resulting image that I get is this:
First check if image type is 24 bits per pixels.
Second, allocate memory to pImgGS;
BYTE* pImgGS = (BTYE*)malloc(sizeof(BYTE)*iWidth *iHeight);
Please refer this article to see how bmp data is saved. bmp images are saved upside down. Also, first 54 byte of information is BITMAPFILEHEADER.
Hence you should access values in following way,
double r,g,b;
unsigned char gray;
for (int i = 0; i<iHeight; i++)
for (int j = 0; j<iWidth; j++)
r = (double)pImg[(i*iWidth + j)*3 + 2];
g = (double)pImg[(i*iWidth + j)*3 + 1];
b = (double)pImg[(i*iWidth + j)*3 + 0];
r= r * 0.299;
g= g * 0.587;
b= b * 0.144;
gray = floor((r + g + b + 0.5));
pImgGS[(iHeight-i-1)*iWidth + j] = gray;
If there is padding present, then first determine padding and access in different way. Refer this to understand pitch and padding.
double r,g,b;
unsigned char gray;
long index=0;
for (int i = 0; i<iHeight; i++)
for (int j = 0; j<iWidth; j++)
r = (double)pImg[index+ (j)*3 + 2];
g = (double)pImg[index+ (j)*3 + 1];
b = (double)pImg[index+ (j)*3 + 0];
r= r * 0.299;
g= g * 0.587;
b= b * 0.144;
gray = floor((r + g + b + 0.5));
pImgGS[(iHeight-i-1)*iWidth + j] = gray;
index =index +pitch;
While drawing image,
as pImg is 24bpp, you need to copy gray values thrice to each R,G,B channel. If you ultimately want to save grayscale image in bmp format, then again you have to write bmp data upside down or you can simply skip that step in converting to gray here:
pImgGS[(iHeight-i-1)*iWidth + j] = gray;
tl; dr:
Make one common path. Convert everything to 32-bits in a well-defined manner, and do not use image dimensions or coordinates. Refactor the YCbCr conversion ( = grey value calculation) into a separate function, this is easier to read and runs at exactly the same speed.
The lengthy stuff
First, you seem to have been confused with strides and offsets. The artefact that you see is because you accidentially wrote out one value (and in total only one third of the data) when you should have written three values.
One can get confused with this easily, but here it happened because you do useless stuff that you needed not do in the first place. You are iterating coordinates left to right, top-to-bottom and painstakingly calculate the correct byte offset in the data for each location.
However, you're doing a full-screen effect, so what you really want is iterate over the complete image. Who cares about the width and height? You know the beginning of the data, and you know the length. One loop over the complete blob will do the same, only faster, with less obscure code, and fewer opportunities of getting something wrong.
Next, 24-bit bitmaps are common as files, but they are rather unusual for in-memory representation because the format is nasty to access and unsuitable for hardware. Drawing such a bitmap will require a lot of work from the driver or the graphics hardware (it will work, but it will not work well). Therefore, 32-bit depth is usually a much better, faster, and more comfortable choice. It is much more "natural" to access program-wise.
You can rather trivially convert 24-bit to 32-bit. Iterate over the complete bitmap data and write out a complete 32-bit word for each 3 byte-tuple read. Windows bitmaps ignore the A channel (the highest-order byte), so just leave it zero, or whatever.
Also, there is no such thing as a 8-bit greyscale bitmap. This simply doesn't exist. Although there exist bitmaps that look like greyscale bitmaps, they are in reality paletted 8-bit bitmaps where (incidentially) the bmiColors member contains all greyscale values.
Therefore, unless you can guarantee that you will only ever process images that you have created yourself, you cannot just rely that e.g. the values 5 and 73 correspond to 5/255 and 73/255 greyscale intensity, respectively. That may be the case, but it is in general a wrong assumption.
In order to be on the safe side as far as correctness goes, you must convert your 8-bit greyscale bitmaps to real colors by looking up the indices (the bitmap's grey values are really indices) in the palette. Otherwise, you could be loading a greyscale image where the palette is the other way around (so 5 would mean 250 and 250 would mean 5), or a bitmap which isn't greyscale at all.
So... you want to convert 24-bit and you want to convert 8-bit bitmaps, both to 32-bit depth. That means you do all the annoying what-if stuff once at the beginning, and the rest is one identical common path. That's a good thing.
What you will be showing on-screen is always a 32-bit bitmap where the topmost byte is ignored, and the lower three are all the same value, resulting in what looks like a shade of grey. That's simple, and simple is good.
Note that if you do a BT.601 style YCbCr conversion (as indicated by your use of the constants 0.299, 0.587, and 0.144), and if your 8-bit greyscale images are perceptive (this is something you must know, there is no way of telling from the file!), then for 100% correctness, you need to to the inverse transformation when converting from paletted 8-bit to RGB. Otherwise, your final result will look like almost right, but not quite. If your 8-bit greycales are linear, i.e. were created without using the above constants (again, you must know, you cannot tell from the image), you need to copy everything as-is (here, doing the conversion would make it look almost-but-not-quite right).
About the RGB-to-greyscale conversion, you do not need an extra greyscale bitmap just to hold the values that you never need again afterwards. You can read the three color values from the loaded bitmap, calculate Y, and directly build the 32-bit ARGB word, which you then write out to the final bitmap. This saves one entirely useless round-trip to memory which is not necessary.
Something like this:
uint32_t* out = (uint32_t*) output_bitmap_data;
for(int i = 0; i < inputSize; i+= 3)
uint8_t Y = calc_greyscale(in[0], in[1], in[2]);
*out++ = (Y<<16) | (Y<<8) | Y;
Alternatively, you can also do the from-whatever-to-32 conversion, and then do the to-greyscale conversion in-place there. This, in turn, introduces an extra round-trip to memory, but the code becomes much, much easier overall.

Faster algorithm to check the colors in a image

Supposing I am given an image of 2048x2048 and i want to know the total number of colors present in the image, what is the fastest possible algorithm? I came up with two algorithm but they are slow.
Algorithm 1:
Compare the current pixel an the next pixel and if they are different
Check a temporary variable, which contains all the detected colors, to see if the color is present or not
If not present add it to the array(List) and increment noOfColors.
This Algorithm works but is slow. For a 1600x1200 pixels image it takes around 3 sec.
Algorithm 2:
The obvious method of checking the each pixel with all other pixels and recording the no of occurences of the color and incrementing the count. This is very very slow, almost like a hung app. So is there any better approach? I need all the pixel info.
You could use std::set (or std::unordered_set), and simply do a single loop though the pixels, adding the colors to the set. Then the number of colors is the size of the set.
Well, this is suited for parallelization. Split the image in several parts and execute the algorithm for each part in a separate task. To avoid syncing each should have its own storage for the unique colors. When all tasks are done, you aggregate the results.
DRAM is dirt cheap. Use brute force. Fill a tab, count.
On a core2duo # 3.0GHz :
0.35secs for 4096x4096 32 bits rgb
0.20secs after some trivial parallelization (I do know nothing of omp)
However, if you are to use 64bit rgb (one channel = 16 bits) it is another question (not enough memory).
You shall probably need a good hash table function.
Using random pixels, same size takes 10 secs.
Remark: at 0.15 secs, the std::bitset<> solution is faster (it gets slower trivially parallelized !).
Solution, c++11
#include <vector>
#include <random>
#include <iostream>
#include <boost/chrono.hpp>
#define _16M 256*256*256
typedef union {
struct { unsigned char r,g,b,n ; } r_g_b_n ;
unsigned char rgb[4] ;
unsigned i_rgb;
} RGB ;
RGB make_RGB(unsigned char r, unsigned char g , unsigned char b) {
RGB res;
res.r_g_b_n.r = r;
res.r_g_b_n.g = g;
res.r_g_b_n.b = b;
res.r_g_b_n.n = 0;
return res;
static_assert(sizeof(RGB)==4,"bad RGB size not 4");
static_assert(sizeof(unsigned)==4,"bad i_RGB size not 4");
struct Image
Image (unsigned M, unsigned N) : M_(M) , N_(N) , v_(M*N) {}
const RGB* tab() const {return & v_[0] ; }
RGB* tab() {return & v_[0] ; }
unsigned M_ , N_;
std::vector<RGB> v_;
void FillRandom(Image & im) {
std::uniform_int_distribution<unsigned> rnd(0,_16M-1);
std::mt19937 rng;
const int N = im.M_ * im.N_;
RGB* tab = im.tab();
for (int i=0; i<N; i++) {
unsigned r = rnd(rng) ;
*tab++ = make_RGB( (r & 0xFF) , (r>>8 & 0xFF), (r>>16 & 0xFF) ) ;
size_t Count(const Image & im) {
const int N = im.M_ * im.N_;
std::vector<char> count(_16M,0);
const RGB* tab = im.tab();
#pragma omp parallel
#pragma omp for
for (int i=0; i<N; i++) {
count[ tab->i_rgb ] = 1 ;
size_t nColors = 0 ;
#pragma omp parallel
#pragma omp for
for (int i = 0 ; i<_16M; i++) nColors += count[i];
return nColors;
int main() {
Image im(4096,4096);
typedef boost::chrono::high_resolution_clock hrc;
auto start = hrc::now();
std::cout << " # colors " << Count(im) << std::endl ;
boost::chrono::duration<double> sec = hrc::now() - start;
std::cout << " took " << sec.count() << " seconds\n";
return 0;
The only feasible algorithm here is building a sort of a histogram of the image colors. The only difference in your case is that instead of calculating the population of each color you need just to know if it's zero or not.
Depending on which color space you work, you may use either an std::set to tag existing colors (as Joachim Pileborg suggested), or just use something like std::bitset, which is obviously faster. This depends on how much distinct colors exist in your color-space.
Also, like Marius Bancila noted, this procedure is a perfect match for parallelization. Calculated the histogram-like data for image parts, and then merge it. Naturally the image division should be based on its memory partition, not the geometric properties. In simple words - split the image vertically (by batches of scan lines), not horizontally.
And, if possible, you should either use some low-level library/code to run through pixels, or try to write your own. At least you must obtain a pointer to scan line and run on its pixels in a batch, rather than doing something like GetPixel for each pixel.
The point, here, is that the ideal representation of an image as 2D array of colors is not the one that happens the way the image is stored on memory (color components can be arranged in "planes", there could be "padding" etc. So getting the pixels using a GetPixel-like function may take time.
The question, then, may even be somehow meaningless if the image is not the result of a "vectorial draw": think to a photograph: between two nearby "greens" you find all the shade of green, so the colors -in this case- are no more no less the ones supported by the encoding of the image itself (2^24, or 256, or 16 or ...), so, unless you are interested on the color distribution (how differently used they are), just counting them makes very few sense.
A workaround can be:
Create an in-memory bitmap having pixel in a "single plane format"
Blit your image into that bitmap using BitBlt or similar (this let the OS to make pixel
conversion from the GPU,if any)
Get the bitmap-bits (this lets you
access the stored values)
Play your "counting algorithm" (whatever
it is) onto those values.
Note that step 1 and 2 can be avoided if you already know that the image is already in planar format.
If you have a multicore system, step 4 can also be assigned to different threads, each working part of the image.
You can use bitset which allows you to set individual bits and has a count function.
You have a bit for each colour, there are 256 values for each of RGB, so that's 256*256*256 bits (16,777,216 colours). The bitset will use a byte for every 8 bits so it will use 2MB.
Use the pixel colour as an index into the bitset:
bitset<256*256*256> colours;
for(int pixel: pixels) {
colours[pixel] = true;
This has linear complexity.
Late comer to this answer, but could not help it since this algorithm is brutally fast, developed about 2 or more decades ago, when it really mattered.
3-D Lookup Table Color Matching
Basically, it creates a 3d color loop up table and the search is very fast, I've done some modifications to suit my purpose for image binarization, so I reduced the color space from ff ff ff to f f f, and it's even 10 times faster. As it is right out of the box, I haven't found anything even close, including hash tables.
char * creatematcharray(struct rgb_color *palette, int palettesize)
int rval=16, gval=16, bval=16, len, r, g, b;
char *taken, *match, *same;
int i, set, sqstep, tp, maxtp, *entryr, *entryg, *entryb;
char *table;
// Prepare table buffers:
size_t size_of_table = len*sizeof(char);
table=(char *)malloc(size_of_table);
if (table==nullptr) return nullptr;
// Select colors to use for fill:
size_t size_of_taken = (palettesize * sizeof(int) * 3) +
(palettesize*sizeof(char)) + (len * sizeof(char));
taken=(char *)malloc(size_of_taken);
same=taken + (len * sizeof(char));
entryr=(int*)(same + (palettesize * sizeof(char)));
entryg=entryr + palettesize;
entryb=entryg + palettesize;
if (taken==nullptr)
free((void *)table);
return nullptr;
std::memset((void *)taken, 0, len * sizeof(char));
// std::cout << "sizes: " << size_of_table << " " << size_of_taken << std::endl;
for (i=0; i<palettesize; i++)
// Compute 3d-table coordinates of palette rgb color:
r=palette[i].r&0x0f, g=palette[i].g&0x0f, b=palette[i].b&0x0f;
// Put color in position:
if (taken[b*rval*gval+g*rval+r]==0) set++;
else same[match[b*rval*gval+g*rval+r]]=1;
entryr[i]=r; entryg[i]=g; entryb[i]=b;
// ### Fill match_array by steps: ###
for (set=len-set, sqstep=1; set>0; sqstep++)
for (i=0; i<palettesize && set>0; i++)
if (same[i]==0)
// Fill all six sides of incremented cube (by pairs, 3 loops):
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b+=sqstep*2)
if (b>=0 && b<bval)
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r++)
if (r>=0 && r<rval)
{ // Draw one 3d line:
if (tp<b*rval*gval+0*rval+r)
if (maxtp>b*rval*gval+(gval-1)*rval+r)
for (; tp<=maxtp; tp+=rval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g+=sqstep*2)
if (g>=0 && g<gval)
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b++)
if (b>=0 && b<bval)
{ // Draw one 3d line:
if (tp<b*rval*gval+g*rval+0)
if (maxtp>b*rval*gval+g*rval+(rval-1))
for (; tp<=maxtp; tp++)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r+=sqstep*2)
if (r>=0 && r<rval)
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g++)
if (g>=0 && g<gval)
{ // Draw one 3d line:
if (tp<0*rval*gval+g*rval+r)
if (maxtp>(bval-1)*rval*gval+g*rval+r)
for (; tp<=maxtp; tp+=rval*gval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
free((void *)taken);`enter code here`
return table;
The answer: unordered_map
I use unordered_map, based on my testing.
You should test because your compiler / library may exhibit different performance Comment out #define USEHASH to use map instead.
On my machine, the vanilla unordered_map (a hash implementation) is about twice as fast as map. Inasmuch as different compilers, libraries can vary enormously, you must test to see which is better. In production, I build a fake image on first start of the app, run both algorithms on it and time them, save an indication of which one is faster, and then preferentially use that for all subsequent starts on that the machine. It's nit-picky, but hey, the user's time is valuable to them.
For a DSLR image with 12,106,244 pixels (about 12 megapixels, not a typo) and 11,857,131 distinct colors (also not a typo), map takes about 14 seconds, while unordered map takes about 7 seconds:
Test Code:
#define USEHASH 1
#ifdef USEHASH
#include <unordered_map>
size = im->xw * im->yw;
#ifdef USEHASH
// unordered_map is about twice as fast as map on my mac with qt5
// --------------------------------------------------------------
#include <unordered_map>
std::unordered_map<qint64, unsigned char> colors;
colors.reserve(size); // pre-allocate the hash space
std::map<qint64, unsigned char> colors;
...use of either is in a loop where I build a 48-bit value of 0RGB in a 64-bit variable corresponding to the 16-bit RGB values of the image pixels, like so:
for (i=0; i<size; i++)
pel = BUILDPEL(i); // macro just shovels 0RGB into 64 bit pel from im
// You'd do the same for your image structure
// in whatever way is fastest for you
colors[pel] = 1;
cc = colors.size();
// time here: 14 secs for map, 7 secs for unordered_map with
// 12,106,244 pixels containing 11,857,131 colors on 12/24 core,
// 3 GHz, 64GB machine.

How to improve sorting pixels in cvMat?

I am trying to sort pixel values of an image (example 80x20) from lowest to highest.
Below is the some code:
bool sortPixel(int first, int second)
return (first < second);
for(int y=0; y<height; y++)
for(int x=0; x<width; x++)
vect_sortPixel.push_back(cvGetReal2D(srcImg, y, x));
sort(vect_sortPixel.begin(), vect_sortPixel.end(), sortPixel);
But it takes quite long time to compute. Any suggestion to reduce the processing time?
Thank you.
Don't use getReal2D. It's quite slow.
Convert image to cv::Mat or Mat. Use its data pointer to get the pixel values. Mat.data() will give you pointer to the original matrix. Use that.
And as far as sorting is concerned, I would advise you to first make an array of all the pixels, then sort it using Merge sort (time complexity O(n log n))
using namespace cv;
using namespace std;
int main()
Mat img = imread("filename.jpg",CV_LOAD_IMAGE_COLOR);
unsigned char *input = (unsigned char*)(img.data);
int i,j,r,g,b;
for(int i = 0;i < img.cols;i++){
for(int j = 0;j < img.rows;j++){
b = input[img.cols * j + i] ;
g = input[img.cols * j+ i + 1];
r = input[img.cols *j + i +2];
return 0;
Using this you can access pixel values from the main matrix.
Warning: This is not how you compare it. I'm suggesting that by using something like this, you can access pixel values.
Mat.data() gives you pointer to the original matrix. This matrix is a 1 D matrix with all the given pixel values.
Image => (x,y,z),(x1,y1,z1), etc..
Mat(original matrix) => x,y,z,x1,y1,z1,...
If you still have some doubts regarding how to extract data from Mat, visit this link OpenCV get pixel channel value from Mat image
and here's a link regarding Merge Sort http://www.cplusplus.happycodings.com/Algorithms/code17.html
There are few problems in your code:
As Froyo already said you use cvGetReal2D which is actually not very fast. You have to convert your cvMat to cv::Mat. To do this there's cv::Mat constructor:
// converts old-style CvMat to the new matrix; the data is not copied by default
Mat(const CvMat* m, bool copyData=false);
And after this use direct pixels acces as mentioned in this SO question.
Another problem is that you use push_back which actually also not very fast. You know the size of array, so why don't you allocate needed memory at the beginning? Like this:
vector<int> vect_sortPixel(mat.cols*mat.rows);
And than just use vect_sortPixel[i] to get needed pixel.
Why do you call sort in the loop? You have to call it after loop, when array is already created! Default STL's sort should work fast:
Approximately N*logN comparisons on average (where N is
last-first). In the worst case, up to N^2, depending on specific
sorting algorithm used by library implementation.