Porting desktop GLSL shader that uses bit operations to GLES - opengl

I'm porting a desktop OpenGL application to GLES-2 (iOS specifically). In the desktop version, some GLSL shaders relied on integer bit operations, which GLES lacks.
This function was used originally in a Fragment Shader:
int reverseByte(int a)
int b = 0;
for (int i = 0; i < 8; i++)
b <<= 1;
b |= ((a & (1 << i)) >> i);
return b;
// ---- usage example: ----
// get inputs from somewhere, just some test values here...
int r = 255;
int g = 128;
int b = 20;
r = reverseByte(r);
g = reverseByte(g);
b = reverseByte(b);
/* produces:
r = 255
g = 1
b = 40
// color would then be normalized to [0,1] range and further used...
It reverses the order of bits in a byte. This is used with RGB colors, in the [0,255] range. GLES lacks integer bit manipulation, so the above function doesn't compile. I did some research trying to find a replacement for it and found several other possible ways of reversing bits in here, but all rely on integer bit operations.
My question is: Is there a way to achieve similar or equivalent result using just floating point operations and/or the stuff available in GLSL-ES?
Side notes:
I cannot precompute the values in the CPU and pass data as a texture or whatever, as the data is procedurally generated by the shader.
You might think of suggesting that I pack the data into a texture, upload it to the CPU, process it and then update the texture with the results. Well, that is actually my current solution, but performance is very poor due to the large data transfers. I would very much like to be able to do it directly in the shader.

it's a bit involved but this should do the trick:
int b=0;
for (int i = 0; i < 8; i++)
b *= 2;
b += mod(a, 2);
a /= 2;


UE4 capture frame using ID3D11Texture2D and convert to R8G8B8 bitmap

I'm working on a streaming prototype using UE4.
My goal here (in this post) is solely about capturing frames and saving one as a bitmap, just to visually ensure frames are correctly captured.
I'm currently capturing frames converting the backbuffer to a ID3D11Texture2D then mapping it.
Note : I tried the ReadSurfaceData approach in the render thread, but it didn't perform well at all regarding performances (FPS went down to 15 and I'd like to capture at 60 FPS), whereas the DirectX texture mapping from the backbuffer currently takes 1 to 3 milliseconds.
When debugging, I can see the D3D11_TEXTURE2D_DESC's format is DXGI_FORMAT_R10G10B10A2_UNORM, so red/green/blues are stored on 10 bits each, and alpha on 2 bits.
My questions :
How to convert the texture's data (using the D3D11_MAPPED_SUBRESOURCE pData pointer) to a R8G8B8(A8), that is, 8 bit per color (a R8G8B8 without the alpha would also be fine for me there) ?
Also, am I doing anything wrong about capturing the frame ?
What I've tried :
All the following code is executed in a callback function registered to OnBackBufferReadyToPresent (code below).
void* NativeResource = BackBuffer->GetNativeResource();
if (NativeResource == nullptr)
UE_LOG(LogTemp, Error, TEXT("Couldn't retrieve native resource"));
ID3D11Texture2D* BackBufferTexture = static_cast<ID3D11Texture2D*>(NativeResource);
D3D11_TEXTURE2D_DESC BackBufferTextureDesc;
// Get the device context
ID3D11Device* d3dDevice;
ID3D11DeviceContext* d3dContext;
// Staging resource
ID3D11Texture2D* StagingTexture;
D3D11_TEXTURE2D_DESC StagingTextureDesc = BackBufferTextureDesc;
StagingTextureDesc.Usage = D3D11_USAGE_STAGING;
StagingTextureDesc.BindFlags = 0;
StagingTextureDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
StagingTextureDesc.MiscFlags = 0;
HRESULT hr = d3dDevice->CreateTexture2D(&StagingTextureDesc, nullptr, &StagingTexture);
if (FAILED(hr))
UE_LOG(LogTemp, Error, TEXT("CreateTexture failed"));
// Copy the texture to the staging resource
d3dContext->CopyResource(StagingTexture, BackBufferTexture);
// Map the staging resource
hr = d3dContext->Map(
if (FAILED(hr))
UE_LOG(LogTemp, Error, TEXT("Map failed"));
// See https://dev.to/muiz6/c-how-to-write-a-bitmap-image-from-scratch-1k6m for the struct definitions & the initialization of bmpHeader and bmpInfoHeader
// I didn't copy that code here to avoid overloading this post, as it's identical to the article's code
// Just making clear the reassigned values below
bmpHeader.sizeOfBitmapFile = 54 + StagingTextureDesc.Width * StagingTextureDesc.Height * 4;
bmpInfoHeader.width = StagingTextureDesc.Width;
bmpInfoHeader.height = StagingTextureDesc.Height;
std::ofstream fout("output.bmp", std::ios::binary);
fout.write((char*)&bmpHeader, 14);
fout.write((char*)&bmpInfoHeader, 40);
// TODO : convert to R8G8B8 (see below for my attempt at this)
d3dContext->Unmap(StagingTexture, 0);
(As mentioned in the code comments, I followed this article about the BMP headers for saving the bitmap to a file)
Texture data
One thing I'm concerned about is the retrieved data with this method.
I used a temporary array to check with the debugger what's inside.
// Just noted which width and height had the texture and hardcoded it here to allocate the right size
uint32_t data[1936 * 1056];
// Multiply by 4 as there are 4 bytes (32 bits) per pixel
memcpy(data, mapInfo.pData, StagingTextureDesc.Width * StagingTextureDesc.Height * 4);
Turns out the 1935 first uint32 in this array all contain the same value ; 3595933029. And after that, the same values are often seen hundred times in a row.
This makes me think the frame isn't captured as it should, because the UE4 editor's window doesn't have the exact same color on its first row all along (whether it's top or bottom).
R10G10B10A2 to R8G8B8(A8)
So I tried to guess how to convert from R10G10B10A2 to R8G8B8. I started from this value that appears 1935 times in a row at the beginning of the data buffer : 3595933029.
When I color pick an editor's window screenshot (using the Windows tool, which gets me an image with the exact same dimensions as the DirectX texture, that is 1936x1056), I get the following different colors:
R=56, G=57, B=52 (top left & bottom left)
R=0, G=0, B=0 (top right)
R=46, G=40, B=72 (bottom right - it overlaps the task bar, thus the color)
So I tried to manually convert the color to check if it matches any of those I color picked.
I thought about bit shifting to simply compare the values
3595933029 (value in retrieved buffer) in binary : 11010110010101011001010101100101
Can already see the pattern : 11 followed 3 times by the 10-bit value 0101100101, and none of the picked colors follow this (except the black corner, which would be only made of zeros though)
Anyway, assuming RRRRRRRRRR GGGGGGGGGG BBBBBBBBBB AA order (ditched bits are marked with an x) :
R=214, G=86, B=86 : doesn't match
R=89, G=89, B=89 : doesn't match
If that can help, here's the editor window that should be captured (it really is a Third person template, didn't add anything to it except this capture code)
Here's the generated bitmap when shifting bits :
Code to generate bitmap's pixels data :
struct Pixel {
uint8_t blue = 0;
uint8_t green = 0;
uint8_t red = 0;
} pixel;
uint32_t* pointer = (uint32_t*)mapInfo.pData;
size_t numberOfPixels = bmpInfoHeader.width * bmpInfoHeader.height;
for (int i = 0; i < numberOfPixels; i++) {
uint32_t value = *pointer;
// Ditch the color's 2 last bits, keep the 8 first
pixel.blue = value >> 2;
pixel.green = value >> 12;
pixel.red = value >> 22;
fout.write((char*)&pixel, 3);
It somewhat seems similar in the present colors, however that doesn't look at all like the editor.
What am I missing ?
First of all, you are assuming that the mapInfo.RowPitch is exactly StagicngTextureDesc.Width * 4. This is often not true. When copying to/from Direct3D resources, you need to do 'row-by-row' copies. Also, allocating 2 MBytes on the stack is not good practice.
#include <cstdint>
#include <memory>
// Assumes our staging texture is 4 bytes-per-pixel
// Allocate temporary memory
auto data = std::unique_ptr<uint32_t[]>(
new uint32_t[StagingTextureDesc.Width * StagingTextureDesc.Height]);
auto src = static_cast<uint8_t*>(mapInfo.pData);
uint32_t* dest = data.get();
for(UINT y = 0; y < StagingTextureDesc.Height; ++y)
// Multiply by 4 as there are 4 bytes (32 bits) per pixel
memcpy(dest, src, StagingTextureDesc.Width * sizeof(uint32_t));
src += mapInfo.RowPitch;
dest += StagingTextureDesc.Width;
For C++11, using std::unique_ptr ensures the memory is eventually released automatically. You can transfer ownership of the memory to something else with uint32_t* ptr = data.release(). See cppreference.
With C++14, the better way to write the allocation is: auto data = std::make_unique<uint32_t[]>(StagingTextureDesc.Width * StagingTextureDesc.Height);. This assumes you are fine with a C++ exception being thrown for out-of-memory.
If you want to return an error code for out-of-memory instead of a C++ exception, use: auto data = std::unique_ptr<uint32_t[]>(new (std::nothrow) uint32_t[StagingTextureDesc.Width * StagingTextureDesc.Height]); if (!data) // return error
Converting 10:10:10:2 content to 8:8:8:8 content can be done efficiently on the CPU with bit-shifting.
The tricky bit is dealing with the up-scaling of the 2-bit alpha to 8-bits. For example, you want the Alpha of 11 to map to 255, not 192.
Here's a replacement for the loop above
// Assumes our staging texture is DXGI_FORMAT_R10G10B10A2_UNORM
for(UINT y = 0; y < StagingTextureDesc.Height; ++y)
auto sptr = reinterpret_cast<uint32_t*>(src);
for(UINT x = 0; x < StagingTextureDesc.Width; ++x)
uint32_t t = *(sptr++);
uint32_t r = (t & 0x000003ff) >> 2;
uint32_t g = (t & 0x000ffc00) >> 12;
uint32_t b = (t & 0x3ff00000) >> 22;
// Upscale alpha
// 11xxxxxx -> 11111111 (255)
// 10xxxxxx -> 10101010 (170)
// 01xxxxxx -> 01010101 (85)
// 00xxxxxx -> 00000000 (0)
t &= 0xc0000000;
uint32_t a = (t >> 24) | (t >> 26) | (t >> 28) | (t >> 30);
// Convert to DXGI_FORMAT_R8G8B8A8_UNORM
*(dest++) = r | (g << 8) | (b << 16) | (a << 24);
src += mapInfo.RowPitch;
Of course we can combine the shifting operations since we move them down and then back up in the previous loop. We do need to update the masks to remove the bits that are normally shifted off by the full shifts. This replaces the inner body of the loop above:
// Convert from 10:10:10:2 to 8:8:8:8
uint32_t t = *(sptr++);
uint32_t r = (t & 0x000003fc) >> 2;
uint32_t g = (t & 0x000ff000) >> 4;
uint32_t b = (t & 0x3fc00000) >> 6;
t &= 0xc0000000;
uint32_t a = t | (t >> 2) | (t >> 4) | (t >> 6);
*(dest++) = r | g | b | a;
Any time you reduce the bit-depth you will introduce error. Techniques like ordered dithering and error-diffusion dithering are commonly used in pixels conversions of this nature. These introduce a bit of noise to the image to reduce the visual impact of the lost low bits.
For examples of conversions for all DXGI_FORMAT types, see DirectXTex which makes use of DirectXMath for all the various packed vector types. DirectXTex also implements both 4x4 ordered dithering and Floyd-Steinberg error-diffusion dithering when reducing bit-depth.

Mixing audio channels

I am implementing an audio channel mixer and using Viktor T. Toth's algorithm. Trying to mix two audio channel streams.
In the code, quantization_ is the byte representation of the bit depth of a channel. My mix function, takes a pointer to destination and source uint8_t buffers, mixes two channels and writes into the destination buffer. Because I am taking data in a uint8_t buffer, doing that addition, division, and multiplication operations to get the actual 8, 16 or 24-bit samples and convert them again to 8-bit.
Generally, it gives the expected output sample values. However, some samples turn out to have near 0 value as they are not supposed to be when I look the output in Audacity. In the screenshot, bottom 2 signals are two mono channels and the top one is the mixed channel. It can be seen that there are some very low values, especially in the middle.
Below, is my mix function;
void audio_mixer::mix(uint8_t* dest, const uint8_t* source)
uint64_t mixed_sample = 0;
uint64_t dest_sample = 0;
uint64_t source_sample = 0;
uint64_t factor = 0;
for (int i = 0; i < channel_size_; ++i)
dest_sample = 0;
source_sample = 0;
factor = 1;
for (int j = 0; j < quantization_; ++j)
dest_sample += factor * static_cast<uint64_t>(*dest++);
source_sample += factor * static_cast<uint64_t>(*source++);
factor = factor * 256;
mixed_sample = (dest_sample + source_sample) - (dest_sample * source_sample / factor);
dest -= quantization_;
for (int k = 0; k < quantization_; ++k)
*dest++ = static_cast<uint8_t>(mixed_sample % 256);
mixed_sample = mixed_sample / 256;
It seems like you aren't treating the signed audio samples correctly. The horizontal line should be zero voltage from your audio signal.
If you look at the positive voltage audio samples they obey your equation correctly (except for the peak values in the center). The negative values are being compressed which makes me feel like they are being treated as small positive voltages instead of negative voltages.
In other words, maybe those unsigned ints should be signed ints so the top bit indicates the voltage polarity and you can have audio samples in the range +127 to -128.
Those peak values in the center seem like they are wrapping around modulo 255 which would be the peak value for an unsigned byte representation of your audio. I'm not sure how this would happen but it seems related to the unsigned vs signed signals.
Maybe you should try the other formula Viktor provided in his document:
Z = 2(A+B) - (AB/128) - 256

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];

How to efficiently render a 24-bpp image on a 32-bpp display?

First of all, I'm programming in the kernel context so no existing libraries exist. In fact this code is going to go into a library of my own.
Two questions, one more important than the other:
As the title suggests, how can I efficiently render a 24-bpp image onto a 32-bpp device, assuming that I have the address of the frame buffer?
Currently I have this code:
void BitmapImage::Render24(uint16_t x, uint16_t y, void (*r)(uint16_t, uint16_t, uint32_t))
uint32_t imght = Math::AbsoluteValue(this->DIB->GetBitmapHeight());
uint64_t ptr = (uint64_t)this->ActualBMP + this->Header->BitmapArrayOffset;
uint64_t rowsize = ((this->DIB->GetBitsPerPixel() * this->DIB->GetBitmapWidth() + 31) / 32) * 4;
uint64_t oposx = x;
uint64_t posx = oposx;
uint64_t posy = y + (this->DIB->Type == InfoHeaderV1 && this->DIB->GetBitmapHeight() < 0 ? 0 : this->DIB->GetBitmapHeight());
for(uint32_t d = 0; d < imght; d++)
for(uint32_t w = 0; w < rowsize / (this->DIB->GetBitsPerPixel() / 8); w++)
r(posx, posy, (*((uint32_t*)ptr) & 0xFFFFFF));
ptr += this->DIB->GetBitsPerPixel() / 8;
posx = oposx;
r is a function pointer to a PutPixel-esque thing that accepts x, y, and colour parameters.
Obviously this code is terribly slow, since plotting pixels one at a time is never a good idea.
For my 32-bpp rendering code (which I also have a question about, more on that later) I can easily Memory::Copy() the bitmap array (I'm loading bmp files here) to the frame buffer.
However, how do I do this with 24bpp images? On a 24bpp display this would be fine but I'm working with a 32bpp one.
One solution I can think of right now is to create another bitmap array which essentially contains values of 0x00(colour) and the use that to draw to the screen -- I don't think this is very good though, so I'm looking for a better alternative.
Next question:
2. Given, for obvious reasons, one cannot simply Memory::Copy() the entire array at once onto the frame buffer, the next best thing would be to copy them row by row.
Is there a better way?
Basically something like this:
for (uint32_t l = 0; l < h; ++l) // l line index in pixels
// srcPitch is distance between lines in bytes
char* srcLine = (char*)srcBuffer + l * srcPitch;
unsigned* trgLine = ((unsigned*)trgBuffer) + l * trgPitch;
for (uint32_t c = 0; c < w; ++c) // c is column index in pixels
// build target pixel. arrange indexes to fit your render target (0, 1, 2)
++(*trgLine) = (srcLine[0] << 16) | (srcLine[1] << 8)
| srcLine[2] | (0xff << 24);
srcLine += 3;
A few notes:
- better to write to a different buffer than the render buffer so the image is displayed at once.
- using functions for pixel placement like you did is very (very very) slow.

Faster algorithm to check the colors in a image

Supposing I am given an image of 2048x2048 and i want to know the total number of colors present in the image, what is the fastest possible algorithm? I came up with two algorithm but they are slow.
Algorithm 1:
Compare the current pixel an the next pixel and if they are different
Check a temporary variable, which contains all the detected colors, to see if the color is present or not
If not present add it to the array(List) and increment noOfColors.
This Algorithm works but is slow. For a 1600x1200 pixels image it takes around 3 sec.
Algorithm 2:
The obvious method of checking the each pixel with all other pixels and recording the no of occurences of the color and incrementing the count. This is very very slow, almost like a hung app. So is there any better approach? I need all the pixel info.
You could use std::set (or std::unordered_set), and simply do a single loop though the pixels, adding the colors to the set. Then the number of colors is the size of the set.
Well, this is suited for parallelization. Split the image in several parts and execute the algorithm for each part in a separate task. To avoid syncing each should have its own storage for the unique colors. When all tasks are done, you aggregate the results.
DRAM is dirt cheap. Use brute force. Fill a tab, count.
On a core2duo # 3.0GHz :
0.35secs for 4096x4096 32 bits rgb
0.20secs after some trivial parallelization (I do know nothing of omp)
However, if you are to use 64bit rgb (one channel = 16 bits) it is another question (not enough memory).
You shall probably need a good hash table function.
Using random pixels, same size takes 10 secs.
Remark: at 0.15 secs, the std::bitset<> solution is faster (it gets slower trivially parallelized !).
Solution, c++11
#include <vector>
#include <random>
#include <iostream>
#include <boost/chrono.hpp>
#define _16M 256*256*256
typedef union {
struct { unsigned char r,g,b,n ; } r_g_b_n ;
unsigned char rgb[4] ;
unsigned i_rgb;
} RGB ;
RGB make_RGB(unsigned char r, unsigned char g , unsigned char b) {
RGB res;
res.r_g_b_n.r = r;
res.r_g_b_n.g = g;
res.r_g_b_n.b = b;
res.r_g_b_n.n = 0;
return res;
static_assert(sizeof(RGB)==4,"bad RGB size not 4");
static_assert(sizeof(unsigned)==4,"bad i_RGB size not 4");
struct Image
Image (unsigned M, unsigned N) : M_(M) , N_(N) , v_(M*N) {}
const RGB* tab() const {return & v_[0] ; }
RGB* tab() {return & v_[0] ; }
unsigned M_ , N_;
std::vector<RGB> v_;
void FillRandom(Image & im) {
std::uniform_int_distribution<unsigned> rnd(0,_16M-1);
std::mt19937 rng;
const int N = im.M_ * im.N_;
RGB* tab = im.tab();
for (int i=0; i<N; i++) {
unsigned r = rnd(rng) ;
*tab++ = make_RGB( (r & 0xFF) , (r>>8 & 0xFF), (r>>16 & 0xFF) ) ;
size_t Count(const Image & im) {
const int N = im.M_ * im.N_;
std::vector<char> count(_16M,0);
const RGB* tab = im.tab();
#pragma omp parallel
#pragma omp for
for (int i=0; i<N; i++) {
count[ tab->i_rgb ] = 1 ;
size_t nColors = 0 ;
#pragma omp parallel
#pragma omp for
for (int i = 0 ; i<_16M; i++) nColors += count[i];
return nColors;
int main() {
Image im(4096,4096);
typedef boost::chrono::high_resolution_clock hrc;
auto start = hrc::now();
std::cout << " # colors " << Count(im) << std::endl ;
boost::chrono::duration<double> sec = hrc::now() - start;
std::cout << " took " << sec.count() << " seconds\n";
return 0;
The only feasible algorithm here is building a sort of a histogram of the image colors. The only difference in your case is that instead of calculating the population of each color you need just to know if it's zero or not.
Depending on which color space you work, you may use either an std::set to tag existing colors (as Joachim Pileborg suggested), or just use something like std::bitset, which is obviously faster. This depends on how much distinct colors exist in your color-space.
Also, like Marius Bancila noted, this procedure is a perfect match for parallelization. Calculated the histogram-like data for image parts, and then merge it. Naturally the image division should be based on its memory partition, not the geometric properties. In simple words - split the image vertically (by batches of scan lines), not horizontally.
And, if possible, you should either use some low-level library/code to run through pixels, or try to write your own. At least you must obtain a pointer to scan line and run on its pixels in a batch, rather than doing something like GetPixel for each pixel.
The point, here, is that the ideal representation of an image as 2D array of colors is not the one that happens the way the image is stored on memory (color components can be arranged in "planes", there could be "padding" etc. So getting the pixels using a GetPixel-like function may take time.
The question, then, may even be somehow meaningless if the image is not the result of a "vectorial draw": think to a photograph: between two nearby "greens" you find all the shade of green, so the colors -in this case- are no more no less the ones supported by the encoding of the image itself (2^24, or 256, or 16 or ...), so, unless you are interested on the color distribution (how differently used they are), just counting them makes very few sense.
A workaround can be:
Create an in-memory bitmap having pixel in a "single plane format"
Blit your image into that bitmap using BitBlt or similar (this let the OS to make pixel
conversion from the GPU,if any)
Get the bitmap-bits (this lets you
access the stored values)
Play your "counting algorithm" (whatever
it is) onto those values.
Note that step 1 and 2 can be avoided if you already know that the image is already in planar format.
If you have a multicore system, step 4 can also be assigned to different threads, each working part of the image.
You can use bitset which allows you to set individual bits and has a count function.
You have a bit for each colour, there are 256 values for each of RGB, so that's 256*256*256 bits (16,777,216 colours). The bitset will use a byte for every 8 bits so it will use 2MB.
Use the pixel colour as an index into the bitset:
bitset<256*256*256> colours;
for(int pixel: pixels) {
colours[pixel] = true;
This has linear complexity.
Late comer to this answer, but could not help it since this algorithm is brutally fast, developed about 2 or more decades ago, when it really mattered.
3-D Lookup Table Color Matching
Basically, it creates a 3d color loop up table and the search is very fast, I've done some modifications to suit my purpose for image binarization, so I reduced the color space from ff ff ff to f f f, and it's even 10 times faster. As it is right out of the box, I haven't found anything even close, including hash tables.
char * creatematcharray(struct rgb_color *palette, int palettesize)
int rval=16, gval=16, bval=16, len, r, g, b;
char *taken, *match, *same;
int i, set, sqstep, tp, maxtp, *entryr, *entryg, *entryb;
char *table;
// Prepare table buffers:
size_t size_of_table = len*sizeof(char);
table=(char *)malloc(size_of_table);
if (table==nullptr) return nullptr;
// Select colors to use for fill:
size_t size_of_taken = (palettesize * sizeof(int) * 3) +
(palettesize*sizeof(char)) + (len * sizeof(char));
taken=(char *)malloc(size_of_taken);
same=taken + (len * sizeof(char));
entryr=(int*)(same + (palettesize * sizeof(char)));
entryg=entryr + palettesize;
entryb=entryg + palettesize;
if (taken==nullptr)
free((void *)table);
return nullptr;
std::memset((void *)taken, 0, len * sizeof(char));
// std::cout << "sizes: " << size_of_table << " " << size_of_taken << std::endl;
for (i=0; i<palettesize; i++)
// Compute 3d-table coordinates of palette rgb color:
r=palette[i].r&0x0f, g=palette[i].g&0x0f, b=palette[i].b&0x0f;
// Put color in position:
if (taken[b*rval*gval+g*rval+r]==0) set++;
else same[match[b*rval*gval+g*rval+r]]=1;
entryr[i]=r; entryg[i]=g; entryb[i]=b;
// ### Fill match_array by steps: ###
for (set=len-set, sqstep=1; set>0; sqstep++)
for (i=0; i<palettesize && set>0; i++)
if (same[i]==0)
// Fill all six sides of incremented cube (by pairs, 3 loops):
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b+=sqstep*2)
if (b>=0 && b<bval)
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r++)
if (r>=0 && r<rval)
{ // Draw one 3d line:
if (tp<b*rval*gval+0*rval+r)
if (maxtp>b*rval*gval+(gval-1)*rval+r)
for (; tp<=maxtp; tp+=rval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g+=sqstep*2)
if (g>=0 && g<gval)
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b++)
if (b>=0 && b<bval)
{ // Draw one 3d line:
if (tp<b*rval*gval+g*rval+0)
if (maxtp>b*rval*gval+g*rval+(rval-1))
for (; tp<=maxtp; tp++)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r+=sqstep*2)
if (r>=0 && r<rval)
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g++)
if (g>=0 && g<gval)
{ // Draw one 3d line:
if (tp<0*rval*gval+g*rval+r)
if (maxtp>(bval-1)*rval*gval+g*rval+r)
for (; tp<=maxtp; tp+=rval*gval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
free((void *)taken);`enter code here`
return table;
The answer: unordered_map
I use unordered_map, based on my testing.
You should test because your compiler / library may exhibit different performance Comment out #define USEHASH to use map instead.
On my machine, the vanilla unordered_map (a hash implementation) is about twice as fast as map. Inasmuch as different compilers, libraries can vary enormously, you must test to see which is better. In production, I build a fake image on first start of the app, run both algorithms on it and time them, save an indication of which one is faster, and then preferentially use that for all subsequent starts on that the machine. It's nit-picky, but hey, the user's time is valuable to them.
For a DSLR image with 12,106,244 pixels (about 12 megapixels, not a typo) and 11,857,131 distinct colors (also not a typo), map takes about 14 seconds, while unordered map takes about 7 seconds:
Test Code:
#define USEHASH 1
#ifdef USEHASH
#include <unordered_map>
size = im->xw * im->yw;
#ifdef USEHASH
// unordered_map is about twice as fast as map on my mac with qt5
// --------------------------------------------------------------
#include <unordered_map>
std::unordered_map<qint64, unsigned char> colors;
colors.reserve(size); // pre-allocate the hash space
std::map<qint64, unsigned char> colors;
...use of either is in a loop where I build a 48-bit value of 0RGB in a 64-bit variable corresponding to the 16-bit RGB values of the image pixels, like so:
for (i=0; i<size; i++)
pel = BUILDPEL(i); // macro just shovels 0RGB into 64 bit pel from im
// You'd do the same for your image structure
// in whatever way is fastest for you
colors[pel] = 1;
cc = colors.size();
// time here: 14 secs for map, 7 secs for unordered_map with
// 12,106,244 pixels containing 11,857,131 colors on 12/24 core,
// 3 GHz, 64GB machine.