Count elements in texture - c++

I have a 3D texture of 32-bit unsigned integers initialized with zeroes. It is defined as follows:
D3D11_TEXTURE3D_DESC description{};
description.Format = DXGI_FORMAT_R32_UINT;
description.Usage = D3D11_USAGE_DEFAULT;
description.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS;
description.CpuAccessFlags = 0;
description.MipLevels = 1;
description.Width = ...;
description.Height = ...;
description.Depth = ...;
I am writing to this texture in a compute shader to set a bit on specified position if a certain condition is fulfilled:
RWTexture3D<uint> txOutput : register(u0)
cbuffer InputBuffer : register(b0)
{
uint position;
/** other elements **/
}
#define SET_BIT(value, position) value |= (1U << position)
[numthreads(8, 8, 8)]
void main(uint3 threadID : SV_DispatchThreadID)
{
if(/** some condition **/)
{
uint value = txOutput[threadID];
SET_BIT(value, position);
txOutput[threadID] = value;
}
}
I need to know how many elements of this texture is filled at a certain bit position in a code behind in C++. How could this be done?

You will have to read back the texture to the cpu with the ID3D11DeviceContext::Map API
https://learn.microsoft.com/en-us/windows/win32/api/d3d11/nf-d3d11-id3d11devicecontext-map
You will get out a void* you will cast to uint32_t* which will point to the start of your data.
You need to get better a looking up the DirectX documentation, its really quite good documentation. There are a lot harder things you will need to find in the documentation if you keep doing 3D graphics.

Edit: I use DirectML to do these tasks now, only using compute shaders for exotic work.
When I need to readback to the cpu I always accumulate on the cpu, because accumulating on the gpu is difficult and only partially parallel.
Summing a texture on the gpu is call a parallel reduction and this type of programming is called general purpose gpu (gpgpu). The most excellent resource on gpgpu for Direct Compute are these slides from nvidia, which go through optimizing parallel reductions https://on-demand.gputechconf.com/gtc/2010/presentations/S12312-DirectCompute-Pre-Conference-Tutorial.pdf
From the slides:
for (unsigned int s=groupDim_x/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
GroupMemoryBarrierWithGroupSync();
}

Related

DirectX - Writing to 3D Texture Causing Display Driver Failure

I'm testing writing to 2D and 3D textures in compute shaders, outputting a gradient noise texture consisting of 32 bit floats. Writing to a 2D texture works fine, but writing to a 3D texture isn't. Are there additional considerations that need to be made when creating a 3D texture when compared to a 2D texture?
Code of how I'm defining the 3D texture below:
HRESULT BaseComputeShader::CreateTexture3D(UINT width, UINT height, UINT depth, DXGI_FORMAT format, ID3D11Texture3D** texture)
{
D3D11_TEXTURE3D_DESC textureDesc;
ZeroMemory(&textureDesc, sizeof(textureDesc));
textureDesc.Width = width;
textureDesc.Height = height;
textureDesc.Depth = depth;
textureDesc.MipLevels = 1;
textureDesc.Format = format;
textureDesc.Usage = D3D11_USAGE_DEFAULT;
textureDesc.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS;
textureDesc.CPUAccessFlags = 0;
textureDesc.MiscFlags = 0;
return renderer->CreateTexture3D(&textureDesc, 0, texture);
}
HRESULT BaseComputeShader::CreateTexture3DUAV(UINT depth, DXGI_FORMAT format, ID3D11Texture3D** texture, ID3D11UnorderedAccessView** unorderedAccessView)
{
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
ZeroMemory(&uavDesc, sizeof(uavDesc));
uavDesc.Format = format;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE3D;
uavDesc.Texture3D.MipSlice = 0;
uavDesc.Texture3D.FirstWSlice = 0;
uavDesc.Texture3D.WSize = depth;
return renderer->CreateUnorderedAccessView(*texture, &uavDesc, unorderedAccessView);
}
HRESULT BaseComputeShader::CreateTexture3DSRV(DXGI_FORMAT format, ID3D11Texture3D** texture, ID3D11ShaderResourceView** shaderResourceView)
{
D3D11_SHADER_RESOURCE_VIEW_DESC srvDesc;
ZeroMemory(&srvDesc, sizeof(srvDesc));
srvDesc.Format = format;
srvDesc.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE3D;
srvDesc.Texture3D.MostDetailedMip = 0;
srvDesc.Texture3D.MipLevels = 1;
return renderer->CreateShaderResourceView(*texture, &srvDesc, shaderResourceView);
}
And how I'm writing to it in the compute shader:
// The texture we're writing to
RWTexture3D<float> outputTexture : register(u0);
[numthreads(8, 8, 8)]
void main(uint3 DTid : SV_DispatchThreadID)
{
float noiseValue = 0.0f;
float value = 0.0f;
float localAmplitude = amplitude;
float localFrequency = frequency;
// Loop for the number of octaves, running the noise function as many times as desired (8 is usually sufficient)
for (int k = 0; k < octaves; k++)
{
noiseValue = noise(float3(DTid.x * localFrequency, DTid.y * localFrequency, DTid.z * localFrequency)) * localAmplitude;
value += noiseValue;
// Calculate a new amplitude based on the input persistence/gain value
// amplitudeLoop will get smaller as the number of layers (i.e. k) increases
localAmplitude *= persistence;
// Calculate a new frequency based on a lacunarity value of 2.0
// This gives us 2^k as the frequency
// i.e. Frequency at k = 4 will be f * 2^4 as we have looped 4 times
localFrequency *= 2.0f;
}
// Output value to 2D index in the texture provided by thread indexing
outputTexture[DTid.xyz] = value;
}
And finally, how I'm running the shader:
// Set the shader
deviceContext->CSSetShader(computeShader, nullptr, 0);
// Set the shader's buffers and views
deviceContext->CSSetConstantBuffers(0, 1, &cBuffer);
deviceContext->CSSetUnorderedAccessViews(0, 1, &textureUAV, nullptr);
// Launch the shader
deviceContext->Dispatch(512, 512, 512);
// Reset the shader now we're done
deviceContext->CSSetShader(nullptr, nullptr, 0);
// Reset the shader views
ID3D11UnorderedAccessView* ppUAViewnullptr[1] = { nullptr };
deviceContext->CSSetUnorderedAccessViews(0, 1, ppUAViewnullptr, nullptr);
// Create the shader resource view for access in other shaders
HRESULT result = CreateTexture3DSRV(DXGI_FORMAT_R32_FLOAT, &texture, &textureSRV);
if (result != S_OK)
{
MessageBox(NULL, L"Failed to create texture SRV after compute shader execution", L"Failed", MB_OK);
exit(0);
}
My bad, simple mistake. Compute shader threads are limited in number. In the compute shader you're limited to a total of 1024 threads, and the dispatch call cannot dispatch more than 65535 thread groups. The HLSL compiler will catch the former issue, but the Visual C++ compiler will not catch the latter issue.
If you create a texture of 512 * 512 * 512 (which seems what you are trying to achieve), your dispatch needs to be divided by groups:
deviceContext->Dispatch(512 / 8, 512 / 8, 512 / 8);
In your previous case, the dispatch was :
512*8 * 512*8 * 512*8 = 68719476736 units
Which very likely triggered the time out detection and crashes the driver
Also the limit of 65535 is per dimension, so in your case you are completely safe to run this.
And last one, you can create both shader resource view and unordered view right after creating your 3d texture (before the dispatch call).
This is generally recommended to avoid mixing context code and resource creation code.
On resource creation, your check is not valid either :
if (result != S_OK)
HRESULT success condition is >= 0
you can use the built in macro instead eg :
if (SUCCEEDED(result))

Metal Prevented Device Address Mode

I am creating a graphics application that uses Metal to render everything. When I did a frame debug under pipeline statistics for all of my draw calls there is a !! priority alert titled "Prevented Device Address Mode Load" with the details:
Indexing using unsigned int for offset prevents addressing calculation in device. To prevent this extra ALU operation use int for offset.
So for my simplest draw call that involves this here is what is going on. There is a large amount of vertex data followed by an index buffer. The index buffer is created and filled at the start and is then constant from then on. The vertex data is constantly all changing.
I have the following types:
struct Vertex {
float3 data;
};
typedef int32_t indexType;
Then the following draw call
[encoder drawIndexedPrimitives:MTLPrimitiveTypeTriangle indexCount:/*int here*/ indexType:MTLIndexTypeUInt32 indexBuffer:indexBuffer indexBufferOffset:0];
Which goes to the following vertex function
vertex VertexOutTC vertex_fun(constant Vertex * vertexBuffer [[ buffer(0) ]],
indexType vid [[ vertex_id ]],
constant matrix_float3x3* matrix [[buffer(1)]]) {
const float2 coords[] = {float2(-1, -1), float2(-1, 1), float2(1, -1), float2(1, 1)};
CircleVertex vert = vertexBuffer[vid];
VertexOutTC out;
out.position = float4((*matrix * float3(vert.data.x, vert.data.y, 1.0)).xy, ((float)((int)vid/4))/10000.0, 1.0);
out.color = HSVtoRGB(vert.data.z, 1.0, 1.0);
out.tc = coords[vid % 4];
return out;
}
I am very confused what exactly I am doing wrong here. The error would seem to suggest I shouldnt use an unsigned type for the offset which I am guessing is the index buffer.
The thing is is ultimately for the index buffer there is only MTLIndexTypeUInt32 and MTLIndexTypeUInt16 both of which are unsigned. Furthermore if I try to use a raw int as the type the shader wont compile. What is going on here?
In Table 5.1 of the Metal Shading Language Specification, they list the "Corresponding Data Type" for vertex_id as ushort or uint. (There are similar tables in that document for all the rest of the types, my examples will use thread_position_in_grid which is the same).
Meanwhile, the hardware prefers signed types for addressing. So if you do
kernel void test(uint position [[thread_position_in_grid]], device float *test) {
test[position] = position;
test[position + 1] = position;
test[position + 2] = position;
}
we are indexing test by an unsigned integer. Debugging this shader we can see that it involves 23 instructions, and has the "Prevented Device Mode Store" warning:
If we convert to int instead, this uses only 18 instructions:
kernel void test(uint position [[thread_position_in_grid]], device float *test) {
test[(int)position] = position;
test[(int)position + 1] = position;
test[(int)position + 2] = position;
}
However, not all uint can fit into int, so this optimization only works for half the range of uint. Still, that's many usecases.
What about ushort? Well,
kernel void test(ushort position [[thread_position_in_grid]], device float *test) {
test[position] = position;
test[position + 1] = position;
test[position + 2] = position;
}
This version is only 17 instructions. We are also "warned" about using unsigned indexing here, even though it is faster than the signed versions above. This suggests to me the warning is not especially well-designed and requires significant interpretation.
kernel void test(ushort position [[thread_position_in_grid]], device float *test) {
short p = position;
test[p] = position;
test[p + 1] = position;
test[p + 2] = position;
}
This is the signed version of short, and fixes the warning, but is also 17 instructions. So it makes Xcode happier, but I'm not sure it's actually better.
Finally, here's the case I was in. My position ranges above signed short, but below unsigned short. Does it make sense to promote short to int for the indexing?
kernel void test(ushort position [[thread_position_in_grid]], device float *test) {
int p = position;
test[p] = position;
test[p + 1] = position;
test[p + 2] = position;
}
This is also 17 instructions, and generates the device store warning. I believe the compiler proves ushort fits into int, and ignores the conversion. This "unsigned" arithmetic then produces a warning telling me to use int, even though that's exactly what I did.
In summary, these warnings are a bit naive, and should really be confirmed or refuted through on-device testing.

Porting desktop GLSL shader that uses bit operations to GLES

I'm porting a desktop OpenGL application to GLES-2 (iOS specifically). In the desktop version, some GLSL shaders relied on integer bit operations, which GLES lacks.
This function was used originally in a Fragment Shader:
int reverseByte(int a)
{
int b = 0;
for (int i = 0; i < 8; i++)
{
b <<= 1;
b |= ((a & (1 << i)) >> i);
}
return b;
}
// ---- usage example: ----
// get inputs from somewhere, just some test values here...
int r = 255;
int g = 128;
int b = 20;
r = reverseByte(r);
g = reverseByte(g);
b = reverseByte(b);
/* produces:
r = 255
g = 1
b = 40
*/
// color would then be normalized to [0,1] range and further used...
It reverses the order of bits in a byte. This is used with RGB colors, in the [0,255] range. GLES lacks integer bit manipulation, so the above function doesn't compile. I did some research trying to find a replacement for it and found several other possible ways of reversing bits in here, but all rely on integer bit operations.
My question is: Is there a way to achieve similar or equivalent result using just floating point operations and/or the stuff available in GLSL-ES?
Side notes:
I cannot precompute the values in the CPU and pass data as a texture or whatever, as the data is procedurally generated by the shader.
You might think of suggesting that I pack the data into a texture, upload it to the CPU, process it and then update the texture with the results. Well, that is actually my current solution, but performance is very poor due to the large data transfers. I would very much like to be able to do it directly in the shader.
it's a bit involved but this should do the trick:
int b=0;
for (int i = 0; i < 8; i++)
{
b *= 2;
b += mod(a, 2);
a /= 2;
}

Faster algorithm to check the colors in a image

Supposing I am given an image of 2048x2048 and i want to know the total number of colors present in the image, what is the fastest possible algorithm? I came up with two algorithm but they are slow.
Algorithm 1:
Compare the current pixel an the next pixel and if they are different
Check a temporary variable, which contains all the detected colors, to see if the color is present or not
If not present add it to the array(List) and increment noOfColors.
This Algorithm works but is slow. For a 1600x1200 pixels image it takes around 3 sec.
Algorithm 2:
The obvious method of checking the each pixel with all other pixels and recording the no of occurences of the color and incrementing the count. This is very very slow, almost like a hung app. So is there any better approach? I need all the pixel info.
You could use std::set (or std::unordered_set), and simply do a single loop though the pixels, adding the colors to the set. Then the number of colors is the size of the set.
Well, this is suited for parallelization. Split the image in several parts and execute the algorithm for each part in a separate task. To avoid syncing each should have its own storage for the unique colors. When all tasks are done, you aggregate the results.
DRAM is dirt cheap. Use brute force. Fill a tab, count.
On a core2duo # 3.0GHz :
0.35secs for 4096x4096 32 bits rgb
0.20secs after some trivial parallelization (I do know nothing of omp)
However, if you are to use 64bit rgb (one channel = 16 bits) it is another question (not enough memory).
You shall probably need a good hash table function.
Using random pixels, same size takes 10 secs.
Remark: at 0.15 secs, the std::bitset<> solution is faster (it gets slower trivially parallelized !).
Solution, c++11
#include <vector>
#include <random>
#include <iostream>
#include <boost/chrono.hpp>
#define _16M 256*256*256
typedef union {
struct { unsigned char r,g,b,n ; } r_g_b_n ;
unsigned char rgb[4] ;
unsigned i_rgb;
} RGB ;
RGB make_RGB(unsigned char r, unsigned char g , unsigned char b) {
RGB res;
res.r_g_b_n.r = r;
res.r_g_b_n.g = g;
res.r_g_b_n.b = b;
res.r_g_b_n.n = 0;
return res;
}
static_assert(sizeof(RGB)==4,"bad RGB size not 4");
static_assert(sizeof(unsigned)==4,"bad i_RGB size not 4");
struct Image
{
Image (unsigned M, unsigned N) : M_(M) , N_(N) , v_(M*N) {}
const RGB* tab() const {return & v_[0] ; }
RGB* tab() {return & v_[0] ; }
unsigned M_ , N_;
std::vector<RGB> v_;
};
void FillRandom(Image & im) {
std::uniform_int_distribution<unsigned> rnd(0,_16M-1);
std::mt19937 rng;
const int N = im.M_ * im.N_;
RGB* tab = im.tab();
for (int i=0; i<N; i++) {
unsigned r = rnd(rng) ;
*tab++ = make_RGB( (r & 0xFF) , (r>>8 & 0xFF), (r>>16 & 0xFF) ) ;
}
}
size_t Count(const Image & im) {
const int N = im.M_ * im.N_;
std::vector<char> count(_16M,0);
const RGB* tab = im.tab();
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++) {
count[ tab->i_rgb ] = 1 ;
tab++;
}
}
size_t nColors = 0 ;
#pragma omp parallel
{
#pragma omp for
for (int i = 0 ; i<_16M; i++) nColors += count[i];
}
return nColors;
}
int main() {
Image im(4096,4096);
FillRandom(im);
typedef boost::chrono::high_resolution_clock hrc;
auto start = hrc::now();
std::cout << " # colors " << Count(im) << std::endl ;
boost::chrono::duration<double> sec = hrc::now() - start;
std::cout << " took " << sec.count() << " seconds\n";
return 0;
}
The only feasible algorithm here is building a sort of a histogram of the image colors. The only difference in your case is that instead of calculating the population of each color you need just to know if it's zero or not.
Depending on which color space you work, you may use either an std::set to tag existing colors (as Joachim Pileborg suggested), or just use something like std::bitset, which is obviously faster. This depends on how much distinct colors exist in your color-space.
Also, like Marius Bancila noted, this procedure is a perfect match for parallelization. Calculated the histogram-like data for image parts, and then merge it. Naturally the image division should be based on its memory partition, not the geometric properties. In simple words - split the image vertically (by batches of scan lines), not horizontally.
And, if possible, you should either use some low-level library/code to run through pixels, or try to write your own. At least you must obtain a pointer to scan line and run on its pixels in a batch, rather than doing something like GetPixel for each pixel.
The point, here, is that the ideal representation of an image as 2D array of colors is not the one that happens the way the image is stored on memory (color components can be arranged in "planes", there could be "padding" etc. So getting the pixels using a GetPixel-like function may take time.
The question, then, may even be somehow meaningless if the image is not the result of a "vectorial draw": think to a photograph: between two nearby "greens" you find all the shade of green, so the colors -in this case- are no more no less the ones supported by the encoding of the image itself (2^24, or 256, or 16 or ...), so, unless you are interested on the color distribution (how differently used they are), just counting them makes very few sense.
A workaround can be:
Create an in-memory bitmap having pixel in a "single plane format"
Blit your image into that bitmap using BitBlt or similar (this let the OS to make pixel
conversion from the GPU,if any)
Get the bitmap-bits (this lets you
access the stored values)
Play your "counting algorithm" (whatever
it is) onto those values.
Note that step 1 and 2 can be avoided if you already know that the image is already in planar format.
If you have a multicore system, step 4 can also be assigned to different threads, each working part of the image.
You can use bitset which allows you to set individual bits and has a count function.
You have a bit for each colour, there are 256 values for each of RGB, so that's 256*256*256 bits (16,777,216 colours). The bitset will use a byte for every 8 bits so it will use 2MB.
Use the pixel colour as an index into the bitset:
bitset<256*256*256> colours;
for(int pixel: pixels) {
colours[pixel] = true;
}
colours.count();
This has linear complexity.
Late comer to this answer, but could not help it since this algorithm is brutally fast, developed about 2 or more decades ago, when it really mattered.
3-D Lookup Table Color Matching
http://www.ddj.com/cpp/184403257
Basically, it creates a 3d color loop up table and the search is very fast, I've done some modifications to suit my purpose for image binarization, so I reduced the color space from ff ff ff to f f f, and it's even 10 times faster. As it is right out of the box, I haven't found anything even close, including hash tables.
char * creatematcharray(struct rgb_color *palette, int palettesize)
{
int rval=16, gval=16, bval=16, len, r, g, b;
char *taken, *match, *same;
int i, set, sqstep, tp, maxtp, *entryr, *entryg, *entryb;
char *table;
len=rval*gval*bval;
// Prepare table buffers:
size_t size_of_table = len*sizeof(char);
table=(char *)malloc(size_of_table);
if (table==nullptr) return nullptr;
// Select colors to use for fill:
set=0;
size_t size_of_taken = (palettesize * sizeof(int) * 3) +
(palettesize*sizeof(char)) + (len * sizeof(char));
taken=(char *)malloc(size_of_taken);
same=taken + (len * sizeof(char));
entryr=(int*)(same + (palettesize * sizeof(char)));
entryg=entryr + palettesize;
entryb=entryg + palettesize;
if (taken==nullptr)
{
free((void *)table);
return nullptr;
}
std::memset((void *)taken, 0, len * sizeof(char));
// std::cout << "sizes: " << size_of_table << " " << size_of_taken << std::endl;
match=table;
for (i=0; i<palettesize; i++)
{
same[i]=0;
// Compute 3d-table coordinates of palette rgb color:
r=palette[i].r&0x0f, g=palette[i].g&0x0f, b=palette[i].b&0x0f;
// Put color in position:
if (taken[b*rval*gval+g*rval+r]==0) set++;
else same[match[b*rval*gval+g*rval+r]]=1;
match[b*rval*gval+g*rval+r]=i;
taken[b*rval*gval+g*rval+r]=1;
entryr[i]=r; entryg[i]=g; entryb[i]=b;
}
// ### Fill match_array by steps: ###
for (set=len-set, sqstep=1; set>0; sqstep++)
{
for (i=0; i<palettesize && set>0; i++)
if (same[i]==0)
{
// Fill all six sides of incremented cube (by pairs, 3 loops):
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b+=sqstep*2)
if (b>=0 && b<bval)
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r++)
if (r>=0 && r<rval)
{ // Draw one 3d line:
tp=b*rval*gval+(entryg[i]-sqstep)*rval+r;
maxtp=b*rval*gval+(entryg[i]+sqstep)*rval+r;
if (tp<b*rval*gval+0*rval+r)
tp=b*rval*gval+0*rval+r;
if (maxtp>b*rval*gval+(gval-1)*rval+r)
maxtp=b*rval*gval+(gval-1)*rval+r;
for (; tp<=maxtp; tp+=rval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
}
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g+=sqstep*2)
if (g>=0 && g<gval)
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b++)
if (b>=0 && b<bval)
{ // Draw one 3d line:
tp=b*rval*gval+g*rval+(entryr[i]-sqstep);
maxtp=b*rval*gval+g*rval+(entryr[i]+sqstep);
if (tp<b*rval*gval+g*rval+0)
tp=b*rval*gval+g*rval+0;
if (maxtp>b*rval*gval+g*rval+(rval-1))
maxtp=b*rval*gval+g*rval+(rval-1);
for (; tp<=maxtp; tp++)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
}
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r+=sqstep*2)
if (r>=0 && r<rval)
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g++)
if (g>=0 && g<gval)
{ // Draw one 3d line:
tp=(entryb[i]-sqstep)*rval*gval+g*rval+r;
maxtp=(entryb[i]+sqstep)*rval*gval+g*rval+r;
if (tp<0*rval*gval+g*rval+r)
tp=0*rval*gval+g*rval+r;
if (maxtp>(bval-1)*rval*gval+g*rval+r)
maxtp=(bval-1)*rval*gval+g*rval+r;
for (; tp<=maxtp; tp+=rval*gval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
}
}
}
free((void *)taken);`enter code here`
return table;
}
The answer: unordered_map
I use unordered_map, based on my testing.
You should test because your compiler / library may exhibit different performance Comment out #define USEHASH to use map instead.
On my machine, the vanilla unordered_map (a hash implementation) is about twice as fast as map. Inasmuch as different compilers, libraries can vary enormously, you must test to see which is better. In production, I build a fake image on first start of the app, run both algorithms on it and time them, save an indication of which one is faster, and then preferentially use that for all subsequent starts on that the machine. It's nit-picky, but hey, the user's time is valuable to them.
For a DSLR image with 12,106,244 pixels (about 12 megapixels, not a typo) and 11,857,131 distinct colors (also not a typo), map takes about 14 seconds, while unordered map takes about 7 seconds:
Test Code:
#define USEHASH 1
#ifdef USEHASH
#include <unordered_map>
#endif
size = im->xw * im->yw;
#ifdef USEHASH
// unordered_map is about twice as fast as map on my mac with qt5
// --------------------------------------------------------------
#include <unordered_map>
std::unordered_map<qint64, unsigned char> colors;
colors.reserve(size); // pre-allocate the hash space
#else
std::map<qint64, unsigned char> colors;
#endif
...use of either is in a loop where I build a 48-bit value of 0RGB in a 64-bit variable corresponding to the 16-bit RGB values of the image pixels, like so:
for (i=0; i<size; i++)
{
pel = BUILDPEL(i); // macro just shovels 0RGB into 64 bit pel from im
// You'd do the same for your image structure
// in whatever way is fastest for you
colors[pel] = 1;
}
cc = colors.size();
// time here: 14 secs for map, 7 secs for unordered_map with
// 12,106,244 pixels containing 11,857,131 colors on 12/24 core,
// 3 GHz, 64GB machine.

exchanging 2 memory positions

I am working with OpenCV and Qt, Opencv use BGR while Qt uses RGB , so I have to swap those 2 bytes for very big images.
There is a better way of doing the following?
I can not think of anything faster but looks so simple and lame...
int width = iplImage->width;
int height = iplImage->height;
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
int limit = height * width;
for (int y = 0; y < limit; ++y) {
buf = iplImagePtr[2];
iplImagePtr[2] = iplImagePtr[0];
iplImagePtr[0] = buf;
iplImagePtr += 3;
}
QImage img((uchar *) iplImage->imageData, width, height,
QImage::Format_RGB888);
We are currently dealing with this issue in a Qt application. We've found that the Intel Performance Primitives to be be fastest way to do this. They have extremely optimized code. In the html help files at Intel ippiSwapChannels Documentation they have an example of exactly what you are looking for.
There are couple of downsides
Is the size of the library, but you can link static link just the library routines you need.
Running on AMD cpus. Intel libs run VERY slow by default on AMD. Check out www.agner.org/optimize/asmlib.zip for details on how do a work around.
I think this looks absolutely fine. That the code is simple is not something negative. If you want to make it shorter you could use std::swap:
std::swap(iplImagePtr[0], iplImagePtr[2]);
You could also do the following:
uchar* end = iplImagePtr + height * width * 3;
for ( ; iplImagePtr != end; iplImagePtr += 3) {
std::swap(iplImagePtr[0], iplImagePtr[2]);
}
There's cvConvertImage to do the whole thing in one line, but I doubt it's any faster either.
Couldn't you use one of the following methods ?
void QImage::invertPixels ( InvertMode mode = InvertRgb )
or
QImage QImage::rgbSwapped () const
Hope this helps a bit !
I would be inclined to do something like the following, working on the basis of that RGB data being in three byte blocks.
int i = 0;
int limit = (width * height); // / 3;
while(i != limit)
{
buf = iplImagePtr[i]; // should be blue colour byte
iplImagePtr[i] = iplImagaePtr[i + 2]; // save the red colour byte in the blue space
iplImagePtr[i + 2] = buf; // save the blue color byte into what was the red slot
// i++;
i += 3;
}
I doubt it is any 'faster' but at end of day, you just have to go through the entire image, pixel by pixel.
You could always do this:
int width = iplImage->width;
int height = iplImage->height;
uchar *start = (uchar *) iplImage->imageData;
uchar *end = start + width * height;
for (uchar *p = start ; p < end ; p += 3)
{
uchar buf = *p;
*p = *(p+2);
*(p+2) = buf;
}
but a decent compiler would do this anyway.
Your biggest overhead in these sorts of operations is going to be memory bandwidth.
If you're using Windows then you can probably do this conversion using the BitBlt and two appropriately set up DIBs. If you're really lucky then this could be done in the graphics hardware.
I hate to ruin anyone's day, but if you don't want to go the IPP route (see photo_tom) or pull in an optimized library, you might get better performance from the following (modifying Andreas answer):
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
std::swap(iplImagePtr[y * 3], iplImagePtr[y * 3 + 2]);
}
Now hold on, folks, I hear you yelling "but all those extra multiplies and adds!" The thing is, this form of the loop is far easier for a compiler to optimize, especially if they get smart enough to multithread this sort of algorithm, because each pass through the loop is independent of those before or after. In the other form, the value of iplImagePtr was dependent on the value in previous pass. In this form, it is constant throughout the whole loop; only y changes, and that is in a very, very common "count from 0 to N-1" loop construct, so it's easier for an optimizer to digest.
Or maybe it doesn't make a difference these days because optimizers are insanely smart (are they?). I wonder what a benchmark would say...
P.S. If you actually benchmark this, I'd also like to see how well the following performs:
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
uchar *pixel = iplImagePtr + y * 3;
std::swap(pix[0], pix[2]);
}
Again, pixel is defined in the loop to limit its scope and keep the optimizer from thinking there's a cycle-to-cycle dependency. If the compiler increments and decrements the stack pointer each time through the loop to "create" and "destroy" pixel, well, it's stupid and I'll apologize for wasting your time.
cvCvtColor(iplImage, iplImage, CV_BGR2RGB);