Faster alpha blending using a lookup table? - c++

I have made a lookup table that allows you to blend two single-byte channels (256 colors per channel) using a single-byte alpha channel using no floating point values (hence no float to int conversions). Each index in the lookup table corresponds to the value of 256ths of a channel, as related to an alpha value.
In all, to fully calculate a 3-channel RGB blend, it would require two lookups into the array per channel, plus an addition. This is a total of 6 lookups and 3 additions. In the example below, I split the colors into separate values for ease of demonstration. This example shows how to blend three channels, R G and B by an alpha value ranging from 0 to 256.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value
BYTE rem = 255 - av; // Remaining fraction
rDest = _lookup[r1][rem] + _lookup[r2][av];
gDest = _lookup[g1][rem] + _lookup[g2][av];
bDest = _lookup[b1][rem] + _lookup[b2][av];
It works great. Precise as you can get using 256 color channels. In fact, you would get the same exact values using the actual floating point calculations. The lookup table was calculated using doubles to begin with. The lookup table is too big to fit in this post (65536 bytes). (If you would like a copy of it, email me at ten.turtle.toes#gmail.com, but don't expect a reply until tomorrow because I am going to sleep now.)
So... what do you think? Is it worth it or not?

I would be interested in seeing some benchmarks.
There is an algorithm that can do perfect alpha blending without any floating point calculations or lookup tables. You can find more info in the following document (the algorithm and code is described at the end)
I also did an SSE implementation of this long ago, if you are interested...
void PreOver_SSE2(void* dest, const void* source1, const void* source2, size_t size)
{
static const size_t STRIDE = sizeof(__m128i)*4;
static const u32 PSD = 64;
static const __m128i round = _mm_set1_epi16(128);
static const __m128i lomask = _mm_set1_epi32(0x00FF00FF);
assert(source1 != NULL && source2 != NULL && dest != NULL);
assert(size % STRIDE == 0);
const __m128i* source128_1 = reinterpret_cast<const __m128i*>(source1);
const __m128i* source128_2 = reinterpret_cast<const __m128i*>(source2);
__m128i* dest128 = reinterpret_cast<__m128i*>(dest);
__m128i d, s, a, rb, ag, t;
for(size_t k = 0, length = size/STRIDE; k < length; ++k)
{
// TODO: put prefetch between calculations?(R.N)
_mm_prefetch(reinterpret_cast<const s8*>(source128_1+PSD), _MM_HINT_NTA);
_mm_prefetch(reinterpret_cast<const s8*>(source128_2+PSD), _MM_HINT_NTA);
// work on entire cacheline before next prefetch
for(int n = 0; n < 4; ++n, ++dest128, ++source128_1, ++source128_2)
{
// TODO: assembly optimization use PSHUFD on moves before calculations, lower latency than MOVDQA (R.N) http://software.intel.com/en-us/articles/fast-simd-integer-move-for-the-intel-pentiumr-4-processor/
// TODO: load entire cacheline at the same time? are there enough registers? 32 bit mode (special compile for 64bit?) (R.N)
s = _mm_load_si128(source128_1); // AABGGRR
d = _mm_load_si128(source128_2); // AABGGRR
// PRELERP(S, D) = S+D - ((S*D[A]+0x80)>>8)+(S*D[A]+0x80))>>8
// T = S*D[A]+0x80 => PRELERP(S,D) = S+D - ((T>>8)+T)>>8
// set alpha to lo16 from dest_
a = _mm_srli_epi32(d, 24); // 000000AA
rb = _mm_slli_epi32(a, 16); // 00AA0000
a = _mm_or_si128(rb, a); // 00AA00AA
rb = _mm_and_si128(lomask, s); // 00BB00RR
rb = _mm_mullo_epi16(rb, a); // BBBBRRRR
rb = _mm_add_epi16(rb, round); // BBBBRRRR
t = _mm_srli_epi16(rb, 8);
t = _mm_add_epi16(t, rb);
rb = _mm_srli_epi16(t, 8); // 00BB00RR
ag = _mm_srli_epi16(s, 8); // 00AA00GG
ag = _mm_mullo_epi16(ag, a); // AAAAGGGG
ag = _mm_add_epi16(ag, round);
t = _mm_srli_epi16(ag, 8);
t = _mm_add_epi16(t, ag);
ag = _mm_andnot_si128(lomask, t); // AA00GG00
rb = _mm_or_si128(rb, ag); // AABGGRR pack
rb = _mm_sub_epi8(s, rb); // sub S-[(D[A]*S)/255]
d = _mm_add_epi8(d, rb); // add D+[S-(D[A]*S)/255]
_mm_stream_si128(dest128, d);
}
}
_mm_mfence(); //ensure last WC buffers get flushed to memory
}

Today's processors can do a lot of calculation in the time it takes to get one value from memory, especially if it's not in cache. That makes it especially important to benchmark possible solutions, because you can't easily reason what the result will be.
I don't know why you're concerned about floating point conversions, this can all be done as integers.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value BYTE
rem = 255 - av; // Remaining fraction
rDest = (r1*rem + r2*av) / 255;
gDest = (g1*rem + g2*av) / 255;
bDest = (b1*rem + b2*av) / 255;
If you want to get really clever, you can replace the divide with a multiply followed by a right-shift.
Edit: Here's the version using a right-shift. Adding a constant might reduce any truncation errors, but I'll leave that as an exercise for the reader.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value BYTE
int aMult = 0x10000 * av / 255;
rem = 0x10000 - aMult; // Remaining fraction
rDest = (r1*rem + r2*aMult) >> 16;
gDest = (g1*rem + g2*aMult) >> 16;
bDest = (b1*rem + b2*aMult) >> 16;

Mixing colors is computationally trivial. I would be surprised if this yielded a significant benefit, but as always, you must first prove that this calculation is a bottleneck in performance in the first place. And I would suggest that a better solution is to use hardware accelerated blending.

Related

Casting AVX512 mask types

I'm trying to figure out how to use masked loads and stores for the last few elements to be processed. My use case involves converting a packed 10 bit data stream to 16 bit which means loading 5 bytes before storing 4 shorts. This results in different masks of different types.
The main loop itself is not a problem. But at the end I'm left with up to 19 bytes input / 15 shorts output which I thought I could process in up to two loop iterations using the 128 bit vectors. Here is the outline of the code.
#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void convert(uint16_t* out, ptrdiff_t n, const uint8_t* in)
{
uint16_t* const out_end = out + n;
for(uint16_t* out32_end = out + (n & -32); out < out32_end; in += 40, out += 32) {
/*
* insert main loop here using ZMM vectors
*/
}
if(out_end - out >= 16) {
/*
* insert half-sized iteration here using YMM vectors
*/
in += 20;
out += 16;
}
// up to 19 byte input remaining, up to 15 shorts output
const unsigned out_remain = out_end - out;
const unsigned in_remain = (out_remain * 10 + 7) / 8;
unsigned in_mask = (1 << in_remain) - 1;
unsigned out_mask = (1 << out_remain) - 1;
while(out_mask) {
__mmask16 load_mask = _cvtu32_mask16(in_mask);
__m128i packed = _mm_maskz_loadu_epi8(load_mask, in);
/* insert computation here. No masks required */
__mmask8 store_mask = _cvtu32_mask8(out_mask);
_mm_mask_storeu_epi16(out, store_mask, packed);
in += 10;
out += 8;
in_mask >>= 10;
out_mask >>= 8;
}
}
(Compile with -O3 -mavx2 -mavx512f -mavx512bw -mavx512vl -mavx512dq)
My idea was to create a bit mask from the number of remaining elements (since I know it fits comfortably in an integer / mask register), then shift values out of the mask as they are processed.
I have two issues with this approach:
I'm re-setting the masks from GP registers each iteration instead of using the kshift family of instructions
_cvtu32_mask8 (kmovb) is the only instruction in this code that requires AVX512DQ. Limiting the number of suitable hardware platforms just for that seems weird
What I'm wondering about:
Can I cast mmask32 to mmask16 and mmask8?
If I can, I could set it once from the GP register, then shift it in its own register. Like this:
__mmask32 load_mask = _cvtu32_mask32(in_mask);
__mmask32 store_mask = _cvtu32_mask32(out_mask);
while(out < out_end) {
__m128i packed = _mm_maskz_loadu_epi8((__mmask16) load_mask, in);
/* insert computation here. No masks required */
_mm_mask_storeu_epi16(out, (__mmask8) store_mask, packed);
load_mask = _kshiftri_mask32(load_mask, 10);
store_mask = _kshiftri_mask32(store_mask, 8);
in += 10;
out += 8;
}
GCC seems to be fine with this pattern. But Clang and MSVC create worse code, moving the mask in and out of GP registers without any apparent reason.

Efficient C++ code (no libs) for image transformation into custom RGB pixel greyscale

Currently working on C++ implementation of ToGreyscale method and I want to ask what is the most efficient way to transform "unsigned char* source" using custom RGB input params.
Below is a current idea, but maybe using a Vector would be better?
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float r = pixel[0];
float g = pixel[1];
float b = pixel[2];
// Do something with r, g, b
}
}
The most efficient single threaded CPU implementation, is using manually optimized SIMD implementation.
SIMD extensions are specific for processor architecture.
For x86 there are SSE and AVX extensions, NEON for ARM, AltiVec for PowerPC...
In many cases the compiler is able to generate very efficient code that utilize the SIMD extension without any knowledge of the programmer (just by setting compiler flags).
There are also many cases where the compiler can't generate efficient code (many reasons for that).
When you need to get very high performance, it's recommended to implement it using C intrinsic functions.
Most of intrinsic instructions are converted directly to assembly instructions (instruction to instruction), without the need to know assembly.
There are many downsides of using intrinsic (compared to generic C implementation): Implementation is complicated to code and to maintain, and the code is platform specific and not portable.
A good reference for x86 intrinsics is Intel Intrinsics Guide.
The posted code uses SSE instruction set extension.
The implementation is very efficient, but not the top performance (using AVX2 for example may be faster, but less portable).
For better efficiency my code uses fixed point implementation.
In many cases fixed point is more efficient than floating point (but more difficult).
The most complicated part of the specific algorithm is reordering the RGB elements.
When RGB elements are ordered in triples r,g,b,r,g,b,r,g,b... you need to reorder them to rrrr... gggg... bbbb... in order of utilizing SIMD.
Naming conventions:
Don't be scared by the long weird variable names.
I am using this weird naming convention (it's my convention), because it helps me follow the code.
r7_r6_r5_r4_r3_r2_r1_r0 for example marks an XMM register with 8 uint16 elements.
The following implementation includes code with and without SSE intrinsics:
//Optimized implementation (use SSE intrinsics):
//----------------------------------------------
#include <intrin.h>
//Convert from RGBRGBRGB... to RRR..., GGG..., BBB...
//Input: Two XMM registers (24 uint8 elements) ordered RGBRGB...
//Output: Three XMM registers ordered RRR..., GGG... and BBB...
// Unpack the result from uint8 elements to uint16 elements.
static __inline void GatherRGBx8(const __m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
const __m128i b7_g7_r7_b6_g6_r6_b5_g5,
__m128i &r7_r6_r5_r4_r3_r2_r1_r0,
__m128i &g7_g6_g5_g4_g3_g2_g1_g0,
__m128i &b7_b6_b5_b4_b3_b2_b1_b0)
{
//Shuffle mask for gathering 4 R elements, 4 G elements and 4 B elements (also set last 4 elements to duplication of first 4 elements).
const __m128i shuffle_mask = _mm_set_epi8(9,6,3,0, 11,8,5,2, 10,7,4,1, 9,6,3,0);
__m128i b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4 = _mm_alignr_epi8(b7_g7_r7_b6_g6_r6_b5_g5, r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, 12);
//Gather 4 R elements, 4 G elements and 4 B elements.
//Remark: As I recall _mm_shuffle_epi8 instruction is not so efficient (I think execution is about 5 times longer than other shuffle instructions).
__m128i r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0 = _mm_shuffle_epi8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, shuffle_mask);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4 = _mm_shuffle_epi8(b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4, shuffle_mask);
//Put 8 R elements in lower part.
__m128i b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0 = _mm_alignr_epi8(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 12);
//Put 8 G elements in lower part.
__m128i g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 4);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0 = _mm_alignr_epi8(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4, g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz, 12);
//Put 8 B elements in lower part.
__m128i b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 4);
__m128i zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0 = _mm_alignr_epi8(zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4, b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz, 12);
//Unpack uint8 elements to uint16 elements.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_cvtepu8_epi16(b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_cvtepu8_epi16(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_cvtepu8_epi16(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0);
}
//Calculate 8 Grayscale elements from 8 RGB elements.
//Y = 0.2989*R + 0.5870*G + 0.1140*B
//Conversion model used by MATLAB https://www.mathworks.com/help/matlab/ref/rgb2gray.html
static __inline __m128i Rgb2Yx8(__m128i r7_r6_r5_r4_r3_r2_r1_r0,
__m128i g7_g6_g5_g4_g3_g2_g1_g0,
__m128i b7_b6_b5_b4_b3_b2_b1_b0)
{
//Each coefficient is expanded by 2^15, and rounded to int16 (add 0.5 for rounding).
const __m128i r_coef = _mm_set1_epi16((short)(0.2989*32768.0 + 0.5)); //8 coefficients - R scale factor.
const __m128i g_coef = _mm_set1_epi16((short)(0.5870*32768.0 + 0.5)); //8 coefficients - G scale factor.
const __m128i b_coef = _mm_set1_epi16((short)(0.1140*32768.0 + 0.5)); //8 coefficients - B scale factor.
//Multiply input elements by 64 for improved accuracy.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_slli_epi16(r7_r6_r5_r4_r3_r2_r1_r0, 6);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_slli_epi16(g7_g6_g5_g4_g3_g2_g1_g0, 6);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_slli_epi16(b7_b6_b5_b4_b3_b2_b1_b0, 6);
//Use the special intrinsic _mm_mulhrs_epi16 that calculates round(r*r_coef/2^15).
//Calculate Y = 0.2989*R + 0.5870*G + 0.1140*B (use fixed point computations)
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = _mm_add_epi16(_mm_add_epi16(
_mm_mulhrs_epi16(r7_r6_r5_r4_r3_r2_r1_r0, r_coef),
_mm_mulhrs_epi16(g7_g6_g5_g4_g3_g2_g1_g0, g_coef)),
_mm_mulhrs_epi16(b7_b6_b5_b4_b3_b2_b1_b0, b_coef));
//Divide result by 64.
y7_y6_y5_y4_y3_y2_y1_y0 = _mm_srli_epi16(y7_y6_y5_y4_y3_y2_y1_y0, 6);
return y7_y6_y5_y4_y3_y2_y1_y0;
}
//Convert single row from RGB to Grayscale (use SSE intrinsics).
//I0 points source row, and J0 points destination row.
//I0 -> rgbrgbrgbrgbrgbrgb...
//J0 -> yyyyyy
static void Rgb2GraySingleRow_useSSE(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x; //Index in J0.
int srcx; //Index in I0.
__m128i r7_r6_r5_r4_r3_r2_r1_r0;
__m128i g7_g6_g5_g4_g3_g2_g1_g0;
__m128i b7_b6_b5_b4_b3_b2_b1_b0;
srcx = 0;
//Process 8 pixels per iteration.
for (x = 0; x < image_width; x += 8)
{
//Load 8 elements of each color channel R,G,B from first row.
__m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0 = _mm_loadu_si128((__m128i*)&I0[srcx]); //Unaligned load of 16 uint8 elements
__m128i b7_g7_r7_b6_g6_r6_b5_g5 = _mm_loadu_si128((__m128i*)&I0[srcx+16]); //Unaligned load of (only) 8 uint8 elements (lower half of XMM register).
//Separate RGB, and put together R elements, G elements and B elements (together in same XMM register).
//Result is also unpacked from uint8 to uint16 elements.
GatherRGBx8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
b7_g7_r7_b6_g6_r6_b5_g5,
r7_r6_r5_r4_r3_r2_r1_r0,
g7_g6_g5_g4_g3_g2_g1_g0,
b7_b6_b5_b4_b3_b2_b1_b0);
//Calculate 8 Y elements.
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = Rgb2Yx8(r7_r6_r5_r4_r3_r2_r1_r0,
g7_g6_g5_g4_g3_g2_g1_g0,
b7_b6_b5_b4_b3_b2_b1_b0);
//Pack uint16 elements to 16 uint8 elements (put result in single XMM register). Only lower 8 uint8 elements are relevant.
__m128i j7_j6_j5_j4_j3_j2_j1_j0 = _mm_packus_epi16(y7_y6_y5_y4_y3_y2_y1_y0, y7_y6_y5_y4_y3_y2_y1_y0);
//Store 8 elements of Y in row Y0, and 8 elements of Y in row Y1.
_mm_storel_epi64((__m128i*)&J0[x], j7_j6_j5_j4_j3_j2_j1_j0);
srcx += 24; //Advance 24 source bytes per iteration.
}
}
//Convert image I from pixel ordered RGB to Grayscale format.
//Conversion formula: Y = 0.2989*R + 0.5870*G + 0.1140*B (Rec.ITU-R BT.601)
//Formula is based on MATLAB rgb2gray function: https://www.mathworks.com/help/matlab/ref/rgb2gray.html
//Implementation uses SSE intrinsics for performance optimization.
//Use fixed point computations for better performance.
//I - Input image in pixel ordered RGB format.
//image_width - Number of columns of I.
//image_height - Number of rows of I.
//J - Destination "image" in Grayscale format.
//I is pixel ordered RGB color format (size in bytes is image_width*image_height*3):
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//
//J is in Grayscale format (size in bytes is image_width*image_height):
//YYYYYY
//YYYYYY
//YYYYYY
//YYYYYY
//
//Limitations:
//1. image_width must be a multiple of 8.
//2. I and J must be two separate arrays (in place computation is not supported).
//3. Rows of I and J are continues in memory (bytes stride is not supported, [but simple to add]).
//
//Comments:
//1. The conversion formula is incorrect, but it's a commonly used approximation.
//2. Code uses SSE 4.1 instruction set.
// Better performance can be archived using AVX2 implementation.
// (AVX2 is supported by Intel Core 4'th generation and above, and new AMD processors).
//3. The code is not the best SSE optimization:
// Uses unaligned load and store operations.
// Utilize only half XMM register in few cases.
// Instruction selection is probably sub-optimal.
void Rgb2Gray_useSSE(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
Rgb2GraySingleRow_useSSE(I0,
image_width,
J0);
}
}
//Convert single row from RGB to Grayscale (simple C code without intrinsics).
static void Rgb2GraySingleRow_Simple(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x; //index in J0.
int srcx; //Index in I0.
srcx = 0;
//Process 1 pixel per iteration.
for (x = 0; x < image_width; x++)
{
float r = (float)I0[srcx]; //Load red pixel and convert to float
float g = (float)I0[srcx+1]; //Green
float b = (float)I0[srcx+2]; //Blue
float gray = 0.2989f*r + 0.5870f*g + 0.1140f*b; //Convert to Grayscale (use BT.601 conversion coefficients).
J0[x] = (unsigned char)(gray + 0.5f); //Add 0.5 for rounding.
srcx += 3; //Advance 3 source bytes per iteration.
}
}
//Convert RGB to Grayscale using simple C code (without SIMD intrinsics).
//Use as reference (for time measurements).
void Rgb2Gray_Simple(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
Rgb2GraySingleRow_Simple(I0,
image_width,
J0);
}
}
In my machine, the manually optimized code is about 3 times faster.
You can find medium value between RGB channels, then, assign the result to each channel.
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float grayscaleValue = 0;
for (int k = 0; k < 3; k++) {
grayscaleValue += pixel[k];
}
grayscaleValue /= 3;
for (int k = 0; k < 3; k++) {
pixel[k] = grayscaleValue;
}
}
}

Efficient way to convert from premultiplied float RGBA to 8-bit RGBA?

I'm looking for a more efficient way to convert from RGBA stored as doubles in premultiplied colorspace to 8-bit integer/channel RGBA non-premulitplied colorspace. It's a significant cost in my image processing.
For one channel, say R, the code looks something like this:
double temp = alpha > 0 ? src_r / alpha : 0
uint8_t out_r = (uint8_t)min( 255, max( 0, int(temp * 255 + 0.5) ) )
This involves three conditionals which I think prevent the compiler/CPU from optimizing this as well as it could. I think that some chips, in particular x86_64 have specialized double clamping operations, so in theory the above might be doable without conditionals.
Is there some technique, or special functions, which can make this conversion faster?
I'm using GCC and would be happy with a solution in C or C++ or with inline ASM if need be.
Here is an outline with some code (untested). This will convert four pixels at once. The main advantage of this method is that it only has to do the division once (not four times). Division is slow. But it has to do a tranpose (AoS to SoA) to do this. It uses mostly SSE except to convert the doubles to floats (which needs AVX).
1.) Load 16 doubles
2.) Convert them to floats
3.) Transpose from rgba rgba rgba rgba to rrrr gggg bbbb aaaa
4.) Divide all 4 alphas in one instruction
5.) Round floats to ints
6.) Compress 32-bit to 8-bit with saturation for underflow and overflow
7.) Transpose back to rgba rgba rgba rgba
9.) Write 4 pixels as integers in rgba format
#include <immintrin.h>
double rgba[16];
int out[4];
//load 16 doubles and convert to floats
__m128 tmp1 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[0]));
__m128 tmp2 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[4]));
__m128 tmp3 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[8]));
__m128 tmp4 = _mm256_cvtpd_ps(_mm256_load_pd(&rgba[12]));
//rgba rgba rgba rgba -> rrrr bbbb gggg aaaa
_MM_TRANSPOSE4_PS(tmp1,tmp2,tmp3,tmp4);
//fact = alpha > 0 ? 255.0f/ alpha : 0
__m128 fact = _mm_div_ps(_mm_set1_ps(255.0f),tmp4);
tmp1 = _mm_mul_ps(fact,tmp1); //rrrr
tmp2 = _mm_mul_ps(fact,tmp2); //gggg
tmp3 = _mm_mul_ps(fact,tmp3); //bbbb
tmp4 = _mm_mul_ps(_mm_set1_ps(255.0f), tmp4); //aaaa
//round to nearest int
__m128i tmp1i = _mm_cvtps_epi32(tmp1);
__m128i tmp2i = _mm_cvtps_epi32(tmp2);
__m128i tmp3i = _mm_cvtps_epi32(tmp3);
__m128i tmp4i = _mm_cvtps_epi32(tmp4);
//compress from 32bit to 8 bit
__m128i tmp5i = _mm_packs_epi32(tmp1i, tmp2i);
__m128i tmp6i = _mm_packs_epi32(tmp3i, tmp4i);
__m128i tmp7i = _mm_packs_epi16(tmp5i, tmp6i);
//transpose back to rgba rgba rgba rgba
__m128i out16 = _mm_shuffle_epi8(in16,_mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f));
_mm_store_si128((__m128i*)out, tmp7i);
Ok, this is pseudo code, but with SSE how about something like
const c = (1/255, 1/255, 1/255, 1/255)
floats = (r, g, b, a)
alpha = (a, a, a, a)
alpha *= (c, c, c, c)
floats /= alpha
ints = cvt_float_to_int(floats)
ints = max(ints, (255, 255, 255, 255))
Here's an implementation
void convert(const double* floats, byte* bytes, const int width, const int height, const int step) {
for(int y = 0; y < height; ++y) {
const double* float_row = floats + y * width;
byte* byte_row = bytes + y * step;
for(int x = 0; x < width; ++x) {
__m128d src1 = _mm_load_pd(float_row);
__m128d src2 = _mm_load_pd(float_row + 2);
__m128d mul = _mm_set1_pd(255.0f / float_row[3]);
__m128d norm1 = _mm_min_pd(_mm_set1_pd(255), _mm_mul_pd(src1, mul));
__m128d norm2 = _mm_min_pd(_mm_set1_pd(255), _mm_mul_pd(src2, mul));
__m128i dst1 = _mm_shuffle_epi8(_mm_cvtpd_epi32(norm1), _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,4,0));
__m128i dst2 = _mm_shuffle_epi8(_mm_cvtpd_epi32(norm2), _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,4,0,0x80,0x80));
_mm_store_ss((float*)byte_row, _mm_castsi128_ps(_mm_or_si128(dst1, dst2)));
float_row += 4;
byte_row += 4;
}
}
}
Edit: In my original answer I worked with floats instead of double, below if anyone's interested thanks to #Z boson for catching that - #OP: I don't handle alhpa==0 cases, so you'll get NaN with my solution, if you want this handling, go with #Z boson's solution.
Here's the float version:
void convert(const float* floats, byte* bytes, const int width, const int height, const int step) {
for(int y = 0; y < height; ++y) {
const float* float_row = floats + y * width;
byte* byte_row = bytes + y * step;
for(int x = 0; x < width; ++x) {
__m128 src = _mm_load_ps(float_row);
__m128 mul = _mm_set1_ps(255.0f / float_row[3]);
__m128i cvt = _mm_cvtps_epi32(_mm_mul_ps(src, mul));
__m128i res = _mm_min_epi32(cvt, _mm_set1_epi32(255));
__m128i dst = _mm_shuffle_epi8(res, _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,8,4,0));
_mm_store_ss((float*)byte_row, _mm_castsi128_ps(dst));
float_row += 4;
byte_row += 4;
}
}
}
Because of SSE alignments constraints, make sure your input pointers are 16-bytes aligned, and use step to make sure each row starts at an aligned address, many libs take such a step argument, but if you don't need it, you can simplify by using a single loop.
I quickly tested with this and get good values:
int main() {
__declspec(align(16)) double src[] = { 10,100,1000,255, 10,100,20,50 };
__declspec(align(16)) byte dst[8];
convert(src, dst, 2, 1, 16); // dst == { 10,100,255,255 }
return 0;
}
I only have visual studio right now, so I can't test with gcc's optimizer, but I'm getting a x1.8 speedup for double and x4.5 for floats, it could be less with gcc -O3 but my code could be optimized more.
Three things to look into
Do this with OpenGL using a shader.
Use Single Instruction Multiple Data (SIMD) - you might get a bit of parallelization.
Look at using saturated arithmetic operations (SADD and SMULL on arm)

Speeding up some SSE2 Intrinsics for color conversion

I'm trying to perform image colour conversion from YCbCr to BGRA (Don't ask about the A bit, such a headache).
Anyway, this needs to perform as fast as possible, so I've written it using compiler intrinsics to take advantage of SSE2. This is my first venture into SIMD land, I'm basically a beginner and so I'm sure there's plenty I'm doing inefficiently.
My arithmetic code for doing the actual colour conversion turns out to be particularly slow, and Intel's VTune is showing it up as a significant bottleneck.
So, any way I can speed up the following code? It's being done in 32bit, 4 pixels at a time. I originally tried doing it in 8 bits, 16 pixels at a time (as in the upper loop), but the calculations cause integer overflow and a broken conversion. This whole process, including the Intel jpeg decode is taking ~14ms for a single field of full HD. It'd be great if I could get it down to at least 12ms, ideally 10ms.
Any help or tips gratefully appreciated. Thanks!
const __m128i s128_8 = _mm_set1_epi8((char)128);
const int nNumPixels = roi.width * roi.height;
for (int i=0; i<nNumPixels; i+=32)
{
// Go ahead and prefetch our packed UV Data.
// As long as the load remains directly next, this saves us time.
_mm_prefetch((const char*)&pSrc8u[2][i],_MM_HINT_T0);
// We need to fetch and blit out our k before we write over it with UV data.
__m128i sK1 = _mm_load_si128((__m128i*)&pSrc8u[2][i]);
__m128i sK2 = _mm_load_si128((__m128i*)&pSrc8u[2][i+16]);
// Using the destination buffer temporarily here so we don't need to waste time doing a memory allocation.
_mm_store_si128 ((__m128i*)&m_pKBuffer[i], sK1);
_mm_store_si128 ((__m128i*)&m_pKBuffer[i+16], sK2);
// In theory, this prefetch needs to be some cycles ahead of the first read. It isn't, yet it does appear to save us time. Worth investigating.
_mm_prefetch((const char*)&pSrc8u[1][i],_MM_HINT_T0);
__m128i sUVI1 = _mm_load_si128((__m128i*)&pSrc8u[1][i]);
__m128i sUVI2 = _mm_load_si128((__m128i*)&pSrc8u[1][i+16]);
// Subtract the 128 here ahead of our YCbCr -> BGRA conversion so we can go 16 pixels at a time rather than 4.
sUVI1 = _mm_sub_epi8(sUVI1, s128_8);
sUVI2 = _mm_sub_epi8(sUVI2, s128_8);
// Swizzle and double up UV data from interleaved 8x1 byte blocks into planar
__m128i sU1 = _mm_unpacklo_epi8(sUVI1, sUVI1);
__m128i sV1 = _mm_unpackhi_epi8(sUVI1, sUVI1);
__m128i sU2 = _mm_unpacklo_epi8(sUVI2, sUVI2);
__m128i sV2 = _mm_unpackhi_epi8(sUVI2, sUVI2);
_mm_store_si128((__m128i*)&pSrc8u[1][i], sU1);
_mm_store_si128((__m128i*)&pSrc8u[1][i+16], sU2);
_mm_store_si128((__m128i*)&pSrc8u[2][i], sV1);
_mm_store_si128((__m128i*)&pSrc8u[2][i+16], sV2);
}
const __m128i s16 = _mm_set1_epi32(16);
const __m128i s299 = _mm_set1_epi32(299);
const __m128i s410 = _mm_set1_epi32(410);
const __m128i s518 = _mm_set1_epi32(518);
const __m128i s101 = _mm_set1_epi32(101);
const __m128i s209 = _mm_set1_epi32(209);
Ipp8u* pDstP = pDst8u;
for (int i=0; i<nNumPixels; i+=4, pDstP+=16)
{
__m128i sK = _mm_set_epi32(m_pKBuffer[i], m_pKBuffer[i+1], m_pKBuffer[i+2], m_pKBuffer[i+3]);
__m128i sY = _mm_set_epi32(pSrc8u[0][i], pSrc8u[0][i+1], pSrc8u[0][i+2], pSrc8u[0][i+3]);
__m128i sU = _mm_set_epi32((char)pSrc8u[1][i], (char)pSrc8u[1][i+1], (char)pSrc8u[1][i+2], (char)pSrc8u[1][i+3]);
__m128i sV = _mm_set_epi32((char)pSrc8u[2][i], (char)pSrc8u[2][i+1], (char)pSrc8u[2][i+2], (char)pSrc8u[2][i+3]);
// N.b. - Attempted to do the sub 16 in 8 bits similar to the sub 128 for U and V - however doing it here is quicker
// as the time saved on the arithmetic is less than the time taken by the additional loads/stores needed in the swizzle loop
sY = _mm_mullo_epi32(_mm_sub_epi32(sY, s16), s299);
__m128i sR = _mm_srli_epi32(_mm_add_epi32(sY,_mm_mullo_epi32(s410, sV)), 8);
__m128i sG = _mm_srli_epi32(_mm_sub_epi32(_mm_sub_epi32(sY, _mm_mullo_epi32(s101, sU)),_mm_mullo_epi32(s209, sV)), 8);
__m128i sB = _mm_srli_epi32(_mm_add_epi32(sY, _mm_mullo_epi32(s518, sU)), 8);
//Microsoft's YUV Conversion
//__m128i sC = _mm_sub_epi32(sY, s16);
//__m128i sD = _mm_sub_epi32(sU, s128);
//__m128i sE = _mm_sub_epi32(sV, s128);
//
//__m128i sR = _mm_srli_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(s298, sC), _mm_mullo_epi32(s409, sE)), s128), 8);
//__m128i sG = _mm_srli_epi32(_mm_add_epi32(_mm_sub_epi32(_mm_mullo_epi32(s298, sC), _mm_sub_epi32(_mm_mullo_epi32(s100, sD), _mm_mullo_epi32(s208, sE))), s128), 8);
//__m128i sB = _mm_srli_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(s298, sC), _mm_mullo_epi32(s516, sD)), s128), 8);
__m128i sKGl = _mm_unpacklo_epi32(sK, sG);
__m128i sKGh = _mm_unpackhi_epi32(sK, sG);
__m128i sRBl = _mm_unpacklo_epi32(sR, sB);
__m128i sRBh = _mm_unpackhi_epi32(sR, sB);
__m128i sKRGB1 = _mm_unpackhi_epi32(sKGh,sRBh);
__m128i sKRGB2 = _mm_unpacklo_epi32(sKGh,sRBh);
__m128i sKRGB3 = _mm_unpackhi_epi32(sKGl,sRBl);
__m128i sKRGB4 = _mm_unpacklo_epi32(sKGl,sRBl);
__m128i p1 = _mm_packus_epi16(sKRGB1, sKRGB2);
__m128i p2 = _mm_packus_epi16(sKRGB3, sKRGB4);
__m128i po = _mm_packus_epi16(p1, p2);
_mm_store_si128((__m128i*)pDstP, po);
}
You may be bandwidth limited here, as there is very little computation relative to the number of loads and stores.
One suggestion: get rid of the _mm_prefetch intrinsics - they are almost certainly not helping and may even hinder operation on more recent CPUs (which already do a pretty good job with automatic prefetching).
Another area to look at:
__m128i sK = _mm_set_epi32(m_pKBuffer[i], m_pKBuffer[i+1], m_pKBuffer[i+2], m_pKBuffer[i+3]);
__m128i sY = _mm_set_epi32(pSrc8u[0][i], pSrc8u[0][i+1], pSrc8u[0][i+2], pSrc8u[0][i+3]);
__m128i sU = _mm_set_epi32((char)pSrc8u[1][i], (char)pSrc8u[1][i+1], (char)pSrc8u[1][i+2], (char)pSrc8u[1][i+3]);
__m128i sV = _mm_set_epi32((char)pSrc8u[2][i], (char)pSrc8u[2][i+1], (char)pSrc8u[2][i+2], (char)pSrc8u[2][i+3]);
This is generating a lot of unnecessary instructions - you should be using _mm_load_xxx and _mm_unpackxx_xxx here. It will look like more code, but it will be a lot more efficient. And you should probably be processing 16 pixels per iteration of the loop, rather than 4 - that way you load a vector of 8 bit values once, and unpack to get each set of 4 values as a vector of 32 bit ints as needed.

Any way to make this relatively simple (nested for memory copy) C++ code more efficient?

I realize this is kind of a goofy question, for lack of a better term. I'm just kind of looking for any outside idea on increasing the efficiency of this code, as it's bogging down the system very badly (it has to perform this function a lot) and I'm running low on ideas.
What it's doing it loading two image containers (imgRGB for a full color img and imgBW for a b&w image) pixel-by-individual-pixel of an image that's stored in "unsigned char *pImage".
Both imgRGB and imgBW are containers for accessing individual pixels as necessary.
// input is in the form of an unsigned char
// unsigned char *pImage
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
imgRGB[y][x].blue = *pImage;
pImage++;
imgRGB[y][x].green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
imgRGB[y][x].red = *pImage;
pImage++;
}
}
Like I said, I was just kind of looking for fresh input and ideas on better memory management and/or copy than this. Sometimes I look at my own code so much I get tunnel vision... a bit of a mental block. If anyone wants/needs more information, by all means let me know.
The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?
If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.
Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.
I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars)
Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:
// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);
for (int y = 0; y < 640; ++y)
{
for (int x = 0; x < 480; x += 4)
{
// At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
unsigned int i0 = *img;
unsigned int i1 = *(img+1);
unsigned int i2 = *(img+2);
img += 3;
// This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
ImgRGB& pix0 = imgRGB[y][x];
pix0.blue = i0 & 0xff;
pix0.green = (i0 >> 8) & 0xff;
pix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
ImgRGB& pix1 = imgRGB[y][x+1];
pix1.blue = (i0 >> 24) & 0xff;
pix1.green = i1 & 0xff;
pix1.red = (i0 >> 8) & 0xff;
imgBW[y][x+1] = i1 & 0xff;
ImgRGB& pix2 = imgRGB[y][x+2];
pix2.blue = (i1 >> 16) & 0xff;
pix2.green = (i1 >> 24) & 0xff;
pix2.red = i2 & 0xff;
imgBW[y][x+2] = (i1 >> 24) & 0xff;
ImgRGB& pix3 = imgRGB[y][x+3];
pix3.blue = (i2 >> 8) & 0xff;
pix3.green = (i2 >> 16) & 0xff;
pix3.red = (i2 >> 24) & 0xff;
imgBW[y][x+3] = (i2 >> 16) & 0xff;
}
}
it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)
ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;
Depending on how clever the compiler is, this may cut down dramatically on the required number of reads.
Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)
You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:
bw |= (i0 >> 8) & 0xff;
bw |= (i1 & 0xff) << 8;
bw |= ((i1 >> 24) & 0xff) << 16;
bw |= ((i2 >> 16) & 0xff) << 24;
*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers
This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)
Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.
Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)
Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms)
A bigger issue is that the bit-twiddling depends on the CPU's endianness.
But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))
But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.
Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.
Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)
When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.
As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.
Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution.
In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.
I think the array accesses (are they real array accesses or operator []?) are going to kill you. Each one represents a multiply.
Basically, you want something like this:
for (int y=0; y < height; y++) {
unsigned char *destBgr = imgRgb.GetScanline(y); // inline methods are better
unsigned char *destBW = imgBW.GetScanline(y);
for (int x=0; x < width; x++) {
*destBgr++ = *pImage++;
*destBW++ = *destBgr++ = *pImage++; // do this in one shot - don't double deref
*destBgr++ = *pImage++;
}
}
This will do two multiplies per scanline. You code was doing 4 multiplies per PIXEL.
What I like to do in situations like this is go into the debugger and step through the disassembly to see what it is really doing (or have the compiler generate an assembly listing). This can give you a lot of clues about where inefficencies are. They are often not where you think!
By implementing the changes suggested by Assaf and David Lee above, you can get a before and after instruction count. This really helps me in optimizing tight inner loops.
You could optimize away some of the pointer arithmetic you're doing over and over with the subscript operators [][] and use an iterator instead (that is, advance a pointer).
Memory bandwidth is your bottleneck here. There is a theoretical minimum time required to transfer all the data to and from system memory. I wrote a little test to compare the OP's version with some simple assembler to see how good the compiler was. I'm using VS2005 with default release mode settings. Here's the code:
#include <windows.h>
#include <iostream>
using namespace std;
const int
c_width = 640,
c_height = 480;
typedef struct _RGBData
{
unsigned char
r,
g,
b;
// I'm assuming there's no padding byte here
} RGBData;
// similar to the code given
void SimpleTest
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
for (int y = 0 ; y < c_height ; ++y)
{
for (int x = 0 ; x < c_width ; ++x)
{
rgb [x + y * c_width].b = *src;
src++;
rgb [x + y * c_width].g = *src;
bw [x + y * c_width] = *src;
src++;
rgb [x + y * c_width].r = *src;
src++;
}
}
}
// the assembler version
void ASM
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
const int
count = 3 * c_width * c_height / 12;
_asm
{
push ebp
mov esi,src
mov edi,bw
mov ecx,count
mov ebp,rgb
l1:
mov eax,[esi]
mov ebx,[esi+4]
mov edx,[esi+8]
mov [ebp],eax
shl eax,16
mov [ebp+4],ebx
rol ebx,16
mov [ebp+8],edx
shr edx,24
and eax,0xff000000
and ebx,0x00ffff00
and edx,0x000000ff
or eax,ebx
or eax,edx
add esi,12
bswap eax
add ebp,12
stosd
loop l1
pop ebp
}
}
// timing framework
LONGLONG TimeFunction
(
void (*function) (unsigned char *src, RGBData *rgb, unsigned char *bw),
char *description,
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
LARGE_INTEGER
start,
end;
cout << "Testing '" << description << "'...";
memset (rgb, 0, sizeof *rgb * c_width * c_height);
memset (bw, 0, c_width * c_height);
QueryPerformanceCounter (&start);
function (src, rgb, bw);
QueryPerformanceCounter (&end);
bool
ok = true;
unsigned char
*bw_check = bw,
i = 0;
RGBData
*rgb_check = rgb;
for (int count = 0 ; count < c_width * c_height ; ++count)
{
if (bw_check [count] != i || rgb_check [count].r != i || rgb_check [count].g != i || rgb_check [count].b != i)
{
ok = false;
break;
}
++i;
}
cout << (end.QuadPart - start.QuadPart) << (ok ? " OK" : " Failed") << endl;
return end.QuadPart - start.QuadPart;
}
int main
(
int argc,
char *argv []
)
{
unsigned char
*source_data = new unsigned char [c_width * c_height * 3];
RGBData
*rgb = new RGBData [c_width * c_height];
unsigned char
*bw = new unsigned char [c_width * c_height];
int
v = 0;
for (unsigned char *dest = source_data ; dest < &source_data [c_width * c_height * 3] ; ++dest)
{
*dest = v++ / 3;
}
LONGLONG
totals [2] = {0, 0};
for (int i = 0 ; i < 10 ; ++i)
{
cout << "Iteration: " << i << endl;
totals [0] += TimeFunction (SimpleTest, "Initial Copy", source_data, rgb, bw);
totals [1] += TimeFunction ( ASM, " ASM Copy", source_data, rgb, bw);
}
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
freq.QuadPart /= 100000;
cout << totals [0] / freq.QuadPart << "ns" << endl;
cout << totals [1] / freq.QuadPart << "ns" << endl;
delete [] bw;
delete [] rgb;
delete [] source_data;
return 0;
}
And the ratio between C and assembler I was getting was about 2.5:1, i.e. C was 2.5 times the time of the assembler version.
I've just noticed the original data was in BGR order. If the copy swapped the B and R components then it does make the assembler code a bit more complex. But it would also make the C code more complex too.
Ideally, you need to work out what the theoretical minimum time is and compare it to what you're actually getting. To do that, you need to know the memory frequency and the type of memory and the workings of the CPU's MMU.
You might try using a simple cast to get your RGB data, and just recompute the grayscale data:
#pragma pack(1)
typedef unsigned char bw_t;
typedef struct {
unsigned char blue;
unsigned char green;
unsigned char red;
} rgb_t;
#pragma pack(pop)
rgb_t *imageRGB = (rgb_t*)pImage;
bw_t *imageBW = (bw_t*)calloc(640*480, sizeof(bw_t));
// RGB(X,Y) = imageRGB[Y*480 + X]
// BW(X,Y) = imageBW[Y*480 + X]
for (int y = 0; y < 640; ++y)
{
// try and pull some larger number of bytes from pImage (24 is arbitrary)
// 24 / sizeof(rgb_t) = 8
for (int x = 0; x < 480; x += 24)
{
imageBW[y*480 + x ] = GRAYSCALE(imageRGB[y*480 + x ]);
imageBW[y*480 + x + 1] = GRAYSCALE(imageRGB[y*480 + x + 1]);
imageBW[y*480 + x + 2] = GRAYSCALE(imageRGB[y*480 + x + 2]);
imageBW[y*480 + x + 3] = GRAYSCALE(imageRGB[y*480 + x + 3]);
imageBW[y*480 + x + 4] = GRAYSCALE(imageRGB[y*480 + x + 4]);
imageBW[y*480 + x + 5] = GRAYSCALE(imageRGB[y*480 + x + 5]);
imageBW[y*480 + x + 6] = GRAYSCALE(imageRGB[y*480 + x + 6]);
imageBW[y*480 + x + 7] = GRAYSCALE(imageRGB[y*480 + x + 7]);
}
}
Several steps you can take. Result at the end of this answer.
First, use pointers.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int y=0; y < 640; ++y) {
for (int x=0; x < 480; ++x) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
}
If imgRGB and imgBW are declared as:
unsigned char imgBW[480][640];
RGB imgRGB[480][640];
You can combine the two loops:
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int i=0; i < 640 * 480; ++i) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
You can exploit the fact that word reads are faster than four char reads. We will use a helper macro for this. Note this example assumes a little-endian target system.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->blue = pixels; \
pixels >>= 8; \
\
rgbOut->green = pixels; \
*bwOut = pixels; \
pixels >>= 8; \
\
rgbOut->red = pixels; \
pixels >>= 8; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
(Your compiler will probably optimize away the redundant or operation in the first READ_PIXEL. It will also optimize shifts, removing the redundant << 0, too.)
If the structure of RGB is thus:
struct RGB {
unsigned char blue, green, red;
};
You can optimize even further, copy to the struct directly, instead of through its members (red, green, blue). This can be done using anonymous structs (or casting, but that makes the code a bit more messy and probably more prone to error). (Again, this is dependant on little-endian systems, etc. etc.):
union RGB {
struct {
unsigned char blue, green, red;
};
uint32_t rgb:24; // Make sure it's a bitfield, otherwise the union will strech and ruin the ++ operator.
};
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->rgb = pixels; \
pixels >>= 8; \
\
*bwOut = pixels; \
pixels >>= 16; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
You can optimize writing the pixel similarly as we did with reading (writing in words rather than 24-bits). In fact, that'd be a pretty good idea, and will be a great next step in optimization. Too tired to code it, though. =]
Of course, you can write the routine in assembly language. This makes it less portable than it already is, however.
I'm assuming the following at the moment, so please let me know if my assumptions are wrong:
a) imgRGB is a structure of the type
struct ImgRGB
{
unsigned char blue;
unsigned char green;
unsigned char red;
};
or at least something similar.
b) imgBW looks something like this:
struct ImgBW
{
unsigned char BW;
};
c) The code is single threaded
Assuming the above, I see several problems with your code:
You put the assignment to the BW part right in the middle of the assignments to the other containers. If you're working on a modern CPU, chances are that with the size of your data your L1 cache gets invalidated every time you're switching containers and you're looking at reloading or switching a cache line. Caches are optimised for linear access these days so hopping to and fro doesn't help. Accessing main memory is a lot slower, so that would be a noticeable performance hit. To verify if this is a problem, temporarily I'd remove the assignment to imgBW and measure if there is a noticeable speedup.
The array access doesn't help and it'll potentially slow down the code a little, although a decent optimiser should take care of that. I'd probably write the loop along these lines instead, but would not expect a big performance gain. Maybe a couple percent.
for (int y=0; y blue = *pImage;
...
}
}
For consistency I would change from using postfix to prefix increment but I would not expect to see a big gain.
If you can waste a little storage (well, 25%) you might gain from adding a fourth dummy unsigned char to the structure ImgRGB provided that this would increase the size of the structure to the size of an int. Native ints are usually fastest to access and if you're looking at a structure of chars that are not filling up an int completely, you're potentially running into all sorts of interesting access issues that can slow your code down noticeably because the compiler might have to generate additional instructions to extract the unsigned chars. Again, try this and measure the result - it might make a noticeable difference or none at all. In the same vein, upping the size of the structure members from unsigned char to unsigned int might waste lots of space but potentially can speed up the code. Nevertheless as long as pImage is a pointer to an unsigned char, you would only eliminate half the problem.
All in all you are down to making your loop fit to your underlying hardware, so for specific optimisation techniques you might have to read up on what your hardware does well and what it does badly.
Make sure pImage, imgRGB, and imgBW are marked __restrict.
Use SSE and do it sixteen bytes at a time.
Actually from what you're doing there it looks like you could use a simple memcpy() to copy pImage into imgRGB (since imgRGB is in row-major format and apparently in the same order as pImage). You could fill out imgBW by using a series of SSE swizzle and store ops to pack down the green values but it might be cumbersome since you'd need to work on ( 3*16 =) 48 bytes at a time.
Are you sure pImage and your output arrays are all in dcache when you start this? Try using a prefetch hint to fetch 128 bytes ahead and measure to see if that improves things.
Edit If you're not on x86, replace "SSE" with the appropriate SIMD instruction set for your hardware, of course. (That'd be VMX, Altivec, SPU, VLIW, HLSL, etc.)
If possible, fix this at a higher level then bit or instruction twiddling!
You could specialize the the B&W image class to one that references the green channel of the color image class (thus saving a copy per pixel). If you always create them in pair, you might not even need the naive imgBW class at all.
By taking care about how your store the data in imgRGB, you could copy a triplet at a time from the input data. Better, you might copy the whole thing, or even just store a reference (which makes the previous suggestion easy as well).
If you don't control the implementation of everything here, you might be stuck, then:
Last resort: unroll the loop (cue someone mentioning Duff's device, or just ask the compiler to do it for you...), though I don't think you'll see much improvement...
It seems that you defined each pixel as some kind of structure or object. Using a primitive type (say, int) could be faster. As others have mentioned, the compiler is likely to optimize the array access using pointer increments. If the compile doesn't do that for you, you can do that yourself to avoid multiplications when you use array[][].
Since you only need 3 bytes per pixel, you could pack one pixel into one int. By doing that, you could copy 3 bytes a time instead of byte-by-byte. The only tricky thing is when you want to read individual color components of a pixel, you will need some bit masking and shifting. This could give you more overhead than that saved by using an int.
Or you can use 3 int arrays for 3 color components respectively. You will need a lot more storage, though.
Here is one very tiny, very simple optimization:
You are referring to imageRGB[y][x] repeatedly, and that likely needs to be re-calculated at each step.
Instead, calculate it once, and see if that makes some improvement:
Pixel* apixel;
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
apixel = &imgRGB[y][x];
apixel->blue = *pImage;
pImage++;
apixel->green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
apixel->red = *pImage;
pImage++;
}
}
If pImage is already entirely in memory, why do you need to massage the data? I mean if it is already in pseudo-RGB format, why can't you just write some inline routines/macros that can spit out the values on demand instead of copying it around?
If rearranging the pixel data is important for later operations, consider block operations and/or cache line optimization.