SSE2 code optimization to compress an image - c++

I want to optimize the for loop with SSE/SSE2 instructions for a better time in image compression.
size_t height = get_height();
size_t width = get_width();
size_t total_size = height * width * 3;
uint8_t *src = get_pixels();
uint8_t *dst = new uint8_t[total_size / 6];
uint8_t *tmp = dst;
rgb_t block[16];
if (height % 4 != 0 || width % 4 != 0) {
cerr << "Texture compression only supported for images if width and height are multiples of 4" << endl;
// Split image in 4x4 pixels zones
for (unsigned y = 0; y < height; y += 4, src += width * 3 * 4) {
for (unsigned x = 0; x < width; x += 4, dst += 8) {
const rgb_t *row0 = reinterpret_cast<const rgb_t*>(src + x * 3);
const rgb_t *row1 = row0 + width;
const rgb_t *row2 = row1 + width;
const rgb_t *row3 = row2 + width;
// Extract 4x4 matrix of pixels from a linearized matrix(linear memory).
memcpy(block, row0, 12);
memcpy(block + 4, row1, 12);
memcpy(block + 8, row2, 12);
memcpy(block + 12, row3, 12);
// Compress block and write result in dst.
compress_block(block, dst);
How can I read from memory an entire line from matrix with sse/sse2 registers when a line is supposed to have 4 elements of 3 bytes? The rgb_t structure has 3 uint_t variables.

Why do you think the compiler doesn't already make good code for those 12-byte copies?
But if it doesn't, probably copying 16 bytes for the first three copies (with overlap) will let it use SSE vectors. And padding your array would let you do the last copy with a 16-byte memcpy which should compile to a 16-byte vector load/store, too:
alignas(16) rgb_t buf[16 + 4];
Aligning probably doesn't matter much, since only the first store will be aligned anyway. But it might help the function you're passing the buffer to as well.


What is the highest bit depth greyscale image I can export from FreeImage?

As context, I'm working with building a topographic program which needs relatively extreme detail. I do not expect the files to be small, and they do not formally need to be viewed on a monitor, they just need to have very high resolution.
I know that most image formats are limited to 8 bpp, on account of the standard limits on both monitors (at a reasonable price) and on human perception. However, 2⁸ is just 256 possible values, which induces plateauing artifacts in a reconstructed displacement. 2¹⁶ may be close enough at 65,536 possible values, which I have achieved.
I'm using FreeImage and DLang to construct the data, currently on a Linux Mint machine.
However, when I went on to 2³², software support seemed to fade on me. I tried a TIFF of this form and nothing seemed to be able to interpret it, either showing a completely (or mostly) transparent image (remembering that I didn't expect any monitor to really support 2³² shades of a channel) or complaining about being unable to decode the RGB data. I imagine that it's because it was assumed to be an RGB or RGBA image.
FreeImage is reasonably well documented for most purposes, but I'm now wondering, what is the highest-precision single-channel format I can export, and how would I do it? Can anyone provide an example? Am I really limited, in any typical and not-home-rolled image format, to 16-bit? I know that's high enough for, say, medical imaging, but I'm sure I'm not the first person to try to aim higher and we science-types can be pretty ambitious about our precision-level…
Did I make a glaring mistake in my code? Is there something else I should try instead for this kind of precision?
Here's my code.
The 16-bit TIFF that worked
void writeGrayscaleMonochromeBitmap(const double width, const double height) {
FIBITMAP *bitmap = FreeImage_AllocateT(FIT_UINT16, cast(int)width, cast(int)height);
for(int y = 0; y < height; y++) {
ubyte *scanline = FreeImage_GetScanLine(bitmap, y);
for(int x = 0; x < width; x++) {
ushort v = cast(ushort)((x * 0xFFFF)/width);
ubyte[2] bytes = nativeToLittleEndian(cast(ushort)(x/width * 0xFFFF));
scanline[x * ushort.sizeof + 0] = bytes[0];
scanline[x * ushort.sizeof + 1] = bytes[1];
FreeImage_Save(FIF_TIFF, bitmap, "test.tif", TIFF_DEFAULT);
The 32-bit TIFF that didn't really work
void writeGrayscaleMonochromeBitmap32(const double width, const double height) {
FIBITMAP *bitmap = FreeImage_AllocateT(FIT_UINT32, cast(int)width, cast(int)height);
writeln(width, ", ", height);
writeln("Width: ", FreeImage_GetWidth(bitmap));
for(int y = 0; y < height; y++) {
ubyte *scanline = FreeImage_GetScanLine(bitmap, y);
writeln(y, ": ", scanline);
for(int x = 0; x < width; x++) {
//writeln(x, " < ", width);
uint v = cast(uint)((x/width) * 0xFFFFFFFF);
writeln("V: ", v);
ubyte[4] bytes = nativeToLittleEndian(v);
scanline[x * uint.sizeof + 0] = bytes[0];
scanline[x * uint.sizeof + 1] = bytes[1];
scanline[x * uint.sizeof + 2] = bytes[2];
scanline[x * uint.sizeof + 3] = bytes[3];
FreeImage_Save(FIF_TIFF, bitmap, "test32.tif", TIFF_NONE);
Thanks for any pointers.
For a single channel, the highest available from FreeImage is 32-bit, as FIT_UINT32. However, the file format must be capable of this, and as of the moment, only TIFF appears to be up to the task (See page 104 of the Stanford Documentation). Additionally, most monitors are incapable of representing more than 8-bits-per-sample, 12 in extreme cases, so it is very difficult to read data back out and have it render properly.
A unit test involving comparing bytes before marshaling to the bitmap, and sampled from the same bitmap afterward, show that the data is in fact being encoded.
To imprint data to a 16-bit gray scale (currently supported by J2K, JP2, PGM, PGMRAW, PNG and TIF), you would do something like this:
void toFreeImageUINT16PNG(string fileName, const double width, const double height, double[] data) {
FIBITMAP *bitmap = FreeImage_AllocateT(FIT_UINT16, cast(int)width, cast(int)height);
for(int y = 0; y < height; y++) {
ubyte *scanline = FreeImage_GetScanLine(bitmap, y);
for(int x = 0; x < width; x++) {
//This magic has to happen with the y-coordinate in order to keep FreeImage from following its default behavior, and generating
//the image upside down.
ushort v = cast(ushort)(data[cast(ulong)(((height - 1) - y) * width + x)] * 0xFFFF); //((x * 0xFFFF)/width);
ubyte[2] bytes = nativeToLittleEndian(v);
scanline[x * ushort.sizeof + 0] = bytes[0];
scanline[x * ushort.sizeof + 1] = bytes[1];
FreeImage_Save(FIF_PNG, bitmap, fileName.toStringz);
Of course you would want to make adjustments for your target file type. To export as 48-bit RGB16, you would do this.
void toFreeImageColorPNG(string fileName, const double width, const double height, double[] data) {
FIBITMAP *bitmap = FreeImage_AllocateT(FIT_RGB16, cast(int)width, cast(int)height);
uint pitch = FreeImage_GetPitch(bitmap);
uint bpp = FreeImage_GetBPP(bitmap);
for(int y = 0; y < height; y++) {
ubyte *scanline = FreeImage_GetScanLine(bitmap, y);
for(int x = 0; x < width; x++) {
ulong offset = cast(ulong)((((height - 1) - y) * width + x) * 3);
ushort r = cast(ushort)(data[(offset + 0)] * 0xFFFF);
ushort g = cast(ushort)(data[(offset + 1)] * 0xFFFF);
ushort b = cast(ushort)(data[(offset + 2)] * 0xFFFF);
ubyte[6] bytes = nativeToLittleEndian(r) ~ nativeToLittleEndian(g) ~ nativeToLittleEndian(b);
scanline[(x * 3 * ushort.sizeof) + 0] = bytes[0];
scanline[(x * 3 * ushort.sizeof) + 1] = bytes[1];
scanline[(x * 3 * ushort.sizeof) + 2] = bytes[2];
scanline[(x * 3 * ushort.sizeof) + 3] = bytes[3];
scanline[(x * 3 * ushort.sizeof) + 4] = bytes[4];
scanline[(x * 3 * ushort.sizeof) + 5] = bytes[5];
FreeImage_Save(FIF_PNG, bitmap, fileName.toStringz);
Lastly, to encode a UINT32 greyscale image (limited purely to TIFF at the moment), you would do this.
void toFreeImageTIF32(string fileName, const double width, const double height, double[] data) {
FIBITMAP *bitmap = FreeImage_AllocateT(FIT_UINT32, cast(int)width, cast(int)height);
int xtest = cast(int)(width/2);
int ytest = cast(int)(height/2);
uint comp1a = cast(uint)(data[cast(ulong)(((height - 1) - ytest) * width + xtest)] * 0xFFFFFFFF);
writeln("initial: ", nativeToLittleEndian(comp1a));
for(int y = 0; y < height; y++) {
ubyte *scanline = FreeImage_GetScanLine(bitmap, y);
for(int x = 0; x < width; x++) {
//This magic has to happen with the y-coordinate in order to keep FreeImage from following its default behavior, and generating
//the image upside down.
ulong i = cast(ulong)(((height - 1) - y) * width + x);
uint v = cast(uint)(data[i] * 0xFFFFFFFF);
ubyte[4] bytes = nativeToLittleEndian(v);
scanline[x * uint.sizeof + 0] = bytes[0];
scanline[x * uint.sizeof + 1] = bytes[1];
scanline[x * uint.sizeof + 2] = bytes[2];
scanline[x * uint.sizeof + 3] = bytes[3];
ulong index = cast(ulong)(xtest * uint.sizeof);
writeln("Final: ", FreeImage_GetScanLine(bitmap, ytest)
[index .. index + uint.sizeof]);
FreeImage_Save(FIF_TIFF, bitmap, fileName.toStringz);
I've yet to find a program, built by anyone else, which will readily render a 32-bit gray-scale image on a monitor's available palette. However, I left my checking code in which will consistently write out the same array both at the top DEBUG and the bottom one, and that's consistent enough for me.
Hopefully this will help someone else out in the future.

Efficient C++ code (no libs) for image transformation into custom RGB pixel greyscale

Currently working on C++ implementation of ToGreyscale method and I want to ask what is the most efficient way to transform "unsigned char* source" using custom RGB input params.
Below is a current idea, but maybe using a Vector would be better?
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float r = pixel[0];
float g = pixel[1];
float b = pixel[2];
// Do something with r, g, b
The most efficient single threaded CPU implementation, is using manually optimized SIMD implementation.
SIMD extensions are specific for processor architecture.
For x86 there are SSE and AVX extensions, NEON for ARM, AltiVec for PowerPC...
In many cases the compiler is able to generate very efficient code that utilize the SIMD extension without any knowledge of the programmer (just by setting compiler flags).
There are also many cases where the compiler can't generate efficient code (many reasons for that).
When you need to get very high performance, it's recommended to implement it using C intrinsic functions.
Most of intrinsic instructions are converted directly to assembly instructions (instruction to instruction), without the need to know assembly.
There are many downsides of using intrinsic (compared to generic C implementation): Implementation is complicated to code and to maintain, and the code is platform specific and not portable.
A good reference for x86 intrinsics is Intel Intrinsics Guide.
The posted code uses SSE instruction set extension.
The implementation is very efficient, but not the top performance (using AVX2 for example may be faster, but less portable).
For better efficiency my code uses fixed point implementation.
In many cases fixed point is more efficient than floating point (but more difficult).
The most complicated part of the specific algorithm is reordering the RGB elements.
When RGB elements are ordered in triples r,g,b,r,g,b,r,g,b... you need to reorder them to rrrr... gggg... bbbb... in order of utilizing SIMD.
Naming conventions:
Don't be scared by the long weird variable names.
I am using this weird naming convention (it's my convention), because it helps me follow the code.
r7_r6_r5_r4_r3_r2_r1_r0 for example marks an XMM register with 8 uint16 elements.
The following implementation includes code with and without SSE intrinsics:
//Optimized implementation (use SSE intrinsics):
#include <intrin.h>
//Convert from RGBRGBRGB... to RRR..., GGG..., BBB...
//Input: Two XMM registers (24 uint8 elements) ordered RGBRGB...
//Output: Three XMM registers ordered RRR..., GGG... and BBB...
// Unpack the result from uint8 elements to uint16 elements.
static __inline void GatherRGBx8(const __m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
const __m128i b7_g7_r7_b6_g6_r6_b5_g5,
__m128i &r7_r6_r5_r4_r3_r2_r1_r0,
__m128i &g7_g6_g5_g4_g3_g2_g1_g0,
__m128i &b7_b6_b5_b4_b3_b2_b1_b0)
//Shuffle mask for gathering 4 R elements, 4 G elements and 4 B elements (also set last 4 elements to duplication of first 4 elements).
const __m128i shuffle_mask = _mm_set_epi8(9,6,3,0, 11,8,5,2, 10,7,4,1, 9,6,3,0);
__m128i b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4 = _mm_alignr_epi8(b7_g7_r7_b6_g6_r6_b5_g5, r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, 12);
//Gather 4 R elements, 4 G elements and 4 B elements.
//Remark: As I recall _mm_shuffle_epi8 instruction is not so efficient (I think execution is about 5 times longer than other shuffle instructions).
__m128i r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0 = _mm_shuffle_epi8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, shuffle_mask);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4 = _mm_shuffle_epi8(b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4, shuffle_mask);
//Put 8 R elements in lower part.
__m128i b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0 = _mm_alignr_epi8(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 12);
//Put 8 G elements in lower part.
__m128i g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 4);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0 = _mm_alignr_epi8(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4, g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz, 12);
//Put 8 B elements in lower part.
__m128i b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 4);
__m128i zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0 = _mm_alignr_epi8(zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4, b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz, 12);
//Unpack uint8 elements to uint16 elements.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_cvtepu8_epi16(b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_cvtepu8_epi16(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_cvtepu8_epi16(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0);
//Calculate 8 Grayscale elements from 8 RGB elements.
//Y = 0.2989*R + 0.5870*G + 0.1140*B
//Conversion model used by MATLAB
static __inline __m128i Rgb2Yx8(__m128i r7_r6_r5_r4_r3_r2_r1_r0,
__m128i g7_g6_g5_g4_g3_g2_g1_g0,
__m128i b7_b6_b5_b4_b3_b2_b1_b0)
//Each coefficient is expanded by 2^15, and rounded to int16 (add 0.5 for rounding).
const __m128i r_coef = _mm_set1_epi16((short)(0.2989*32768.0 + 0.5)); //8 coefficients - R scale factor.
const __m128i g_coef = _mm_set1_epi16((short)(0.5870*32768.0 + 0.5)); //8 coefficients - G scale factor.
const __m128i b_coef = _mm_set1_epi16((short)(0.1140*32768.0 + 0.5)); //8 coefficients - B scale factor.
//Multiply input elements by 64 for improved accuracy.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_slli_epi16(r7_r6_r5_r4_r3_r2_r1_r0, 6);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_slli_epi16(g7_g6_g5_g4_g3_g2_g1_g0, 6);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_slli_epi16(b7_b6_b5_b4_b3_b2_b1_b0, 6);
//Use the special intrinsic _mm_mulhrs_epi16 that calculates round(r*r_coef/2^15).
//Calculate Y = 0.2989*R + 0.5870*G + 0.1140*B (use fixed point computations)
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = _mm_add_epi16(_mm_add_epi16(
_mm_mulhrs_epi16(r7_r6_r5_r4_r3_r2_r1_r0, r_coef),
_mm_mulhrs_epi16(g7_g6_g5_g4_g3_g2_g1_g0, g_coef)),
_mm_mulhrs_epi16(b7_b6_b5_b4_b3_b2_b1_b0, b_coef));
//Divide result by 64.
y7_y6_y5_y4_y3_y2_y1_y0 = _mm_srli_epi16(y7_y6_y5_y4_y3_y2_y1_y0, 6);
return y7_y6_y5_y4_y3_y2_y1_y0;
//Convert single row from RGB to Grayscale (use SSE intrinsics).
//I0 points source row, and J0 points destination row.
//I0 -> rgbrgbrgbrgbrgbrgb...
//J0 -> yyyyyy
static void Rgb2GraySingleRow_useSSE(const unsigned char I0[],
const int image_width,
unsigned char J0[])
int x; //Index in J0.
int srcx; //Index in I0.
__m128i r7_r6_r5_r4_r3_r2_r1_r0;
__m128i g7_g6_g5_g4_g3_g2_g1_g0;
__m128i b7_b6_b5_b4_b3_b2_b1_b0;
srcx = 0;
//Process 8 pixels per iteration.
for (x = 0; x < image_width; x += 8)
//Load 8 elements of each color channel R,G,B from first row.
__m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0 = _mm_loadu_si128((__m128i*)&I0[srcx]); //Unaligned load of 16 uint8 elements
__m128i b7_g7_r7_b6_g6_r6_b5_g5 = _mm_loadu_si128((__m128i*)&I0[srcx+16]); //Unaligned load of (only) 8 uint8 elements (lower half of XMM register).
//Separate RGB, and put together R elements, G elements and B elements (together in same XMM register).
//Result is also unpacked from uint8 to uint16 elements.
//Calculate 8 Y elements.
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = Rgb2Yx8(r7_r6_r5_r4_r3_r2_r1_r0,
//Pack uint16 elements to 16 uint8 elements (put result in single XMM register). Only lower 8 uint8 elements are relevant.
__m128i j7_j6_j5_j4_j3_j2_j1_j0 = _mm_packus_epi16(y7_y6_y5_y4_y3_y2_y1_y0, y7_y6_y5_y4_y3_y2_y1_y0);
//Store 8 elements of Y in row Y0, and 8 elements of Y in row Y1.
_mm_storel_epi64((__m128i*)&J0[x], j7_j6_j5_j4_j3_j2_j1_j0);
srcx += 24; //Advance 24 source bytes per iteration.
//Convert image I from pixel ordered RGB to Grayscale format.
//Conversion formula: Y = 0.2989*R + 0.5870*G + 0.1140*B (Rec.ITU-R BT.601)
//Formula is based on MATLAB rgb2gray function:
//Implementation uses SSE intrinsics for performance optimization.
//Use fixed point computations for better performance.
//I - Input image in pixel ordered RGB format.
//image_width - Number of columns of I.
//image_height - Number of rows of I.
//J - Destination "image" in Grayscale format.
//I is pixel ordered RGB color format (size in bytes is image_width*image_height*3):
//J is in Grayscale format (size in bytes is image_width*image_height):
//1. image_width must be a multiple of 8.
//2. I and J must be two separate arrays (in place computation is not supported).
//3. Rows of I and J are continues in memory (bytes stride is not supported, [but simple to add]).
//1. The conversion formula is incorrect, but it's a commonly used approximation.
//2. Code uses SSE 4.1 instruction set.
// Better performance can be archived using AVX2 implementation.
// (AVX2 is supported by Intel Core 4'th generation and above, and new AMD processors).
//3. The code is not the best SSE optimization:
// Uses unaligned load and store operations.
// Utilize only half XMM register in few cases.
// Instruction selection is probably sub-optimal.
void Rgb2Gray_useSSE(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
//Convert single row from RGB to Grayscale (simple C code without intrinsics).
static void Rgb2GraySingleRow_Simple(const unsigned char I0[],
const int image_width,
unsigned char J0[])
int x; //index in J0.
int srcx; //Index in I0.
srcx = 0;
//Process 1 pixel per iteration.
for (x = 0; x < image_width; x++)
float r = (float)I0[srcx]; //Load red pixel and convert to float
float g = (float)I0[srcx+1]; //Green
float b = (float)I0[srcx+2]; //Blue
float gray = 0.2989f*r + 0.5870f*g + 0.1140f*b; //Convert to Grayscale (use BT.601 conversion coefficients).
J0[x] = (unsigned char)(gray + 0.5f); //Add 0.5 for rounding.
srcx += 3; //Advance 3 source bytes per iteration.
//Convert RGB to Grayscale using simple C code (without SIMD intrinsics).
//Use as reference (for time measurements).
void Rgb2Gray_Simple(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
In my machine, the manually optimized code is about 3 times faster.
You can find medium value between RGB channels, then, assign the result to each channel.
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float grayscaleValue = 0;
for (int k = 0; k < 3; k++) {
grayscaleValue += pixel[k];
grayscaleValue /= 3;
for (int k = 0; k < 3; k++) {
pixel[k] = grayscaleValue;

Broken BMP when save bitmap by SOIL. Screenshot area

This is continuation of my last question about saving screenshot to SOIL .here Now I wonder, how to make screenshot of part of screen and eliminate the reason that strange behaviour. My code:
bool saveTexture(string path, glm::vec2 startPos, glm::vec2 endPos)
const char *charPath = path.c_str();
GLuint widthPart = abs(endPos.x - startPos.x);
GLuint heightPart = abs(endPos.y - startPos.y);
auto& hdr = bmi.bmiHeader;
hdr.biSize = sizeof(bmi.bmiHeader);
hdr.biWidth = widthPart;
hdr.biHeight = -1.0 * heightPart;
hdr.biPlanes = 1;
hdr.biBitCount = 24;
hdr.biCompression = BI_RGB;
hdr.biSizeImage = 0;
hdr.biXPelsPerMeter = 0;
hdr.biYPelsPerMeter = 0;
hdr.biClrUsed = 0;
hdr.biClrImportant = 0;
unsigned char* bitmapBits = (unsigned char*)malloc(3 * widthPart * heightPart);
HDC hdc = GetDC(NULL);
HDC hBmpDc = CreateCompatibleDC(hdc);
HBITMAP hBmp = CreateDIBSection(hdc, &bmi, DIB_RGB_COLORS, (void**)&bitmapBits, nullptr, 0);
SelectObject(hBmpDc, hBmp);
BitBlt(hBmpDc, 0, 0, widthPart, heightPart, hdc, startPos.x, startPos.y, SRCCOPY);
- int bytes = widthPart * heightPart * 3;
- // invert R and B chanels
- for (unsigned i = 0; i< bytes - 2; i += 3)
- {
- int tmp = bitmapBits[i + 2];
- bitmapBits[i + 2] = bitmapBits[i];
- bitmapBits[i] = tmp;
- }
+ unsigned stride = (widthPart * (hdr.biBitCount / 8) + 3) & ~3;
+ // invert R and B chanels
+ for (unsigned row = 0; row < heightPart; ++row) {
+ for (unsigned col = 0; col < widthPart; ++col) {
+ // Calculate the pixel index into the buffer, taking the
alignment into account
+ const size_t index{ row * stride + col * hdr.biBitCount / 8 };
+ std::swap(bitmapBits[index], bitmapBits[index + 2]);
+ }
+ }
int texture = SOIL_save_image(charPath, SOIL_SAVE_TYPE_BMP, widthPart, heightPart, 3, bitmapBits);
return texture;
When I run this if widthPart and heightPart is even number, that works perfect. But if something from this is odd number I get this BMP's.:
I checked any converting and code twice, but it seems to me the reason is in my wrong blit functions. Function of converting RGB is not affect on problem. What can be a reason? It's the right way blitting of area in BitBlt ?
Update No difference even or odd numbers. Correct picture produces when this numbers is equal. I don't know where is a problem.((
SOIL_save_image functions check parameters for errors and send to stbi_write_bmp:
int stbi_write_bmp(char *filename, int x, int y, int comp, void *data)
int pad = (-x*3) & 3;
return outfile(filename,-1,-1,x,y,comp,data,0,pad,
"11 4 22 4" "4 44 22 444444",
'B', 'M', 14+40+(x*3+pad)*y, 0,0, 14+40, // file header
40, x,y, 1,24, 0,0,0,0,0,0); // bitmap header
outfile function:
static int outfile(char const *filename, int rgb_dir, int vdir, int x, int
y, int comp, void *data, int alpha, int pad, char *fmt, ...)
FILE *f = fopen(filename, "wb");
if (f) {
va_list v;
va_start(v, fmt);
writefv(f, fmt, v);
return f != NULL;
The broken bitmap images are the result of a disagreement of data layout between Windows bitmaps and what the SOIL library expects1. The pixel buffer returned from CreateDIBSection follows the Windows rules (see Bitmap Header Types):
The scan lines are DWORD aligned [...]. They must be padded for scan line widths, in bytes, that are not evenly divisible by four [...].
In other words: The width, in bytes, of each scanline is (biWidth * (biBitCount / 8) + 3) & ~3. The SOIL library, on the other hand, doesn't expect pixel buffers to be DWORD aligned.
To fix this, the pixel data needs to be converted before being passed to SOIL, by stripping (potential) padding and exchanging the R and B color channels. The following code does so in-place2:
unsigned stride = (widthPart * (hdr.biBitCount / 8) + 3) & ~3;
for (unsigned row = 0; row < heightPart; ++row) {
for (unsigned col = 0; col < widthPart; ++col) {
// Calculate the source pixel index, taking the alignment into account
const size_t index_src{ row * stride + col * hdr.biBitCount / 8 };
// Calculate the destination pixel index (no alignment)
const size_t index_dst{ (row * width + col) * (hdr.biBitCount / 8) };
// Read color channels
const unsigned char b{ bitmapBits[index_src] };
const unsigned char g{ bitmapBits[index_src + 1] };
const unsigned char r{ bitmapBits[index_src + 2] };
// Write color channels switching R and B, and remove padding
bitmapBits[index_dst] = r;
bitmapBits[index_dst + 1] = g;
bitmapBits[index_dst + 2] = b;
With this code, index_src is the index into the pixel buffer, which includes padding to enforce proper DWORD alignment. index_dst is the index without any padding applied. Moving pixels from index_src to index_dst removes (potential) padding.
1 The tell-tale sign is scanlines moving to the left or right by one or two pixels (or individual color channels at different speeds). This is usually a safe indication, that there is a disagreement of scanline alignment.
2 This operation is destructive, i.e. the pixel buffer can no longer be passed to Windows GDI functions once converted, although the original data can be reconstructed, even if a bit more involved.

AccessVioilationException using BitmapData in c++

Below is my program. I am trying to apply grayscale filter using bitmapdata class in visual c++. I am getting AccessViolationException at 11, tagged by the comment. I have tried using CLR:Safe and CLR:pure but no use. In c# this would be solved by using unsafe block. Any suggestions? None of the other solutions on related questions worked.
Bitmap^ bmp = gcnew Bitmap(pictureBox1->Image);
BitmapData^ data = bmp->LockBits(Rectangle(0,0,bmp->Width,bmp->Height), ImageLockMode::ReadWrite, PixelFormat::Format24bppRgb);
int blue=0, green=0, red=0;
System::IntPtr s = data->Scan0;
int* P = (int*)(void*)s;
for (int i =0; i<bmp->Height;i++)
for (int j = 0; j < bmp->Width*3; j++)
blue = (int)P[0]; //access violation exception
green =(int )P[1];
red = (int)P[2];
int avg = (int)((blue + green + red) / 3);
P[0] = avg;
P[1] = avg;
P[2] = avg;
P +=3;
pictureBox1->Image = bmp;
You are using an int* when you should be using a byte*. Your pixels are three bytes each, one byte per channel. Your int is (likely) 4 bytes, so p[0] returns an entire pixel plus on byte past it. This is why you get an access violation; you are overrunning the bounds of the image buffer.
When you increment a pointer, you are adding sizeof *p bytes to it. In this case, P += 3 increments the pointer P by 12 bytes. Much too much, and you'll never be able to read a single pixel (or channel) of a 24bpp image with an int*. You are also assuming that your stride is Width * 3, which may or may not be correct (bitmaps are 4 byte aligned.)
Byte* base = (Byte*)data->Scan0;
int stride = data->Stride;
for(int y = 0; y < data->Height; ++y) {
Byte* src = base + y * stride;
for(int x = 0; x < data->Width; ++x, src += 3) {
// bitmaps are stored in BGR order (though not really important here).
// I'm assuming a 24bpp bitmap.
Byte b = src[0];
Byte g = src[1];
Byte r = src[2];
int average = (r + g + b) / 3;
src[0] = src[1] = src[2] = (Byte)average;

array, copy pixels to correct index, algorithm

I have image size is 2x2, so count pixels = 4
one pixel - 4 bytes
so I have an array of 16 bytes - mas[16] - width * height * 4 = 16
I want to make the same image, but the size is more a factor of 2, this means that instead of one will be four pixels
new array will have size of 64 bytes - newMas[16] - width*2 * height*2 * 4
problem, that i can't correct copy pixels to newMas,that with different size image correctly copy pixels
this code copy pixels to mas[16]
size_t width = CGImageGetWidth(imgRef);
size_t height = CGImageGetHeight(imgRef);
const size_t bytesPerRow = width * 4;
const size_t bitmapByteCount = bytesPerRow * height;
size_t mas[bitmapByteCount];
UInt8* data = (UInt8*)CGBitmapContextGetData(bmContext);
for (size_t i = 0; i < bitmapByteCount; i +=4)
UInt8 a = data[i];
UInt8 r = data[i + 1];
UInt8 g = data[i + 2];
UInt8 b = data[i + 3];
mas[i] = a;
mas[i+1] = r;
mas[i+2] = g;
mas[i+3] = b;
In general, using the built-in image drawing API will be faster and less error-prone than writing your own image-manipulation code. There are at least three potential errors in the code above:
It assumes that there's no padding at the end of rows (iOS seems to pad up to a multiple of 16 bytes); you need to use CGImageGetBytesPerRow().
It assumes a fixed pixel format.
It gets the width/height from a CGImage but the data from a CGBitmapContext.
Assuming you have a UIImage,
CGRect r = {{0,0},img.size};
r.size.width *= 2;
r.size.height *= 2;
// This turns off interpolation in order to do pixel-doubling.
CGContextSetInterpolationQuality(UIGraphicsGetCurrentContext(), kCGInterpolationNone);
[img drawRect:r];
UIImage * bigImg = UIGraphicsGetImageFromCurrentImageContext();