OpenCL: Downsampling with bilinear interpolation

OpenCL: Downsampling with bilinear interpolation - c++

I've a problem with downsampling image with bilinear interpolation. I've read almost all relevant articles on stackoverflow and searched around in google, trying to solve or at least to find the problem in my OpenCL kernel. This is my main source for the theory. After I implemented this code in OpenCL:
__kernel void downsample(__global uchar* image, __global uchar* outputImage, __global int* width, __global int* height, __global float* factor){
//image vector containing original RGB values
//outputImage vector containing "downsampled" RGB mean values
//factor - downsampling factor, downscaling the image by factor: 1024*1024 -> 1024/factor * 1024/factor
int r = get_global_id(0);
int c = get_global_id(1); //current coordinates
int oWidth = get_global_size(0);
int olc, ohc, olr, ohr; //coordinates of the original image used for bilinear interpolation
int index; //linearized index of the point
uchar q11, q12, q21, q22;
float accurate_c, accurate_r; //the exact scaled point
int k;
accurate_c = convert_float(c*factor[0]);
olc=convert_int(accurate_c);
ohc=olc+1;
if(!(ohc<width[0]))
ohc=olc;
accurate_r = convert_float(r*factor[0]);
olr=convert_int(accurate_r);
ohr=olr+1;
if(!(ohr<height[0]))
ohr=olr;
index= (c + r*oWidth)*3; //3 bytes per pixel
//Compute RGB values: take a central mean RGB values among four points
for(k=0; k<3; k++){
q11=image[(olc + olr*width[0])*3+k];
q12=image[(olc + ohr*width[0])*3+k];
q21=image[(ohc + olr*width[0])*3+k];
q22=image[(ohc + ohr*width[0])*3+k];
outputImage[index+k] = convert_uchar((q11*(ohc - accurate_c)*(ohr - accurate_r) +
q21*(accurate_c - olc)*(ohr - accurate_r) +
q12*(ohc - accurate_c)*(accurate_r - olr) +
q22*(accurate_c - olc)*(accurate_r - olr)));
}
}
The kernel works with factor = 2, 4, 5, 6 but not with factor = 3, 7 (I get missing pixels, and the image appears little bit skewed) whereas the "identical" code written in c++ works fine with all factor values. I cann't explain it to myself why that happens in opencl. I attach my full code project here

Related

Efficient C++ code (no libs) for image transformation into custom RGB pixel greyscale

Currently working on C++ implementation of ToGreyscale method and I want to ask what is the most efficient way to transform "unsigned char* source" using custom RGB input params.
Below is a current idea, but maybe using a Vector would be better?
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float r = pixel[0];
float g = pixel[1];
float b = pixel[2];
// Do something with r, g, b
}
}

The most efficient single threaded CPU implementation, is using manually optimized SIMD implementation.
SIMD extensions are specific for processor architecture.
For x86 there are SSE and AVX extensions, NEON for ARM, AltiVec for PowerPC...
In many cases the compiler is able to generate very efficient code that utilize the SIMD extension without any knowledge of the programmer (just by setting compiler flags).
There are also many cases where the compiler can't generate efficient code (many reasons for that).
When you need to get very high performance, it's recommended to implement it using C intrinsic functions.
Most of intrinsic instructions are converted directly to assembly instructions (instruction to instruction), without the need to know assembly.
There are many downsides of using intrinsic (compared to generic C implementation): Implementation is complicated to code and to maintain, and the code is platform specific and not portable.
A good reference for x86 intrinsics is Intel Intrinsics Guide.
The posted code uses SSE instruction set extension.
The implementation is very efficient, but not the top performance (using AVX2 for example may be faster, but less portable).
For better efficiency my code uses fixed point implementation.
In many cases fixed point is more efficient than floating point (but more difficult).
The most complicated part of the specific algorithm is reordering the RGB elements.
When RGB elements are ordered in triples r,g,b,r,g,b,r,g,b... you need to reorder them to rrrr... gggg... bbbb... in order of utilizing SIMD.
Naming conventions:
Don't be scared by the long weird variable names.
I am using this weird naming convention (it's my convention), because it helps me follow the code.
r7_r6_r5_r4_r3_r2_r1_r0 for example marks an XMM register with 8 uint16 elements.
The following implementation includes code with and without SSE intrinsics:
//Optimized implementation (use SSE intrinsics):
//----------------------------------------------
#include <intrin.h>
//Convert from RGBRGBRGB... to RRR..., GGG..., BBB...
//Input: Two XMM registers (24 uint8 elements) ordered RGBRGB...
//Output: Three XMM registers ordered RRR..., GGG... and BBB...
// Unpack the result from uint8 elements to uint16 elements.
static __inline void GatherRGBx8(const __m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
const __m128i b7_g7_r7_b6_g6_r6_b5_g5,
__m128i &r7_r6_r5_r4_r3_r2_r1_r0,
__m128i &g7_g6_g5_g4_g3_g2_g1_g0,
__m128i &b7_b6_b5_b4_b3_b2_b1_b0)
{
//Shuffle mask for gathering 4 R elements, 4 G elements and 4 B elements (also set last 4 elements to duplication of first 4 elements).
const __m128i shuffle_mask = _mm_set_epi8(9,6,3,0, 11,8,5,2, 10,7,4,1, 9,6,3,0);
__m128i b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4 = _mm_alignr_epi8(b7_g7_r7_b6_g6_r6_b5_g5, r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, 12);
//Gather 4 R elements, 4 G elements and 4 B elements.
//Remark: As I recall _mm_shuffle_epi8 instruction is not so efficient (I think execution is about 5 times longer than other shuffle instructions).
__m128i r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0 = _mm_shuffle_epi8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, shuffle_mask);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4 = _mm_shuffle_epi8(b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4, shuffle_mask);
//Put 8 R elements in lower part.
__m128i b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0 = _mm_alignr_epi8(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 12);
//Put 8 G elements in lower part.
__m128i g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 4);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0 = _mm_alignr_epi8(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4, g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz, 12);
//Put 8 B elements in lower part.
__m128i b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 4);
__m128i zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0 = _mm_alignr_epi8(zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4, b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz, 12);
//Unpack uint8 elements to uint16 elements.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_cvtepu8_epi16(b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_cvtepu8_epi16(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_cvtepu8_epi16(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0);
}
//Calculate 8 Grayscale elements from 8 RGB elements.
//Y = 0.2989*R + 0.5870*G + 0.1140*B
//Conversion model used by MATLAB https://www.mathworks.com/help/matlab/ref/rgb2gray.html
static __inline __m128i Rgb2Yx8(__m128i r7_r6_r5_r4_r3_r2_r1_r0,
__m128i g7_g6_g5_g4_g3_g2_g1_g0,
__m128i b7_b6_b5_b4_b3_b2_b1_b0)
{
//Each coefficient is expanded by 2^15, and rounded to int16 (add 0.5 for rounding).
const __m128i r_coef = _mm_set1_epi16((short)(0.2989*32768.0 + 0.5)); //8 coefficients - R scale factor.
const __m128i g_coef = _mm_set1_epi16((short)(0.5870*32768.0 + 0.5)); //8 coefficients - G scale factor.
const __m128i b_coef = _mm_set1_epi16((short)(0.1140*32768.0 + 0.5)); //8 coefficients - B scale factor.
//Multiply input elements by 64 for improved accuracy.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_slli_epi16(r7_r6_r5_r4_r3_r2_r1_r0, 6);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_slli_epi16(g7_g6_g5_g4_g3_g2_g1_g0, 6);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_slli_epi16(b7_b6_b5_b4_b3_b2_b1_b0, 6);
//Use the special intrinsic _mm_mulhrs_epi16 that calculates round(r*r_coef/2^15).
//Calculate Y = 0.2989*R + 0.5870*G + 0.1140*B (use fixed point computations)
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = _mm_add_epi16(_mm_add_epi16(
_mm_mulhrs_epi16(r7_r6_r5_r4_r3_r2_r1_r0, r_coef),
_mm_mulhrs_epi16(g7_g6_g5_g4_g3_g2_g1_g0, g_coef)),
_mm_mulhrs_epi16(b7_b6_b5_b4_b3_b2_b1_b0, b_coef));
//Divide result by 64.
y7_y6_y5_y4_y3_y2_y1_y0 = _mm_srli_epi16(y7_y6_y5_y4_y3_y2_y1_y0, 6);
return y7_y6_y5_y4_y3_y2_y1_y0;
}
//Convert single row from RGB to Grayscale (use SSE intrinsics).
//I0 points source row, and J0 points destination row.
//I0 -> rgbrgbrgbrgbrgbrgb...
//J0 -> yyyyyy
static void Rgb2GraySingleRow_useSSE(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x; //Index in J0.
int srcx; //Index in I0.
__m128i r7_r6_r5_r4_r3_r2_r1_r0;
__m128i g7_g6_g5_g4_g3_g2_g1_g0;
__m128i b7_b6_b5_b4_b3_b2_b1_b0;
srcx = 0;
//Process 8 pixels per iteration.
for (x = 0; x < image_width; x += 8)
{
//Load 8 elements of each color channel R,G,B from first row.
__m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0 = _mm_loadu_si128((__m128i*)&I0[srcx]); //Unaligned load of 16 uint8 elements
__m128i b7_g7_r7_b6_g6_r6_b5_g5 = _mm_loadu_si128((__m128i*)&I0[srcx+16]); //Unaligned load of (only) 8 uint8 elements (lower half of XMM register).
//Separate RGB, and put together R elements, G elements and B elements (together in same XMM register).
//Result is also unpacked from uint8 to uint16 elements.
GatherRGBx8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
b7_g7_r7_b6_g6_r6_b5_g5,
r7_r6_r5_r4_r3_r2_r1_r0,
g7_g6_g5_g4_g3_g2_g1_g0,
b7_b6_b5_b4_b3_b2_b1_b0);
//Calculate 8 Y elements.
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = Rgb2Yx8(r7_r6_r5_r4_r3_r2_r1_r0,
g7_g6_g5_g4_g3_g2_g1_g0,
b7_b6_b5_b4_b3_b2_b1_b0);
//Pack uint16 elements to 16 uint8 elements (put result in single XMM register). Only lower 8 uint8 elements are relevant.
__m128i j7_j6_j5_j4_j3_j2_j1_j0 = _mm_packus_epi16(y7_y6_y5_y4_y3_y2_y1_y0, y7_y6_y5_y4_y3_y2_y1_y0);
//Store 8 elements of Y in row Y0, and 8 elements of Y in row Y1.
_mm_storel_epi64((__m128i*)&J0[x], j7_j6_j5_j4_j3_j2_j1_j0);
srcx += 24; //Advance 24 source bytes per iteration.
}
}
//Convert image I from pixel ordered RGB to Grayscale format.
//Conversion formula: Y = 0.2989*R + 0.5870*G + 0.1140*B (Rec.ITU-R BT.601)
//Formula is based on MATLAB rgb2gray function: https://www.mathworks.com/help/matlab/ref/rgb2gray.html
//Implementation uses SSE intrinsics for performance optimization.
//Use fixed point computations for better performance.
//I - Input image in pixel ordered RGB format.
//image_width - Number of columns of I.
//image_height - Number of rows of I.
//J - Destination "image" in Grayscale format.
//I is pixel ordered RGB color format (size in bytes is image_width*image_height*3):
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//
//J is in Grayscale format (size in bytes is image_width*image_height):
//YYYYYY
//YYYYYY
//YYYYYY
//YYYYYY
//
//Limitations:
//1. image_width must be a multiple of 8.
//2. I and J must be two separate arrays (in place computation is not supported).
//3. Rows of I and J are continues in memory (bytes stride is not supported, [but simple to add]).
//
//Comments:
//1. The conversion formula is incorrect, but it's a commonly used approximation.
//2. Code uses SSE 4.1 instruction set.
// Better performance can be archived using AVX2 implementation.
// (AVX2 is supported by Intel Core 4'th generation and above, and new AMD processors).
//3. The code is not the best SSE optimization:
// Uses unaligned load and store operations.
// Utilize only half XMM register in few cases.
// Instruction selection is probably sub-optimal.
void Rgb2Gray_useSSE(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
Rgb2GraySingleRow_useSSE(I0,
image_width,
J0);
}
}
//Convert single row from RGB to Grayscale (simple C code without intrinsics).
static void Rgb2GraySingleRow_Simple(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x; //index in J0.
int srcx; //Index in I0.
srcx = 0;
//Process 1 pixel per iteration.
for (x = 0; x < image_width; x++)
{
float r = (float)I0[srcx]; //Load red pixel and convert to float
float g = (float)I0[srcx+1]; //Green
float b = (float)I0[srcx+2]; //Blue
float gray = 0.2989f*r + 0.5870f*g + 0.1140f*b; //Convert to Grayscale (use BT.601 conversion coefficients).
J0[x] = (unsigned char)(gray + 0.5f); //Add 0.5 for rounding.
srcx += 3; //Advance 3 source bytes per iteration.
}
}
//Convert RGB to Grayscale using simple C code (without SIMD intrinsics).
//Use as reference (for time measurements).
void Rgb2Gray_Simple(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
Rgb2GraySingleRow_Simple(I0,
image_width,
J0);
}
}
In my machine, the manually optimized code is about 3 times faster.

You can find medium value between RGB channels, then, assign the result to each channel.
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float grayscaleValue = 0;
for (int k = 0; k < 3; k++) {
grayscaleValue += pixel[k];
}
grayscaleValue /= 3;
for (int k = 0; k < 3; k++) {
pixel[k] = grayscaleValue;
}
}
}

(C++)(Visual Studio) Change RGB to Grayscale

I am accessing the image like so:
pDoc = GetDocument();
int iBitPerPixel = pDoc->_bmp->bitsperpixel; // used to see if grayscale(8 bits) or RGB (24 bits)
int iWidth = pDoc->_bmp->width;
int iHeight = pDoc->_bmp->height;
BYTE *pImg = pDoc->_bmp->point; // pointer used to point at pixels in the image
int Wp = iWidth;
const int area = iWidth * iHeight;
int r; // red pixel value
int g; // green pixel value
int b; // blue pixel value
int gray; // gray pixel value
BYTE *pImgGS = pImg; // grayscale image pixel array
and attempting to change the rgb image to gray like so:
// convert RGB values to grayscale at each pixel, then put in grayscale array
for (int i = 0; i<iHeight; i++)
for (int j = 0; j<iWidth; j++)
{
r = pImg[i*iWidth * 3 + j * 3 + 2];
g = pImg[i*iWidth * 3 + j * 3 + 1];
b = pImg[i*Wp + j * 3];
r * 0.299;
g * 0.587;
b * 0.144;
gray = std::round(r + g + b);
pImgGS[i*Wp + j] = gray;
}
finally, this is how I try to draw the image:
//draw the picture as grayscale
for (int i = 0; i < iHeight; i++) {
for (int j = 0; j < iWidth; j++) {
// this should set every corresponding grayscale picture to the current picture as grayscale
pImg[i*Wp + j] = pImgGS[i*Wp + j];
}
}
}
original image:
and the resulting image that I get is this:

First check if image type is 24 bits per pixels.
Second, allocate memory to pImgGS;
BYTE* pImgGS = (BTYE*)malloc(sizeof(BYTE)*iWidth *iHeight);
Please refer this article to see how bmp data is saved. bmp images are saved upside down. Also, first 54 byte of information is BITMAPFILEHEADER.
Hence you should access values in following way,
double r,g,b;
unsigned char gray;
for (int i = 0; i<iHeight; i++)
{
for (int j = 0; j<iWidth; j++)
{
r = (double)pImg[(i*iWidth + j)*3 + 2];
g = (double)pImg[(i*iWidth + j)*3 + 1];
b = (double)pImg[(i*iWidth + j)*3 + 0];
r= r * 0.299;
g= g * 0.587;
b= b * 0.144;
gray = floor((r + g + b + 0.5));
pImgGS[(iHeight-i-1)*iWidth + j] = gray;
}
}
If there is padding present, then first determine padding and access in different way. Refer this to understand pitch and padding.
double r,g,b;
unsigned char gray;
long index=0;
for (int i = 0; i<iHeight; i++)
{
for (int j = 0; j<iWidth; j++)
{
r = (double)pImg[index+ (j)*3 + 2];
g = (double)pImg[index+ (j)*3 + 1];
b = (double)pImg[index+ (j)*3 + 0];
r= r * 0.299;
g= g * 0.587;
b= b * 0.144;
gray = floor((r + g + b + 0.5));
pImgGS[(iHeight-i-1)*iWidth + j] = gray;
}
index =index +pitch;
}
While drawing image,
as pImg is 24bpp, you need to copy gray values thrice to each R,G,B channel. If you ultimately want to save grayscale image in bmp format, then again you have to write bmp data upside down or you can simply skip that step in converting to gray here:
pImgGS[(iHeight-i-1)*iWidth + j] = gray;

tl; dr:
Make one common path. Convert everything to 32-bits in a well-defined manner, and do not use image dimensions or coordinates. Refactor the YCbCr conversion ( = grey value calculation) into a separate function, this is easier to read and runs at exactly the same speed.
The lengthy stuff
First, you seem to have been confused with strides and offsets. The artefact that you see is because you accidentially wrote out one value (and in total only one third of the data) when you should have written three values.
One can get confused with this easily, but here it happened because you do useless stuff that you needed not do in the first place. You are iterating coordinates left to right, top-to-bottom and painstakingly calculate the correct byte offset in the data for each location.
However, you're doing a full-screen effect, so what you really want is iterate over the complete image. Who cares about the width and height? You know the beginning of the data, and you know the length. One loop over the complete blob will do the same, only faster, with less obscure code, and fewer opportunities of getting something wrong.
Next, 24-bit bitmaps are common as files, but they are rather unusual for in-memory representation because the format is nasty to access and unsuitable for hardware. Drawing such a bitmap will require a lot of work from the driver or the graphics hardware (it will work, but it will not work well). Therefore, 32-bit depth is usually a much better, faster, and more comfortable choice. It is much more "natural" to access program-wise.
You can rather trivially convert 24-bit to 32-bit. Iterate over the complete bitmap data and write out a complete 32-bit word for each 3 byte-tuple read. Windows bitmaps ignore the A channel (the highest-order byte), so just leave it zero, or whatever.
Also, there is no such thing as a 8-bit greyscale bitmap. This simply doesn't exist. Although there exist bitmaps that look like greyscale bitmaps, they are in reality paletted 8-bit bitmaps where (incidentially) the bmiColors member contains all greyscale values.
Therefore, unless you can guarantee that you will only ever process images that you have created yourself, you cannot just rely that e.g. the values 5 and 73 correspond to 5/255 and 73/255 greyscale intensity, respectively. That may be the case, but it is in general a wrong assumption.
In order to be on the safe side as far as correctness goes, you must convert your 8-bit greyscale bitmaps to real colors by looking up the indices (the bitmap's grey values are really indices) in the palette. Otherwise, you could be loading a greyscale image where the palette is the other way around (so 5 would mean 250 and 250 would mean 5), or a bitmap which isn't greyscale at all.
So... you want to convert 24-bit and you want to convert 8-bit bitmaps, both to 32-bit depth. That means you do all the annoying what-if stuff once at the beginning, and the rest is one identical common path. That's a good thing.
What you will be showing on-screen is always a 32-bit bitmap where the topmost byte is ignored, and the lower three are all the same value, resulting in what looks like a shade of grey. That's simple, and simple is good.
Note that if you do a BT.601 style YCbCr conversion (as indicated by your use of the constants 0.299, 0.587, and 0.144), and if your 8-bit greyscale images are perceptive (this is something you must know, there is no way of telling from the file!), then for 100% correctness, you need to to the inverse transformation when converting from paletted 8-bit to RGB. Otherwise, your final result will look like almost right, but not quite. If your 8-bit greycales are linear, i.e. were created without using the above constants (again, you must know, you cannot tell from the image), you need to copy everything as-is (here, doing the conversion would make it look almost-but-not-quite right).
About the RGB-to-greyscale conversion, you do not need an extra greyscale bitmap just to hold the values that you never need again afterwards. You can read the three color values from the loaded bitmap, calculate Y, and directly build the 32-bit ARGB word, which you then write out to the final bitmap. This saves one entirely useless round-trip to memory which is not necessary.
Something like this:
uint32_t* out = (uint32_t*) output_bitmap_data;
for(int i = 0; i < inputSize; i+= 3)
{
uint8_t Y = calc_greyscale(in[0], in[1], in[2]);
*out++ = (Y<<16) | (Y<<8) | Y;
}
Alternatively, you can also do the from-whatever-to-32 conversion, and then do the to-greyscale conversion in-place there. This, in turn, introduces an extra round-trip to memory, but the code becomes much, much easier overall.

rotate pixel array in any given degree (e.g. 45 degree) in c++

I am Trying to rotate an RGB/RGBA image in 45 degree in c++.
I have the pixels of the image stored in unsigned char *pBuffer.
I have found this code for 90 degree rotation -
void rotate90(unsigned char *buffer, const unsigned int width, const unsigned int height)
{
const unsigned int sizeBuffer = width * height * 3;
unsigned char *tempBuffer = new unsigned char[sizeBuffer];
for (int y = 0, destinationColumn = height - 1; y < height; ++y, --destinationColumn)
{
int offset = y * width;
for (int x = 0; x < width; x++)
{
tempBuffer[(x * height) + destinationColumn] = buffer[offset + x];
}
}
// Copy rotated pixels
memcpy(buffer, tempBuffer, sizeBuffer);
delete[] tempBuffer;
}
But i want to rotate my image in any given degree for rotation in C++

Short answer. You need to implement some kind of interpolation.
Long answer. First of all keep in mind that resulting bitmap will have a different size. For example if you have a 100x100 pixel image the 45 degrees rotated image will have size of 141x141 pixels, but only the central part will have original pixels, the cornerswill be empty (white? transparent? is up to you).
Regarding how to compute the central pixels I suggest you to do as follows: for each pixel in the destination picture implements the rotation that maps it back in the original one. This implies floating point computations and you will end with a result like 32.4;12.7 Now you have a few choices. The simplest is just round the result to the closest integer approximation and get that pixel. This will give a grained result but may be acceptable in come cases. Or you can get the 4 closest pixel and interpolate them. Linear interpolation is an option but gives a somewhat blurry result, other schemes are possible (cubic)

C++AMP Computing gradient using texture on a 16 bit image

I am working with depth images retrieved from kinect which are 16 bits. I found some difficulties on making my own filters due to the index or the size of the images.
I am working with Textures because allows to work with any bit size of images.
So, I am trying to compute an easy gradient to understand what is wrong or why it doesn't work as I expected.
You can see that there is something wrong when I use y dir.
For x:
For y:
That's my code:
typedef concurrency::graphics::texture<unsigned int, 2> TextureData;
typedef concurrency::graphics::texture_view<unsigned int, 2> Texture
cv::Mat image = cv::imread("Depth247.tiff", CV_LOAD_IMAGE_ANYDEPTH);
//just a copy from another image
cv::Mat image2(image.clone() );
concurrency::extent<2> imageSize(640, 480);
int bits = 16;
const unsigned int nBytes = imageSize.size() * 2; // 614400
{
uchar* data = image.data;
// Result data
TextureData texDataD(imageSize, bits);
Texture texR(texDataD);
parallel_for_each(
imageSize,
[=](concurrency::index<2> idx) restrict(amp)
{
int x = idx[0];
int y = idx[1];
// 65535 is the maxium value that can take a pixel with 16 bits (2^16 - 1)
int valX = (x / (float)imageSize[0]) * 65535;
int valY = (y / (float)imageSize[1]) * 65535;
texR.set(idx, valX);
});
//concurrency::graphics::copy(texR, image2.data, imageSize.size() *(bits / 8u));
concurrency::graphics::copy_async(texR, image2.data, imageSize.size() *(bits) );
cv::imshow("result", image2);
cv::waitKey(50);
}
Any help will be very appreciated.

Your indexes are swapped in two places.
int x = idx[0];
int y = idx[1];
Remember that C++AMP uses row-major indices for arrays. Thus idx[0] refers to row, y axis. This is why the picture you have for "For x" looks like what I would expect for texR.set(idx, valY).
Similarly the extent of image is also using swapped values.
int valX = (x / (float)imageSize[0]) * 65535;
int valY = (y / (float)imageSize[1]) * 65535;
Here imageSize[0] refers to the number of columns (the y value) not the number of rows.
I'm not familiar with OpenCV but I'm assuming that it also uses a row major format for cv::Mat. It might invert the y axis with 0, 0 top-left not bottom-left. The Kinect data may do similar things but again, it's row major.
There may be other places in your code that have the same issue but I think if you double check how you are using index and extent you should be able to fix this.

OpenCV and Unsharp Masking Like Adobe Photoshop

I am trying to implement unsharp masking like it's done in Adobe Photoshop. I gathered a lot of information on the interent but I'm not sure if I'm missing something. Here's the code:
void unsharpMask( cv::Mat* img, double amount, double radius, double threshold ) {
// create blurred img
cv::Mat img32F, imgBlur32F, imgHighContrast32F, imgDiff32F, unsharpMas32F, colDelta32F, compRes, compRes32F, prod;
double r = 1.5;
img->convertTo( img32F, CV_32F );
cv::GaussianBlur( img32F, imgBlur32F, cv::Size(0,0), radius );
cv::subtract( img32F, imgBlur32F, unsharpMas32F );
// increase contrast( original, amount percent )
imgHighContrast32F = img32F * amount / 100.0f;
cv::subtract( imgHighContrast32F, img32F, imgDiff32F );
unsharpMas32F /= 255.0f;
cv::multiply( unsharpMas32F, imgDiff32F, colDelta32F );
cv::compare( cv::abs( colDelta32F ), threshold, compRes, cv::CMP_GT );
compRes.convertTo( compRes32F, CV_32F );
cv::multiply( compRes32F, colDelta32F, prod );
cv::add( img32F, prod, img32F );
img32F.convertTo( *img, CV_8U );
}
At the moment I am testing with a grayscale image. If i try the exact same parameters in Photoshop I get much better result. My own code leads to noisy images. What am I doing wrong.
The 2nd question is, how i can apply unsharp masking on RGB images? Do I have to unsharp mask each of the 3 channels or would it be better in another color space? How are these things done in Photoshop?
Thanks for your help!

I'm trying to replicate Photoshop's Unsharp Mask as well.
Let's ignore the Threshold for a second.
I will show you how to replicate Photoshop's Unsharp Mask using its Gaussian Blur.
Assuming O is the original image layer.
Create a new layer GB which is a Gaussian Blur applied on O.
Create a new layer which is O - GB (Using Apply Image).
Create a new layer by inverting GB - invGB.
Create a new layer which is O + invGB using Image Apply.
Create a new layer which is inversion of the previous layer, namely inv(O + invGB).
Create a new layer which is O + (O - GB) - inv(O + invGB).
When you do that in Photoshop you'll get a perfect reproduction of the Unsharp Mask.
If you do the math recalling that inv(L) = 1 - L you will get that the Unsharp Mask is
USM(O) = 3O - 2B.
Yet when I do that directly in MATLAB I don't get Photoshop's results.
Hopefully someone will know the exact math.
Update
OK,
I figured it out.
In Photoshop USM(O) = O + (2 * (Amount / 100) * (O - GB))
Where GB is a Gaussian Blurred version of O.
Yet, in order to replicate Photoshop's results you must do the steps above and clip the result of each step into [0, 1] as done in Photoshop.

According to docs:
C++: void GaussianBlur(InputArray src, OutputArray dst, Size ksize,
double sigmaX, double sigmaY=0, int borderType=BORDER_DEFAULT )
4th parameter is not "radius" it is "sigma" - gaussian kernel standard deviation. Radius is rather "ksize". Anyway Photoshop is not open source, hence we can not be sure they use the same way as OpenCV to calculate radius from sigma.
Channels
Yes you should apply sharp to any or to all channels, it depends on your purpose. Sure you can use any space: if you want sharp only brightness-component and don't want to increase color noise you can covert it to HSL or Lab-space and sharp L-channel only (Photoshop has all this options too).

In response to #Royi, the 2x multiplier results from assuming no clamping in this formula:
USM(Original) = Original + Amount / 100 * ((Original - GB) - (1 - (Original + (1 - GB))))
Ignoring clamping this incorrectly reduces to:
USM(Original) = Original + 2 * Amount / 100 * (Original - GB)
However, as you also point out, (Original - GB) and (Original + inv(GB)) are clamped to [0, 1]:
USM(Original) = Original + Amount / 100 *
(Max(0, Min(1, Original - GB)) - (1 - (Max(0, Min(1, Original + (1 - GB))))))
This correctly reduces to:
USM(Original) = Original + Amount / 100 * (Original - GB)
Here is an example illustrating why:
https://legacy.imagemagick.org/discourse-server/viewtopic.php?p=133597#p133597

Here's the code what I have done.
I am using this code to implement Unsharp Mask and it is working well for me.
Hope it is useful for you.
void USM(cv::Mat &O, int d, int amp, int threshold)
{
cv::Mat GB;
cv::Mat O_GB;
cv::subtract(O, GB, O_GB);
cv::Mat invGB = cv::Scalar(255) - GB;
cv::add(O, invGB, invGB);
invGB = cv::Scalar(255) - invGB;
for (int i = 0; i < O.rows; i++)
{
for (int j = 0; j < O.cols; j++)
{
unsigned char o_rgb = O.at<unsigned char>(i, j);
unsigned char d_rgb = O_GB.at<unsigned char>(i, j);
unsigned char inv_rgb = invGB.at<unsigned char>(i, j);
int newVal = o_rgb;
if (d_rgb >= threshold)
{
newVal = o_rgb + (d_rgb - inv_rgb) * amp;
if (newVal < 0) newVal = 0;
if (newVal > 255) newVal = 255;
}
O.at<unsigned char>(i, j) = unsigned char(newVal);
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js