I am wanting to write to a texture from my directX 11 compute shader. However I have no idea how to display this onto the screen nor am I sure what sort of buffer I should be using to do this.
welcome on stackoverflow :)
The type of resource to choose is RWTexture2D<float4> since you can print this directly on screen via a swapchain.
You can look at the DirectX SDK OIT sample:
They have declared a RWTexture2D<float4> frameBuffer that they access in the function SortAndRenderCS of OIT_CS.hlsl.
// convert the color to floats
float4 color[3];
color[0].r = (r0 >> 0 & 0xFF) / 255.0f;
color[0].g = (r0 >> 8 & 0xFF) / 255.0f;
color[0].b = (r0 >> 16 & 0xFF) / 255.0f;
color[0].a = (r0 >> 24 & 0xFF) / 255.0f;
color[1].r = (r1 >> 0 & 0xFF) / 255.0f;
color[1].g = (r1 >> 8 & 0xFF) / 255.0f;
color[1].b = (r1 >> 16 & 0xFF) / 255.0f;
color[1].a = (r1 >> 24 & 0xFF) / 255.0f;
color[2].r = (r2 >> 0 & 0xFF) / 255.0f;
color[2].g = (r2 >> 8 & 0xFF) / 255.0f;
color[2].b = (r2 >> 16 & 0xFF) / 255.0f;
color[2].a = (r2 >> 24 & 0xFF) / 255.0f;
float4 result = lerp(lerp(lerp(0, color[2], color[2].a), color[1], color[1].a), color[0], color[0].a);
result.a = 1.0f;
frameBuffer[nDTid.xy] = result;
As you can see they have r0, r1 and r2 uint values that are actually RGBA colors (a byte for each channel), they extract each channel using shifts and masks and normalized it.
You don't need to do that if you have already float4 values of course.
Then they do those lerps (for interpolation). Again you shouldn't need to do that.
What interest you is that they access frameBuffer using array notation and an uint2 for coordinates.
Related
I am trying to implement a function that blends two colors encoded with RGB565 using Alpha blending
Crgb565 = (1-a)Argb565 + a*Brgb565
Where a is the alpha parameter, and the alpha blending value of 0.0-1.0 is mapped to an unsigned char value on the range 0-32.
we can choose to use a five bit representation for a instead, thus restricting it to the range of 0-31 (effectively mapping to an alpha blending value of 0.0-0.96875).
Following code I am trying to implement, can you please suggest better way wrt less temp variable , memory optimization (number of multiplications and required memory accesses ),Is my logic for alpha bending is correct? I am not getting correct result/expected output, Seems like I am missing something, please review the code, Every suggest is appreciated, have some doubt based on alpha parameter. I have put my doubts in code comment section. Is there any way to shortening the alpha blending equations(division operation)?
=====================================================
unsigned short blend_rgb565(unsigned short A, unsigned short B, unsigned char Alpha)
{
unsigned short res = 0;
// Alpha converted from [0..255] to [0..31] (8 bit to 5 bit)
/* I want the alpha parameter (0-32), do i need to add something in Alpha before right shift?? */
Alpha = Alpha >> 3;
// Split Image A into R, G, B components
/*Do I need to take it as unsigned short or uint8_t also work fine ??*/
unsigned short A_r = A >> 11;
unsigned short A_g = (A >> 5) & ((1u << 6) - 1); // ((1u << 6) - 1) --> 00000000 00111111
unsigned short A_b = A & ((1u << 5) - 1); // ((1u << 5) - 1) --> 00000000 00011111
// Split Image B into R, G, B components
unsigned short B_r = B >> 11;
unsigned short B_g = (B >> 5) & ((1u << 6) - 1);
unsigned short B_b = B & ((1u << 5) - 1);
// Alpha blend components
/*Do I need to use 255(8 bit) instead of 32(5 bit), Why we are dividing by it , I have taken the ref from internet , but need little bit more clarification ??*/
unsigned short uiC_r = (A_r * Alpha + B_r * (32 - Alpha)) / 32;
unsigned short uiC_g = (A_g * Alpha + B_g * (32 - Alpha)) / 32;
unsigned short uiC_b = (A_b * Alpha + B_b * (32 - Alpha)) / 32;
// Pack result
res= (unsigned short) ((uiC_r << 11) | (uiC_g << 5) | uiC_b);
return res;
}
=====================
EDIT:
Adding method 2 ,is this approach is correct ?
Method 2:
// rrrrrggggggbbbbb
#define RB_MASK 63519 // 0b1111100000011111 --> hex :F81F
#define G_MASK 2016 // 0b0000011111100000 --> hex :07E0
#define RB_MUL_MASK 2032608 // 0b111110000001111100000 --> hex :1F03E0
#define G_MUL_MASK 64512 // 0b000001111110000000000 --> hex :FC00
unsigned short blend_rgb565(unsigned short A,unsigned short B,unsigned char Alpha) {
// Alpha converted from [0..255] to [0..31]
Alpha = Alpha >> 3
uint8_t beta = 32 - Alpha;
// so (0..32)*Alpha + (0..32)*beta always in 0..32
return (unsigned short)
(
(
( ( Alpha * (uint32_t)( A & RB_MASK ) + beta * (uint32_t)( B & RB_MASK )) & RB_MUL_MASK )
|
( ( Alpha * ( A & G_MASK ) + beta * ( B & G_MASK )) & G_MUL_MASK )
)
>> 5 // removing the alpha component 5 bit
);
}
It's possible to reduce the multiplies from 6 to 2 if you space out the RGB values into 2 32-bit integers before multiplying:
unsigned short blend_rgb565(unsigned short A, unsigned short B, unsigned char Alpha)
{
unsigned short res = 0;
// Alpha converted from [0..255] to [0..31] (8 bit to 5 bit)
Alpha = Alpha >> 3;
// Alpha = (Alpha + (Alpha >> 5)) >> 3; // map from 0-255 to 0-32 (if Alpha is unsigned short or larger)
// Space out A and B from RRRRRGGGGGGBBBBB to 00000RRRRR00000GGGGGG00000BBBBB
// 31 = 11111 binary
// 63 = 111111 binary
unsigned int A32 = (unsigned int)A;
unsigned int A_spaced = A32 & 31; // B
A_spaced |= (A32 & (63 << 5)) << 5; // G
A_spaced |= (A32 & (31 << 11)) << 11; // R
unsigned int B32 = (unsigned int)B;
unsigned int B_spaced = B32 & 31; // B
B_spaced |= (B32 & (63 << 5)) << 5; // G
B_spaced |= (B32 & (31 << 11)) << 11; // R
// multiply and add the alpha to give a result RRRRRrrrrrGGGGGGgggggBBBBBbbbbb,
// where RGB are the most significant bits we want to keep
unsigned int C_spaced = (A_spaced * Alpha) + (B_spaced * (32 - Alpha));
// remap back to RRRRRGGGGGBBBBB
res = (unsigned short)(((C_spaced >> 5) & 31) + ((C_spaced >> 10) & (63 << 5)) + ((C_spaced >> 16) & (31 << 11)));
return res;
}
You need to profile this to see if it is faster, it assumes that multiplications you save are slower than the extra bit-manipulations you replace them with.
can you please suggest better way wrt less temp variable
There is no advantage to remove temporary variables from the implementation. When you compile with optimizations turned on (e.g. -O2 or /O2) those temp variables will get optimized away.
Two adjustments I would make to your code:
Use uint16_t instead of unsigned short. For most platforms, it won't matter since sizeof(uint16_t)==sizeof(unsigned short), but it helps to be definitive.
No point in converting alpha from an 8-bit value to a 5-bit value. You'll get better accuracy with blending if you let alpha have the full range
Some of your bit-shifting looks weird. It might work. But I use a simpler approach.
Here's an adjustment to your implementation:
#include <stdint.h>
#define MAKE_RGB565(r, g, b) ((r << 11) | (g << 5) | (b))
uint16_t blend_rgb565(uint16_t a, uint16_t b, uint8_t Alpha)
{
const uint8_t invAlpha = 255 - Alpha;
uint16_t A_r = a >> 11;
uint16_t A_g = (a >> 5) & 0x3f;
uint16_t A_b = a & 0x1f;
uint16_t B_r = b >> 11;
uint16_t B_g = (b >> 5) & 0x3f;
uint16_t B_b = b & 0x1f;
uint32_t C_r = (A_r * invAlpha + B_r * Alpha) / 255;
uint32_t C_g = (A_g * invAlpha + B_g * Alpha) / 255;
uint32_t C_b = (A_b * invAlpha + B_b * Alpha) / 255;
return MAKE_RGB565(C_r, C_g, C_b);
}
But the bigger issue is that this function works on exactly one one pair of pixel colors. If you are invoking this function across an entire image or pair of images, the overhead of using the function call is going to be a major performance issue - even with compiler optimizations and inlining. So if you are calling this function row x col times, you should probably manually inline the code into your loop that is enumerating over every pixel on an image (or pair of images).
In the same vein as #samgak's answer, you can implement more efficiently on a 64 bits architecture by "post-masking", as follows:
rrrrrggggggbbbbb
Replicate to a long long (by shifting or mapping the long long to four shorts)
---------------- rrrrrggggggbbbbb rrrrrggggggbbbbb rrrrrggggggbbbbb
Mask out the useless bits
---------------- rrrrr----------- -----gggggg----- -----------bbbbb
Multiply by α
-----------rrrrr rrrrr----------- ggggggggggg----- ------bbbbbbbbbb
Mask out the low order bits
-----------rrrrr ---------------- gggggg---------- ------bbbbb-----
Pack
rrrrrgggggbbbbb
Another saving is possible by rewriting
(1 - α) X + α Y
as
X + α (Y - X)
(or X - α (X - Y) to avoid negatives). This spares a multiply (at the expense of a comparison).
Update:
The "saving" above cannot work because the negatives should be handled component-wise.
I have this function for RGB blend. What I'm trying to do is put red and blue together to lessen the operations.
Here' the original code :
#define REDMASK (0xff0000)
#define GREENMASK (0x00ff00)
#define BLUEMASK (0x0000ff)
typedef unsigned int Pixel;
inline Pixel AddBlend( Pixel a_Color1, Pixel a_Color2 )
{
const unsigned int r = (a_Color1 & REDMASK) + (a_Color2 & REDMASK);
const unsigned int g = (a_Color1 & GREENMASK) + (a_Color2 & GREENMASK);
const unsigned int b = (a_Color1 & BLUEMASK) + (a_Color2 & BLUEMASK);
const unsigned r1 = (r & REDMASK) | (REDMASK * (r >> 24));
const unsigned g1 = (g & GREENMASK) | (GREENMASK * (g >> 16));
const unsigned b1 = (b & BLUEMASK) | (BLUEMASK * (b >> 8));
return (r1 + g1 + b1);
}`
And here's what I got so far. My problem is right now is that the colours are not blending correctly. What am I doing wrong here?
typedef unsigned int Pixel;
inline Pixel AddBlend( Pixel a_Color1, Pixel a_Color2 ){
const unsigned int rb = ( ( a_Color1 & 0xff00ff ) + ( a_Color2 & 0xff00ff ) );
const unsigned int g = ( a_Color1 & GREENMASK ) + ( a_Color2 & GREENMASK );
const unsigned rb1 = ( rb & 0xff00ff ) | ( 0xff00ff * ( rb >> 8 ));
const unsigned g1 = (g & GREENMASK) | (GREENMASK * (g >> 16));
return (rb1 + g1);
}
The (REDMASK * (r >> 24)) type part in the original code handles clamping values that overflow. This works with one color part, but not two. You'll need to split that into two parts, one to handle the red overflow and one for the blue. Handling the overflow for red can be done as in the original, but the blue overflow needs a little adjustment to ignore any of the red contribution.
BLUE_MASK * ((rb & 0x100) >> 8)
This results in
const unsigned rb1 = (rb & 0xff00ff) | (REDMASK * (r >> 24)) | (BLUE_MASK * ((rb & 0x100) >> 8));
Combining two colors like this works because there is a gap between red and blue that the overflow can occupy (the green bits). If you tried this with red/green or green/blue the overflow for the part stored in the lower byte would collide with the value for the part stored in the higher byte.
I'm making a program and I need to paint a rectangle of the same color as the title bar.
If I try to get the color like this:
ARGB rgbActiveColor = GetSysColor(COLOR_ACTIVECAPTION);
ARGB rgbInactiveColor = GetSysColor(COLOR_INACTIVECAPTION);
rgbActiveColor |= 0xFF000000; // Because of alpha
rgbInactiveColor |= 0xFF000000;
I get a totally different color in Windows 8. It always returns a orange or brown color instead of the actual color (let's say, blue).
Using DwmGetColorizationColor works, but the color is darker because I need to eliminate alpha. I try to do it like this:
BYTE r = ((RED * ALPHA) + (255 * (255 - ALPHA))) / 255; // R' = (R * A) + (1 - A)
BYTE g = ((GREEN * ALPHA) + (255 * (255 - ALPHA))) / 255; // G' = (G * A) + (1 - A)
BYTE b = ((BLUE * ALPHA) + (255 * (255 - ALPHA))) / 255; // B' = (B * A) + (1 - A)
So, my problems are:
I don't know how I can correctly convert the return color from ARGB to RGB
I don't know how to get the inactive title bar color
EDIT: My ARGB to RGB code seems to work unless I set color intensity in Control Panel to max (because somehow alpha is 0, and the color is green) or min.
EDIT2: This is not a duplicate because this is specifically about W8+.
This is a hackish solution and will probably only work with Windows 8 and 8.1 (I'm going to test later with 10).
I analysed the windows colors and this is what I could see:
Active window title (or caption) color is the result of blending between 0xD9D9D9 and the color in \HKEY_CURRENT_USER\Software\Microsoft\Windows\DWM\ColorizationColor using the value in \HKEY_CURRENT_USER\Software\Microsoft\Windows\DWM\ColorizationColorBalance (it's in a 0-100 scale) as a "alpha".
Inactive windows have the color 0xEBEBEB
So...
if (fActive)
{
DWORD ColorizationColor;
DWORD ColorizationColorBalance;
DWORD size = sizeof(DWORD);
RegGetValue(HKEY_CURRENT_USER, L"Software\\Microsoft\\Windows\\DWM", L"ColorizationColor", RRF_RT_REG_DWORD, 0, &ColorizationColor, &size);
RegGetValue(HKEY_CURRENT_USER, L"Software\\Microsoft\\Windows\\DWM", L"ColorizationColorBalance", RRF_RT_REG_DWORD, 0, &ColorizationColorBalance, &size);
BYTE ALPHA = 255 * ColorizationColorBalance / 100; // Convert from 0-100 to 0-255
BYTE RED = (ColorizationColor >> 16) & 0xFF;
BYTE GREEN = (ColorizationColor >> 8) & 0xFF;
BYTE BLUE = ColorizationColor & 0xFF;
BYTE r = ((RED * ALPHA) + (0xD9 * (255 - ALPHA))) / 255;
BYTE g = ((GREEN * ALPHA) + (0xD9 * (255 - ALPHA))) / 255;
BYTE b = ((BLUE * ALPHA) + (0xD9 * (255 - ALPHA))) / 255;
graphics.FillRectangle(&SolidBrush(Color(r, g, b)), Rect(...);
}
else
{
graphics.FillRectangle(&SolidBrush(0xFFEBEBEB), Rect(...));
}
Because this probably won't work at Windows 7 different code should be user for different systems.
This question already has answers here:
How to alpha blend RGBA unsigned byte color fast?
(17 answers)
Closed 8 years ago.
I have two colors and I use this method to do a simple alpha blending:
int Color::blend(int col1, int col2)
{
float a1 = ((col1 & 0x000000FF) / 255.0);
return ((int)((((col1 & 0xFF000000) >> 24) * a1) + (((col2 & 0xFF000000) >> 24) * (1.0 - a1)))) << 24 |
((int)((((col1 & 0x00FF0000) >> 16) * a1) + (((col2 & 0x00FF0000) >> 16) * (1.0 - a1)))) << 16 |
((int)((((col1 & 0x0000FF00) >> 8 ) * a1) + (((col2 & 0x0000FF00) >> 8 ) * (1.0 - a1)))) << 8 | 255;
}
(The colors are in RGBA8888 format)
This works, but i was wondering: is this the fastest way, or is there a more efficient one?
You might be able to eke out a little more performance by representing a1*(2^24) as an integer, doing the arithmetic in integers, then shifting the result down by 24 bits. On modern architectures I doubt it would gain you much, though. If you want better performance, you'll really need to go for SIMD operations.
Oh, one thing: You should express the calculation of a1 as a1 = ((col1 & 0x000000FF) * (1.0 / 255.0)). That'll avoid an expensive FP division. (Compilers won't usually do that on their own, due to the potential loss of precision.)
I'm working in C++ with an array of unsigned char representing pixels in an image. Each pixel has 3 channel (R,G,B). The image is represented linearly, sort of like
RGBRGBRGBRGB.....
How do I split each of the R,G and B, into separate arrays efficiently?
I tried:
for(int pos = 0; pos < srcWidth * srcHeight; pos++) {
int rgbPos = pos * 3;
splitChannels[0][pos] = rgbSrcData[rgbPos];
splitChannels[1][pos] = rgbSrcData[rgbPos + 1];
splitChannels[2][pos] = rgbSrcData[rgbPos + 2];
}
But this is surprisingly slow.
Thanks!
My attempt: load and store the bytes four by four. Byte scrambling will be tedious but possibly throughput will improve.
// Load 4 interleaved pixels
unsigned int RGB0= ((int*)rgbSrcData)[i];
unsigned int RGB1= ((int*)rgbSrcData)[i + 1];
unsigned int RGB2= ((int*)rgbSrcData)[i + 2];
// Rearrange and store 4 unpacked pixels
((int*)splitChannels[0])[j]=
(RGB0 & 0xFF) | (RGB0 >> 24) | (RGB1 & 0xFF0000) | ((RGB2 & 0xFF00) << 16);
((int*)splitChannels[1])[j]=
((RGB0 & 0xFF00) >> 8) | (RGB1 & 0xFF) | (RGB1 >> 24) | (RGB2 & 0xFF0000) >> 16;
((int*)splitChannels[2])[j]=
((RGB0 & 0xFF0000) >> 16) | (RGB1 & 0xFF00) | ((RGB2 & 0xFF) >> 16) | (RGB2 & 0xFF000000);
(CAUTION: not unchecked !) A shift-only version is also possible.
An SSE solution would be more complex (the stride 3 does not get along with powers of 2).
A great technique to use to make it run faster is loop unwinding.
You can read about it here: http://en.wikipedia.org/wiki/Loop_unwinding