Problem testing DTid.x Direct3D ComputeShader HLSL - hlsl

I’m attempting to write a slightly simple compute shader that does a simple moving average.
It is my first shader where I had to test DTid.x for certain conditions related to logic.
The shader works, the moving average is calculated as expected, except (ugh), for the case of DTid.x = 0 where I get a bad result.
It seems my testing of value DTid.x is somehow corrupted or not possible for case DTid.x = 0
I may be missing some fundamental understanding how compute shaders work as this piece of code seems super simple but it doesn't work as I'd expect it to.
Hopefully someone can tell me why this code doesn't work for case DTid.x = 0
For example, I simplified the shader to...
[numthreads(1024, 1, 1)]
void CSSimpleMovingAvgDX(uint3 DTid : SV_DispatchThreadID)
{
// I added below trying to limit the logic?
// I initially had it check for a range like >50 and <100 and this did work as expected.
// But I saw that my value at DTid.x = 0 was corrupted and I started to work on solving why. But no luck.
// It is just the case of DTid.x = 0 where this shader does not work.
if (DTid.x > 0)
{
return;
}
nAvgCnt = 1;
ft0 = asfloat(BufferY0.Load(DTid.x * 4)); // load data at actual DTid.x location
if (DTid.x > 0) // to avoid loading a second value for averaging
{
// somehow this code is still being called for case DTid.x = 0 ?
nAvgCnt = nAvgCnt + 1;
ft1 = asfloat(BufferY0.Load((DTid.x - 1) * 4)); // load data value at previous DTid.x location
}
if (nAvgCnt > 1) // If DTid.X was larger than 0, then we should have loaded ft1 and we can avereage ft0 and ft1
{
result = ((ft0 + ft1) / ((float)nAvgCnt));
}
else
{
result = ft0;
}
// And when I add code below, which should override above code, the result is still corrupted? //
if (DTid.x < 2)
result = ft0;
llByteOffsetLS = ((DTid.x) * dwStrideSomeBuffer);
BufferOut0.Store(llByteOffsetLS, asuint(result)); // store result, where all good except for case DTid.x = 0
}

I am compiling the shader with FXC. My shader was slightly more involved than above, I added the /Od option and the code behaved as expected. Without the /Od option I tried to refactor the code over and over with no luck but eventually I changed variable names for every possible section to make sure the compiler would treat them separately and eventually success. So, the lesson I learned is never reuse a variable in any way. Another solution, worse case, would be to decompile the compiled shader to understand how it was optimized. If attempting a large shader with several conditions/branches, I'd start with /Od and then eventually remove, and do not reuse variables, else you may start chasing problems that are not truly problems.

Related

glfwSwapBuffers slow (>3s)

The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...
Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.

C++ local uninitialized structure caused performance drop?

I'm working on a 3d software renderer. In my code, I've declared a structure Arti3DVSOutput with no default constructor. It's like this:
struct Arti3DVSOutput {
vec4 vPosition; // vec4 has a default ctor that sets all 4 floats 0.0f.
float Varyings[g_ciMaxVaryingNum];
};
void Arti3DDevice::GetTransformedVertex(uint32_t i_iVertexIndex, Arti3DTransformedVertex *out)
{
// Try to fetch result from cache.
uint32_t iCacheIndex = i_iVertexIndex&(g_ciCacheSize - 1);
if (vCache[iCacheIndex].tag == i_iVertexIndex)
{
*out = *(vCache[iCacheIndex].v);
}
else
{
// Cache miss. Need calculation.
Arti3DVSInput vsinput;
// omit some codes that fill in "vsinput"..........
Arti3DVSOutput vs_output;
// Whether comment the following line makes a big difference.
//memset(&vs_output, 0, sizeof(Arti3DVSOutput));
mRC.pfnVS(&vsinput, &mRC.globals, &vs_output);
*out = vs_output;
// Store results in cache.
vCache[iCacheIndex].tag = i_iVertexIndex;
vCache[iCacheIndex].v = &tvBuffer[i_iVertexIndex];
tvBuffer[i_iVertexIndex] = vs_output;
}
}
mRC.pfnVS is a function pointer and the function it's pointing is implemented like this:
void NewCubeVS(Arti3DVSInput *i_pVSInput, Arti3DShaderUniform* i_pUniform, Arti3DVSOutput *o_pVSOutput)
{
o_pVSOutput->vPosition = i_pUniform->mvp * i_pVSInput->ShaderInputs[0];
o_pVSOutput->Varyings[0] = i_pVSInput->ShaderInputs[1].x;
o_pVSOutput->Varyings[1] = i_pVSInput->ShaderInputs[1].y;
o_pVSOutput->Varyings[2] = i_pVSInput->ShaderInputs[1].z;
}
As you can see, what I do in this function is just fill in some members of "o_pVSOutput". No read operation is performed.
Here comes the problem: The pefermance of the renderer has a great drop from 400+ fps to 60+ fps when the local variable "vsoutput" is not set to 0 before I pass its address to the function("NewCubeVS" in this case) as the third parameters.
The image rendered is exactly the same. When I turn off the optimization(-O0), the perfermance of two versions are the same. Once I turn on the optimaztion(-O1 or -O2 or -O3), the performance difference shows again.
I profiled this program and found something very strange. The increase of time costs of "vsoutput uninitialized" version does not take place in the function "GetTransformedVertex",not even near from it. The time increase is caused by some SSE intrinsics functions way after the "GetTransformedVertex" is called. I'm really confused...
FYI,I'm using Visual Studio 2013 Community.
Now I do know that this performance drop is caused by unitialized structure. But I don't know how. Does it implicitly turned off some compiler's optimization options?
If necessary, I will post my source code to my github for your reference.
Any opinions are appreciated! Thank you in advance.
Update: Enlighted by #KerrekSB, I did some more tests.
Call memset() with different values, the perfermance could be quite different!
1: With 0, fps 400+.
2: With 1,2,3,4,5.... fps 40~60.
Then I removed the memset() and explicitly implemented a ctor for Arti3DOutput. The ctor did nothing but set all floats in Varyings[] to one valid float-point value(eg. 0.0f,1.5f,100.0f....). Haha, 400+ fps.
Till now, it seems that the values/content in Arti3DVSOutput have a great effect on the perfermance.
Then I did some more tests to find out which piece of memory of Arti3DVSOutput does really matters. Here comes the code.
Arti3DVSOutput vs_output; // No explict default ctor in this version.
// Comment the following 12 lines of code one by one to find out which piece of unitialized memory really matters.
vs_output.Varyings[0] = 0.0f;
vs_output.Varyings[1] = 0.0f;
vs_output.Varyings[2] = 0.0f;
vs_output.Varyings[3] = 0.0f;
vs_output.Varyings[4] = 0.0f;
vs_output.Varyings[5] = 0.0f;
vs_output.Varyings[6] = 0.0f;
vs_output.Varyings[7] = 0.0f;
vs_output.Varyings[8] = 0.0f;
vs_output.Varyings[9] = 0.0f;
vs_output.Varyings[10] = 0.0f;
vs_output.Varyings[11] = 0.0f;
mRC.pfnVS(&vsinput, &mRC.globals, &vs_output);
Comment that 12 lines of code one by one and run the program.
The result is shown as follows:
comment line# FPS
0 420
1 420
2 420
3 420
4 200
5 420
6 280
7 195
8 197
9 200
10 200
11 420
0,1,2,3,5,11 420
4,6,7,8,9,10 60
It seems the 4th,6th,7th,8th,9th and the 10th elements of Varyings[] all make some contibutions to the perfermance drop.
I really got confused by what the compiler does behind my back. Is there some kind of value check the compiler has to do?
Solution:
I figured it out!
The source of the problem is that uninitialized or improperly initialized float point values are used as parameters by SSE intrinsics afterwards. Those invalid floats generate exceptions and slow down the SSE intrinsics greatly.

While loop in compute shader is crashing my video card driver

I am trying to implement a Binary Search in a compute shader with HLSL. It's not a classic Binary Search as the search key as well as the array values are float. If there is no matching array value to the search key, the search is supposed to return the last index (minIdx and maxIdx match at this point). This is the worst case for classic Binary Search as it takes the maximum number of operations, I am aware of this.
So here's my problem:
My implementation looks like this:
uint BinarySearch (Texture2D<float> InputTexture, float key, uint minIdx, uint maxIdx)
{
uint midIdx = 0;
while (minIdx <= maxIdx)
{
midIdx = minIdx + ((maxIdx + 1 - minIdx) / 2);
if (InputTexture[uint2(midIdx, 0)] == key)
{
// this might be a very rare case
return midIdx;
}
// determine which subarray to search next
else if (InputTexture[uint2(midIdx, 0)] < key)
{
// as we have a decreasingly sorted array, we need to change the
// max index here instead of the min
maxIdx = midIdx - 1;
}
else if (InputTexture[uint2(midIdx, 0)] > key)
{
minIdx = midIdx;
}
}
return minIdx;
}
This leads to my video driver crashing on program execution. I don't get a compile error.
However, if I use an if instead of the while I can execute it and the first iteration works as expected.
I already did a couple of searches and I suspect this might have to do something with dynamic looping in a compute shader. But I have no prior experience with compute shaders and only little experience with HLSL as well, which is why I feel kind of lost.
I am compiling this with cs_5_0.
Could anyone please explain what I am doing wrong or at least hint me to some documentation/explanation? Anything that can get me started on solving and understanding this would be super-appreciated!
DirectCompute shaders are still subject to the Timeout Detection & Recovery (TDR) behavior in the drivers. This basically means if your shader takes more than 2 seconds, the driver assumes the GPU has hung and resets it. This can be challenging with DirectCompute where you intentionally want the shader to run a long while (much longer than rendering usually would). In this case it may be a bug, but it's something to be aware of.
With Windows 8.0 or later, you can allow long-running shaders by using D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT when you create the device. This will, however, apply to all shaders not just DirectCompute so you should be careful about using this generally.
For special-purpose systems, you can also use registry keys to disable TDRs.

newComputePipelineStateWithFunction failed

I am trying to let a neural net run on metal.
The basic idea is that of data duplication. Each gpu thread runs one version of the net for random data points.
I have written other shaders that work fine.
I also tried my code in a c++ command line app. No errors there.
There is also no compile error.
I used the apple documentation to convert to metal c++, since not everything from c++11 is supported.
It crashes after it loads the kernel function and when it tries to assign newComputePipelineStateWithFunction to the metal device. This means there is a problem with the code that isn't caught at compile time.
MCVE:
kernel void net(const device float *inputsVector [[ buffer(0) ]], // layout of net *
uint id [[ thread_position_in_grid ]]) {
uint floatSize = sizeof(tempFloat);
uint inputsVectorSize = sizeof(inputsVector) / floatSize;
float newArray[inputsVectorSize];
float test = inputsVector[id];
newArray[id] = test;
}
Update
It has everything to do with dynamic arrays.
Since it fails to create the pipeline state and doesn't crash running the actual shader it must be a coding issue. Not an input issue.
Assigning values from a dynamic array to a buffer makes it fail.
The real problem:
It is a memory issue!
To all the people saying that it was a memory issue, you were right!
Here is some pseudo code to illustrate it. Sorry that it is in "Swift" but easier to read. Metal Shaders have a funky way of coming to life. They are first initialised without values to get the memory. It was this step that failed because it relied on a later step: setting the buffer.
It all comes down to which values are available when. My understanding of newComputePipelineStateWithFunction was wrong. It is not simply getting the shader function. It is also a tiny step in the initialising process.
class MetalShader {
// buffers
var aBuffer : [Float]
var aBufferCount : Int
// step One : newComputePipelineStateWithFunction
memory init() {
// assign shader memory
// create memory for one int
let aStaticValue : Int
// create memory for one int
var aNotSoStaticValue : Int // this wil succeed, assigns memory for one int
// create memory for 10 floats
var aStaticArray : [Float] = [Float](count: aStaticValue, repeatedValue: y) // this will succeed
// create memory for x floats
var aDynamicArray : [Float] = [Float](count: aBuffer.count, repeatedValue: y) // this will fail
var aDynamicArray : [Float] = [Float](count: aBufferCount, repeatedValue: y) // this will fail
let tempValue : Float // one float from a loop
}
// step Two : commandEncoder.setBuffer()
assign buffers (buffers) {
aBuffer = cpuMemoryBuffer
}
// step Three : commandEncoder.endEncoding()
actual init() {
// set shader values
let aStaticValue : Int = 0
var aNotSoStaticValue : Int = aBuffer.count
var aDynamicArray : [Float] = [Float](count: aBuffer.count, repeatedValue: 1) // this could work, but the app already crashed before getting to this point.
}
// step Four : commandBuffer.commit()
func shaderFunction() {
// do stuff
for i in 0..<aBuffer.count {
let tempValue = aBuffer[i]
}
}
}
Fix:
I finally realised that buffers are technically dynamic arrays and instead of creating arrays inside the shader, I could also just add more buffers. This obviously works.
I think your problem is with this line :
uint schemeVectorSize = sizeof(schemeVector) / uintSize;
Here schemeVector is dynamic so as in classic C++ you cannot use sizeof on a dynamic array to get number of elements. sizeof would only work on arrays you would have defined locally/statically in the metal shader code.
Just imagine how it works internally : at compile time, the Metal compiler is supposed to transform the sizeof call into a constant ... but he can't since schemeVector is a parameter of your shader and thus can have any size ...
So for me the solution would be to compute schemeVectorSize in the C++/ObjectiveC/Swift part of your code, and pass it as a parameter to the shader (as a uniform in OpenGLES terminology ...).

My interrupt routine does not access an array correctly

Update to this - seems like there are some issues with trig functions in math.h (using MPIDE compiler)- it is no wonder I couldn't see this with my debugger which was using its own math.h and therefore giving me the expected (correct solutions). I found this out by accident on the microchip boards and instead implemented a 'fast sine/cosine' algorithm instead (see devmaster dot com for this). My ISR and ColourWheel array now work perfectly.
I must say that, as a fairly newcomer to C/C++ I have spent a lot of hours reviewing and re-reviewing my own code for errors. The last possible thing on my mind was that some very basic functions that were no doubt written decades ago could give such problems.
I suppose I would have seen the problem earlier myself if I'd had access to a screen dump of the actual array but, as my chip is connected to my led cube I've no way to access the data in the chip directly.
Hey, ho !! - when I get the chance I'll post a link to a u tube video showing the wave function that I've now been able to program and looks pretty good on my LED cube.
Russell
ps
Thank you all so very much for your help here - it stopped me giving up completely by giving me some avenues to chase down - certainly did not know much about endianess before this so learned about that and some systematic ways to go about a robust debugging approach.
I have a problem when trying to access an array in an interrupt routine.
The following is a snippet of code from inside the ISroutine.
if (CubeStatusArray[x][y][Layer]){
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
if((ColourWheel[Colour]>>16)&(1<<bitpos)) { // This line seems to cause trouble
setHigh(SINRED_PORT,SINRED_PIN);
}
else {
setLow(SINRED_PORT,SINRED_PIN);
}
}
}
..........
ColourWheel[Colour] has been declared as follows at the start of my program (outside any function)
static volatile uint32_t ColourWheel[255]; //this is the array from which
//the colours can be obtained -
//all set as 3 eight bit numbers
//using up 24 bits of a 32bit
//unsigned int.
What this snippet of code is doing is taking each bit of an eight bit segment of the code and setting the port/pin high or low accordingly with MSB first (I then have some other code which updates a TLC5940 IC LED driver chip for each high/low on the pin and the code goes on to take the green and blue 8 bits in a similar way).
This does not work and my colours output to my LEDs behave incorrectly.
However, if I change the code as follows then the routine works
if (CubeStatusArray[x][y][Layer]){
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
if(0b00000000111111111110101010111110>>16)&(1<<bitpos)) { // This line seems to cause trouble
setHigh(SINRED_PORT,SINRED_PIN);}
else {
setLow(SINRED_PORT,SINRED_PIN);
}
}
}
..........
(the actual binary number in the line is irrelevant (the first 8 bits are always zero, the next 8 bits represent a red colour, the next a blue colour etc)
So why does the ISR work with the fixed number but not if I try to use a number held in an array.??
Following is the actual code showing the full RGB update:
if (CubeStatusArray[x][y][Layer]){
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
{if((ColourWheel[Colour]>>16)&(1<<bitpos))
{setHigh(SINRED_PORT,SINRED_PIN);}
else
{setLow(SINRED_PORT,SINRED_PIN);}}
{if((ColourWheel[Colour]>>8)&(1<<bitpos))
{setHigh(SINGREEN_PORT,SINGREEN_PIN);}
else
{setLow(SINGREEN_PORT,SINGREEN_PIN);}}
{if((ColourWheel[Colour])&(1<<bitpos))
{setHigh(SINBLUE_PORT,SINBLUE_PIN);}
else
{setLow(SINBLUE_PORT,SINBLUE_PIN);}}
pulse(SCLK_PORT, SCLK_PIN);
pulse(GSCLK_PORT, GSCLK_PIN);
Data_Counter++;
GSCLK_Counter++; }
I assume the missing ( after if is a typo.
The indicated research technique, in the absence of a debugger, is:
Confirm one more time that test if( ( 0b00000000111111111110101010111110 >> 16 ) & ( 1 << bitpos ) ) works. Collect (print) the result for each bitpos
Store 0b00000000111111111110101010111110 in element 0 of the array. Repeat with if( ( ColourWheel[0] >> 16 ) & ( 1 << bitpos ) ). Collect results and compare with base case.
Store 0b00000000111111111110101010111110 in all elements of the array. Repeat with if( ( ColourWheel[Colour] >> 16 ) & ( 1 << bitpos ) ) for several different Colour values (assigned manually, though). Collect results and compare with base case.
Store 0b00000000111111111110101010111110 in all elements of the array. Repeat with if( ( ColourWheel[Colour] >> 16 ) & ( 1 << bitpos ) ) with a value for Colour normally assigned. Collect results and compare with base case.
Revert to the original program and retest. Collect results and compare with base case.
Confident that the value in ColourWheel[Colour] is not as expected or unstable. Validate the index range and access once. Code speed enhancement included.
[Edit] If the receiving end does not like the slower signal changes caused by replacing a constant with ColourWheel[Colour]>>16, more effcient code may solve this.
if (CubeStatusArray[x][y][Layer]){
uint32_t value = 0;
uint32_t maskR = 0x800000UL;
uint32_t maskG = 0x8000UL;
uint32_t maskB = 0x80UL;
if ((Colour >= 0) && (Colour < 255)) {
value = ColourWheel[Colour];
}
// All you need to do is shift 'value'
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
{ if( (value & maskR) // set red
}
{ if( (value & maskG) // set green
}
{ if( (value & maskB) // set blue
}
value <<= 1;
}