InterlockedSubtract workaround - memory barriers - hlsl

I have a situation where I would love to have an InterlockedSubtract function in HLSL. InterlockedAdd works fine for integers, but I'm stuck using a RWByteAddressBuffer of uints - I'm using every single bit, and I would rather not resort to having an encode/decode function to make ints behave exactly like uints.
My current workaround looks like this:
uint oldValue = Source.Load(oldParent, y);
Source.InterlockedMin(oldParent, oldValue - 1, y);
The issue is that I understand that it is possible for these operations to be confused across several threads, like so:
Thread 1: Thread 2:
o1 = Source.Load(l) l = 10, o1 = 10
o2 = Source.Load(l) l = 10, o2 = 10
Source.InterlockedMin(l, o1 - 1) l = 9
Source.InterlockedMin(l, o2 - 1) l = 9
These would only decrement the value once, despite the two calls.
As I understand it, I can't just make it a one-liner, as the compiled instructions could desync anyway.
Is there a workaround I'm missing? I could refactor my code to use another uint as a subtraction counter, then use another kernel to subtract those from the actual counts, but I'd much prefer to keep it within one kernel.

Thanks #PeterCordes you were absolutely right.
I did something very simple to test:
HLSL
#pragma kernel Overflow
RWByteAddressBuffer buff;
uint ByteIndex(uint3 id)
{
return (id.x + id.y * 8) * 4;
}
[numthreads(8,8,1)]
void Overflow (uint3 id : SV_DispatchThreadID)
{
uint originalValue;
buff.InterlockedAdd(ByteIndex(id), 1, originalValue);
}
And C#
public ComputeShader shade;
private ComputeBuffer b;
private int kernel;
private uint[] uints = new uint[64];
private void Start()
{
for(int i = 0; i < 64; i++) { uints[i] = 4294967294; }
for(int i = 0; i < 64; i++) { Debug.Log(uints[i]); }
}
void Update()
{
if(Input.GetKeyDown(KeyCode.Space))
{
b = new ComputeBuffer(64, sizeof(uint));
b.SetData(uints);
kernel = shade.FindKernel("Overflow");
shade.SetBuffer(kernel, Shader.PropertyToID("buff"), b);
shade.Dispatch(kernel, 1, 1, 1);
b.GetData(uints);
b.Dispose();
for (int i = 0; i < 64; i++) { Debug.Log(uints[i]); }
}
}
This clearly exhibits the desired behavior.
It would be nice if this behavior was documented in HLSL, but at least I know it now.

Related

C++ optimizations

I'm doing some real-time stuff and I need a lot of speed. But in my code, I have this :
float maxdepth;
uint32_t faceindex;
for (uint32_t tr_iterator = 0; tr_iterator < facesNum-1; tr_iterator++)
{
maxdepth = VXTrisDepth[tr_iterator];
faceindex = tr_iterator;
uint32_t tr_literator = 3*tr_iterator;
uint32_t facelindex = 3*faceindex;
for (uint32_t tr_titerator = tr_iterator+1; tr_titerator < facesNum; tr_titerator++)
{
float depth = VXTrisDepth[tr_titerator];
if (depth > maxdepth)
{
maxdepth = depth;
faceindex = tr_titerator;
}
}
Vei2 itmpx = trs[tr_literator+0];
trs[tr_literator+0] = trs[facelindex+0];
trs[facelindex+0] = itmpx;
itmpx = trs[tr_literator+1];
trs[tr_literator+1] = trs[facelindex+1];
trs[facelindex+1] = itmpx;
itmpx = trs[tr_literator+2];
trs[tr_literator+2] = trs[facelindex+2];
trs[facelindex+2] = itmpx;
float id = VXTrisDepth[tr_iterator];
VXTrisDepth[tr_iterator] = VXTrisDepth[faceindex];
VXTrisDepth[faceindex] = id;
}
VXTrisDepth is just an array of float, faceindex is a uint32_t and is a big number, trs is an array of Vei2, and Vei2 is just a integer 2D vector.
The problem is that when we have something like 16074 in facenum, this loop takes 700ms to run on my computer, and that's way too much, any idea of optimizations ?
I've rewritten it a bit to find out what you really was doing.
Warning all code is untested
float maxdepth;
uint32_t faceindex;
for (uint32_t tr_iterator = 0; tr_iterator < facesNum-1; tr_iterator++) {
faceindex = tr_iterator;
uint32_t tr_literator = 3*tr_iterator;
uint32_t facelindex = 3*faceindex;
auto fi = std::max_element(&VXTrisDepth[tr_iterator], &VXTrisDepth[facesNum]);
maxdepth = *fi;
faceindex = std::distance(&VXTrisDepth[0], fi);
// hmm was this originally a VEC3...
std::swap(trs[tr_literator+0], trs[facelindex+0]);
std::swap(trs[tr_literator+1], trs[facelindex+1]);
std::swap(trs[tr_literator+2], trs[facelindex+2]);
// with the above this looks like a struct of arrays. SOA vs AOS
std::swap(VXTrisDepth[tr_iterator], VXTrisDepth[faceindex]);
}
Now it looks like selection sort of two arrays which is O(N^2) no wonder it feels slow.
There are multiple methods to sort this
External index, make an array with length facesNum, initalized from zero to facesNum-1 and sort them using the index into VXTrisDepth. Then reorder the 2 original arrays according to the index array.
External pair of index and key, to make it easy use std::pair, sort it and then reorder the original 2 arrays.
sort the 2 arrays as if it was one, slight hack. using std::swap you need to specialize on a type so it can be misused to swap 2 arrays. No extra storage needed.
Lets try an easy version with the external pair.
We need 3 stages
make helper array O(N)
sort helper array O(N lg N)
reorder original arrays O(N)
And some more code
// make helper array
using hPair = std::pair<float, int>; // order is important
std::vector<hPair> helper;
helper.reserve(numFaces);
for (int idx = 0; idx < facesNum; idx++)
helper.emplace_back(VXTrisDepth[idx], idx);
// sort it using std::pair's operator < or write your own
std::sort(helper.begin(), helper.end());
// reorder the SOA arrays
auto vx = std::begin(VXTrisDepth);
for (auto& help : helper) {
int tr_literator = help.second;
std::swap(trs[tr_literator+0], trs[facelindex+0]);
std::swap(trs[tr_literator+1], trs[facelindex+1]);
std::swap(trs[tr_literator+2], trs[facelindex+2]);
*vs++ = help.first; // we already have the sorted depth in helper.
//std::swap(VXTrisDepth[tr_iterator], VXTrisDepth[faceindex]);
}
Remember to test that it still works ... you already have a test framework right?

Initializing typedef struct from C library properly in C++

I want to include a library in my C++ project (controls RGB LED strips on the Raspberry Pi).
Importing the library is working fine but I have quite the issue with properly initializing some structs. I'm pretty lost where to even find the proper syntax, I did a lot of googling but didn't get very far.
What I want to at first is getting the sample application going that comes with the library. See: https://github.com/richardghirst/rpi_ws281x/blob/master/main.c
My main issue is this. How do I do what is done below the C++ way?
ws2811_t ledstring =
{
.freq = TARGET_FREQ,
.dmanum = DMA,
.channel =
{
[0] =
{
.gpionum = GPIO_PIN,
.count = LED_COUNT,
.invert = 0,
.brightness = 255,
},
[1] =
{
.gpionum = 0,
.count = 0,
.invert = 0,
.brightness = 0,
},
},
};
The way this is initialized is C specific and doesn't compile in any current C++ standard. See: Why does C++11 not support designated initializer list as C99?
So far I only ever used my own structs and also never used typedef, so I'm just confused the way structs are defined here.
The struct(s) that gets initialized above is defined in this way. See: https://github.com/richardghirst/rpi_ws281x/blob/master/ws2811.h
typedef struct
{
int gpionum; //< GPIO Pin with PWM alternate function
int invert; //< Invert output signal
int count; //< Number of LEDs, 0 if channel is unused
int brightness; //< Brightness value between 0 and 255
ws2811_led_t *leds; //< LED buffers, allocated by driver based on count
} ws2811_channel_t;
typedef struct
{
struct ws2811_device *device; //< Private data for driver use
uint32_t freq; //< Required output frequency
int dmanum; //< DMA number _not_ already in use
ws2811_channel_t channel[RPI_PWM_CHANNELS];
} ws2811_t;
What I tried was this:
ws2811_led_t matrix[WIDTH][HEIGHT];
ws2811_channel_t channel0 = {GPIO_PIN,LED_COUNT,0,255,*matrix};
ws2811_t ledstring = {nullptr,TARGET_FREQ,DMA,channel0};
That compiles but results in a malloc error when I come to actually "render" to the LED strip:
int x, y;
for (x = 0; x < WIDTH; x++)
{
for (y = 0; y < HEIGHT; y++)
{
cout << "LEDs size: " << (y * WIDTH) + x << endl;
ledstring.channel[0].leds[(y * WIDTH) + x] = matrix[x][y];
}
}
Results in this error message after the loop construct finishes:
malloc(): memory corruption (fast): 0x021acaa8
You should be able to use use following initializer:
ws2811_t ledstring =
{
nullptr,
TARGET_FREQ,
DMA,
{
{ GPIO_PIN, 0, LED_COUNT, 255 },
{ 0 }
}
};
This line
ledstring.channel[0].leds[(y * WIDTH) + x] = matrix[x][y];
is almost certainly the cause of the memory corruption, as that can only happen by either a buffer overrun or dereferencing an invalid (but non-NULL) pointer.
I see some problems in this code
ws2811_channel_t channel0 = {GPIO_PIN,LED_COUNT,0,255,*matrix};
ws2811_t ledstring = {nullptr,TARGET_FREQ,DMA,channel0};
First, in the initializer for channel0 you are setting the leds field to the contents of matrix[0][0] rather than its address. You need to change the final initializer to be simply matrix.
Next, you are initializing channel0.leds to point to the two dimensional array matrix, but treating it as a single dimensional array in ledstring.channel[0].leds[(y * WIDTH) + x]. This should probably be ledstring.channel[0].leds[x][y].
Finally, the last initializer for ledstring should probably be {channel0} for clarity. That's not a big issue, but it allows you to initialize more than one entry in the array.

Access violation, cant figure out the reason

So, been building this class:
public class BitArray {
public:
unsigned char* Data;
UInt64 BitLen;
UInt64 ByteLen;
private:
void SetLen(UInt64 BitLen) {
this->BitLen = BitLen;
ByteLen = (BitLen + 7) / 8;
Data = new unsigned char(ByteLen + 1);
Data[ByteLen] = 0;
}
public:
BitArray(UInt64 BitLen) {
SetLen(BitLen);
}
BitArray(unsigned char* Data, UInt64 BitLen) {
SetLen(BitLen);
memcpy(this->Data, Data, ByteLen);
}
unsigned char GetByte(UInt64 BitStart) {
UInt64 ByteStart = BitStart / 8;
unsigned char BitsLow = (BitStart - ByteStart * 8);
unsigned char BitsHigh = 8 - BitsLow;
unsigned char high = (Data[ByteStart] & ((1 << BitsHigh) - 1)) << BitsLow;
unsigned char low = (Data[ByteStart + 1] >> BitsHigh) & ((1 << BitsLow) - 1);
return high | low;
}
BitArray* SubArray(UInt64 BitStart, UInt64 BitLen) {
BitArray* ret = new BitArray(BitLen);
UInt64 rc = 0;
for (UInt64 i = BitStart; i < BitLen; i += 8) {
ret->Data[rc] = GetByte(i);
rc++;
}
Data[rc - 1] ^= (1 << (BitLen - ret->ByteLen * 8)) - 1;
return ret;
}
};
just finished writing the SubArray function and went on to test but I get "Access violation: attempted to read protected memory" on the line where GetByte(i) gets called. I tested a bit and it doesn't seem to have anything to do with the data array or i, placing "int derp = GetByte(0)" on the first line of the function produces the same error.
calling GetByte from outside the class works fine, I don't understand whats going on.
the test function looks like this:
unsigned char test[] = { 0, 1, 2, 3, 4, 5, 6, 7 };
BitArray* juku = new BitArray(test, 64);
auto banana = juku->GetByte(7); //this works fine
auto pie = juku->SubArray(7, 8);
You might want to consider creating an array of characters, changing:
Data = new unsigned char(ByteLen + 1);
into:
Data = new unsigned char[ByteLen + 1];
In the former, the value inside the parentheses is not the desired length, it's the value that *Data gets initialised to. If you use 65 (in an ASCII system), the first character becomes A.
Having said that, C++ already has a pretty efficient std::bitset for exactly the situation you seem to be in. If your intent is to learn how to make classes, by all means write your own. However, if you want to just make your life simple, you may want to consider using the facilities already provided rather than rolling your own.

How to speed up a function that returns a pointer to object in c++?

I am a mechanical engineer so please understand I am not trained in proper coding. I have a finite element code that uses grids to make elements which make a model. The element is not important to this question so I have left it out. The elements and grids are read in from a file and that part works.
class Grid
{
private:
int id;
double x;
double y;
double z;
public:
Grid();
Grid(int, double, double, double);
int get_id() { return id;};
};
Grid::Grid() {};
Grid::Grid(int t_id, double t_x, double t_y double t_z)
{
id = t_id; x = t_x; y = t_y; z = t_z;
}
class SurfaceModel
{
private:
Grid** grids;
Element** elements;
int grid_count;
int elem_count;
public:
SurfaceModel();
SurfaceModel(int, int);
~SurfaceModel();
void read_grid(std::string);
int get_grid_count() { return grid_count; };
Grid* get_grid(int);
};
SurfaceModel::SurfaceModel()
{
grids = NULL;
elements = NULL;
}
SurfaceModel::SurfaceModel(int g, int e)
{
grids = new Grid*[g];
for (int i = 0; i < g; i++)
grids[i] = NULL;
elements = new Element*[e];
for (int i = 0; i < e; i++)
elements[i] = NULL;
}
void SurfaceModel::read_grid(std::string line)
{
... blah blah ...
grids[index] = new Grid(n_id, n_x, n_y, n_z);
... blah blah ....
}
Grid* SurfaceModel::get_grid(int i)
{
if (i < grid_count)
return grids[i];
else
return NULL;
}
When I need to actually use the grid I use the get_grid maybe something like this:
SurfaceModel model(...);
.... blah blah .....
for (int i = 0; i < model.get_grid_count(); i++)
{
Grid *cur_grid = model.get_grid(i);
int cur_id = cur_grid->get_id();
}
My problem is that the call to get_grid seems to be taking more time than I think it should to simply return my object. I have run the gprof on the code and found that get_grid gets called about 4 billion times when going through a very large simulation and another operation using the x, y, z occurs about the same. The operation does some multiplication. What I found is that the get_grid and math take about the same amount of time (~40 seconds). This seems like I have done something wrong. Is there a faster way to get that object out of there?
I think you're forgetting to set grid_count and elem_count.
This means, they will have uninitialized (indeterminate) values. If you loop for those values, you can easily end up looping a lot of iterations.
SurfaceModel::SurfaceModel()
: grid_count(0),
grids(NULL),
elem_count(0),
elements(NULL)
{
}
SurfaceModel::SurfaceModel(int g, int e)
: grid_count(g),
elem_count(e)
{
grids = new Grid*[g];
for (int i = 0; i < g; i++)
grids[i] = NULL;
elements = new Element*[e];
for (int i = 0; i < e; i++)
elements[i] = NULL;
}
Howeverm, I suggest you would want to get rid of each instance of new in this program (and use a vector for the grid)
On a modern CPU accessing memory often takes longer than doing multiplication. Getting good performance on modern systems can often mean focusing more on optimizing memory accesses than optimizing computation. Because you are storing your grid objects as an array of dynamically allocated pointers the grid objects themselves will be stored non-contiguously in memory and you will likely get many cache misses when trying to access them. In this example you would probably see a significant speedup by storing your grid objects directly in an array or vector since you will be accessing contiguous memory in your loop and so get good cache utilization and effective hardware prefetching.
4 billion times a microsecond (which is a pretty acceptable time in many cases) gives 4 000 seconds. And since you only get about 40 s (if I get it right), I doubt there's something seriously wrong here. If it's still slow for the task, I'd consider the use of parallel computing.

Reversing for loop causing system errors

This feels like a newbie issue, but I can't seem to figure it out. I want to iterate over the items in a std::vector. Currently I use this loop:
for (unsigned int i = 0; i < buffer.size(); i++) {
myclass* var = buffer.at(i);
[...]
}
However, I realised that I actually want to iterate over it in the opposite order: starting at the end and working my way to 0. So I tried using this iterator:
for (unsigned int i = buffer.size()-1; i >= 0; i--) {
myclass* var = buffer.at(i);
[...]
}
But by simply replacing the old line with the new (and of course, recompiling), then it goes from running properly and iterating over the code, it instead causes the program to crash the first time it hits this line, with this error:
http://i43.tinypic.com/20sinlw.png
Followed by a "[Program] has stopped working" dialog box.
The program also returns exit code 3, according to Code::Blocks, which (if this article is to be believed) means ERROR_PATH_NOT_FOUND: The system cannot find the file specified.
Any advice? Am I just missing something in my for loop that's maybe causing some sort of memory issue? Is the return code of 3, or the article, misleading, and it doesn't actually mean "path not found"?
An unsigned integer is always >= 0. Furthermore, decrementing from 0 leaps to a large number.
When i == 0 (i.e. what should be the last iteration), the decrement i-- causes i to wrap around to the largest possible value for an unsigned int. Thus, the condition i >= 0 still holds, even though you'd like the loop to stop.
To fix this, you can try something like this, which maintains the original loop logic, but yields a decrementing i:
unsigned int i;
unsigned int size = buffer.size();
for (unsigned int j = 0; j < size; j++) {
i = size - j - 1;
Alternatively, since std::vector has rbegin and rend methods defined, you can use iterators:
for(typename std::vector<myclass *>::reverse_iterator i = buffer.rbegin(); i != rend(); ++i)
{
myclass* var = *i;
// ...
}
(There might be small syntactic errors - I don't have a compiler handy)
#include <vector>
using namespace std;
int main() {
vector<int> buffer = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
for (vector<int>::reverse_iterator it = buffer.rbegin(); it != buffer.rend(); it++) {
//do your stuff
}
return 0;
}