Ways to speed up a huge case statement? C++

Ways to speed up a huge case statement? C++ - c++

I am running through a file and dealing with 30 or so different fragment types. So every time, I read in a fragment and compare it's type (in hex) with those of the fragments I know. Is this fast or is there another way I can do this quicker?
Here is a sample of the code I am using:
// Iterate through the fragments and address them individually
for(int i = 0; i < header.fragmentCount; i++)
{
// Read in memory for the current fragment
memcpy(&frag, (wld + file_pos), sizeof(struct_wld_basic_frag));
// Deal with each frag type
switch(frag.id)
{
// Texture Bitmap Name(s)
case 0x03:
errorLog.OutputSuccess("[%i] 0x03 - Texture Bitmap Name", i);
break;
// Texture Bitmap Info
case 0x04:
errorLog.OutputSuccess("[%i] 0x04 - Texture Bitmap Info", i);
break;
// Texture Bitmap Reference Info
case 0x05:
errorLog.OutputSuccess("[%i] 0x05 - Texture Bitmap Reference Info", i);
break;
// Two-dimensional Object
case 0x06:
errorLog.OutputSuccess("[%i] 0x06 - Two-dimensioanl object", i);
break;
It runs through about 30 of these and when there are thousands of fragments, it can chug a bit. How would one recommend I speed this process up?
Thank you!

If all of these cases are the same except for the format string, consider having a array of format strings, and no case, as in:
const char *fmtStrings[] = {
NULL, NULL, NULL,
"[%i] 0x03 - Texture Bitmap Name",
"[%i] 0x04 - Texture Bitmap Info",
/* ... */
};
// ...
errorLog.OutputSuccess(fmtStrings[i], i);
// (range checks elided)
This should be less expensive than a switch, as it won't involve a branch misprediction penalty. That said, the cost of this switch is probably less than the cost of actually formatting the output string, so your optimization efforts may be a bit misplaced.

The case statement should be very fast, because when your code is optimized (and even sometimes when it isn't) it is implemented as a jump table. Go into the debugger and put a breakpoint on the switch and check the disassembly to make sure that's the case.

I think performing the memcpy is probably causing a lot of overhead. Maybe use your switch statement on a direct access to your data at (wld + file_pos).

I'm skeptical that the 30 case statements are the issue. That's just not very much code compared to whatever your memcpy and errorLog methods are doing. First verify that your speed is limited by CPU time and not by disk access. If you really are CPU bound, examine the code in a profiler.

If your fragment identifiers aren't too sparse, you can create an array of fragment type names and use it as a lookup table.
static const char *FRAGMENT_NAMES[] = {
0,
0,
0,
"Texture Bitmap Name", // 0x03
"Texture Bitmap Info", // 0x04
// etc.
};
...
const char *name = FRAGMENT_NAMES[frag.id];
if (name) {
errorLog.OutputSuccess("[%i] %x - %s", i, frag.id, name);
} else {
// unknown name
}

If your log statements are always strings of the form "[%i] 0xdd - message..." and frag.id is always an integer between 0 and 30, you could instead declare an array of strings:
std::string messagesArray[] = {"[%i] 0x00 - message one", "[%i] 0x01 - message two", ...}
Then replace the switch statement with
errorLog.OutputSuccess(messagesArray[frag.id], i);

If the possible fragment type values are all contiguous, and you don't want to do anything much more complex than printing a string upon matching, you can just index into an array, e.g.:
const char* typeNames[] = {"Texture Bitmap Name", "Texture Bitmap Info", ...};
/* for each frag.id: */
if (LOWER_LIMIT <= frag.id && frag.id < UPPER_LIMIT) {
printf("[%i] %#02x - %s\n", i, frag.id, typeNames[frag.id-LOWER_LIMIT]);
} else {
/* complain about error */
}

It's impossible to say for sure without seeing more, but it appears that you can avoid the memcpy, and instead use a pointer to walk through the data.
struct_wld_basic_frag *frag = (struct_wld_basic_frag *)wld;
for (i=0; i<header.fragmentCount; i++)
errorlog.OutputSuccess(fragment_strings[frag[i].id], i);
For the moment, I've assumed an array of strings for the different fragment types, as recommended by #Chris and #Ates. Even at worst, that will improve readability and maintainability without hurting speed. At best, it might (for example) improve cache usage, and give a major speed improvement -- one copy of the code to call errorlog.outputSuccess instead of 30 separate copies could make room for a lot of other "stuff" in the cache.
Avoiding copying data every time is a lot more likely to do real good though. At the same time, I should probably add that it's possible for this to cause a problem -- if the data isn't correctly aligned in the original buffer, attempting to use the pointer won't work.

Related

Communicating an array of bvec2 between host->shader and/or shader->shader

I need to communicate two boolean values per entry of an array in a compute shader. For now, I'm getting them from the cpu, but later I will want to generate these values from another compute shader that runs before that one. I got this working as follows:
Using glm::bvec2 I can place the booleans relatively packed into memory (the bvec stores one bool per byte. could be nicer, but will do for now, I can always manually pack this). Then, I use vkMapMemory to place the data into a Vulkan buffer (I then copy it to a device local buffer, but that's probably irrelevant here).
GLSL's bvec2 is not equivalent to that, unfortunately (or at least it won't give me the expected values if I use it, maybe I'm doing it wrong? Using bvec2 changes[] yields wrong results in the following code. I suspect an alignment mismatch). Because of that, the compute shader accesses this array as follows:
layout (binding = 2, scalar) buffer Changes
{
uint changes[];
};
void main() {
//uint is 4 byte, glm::bvec2 is 2 byte
uint changeIndex = gl_GlobalInvocationID.x / 2;
//order seems to be reversed in memory:
//v1(x1, y1) followed by v2(x2, y2) is stored as: 0000000(y2) 0000000(x2) 0000000(y1) 0000000(x1)
uint changeOffset = (gl_GlobalInvocationID.x % 2) * 16;
uint maskx = 1 << (changeOffset + 0);
uint masky = 1 << (changeOffset + 8);
uint uchange = changes[changeIndex];
bvec2 change = bvec2(uchange & maskx, uchange & masky);
}
This works. Took a bit of trial and error but there we go. I have two questions now:
Is there a more elegant way to do this?
When generating the values via compute shaders, I would not be using glm::bvec2. Should I perhaps just manually pack the booleans - one per bit - into uints, or is there a better way?
Performance is pretty important to me in this application, as I'm trying to benchmark things. Memory usage optimizations are secondary, but also worth considering. Being relatively inexperienced with optimizing GLSL, I'm happy about any advice you can give me.

Since glm::bvec2 stores a boolean as two bytes, perhaps the explicitly 8-bit unsigned integer vector type u8vec2 provided by the GL_EXT_shader_8bit_storage extension would be more convenient here? I don't know if the Vulkan driver you're using will support the necessary feature (I assume it's storageBuffer8BitAccess), though.

The comment by Andrea mentions a useful extension: GL_EXT_shader_8bit_storage.
As the writing access to my booleans is done in parallel, the only option for tight packing is atomics. I've chosen to trade memory efficiency for performance by storing two booleans in one byte, "wasting" 6 bits. The code is as follows:
#extension GL_EXT_shader_8bit_storage : enable
void getBools(in uint data, out bool split, out bool merge) {
split = (data & 1) > 0;
merge = (data & 2) > 0;
}
uint setBools(in bool split, in bool merge) {
uint result = 0;
if (split) result = result | 1;
if (merge) result = result | 2;
return result;
}
//usage:
layout (binding = 4, scalar) buffer ChangesBuffer
{
uint8_t changes[];
};
//[...]
bool split, merged;
getBools(uint(changes[invocationIdx]), split, merged);
//[...]
changes[idx] = uint8_t(setBools(split, merge));
Note the constructors, the data types provided by the extension do not provide any arithmetic operations and must be converted before use.

Element-wise shifting from smaller array to a larger array

I am programming an ESP32 in the Arduino framework. For my application, I need to create a buffer which will store information from both the present and the last time it was accessed. Here is what I am attempting to do.
//first buffer
char buffer1[4];
//second buffer
char buffer2[8];
void setup {
//setup
}
//buffer1 values will change with each iteration of loop from external inputs
//buffer2 must store most recent values of buffer1 plus values of buffer1 from when loop last ran
for example:
**loop first iteration**
void loop {
buffer1[0] = {1};
buffer1[1] = {2};
buffer1[2] = {3};
buffer1[3] = {1};
saveold(); //this is the function I'm trying to implement to save values to buffer2 in an element-wise way
}
//value of buffer2 should now be: buffer2 = {1,2,3,1,0,0,0,0}
**loop second iteration**
void loop {
buffer1[0] = {2};
buffer1[1] = {3};
buffer1[2] = {4};
buffer1[3] = {2};
saveold();
}
//value of buffer2 should now be: buffer2 = {2,3,4,2,1,2,3,1}
From what I've been able to understand through searching online, the "saveold" function I'm trying to make
should implement some form of memmove for these array operations
I've tried to piece it together, but I always overwrite the value of buffer2 instead of somehow shifting new values in, while retaining the old ones
This is all I've got:
void saveold() {
memmove(&buffer2[0], &buffer1[0], (sizeof(buffer1[0]) * 4));
}
From my understanding, this copies buffer1 starting from index position 0 to buffer2, starting at index position 0, for 4 bytes (where 1 char = 1 byte).
Computer science is not my backround, so perhaps there is some fundamental solution or strategy that I am missing. Any pointers would be appreciated.

You have multiple options to implement saveold():
Solution 1
void saveold() {
// "shift" lower half into upper half, saving recent values (actually it's a copy)
buffer2[4] = buffer2[0];
buffer2[5] = buffer2[1];
buffer2[6] = buffer2[2];
buffer2[7] = buffer2[3];
// copy current values
buffer2[0] = buffer[0];
buffer2[1] = buffer[1];
buffer2[2] = buffer[2];
buffer2[3] = buffer[3];
}
Solution 2
void saveold() {
// "shift" lower half into upper half, saving recent values (actually it's a copy)
memcpy(buffer2 + 4, buffer2 + 0, 4 * sizeof buffer2[0]);
// copy current values
memcpy(buffer2 + 0, buffer1, 4 * sizeof buffer1[0]);
}
Some notes
There are even more ways to do it. Anyway, choose the one you understand best.
Be sure that buffer2 is exactly double size of buffer1.
memcpy() can be used safely if source and destination don't overlap. memmove() checks for overlaps and reacts accordingly.
&buffer1[0] is the same as buffer1 + 0. Feel free to use the expression you better understand.
sizeof is an operator, not a function. So sizeof buffer[0] evaluates to the size of buffer[0]. A common and most accepted expression to calculate the size of an array dimension is sizeof buffer1 / sizeof buffer1[0]. You only need parentheses if you evaluate the size of a data type, like sizeof (int).
Solution 3
The last note leads directly to this improvement of solution 1:
void saveold() {
// "shift" lower half into upper half, saving recent values
size_t size = sizeof buffer2 / sizeof buffer2[0];
for (int i = 0; i < size / 2; ++i) {
buffer2[size / 2 + i] = buffer2[i];
}
// copy current values
for (int i = 0; i < size / 2; ++i) {
buffer2[i] = buffer1[i];
}
}
To apply this knowledge to solution 2 is left as an exercise for you. ;-)

The correct way to do this is to use buffer pointers, not by doing hard-copy backups. Doing hardcopies with memcpy is particularly bad on slow legacy microcontrollers such as AVR. Not quite sure what MCU this ESP32 got, seems to be some oddball one from Tensilica. Anyway, this answer applies universally for any processor where you have more data than CPU data word length.
perhaps there is some fundamental solution or strategy that I am missing.
Indeed - it really sounds that what you are looking for is a ring buffer. That is, an array of fixed size which has a pointer to the beginning of the valid data, and another pointer at the end of the data. You move the pointers, not the data. This is much more efficient both in terms of execution speed and RAM usage, compared to making naive hardcopies with memcpy.

While loop in compute shader is crashing my video card driver

I am trying to implement a Binary Search in a compute shader with HLSL. It's not a classic Binary Search as the search key as well as the array values are float. If there is no matching array value to the search key, the search is supposed to return the last index (minIdx and maxIdx match at this point). This is the worst case for classic Binary Search as it takes the maximum number of operations, I am aware of this.
So here's my problem:
My implementation looks like this:
uint BinarySearch (Texture2D<float> InputTexture, float key, uint minIdx, uint maxIdx)
{
uint midIdx = 0;
while (minIdx <= maxIdx)
{
midIdx = minIdx + ((maxIdx + 1 - minIdx) / 2);
if (InputTexture[uint2(midIdx, 0)] == key)
{
// this might be a very rare case
return midIdx;
}
// determine which subarray to search next
else if (InputTexture[uint2(midIdx, 0)] < key)
{
// as we have a decreasingly sorted array, we need to change the
// max index here instead of the min
maxIdx = midIdx - 1;
}
else if (InputTexture[uint2(midIdx, 0)] > key)
{
minIdx = midIdx;
}
}
return minIdx;
}
This leads to my video driver crashing on program execution. I don't get a compile error.
However, if I use an if instead of the while I can execute it and the first iteration works as expected.
I already did a couple of searches and I suspect this might have to do something with dynamic looping in a compute shader. But I have no prior experience with compute shaders and only little experience with HLSL as well, which is why I feel kind of lost.
I am compiling this with cs_5_0.
Could anyone please explain what I am doing wrong or at least hint me to some documentation/explanation? Anything that can get me started on solving and understanding this would be super-appreciated!

DirectCompute shaders are still subject to the Timeout Detection & Recovery (TDR) behavior in the drivers. This basically means if your shader takes more than 2 seconds, the driver assumes the GPU has hung and resets it. This can be challenging with DirectCompute where you intentionally want the shader to run a long while (much longer than rendering usually would). In this case it may be a bug, but it's something to be aware of.
With Windows 8.0 or later, you can allow long-running shaders by using D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT when you create the device. This will, however, apply to all shaders not just DirectCompute so you should be careful about using this generally.
For special-purpose systems, you can also use registry keys to disable TDRs.

Want to translate/typecast parts of a char array into values

I'm playing around with networking, and I've hit a bit of a road block with translating a packet of lots of data into the values I want.
Basically I've made a mockup packet of what I'm expecting my packets to look like a bit. Essentially a Char (8bit value) indicating what the message is, and that is detected by a switch statement which then populates values based off the data after that 8 bit value. I'm expecting my packet to have all sorts of messages in it which may not be in order.
Eg, I may end up with the heartbeat at the end, or a string of text from a chat message, etc.
I just want to be able to say to my program, take the data from a certain point in the char array and typecast (if thats the term for it?) them into what I want them to be. What is a nice easy way to do that?
char bufferIncoming[15];
ZeroMemory(bufferIncoming,15);
//Making a mock packet
bufferIncoming[0] = 0x01; //Heartbeat value
bufferIncoming[1] = 0x01; //Heartbeat again just cause I can
bufferIncoming[2] = 0x10; //This should = 16 if its just an 8bit number,
bufferIncoming[3] = 0x00; // This
bufferIncoming[4] = 0x00; // and this
bufferIncoming[5] = 0x00; // and this
bufferIncoming[6] = 0x09; // and this should equal "9" of its is a 32bit number (int)
bufferIncoming[7] = 0x00;
bufferIncoming[8] = 0x00;
bufferIncoming[9] = 0x01;
bufferIncoming[10] = 0x00; //These 4 should be 256 I think when combines into an unsigned int
//End of mockup packet
int bufferSize = 15; //Just an arbitrary value for now
int i = 0;
while (i < bufferSize)
{
switch (bufferIncoming[i])
{
case 0x01: //Heart Beat
{
cout << "Heartbeat ";
}
break;
case 0x10: //Player Data
{
//We've detected the byte that indicates the following 8 bytes will be player data. In this case a X and Y position
playerPosition.X = ??????????; //How do I combine the 4 hex values for this?
playerPosition.Y = ??????????;
}
break;
default:
{
cout << ".";
}
break;
}
i++;
}
cout << " End of Packet\n";
UPDATE
Following Clairvoire's idea I added the following.
playerPosition.X = long(bufferIncoming[3]) << 24 | long(bufferIncoming[4]) << 16 | long(bufferIncoming[5]) << 8 | long(bufferIncoming[6]);
Notice I changed around the shifting values.
Another important change was
unsigned char bufferIncoming[15]
If I didn't do that, I was getting negative values being mixed with the combining of each element. I don't know what the compiler was doing under the hood but it was bloody annoying.
As you can imagine this is not my preferred solution but I'll give it a go. "Chad" has a good example of how I could have structured it, and a fellow programmer from work also recommended his implementation. But...
I have this feeling that there must be a faster cleaner way of doing what I want. I've tried things like...
playerPosition.X = *(bufferIncoming + 4) //Only giving me the value of the one hex value, not the combined >_<
playerPosition.X = reinterpret_cast<unsigned long>(&bufferIncoming); //Some random number that I dont know what it was
..and a few other things that I've deleted that didn't work either. What I was expecting to do was point somewhere in that char buffer and say "hey playerPosition, start reading from this position, and fill in your values based off the byte data there".
Such as maybe...
playerPosition = (playerPosition)bufferIncoming[5]; //Reads from this spot and fills in the 8 bytes worth of data
//or
playerPosition.Y = (playerPosition)bufferIncoming[9]; //Reads in the 4 bytes of values
...Why doesnt it work like that, or something similar?

There is probably a pretty version of this, but personally I would combine the four char variables using left shifts and ors like so:
playerPosition.X = long(buffer[0]) | long(buffer[1])<<8 | long(buffer[2])<<16 | long(buffer[3])<<24;
Endianness shouldn't be a concern, since bitwise logic is always executed the same, with the lowest order on the right (like how the ones place is on the right for decimal numbers)
Edit: Endianness may become a factor depending on how the sending machine initially splits the integer up before sending it across the network. If it doesn't decompose the integer in the same way as it does to recompose it using shifts, you may get a value where the first byte is last and the last byte is first. It's small ambiguities like these that prompt most to use networking libraries, aha.
An example of splitting an integer using bitwise would look something like this
buffer[0] = integer&0xFF;
buffer[1] = (integer>>8)&0xFF;
buffer[2] = (integer>>16)&0xFF;
buffer[3] = (integer>>24)&0xFF;

In a typical messaging protocol, the most straight forward way is to have a set of messages that you can easily cast, using inheritance (or composition) along with byte aligned structures (important for casting from a raw data pointer in this case) can make this relatively easy:
struct Header
{
unsigned char message_type_;
unsigned long message_length_;
};
struct HeartBeat : public Header
{
// no data, just a heartbeat
};
struct PlayerData : public Header
{
unsigned long position_x_;
unsigned long position_y_;
};
unsigned char* raw_message; // filled elsewhere
// reinterpret_cast is usually best avoided, however in this particular
// case we are casting two completely unrelated types and is therefore
// necessary
Header* h = reinterpret_cast<Header*>(raw_message);
switch(h)
{
case HeartBeat_MessageType:
break;
case PlayerData_MessageType:
{
PlayerData* data = reinterpret_cast<PlayerData*>(h);
}
break;
}

Was talking to one of the programmers I know on Skype and he showed me the solution I was looking for.
playerPosition.X = *(int*)(bufferIncoming+3);
I couldn't remember how to get it to work, or what its called. But it seems all good now.
Thanks guys for helping out :)

Random memory accesses are expensive?

During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.

You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}

How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.

The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.

I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.

Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Ways to speed up a huge case statement? C++ - c++

The case statement should be very fast, because when your code is optimized (and even sometimes when it isn't) it is implemented as a jump table. Go into the debugger and put a breakpoint on the switch and check the disassembly to make sure that's the case.

I think performing the memcpy is probably causing a lot of overhead. Maybe use your switch statement on a direct access to your data at (wld + file_pos).

I'm skeptical that the 30 case statements are the issue. That's just not very much code compared to whatever your memcpy and errorLog methods are doing. First verify that your speed is limited by CPU time and not by disk access. If you really are CPU bound, examine the code in a profiler.

Related

Communicating an array of bvec2 between host->shader and/or shader->shader

Element-wise shifting from smaller array to a larger array

While loop in compute shader is crashing my video card driver

Want to translate/typecast parts of a char array into values

Random memory accesses are expensive?

Categories

Resources