I'm using OpenGL, and in my code I have some unreadable and annoying lines like:
newChild->dposition = dvec_4(dvec_4(newChild->boundingVerts[2].position, 1) + newChild->parent->dposition);
The idea was to keep positions in vec3s, with many objects in the scene it could amount to a good saving in storage, and even more importantly reduce the size of buffers sent to the graphics card. But it leads to really hard to read code casting back and fourth, plus all the casts I imagine do cost something. So is it better to keep vec4s to avoid the casting?
Without having access to all of the code, it is hard to say.
However, I would rather say that using vec4s might bring performance/code quality benefits:
Guessing that the data is likely used on the GPU, it is likely more efficient to load/store a vec4, than a vec3. I am not exactly sure, but I do not think that there is a single instruction to load a vec3. It will be broken into loading a vec2 and a float I think.
Later, you could easily store some additional data into that additional float.
Less casting and making code more readable.
Depending on the memory layout/member types of your struct, it might be so that the struct is aligned to 16 bytes anyway undoing your "memory optimization".
If something is wrong, please correct me.
Related
So I have a struct I need to create. Essentially its purpose is that an array of these structs will be returned from a variety of functions. All of these structs will be gathered (maybe a few hundred) and I will then need the normal for the least of them.
So to sort of summarise I have a situation where there will be many of these structs however the normal value will only be needed for one of them.
As such I was trying to create a struct that captures this idea. That way the one chosen struct can either contain its normal or a method to calculate it. As such I designed the following struct:
struct result {
float t;
bool calculated; // Tells whether to use result.normal.norm or result.norm.calculate()
union normal {
float2 norm;
float2 (*calculate)();
};
};
Is this the proper way to express this idea?
*For examples sake some of these normal calculations might involve some calculations like trig to figure out a normal on a complex curved surface. We would only want to calculate this if absolutely necessary.
(I don't believe you've provided enough context to fully answer your question, but I'll try based on what you've given me.)
Is this the proper way to express this idea?
Since you seem to be concerned with performance - probably not. Calling a function through a function pointer is (edit:) often expensive. What's more, that function doesn't even get the t field's value when you call it... so this will probably not even work.
What should you do, then?
First, figure out if this is even a pain point w.r.t. performance. Don't just optimize this because there's a potential for optimization.
Assuming this is useful to optimize - try avoiding this whole materialization-of-results. Instead, determine which float needs its norm, then be willing to spend some effort getting that norm; it won't be that bad since you'll only be doing work for one piece of data rather than all of them.
PS - No need to use unions these days, we have std::variant in C++17. The would save you the boolean, too.
Is there any difference when I allocate multiple small SSBOs for usage in compute shaders over a big one, internally mapped to many arrays?
By the difference I mean the read/write performance. Does the GPU memory even care about the SSBO partitioning or does it handle everything uniformly.
Here is an example in shader:
layout (std430, binding=1) buffer bufferA
{int elementsA[]};
layout (std430, binding=2) buffer bufferB
{int elementsB[]};
...
//VS
layout (std430, binding=1) buffer buffers
{
int elementsA[MAXCOUNT_A];
int elementsB[MAXCOUNT_B];
...
};
One big buffer would avoid the need of many allocations from CPU side and result in a cleaner code, leaving the memory partitioning to the shader code. Of course I'd need to specify maximum size for each array representing a buffer which might result in needless memory allocation. I am however more concerned about the runtime access speed.
Is this merging even a good practice? Now in my code I am getting too much small buffer allocations which kind of ugly :D.
GPU memory cares of what type of data storage you use. You must first ask yourself,why you need SSBOs in general? SSBO data maybe stored in global memory on GPU,whereas UBOs are in on chip shared memory,access to which is much faster. I would use SSBOs for really HUGE amount of data,where your application cannot live with UBO blocks size limits.
Now,regarding your question - you must try and profile. It is hard to tell if you're going to gain by using several buffers or just one. But,I would go for one long buffer,as it requires less bookkeeping,takes less binding slots, and will probably perform faster due to spatial locality of the data in video memory. But I leave the actual test to you.
I want to learn how to write better code that takes advantage of the CPU's cache. Working with contiguous memory seems to be the ideal situation. That being said, I'm curious if there are similar improvements that can be made with non-contiguous memory, but with an array of pointers to follow, like:
struct Position {
int32_t x,y,z;
}
...
std::vector<Position*> posPointers;
...
updatePosition () {
for (uint32_t i = 0; i < posPointers.size(); i++) {
Position& nextPos = *posPointers[i];
nextPos.x++;
nextPos.y++;
nextPos.z++;
}
}
This is just some rough mock-up code, and for the sake of learning this properly let's just say that all Position structs were created randomly all over the heap.
Can modern, smart, processors such as Intel's i7 look ahead and see that it's going to need X_ptr's data very shortly? Would the following line of code help?
... // for loop
Position& nextPos1 = *posPointers[i];
Position& nextPos2 = *posPointers[i+1];
Position& nextPos3 = *posPointers[i+2];
Position& nextPos4 = *posPointers[i+3];
... // Work on data here
I had read some presentation slides that seemed to indicate code like this would cause the processor to prefetch some data. Is that true? I am aware there are non-standard, platform specific, ways to call prefetching like __builtin_prefetch, but throwing that all over the place just seems like an ugly premature optimization. I am looking for a way I can subconsciously write cache-efficient code.
I know you didn't ask (and probably don't need a sermon on proper treatment of caches, but I thought I'd contribute my two cents anyways. Note that all this only applies in hot code. Remember that premature optimization is the root of all evil.
As has been pointed out in the comments, the best way is to have containers of actual data. Generally speaking, flat data structures are much preferable to "pointer spaghetti", even if you have to duplicate some data and/or pay a price for resizing/moving/defragmenting your data structures.
And as you know, flat data structures (e.g. an array of data) only pay off if you access them linearly and sequentially most of the time.
But this strategy may not always be usable. In lieu of actual linear data, you can use other strategies like employing pool allocators, and iterating over the pools themselves, instead of over the vector holding the pointers. This of course has its own disadvantages and can be a bit more complicated.
I'm sure you know this already, but it bears mentioning again that one of the most effective techniques for getting most out of your cache is having smaller data! In the above code, if you can get away with int16_t instead of int32_t, you should definitely do so. You should pack your many bools and flags and enums into bit-fields, use indexes instead of pointers (specially on 64-bit systems,) use fixed-size hash values in your data structures instead of strings, etc.
Now, about your main question that whether the processor can follow random pointers around and bring the data into cache before they are needed. To a very limited extent, this does happen. As probably you know, modern CPUs employ a lot of tricks to increase their speed (i.e. increase their instruction retire rate.) Tricks like having a store buffer, out-of-order execution, superscalar pipelines, multiple functional units of every kind, branch prediction, etc. Most of the time, these tricks all just help the CPU to keep executing instructions, even if the current instructions have stalled or take too long to finish. For memory loads (which is the slowest thing to do, iff the data is not in the cache,) this means that the CPU should get to the instruction as soon as possible, calculate the address, and request the data from the memory controller. However, the memory controller can have only a very limited number of outstanding requests (usually two these days, but I'm not sure.) This means that even if the CPU did very sophisticated stuff to look ahead into other memory locations (e.g. the elements of your posPointers vector) and deduce that these are the addresses of new data that your code is going to need, it couldn't get very far ahead because the memory controller can have only so many requests pending.
In any case, AFAIK, I don't think that CPUs actually do that yet. Note that this is a hard case, because the addresses of your randomly distributed memory locations are themselves in memory (as opposed to being in a register or calculable from the contents of a register.) And if the CPUs did it, it wouldn't have that much of an effect anyways because of memory interface limitations.
The prefetching technique you mentioned seems valid to me and I've seen it used, but it only yields noticeable effect if your CPU has something to do while waiting for the future data to arrive. Incrementing three integers takes a lot less time than loading 12 bytes from memory (loading one cache line, actually) and therefor it won't mean much for the execution time. But if you had something worthwhile and more heavyweight to overlay on top of the memory prefetches (e.g. calculating a complex function that doesn't require data from memory!) then you could get very nice speedups. You see, the time to go through the above loop is essentially the sum of the time of all the cache misses; and you are getting the coordinate increments and the loop bookkeeping for free. So, you'd have won more if the free stuff were more valuable!
Modern processors have hardware prefetching mechanisms: Intel Hardware prefetcher. They infer stride access patterns to memory and prefetch memory locations that are likely to be accessed in the near future.
However in the case of totally random pointer chasing such techniques can not help. The processor does not know that the program in execution is performing pointer chasing, therefore it can not prefetch accordingly. In such cases hardware mechanisms are detrimental for performance as they would prefetch values that are not likely to be used.
The best that you can do is try to organize you data structures in memory in such a way that accesses to contiguous portions of memory are more likely.
So, let say that I have two vertex buffers. One that describes the actual shape I want to draw, and the other one is able to influence the first one.
So, what I actually want to be able to do is something like this:
uniform VBO second_one;
void main()
{
for (int i = 0; i < size_of_array(second_one); ++i)
Do things with second_one[i] to alter the values
create the output informations
}
Things I might want to do can be gravity, that that each point in second_one tries to drag a bit the point closer to it and so on and then after the point is adjusted, apply the matrices to have its actual location.
I would be really surprise that it's possible, or something close to it. But the whole point is to be able to use a second VBO, or the make it as a uniform of type vec3 let say so I can access it.
For what you're wanting, you have three options.
An array of uniforms. GLSL lets you do uniform vec3 stuff[50];. And arrays in GLSL have a .length() method, so you can find out how big they are. Of course, there are limits to the number of uniforms you use, but you shouldn't need more than 20-30 of these. Anything more than that and you'll really feel the performance drain.
Uniform buffer objects. These can store a bit more data than non-block uniforms, but they still have limits. And the storage comes from a buffer object. But accesses to them are, depending on hardware, slightly slower than accesses to direct uniforms.
Buffer textures. This is a way to attach a buffer object to a texture. With this, you can access vast amounts of memory from within a shader. But be warned: they're not fast to access. If you can make due with one of the above methods, do so.
Note that #2 and #3 will only be found on hardware capable of supporting GL 3.x and above. So DX10-class hardware.
If execution speed is important, should I use this,
struct myType {
float dim[3];
};
myType arr[size];
or to use a 2D array as arr[size][index]
It does not matter. The compiler will produce the exact same code regardless in almost all cases. The only difference would be if the struct induced some kind of padding, but given that you have floats that seems unlikely.
It depends on your use case. If you use the three dimensions typically together, the struct organization can be reasonable. Especially when using the dimension individually the array layout is most likely to give better performance: contemporary processors don't just load individual words but rather units of cache lines. If only parts of the data is used there are words loaded which aren't used.
The array layout is also more accessible to parallel processing e.g. using SIMD operations. This is unfortunate to some extend because the object layout is generally different. Actually, the arrays you are using are probably similar but if you change things to become float array[3][size] things become different.
No difference at all. Pick what is more readable to you.
Unless you're working on some weird platform, the memory layout of those two will be the same -- and for the compiler this is what counts most.
The only difference is when you pass something to a function.
When you use the array solution, you never copy the array contains but just pass the array address.
The structs will always be copied if you don't explicitly pass the struct address in case of the struct solution.
One other thing to keep in mind that another poster mentioned: If dim will always have a size of 3 in the struct, but the collection really represents something like "Red, Green, Blue" or "X, Y, Z" or "Car, Truck, Boat", from a maintenance standpoint you might be better off breaking them out. That is, use something like
typedef struct VEHICLES
{
float fCar;
float fTruck;
float fBoat;
} Vehicles;
That way when you come back in two years to debug it, or someone else has to look at it, they will not have to guess what dim[0], dim[1] and dim[2] refer to.
You might want to map out the 2d array to 1d. Might be more cache friendly