Performance hit when combining class member arrays to a single array - c++

I'm trying to make my code a bit faster and I'm trying to find out if I can gain some performance by better managing arrays stored in objects and stuff.
So the basic idea behind that is that I tend to keep separate arrays for temporary and permanent states. This means that they have to be indexed separately all the time having to explicitly write the proper member name every time I want to use them.
This is how a particular class with such arrays looks like:
class solution
{
public:
//Costs
float *cost_array;
float *temp_cost_array;
//Cost trend
float *d_cost_array;
float *temp_d_cost_array;
...
}
Now, because of the fact that I have functions/methods that work on the temp or the permanent status depending on input arguments, these look like this:
void do_stuff(bool temp){
if (temp)
work_on(this->temp_cost_array);
else
work_on(this->cost_array);
}
These are very simplistic examples of such branches. These arrays may be indexed separately here and there within the code. So exactly because of the fact that such stuff is all over the place I thought that this is yet another reason to combine everything so that I could get rid of that code branches as well.
So I converted my class to:
class solution
{
public:
//Costs
float **cost_array;
//Cost trend
float **d_cost_array;
...
}
Those double arrays are of size 2, with each element being a float* array. Those are dynamically allocated just once during object creation at the start of the program and deleted at the end of the program.
So after that I also converted all the temp branches of my code like this:
void do_stuff(bool temp){
work_on(this->cost_array[temp]);
}
It looks WAY more elegant than before but for some reason performance got way worse than before (almost 2 times worse), and I seriously can't understand why that happened.
So, as a first insight, I'd really love to hear from more experienced people, if my rationale behind that code optimization was valid or not.
Could that extra indexing required to access each array introduce such major performance hit to overcome all the if branching and stuff? For sure it depends on how the whole thing works but the code is a beast and I've no idea how to properly profile that thing all-together.
Thanks
EDIT:
Environment settings:
Running on Windows 10, VS 2017, Full Optimization enabled (/Ox)

The reason for such a huge performance degradation might be that with the change we have introduced another level of indirection, accessing which might slow down the program quite significantly.
The object prior the change:
*array -> data[]
*temp_array -> data[]
Assuming the object (i.e. this) is in the CPU cache, prior the change you had one cache miss: take either of the pointers from cache (cache hit) and access a cold data (cache miss).
The object after the change:
**array -> * -> data[]
* -> data[]
Now we have to access pointer to an array (cache hit) then index the cold data (cache miss) then access the cold data (another cache miss).
Sure, that is the worst scenario described above, but it might be the case.
The fix is quite easy: allocate those pointers in the object with float *cost_array[2], not dynamically, i.e.:
*array[2] -> data[]
-> data[]
So in therms of storage and levels of indirections this is exactly corresponds to the original data structure prior the change and should behave quite the same.

Related

Storing large amounts of compile time constant data

I hope that this question hasn’t been asked before, but I couldn’t find any answer after googling for an hour.
I have the following problem: I do numerical mathematics and I have about 40 MB of doubles in the form of certain matrices that are compile time constants. I very frequently use these matrices throughout my program. The creation of these matrices takes a week of computation, so I do not want to compute them on the fly but use precomputed values.
What is the best way of doing this in a portable manner? Right now I have some very large CPP-files, in which I create dynamically allocated matrices with the constants as initialiser lists. Something along the lines:
data.cpp:
namespace // Anonymous
{
// math::matrix uses dynamic memory internally
const math::matrix K { 3.1337e2, 3.1415926e00, /* a couple of hundred-thousand numbers */ };
}
const math::matrix& get_K() { return K; }
data.h
const math::matrix& get_K();
I am not sure if this can cause problems with too much data on the stack, due to the initialiser list. Or if this can crash some compilers. Or if this is the right way to go about things. It does seem to be working though.
I also thought about loading the matrices at program startup from an external file, but that also seems messy to me.
Thank you very much in advance!
I am not sure if this can cause problems with too much data on the stack, due to the initialiser list.
There should not be a problem assuming it has static storage with non-dynamic initialisation. Which should be the case if math::matrix is an aggregate.
Given that the values will be compile time constant, you might consider defining them in a header, so that all translation units can take advantage of them at compile time. How beneficial that would be depends on how the values are used.
I also thought about loading the matrices at program startup from an external file
The benefit of this approach is the added flexibility that you gain because you would not need to recompile the program when you change the data. This is particularly useful if the program is expensive to compile.
A slightly cleaner approach:
// math::matrix uses dynamic memory internally
const math::matrix K {
#include "matrix_initial_values.h"
};
And, in the included header,
3.1337e2, 3.1415926e00, 1,2,3,4,5e6, 7.0,...
Considering your comment of "A few hundred thousand" float values: 1M double values takes 8,000,000 bytes, or about 7.6MB. That's not going to blow the stack. Win64 has a max stack size of 1GB, so you'd have to go really, really nuts, and that's assuming that these values are actually placed on the stack, which they should not be given that it's const.
This is probably implementation-specific, but a large block of literals is typically stored as a big chunk of code-space data that's loaded directly into the process' memory space. The identifier (K) is really just a label for that data. It doesn't exist on the stack or the heap anymore than code does.

Use local variables or access multiple times a struct value ( C++ )

in JS it a good practice to create a variable, for reusing, instead of access the value in a deep object structure:
for (var i = 0, l = arr.length; i < l; ++i) {
var someValueINeedOftenHere = arr[i].path.to.value;
// do several things with this var..
}
So instead of finding the value in this deep object structure, we store it locally and we can reuse it over and over again. This should be a good practice, not only because it lets you write cleaner code, but also because of performance.
So, when I am writing C++ code, and I've to iterate over a vector, which holds a lot of structs / objects. Is then the same, or it doesn't matter?
Generally speaking, in C/C++ it doesn't matter. In C and C++, the memory layout of every structure is known at compile-time. When you type arr[i].path.to.value, that's going to be essentially the same as *(&arr[0] + i * (something) + offset_1 + offset_2 + offset_3), and all that will get simplified at compile-time to something like *(&arr[0] + i * (something) + something). And those something's will be computed by the compiler and hard-coded into the binary, so effectively looking up arr[i].path.to is not faster than arr[i].path.to.value.
This is not mandated by the standard or anything as far as I know, but it's how most compilers will actually work.
If you want to be sure in some specific case, you can look at godbolt and see what assembly it cooks up: http://gcc.godbolt.org/
Note that I'm assuming that when you make the local variable, you are taking a reference to the value arr[i].path.to.value, which is most similar to what you do in javascript. If you actually copy the value to a new variable then that will create some copying overhead. I don't think that copying it would be advantageous w.r.t. cache misses unless the usage pattern is pretty complicated. Once you access arr[i].path.to.value once, all the stuff around it is going to be in the cache, and there's no reason that copying it onto the stack would make anything faster.
It doesn't matter if none of arr, arr[i], path, to, value are/involve references/pointers, as pointed out by Chris Beck.
However, if any of arr, arr[i], path, to, value is/involves a reference/pointer then accessing them will possibly be a cache miss. Accessing them many times makes it multiple cache misses (probably).
Storing a reference/pointer directly to value might therefore be more efficient than getting multiple cache misses while chasing the pointer(s) to get to value. Bear in mind the compiler will probably optimize things like this anyway.
Storing value as a copy might be more efficient if it avoids cache misses, and it's relatively light to copy, and you don't need to modify the original value. Again, the compiler will most probably optimize these cases as well if it's clear that doing so will give a performance improvement.
It all depends. Only optimize if it's been proven to be a problem.

Is there any performance difference between accessing a hardcode array and a run time initialization array?

For example, I want to create a square root table using array SQRT[i] to optimize a game, but I don't know if there is a performance difference between the following initialization when accessing the value of SQRT[i]:
Hardcode array
int SQRT[]={0,1,1,1,2,2,2,2,2,3,3,.......255,255,255}
Generate value at run time
int SQRT[65536];
int main(){
for(int i=0;i<65536;i++){
SQRT[i]=sqrt(i);
}
//other code
return 0;
}
Some example of accessing them:
if(SQRT[a*a+b*b]>something)
...
I'm not clear if the program stores or accesses a hard-code array in a different way, also I don't know if the compiler would optimize the hard-code array to speed up the access time, is there performance differences between them when accessing the array?
First, you should do the hard-coded array right:
static const int SQRT[]={0,1,1,1,2,2,2,2,2,3,3,.......255,255,255};
(also using uint8_t instead of int is probably better, so as to reduce the size of the array and making it more cache-friendly)
Doing this has one big advantage over the alternative: the compiler can easily check that the contents of the array cannot be changed.
Without this, the compiler has to be paranoid -- every function call might change the contents of SQRT, and every pointer could potentially be pointing into SQRT and thus any write through an int* or char* might be modifying the array. If the compiler cannot prove this doesn't happen, then that puts limits on the kinds of optimizations it can do, which in some cases could show a performance impact.
Another potential advantage is the ability to resolve things involving constants at compile-time.
If needed, you may be able to help the compiler figure things out with clever use of __restrict__.
In modern C++ you can get the best of both worlds; it should be possible (and in a reasonable way) to write code that would run at compile time to initialize SQRT as a constexpr. That would best be asked a new question, though.
As people said in comments:
if(SQRT[a*a+b*b]>something)
is a horrible example use-case. If that's all you need SQRT for, just square something.
As long as you can tell the compiler that SQRT doesn't alias anything, then a run-time loop will make your executable smaller, and only add a tiny amount of CPU overhead during startup. Definitely use uint8_t, not int. Loading a 32bit temporary from an 8bit memory location is no slower than from a zero-padded 32b memory location. (The extra movsx instruction on x86, instead of using a memory operand, will more than pay for itself in reduced cache pollution. RISC machines typically don't allow memory operands anyway, so you always need an instruction to load the value into a register.)
Also, sqrt is 10-21 cycle latency on Sandybridge. If you don't need it often, the int->double, sqrt, double->int chain is not much worse than an L2 cache hit. And better than going to L3 or main memory. If you need a lot of sqrt, then sure, make a LUT. The throughput will be much better, even if you're bouncing around in the table and causing L1 misses.
You could optimize the initialization by squaring instead of sqrting, with something like
uint8_t sqrt_lookup[65536];
void init_sqrt (void)
{
int idx = 0;
for (int i=0 ; i < 256 ; i++) {
// TODO: check that there isn't an off-by-one here
int iplus1_sqr = (i+1)*(i+1);
memset(sqrt_lookup+idx, i, iplus1_sqr-idx);
idx = iplus1_sqr;
}
}
You can still get the benefits of having sqrt_lookup being const (compiler knows it can't alias). Either use restrict, or lie to the compiler, so users of the table see a const array, but you actually write to it.
This might involve lying to the compiler, by having it declared extern const in most places, but not in the file that initializes it. You'd have to make sure this actually works, and doesn't create code referring to two different symbols. If you just cast away the const in the function that initializes it, you might get a segfault if the compiler placed it in rodata (or read-only bss memory if it's uninitialized, if that's possible on some platforms?)
Maybe we can avoid lying to the compiler, with:
uint8_t restrict private_sqrt_table[65536]; // not sure about this use of restrict, maybe that will do it?
const uint8_t *const sqrt_lookup = private_sqrt_table;
Actually, that's just a const pointer to const data, not a guarantee that what it's pointing to can't be changed by another reference.
The access time will be the same. When you hardcode an array, the C library routines called before the main will initialize it (in an embedded system the start up code copies the Read write data, i.e. hardcoded from ROM to RAM address where the array is located, if the array is constant, then it is accessed directly from ROM).
If the for loop is used to initialize, then there is an overhead of calling the Sqrt function.

Performance impact of objects

I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?
C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.
On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).
It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.
Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.

C++ structure: more members, MUCH slower member access time?

I have a linked list of structures. Lets say I insert x million nodes into the linked list,
then I iterate trough all nodes to find a given value.
The strange thing is (for me at least), if I have a structure like this:
struct node
{
int a;
node *nxt;
};
Then I can iterate trough the list and check the value of a ten times faster compared to when I have another member in the struct, like this:
struct node_complex
{
int a;
string b;
node_complex *nxt;
};
I also tried it with C style strings (char array), the result was the same: just because I had another member (string), the whole iteration (+ value check) was 10 times slower, even if I did not even touched that member ever! Now, I do not know how the internals of structures work, but it looks like a high price to pay...
What is the catch?
Edit:
I am a beginner and this is the first time I use pointers, so chances are, the mistake is on my part. I will post the code ASAP (not being at home now).
Update:
I checked the values again, and I know see a much smaller difference: 2x instead of 10x.
It is much more reasonable for sure.
While it is certainly possible it was the case yesterday too and I was just so freaking tired last night I could not divide two numbers, I have just made more tests and the results are mind blowing.
The times for a the same number of nodes is:
One int and a pointer the time to iterate trough is 0.101
One int and a string: 0.196
One int and 2 strings: 0.274
One int and 3 strings: 0.147 (!!!)
For two ints it is: 0.107
Look what happens when there is more than two strings in the structure! It gets faster! Did somebody drop LSD into my coffee? No! I do not drink coffee.
It is way too fckd up for my brain at the mo' so I think I will just figure it out on my own instead of draining public resources here at SO.
(Ad: I do not think my profiling class is buggy, and anyway I can see the time difference with my own eyes).
Anyhow, thanks for the help.
Cheers.
I must be related to memory access. You speak of a million linked elements. With just an int and a pointer in the node, it takes 8 bytes (assuming 32 bits pointers). This takes up 8 MB memory, which is around the size of cache memory sizes.
When you add other members, you increase the overall size of your data. It does not fit anymore entirely in the cache memory. You revert to plain memory accesses that are much slower.
This may also be caused because during the iteration you may create a copy of your structures. That is:
node* pHead;
// ...
for (node* p = pHead; p; p = p->nxt)
{
node myNode = *p; // here you create a copy!
// ...
}
Copying a simple structure very fast. But the member you've added is a string, which is a complex object. Copying it is a relatively complex operation, with heap access.
Most likely, the issue is that your larger struct no longer fits inside a single cache line.
As I recall, mainstream CPUs typically use a cache line of 32 bytes. This means that data is read into the cache in chunks of 32 bytes at a time, and if you move past these 32 bytes, a second memory fetch is required.
Looking at your struct, it starts with an int, accounting for 4 bytes (usually), and then std::string (I assume, even though the namespace isn't specified), which in my standard library implementation (from VS2010) takes up 28 bytes, which gives us 32 bytes total. Which means that the initial int and the the next pointer will be placed in different cache lines, using twice as much cache space, and requiring twice as many memory accesses if both members are accessed during iteration.
If only the pointer is accessed, this shouldn't make a difference, though, as only the second cache line then has to be retrieved from memory.
If you always access the int and the pointer, and the string is required less often, reordering the members may help:
struct node_complex
{
int a;
node_complex *nxt;
string b;
};
In this case, the next pointer and the int are located next to each others, on the same cache line, so they can be read without requiring additional memory reads. But then you incur the additional cost once you need to read the string.
Of course, it's also possible that your benchmarking code includes creation of the nodes, or (intentional or otherwise) copies being created of the nodes, which would obviously also affect performance.
I'm not a spacialist at all, but the "cache miss" problem rings in my head while reading your problem.
When you had a member, as it makes the size of the structure get bigger, it also might cache misses when going throught the linked list (that is naturally cache-unfriendly if you don't have nodes allocated in one bloc and not far from each other in memory).
I can't find another explaination.
However, we don't have the creation and the loop provided so it's still hard to guess if you're not just having code that don't perform the list exploration in an efficient way.
Perhaps a solution would be a linked list of pointers to your object. It may make things more complicated (unless you use smart pointers, ect.) but it may increase search time.