C++ structure: more members, MUCH slower member access time? - c++

I have a linked list of structures. Lets say I insert x million nodes into the linked list,
then I iterate trough all nodes to find a given value.
The strange thing is (for me at least), if I have a structure like this:
struct node
{
int a;
node *nxt;
};
Then I can iterate trough the list and check the value of a ten times faster compared to when I have another member in the struct, like this:
struct node_complex
{
int a;
string b;
node_complex *nxt;
};
I also tried it with C style strings (char array), the result was the same: just because I had another member (string), the whole iteration (+ value check) was 10 times slower, even if I did not even touched that member ever! Now, I do not know how the internals of structures work, but it looks like a high price to pay...
What is the catch?
Edit:
I am a beginner and this is the first time I use pointers, so chances are, the mistake is on my part. I will post the code ASAP (not being at home now).
Update:
I checked the values again, and I know see a much smaller difference: 2x instead of 10x.
It is much more reasonable for sure.
While it is certainly possible it was the case yesterday too and I was just so freaking tired last night I could not divide two numbers, I have just made more tests and the results are mind blowing.
The times for a the same number of nodes is:
One int and a pointer the time to iterate trough is 0.101
One int and a string: 0.196
One int and 2 strings: 0.274
One int and 3 strings: 0.147 (!!!)
For two ints it is: 0.107
Look what happens when there is more than two strings in the structure! It gets faster! Did somebody drop LSD into my coffee? No! I do not drink coffee.
It is way too fckd up for my brain at the mo' so I think I will just figure it out on my own instead of draining public resources here at SO.
(Ad: I do not think my profiling class is buggy, and anyway I can see the time difference with my own eyes).
Anyhow, thanks for the help.
Cheers.

I must be related to memory access. You speak of a million linked elements. With just an int and a pointer in the node, it takes 8 bytes (assuming 32 bits pointers). This takes up 8 MB memory, which is around the size of cache memory sizes.
When you add other members, you increase the overall size of your data. It does not fit anymore entirely in the cache memory. You revert to plain memory accesses that are much slower.

This may also be caused because during the iteration you may create a copy of your structures. That is:
node* pHead;
// ...
for (node* p = pHead; p; p = p->nxt)
{
node myNode = *p; // here you create a copy!
// ...
}
Copying a simple structure very fast. But the member you've added is a string, which is a complex object. Copying it is a relatively complex operation, with heap access.

Most likely, the issue is that your larger struct no longer fits inside a single cache line.
As I recall, mainstream CPUs typically use a cache line of 32 bytes. This means that data is read into the cache in chunks of 32 bytes at a time, and if you move past these 32 bytes, a second memory fetch is required.
Looking at your struct, it starts with an int, accounting for 4 bytes (usually), and then std::string (I assume, even though the namespace isn't specified), which in my standard library implementation (from VS2010) takes up 28 bytes, which gives us 32 bytes total. Which means that the initial int and the the next pointer will be placed in different cache lines, using twice as much cache space, and requiring twice as many memory accesses if both members are accessed during iteration.
If only the pointer is accessed, this shouldn't make a difference, though, as only the second cache line then has to be retrieved from memory.
If you always access the int and the pointer, and the string is required less often, reordering the members may help:
struct node_complex
{
int a;
node_complex *nxt;
string b;
};
In this case, the next pointer and the int are located next to each others, on the same cache line, so they can be read without requiring additional memory reads. But then you incur the additional cost once you need to read the string.
Of course, it's also possible that your benchmarking code includes creation of the nodes, or (intentional or otherwise) copies being created of the nodes, which would obviously also affect performance.

I'm not a spacialist at all, but the "cache miss" problem rings in my head while reading your problem.
When you had a member, as it makes the size of the structure get bigger, it also might cache misses when going throught the linked list (that is naturally cache-unfriendly if you don't have nodes allocated in one bloc and not far from each other in memory).
I can't find another explaination.
However, we don't have the creation and the loop provided so it's still hard to guess if you're not just having code that don't perform the list exploration in an efficient way.

Perhaps a solution would be a linked list of pointers to your object. It may make things more complicated (unless you use smart pointers, ect.) but it may increase search time.

Related

Performance hit when combining class member arrays to a single array

I'm trying to make my code a bit faster and I'm trying to find out if I can gain some performance by better managing arrays stored in objects and stuff.
So the basic idea behind that is that I tend to keep separate arrays for temporary and permanent states. This means that they have to be indexed separately all the time having to explicitly write the proper member name every time I want to use them.
This is how a particular class with such arrays looks like:
class solution
{
public:
//Costs
float *cost_array;
float *temp_cost_array;
//Cost trend
float *d_cost_array;
float *temp_d_cost_array;
...
}
Now, because of the fact that I have functions/methods that work on the temp or the permanent status depending on input arguments, these look like this:
void do_stuff(bool temp){
if (temp)
work_on(this->temp_cost_array);
else
work_on(this->cost_array);
}
These are very simplistic examples of such branches. These arrays may be indexed separately here and there within the code. So exactly because of the fact that such stuff is all over the place I thought that this is yet another reason to combine everything so that I could get rid of that code branches as well.
So I converted my class to:
class solution
{
public:
//Costs
float **cost_array;
//Cost trend
float **d_cost_array;
...
}
Those double arrays are of size 2, with each element being a float* array. Those are dynamically allocated just once during object creation at the start of the program and deleted at the end of the program.
So after that I also converted all the temp branches of my code like this:
void do_stuff(bool temp){
work_on(this->cost_array[temp]);
}
It looks WAY more elegant than before but for some reason performance got way worse than before (almost 2 times worse), and I seriously can't understand why that happened.
So, as a first insight, I'd really love to hear from more experienced people, if my rationale behind that code optimization was valid or not.
Could that extra indexing required to access each array introduce such major performance hit to overcome all the if branching and stuff? For sure it depends on how the whole thing works but the code is a beast and I've no idea how to properly profile that thing all-together.
Thanks
EDIT:
Environment settings:
Running on Windows 10, VS 2017, Full Optimization enabled (/Ox)
The reason for such a huge performance degradation might be that with the change we have introduced another level of indirection, accessing which might slow down the program quite significantly.
The object prior the change:
*array -> data[]
*temp_array -> data[]
Assuming the object (i.e. this) is in the CPU cache, prior the change you had one cache miss: take either of the pointers from cache (cache hit) and access a cold data (cache miss).
The object after the change:
**array -> * -> data[]
* -> data[]
Now we have to access pointer to an array (cache hit) then index the cold data (cache miss) then access the cold data (another cache miss).
Sure, that is the worst scenario described above, but it might be the case.
The fix is quite easy: allocate those pointers in the object with float *cost_array[2], not dynamically, i.e.:
*array[2] -> data[]
-> data[]
So in therms of storage and levels of indirections this is exactly corresponds to the original data structure prior the change and should behave quite the same.

Is every element access in std::vector a cache miss?

It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.
Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.
If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.
But if I iterate std::array addresses are in the same cache line so it should be faster.
I tested with VS2013 with full optimization and std::array is approx 20% faster.
Am I right in my assumptions?
Update: in order to not create the second similar topic. In this code I have an array and some local variable:
void test(array<int, 10>& arr)
{
int m{ 42 };
for (int i{ 0 }; i < arr.size(); ++i)
{
arr[i] = i * m;
}
}
In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?
Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.
You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.
Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.
So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.
Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.
Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.
I think it has nothing to do with cache miss.
You can take std::array as a wrapper of raw array, i.e. int arr[10], while vector as a wrapper of dynamic array, i.e. new int[10]. They should have the same performance. However, when you access vector, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: std::array is faster.
You can have a test that replacing std::array with int arr[10]. Although std::array is just a wrapper of int arr[10], you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing vector with new int[10], they should have equal performance.
For your second question, the local variable, i.e. m, will be saved in register (if optimized properly), and there will be no access to the memory location of m during the for loop. So it won't be a problem of cache miss either.

Keep vector members of class contiguous with class instance

I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.

c++ Alternative implementation to avoid shifting between RAM and SWAP memory

I have a program, that uses dynamic programming to calculate some information. The problem is, that theoretically the used memory grows exponentially. Some filters that I use limit this space, but for a big input they also can't avoid that my program runs out of RAM - Memory.
The program is running on 4 threads. When I run it with a really big input I noticed, that at some point the program starts to use the swap memory, because my RAM is not big enough. The consequence of this is, that my CPU-usage decreases from about 380% to 15% or lower.
There is only one variable that uses the memory which is the following datastructure:
Edit (added type) with CLN library:
class My_Map {
typedef std::pair<double,short> key;
typedef cln::cl_I value;
public:
tbb::concurrent_hash_map<key,value>* map;
My_Map() { map = new tbb::concurrent_hash_map<myType>(); }
~My_Map() { delete map; }
//some functions for operations on the map
};
In my main program I am using this datastructure as globale variable:
My_Map* container = new My_Map();
Question:
Is there a way to avoid the shifting of memory between SWAP and RAM? I thought pushing all the memory to the Heap would help, but it seems not to. So I don't know if it is possible to maybe fully use the swap memory or something else. Just this shifting of memory cost much time. The CPU usage decreases dramatically.
If you have 1 Gig of RAM and you have a program that uses up 2 Gb RAM, then you're going to have to find somewhere else to store the excess data.. obviously. The default OS way is to swap but the alternative is to manage your own 'swapping' by using a memory-mapped file.
You open a file and allocate a virtual memory block in it, then you bring pages of the file into RAM to work on. The OS manages this for you for the most part, but you should think about your memory usage so not to try to keep access to the same blocks while they're in memory if you can.
On Windows you use CreateFileMapping(), on Linux you use mmap(), on Mac you use mmap().
The OS is working properly - it doesn't distinguish between stack and heap when swapping - it pages you whatever you seem not to be using and loads whatever you ask for.
There are a few things you could try:
consider whether myType can be made smaller - e.g. using int8_t or even width-appropriate bitfields instead of int, using pointers to pooled strings instead of worst-case-length character arrays, use offsets into arrays where they're smaller than pointers etc.. If you show us the type maybe we can suggest things.
think about your paging - if you have many objects on one memory page (likely 4k) they will need to stay in memory if any one of them is being used, so try to get objects that will be used around the same time onto the same memory page - this may involve hashing to small arrays of related myType objects, or even moving all your data into a packed array if possible (binary searching can be pretty quick anyway). Naively used hash tables tend to flay memory because similar objects are put in completely unrelated buckets.
serialisation/deserialisation with compression is a possibility: instead of letting the OS swap out full myType memory, you may be able to proactively serialise them into a more compact form then deserialise them only when needed
consider whether you need to process all the data simultaneously... if you can batch up the work in such a way that you get all "group A" out of the way using less memory then you can move on to "group B"
UPDATE now you've posted your actual data types...
Sadly, using short might not help much because sizeof key needs to be 16 anyway for alignment of the double; if you don't need the precision, you could consider float? Another option would be to create an array of separate maps...
tbb::concurrent_hash_map<double,value> map[65536];
You can then index to map[my_short][my_double]. It could be better or worse, but is easy to try so you might as well benchmark....
For cl_I a 2-minute dig suggests the data's stored in a union - presumably word is used for small values and one of the pointers when necessary... that looks like a pretty good design - hard to improve on.
If numbers tend to repeat a lot (a big if) you could experiment with e.g. keeping a registry of big cl_Is with a bi-directional mapping to packed integer ids which you'd store in My_Map::map - fussy though. To explain, say you get 987123498723489 - you push_back it on a vector<cl_I>, then in a hash_map<cl_I, int> set [987123498723489 to that index (i.e. vector.size() - 1). Keep going as new numbers are encountered. You can always map from an int id back to a cl_I using direct indexing in the vector, and the other way is an O(1) amortised hash table lookup.

(C++) Looking for tips to reduce memory usage

I've got a problem with a really tight and tough memory limit. I'm a CPP geek and I want to reduce my memory usage. Please give me some tips.
One of my friends recommended to take functions inside my structs out of them.
for example instead of using:
struct node{
int f()
{}
}
he recommended me to use:
int f(node x)
{}
does this really help?
Note: I have lots of copies of my struct.
here's some more information:
I'm coding some sort of segment tree for a practice problem on an online judge. I get tree nodes in a struct. my struct has these variables:
int start;
int end;
bool flag;
node* left;
node* right;
The memory limit is 16 MB and I'm using 16.38 MB.
I'm guessing by the subtext of your question that the majority of your memory usage is data, not code. Here are a couple of tips:
If your data ranges are limited, take advantage of it. If the range of an integer is -128 to 127, use char instead of int, or unsigned char if it's 0 to 255. Likewise use int16_t or uint16_t for ranges of -32768..32767 and 0..65535.
Rearrange the structure elements so the larger items come first, so that data alignment doesn't leave dead space in the middle of the structure. You can also usually control padding via compiler options, but it's better just to make the layout optimal in the first place.
Use containers that don't have a lot of overhead. Use vector instead of list, for example. Use boost::ptr_vector instead of std::vector containing shared_ptr.
Avoid virtual methods. The first virtual method you add to a struct or class adds a hidden pointer to a vtable.
No, regular member functions don't make the class or struct larger. Introducing a virtual function might (on many platforms) add a vtable pointer to the class. On x86 that would increase the size by four bytes. No more memory will be required as you add virtual functions, though -- one pointer is sufficient. The size of a class or struct type is never zero (regardless of whether it has any member variables or virtual functions). This is to make sure that each instance occupies its own memory space (source, section 9.0.3).
In my opinion, the best way to reduce memory is to consider your algorithmic space compexity instead of justing doing fine code optimizations. Reconsider things like dynamic programming tables, unnecessary copies, generally any thing that is questionable in terms of memory efficiency. Also, try to free memory resources early whenever they are not needed anymore.
For your final example (the tree), you can use a clever hack with XOR to replace the two node pointers with a single node pointer, as described here. This only works if you traverse the tree in the right order, however. Obviously this hurts code readability, so should be something of a last resort.
You could use compilation flags to do some optimization. If you are using g++ you could test with: -O2
There are great threads about the subject:
C++ Optimization Techniques
Should we still be optimizing "in the small"?
Constants and compiler optimization in C++
What are the known C/C++ optimizations for GCC
The two possibilities are not at all equivalent:
In the first, f() is a member function of node.
In the second, f() is a free (or namespace-scope) function. (Note also that the signature of two f() are different.)
Now note that, in the first style, f() is an inline member function. Defining a member function inside the class body makes it inline. Although inlining is not guranteed, it is just a hint to the compiler. For functions with small bodies, it may be good to inline them, as it would avoid function call over head. However, I have never seen that to be a make-or-break factor.
If you do not want or if f() does not qualifiy for inlining, you should define it outside the class body (probably in a .cpp file) as:
int node::f() { /* stuff here */ }
If memory usage is a problem in your code, then most probably the above topics are not relevant. Exploring the following might give you some hint
Find the sizes of all classes in your program. Use sizeof to find this information, e.g. sizeof( node)
Find what is the maximum number of objects of each class that your program is creating.
Using the above two series of information, estimate worst case memory usage by your program
Worst case memory usage = n1 * sizeof( node1 ) + n2 * sizeof( node2 ) + ...
If the above number is too high, then you have the following choices:
Reducing the number of maximum instances of classes. This probably won't be possible because this depends on the input to the program, and that is beyond your control
Reduce the size of each classes. This is in your control though.
How to reduce the size of the classes? Try packing the class members compactly to avoid packing.
As others have said, having methods doesn't increase the size of the struct unless one of them is virtual.
You can use bitfields to effectively compress the data (this is especially effective with your boolean...). Also, you can use indices instead of pointers, to save some bytes.
Remember to allocate your nodes in big chunks rather than individually (e.g., using new[] once, not regular new many times) to avoid memory management overhead.
If you don't need the full flexibility your node pointers provide, you may be able to reduce or eliminate them. For example, heapsort always has a near-full binary tree, so the standard implementation uses an implicit tree, which doesn't need any pointers at all.
Above all, finding a different algorithm may change the game completely...