Existing questions in this area that still don't ask specifically my question:
Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size
Correct way to ensure sharing with std::hardware_constructive_interference_size
The answer to second one actually makes me ask this question.
So, assuming I want to have constructive interference. And I'm putting few variables into a single struct that fits std::hardware_constructive_interference_size:
struct together
{
int a;
int b;
};
Advantage seems too weak to ban compilation with static_assert if it does not fit:
// Not going to do the below:
static_assert(sizeof(together) <= std::hardware_constructive_interference_size);
Still aligning is helpful to avoid structure span:
struct alignas(std::hardware_constructive_interference_size) together
{
int a;
int b;
};
However the same effect can be achieved with aligning just on structure size:
struct alignas(std::bit_ceil(2*sizeof(int))) together
{
int a;
int b;
};
If structure size is larger than std::hardware_constructive_interference_size, it may still be helpful to align it on structure size, because:
It is compile-time hint that may become obsolete with later CPUs the compiled program run on
It is one of cache levels cache line size, if there are more than one, exceeding one cache level cache line may still give useful sharing of other level cache line
Aligning on structure size is not going to make much more than twice overhead. Aligning on cache line size may potentially cause more overhead, if cache line size becomes way more than structure size.
So, is there any point left for std::hardware_constructive_interference_size?
Consider a std::deque<T>. It's often implemented using chunks of a given size. But how many T's do you store per chunk? A reasonable answer is std::hardware_constructive_interference_size/sizeof(T), if sizeof(T) is small.
Similarly, a string class with the Small String Optimization may aim for a size of std::hardware_constructive_interference_size. In general, the size is useful when you can have a run-time variable amount of data with high locality of reference.
Related
I know that this is all implementation specific, but for the sake of example let's assume that for a certain modern computer:
int takes up a whole WORD
short takes up half a WORD
Will the short actually take up less memory, or will it just be stored in the first half of a WORD with unused memory in the second half? Will a C/C++ compiler ever try to pack two or more smaller variables into a single WORD, or will this space always be wasted?
That depends a lot on usage.
You can usually force the compiler to optimize for space.
Objects are more optimally accessed when aligned to a memory boundary that is a multiple of their size (On most architectures).
As such the compiler may inject space to get better alignment.
This usually happens when objects of different sizes are required to be beside each other.
The compiler is NOT allowed to rearrange the order of variables in a structure (if they are in the same private/public/protected section).
I do not believe there are any requirements of ordering of variables in the local stack frame. So the compiler should be able to optimally pack the local variables and use all available space optimally (even potentially re-use space for POD variables or never use space if it can keep it in a register).
But if you have a structure that is using the same size objects.
struct X
{
short var1;
short var2;
}
Then most likely there will be no padding in the above structure (no guarantee but it's highly likely there is no padding).
Because of point 3 above: If you want to help your compiler optimally pack a structure then it definitely makes it easier for the compiler to order your members from largest to smallest as this makes the packing without needing padding much easier (but the standard does not impose any requirements on padding).
// if we assume sizeof(int) == 8
struct Y
{
char x; // 1 byte;
// Compiler will (prob) insert 7 bytes of padding here.
// to make sure that y is on an 8 byte boundry
// for most effecient reads.
int y;
char z; // 1 byte
// Compiler will (prob) insert 7 bytes of padding here.
// to make sure that the whole structure has a size
// that is a multiple of 8 (the largest object)
// This allows for optimal packing of arrays of type
// Y.
};
The compiler can still achieve optimal packing and fast access if you arrange the object like this:
struct Y
{
int y;
char x;
char z;
// probably add 6 bytes of padding.
// So that we get optimal access to objects in an array.
};
for the sake of example let's assume that for a certain modern computer:
If we assume a modern good compiler like clang or g++ on normal standard architecture machine. Even if we assume not optimizing for speed.
Will the short actually take up less memory
Yes. Modern compiler will pack objects as much as possible and would probably use only the memory required. Note: most compiler be default will optimize for speed so will maintain optimal alignment for speed so will pad if they have to (if objects in a structure that they can not re-order have different sizes).
or will it just be stored in the first half of a WORD with unused memory in the second half?
Unlikely unless there is some requirement that the compiler must maintain. Like order of a structure.
Will a C/C++ compiler ever try to pack two or more smaller variables into a single WORD
Yes. All the time. The default is usually. 1 Optimize for speed. 2 Optimize for size (not always mutually exclusive). You can also force modern compilers to optimize for space and pack structures without padding.
or will this space always be wasted?
Unlikely.
To achieve best performance, try to minimize cache misses. I guess we can all agree on that.
What I suggest and would like to ask about is the following. I say that this:
template <typename T>
struct {
T* a;
T* b;
T* c;
};
is more vulnerable to cache misses than this:
template <typename T>
struct {
T a;
T b;
T c;
};
Often I make the argument: Minimize heap allocations to minimize cache misses. Am I wrong about this?
Rationale: From my work emulators (I wrote the emulator of a PowerPC including the MMU): Memory is pulled in pages or blocks. If you allocate everything on the stack, the compiler will have a better chance in getting everything in a contiguous memory chunk, which means that pulling a single page/block will contain your whole struct/class (assuming you're not using gigantic structs/classes), and hence you'll have less cache misses.
I don't fully understand cache-lines in modern CPUs when people mention them (and I don't know whether it refers to simply the page table walk process on multiple cache levels). Some people told me my argument is incorrect for that, and I didn't understand what they meant. Can someone please tell me whether my argument is correct/incorrect and whether it's wrong for a particular reason in x86/x64 architectures?
On the stack or on the heap, you will get cache misses. So no, it's not about minimizing heap allocations.
The question is how can the processor reuse cache information as much as possible and predict where you want to go. That's why a vector is better than a list, because you are going through your data in a predictable manner (compared to a list or a map).
So the question is, is your struct a struct of say 3 float that get allocated on the heap, or arrays of float? If it's the first, then bad, use the data itself rather than pointers, if it's the latter, good, you have locality if you loop over each array.
The 3 ground rules are locality, locality, locality.
Then there is the whole discussion about Arrays of Structures (AoS), Structures of Arrays (SoA, usually better when not all entries are useful for computations) and Arrays of Structures of Arrays (AoSoA, with vectorized code, the last arrays would be packed floats/integers...).
I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.
How to write C/C++ code that takes care of the cache line alignment automatically.
Suppose we write an structure in c and have 5 members in to it and we want to align the this structures members to the different cache lines in different hardware X86 hardware CPU.
For example, If I have two X86 machine Machine_1 and Machine_2.
And Machine_1 has 64 byte cache line and Machine_2 has 32 byte cache line.
How will I do a coding so that each variable will be aligns to different cache lines for both the Machine_1 and Machine_2.
struct test_cache_alignment {
int a;
int b;
int c;
int d;
int e;
};
Thanks,
Abhishek
This mostly breaks down into 2 separate problems.
The first problem is ensuring that the structure as a whole begins on a cache line boundary, which depends on where the structure is. If you allocate memory for the structure using malloc() then you need a malloc() that will ensure alignment. If you put a structure in global data then the compiler and/or linker has to ensure alignment. If you have a structure as local data (on the stack) then the compiler has to generate code that ensures alignment.
This is only partly solvable. You can write your own malloc() or write a wrapper around an existing malloc(). You might be able to have special sections that are aligned (instead of using the normal .rodata, .data and .bss sections) and convince the linker to do the right thing. You probably won't be able to get the compiler to generate suitably aligned local data.
The second part of the problem is ensuring that offsets of member within the structure are multiples of the cache line size. This means that if the structure as a whole is aligned then the members of the structure will also be aligned. This might not be so hard to do (as long as you don't mind "slightly not portable" code and painful micro-management). For example:
#define CACHE_LINE_SIZE 32
struct test_cache_alignment {
int a;
uint8_t padding1[CACHE_LINE_SIZE - sizeof(int)];
int b;
uint8_t padding2[CACHE_LINE_SIZE - sizeof(int)];
int c;
uint8_t padding3[CACHE_LINE_SIZE - sizeof(int)];
int d;
uint8_t padding4[CACHE_LINE_SIZE - sizeof(int)];
int e;
uint8_t padding5[CACHE_LINE_SIZE - sizeof(int)];
};
However; for this specific case (a structure of integers) it's rare to want to waste space like this. Without the padding it would have all fit in a single cache line and spreading it across many cache lines will only increase cache misses and reduce performance.
The only case I can think of where you actually want to use a whole cache line is to reduce false sharing in multi-CPU systems (e.g. to avoid "cache line bouncing" caused by different CPUs modifying different members of the same structure at the same time). Often for these cases you're doing something wrong to begin with (e.g. maybe it's better to have separate local variables and not use a structure at all).
I am coding a C simulation, in which, given a sequence of rules to verify, we break it up into 'slices' and verify each slice. (The basic idea is that the order is important, and the actual meaning of a rule is affected by some rules above it; we can make a 'slice' with each rule and only those rules above it which overlap it. We then verify the slices, which are usually much smaller than the whole sequence was.)
My problem is as follows.
I have a struct (policy) which contains an array of structs (rules), and an int (length).
My original implementation used malloc and realloc liberally:
struct{
struct rule *rules;
int length;
}policy;
...
struct policy makePolicy(int length)
{
struct policy newPolicy;
newPolicy.rules = malloc(length * sizeof(struct rule));
newPolicy.length = length;
return newPolicy;
}
...
struct policy makeSlice(struct policy inPol, int rulePos)
{
if(rulePos > inPol.length - 1){
printf("Slice base outside policy \n");
exit(1);
}
struct slice = makePolicy(inPol.length);
//create slice, loop counter gets stored in sliceLength
slice.rules = realloc(slice.rules, sliceLength * sizeof(struct rule));
slice.length = sliceLength;
return slice;
}
As this uses malloc'ed memory, I'm assuming it makes heavy use of heap.
Now I'm trying to port to an experimental parallel machine, which has no malloc.
I sadly went and allocated everything with fixed size arrays.
Now here's the shocker.
The new code runs slower. Much slower.
(The original code used to wait for minutes on end when the slice length was say 200, and maybe an hour at over 300 ... now it does that when the slice length is 70, 80 ... and has been taking hours for say 120. Still not 200.)
The only thing is that now the slices are given the same memory as a full policy (MAXBUFLEN is 10000), but the whole doesn't seem to be running out of memory at all. 'top' shows that the total memory consumed is quite modest, tens-of-megabytes range, as before. (And of course as I'm storing the length, I'm not looping over the whole thing, just the part with real rules.)
Could anyone please help explain why it suddenly got so much slower?
It seems that when you fixed the size of the struct to a larger size (say 10000 rules), your cache locality could become much worse than the original one. You can use a profiler (oprofile or cachegrind in Valgrind) to see if cache is a problem.
In the original program, one cache line can hold at most 8 struct policy (on a 32bit machine with 64byte cache line). But in the modified verison it can only hold one since it is now much larger than the cache line size.
Move the length field up can improve performance in this case since now the length and the first few struct rules can fit into a single cache line.
struct policy{
int length;
struct rule rules[10000];
};
To solve this problem you need to write your own custom allocator to ensure cache locality. If you are writing a parallel version of this program, also remember to isolate memory used by different threads into different cache lines to avoid cache line contention.