Is there a speed problem when using pragma pack? - c++

struct window_data
{
window_props win_props;
bool VSync;
Window::event_callback_fn EventCallback;
};
I have this struct in my program.
sizeof(window_data) equals 120.
#pragma pack(push, 2)
struct window_data
{
window_props win_props;
bool VSync;
Window::event_callback_fn EventCallback;
};
#pragma pack(pop)
If i use #pragma pack(push, 2), sizeof(window_data) equals 114.
#pragma pack(push, 1)
struct window_data
{
window_props win_props;
bool VSync;
Window::event_callback_fn EventCallback;
};
#pragma pack(pop)
And in this case, sizeof(window_data) equals 113.
So, is there a problem with using the latest case?

In the normal case, without the pragma, the compiler is going to lay out the structure with padding so that the fields are correctly aligned. For example EventCallback is presumably a 64 bit pointer, so its address should be lined up on a 8-byte boundary.
When you use the pragma, fields may wind up not being aligned. Depending on the CPU, this may mean using different load/store instructions to access the field, or using the normal instructions and getting degraded performance on them.
In exchange for this degraded CPU performance accessing the structure, there are a few benefits. Packed structures have a more predictable layout and are sometimes used directly for serializing data, or accessing hardware using memory mapped IO. They also take up less space, which may help your program fit into ram, or help your working set fit into cache. In some cases this tradeoff is worth the CPU penalty.
You should not pack unless you have a specific reason to do so. If you are doing it to achieve a performance benefit, then measure to confirm it's actually better.

Related

Does the order of class members affect access speed?

I'm writing a delegate library that is supposed to have absolutely no overhead. Therefore it's important that the access of a function pointer is done as fast as possible.
So my question is: Does the access speed depend on the member position in a class? I heard that the most important member should be the first in the member declaration, and that make sense to me, because that means that the this pointer of a class points to the same address as the important member (assuming non-virtual classes). Whereas if the important member would be at any other position, the CPU would have to calculate it's position by adding this and the offset in class layout.
On the other hand I know that the compiler represents that address as a qword-ptr, which contains the information of the offset.
So my question comes down to: Does resolving a qword-ptr take a constant time or does it increase if the offset is not 0? Does the behaviour stay the same on different plattforms?
Most machines have a load instruction or addressing mode that can include a small constant displacement for no extra cost.
On x86, [reg] vs. [reg + disp8] costs 1 extra byte for the 8-bit displacement part of the addressing mode. On RISC-like machines, e.g. ARM, fixed-width instructions mean that load/store instructions always have some bits for a displacement (which can simply be all zero to access the first member given a pointer to the start of the object).
Group the hottest members together at the front of the class, preferably sorted by size to avoid gaps for padding (How do I organize members in a struct to waste the least space on alignment?) Hopefully the hot members will all be in the same cache line. (If your class/struct extends into a 2nd cache line, hopefully only the first line has to stay hot in cache most of the time, reducing the footprint of your working set.)
If the member isn't in the same page as the start of the object, Sandybridge-family's pointer-chasing optimization can cause extra latency if this was also loaded from memory.
Is there a penalty when base+offset is in a different page than the base? Normally it reduces the L1d load-use latency from 5 to 4 cycles for addressing modes like [rdi + 0..2047] by optimistically using just the register value as an input to the TLB, but has to retry if it guessed wrong. (Not a pipeline flush, just retrying that load uop without the shortcut.)
Note that function-pointers mostly depend on branch prediction to be efficient, with access latency only mattering to check the prediction (and start branch recovery if it was wrong). i.e. speculative execution + branch prediction hides the latency control dependencies in CPUs with out-of-order exec.
The order of class members may affect performance, but usually not due to the offset. Because as mentioned above, almost all architectures have load/store with offset. For small structs then you need 1 more byte on x86 and 0 more byte on fixed-width ISAs (but even with that extra byte the x86 instruction is still usually shorter than the fixed 4-byte instructions in those ISAs). If the struct is huge then you may need 4 more bytes for the 4-byte displacement in x86-64, but the instruction count is still 1. On fixed-width ISAs you'll need at least another instruction to get the 32-bit displacement, yet the offset calculation cost is just tiny compared to the effect of cache misses which is the main thing that may incur performance degradation when changing member positions.
So the order of class members affects the fields' positions in cache, and you'll want the important members to be in cache and in the same cache line. Typically you'll put the largest hot member at the beginning to avoid padding. But if the hottest members are small it may be better to move them to the beginning if they don't cause padding. For example
struct mystruct
{
uint32_t extremely_hot;
uint8_t very_hot[4];
void* ptr;
}
If ptr isn't accessed very often it may be a better idea to keep it after the hotter fields like that
But moving the fields around isn't always the better solution. In many cases you may consider splitting the class into two, one for hot members and one for cold ones. In fact I've read somewhere that the Intel compiler has a feature that will automatically split hot and cold members of a class into separate classes when running profile-guided optimization. Unfortunately I couldn't find the source right now
Take a simple example
struct game_player
{
int32_t id;
int16_t positionX;
int16_t positionY;
int16_t health;
int16_t attribute;
game_player* spouse;
time_t join_time;
};
game_player[MAX_PLAYERS];
Only the first 5 fields are commonly used when rendering the object on screen, so we can split them into a hot class
struct game_player_hot
{
int32_t id;
int16_t positionX;
int16_t positionY;
int16_t health;
int16_t attribute;
};
struct game_player_cold
{
game_player* spouse;
time_t join_time;
};
game_player_hot players_hot[MAX_PLAYERS];
game_player_cold players_cold[MAX_PLAYERS];
Sometimes it's recommended to use SoA (struct of arrays) instead of AoS (array of structs) or a mix of that if the same field of different objects are accessed at the same time or in the same manner. For example if we have a list of vectors to sum, instead of
struct my_vector
{
uint16_t x, y, z;
uint16_t attribute; // very rarely used
}
my_vector vectors[MAX];
we'll use
struct my_vector
{
uint16_t x[MAX]; // hot
uint16_t y[MAX]; // hot
uint16_t z[MAX]; // hot
uint16_t attribute[MAX];
}
That way all the dimension values are kept hot and close to each other. Now we also have easier and better vectorization, and it also keeps the hot things hot.
For more information read
AoS and SoA
Nomad Game Engine: Part 4.3 — AoS vs SoA
How to Manipulate Data Structure to Optimize Memory Use on 32-Bit Intel® Architecture
Memory Layout Transformations
Why is SOA (Structures of Arrays) faster than AOS?

Writing a Low latency code in C/ C++ language that automatically take care of cache lines size

How to write C/C++ code that takes care of the cache line alignment automatically.
Suppose we write an structure in c and have 5 members in to it and we want to align the this structures members to the different cache lines in different hardware X86 hardware CPU.
For example, If I have two X86 machine Machine_1 and Machine_2.
And Machine_1 has 64 byte cache line and Machine_2 has 32 byte cache line.
How will I do a coding so that each variable will be aligns to different cache lines for both the Machine_1 and Machine_2.
struct test_cache_alignment {
int a;
int b;
int c;
int d;
int e;
};
Thanks,
Abhishek
This mostly breaks down into 2 separate problems.
The first problem is ensuring that the structure as a whole begins on a cache line boundary, which depends on where the structure is. If you allocate memory for the structure using malloc() then you need a malloc() that will ensure alignment. If you put a structure in global data then the compiler and/or linker has to ensure alignment. If you have a structure as local data (on the stack) then the compiler has to generate code that ensures alignment.
This is only partly solvable. You can write your own malloc() or write a wrapper around an existing malloc(). You might be able to have special sections that are aligned (instead of using the normal .rodata, .data and .bss sections) and convince the linker to do the right thing. You probably won't be able to get the compiler to generate suitably aligned local data.
The second part of the problem is ensuring that offsets of member within the structure are multiples of the cache line size. This means that if the structure as a whole is aligned then the members of the structure will also be aligned. This might not be so hard to do (as long as you don't mind "slightly not portable" code and painful micro-management). For example:
#define CACHE_LINE_SIZE 32
struct test_cache_alignment {
int a;
uint8_t padding1[CACHE_LINE_SIZE - sizeof(int)];
int b;
uint8_t padding2[CACHE_LINE_SIZE - sizeof(int)];
int c;
uint8_t padding3[CACHE_LINE_SIZE - sizeof(int)];
int d;
uint8_t padding4[CACHE_LINE_SIZE - sizeof(int)];
int e;
uint8_t padding5[CACHE_LINE_SIZE - sizeof(int)];
};
However; for this specific case (a structure of integers) it's rare to want to waste space like this. Without the padding it would have all fit in a single cache line and spreading it across many cache lines will only increase cache misses and reduce performance.
The only case I can think of where you actually want to use a whole cache line is to reduce false sharing in multi-CPU systems (e.g. to avoid "cache line bouncing" caused by different CPUs modifying different members of the same structure at the same time). Often for these cases you're doing something wrong to begin with (e.g. maybe it's better to have separate local variables and not use a structure at all).

Does alignment really matter for performance in C++11?

Does alignment really matter for performance in C++11?
There is an advice in Stroustrup's book to order the members in a struct
beginning from the biggest to the smallest. But I wonder if someone
has made measurements to actually see if this makes any difference,
and if it is worth it to think about when writing code.
Alignment matters not only for performance, but also for correctness. Some architectures will fail with an processor trap if the data is not aligned correctly, or access the wrong memory location. On others, access to unaligned variables is broken into multiple accesses and bitshifts (often inside the hardware, sometimes by OS trap handler), losing atomicity.
The advice to sort members in descending order of size is for optimal packing / minimum space wasted by padding, not for alignment or speed. Members will be correctly aligned no matter what order you list them in, unless you request non-conformant layout using specialized pragmas (i.e. the non-portable #pragma pack) or keywords. Although total structure size is affected by padding and also affects speed, often there is another ordering that is optimal.
For best performance, you should try to get members which are used together into the same cache line, and members that are accessed by different threads into different cache lines. Sometimes that means a lot of padding to get a cross-thread shared variable alone in its own cache line. But that's better than taking a performance hit from false sharing.
Just to add to Ben's great answer:
Defining struct members in the same order they are later accessed in your application will reduce cache misses and possibly increase performance. This will work provided the entire structure does not fit into L1 cache.
On the other hand, ordering the members from biggest to smallest may reduce overall memory usage, which may be important when storing an array of small structures.
Let's assume that for an architecture (I don't know them that well, I think that would be the case for default settings 32bit gcc, someone will correct me in comments) this structure:
struct MemoryUnused {
uint8_t val0;
uint16_t val1;
uint8_t val2;
uint16_t val3;
uint8_t val4;
uint32_t val5;
uint8_t val6;
}
takes 20 bytes in memory, while this:
struct MemoryNotLost {
uint32_t val5;
uint16_t val1;
uint16_t val3;
uint8_t val0;
uint8_t val2;
uint8_t val4;
uint8_t val6;
}
Will take 12. That's 8 bytes lost due to padding, and it's a 67% increase in size of the smallers struct. With a large array of such structs, the gain would be significant and, simply because of the amount of used memory, will decrease the amount of cache misses.

why does size of the struct need to be a multiple of the largest alignment of any struct member

I understand the padding that takes place between the members of a struct to ensure correct alignment of individual types. However, why does the data structure have to be a multiple of alignment of largest member? I don't understand the padding is needed at the end.
Reference:
http://en.wikipedia.org/wiki/Data_structure_alignment
Good question. Consider this hypothetical type:
struct A {
int n;
bool flag;
};
So, an object of type A should take five bytes (four for the int plus one for the bool), but in fact it takes eight. Why?
The answer is seen if you use the type like this:
const size_t N = 100;
A a[N];
If each A were only five bytes, then a[0] would align but a[1], a[2] and most of the other elements would not.
But why does alignment even matter? There are several reasons, all hardware-related. One reason is that recently/frequently used memory is cached in cache lines on the CPU silicon for rapid access. An aligned object smaller than a cache line always fits in a single line (but see the interesting comments appended below), but an unaligned object may straddle two lines, wasting cache.
There are actually even more fundamental hardware reasons, having to do with the way byte-addressable data is transferred down a 32- or 64-bit data bus, quite apart from cache lines. Not only will misalignment clog the bus with extra fetches (due as before to straddling), but it will also force registers to shift bytes as they come in. Even worse, misalignment tends to confuse optimization logic (at least, Intel's optimization manual says that it does, though I have no personal knowledge of this last point). So, misalignment is very bad from a performance standpoint.
It usually is worth it to waste the padding bytes for these reasons.
Update: The comments below are all useful. I recommend them.
Depending on the hardware, alignment might be necessary or just help speeding up execution.
There is a certain number of processors (ARM I believe) in which an unaligned access leads to a hardware exception. Plain and simple.
Even though typical x86 processors are more lenient, there is still a penalty in accessing unaligned fundamental types, as the processor has to do more work to bring the bits into the register before being able to operate on it. Compilers usually offer specific attributes/pragmas when packing is desirable nonetheless.
Because of virtual addressing.
"...aligning a page on a page-sized boundary lets the
hardware map a virtual address to a physical address by substituting
the higher bits in the address, rather than doing complex arithmetic."
By the way, I found the Wikipedia page on this quite well written.
If the register size of the CPU is 32 bits, then it can grab memory that is on 32 bit boundaries with a single assembly instruction. It is slower to grab 32 bits, and then get the byte that starts at bit 8.
BTW: There doesn't have to be padding. You can ask that structures be packed.

which size of struct member alignment in VC bring performance benefit?

does struct member alignment in VC bring performance benefit? if it is what is the best performance implication by using this and which size is best for current cpu architecture (x86_64, SSE2+, ..)
Perf takes a nose-dive on x86 and x64 cores when a member straddles a cache line boundary. The common compiler default is 8 byte packing which ensures you're okay on long long, double and 64-bit pointer members.
SSE2 instructions require an alignment of 16, the code will bomb if it is off. You cannot get that out of a packing pragma, the heap allocator for example will only provide an 8-byte alignment guarantee. Find out what your compiler and CRT support. Something like __declspec(align(16)) and a custom allocator like _aligned_malloc(). Or over-allocate the memory and tweak the pointer yourself.
The default alignment used by the compiler should be appropriate for the target platform (32- or 64-bit Intel/AMD) for general data. To take advantage of SIMD, you might have to use a more restrictive alignment on those arrays, but that's usually done with a #pragma or special data type that applies just to the data you'll be using in the SIMD instructions.