How and when to align to cache line size?

How and when to align to cache line size? - c++

In Dmitry Vyukov's excellent bounded mpmc queue written in C++
See: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
He adds some padding variables. I presume this is to make it align to a cache line for performance.
I have some questions.
Why is it done in this way?
Is it a portable method that will
always work
In what cases would it be best to use __attribute__
((aligned (64))) instead.
why would padding before a buffer pointer help with performance? isn't just the pointer loaded into the cache so it's really only the size of a pointer?
static size_t const cacheline_size = 64;
typedef char cacheline_pad_t [cacheline_size];
cacheline_pad_t pad0_;
cell_t* const buffer_;
size_t const buffer_mask_;
cacheline_pad_t pad1_;
std::atomic<size_t> enqueue_pos_;
cacheline_pad_t pad2_;
std::atomic<size_t> dequeue_pos_;
cacheline_pad_t pad3_;
Would this concept work under gcc for c code?

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.
In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.
The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.
I'm unsure whether the technique is entirely portable. The assumption is that each cacheline_pad_t will itself be aligned to a 64 byte (its size) cache line boundary, and hence whatever follows it will be on the next cache line. So far as I know, the C and C++ language standards only require this of whole structures, so that they can live in arrays nicely, without violating alignment requirements of any of their members. (see comments)
The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.
The same concept applies in C as well as C++.

You may need to align to a cache line boundary, which is typically 64 bytes per cache line, when you are working with interrupts or high-performance data reads, and they are mandatory to use when working with interprocess sockets. With Interprocess sockets, there are control variables that cannot be spread out across multiple cache lines or DDR RAM words else it will cause the L1, L2, etc or caches or DDR RAM to function as a low-pass filter and filter out your interrupt data! THAT IS BAD!!! That means you get bizarre errors when your algorithm is good and it has the potential to make you go insane!
The DDR RAM is almost always going to read in 128-bit words (DDR RAM Words), which is 16 bytes, so the ring buffer variables shall not be spread out across multiple DDR RAM words. some systems do use 64-bit DDR RAM words and technically you could get a 32-bit DDR RAM word on a 16-bit CPU but one would use SDRAM in the situation.
One may also just be interested in minimizing the number of cache lines in use when reading data in a high-performance algorithm. In my case, I developed the world's fastest integer-to-string algorithm (40% faster than prior fastest algorithm) and I'm working on optimizing the Grisu algorithm, which is the world's fastest floating-point algorithm. In order to print the floating-point number you must print the integer, so in order optimize the Grisu one optimization I have implemented is I have cache-line-aligned the Lookup Tables (LUT) for Grisu into exactly 15 cache lines, which is rather odd that it actually aligned like that. This takes the LUTs from the .bss section (i.e. static memory) and places them onto the stack (or heap but the Stack is more appropriate). I have not benchmarked this but it's good to bring up, and I learned a lot about this, is the fastest way to load values is to load them from the i-cache and not the d-cache. The difference is that the i-cache is read-only and has much larger cache lines because it's read-only (2KB was what a professor quoted me once.). So you're actually going to degrigate your performance from array indexing as opposed to loading a variable like this:
int faster_way = 12345678;
as opposed to the slower way:
int variables[2] = { 12345678, 123456789};
int slower_way = variables[0];
The difference is that the int variable = 12345678 will get loaded from the i-cache lines by offsetting to the variable in the i-cache from the start of the function, while slower_way = int[0] will get loaded from the smaller d-cache lines using much slower array indexing. This particular subtly as I just discovered is actually slowing down my and many others integer-to-string algorithm. I say this because you may thing you're optimizing by cache-aligning read-only data when you're not.
Typically in C++, you will use the std::align function. I would advise not using this function because it is not guaranteed to work optimally. Here is the fastest way to align to a cache line, which to be up front I am the author and this is a shamless plug:
Kabuki Toolkit Memory Alignment Algorithm
namespace _ {
/* Aligns the given pointer to a power of two boundaries with a premade mask.
#return An aligned pointer of typename T.
#brief Algorithm is a 2's compliment trick that works by masking off
the desired number of bits in 2's compliment and adding them to the
pointer.
#param pointer The pointer to align.
#param mask The mask for the Least Significant bits to align. */
template <typename T = char>
inline T* AlignUp(void* pointer, intptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value ) & mask;
return reinterpret_cast<T*>(value);
}
} //< namespace _
// Example calls using the faster mask technique.
enum { kSize = 256 };
char buffer[kSize + 64];
char* aligned_to_64_byte_cache_line = AlignUp<> (buffer, 63);
char16_t* aligned_to_64_byte_cache_line2 = AlignUp<char16_t> (buffer, 63);
and here is the faster std::align replacement:
inline void* align_kabuki(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
// Begin Kabuki Toolkit Implementation
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (align - 1);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
// End Kabuki Toolkit Implementation
}

Related

C++ Atomic operations within contiguous block of memory

Is it possible to use atomic operations, possibly using the std::atomic library, when assigning values in a contiguous block of memory.
If I have this code:
uint16_t* data = (uint16_t*) calloc(num_values, size);
What can I do to make operations like this atomic:
data[i] = 5;
I will have multiple threads assigning to data, possibly at the same index, at the same time. The order in which these threads modify the value at a particular index doesn't matter to me, as long as the modifications are atomic, avoiding any possible mangling of the data.
EDIT: So, per #user4581301, I'm providing some context for my issue here.
I am writing a program to align depth video data frames to color video data frames. The camera sensors for depth and color have different focal characteristics so they do not come completely aligned.
The general algorithm involves projecting a pixel in depth space to a region in color space, then, overwriting all values in the depth frame, spanning that region, with that single pixel.
I am parallelizing this algorithm. These projected regions may overlap, thus when paralellized, writes to an index may occur concurrently.
Pseudo-code looks like this:
for x in depth_video_width:
for y in depth_video_height:
pixel = get_pixel(x, y)
x_min, x_max, y_min, y_max = project_depth_pixel(x, y)
// iterate over projected region
for x` in [x_min, x_max]:
for y` in [y_min, y_max]:
// possible concurrent modification here
data[x`, y`] = pixel
The outer loop or outermost two loops are parallelized.

You're not going to be able to do exactly what you want like this.
An atomic array doesn't make much sense, nor is it what you want (you want individual writes to be atomic).
You can have an array of atomics:
#include <atomic>
#include <array>
int main()
{
std::array<std::atomic<uint16_t>, 5> data{};
data[1] = 5;
}
… but now you can't just access a contiguous block of uint16_ts, which it's implied you want to do.
If you don't mind something platform-specific, you can keep your array of uint16_ts and ensure that you only use atomic operations with each one (e.g. GCC's __atomic intrinsics).
But, generally, I think you're going to want to keep it simple and just lock a mutex around accesses to a normal array. Measure to be sure, but you may be surprised at how much of a performance loss you don't get.
If you're desperate for atomics, and desperate for an underlying array of uint16_t, and desperate for a standard solution, you could wait for C++20 and keep an std::atomic_ref (this is like a non-owning std::atomic) for each element, then access the elements through those. But then you still have to be cautious about any operation accessing the elements directly, possibly by using a lock, or at least by being very careful about what's doing what and when. At this point your code is much more complex: be sure it's worthwhile.

To add on the last answer, I would strongly advocate against using an array of atomics since any read or write to an atomic locks an entire cache line (at least on x86). In practice, it means that when accessing element i in your array (either to read or to write it), you would lock the cache line around that element (so other threads couldn't access that particular cache line).
The solution to your problem is a mutex as mentioned in the other answer.
For the maximum supported atomic operations it seems to be currently 64bits (see https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html)
The Intel-64 memory ordering model guarantees that, for each of the following
memory-access instructions, the constituent memory operation appears to execute
as a single memory access:
• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.
Any locked instruction (either the XCHG instruction or another read-modify-write
instruction with a LOCK prefix) appears to execute as an indivisible and
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.
In other word, your processor doesn't know how to do more than 64bits atomic operations. And I'm not even mentioning here the STL implementation of atomic which can use lock (see https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free).

Does the order of class members affect access speed?

I'm writing a delegate library that is supposed to have absolutely no overhead. Therefore it's important that the access of a function pointer is done as fast as possible.
So my question is: Does the access speed depend on the member position in a class? I heard that the most important member should be the first in the member declaration, and that make sense to me, because that means that the this pointer of a class points to the same address as the important member (assuming non-virtual classes). Whereas if the important member would be at any other position, the CPU would have to calculate it's position by adding this and the offset in class layout.
On the other hand I know that the compiler represents that address as a qword-ptr, which contains the information of the offset.
So my question comes down to: Does resolving a qword-ptr take a constant time or does it increase if the offset is not 0? Does the behaviour stay the same on different plattforms?

Most machines have a load instruction or addressing mode that can include a small constant displacement for no extra cost.
On x86, [reg] vs. [reg + disp8] costs 1 extra byte for the 8-bit displacement part of the addressing mode. On RISC-like machines, e.g. ARM, fixed-width instructions mean that load/store instructions always have some bits for a displacement (which can simply be all zero to access the first member given a pointer to the start of the object).
Group the hottest members together at the front of the class, preferably sorted by size to avoid gaps for padding (How do I organize members in a struct to waste the least space on alignment?) Hopefully the hot members will all be in the same cache line. (If your class/struct extends into a 2nd cache line, hopefully only the first line has to stay hot in cache most of the time, reducing the footprint of your working set.)
If the member isn't in the same page as the start of the object, Sandybridge-family's pointer-chasing optimization can cause extra latency if this was also loaded from memory.
Is there a penalty when base+offset is in a different page than the base? Normally it reduces the L1d load-use latency from 5 to 4 cycles for addressing modes like [rdi + 0..2047] by optimistically using just the register value as an input to the TLB, but has to retry if it guessed wrong. (Not a pipeline flush, just retrying that load uop without the shortcut.)
Note that function-pointers mostly depend on branch prediction to be efficient, with access latency only mattering to check the prediction (and start branch recovery if it was wrong). i.e. speculative execution + branch prediction hides the latency control dependencies in CPUs with out-of-order exec.

The order of class members may affect performance, but usually not due to the offset. Because as mentioned above, almost all architectures have load/store with offset. For small structs then you need 1 more byte on x86 and 0 more byte on fixed-width ISAs (but even with that extra byte the x86 instruction is still usually shorter than the fixed 4-byte instructions in those ISAs). If the struct is huge then you may need 4 more bytes for the 4-byte displacement in x86-64, but the instruction count is still 1. On fixed-width ISAs you'll need at least another instruction to get the 32-bit displacement, yet the offset calculation cost is just tiny compared to the effect of cache misses which is the main thing that may incur performance degradation when changing member positions.
So the order of class members affects the fields' positions in cache, and you'll want the important members to be in cache and in the same cache line. Typically you'll put the largest hot member at the beginning to avoid padding. But if the hottest members are small it may be better to move them to the beginning if they don't cause padding. For example
struct mystruct
{
uint32_t extremely_hot;
uint8_t very_hot[4];
void* ptr;
}
If ptr isn't accessed very often it may be a better idea to keep it after the hotter fields like that
But moving the fields around isn't always the better solution. In many cases you may consider splitting the class into two, one for hot members and one for cold ones. In fact I've read somewhere that the Intel compiler has a feature that will automatically split hot and cold members of a class into separate classes when running profile-guided optimization. Unfortunately I couldn't find the source right now
Take a simple example
struct game_player
{
int32_t id;
int16_t positionX;
int16_t positionY;
int16_t health;
int16_t attribute;
game_player* spouse;
time_t join_time;
};
game_player[MAX_PLAYERS];
Only the first 5 fields are commonly used when rendering the object on screen, so we can split them into a hot class
struct game_player_hot
{
int32_t id;
int16_t positionX;
int16_t positionY;
int16_t health;
int16_t attribute;
};
struct game_player_cold
{
game_player* spouse;
time_t join_time;
};
game_player_hot players_hot[MAX_PLAYERS];
game_player_cold players_cold[MAX_PLAYERS];
Sometimes it's recommended to use SoA (struct of arrays) instead of AoS (array of structs) or a mix of that if the same field of different objects are accessed at the same time or in the same manner. For example if we have a list of vectors to sum, instead of
struct my_vector
{
uint16_t x, y, z;
uint16_t attribute; // very rarely used
}
my_vector vectors[MAX];
we'll use
struct my_vector
{
uint16_t x[MAX]; // hot
uint16_t y[MAX]; // hot
uint16_t z[MAX]; // hot
uint16_t attribute[MAX];
}
That way all the dimension values are kept hot and close to each other. Now we also have easier and better vectorization, and it also keeps the hot things hot.
For more information read
AoS and SoA
Nomad Game Engine: Part 4.3 — AoS vs SoA
How to Manipulate Data Structure to Optimize Memory Use on 32-Bit Intel® Architecture
Memory Layout Transformations
Why is SOA (Structures of Arrays) faster than AOS?

Double-checking understanding of memory coalescing in CUDA

Suppose I define some arrays which are visible to the GPU:
double* doubleArr = createCUDADouble(fieldLen);
float* floatArr = createCUDAFloat(fieldLen);
char* charArr = createCUDAChar(fieldLen);
Now, I have the following CUDA thread:
void thread(){
int o = getOffset(); // the same for all threads in launch
double d = doubleArr[threadIdx.x + o];
float f = floatArr[threadIdx.x + o];
char c = charArr[threadIdx.x + o];
}
I'm not quite sure whether I correctly interpret the documentation, and its very critical for my design: Will the memory accesses for double, float and char be nicely coalesced? (Guess: Yes, it will fit into sizeof(type) * blockSize.x / (transaction size) transactions, plus maybe one extra transaction at the upper and lower boundary.)

Yes, for all the cases you have shown, and assuming createCUDAxxxxx translates into some kind of ordinary cudaMalloc type operation, everything should nicely coalesce.
If we have ordinary 1D device arrays allocated via cudaMalloc, in general we should have good coalescing behavior across threads if our load pattern includes an array index of the form:
data_array[some_constant + threadIdx.x];
It really does not matter what data type the array is - it will coalesce nicely.
However, from a performance perspective, global loads (assuming an L1 miss) will occur in a minimum 128-byte granularity. Therefore loading larger sizes per thread (say, int, float, double, float4, etc.) may give slightly better performance. The caches tend to mitigate any difference, if the loads are across a large enough number of warps.
It's pretty easy also to verify this on a particular piece of code with a profiler. There are many ways to do this depending on which profiler you choose, but for example with nvprof you can do:
nvprof --metric gld_efficiency ./my_exe
and it will return an average percentage number that more or less exactly reflects the percentage of optimal coalescing that is occurring on global loads.
This is the presentation I usually cite for additional background info on memory optimization.
I suppose someone will come along and notice that this pattern:
data_array[some_constant + threadIdx.x];
roughly corresponds to the access type shown on slides 40-41 of the above presentation. And aha!! efficiency drops to 50%-80%. That is true, if only a single warp-load is being considered. However, referring to slide 40, we see that the "first" load will require two cachelines to be loaded. After that however, additional loads (moving to the right, for simplicity) will only require one additional/new cacheline per warp-load (assuming the existence of an L1 or L2 cache, and reasonable locality, i.e. lack of thrashing). Therefore, over a reasonably large array (more than just 128 bytes), the average requirement will be one new cacheline per warp, which corresponds to 100% efficiency.

Does int32_t have lower latency than int8_t, int16_t and int64_t?

(I'm referring to Intel CPUs and mainly with GCC, but poss ICC or MSVC)
Is it true using int8_t, int16_t or int64_t is less efficient compared with int32_tdue to additional instructions generated to to convert between the CPU word size and the chosen variable size?
I would be interested if anybody has any examples or best practices for this? I sometimes use smaller variable sizes to reduce cacheline loads, but say I only consumed 50 bytes of a cacheline with one variable being 8-bit int, it may be quicker processing by using the remaining cacheline space and promote the 8-bit int to a 32-bit int etc?

You can stuff more uint8_ts into a cache line, so loading N uint8_ts will be faster than loading N uint32_ts.
In addition, if you are using a modern Intel chip with SIMD instructions, a smart compiler will vectorize what it can. Again, using a small variable in your code will allow the compiler to stuff more lanes into a SIMD register.
I think it is best to use the smallest size you can, and leave the details up to the compiler. The compiler is probably smarter than you (and me) when it comes to stuff like this. For many operations (say unsigned addition), the compiler can use the same code for uint8, uint16 or uint32 (and just ignore the upper bits), so there is no speed difference.
The bottom line is that a cache miss is WAY more expensive than any arithmetic or logical operation, so it is nearly always better to worry about cache (and thus data size) than simple arithmetic.
(It used to be true a long time again that on Sun workstation, using double was significantly faster than float, because the hardware only supported double. I don't think that is true any more for modern x86, as the SIMD hardware (SSE, etc) have direct support for both single and double precision).

Mark Lakata answer points in the right direction.
I would like to add some points.
A wonderful resource for understanding and taking optimization decision are the Agner documents.
The Instruction Tables document has the latency for the most common instructions. You can see that some of them perform better in the native size version.
A mov for example may be eliminated, a mul have less latency.
However here we are talking about gaining 1 clock, we would have to execute a lot of instruction to compensate for a cache miss.
If this were the whole story it would have not worth it.
The real problems comes with the decoders.
When you use some length-changing prefixes (and you will by using non native size word) the decoder takes extra cycles.
The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes.
In, nowadays, no longer more recent (but still present) microarchs the penalty was severe, specially with some kind arithmetic instructions.
In later microarchs this has been mitigated but the penalty it is still present.
Another aspect to consider is that using non native size requires to prefix the instructions and thereby generating larger code.
This is the closest as possible to the statement "additional instructions [are] generated to to convert between the CPU word size and the chosen variable size" as Intel CPU can handle non native word sizes.
With other, specially RISC, CPUs this is not generally true and more instruction can be generated.
So while you are making an optimal use of the data cache, you are also making a bad use of the instruction cache.
It is also worth nothing that on the common x64 ABI the stack must be aligned on 16 byte boundary and that usually the compiler saves local vars in the native word size or a close one (e.g. a DWORD on 64 bit system).
Only if you are allocating a sufficient number of local vars or if you are using array or packed structs you can gain benefits from using small variable size.
If you declare a single uint16_t var, it will probably takes the same stack space of a single uint64_t, so it is best to go for the fastest size.
Furthermore when it come to the data cache it is the locality that matters, rather than the data size alone.
So, what to do?
Luckily, you don't have to decide between having small data or small code.
If you have a considerable quantity of data this is usually handled with arrays or pointers and by the use of intermediate variables. An example being this line of code.
t = my_big_data[i];
Here my approach is:
Keep the external representation of data, i.e. the my_big_data array, as small as possible. For example if that array store temperatures use a coded uint8_t for each element.
Keep the internal representation of data, i.e. the t variable, as close as possible to the CPU word size. For example t could be a uint32_t or uint64_t.
This way you program optimize both caches and use the native word size.
As a bonus you may later decide to switch to SIMD instructions without have to repack the my_big_data memory layout.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
D. Knuth
When you design your structures memory layout be problem driven. For example, age values need 8 bit, city distances in miles need 16 bits.
When you code the algorithms use the fastest type the compiler is known to have for that scope. For example integers are faster than floating point numbers, uint_fast8_t is no slower than uint8_t.
When then it is time to improve the performance start by changing the algorithm (by using faster types, eliminating redundant operations, and so on) and then if it is needed the data structures (by aligning, padding, packing and so on).

why does size of the struct need to be a multiple of the largest alignment of any struct member

I understand the padding that takes place between the members of a struct to ensure correct alignment of individual types. However, why does the data structure have to be a multiple of alignment of largest member? I don't understand the padding is needed at the end.
Reference:
http://en.wikipedia.org/wiki/Data_structure_alignment

Good question. Consider this hypothetical type:
struct A {
int n;
bool flag;
};
So, an object of type A should take five bytes (four for the int plus one for the bool), but in fact it takes eight. Why?
The answer is seen if you use the type like this:
const size_t N = 100;
A a[N];
If each A were only five bytes, then a[0] would align but a[1], a[2] and most of the other elements would not.
But why does alignment even matter? There are several reasons, all hardware-related. One reason is that recently/frequently used memory is cached in cache lines on the CPU silicon for rapid access. An aligned object smaller than a cache line always fits in a single line (but see the interesting comments appended below), but an unaligned object may straddle two lines, wasting cache.
There are actually even more fundamental hardware reasons, having to do with the way byte-addressable data is transferred down a 32- or 64-bit data bus, quite apart from cache lines. Not only will misalignment clog the bus with extra fetches (due as before to straddling), but it will also force registers to shift bytes as they come in. Even worse, misalignment tends to confuse optimization logic (at least, Intel's optimization manual says that it does, though I have no personal knowledge of this last point). So, misalignment is very bad from a performance standpoint.
It usually is worth it to waste the padding bytes for these reasons.
Update: The comments below are all useful. I recommend them.

Depending on the hardware, alignment might be necessary or just help speeding up execution.
There is a certain number of processors (ARM I believe) in which an unaligned access leads to a hardware exception. Plain and simple.
Even though typical x86 processors are more lenient, there is still a penalty in accessing unaligned fundamental types, as the processor has to do more work to bring the bits into the register before being able to operate on it. Compilers usually offer specific attributes/pragmas when packing is desirable nonetheless.

Because of virtual addressing.
"...aligning a page on a page-sized boundary lets the
hardware map a virtual address to a physical address by substituting
the higher bits in the address, rather than doing complex arithmetic."
By the way, I found the Wikipedia page on this quite well written.

If the register size of the CPU is 32 bits, then it can grab memory that is on 32 bit boundaries with a single assembly instruction. It is slower to grab 32 bits, and then get the byte that starts at bit 8.
BTW: There doesn't have to be padding. You can ask that structures be packed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js