How can some architectures guarantee that aligned memory operations are atomic? - concurrency

As explained in this post: Why is integer assignment on a naturally aligned variable atomic on x86? :
Memory load/store on a byte value - and any correctly aligned value up to 64 bits
is guaranteed to be atomic on x86.
But what if:
1- The data crosses cache line boundaries. Assume I have short a = 1234; and address of a is halfword aligned. But for some reason 2 byte data is split between 2 cache lines hence CPU needs to do extra work to fetch and concatenate. How can this remain atomic?
2- The value is paged. Assume a value which CPU is trying to fetch is properly aligned but it's not even in cache or memory. Now it needs to fetch it all the way from disk. How it this still atomic?
I like to ask a third related question while we are at it:
3- Why does the data need to be aligned to its data type at all? Why isn't it enough if it is within a cache line block as every memory load/store is done is cache line blocks and not in specific data sizes?

Related

Does a std::vector<std::unique_ptr<T>> with each individual unique_ptr vector item cache-line aligned make sense in order to avoid false sharing?

Assuming such vector is std::vector<T, boost::alignment::aligned_allocator<T, 64>> would be read/write accessed by several cores concurrently and having it allocated on a per-core basis (i.e. vector index 0 used by the CPU core 0 only, 1 by the core 1, and so on), if one wants to avoid false-sharing, the underlying T here would have to be declared as either alignas(64) or just ensuring to have it properly padded to a standard x86 cache line size (i.e. 64 bytes). But what if the vector's T is std::unique_ptr<U>? Does the same still hold and make sense, i.e. each vector item - in this case std::unique_ptr - required to be 64 bytes in size?
If you want to be able to modify the pointer, then yes, you should ensure the pointers themselves are aligned. However, if the pointers are never changed while your parallel code is running, then they don't have to (it's fine to share read-only data among threads even if they share cache lines). But then you have to make sure U is correctly aligned.
Note: don't assume 64 bytes cache line size, use std::hardware_destuctive_interference_size instead.

C++ Atomic operations within contiguous block of memory

Is it possible to use atomic operations, possibly using the std::atomic library, when assigning values in a contiguous block of memory.
If I have this code:
uint16_t* data = (uint16_t*) calloc(num_values, size);
What can I do to make operations like this atomic:
data[i] = 5;
I will have multiple threads assigning to data, possibly at the same index, at the same time. The order in which these threads modify the value at a particular index doesn't matter to me, as long as the modifications are atomic, avoiding any possible mangling of the data.
EDIT: So, per #user4581301, I'm providing some context for my issue here.
I am writing a program to align depth video data frames to color video data frames. The camera sensors for depth and color have different focal characteristics so they do not come completely aligned.
The general algorithm involves projecting a pixel in depth space to a region in color space, then, overwriting all values in the depth frame, spanning that region, with that single pixel.
I am parallelizing this algorithm. These projected regions may overlap, thus when paralellized, writes to an index may occur concurrently.
Pseudo-code looks like this:
for x in depth_video_width:
for y in depth_video_height:
pixel = get_pixel(x, y)
x_min, x_max, y_min, y_max = project_depth_pixel(x, y)
// iterate over projected region
for x` in [x_min, x_max]:
for y` in [y_min, y_max]:
// possible concurrent modification here
data[x`, y`] = pixel
The outer loop or outermost two loops are parallelized.
You're not going to be able to do exactly what you want like this.
An atomic array doesn't make much sense, nor is it what you want (you want individual writes to be atomic).
You can have an array of atomics:
#include <atomic>
#include <array>
int main()
{
std::array<std::atomic<uint16_t>, 5> data{};
data[1] = 5;
}
… but now you can't just access a contiguous block of uint16_ts, which it's implied you want to do.
If you don't mind something platform-specific, you can keep your array of uint16_ts and ensure that you only use atomic operations with each one (e.g. GCC's __atomic intrinsics).
But, generally, I think you're going to want to keep it simple and just lock a mutex around accesses to a normal array. Measure to be sure, but you may be surprised at how much of a performance loss you don't get.
If you're desperate for atomics, and desperate for an underlying array of uint16_t, and desperate for a standard solution, you could wait for C++20 and keep an std::atomic_ref (this is like a non-owning std::atomic) for each element, then access the elements through those. But then you still have to be cautious about any operation accessing the elements directly, possibly by using a lock, or at least by being very careful about what's doing what and when. At this point your code is much more complex: be sure it's worthwhile.
To add on the last answer, I would strongly advocate against using an array of atomics since any read or write to an atomic locks an entire cache line (at least on x86). In practice, it means that when accessing element i in your array (either to read or to write it), you would lock the cache line around that element (so other threads couldn't access that particular cache line).
The solution to your problem is a mutex as mentioned in the other answer.
For the maximum supported atomic operations it seems to be currently 64bits (see https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html)
The Intel-64 memory ordering model guarantees that, for each of the following
memory-access instructions, the constituent memory operation appears to execute
as a single memory access:
• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.
Any locked instruction (either the XCHG instruction or another read-modify-write
instruction with a LOCK prefix) appears to execute as an indivisible and
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.
In other word, your processor doesn't know how to do more than 64bits atomic operations. And I'm not even mentioning here the STL implementation of atomic which can use lock (see https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free).

Largest "atomic" Type?

When working in a concurrent, parallel programming language with multiple threads working together on multiple cores and/or multiple sockets, what is the largest value in memory considered to be atomic?
What I mean is: a string, being a sequence of bytes, is decidedly not atomic because a write to that location in memory may take some time to update the string. Therefore, a lock of some kind must be acquired when reading and writing to the string so that other threads don't see corrupted, half-finished results. However, a string on the stack is atomic because AFAIK the stack is not a shared area of memory across threads.
Is the largest guaranteed, lockless unit a bit or a byte, or does it depend on the instruction used to write that byte? Is it possible, for instance, for a thread to read an integer while another thread is moving the value bit-by-bit from a register to the stack to shared memory, causing the reader thread to see a half-written value?
I guess I am asking what the largest atomic value is on x86_64 and what the guarantees are.
The largest atomic instruction in x86-64 is lock cmpxchg16b, which reads and writes 16 bytes atomically.
Although it is usually used to atomically update a 16-byte object in memory, it can also be used to atomically read such a value.
To atomically update a value, load rdx:rax with the prior value and rcx:rbx with the new value. The instruction atomically updates the memory location with the new value only if the prior value hasn't changed.
To atomically read a 16-byte value, load rdx:rax and rcx:rbx with the same value. (It doesn't matter what value, but 0 is a good choice.) The instruction atomically reads the current value into rdx:rax.

How does std::alignas optimize the performance of a program?

In 32-bit machine, One memory read cycle gets 4 bytes of data.
So for reading below buffer, It should take 32 read-cycle to read a buffer of 128 bytes mentioned below.
char buffer[128];
Now, Suppose if I have aligned this buffer as mentioned below then please let me know how will it make it faster to read?
alignas(128) char buffer[128];
I am assuming the memory read cycle will remain 4 bytes only.
The size of the registers used for memory access is only one part of the story, the other part is the size of the cache-line.
If a cache-line is 64 bytes and your char[128] is naturally aligned, the CPU generally needs to manipulate three different cache-lines. With alignas(64) or alignas(128), only two cache-lines need to be touched.
If you are working with memory mapped file, or under swapping conditions, the next level of alignment kicks in: the size of a memory page. This would call for 4096 or 8192 byte alignments.
However, I seriously doubt that alignas() has any significant positive effect if the specified alignment is larger than the natural alignment that the compiler uses anyway: It significantly increases memory consumption, which may be enough to trigger more cache-lines/memory pages being touched in the first place. It's only the small misalignments that need to be avoided because they may trigger huge slowdowns on some CPUs, or might be downright illegal/impossible on others.
Thus, truth is only in measurement: If you need all the speedup you can get, try it, measure the runtime difference, and see whether it works out.
In 32 bit machine, One memory read cycle gets 4 bytes of data.
It's not that simple. Just the term "32 bit machine" is already too broad and can mean many things. 32b registers (GP registers? ALU registers? Address registers?)? 32b address bus? 32b data bus? 32b instruction word size?
And "memory read" by whom. CPU? Cache? DMA chip?
If you have a HW platform where memory is read by 4 bytes (aligned by 4) in single cycle and without any cache, then alignas(128) will do no difference (than alignas(4)).

Why does Malloc() care about boundary alignments?

I've heard that malloc() aligns memory based on the type that is being allocated. For example, from the book Understanding and Using C Pointers:
The memory allocated will be aligned according to the pointer's data type. Fore example, a four-byte integer would be allocated on an address boundary evenly divisible by four.
If I follow, this means that
int *integer=malloc(sizeof(int)); will be allocated on an address boundary evenly divisible by four. Even without casting (int *) on malloc.
I was working on a chat server; I read of a similar effect with structs.
And I have to ask: logically, why does it matter what the address boundary itself is divisible on? What's wrong with allocating a group of memory to the tune of n*sizeof(int) using an integer on address 129?
I know how pointer arithmetic works *(integer+1), but I can't work out the importance of boundaries...
The memory allocated will be aligned according to the pointer's data
type.
If you are talking about malloc, this is false. malloc doesn't care what you do with the data and will allocate memory aligned to fit the most stringent native type of the implementation.
From the standard:
The pointer returned if the allocation succeeds is suitably aligned so
that it may be assigned to a pointer to any type of object with a
fundamental alignment requirement and then used to access such an
object or an array of such objects in the space allocated (until the
space is explicitly deallocated)
And:
Logically, why does it matter what the address boundary itself is
divisible on
Due to the workings of the underlying machine, accessing unaligned data might be more expensive (e.g. x86) or illegal (e.g. arm). This lets the hardware take shortcuts that improve performance / simplify implementation.
In many processors, data that isn't aligned will cause a "trap" or "exception" (this is a different form of exception than those understood by the C++ compiler. Even on processors that don't trap when data isn't aligned, it is typically slower (twice as slow, for example) when the data is not correctly aligned. So it's in the compiler's/runtime library's best interest to ensure that things are nicely aligned.
And by the way, malloc (typically) doesn't know what you are allocating. Insteat, malloc will align ALL data, no matter what size it is, to some suitable boundary that is "good enough" for general data-access - typically 8 or 16 bytes in modern OS/processor combinations, 4 bytes in older systems.
This is because malloc won't know if you do char* p = malloc(1000); or double* p = malloc(1000);, so it has to assume you are storing double or whatever is the item with the largest alignment requirement.
The importance of alignment is not a language issue but a hardware issue. Some machines are incapable of reading a data value that is not properly aligned. Others can do it but do so less efficiently, e.g., requiring two reads to read one misaligned value.
The book quote is wrong; the memory returned by malloc is guaranteed to be aligned correctly for any type. Even if you write char *ch = malloc(37);, it is still aligned for int or any other type.
You seem to be asking "What is alignment?" If so, there are several questions on SO about this already, e.g. here, or a good explanation from IBM here.
It depends on the hardware. Even assuming int is 32 bits, malloc(sizeof(int)) could return an address divisible by 1, 2, or 4. Different processors handle unaligned access differently.
Processors don't read directly from RAM any more, that's too slow (it takes hundreds of cycles). So when they do grab RAM, they grab it in big chunks, like 64 bytes at a time. If your address isn't aligned, the 4-byte integer might straddle two 64-byte cache lines, so your processor has to do two loads and fix up the result. Or maybe the engineers decided that building the hardware to fix up unaligned loads isn't necessary, so the processor signals an exception: either your program crashes, or the operating system catches the exception and fixes up the operation (hundreds of wasted cycles).
Aligning addresses means your program plays nicely with hardware.
Because it's more fast; Most processor likes data which is aligned. Even, Some processor CANNOT access data which is not aligned! (If you try to access this data, processor may occur fault)