C++ Atomic operations within contiguous block of memory - c++

Is it possible to use atomic operations, possibly using the std::atomic library, when assigning values in a contiguous block of memory.
If I have this code:
uint16_t* data = (uint16_t*) calloc(num_values, size);
What can I do to make operations like this atomic:
data[i] = 5;
I will have multiple threads assigning to data, possibly at the same index, at the same time. The order in which these threads modify the value at a particular index doesn't matter to me, as long as the modifications are atomic, avoiding any possible mangling of the data.
EDIT: So, per #user4581301, I'm providing some context for my issue here.
I am writing a program to align depth video data frames to color video data frames. The camera sensors for depth and color have different focal characteristics so they do not come completely aligned.
The general algorithm involves projecting a pixel in depth space to a region in color space, then, overwriting all values in the depth frame, spanning that region, with that single pixel.
I am parallelizing this algorithm. These projected regions may overlap, thus when paralellized, writes to an index may occur concurrently.
Pseudo-code looks like this:
for x in depth_video_width:
for y in depth_video_height:
pixel = get_pixel(x, y)
x_min, x_max, y_min, y_max = project_depth_pixel(x, y)
// iterate over projected region
for x` in [x_min, x_max]:
for y` in [y_min, y_max]:
// possible concurrent modification here
data[x`, y`] = pixel
The outer loop or outermost two loops are parallelized.

You're not going to be able to do exactly what you want like this.
An atomic array doesn't make much sense, nor is it what you want (you want individual writes to be atomic).
You can have an array of atomics:
#include <atomic>
#include <array>
int main()
{
std::array<std::atomic<uint16_t>, 5> data{};
data[1] = 5;
}
… but now you can't just access a contiguous block of uint16_ts, which it's implied you want to do.
If you don't mind something platform-specific, you can keep your array of uint16_ts and ensure that you only use atomic operations with each one (e.g. GCC's __atomic intrinsics).
But, generally, I think you're going to want to keep it simple and just lock a mutex around accesses to a normal array. Measure to be sure, but you may be surprised at how much of a performance loss you don't get.
If you're desperate for atomics, and desperate for an underlying array of uint16_t, and desperate for a standard solution, you could wait for C++20 and keep an std::atomic_ref (this is like a non-owning std::atomic) for each element, then access the elements through those. But then you still have to be cautious about any operation accessing the elements directly, possibly by using a lock, or at least by being very careful about what's doing what and when. At this point your code is much more complex: be sure it's worthwhile.

To add on the last answer, I would strongly advocate against using an array of atomics since any read or write to an atomic locks an entire cache line (at least on x86). In practice, it means that when accessing element i in your array (either to read or to write it), you would lock the cache line around that element (so other threads couldn't access that particular cache line).
The solution to your problem is a mutex as mentioned in the other answer.
For the maximum supported atomic operations it seems to be currently 64bits (see https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html)
The Intel-64 memory ordering model guarantees that, for each of the following
memory-access instructions, the constituent memory operation appears to execute
as a single memory access:
• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.
Any locked instruction (either the XCHG instruction or another read-modify-write
instruction with a LOCK prefix) appears to execute as an indivisible and
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.
In other word, your processor doesn't know how to do more than 64bits atomic operations. And I'm not even mentioning here the STL implementation of atomic which can use lock (see https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free).

Related

Read a non-atomic variable, atomically?

I have a non-atomic 62-bit double which is incremented in one thread regularly. This access does not need to be atomic. However, this variable is occasionally read (not written) by another thread. If I align the variable on a 64-bit boundary the read is atomic.
However, is there any way I can ensure I do not read the variable mid-way during the increment? Could I call a CPU instruction which serialises the pipeline or something? Memory barrier?
I thought of declaring the variable atomic and using std::memory_order::memory_order_relaxed in my critical thread (and a stricter memory barrier in the rare thread), but it seems to be just as expensive.
Since you tagged x86, this will be x86-specific.
An increment is essentially three parts, read, add, write. The increment is not atomic, but all three steps (the add doesn't count I suppose, it's not observable anyway) of it are as long as the variable does not cross a cache line boundary (this condition is weaker than having to be aligned to its natural alignment, it has been like this as of P6, before that quadwords had to be aligned).
So you already can't read a torn value. The worst you could do is overwrite the variable in between the moments where it is read and the new value is written, but you're only reading it.

CUDA - Understanding parallel execution of threads (warps) and coalesced memory access

I just started to code in CUDA and I'm trying to get my head around the concepts of how threads are executed and memory accessed in order to get the most out of the GPU. I read through the CUDA best practice guide, the book CUDA by Example and several posts here. I also found the reduction example by Mark Harris quite interesting and useful, but despite all the information I got rather confused on the details.
Let's assume we have a large 2D array (N*M) on which we do column-wise operations. I split the array into blocks so that each block has a number of threads that is a multiple of 32 (all threads fit into several warps). The first thread in each block allocates additional memory (a copy of the initial array, but only for the size of its own dimension) and shares the pointer using a _shared _ variable so that all threads of the same block can access the same memory. Since the number of threads is a multiple of 32, so should be the memory in order to be accessed in a single read. However, I need to have an extra padding around the memory block, a border, so that the width of my array becomes (32*x)+2 columns. The border comes from decomposing the large array, so that I have an overlapping areas in which a copy of its neighbours is temporarily available.
Coeleased memory access:
Imagine the threads of a block are accessing the local memory block
1 int x = threadIdx.x;
2
3 for (int y = 0; y < height; y++)
4 {
5 double value_centre = array[y*width + x+1]; // remeber we have the border so we need an offset of + 1
6 double value_left = array[y*width + x ]; // hence the left element is at x
7 double value_right = array[y*width + x+2]; // and the right element at x+2
8
9 // .. do something
10 }
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned? Note also, if that is not the case then I would have unaligned access to the array for each row after the first one, since the width of my array is (32*x)+2, and hence not 32-byte aligned. A further padding would however solve the problem for each new row.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Thread executed in a warp:
Threads in a warp are only executed in parallel if and only if all the instructions are the same (according to link). If I do have a conditional statement / diverging execution, then that particular thread will be executed by itself and not within a warp with the others.
For example if I initialise the array I could do something like
1 int x = threadIdx.x;
2
3 array[x+1] = globalArray[blockIdx.x * blockDim.x + x]; // remember the border and therefore use +1
4
5 if (x == 0 || x == blockDim.x-1) // border
6 {
7 array[x] = DBL_MAX;
8 }
Will the warp be of size 32 and executed in parallel until line 3 and then stop for all other threads and only the first and last thread further executed to initialise the border, or will those be separated from all other threads already at the beginning, since there is an if statement that all other threads do not fulfill?
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
I wonder what the best practice is, to have either memory perfectly aligned for each row (in which case I would run each block with (32*x-2) threads so that the array with border is (32*x-2)+2 a multiple of 32 for each new line) or do it the way I had demonstrated above, with threads a multiple of 32 for each block and just live with the unaligned memory. I am aware that these sort of questions are not always straightforward and often depend on particular cases, but sometimes certain things are a bad practice and should not become habit.
When I experimented a little bit, I didn't really notice a difference in execution time, but maybe my examples were just too simple. I tried to get information from the visual profiler, but I haven't really understood all the information it gives me. I got however a warning that my occupancy level is at 17%, which I think must be really low and therefore there is something I do wrong. I didn't manage to find information on how threads are executed in parallel and how efficient my memory access is.
-Edit-
Added and highlighted 2 questions, one about memory access, the other one about how threads are collected to a single warp.
Now, my understanding is that since I do have an offset (+1,+2), which is unavoidable, I will have at least two reads per warp and per assignment (except for the left elements), or does it not matter from where I start reading as long as the memory after the 1st thread is perfectly aligned?
Yes, it does matter "from where you start reading" if you are trying to achieve perfect coalescing. Perfect coalescing means the read activity for a given warp and a given instruction all comes from the same 128-byte aligned cacheline.
Question: Is my understanding correct that in the example above only the first row would allow coeleased access and only for the left element in the array, since that is the only one which is accessed without any offset?
Yes. For cc2.0 and higher devices, the cache(s) may mitigate some of the drawbacks of unaligned access.
Question: How are threads collected into a single warp? Each thread in a warp needs to share the same instructions. Need this to be valid for the whole function? This is not the case for thread 1 (x=0), since it initialises also the border and therefore is different from others. To my understanding, thread 1 is executed in a single warp, thread (2-33, etc.) in another warp, which then doesn't access the memory in a singe read, due to miss-alignment, and then again the final thread in a single warp due to the other border. Is that correct?
The grouping of threads into warps always follows the same rules, and will not vary based on the specifics of the code you write, but is only affected by your launch configuration. When you write code that not all the threads will participate in (such as in your if statement), then the warp still proceeds in lockstep, but the threads that do not participate are idle. When you are filling in borders like this, it's rarely possible to get perfectly aligned or coalesced reads, so don't worry about it. The machine gives you that flexibility.

Acquire/Release versus Sequentially Consistent memory order

For any std::atomic<T> where T is a primitive type:
If I use std::memory_order_acq_rel for fetch_xxx operations, and std::memory_order_acquire for load operation and std::memory_order_release for store operation blindly (I mean just like resetting the default memory ordering of those functions)
Will the results be same as if I used std::memory_order_seq_cst (which is being used as default) for any of the declared operations?
If the results were the same, is this usage anyhow different than using std::memory_order_seq_cst in terms of efficiency?
The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release, and a load from another thread reads the value with std::memory_order_acquire then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.
If both the store and subsequent load are std::memory_order_seq_cst then the relationship between these two threads is the same. You need more threads to see the difference.
e.g. std::atomic<int> variables x and y, both initially 0.
Thread 1:
x.store(1,std::memory_order_release);
Thread 2:
y.store(1,std::memory_order_release);
Thread 3:
int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);
Thread 4:
int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);
As written, there is no relationship between the stores to x and y, so it is quite possible to see a==1, b==0 in thread 3, and c==1 and d==0 in thread 4.
If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y. Consequently, if thread 3 sees a==1 and b==0 then that means the store to x must be before the store to y, so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1.
In practice, then using std::memory_order_seq_cst everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG instructions rather than MOV instructions for std::memory_order_seq_cst stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release a plain MOV will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.
Memory ordering is hard. I devoted almost an entire chapter to it in my book.
Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle.
The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. For example, if you store something to a couple of variables (e.g. x = 7; y = 11;), then another processor may be able to see y as 11 before it sees the value 7 in x. By using memory ordering operation between setting x and setting y, the processor that you are using will guarantee that x = 7; has been written to memory before it continues to store something in y.
Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. But if we, say, have a circular buffer with integers, and we do something like:
buffer[index] = 32;
index = (index + 1) % buffersize;
and some other thread is using index to determine that the new value has been written, then we NEED to have 32 written FIRST, then index updated AFTER. Otherwise, the other thread may get old data.
The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types.
Now, the cst is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. This will be slower than doing the specific acquire or release barriers. It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads.
How much difference does that make? It is highly dependent on what the system archiecture is. On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. On other processors, it's only some small percentage slower than doing a regular memory write. X86 is pretty good at doing this fast. Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works.

Does the C++11 memory model prevent memory tearing and conflicts?

Reading a draft of C++11 I was interested by clause 1.7.3:
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. ... Two threads of execution (1.10) can update and access separate memory locations without interfering with each other.
Does this clause protect from hardware related race conditions such as:
unaligned data access where memory is updated in two bus transactions (memory tearing)?
where you have distinct objects within a system memory unit, e.g. two 16-bit signed integers in a 32-bit word, and each independent update of the separate objects requires the entire memory unit to be written (memory conflict)?
Regarding the second point, the standard guarantees that there will be no race there. That being said, I have been told that this guarantee is not implemented in current compilers, and it might even be impossible to implement in some architectures.
Regarding the first point, if the second point is guaranteed, and if your program does not contain any race condition, then the natural outcome is that this will not be a race condition either. That is, given the premise that the standard guarantees that writes to different sub word locations are safe, then the only case where you can have a race condition is if multiple threads access the same variable (that is split across words, or more probably for this to be problematic, across cache lines).
Again this might be hard or even impossible to implement. If your unaligned datum goes across a cache line, then it would be almost impossible to guarantee the correctness of the code without imposing a huge cost to performance. You should try to avoid unaligned variables as much as possible for this and other reasons (including raw performance, a write to an object that touches two cache lines involves writing as many as 32 bytes to memory, and if any other thread is touching any of the cache lines, it also involves the cost of synchronization of the caches...
It does not protect against memory tearing, which is only visible when two threads access the same memory location (but the clause only applies to separate memory locations).
It appears to protect against memory conflict, according to your example. The most likely way to achieve this is that a system which can't write less than 32 bits at once would have 32-bit char, and then two separate objects could never share a "system memory unit". (The only way two 16-bit integers can be adjacent on a system with 32-bit char is as bitfields.)

How and when to align to cache line size?

In Dmitry Vyukov's excellent bounded mpmc queue written in C++
See: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
He adds some padding variables. I presume this is to make it align to a cache line for performance.
I have some questions.
Why is it done in this way?
Is it a portable method that will
always work
In what cases would it be best to use __attribute__
((aligned (64))) instead.
why would padding before a buffer pointer help with performance? isn't just the pointer loaded into the cache so it's really only the size of a pointer?
static size_t const cacheline_size = 64;
typedef char cacheline_pad_t [cacheline_size];
cacheline_pad_t pad0_;
cell_t* const buffer_;
size_t const buffer_mask_;
cacheline_pad_t pad1_;
std::atomic<size_t> enqueue_pos_;
cacheline_pad_t pad2_;
std::atomic<size_t> dequeue_pos_;
cacheline_pad_t pad3_;
Would this concept work under gcc for c code?
It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.
In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.
The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.
I'm unsure whether the technique is entirely portable. The assumption is that each cacheline_pad_t will itself be aligned to a 64 byte (its size) cache line boundary, and hence whatever follows it will be on the next cache line. So far as I know, the C and C++ language standards only require this of whole structures, so that they can live in arrays nicely, without violating alignment requirements of any of their members. (see comments)
The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.
The same concept applies in C as well as C++.
You may need to align to a cache line boundary, which is typically 64 bytes per cache line, when you are working with interrupts or high-performance data reads, and they are mandatory to use when working with interprocess sockets. With Interprocess sockets, there are control variables that cannot be spread out across multiple cache lines or DDR RAM words else it will cause the L1, L2, etc or caches or DDR RAM to function as a low-pass filter and filter out your interrupt data! THAT IS BAD!!! That means you get bizarre errors when your algorithm is good and it has the potential to make you go insane!
The DDR RAM is almost always going to read in 128-bit words (DDR RAM Words), which is 16 bytes, so the ring buffer variables shall not be spread out across multiple DDR RAM words. some systems do use 64-bit DDR RAM words and technically you could get a 32-bit DDR RAM word on a 16-bit CPU but one would use SDRAM in the situation.
One may also just be interested in minimizing the number of cache lines in use when reading data in a high-performance algorithm. In my case, I developed the world's fastest integer-to-string algorithm (40% faster than prior fastest algorithm) and I'm working on optimizing the Grisu algorithm, which is the world's fastest floating-point algorithm. In order to print the floating-point number you must print the integer, so in order optimize the Grisu one optimization I have implemented is I have cache-line-aligned the Lookup Tables (LUT) for Grisu into exactly 15 cache lines, which is rather odd that it actually aligned like that. This takes the LUTs from the .bss section (i.e. static memory) and places them onto the stack (or heap but the Stack is more appropriate). I have not benchmarked this but it's good to bring up, and I learned a lot about this, is the fastest way to load values is to load them from the i-cache and not the d-cache. The difference is that the i-cache is read-only and has much larger cache lines because it's read-only (2KB was what a professor quoted me once.). So you're actually going to degrigate your performance from array indexing as opposed to loading a variable like this:
int faster_way = 12345678;
as opposed to the slower way:
int variables[2] = { 12345678, 123456789};
int slower_way = variables[0];
The difference is that the int variable = 12345678 will get loaded from the i-cache lines by offsetting to the variable in the i-cache from the start of the function, while slower_way = int[0] will get loaded from the smaller d-cache lines using much slower array indexing. This particular subtly as I just discovered is actually slowing down my and many others integer-to-string algorithm. I say this because you may thing you're optimizing by cache-aligning read-only data when you're not.
Typically in C++, you will use the std::align function. I would advise not using this function because it is not guaranteed to work optimally. Here is the fastest way to align to a cache line, which to be up front I am the author and this is a shamless plug:
Kabuki Toolkit Memory Alignment Algorithm
namespace _ {
/* Aligns the given pointer to a power of two boundaries with a premade mask.
#return An aligned pointer of typename T.
#brief Algorithm is a 2's compliment trick that works by masking off
the desired number of bits in 2's compliment and adding them to the
pointer.
#param pointer The pointer to align.
#param mask The mask for the Least Significant bits to align. */
template <typename T = char>
inline T* AlignUp(void* pointer, intptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value ) & mask;
return reinterpret_cast<T*>(value);
}
} //< namespace _
// Example calls using the faster mask technique.
enum { kSize = 256 };
char buffer[kSize + 64];
char* aligned_to_64_byte_cache_line = AlignUp<> (buffer, 63);
char16_t* aligned_to_64_byte_cache_line2 = AlignUp<char16_t> (buffer, 63);
and here is the faster std::align replacement:
inline void* align_kabuki(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
// Begin Kabuki Toolkit Implementation
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (align - 1);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
// End Kabuki Toolkit Implementation
}