Is ConcurrentHashMap::computeIfAbsent atomic per key or per ConcurrentHashMap? - concurrency

In a call to ConcurrentHashMap::computeIfAbsent I use a slightly expensive mappingFunction. The mappingFunctions are safe to execute concurrently if and only if they are for different keys.
I'm wondering if the mappingFunctions are executed concurrently for different keys. If this is not the case, each mappingFunction would be executed one-at-a-time, leading to unnecessary waiting time. To fix this I would need to write more complex code and use putIfAbsent.
Does anyone know if mappingFunctions are executed concurrently if they are for different keys?
The documentation states:
The entire method invocation is performed atomically, so the function is applied at most once per key.
This may or may not answer my question, depending on how you read it

Related

Using "memory_order::relaxed" or "memory_order::acq_rel" to generate unique IDs?

I have read in several places that relaxed ordering was ok to generate unique IDs. I have a doubt about that, because if two threads call at the same time:
uniqueId.fetch_add(1, std::memory_order::relaxed);
Then the value incremented by thread-A might not be visible yet to thread-B. This means, both threads could get the same unique ID.
For this reason, I would rather use std::memory_order::acq_rel
What do you think?
Impossible to test, in practise.
std::memory_order_* is about how stores and loads to memory locations other than the atomic object itself synchronize.
A single atomic object's value is always consistent among all threads. It has exactly one modification order that all threads agree on and that is consistent with sequencing of loads/stores in each thread, regardless of std::memory_order_*.
(However this is true only for each atomic object viewed individually. The same does not apply between multiple atomic objects.)
Whether std::memory_order::relaxed is sufficient in your case depends on whether the resulting ID values are going to be used through other shared objects (whether atomic or not) between the threads, but the expression
uniqueId.fetch_add(1, std::memory_order::relaxed)
, even when used in multiple threads will generate each ID only once (assuming uniqueId refers to the same std::atomic object, no other stores are applied to it and no overflow/wrap-around happens). It is important though that fetch_add itself is an atomic read-modify-write. A load followed by a store would not be an atomic operation and wouldn't guarantee that no store from another thread intervenes between the load and store.

Atomic operation propagation/visibility (atomic load vs atomic RMW load)

Context 
I am writing a thread-safe protothread/coroutine library in C++, and I am using atomics to make task switching lock-free. I want it to be as performant as possible. I have a general understanding of atomics and lock-free programming, but I do not have enough expertise to optimise my code. I did a lot of researching, but it was hard to find answers to my specific problem: What is the propagation delay/visiblity for different atomic operations under different memory orders?
Current assumptions 
I read that changes to memory are propagated from other threads, in such a way that they might become visible:
in different orders to different observers,
with some delay.
I am unsure as to whether this delayed visibility and inconsistent propagation applies only to non-atomic reads, or to atomic reads as well, potentially depending on what memory order is used. As I am developing on an x86 machine, I have no way of testing the behaviour on weakly ordered systems.
Do all atomic reads always read the latest values, regardless of the type of operation and the memory order used? 
I am pretty sure that all read-modify-write (RMW) operations always read the most recent value written by any thread, regardless of the memory order used. The same seems to be true for sequentially consistent operations, but only if all other modifications to a variable are also sequentially consistent. Both are said to be slow, which is not good for my task. If not all atomic reads get the most recent value, then I will have to use RMW operations just for reading an atomic variable's latest value, or use atomic reads in a while loop, to my current understanding.
Does the propagation of writes (ignoring side effects) depend on the memory order and the atomic operation used? 
(This question only matters if the answer to the previous question is that not all atomic reads always read the most recent value. Please read carefully, I do not ask about the visibility and propagation of side-effects here. I am merely concerned with the value of the atomic variable itself.) This would imply that depending on what operation is used to modify an atomic variable, it would be guaranteed that any following atomic read receives the most recent value of the variable. So I would have to choose between an operation guaranteed to always read the latest value, or use relaxed atomic reads, in tandem with this special write operation that guarantees instant visibility of the modification to other atomic operations.
Is atomic lock-free ?
First of all, let's get rid of the elephant in the room: using atomic in your code doesn't guarantee a lock-free implementation. atomic is only an enabler for a lock-free implementation. is_lock_free() will tell you if it's really lock-free for the C++ implementation and the underlying types that you are using.
What's the latest value ?
The term "latest" is very ambiguous in the world of multithreading. Because what is the "latest" for one thread that might be put asleep by the OS, might no longer be what is the latest for another thread that is active.
std::atomic only guarantees is a protection against racing conditions, by ensuring that R, M and RMW performed on one atomic in one thread are performed atomically, without any interruption, and that all other threads see either the value before or the value after, but never what's in-between. So atomic synchronize threads by creating an order between concurrent operations on the same atomic object.
You need to see every thread as a parallel universe with its own time and that is unaware of the time in the parallel universes. And like in quantum physics, the only thing that you can know in one thread about another thread is what you can observe (i.e. a "happened before" relation between the universes).
This means that you should not conceive multithreaded time as if there would be an absolute "latest" across all the threads. You need to conceive time as relative to the other threads. This is why atomics don't create an absolute latest, but only ensure a sequential ordering of the successive states that an atomic will have.
Propagation
The propagation doesn't depend on the memory order nor the atomic operation performed. memory_order is about sequential constraints on non-atomic variables around atomic operations that are seen like fences. The best explanation of how this works is certainly Herb Sutters presentation, that is definitively worth its hour and half if you're working on multithreading optimisation.
Although it is possible that a particular C++ implementation could implement some atomic operation in a way that influences propagation, you cannot rely on any such observation that you would do, since there would be no guarantee that propagation works in the same fashion in the next release of the compiler or on another compiler on another CPU architecture.
But does propagation matter ?
When designing lock-free algorithms, it is tempting to read atomic variables to get the latest status. But whereas such a read-only access is atomic, the action immediately after is not. So the following instructions might assume a state which is already obsolete (for example because the thread is send asleep immediately after the atomic read).
Take if(my_atomic_variable<10) and suppose that you read 9. Suppose you're in the best possible world and 9 would be the absolutely latest value set by all the concurrent threads. Comparing its value with <10 is not atomic, so that when the comparison succeeds and if branches, my_atomic_variable might already have a new value of 10. And this kind of problems might occur regardless of how fast the propagation is, and even if the read would be guaranteed to always get the latest value. And I didn't even mention the ABA problem yet.
The only benefit of the read is to avoid a data race and UB. But if you want to synchronize decisions/actions across threads, you need to use an RMW, such compare-and-swap (e.g. atomic_compare_exchange_strong) so that the ordering of atomic operations result in a predictable outcome.
After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.
Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.load(std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.
This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.
Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.
std::atomic_int counter = 0;
void thread()
{
// Imagine no race between read and write.
int value = counter.fetch_or(0, std::memory_order_relaxed);
counter.store(value+1, std::memory_order_relaxed);
}
for(int i = 0; i < 1000; i++)
std::async(thread);
In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.
Now, when to use which kind of read? 
If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.
But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.
So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.
Multithreading is surprising area.
First, an atomic read is not ordered after a write. I e reading a value does not mean that it were written before. Sometimes such read may ever see (indirect, by other thread) result of some subsequent atomic write by the same thread.
Sequential consistency are clearly about visibility and propagation. When a thread writes an atomic "sequentially consistent" it makes all its previous writes to be visible to other threads (propagation). In such case a (sequentially consistent) read is ordered in relation to a write.
Generally the most performant operations are "relaxed" atomic operations, but they provide minimum guarranties on ordering. In principle there is ever some causality paradoxes... :-)

Multithreading - hopefully a simple task

I have an iterative process coded in C++ which takes a long time and am considering converting my code to use multiple threads. But I am concerned that it could be very complicated and risk lock-ups and bugs. However I suspect that for this particular problem it may be trivial, but I would like confirmation.
I am hoping I can use threading code which is a s simple as this here.
My program employs large amounts of global arrays and structures. I assume that the individual threads need not concern themselves if other threads are attempting to read the same data at the same time.
I would also assume that if one thread wanted to increment a global float variable by say 1.5 and another thread wanted to decrement it by 0.1 then so long as I didn't care about the order of events then both threads would succeed in their task without any special code (like mutexs and locks etc) and the float would eventually end up larger by 1.4. If all my assumptions are correct then my task will be easy - Please advise.
EDIT: just to make it absolutely clear - it doesn't matter at all the order in which the float is incremented / decremented. So long as its value ends up larger by 1.4 then I am happy. The value of the float is not read until after all the threads have completed their task.
EDIT: As a more concrete example, imaging we had the task of finding the total donations made to a charity from different states in the US. We could have a global like this:
float total_donations= 0;
Then we could have 50 separate threads, each of which calculated a local float called donations_from_this_state. And each thread would separately perform:
total_donations += donations_from_this_state;
Obviously which order the threads performed their task in would make no difference to the end result.
I assume that the individual threads need not concern themselves if other threads are attempting to read the same data at the same time.
Correct. As long as all threads are readers no synchronization is needed as no values are changed in the shared data.
I would also assume that if one thread wanted to increment a global float variable by say 1.5 and another thread wanted to decrement it by 0.1 then so long as I didn't care about the order of events then both threads would succeed in their task without any special code (like mutexs and locks etc) and the float would eventually end up larger by 1.4
This assumption is not correct. If you have two or more threads writing to the same shared variable and that variable is not internally synchronized then you need external synchronization otherwise your code has undefined behavior per [intro.multithread]/21
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Where conflicting action is specified by [intro.multithread]/4
Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.

What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.

Can shared memory be read and validated without mutexes?

On Linux I'm using shmget and shmat to setup a shared memory segment that one process will write to and one or more processes will read from. The data that is being shared is a few megabytes in size and when updated is completely rewritten; it's never partially updated.
I have my shared memory segment laid out as follows:
-------------------------
| t0 | actual data | t1 |
-------------------------
where t0 and t1 are copies of the time when the writer began its update (with enough precision such that successive updates are guaranteed to have differing times). The writer first writes to t1, then copies in the data, then writes to t0. The reader on the other hand reads t0, then the data, then t1. If the reader gets the same value for t0 and t1 then it considers the data consistent and valid, if not, it tries again.
Does this procedure ensure that if the reader thinks the data is valid then it actually is?
Do I need to worry about out-of-order execution (OOE)? If so, would the reader using memcpy to get the entire shared memory segment overcome the OOE issues on the reader side? (This assumes that memcpy performs it's copy linearly and ascending through the address space. Is that assumption valid?)
Modern hardware is actually anything but sequentially consistent. Thus, this is not guaranteed to work as such if you don't execute memory barriers at the appropriate spots. Barriers are needed because the architecture implements a weaker shared memory consistency model than sequential consistency. This as such has nothing to do with pipelining or OoO, but with allowing multiple processors to efficiently access the memory system in parallel. See e.g. Shared memory consistency models: A tutorial. On a uniprocessor, you don't need barriers, because all the code executes sequentially on that one processor.
Also, there is no need to have two time fields, a sequence count is probably a better choice as there is no need to worry whether two updates are so close that they get the same timestamp, and updating a counter is much faster than getting the current time. Also, there is no chance that the clock moves backwards in time which might happen e.g. when ntpd adjusts for clock drift. Though this last problem can be overcome on Linux by using clock_gettime(CLOCK_MONOTONIC, ...). Another advantage of using sequence counters instead of timestamps is that you need only one sequence counter. The writer increments the counter both before writing the data, and after the write is done. Then the reader reads the sequence number, checks that it's even, and if so, reads the data, and finally then reads the sequence number again and compares to the first sequence number. If the sequence number is odd, it means a write is in progress, and there is no need to read the data.
The Linux kernel uses a locking primitive called seqlocks that do something like the above. If you're not afraid of "GPL contamination", you can google for the implementation; As such it's trivial, but the trick is getting the barriers correct.
Joe Duffy gives the exact same algorithm and calls it: "A scalable reader/writer scheme with optimistic retry".
It works.
You need two sequence number fields.
You need to read and write them in opposite order.
You might need to have memory barriers in place, depending on the memory ordering guarantees of the system.
Specifically, you need read acquire and store release semantics for the readers and writers when they access t0 or t1 for reading and writing respectively.
What instructions are needed to achieve this, depends on the architecture. E.g. on x86/x64, because of the relatively strong guarantees one needs no machine specific barriers at all in this specific case*.
* one still needs to ensure that the compiler/JIT does not mess around with loads and stores , e.g. by using volatile (that has a different meaning in Java and C# than in ISO C/C++. Compilers may differ, however. E.g. using VC++ 2005 or above with volatile it would be safe doing the above. See the "Microsoft Specific" section. It can be done with other compilers as well on x86/x64. The assembly code emitted should be inspected and one must make sure that accesses to t0 and t1 are not eliminated or moved around by the compiler.)
As a side note, if you ever need MFENCE, lock or [TopOfStack],0 might be a better option, depending on your needs.