Am I right in the following assumptions:
I don't need to explicitly synchronize an access to the std::atomic<T> objects from different threads on any platform with my own synchronization objects
std::atomic<T> operations could be lock-free or non-lock-free, dependent on the platform
std::atomic_bool and std::atomic<bool> (and the other types like these) are the same things actually
std::atomic_flag is the only class that guarantees platform-independent lock-free operations by the standard
Also where can I find a useful info about std::memory_order and how to use it properly?
Let's go through one by one.
I don't need to explicitly synchronize an access to the std::atomic<T> objects from different threads on any platform with my own synchronization objects
Yes, atomic objects are fully synchronized on all of their accessor methods.
The only time a data race can occur with access to the atomic types is during construction, but it involves constructing an atomic object A, passing its address to another thread via an atomic pointer using memory_order_relaxed to intentionally work around the sequential consistency of std::atomic, and then accessing A from that second thread. So, don't do that? :)
Speaking of construction, there are three ways to initialize your atomic types:
// Method 1: constructor
std::atomic<int> my_int(5);
// Method 2: atomic_init
std::atomic<int> my_int; // must be default constructed
std::atomic_init(&my_int, 5); // only allowed once
// Method 3: ATOMIC_VAR_INIT
// may be implemented using locks even if std::atomic<int> is lock-free
std::atomic<int> my_int = ATOMIC_VAR_INIT(5);
Using either of the two latter methods, the same data race possibility applies.
std::atomic<T> operations could be lock-free or non-lock-free, dependent on the platform
Correct. For all integral types, there are macros you can check that tell you whether a given atomic specialization is not, sometimes, or always lock-free. The value of the macros is 0, 1, or 2 respectively for these three cases. The full list of macros is taken from §29.4 of the standard, where unspecified is their stand-in for "0, 1, or 2":
#define ATOMIC_BOOL_LOCK_FREE unspecified
#define ATOMIC_CHAR_LOCK_FREE unspecified
#define ATOMIC_CHAR16_T_LOCK_FREE unspecified
#define ATOMIC_CHAR32_T_LOCK_FREE unspecified
#define ATOMIC_WCHAR_T_LOCK_FREE unspecified
#define ATOMIC_SHORT_LOCK_FREE unspecified
#define ATOMIC_INT_LOCK_FREE unspecified
#define ATOMIC_LONG_LOCK_FREE unspecified
#define ATOMIC_LLONG_LOCK_FREE unspecified
#define ATOMIC_POINTER_LOCK_FREE unspecified
Note that these defines apply to both the unsigned and signed variants of the corresponding types.
In the case that the #define is 1, you have to check at runtime. This is accomplished as follows:
std::atomic<int> my_int;
if (my_int.is_lock_free()) {
// do lock-free stuff
}
if (std::atomic_is_lock_free(&my_int)) {
// also do lock-free stuff
}
std::atomic_bool and std::atomic<bool> (and the other types like these) are the same things actually
Yes, these are just typedefs for your convenience. The full list is found in Table 194 of the standard:
Named type | Integral argument type
----------------+-----------------------
atomic_char | char
atomic_schar | signed char
atomic_uchar | unsigned char
atomic_short | short
atomic_ushort | unsigned short
atomic_int | int
atomic_uint | unsigned int
atomic_long | long
atomic_ulong | unsigned long
atomic_llong | long long
atomic_ullong | unsigned long long
atomic_char16_t | char16_t
atomic_char32_t | char32_t
atomic_wchar_t | wchar_t
std::atomic_flag is the only class that guarantees platform-independent lock-free operations by the standard
Correct, as guaranteed by §29.7/2 in the standard.
Note that there's no guarantee on the initialization state of atomic_flag unless you initialize it with the macro as follows:
std::atomic_flag guard = ATOMIC_FLAG_INIT; // guaranteed to be initialized cleared
There is a similar macro for the other atomic types,
The standard does not specify if atomic_flag could experience the same data race other atomic types could have during construction.
Also where can I find a useful info about std::memory_order and how to use it properly?
As suggested by #WhozCraig, cppreference.com has the best reference.
And as #erenon suggests, Boost.Atomic has a great essay on how to use memory fences for lock-free programming.
Related
Suppose that I have the following type:
struct T {
int32 high;
int32 low;
};
Is it defined behavior to perform atomic accesses (using, e.g. atomic_load, atomic_fetch_add) on all of x, &x->high, and &x->low (assuming U* x)?
My understanding is that the C/C++ memory models are defined using histories of over individual locations (to accommodate weak memory architectures). If accesses can cross locations, does this mean synchronization across locations? If that is the case, then I assume that would imply that histories are essentially per-byte and accessing an int is just like synchronizing across the underlying 4 (or 8) bytes.
edit: revised the example to avoid the union since the main part of the question is about the concurrency model.
edit: revised to use the standard atomics from stdatomic.h
For C11/C18 (I cannot speak about C++) the Standard atomic_xxx() functions of <stdatomic.h> are only defined to take _Atomic qualified arguments. So to do atomic_xxx() operations on the fields of your struct T you would need:
struct T {
_Atomic int32 high;
_Atomic int32 low;
} ;
struct T foo, bar ;
and then you would be able to do (for example) atomic_fetch_add(&foo->high, 42). But bar = atomic_load(&foo) would be undefined.
Conversely, you could have:
struct T {
int32 high;
int32 low;
} ;
_Atomic struct T foo ;
struct T bar ;
and now bar = atomic_load(&foo) is defined. But access to any individual field in foo is undefined -- whether or not it's _Atomic.
Going by the Standard, _Atomic xxxx objects should be thought of as entirely distinct from "ordinary" xxxx objects -- they may have different sizes, representations and alignments. Casting an xxxx to/from an _Atomic xxxx is, therefore, no more sensible than casting one struct to/from another, different struct.
But, for gcc and the __atomic_xxx() built-ins, you can do whatever the processor will support. Indeed, for gcc the (otherwise) standard atomic_xxx() will accept arguments which are not_Atomic qualified types, and are mapped to the built-ins. clang, on the other hand, treats passing a not _Atomic qualified type to the standard functions as an error. IMHO this is a bug in gcc's <stdatomic.h>.
Loading an inactive union member yields an indeterminate value regardless of using or not using __atomic_load to do that load.
This question here indicates that std::atomic<T> is generally supposed to have the same size as T, and indeed that seems to be the case for gcc, clang, and msvc on x86, x64, and ARM.
In an implementation where std::atomic<T> is always lock free for some type T, is it's memory layout guaranteed to be the same as the memory layout of T? Are there any additional special requirements imposed by std::atomic, such as alignment?
Upon reviewing [atomics.types.generic], which the answer you linked quotes in part, the only remark regarding alignment is the note which you saw before:
Note: The representation of an atomic specialization need not have the same size as its corresponding argument type. Specializations should have the same size whenever possible, as this reduces the effort required to port existing code
In a newer version:
The representation of an atomic specialization
need not have the same size and alignment requirement as
its corresponding argument type.
Moreover, at least one architecture, IA64, gives a requirement for atomic behavior of instructions such as cmpxchg.acq, which indicates that it's likely that a compiler targeting IA64 may need to align atomic types differently than non-atomic types, even in the absence of a lock.
Furthermore, the use of a compiler feature such as packed structs will cause alignment to differ between atomic and non-atomic variants. Consider the following example:
#include <atomic>
#include <iostream>
struct __attribute__ ((packed)) atom{
char a;
std::atomic_long b;
};
struct __attribute__ ((packed)) nonatom{
char a;
long b;
};
atom atom1;
nonatom nonatom1;
int disp_aligns(int num) {
std::cout<< alignof(atom1.b) << std::endl;
std::cout<< alignof(nonatom1.b) << std::endl;
}
On at least one configuration, the alignment of atom1.b will be on an 8-byte boundary, while the alignment of nonatom1.b will be on a 1-byte boundary. However, this is under the supposition that we requested that the structs be packed; it's not clear whether you are interested in this case.
From the standard:
The representation of an atomic specialization need not have the same size and alignment requirement as its corresponding argument type.
So the answer, at least for now, is no, it is not guaranteed to be the same size, nor have same alignment. But it might have, unless it doesn't and then it won't.
Is the following code guaranteed to return the expected value of counter (40,000,000), according to the C++11 memory model? (NOT limited to x86).
#include <atomic>
#include <thread>
using namespace std;
void ThreadProc(atomic<int>& counter)
{
for (int i = 0; i < 10000000; i++)
counter.fetch_add(1, memory_order_relaxed);
}
int main()
{
#define COUNT 4
atomic<int> counter = { 0 };
thread threads[COUNT] = {};
for (size_t i = 0; i < COUNT; i++)
threads[i] = thread(ThreadProc, ref(counter));
for (size_t i = 0; i < COUNT; i++)
threads[i].join();
printf("Counter: %i", counter.load(memory_order_relaxed));
return 0;
}
In particular, will relaxed atomics coordinate such that two threads will not simultaneously read the current value, independently increment it, and both write their incremented value, effectively losing one of the writes?
Some lines from the spec seem to indicate that counter must consistently be 40,000,000 in the above example.
[Note: operations specifying memory_order_relaxed are relaxed with
respect to memory ordering. Implementations must still guarantee that
any given atomic access to a particular atomic object be indivisible
with respect to all other atomic accesses to that object. — end note
.
Atomic read-modify-write operations shall always read the last value
(in the modification order) written the write associated with the
read-modify-write operation.
.
All modifications to a particular atomic object M occur in some
particular total order, called the modification order of M. If A and B
are modifications of an atomic object M and A happens before (as
defined below) B, then A shall precede B in the modification order of
M, which is defined below.
This talk also supports the notion that the above code is race free.
https://www.youtube.com/watch?v=KeLBd2EJLOU&feature=youtu.be&t=1h9m30s
It appears to me that there is an indivisible ordering of the atomic operations, but we have no guarantees what the order is. So all increments must take place 'one before the other' without the race I described above.
But then a few things potentially point in the other direction:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
I've been informed by a coworker that there are known mistakes in Sutter's talk. Though I've yet to find any sources for this.
Multiple members of the C++ community smarter than I have implied that a relaxed atomic add could be buffered such that a subsequent relaxed atomic add could read and operator on the stale value.
The code in your question is race free; all increments are ordered and the outcome of 40000000 is guaranteed.
The references in your question contain all the relevant quotes from the standard.
The part where it says that atomic stores should be visible within a reasonable time applies only to single stores.
In your case, the counter is incremented with an atomic read-modift-write operation and those are guaranteed to operate on the latest in the modification order.
Multiple members of the C++ community (...) have implied that a relaxed atomic add could be buffered such that a subsequent relaxed atomic add could read and operator on the stale value.
This is not possible, as long as the modifications are based on atomic read-modify-write operations.
Atomic increments would be useless if a reliable outcome was not guaranteed by the standard
As known, std::atomic and volatile are different things.
There are 2 main differences:
Two optimizations can be for std::atomic<int> a;, but can't be for volatile int a;:
fused operations: a = 1; a = 2; can be replaced by the compiler on a = 2;
constant propagation: a = 1; local = a; can be replaced by the compiler ona = 1; local = 1;
Reordering of ordinary reads/writes across atomic/volatile operations:
for volatile int a; any volatile-read/write-operations can't be reordered. But nearby ordinary reads/writes can still be reordered around volatile reads/writes.
for std::atomic a; reordering of nearby ordinary reads/writes restricted based on the memory barrier used for atomic operation a.load(std::memory_order_...);
I.e. volatile don't introduce a memory fences, but std::atomic can do it.
As is well described in the article:
Herb Sutter, January 08, 2009 - part 1: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484
Herb Sutter, January 08, 2009 - part 2: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484?pgno=2
For example, std::atomic should be used for concurrent multi-thread programs (CPU-Core <-> CPU-Core), but volatile should be used for access to Mamory Mapped Regions on devices (CPU-Core <-> Device).
But if required, both have unusual semantics and has any or all of the atomicity and/or ordering guarantees needed for lock-free coding, i.e. if required volatile std::atomic<>, require for several reasons:
ordering: to prevent reordering of ordinary reads/writes, for example, for reads from CPU-RAM, to which the data been written using the Device DMA-controller
For example:
char cpu_ram_data_written_by_device[1024];
device_dma_will_write_here( cpu_ram_data_written_by_device );
// physically mapped to device register
volatile bool *device_ready = get_pointer_device_ready_flag();
//... somewhere much later
while(!device_ready); // spin-lock (here should be memory fence!!!)
for(auto &i : cpu_ram_data_written_by_device) std::cout << i;
spilling: CPU write to CPU-RAM and then Device DMA-controller read from this memory: https://en.wikipedia.org/wiki/Register_allocation#Spilling
example:
char cpu_ram_data_will_read_by_device[1024];
device_dma_will_read_it( cpu_ram_data_written_by_device );
// physically mapped to device register
volatile bool *data_ready = get_pointer_data_ready_flag();
//... somewhere much later
for(auto &i : cpu_ram_data_will_read_by_device) i = 10;
data_ready=true; //spilling cpu_ram_data_will_read_by_device to RAM, should be memory fence
atomic: to guarantee that the volatile operation will be atomic - i.e. It will consist of a single operation instead of multiple - i.e. one 8-byte-operation instead of two 4-byte-operations
For this, Herb Sutter said about volatile atomic<T>, January 08, 2009: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484?pgno=2
Finally, to express a variable that both has unusual semantics and has
any or all of the atomicity and/or ordering guarantees needed for
lock-free coding, only the ISO C++0x draft Standard provides a direct
way to spell it: volatile atomic.
But do modern standards C++11 (not C++0x draft), C++14, and C++17 guarantee that volatile atomic<T> has both semantics (volatile + atomic)?
Does volatile atomic<T> guarantee the most stringent guarantees from both volatile and atomic?
As in volatile: Avoids fused-operations and constant-propagation as described in the beginning of the question
As in std::atomic: Introduces memory fences to provide ordering, spilling, and being atomic.
And can we do reinterpret_cast from volatile int *ptr; to volatile std::atomic<int>*?
Yes, it does.
Section 29.6.5, "Requirements for operations on atomic types"
Many operations are volatile-qualified. The “volatile as device register” semantics have not changed in the standard. This qualification means that volatility is preserved when applying these operations to volatile objects.
I checked working drafts 2008 through 2016, and the same text is in all of them. Therefore it should apply C++11, C++14, and C++17.
And can we do reinterpret_cast from volatile int *ptr; to volatile
std::atomic<int>*?
You can do such casts if and only if the ABI says that both types (here int and std::atomic<int>) have the same representation and restrictions: same size, alignement and possible bit patterns; same meaning for same bit patterns.
Everything that is volatile is directly connected with the ABI: variables that are volatile qualified must have the canonical ABI representation at sequence points and operations on volatile objects only assume they follow their ABI requirements and nothing else. So whenever volatile is used in C or C++, you can rely alternatively on the language standard or the platform ABI.
(I hope this answer is not deleted because some people despise volatile semantic and depending on the ABI and platform specific notions.)
I'm trying to find where the comparison semantics for the type T with std::atomic is defined.
I know that beside the builtin specializations for integral types, T can be any TriviallyCopyable type. But how do operations like compare_and_exchange_X know how to compare an instance of T?
I imagine they must simply do a byte by byte comparison of the user defined object (like a memcmp) but I don't see where in the standard this is explicitly mentioned.
So, suppose I have:
struct foo
{
std::uint64_t x;
std::uint64_t y;
};
How does the compiler know how to compare two std::atomic<foo> instances when I call std::atomic<foo>::compare_and_exchange_weak()?
In draft n3936, memcmp semantics are explicitly described in section 29.6.5.
Note: For example, the effect of atomic_compare_exchange_strong is
if (memcmp(object, expected, sizeof(*object)) == 0)
memcpy(object, &desired, sizeof(*object));
else
memcpy(expected, object, sizeof(*object));
and
Note: The memcpy and memcmp semantics of the compare-and-exchange operations may result in failed comparisons for values that compare equal with operator== if the underlying type has padding bits, trap bits, or alternate representations of the same value.
That wording has been present at least since n3485.
Note that only memcmp(p1, p2, sizeof(T)) != 0 is meaningful to compare_and_exchange_weak (failure guaranteed). memcmp(p1, p2, sizeof(T)) == 0 allows but does not guarantee success.
It's implementation defined. It could just be using a mutex lock or it could be using some intrinsics on memory blobs. The standard simply defines it such that the latter might work as an implementation strategy.
The compiler doesn't know anything here. It'll all be in the library. Since it's a template you can go read how your implementation does it.