I've actually heard claims both ways. I suspect they are not, but I wanted to get the topic settled.
C++03 does not know about the existance of threads, therefore the concept of atomicity doesn't make much sense for C++03, meaning that it doesn't say anything about that.
C++11 does know about threads, but once again doesn't say anything about the atomicity of assigning pointers. However C++11 does contain std::atomic<T*>, which is guaranteed to be atomic.
Note that even if writing to a raw pointer is atomic on your platform the compiler is still free to move that assingment around, so that doesn't really buy you anything.
If you need to write to a pointer which is shared between threads use either std::atomic<T*> (or the not yet official boost::atomic<T*>, gccs atomic intrinsics or windows Interlocked*) or wrap all accesses to that pointer in mutexes.
The C++ norm does not define specific threading behavior. Depending on the compiler and the platform, the pointer assignment may or may not be atomic.
Related
I wrote some multithreaded but lock-free code that compiled and apparently executed fine on an earlier C++11-supporting GCC (7 or older). The atomic fields were ints and so on. To the best of my recollection, I used normal C/C++ operations to operate on them (a=1;, etc.) in places where atomicity or event ordering wasn't a concern.
Later I had to do some double-width CAS operations, and made a little struct with a pointer and counter as is common. I tried doing the same normal C/C++ operations, and errors came that the variable had no such members. (Which is what you'd expect from most normal templates, but I half-expected atomic to work differently, in part because normal assignments to and from were supported, to the best of my recollection, for ints.).
So two part question:
Should we use the atomic methods in all cases, even (say) initialization done by one thread with no race conditions? 1a) so once declared atomic there's no way to access unatomically? 1b) we also have to use the verboser verbosity of the atomic<> methods to do so?
Otherwise, if for integer types at least, we can use normal C/C++ operations. But in this case will those operations be the same as load()/store() or are they merely normal assignments?
And a semi-meta question: is there any insight as to why normal C/C++ operations aren't supported on atomic<> variables? I'm not sure if the C++11 language as spec'd has the power to write code that does that, but the spec can certainly require the compiler to do things the language as spec'd isn't powerful enough to do.
You're maybe looking for C++20 std::atomic_ref<T> to give you the ability to do atomic ops on objects that can also be accessed non-atomically. Make sure your non-atomic T object is declared with sufficient alignment for atomic<T>. e.g.
alignas(std::atomic_ref<long long>::required_alignment)
long long sometimes_shared_var;
But that requires C++20, and nothing equivalent is available in C++17 or earlier. Once an atomic object is constructed, I don't think there's any guaranteed portable safe way to modify it other than its atomic member functions.
Its internal object representation isn't guaranteed by the standard so memcpy to get the struct sixteenbyte object out of an atomic<sixteenbyte> efficiently isn't guaranteed by the standard to be safe even if no other thread has a reference to it. You'd have to know how a specific implementation stores it. Checking sizeof(atomic<T>) == sizeof(T) is a good sign, though, and mainstream implementations do in practice just have a T as the object-representation for atomic<T>.
Related: How can I implement ABA counter with c++11 CAS? for a nasty union hack ("safe" in GNU C++) to give efficient access to a single member, because compilers don't optimize foo.load().ptr to just atomically load that member. Instead GCC and clang will lock cmpxchg16b to load the whole pointer+counter pair, then just the first member. C++20 atomic_ref<> should solve that.
Accessing members of atomic<struct foo>: one reason for not allowing shared.x = tmp; is that it's the wrong mental model. If two different threads are storing to different members of the same struct, how does the language define any ordering for what other threads see? Plus it was probably considered too easy for programmer to design their lockless algorithms incorrectly if stuff like that were allowed.
Also, how would you even implement that? Return an lvalue-reference? It can't be to the underlying non-atomic object. And what if the code captures that reference and keeps using it long after calling some function that's not load or store?
Remember that ISO C++'s ordering model works in terms of synchronizes-with, not in terms of local reordering and a single cache-coherent domain like the way real ISAs define their memory models. The ISO C++ model is always strictly in terms of reading, writing, or RMWing the entire atomic object. So a load of the object can always sync-with any store of the whole object.
In hardware that would actually still work for a store to one member and a load from a different member if the whole object is in one cache line, on real-world ISAs. At least I think so, although possibly not on some SMT systems. (Being in one cache line is necessary for lock-free atomic access to the whole object to be possible on most ISAs.)
we also have to use the verboser verbosity of the atomic<> methods to do so?
The member functions of atomic<T> include overloads of all the operators, including operator= (store) and cast back to T (load). a = 1; is equivalent to a.store(1, std::memory_order_seq_cst) for atomic<int> a; and is the slowest way to set a new value.
Should we use the atomic methods in all cases, even (say) initialization done by one thread with no race conditions?
You don't have any choice, other than passing args to the constructors of std::atomic<T> objects.
You can use mo_relaxed loads / stores while your object is still thread-private, though. Avoid any RMW operators like +=. e.g. a.store(a.load(relaxed) + 1, relaxed); will compile about the same as for non-atomic objects of register-width or smaller.
(Except that it can't optimize away and keep the value in a register, so use local temporaries instead of actually updating the atomic object).
But for atomic objects too large to be lock-free, there's not really anything you can do efficiently except construct them with the right values in the first place.
The atomic fields were ints and so on. ...
and apparently executed fine
If you mean plain int, not atomic<int> then it wasn't portably safe.
Data-race UB doesn't guarantee visible breakage, the nasty thing with undefined behaviour is that happening to work in your test case is one of the things that's allowed to happen.
And in many cases with pure load or pure store, it won't break, especially on strongly ordered x86, unless the load or store can hoist or sink out of a loop. Why is integer assignment on a naturally aligned variable atomic on x86?. It'll eventually bite you when a compiler manages to do cross-file inlining and reorder some operations at compile time, though.
why normal C/C++ operations aren't supported on atomic<> variables?
... but the spec can certainly require the compiler to do things the language as spec'd isn't powerful enough to do.
This in fact was a limitation of C++11 through 17. Most compilers have no problem with it. For example implementation of the <atomic> header for gcc/clang's uses __atomic_ builtins which take a plain T* pointer.
The C++20 proposal for atomic_ref is p0019, which cites as motivation:
An object could be heavily used non-atomically in well-defined phases
of an application. Forcing such objects to be exclusively atomic would
incur an unnecessary performance penalty.
3.2. Atomic Operations on Members of a Very Large Array
High-performance computing (HPC) applications use very large arrays. Computations with these arrays typically have distinct phases that allocate and initialize members of the array, update members of the array, and read members of the array. Parallel algorithms for initialization (e.g., zero fill) have non-conflicting access when assigning member values. Parallel algorithms for updates have conflicting access to members which must be guarded by atomic operations. Parallel algorithms with read-only access require best-performing streaming read access, random read access, vectorization, or other guaranteed non-conflicting HPC pattern.
All of these things are a problem with std::atomic<>, confirming your suspicion that this is a problem for C++11.
Instead of introducing a way to do non-atomic access to std::atomic<T>, they introduced a way to do atomic access to a T object. One problem with this is that atomic<T> might need more alignment than a T would get by default, so be careful.
Unlike with giving atomic access to members of T, you could plausible have a .non_atomic() member function that returned an lvalue reference to the underlying object.
I use std::atomic for atomicity. Still, somewhere in the code, atomicity is not needed by program logic. In this case, I'm wondering whether it is OK, both pedantically and practically, to use constructor in place of store() as an optimization. For example,
// p.store(nullptr, std::memory_order_relaxed);
new(p) std::atomic<node*>(nullptr);
In accord with the standard, whether this works depends entirely on the implementation of std::atomic<T>. If it is lock-free for that T, then the implementation probably just stores a T. If it isn't lock-free, things get more complex, since it may store a mutex or some other thing.
The thing is, you don't know what std::atomic<T> stores. This matters because if it stores a const-qualified object or a reference type, then reusing the storage here will cause problems. The pointer returned by placement-new can certainly be used, but if a const or reference type is used, the original object name p cannot.
Why would std::atomic<T> store a const or reference type? Who knows; my point is that, because its implementation is not under your control, then pedantically you cannot know how any particular implementation behaves.
As for "practically", it's unlikely that this will cause a problem. Especially if the atomic<T> is always lock-free.
That being said, "practically" should also include some notion of how other users will interpret this code. While people experienced with doing things like reusing storage will be able to understand what the code is doing, they will likely be puzzled by why you're doing it. That means you'll need to either stick a comment on that line or make a (template) function non_atomic_reset.
Also, it should be noted that std::shared_ptr uses atomic increments/decrements for its reference counter. I bring that up because there is no std::single_threaded_shared_ptr that doesn't use atomics, or a special constructor that doesn't use atomics. So even in cases where you're using shared_ptr in pure single-threaded code, those atomics are still firing. This was considered a reasonable tradeoff by the C++ standards committee.
Atomics aren't cheap, but they're not that expensive (most of the time) that using unusual mechanisms like this to bypass an atomic store is a good idea. As always, profile to see if the code obfuscation is worth it.
Consider this function:
void f(void* loc)
{
auto p = new(loc) volatile int{42};
*p = 0;
}
I have check the generated code by clang, gcc and CL, none of them elide the initialization. (The answer may be seen by the hardwer:).
Is it an extension provided by compilers to the standard? Does the standard allow compilers not to perform the write 42?
Actualy for objects of class type, it is specfied that constructor of an object is executed without consideration for the volatile qualifier [class.ctor]:
A constructor can be invoked for a const, volatile or const volatile object. const and volatile
semantics (10.1.7.1) are not applied on an object under construction. They come into effect when the
constructor for the most derived object (4.5) ends.
[intro.execution]/8 lists the minimum requirements for a conforming implementation; these are also known as “observable behavior”. The first requirement is that “Access to volatile objects are evaluated strictly according to the rules of the abstract machine.” The compiler is required to produce all observable behavior. In particular, it is not allowed to remove accesses to volatile objects. And note that “object” here is used in the compiler-writer’s sense: it includes built-in types.
This is not a coherent question because what it means for a compiler to perform a write is platform-specific. There is no platform-independent notion of performing a write other than perhaps seeing the effects of a write in a subsequent read.
As you see, typical compilers on x86 will emit a write instruction but no memory barrier. The CPU may reorder the write, coalesce it, or even avoid doing any write to main memory because of the way the platform's cache coherence works.
The reason they made this implementation choice is that it makes volatile work for a broad range of applications, including those where the standard requires it to work, and because it has acceptable performance consequences. The standard, being platform-neutral, doesn't dictate platform-specific decisions like this and compiler writers do not understand it to do that.
They could have forced every volatile access to be uncoalsecable, un-reorderable, and pushed through the cache subsystem to main memory. But that would provide terrible performance and, on this platform, no significant benefits. So they don't do it, and they don't understand the C++ standard to suggest that there's some mythical observer on the memory bus who must see specific things. The very existence of a memory bus is platform-specific. The standard is not platform-specific.
You will sometimes see people argue, for example, that the standard somehow requires the compiler to issue instructions to do volatile writes in order but that it doesn't matter if the CPU coalesces or re-orders the writes. This is, frankly, silly. The C++ standard doesn't impose requirements on the instructions compilers generate but rather on what those instructions must actually do when executed. It doesn't distinguish between optimizations done by a CPU and optimizations done by a compiler and any such distinctions would be platform-specific anyway.
If the standard allows a CPU to re-order two writes, then it allows the compiler to re-order them. It does not, and cannot, make that kind of distinction. Of course, compiler writers may still decide that they will issues the writes in order even though the CPU can re-order them because that may make the most sense on their platform.
class CSample{
int a;
// ..... lots of fields
}
Csample c;
As we know, Csample has a default copy constructor. When I do this:
Csample d = c
the default copy constructor will happen. My question is: is it thread safe? Because maybe someone modifies c in another thread when you do the copy constructor. If so, how does the compiler do it? And if not, I think it's horrible that the complier can't guarantee that the copy constructor be thread safe.
Nothing in C++ is thread-safe¹ unless explicitly noted.
If you need to read object c while it may be modified in another thread, you are responsible for locking it. That is a general rule and there is no reason why reading it for purpose of creating a copy should be an exception.
Note, that the copy being created does not need to be locked, because no other thread knows about it yet. Only the source needs to be.
The compiler does not guarantee anything to be thread-safe on it's own, because 99.9% of things don't need to be thread-safe. Most things only need to be reentrant³. So in the rare case you actually need to make something thread-safe, you have to use locks (std::mutex) or atomic types (std::atomic<int>).
You can also simply make your objects constant and then you can read them without locking, because nothing is writing them after creation. Code using constant objects is both more easily parallelised and more easily understood in general, because there is fewer things with state you have to track.
Note that on the most common architecture the mov instruction with int operands happens to be thread-safe. On other CPU types even that might not be true. And because the compiler is allowed to preload values, integer assignment in C++ is not anyway.
¹A set of operations is considered thread-safe if calling them concurrently on the same object is well defined². In C++, calling any modifying operation and any other operation concurrently on the same object is a data race, which is UndefinedBehaviour™.
²It is important to note, that if an object is "thread-safe", it does not really help you much most of the time anyway. Because if an object guarantees that when it's concurrently written you'll always read the new or the old value (C++ allows that when an int c is being changed from 0 to 1000 by one thread, another thread may read, say, 232), most of the time that won't help you, because you need to read multiple values in a consistent state, for which you have to lock over them yourself anyway.
³Reentrant means that the same operation may be called on different objects at the same time. There are a few functions in standard C library that are not reentrant, because they use global (static) buffers or other state. Most have reentrant variants (with _r suffix, usually) and the standrd C++ library uses these, so the C++ part is generally reentrant.
The general rule in the standard is simple: if an object (and
sub-objects are objects) is accessed by more than one thread,
and is modified by any thread, then all accesses must be
synchronized. There are numerous reasons for this, but the most
basic one is that protecting at the lowest level is usually the
wrong level of granularity; adding synchronization primitives
would only make the code run significantly slower, without any
real advantage for the user, even in a multithreaded
environment. Even if the copy constructor were "thread-safe",
unless the object is somehow totally independent of all other
context, you'll probably need some sort of synchronization
primitives at a higher level.
And with regards to "thread-safety": the usual meaning among
experienced practitionners it that the object/class/whatever
specifies exactly how much protection it guarantees. Precisely
because such low level definitions such as you (and many, many
others) seem to use are useless. Synchronizing each function in
a class is generally useless. (Java made the experiment, and
then backed off, because the gurantees they made in the initial
versions of their containers turned out to be expensive and
worthless.)
Assuming that d or c are accessed concurrently on multiple threads, this is not thread-safe. This would amount to a data-race which is undefined behavior.
Csample d = c;
is just as unsafe as
int d = c;
is.
I've looked at the standard but couldn't find any indication that simply writing to memory would be considered observable behaviour. If not, that would mean the compiled code need not actually write to that memory. If a compiler choose to optimize away such access anything involving mapper memory, or shared memory, may not work.
1.9-8 seems to defined a very limited observable behaviour but indicates an implementation may define more. Can one assume than any quality compiler would treat modifying memory as an observable behaviour? That is, it may not guarantee atomicity or ordering, but does guarantee that data will eventually be written.
So, have I overlooked something in the standard, or is the writing to memory merely something the compiler decides to do?
Statements from the current or C++0x standard are good. Please note I'm not talking about accessing memory through a function, I mean direct access, such as writing data to a pointer (perhaps retrieved via mmap or another library function).
This kind of thing is what volatile exists for. Else, writing to memory and never apparently reading from it is not observable behaviour. However, in the general case, it would be quite impossible for the optimizer to prove that you never read it back except in relatively trivial examples, so it's not usually an issue.
Can one assume than any quality compiler would treat modifying memory as an observable behaviour?
No. Volatile is meant for marking that. However, you cannot fully trust the compiler even after adding the volatile qualifier, at least as told by a 2008 paper: http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf
EDIT:
From C standard (not C++) http://c0x.coding-guidelines.com/5.1.2.3.html
An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
My reading of C99 is that unless you specify volatile, how and when the variable is actually accessed is implementation defined. If you specify volatile qualifier then code must work according to the rules of an abstract machine.
Relevant parts in the standard are: 6.7.3 Type qualifiers (volatile description) and 5.1.2.3 Program execution (the abstract machine definition).
For some time now I know that many compilers actually have heuristics to detect cases when a variable should be reread again and when it is okay to use a cached copy. Volatile makes it clear to the compiler that every access to the variable should be actually an access to the memory. Without volatile it seems compiler is free to never reread the variable.
And BTW wrapping the access in a function doesn't change that since a function even without inline might be still inlined by the compiler within the current compilation unit.
From your question below:
Assume I use an array on the heap (unspecified where it is allocated),
and I use that array to perform a calculation (temp space). The
optimizer sees that it doesn't actually need any of that space as it
can use strictly registers. Does the compiler nonetheless write the
temp values to the memory?
Per MSalters below:
It's not guaranteed, and unlikely. Consider a a Static Single
Assignment optimizer. This figures out each possible write/read
dependency, and then assigns registers to optimize these dependencies.
As a side effect, any write that's not followed by a (possible) read
creates no dependencies at all, and is eliminated. In your example
("use strictly registers") the optimizer has satisfied all write/read
dependencies with registers, so it won't write to memory at all. All
reads produce the correct values, so it's a correct optimization.