In C++11 standard the machine model changed from a single thread machine to a multi threaded machine.
Does this mean that the typical static int x; void func() { x = 0; while (x == 0) {} } example of optimized out read will no longer happen in C++11?
EDIT: for those who don't know this example (I'm seriously astonished), please read this: https://en.wikipedia.org/wiki/Volatile_variable
EDIT2:
OK, I was really expecting that everyone who knew what volatile is has seen this example.
If you use the code in the example the variable read in the cycle will be optimized out, making the cycle endless.
The solution of course is to use volatile which will force the compiler to read the variable on each access.
My question is if this is a deprecated problem in C++11, since the machine model is multi-threaded, therefore the compiler should consider concurrent access to variable to be present in the system.
Whether it is optimized out depends entirely on compilers and what they choose to optimize away. The C++98/03 memory model does not recognize the possibility that x could change between the setting of it and the retrieval of the value.
The C++11 memory model does recognize that x could be changed. However, it doesn't care. Non-atomic access to variables (ie: not using std::atomics or proper mutexes) yields undefined behavior. So it's perfectly fine for a C++11 compiler to assume that x never changes between the write and reads, since undefined behavior can mean, "the function never sees x change ever."
Now, let's look at what C++11 says about volatile int x;. If you put that in there, and you have some other thread mess with x, you still have undefined behavior. Volatile does not affect threading behavior. C++11's memory model does not define reads or writes from/to x to be atomic, nor does it require the memory barriers needed for non-atomic reads/writes to be properly ordered. volatile has nothing to do with it one way or the other.
Oh, your code might work. But C++11 doesn't guarantee it.
What volatile tells the compiler is that it can't optimize memory reads from that variable. However, CPU cores have different caches, and most memory writes do not immediately go out to main memory. They get stored in that core's local cache, and may be written... eventually.
CPUs have ways to force cache lines out into memory and to synchronize memory access among different cores. These memory barriers allow two threads to communicate effectively. Merely reading from memory in one core that was written in another core isn't enough; the core that wrote the memory needs to issue a barrier, and the core that's reading it needs to have had that barrier complete before reading it to actually get the data.
volatile guarantees none of this. Volatile works with "hardware, mapped memory and stuff" because the hardware that writes that memory makes sure that the cache issue is taken care of. If CPU cores issued a memory barrier after every write, you can basically kiss any hope of performance goodbye. So C++11 has specific language saying when constructs are required to issue a barrier.
volatile is about memory access (when to read); threading is about memory integrity (what is actually stored there).
The C++11 memory model is specific about what operations will cause writes in one thread to become visible in another. It's about memory integrity, which is not something volatile handles. And memory integrity generally requires both threads to do something.
For example, if thread A locks a mutex, does a write, and then unlocks it, the C++11 memory model only requires that write to become visible to thread B if thread B later locks it. Until it actually acquires that particular lock, it's undefined what value is there. This stuff is laid out in great detail in section 1.10 of the standard.
Let's look at the code you cite, with respect to the standard. Section 1.10, p8 speaks of the ability of certain library calls to cause a thread to "synchronize with" another thread. Most of the other paragraphs explain how synchronization (and other things) build an order of operations between threads. Of course, your code doesn't invoke any of this. There is no synchronization point, no dependency ordering, nothing.
Without such protection, without some form of synchronization or ordering, 1.10 p21 comes in:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Your program contains two conflicting actions (reading from x and writing to x). Neither is atomic, and neither is ordered by synchronization to happen before the other.
Thus, you have achieved undefined behavior.
So the only case where you get guaranteed multithreaded behavior by the C++11 memory model is if you use a proper mutex or std::atomic<int> x with the proper atomic load/store calls.
Oh, and you don't need to make x volatile too. Anytime you call a (non-inline) function, that function or something it calls could modify a global variable. So it cannot optimize away the read of x in the while loop. And every C++11 mechanism to synchronize requires calling a function. That just so happens to invoke a memory barrier.
Intel developer zone mentions, "Volatile: Almost Useless for Multi-Threaded Programming"
The volatile keyword is used in this signal handler example from cppreference.com
#include <csignal>
#include <iostream>
namespace
{
volatile std::sig_atomic_t gSignalStatus;
}
void signal_handler(int signal)
{
gSignalStatus = signal;
}
int main()
{
// Install a signal handler
std::signal(SIGINT, signal_handler);
std::cout << "SignalValue: " << gSignalStatus << '\n';
std::cout << "Sending signal " << SIGINT << '\n';
std::raise(SIGINT);
std::cout << "SignalValue: " << gSignalStatus << '\n';
}
Related
Can external I/O be relied upon as a form of cross-thread synchronization?
To be specific, consider the pseudocode below, which assumes the existence of network/socket functions:
int a; // Globally accessible data.
socket s1, s2; // Platform-specific.
int main() {
// Set up + connect two sockets to (the same) remote machine.
s1 = ...;
s2 = ...;
std::thread t1{thread1}, t2{thread2};
t1.join();
t2.join();
}
void thread1() {
a = 42;
send(s1, "foo");
}
void thread2() {
recv(s2); // Blocking receive (error handling omitted).
f(a); // Use a, should be 42.
}
We assume that the remote machine only sends data to s2 upon receiving the "foo" from s1. If this assumption fails, then certainly undefined behavior will result. But if it holds (and no other external failure occurs like network data corruption, etc.), does this program produce defined behavior?
"Never", "unspecified (depends on implementation)", "depends on the guarantees provided by the implementation of send/recv" are example answers of the sort I'm expecting, preferably with justification from the C++ standard (or other relevant standards, such as POSIX for sockets/networking).
If "never", then changing a to be a std::atomic<int> initialized to a definite value (say 0) would avoid undefined behaviour, but then is the value guaranteed to be read as 42 in thread2 or could a stale value be read? Do POSIX sockets provide a further guarantee that ensures a stale value will not be read?
If "depends", do POSIX sockets provide the relevant guarantee to make it defined behavior? (How about if s1 and s2 were the same socket instead of two separate sockets?)
For reference, the standard I/O library has a clause which seems to provide an analogous guarantee when working with iostreams (27.2.3¶2 in N4604):
If one thread makes a library call a that writes a value to a stream and, as a result, another thread reads this value from the stream through a library call b such that this does not result in a data race, then a’s write synchronizes with b’s read.
So is it a matter of the underlying network library/functions being used providing a similar guarantee?
In practical terms, it seems the compiler can't reorder accesses to the global a with respect to the send andrecv functions (as they could use a in principle). However, the thread running thread2 could still read a stale value of a unless there was some kind of memory barrier / synchronization guarantee provided by the send/recv pair itself.
Short answer: No, there is no generic guarantee that a will be updated. My suggestion would be to send the value of a along with "foo" - e.g. "foo, 42", or something like it. That is guaranteed to work, and probably not that significant overhead. [There may of course be other reasons why that doesn't work well]
Long rambling stuff that doesn't really answer the problem:
Global data is not guaranteed to be "visible" immediately in different cores of multicore processors without further operations. Yes, most modern processors are "coherent", but not all models of all brands are guaranteed to do so. So if thread2 runs on a processor that has already cached a copy of a, it can not be guaranteed that the value of a is 42 at the point when you call f.
The C++ standard guarantees that global variables are loaded after the function call, so the compiler is not allowed to do:
tmp = a;
recv(...);
f(tmp);
but as I said above, cache-operations may be needed to guarantee that all processors see the same value at the same time. If send and recv are long in time or big in accesses enough [there is no direct measure that says how long or big] you may see the correct value most or even all of the time, but there is no guarantee for ordinary types that they are ACTUALLY updated outside of the thread that wrote the value last.
std::atomic will help on some types of processors, but there is no guarantee that this is "visible" in a second thread or on a second processor core at any reasonable time after it was changed.
The only practical solution is to have some kind of "repeat until I see it change" type code - this may require one value that is (for example) a counter, and one value that is the actual value - if you want to be able to say that "a is now 42. I've set a again, it's 42 this time too". If a is reppresenting, for example the number of data items available in a buffer, it is probably "it changed value" that matters, and just checking "is this the same as last time". The std::atomic operations have guarantees with regard to ordering, which allows you to use them to ensure that "if I update this field, the other field is guaranteed to appear at the same time or before this". So you can use that to guarantee for example a pair of data items are set to the "there is a new value" (for example a counter to indicate the "version number" of the current data) and "the new value is X".
Of course, if you KNOW what processor architectures your code will run on, you can plausibly make more advanced guesses as to what the behaviour will be. For example all x86 and many ARM processors use the cache-interface to implement atomic updates on a variable, so by doing an atomic update on one core, you can know that "no other processor will have a stale value of this". But there are processors available that do not have this implementation detail, and where an update, even with an atomic instruction, will not be updated on other cores or in other threads until "some time in the future, uncertain when".
In general, no, external I/O can't be relied upon for cross-thread synchronization.
The question is out-of-scope of the C++ standard itself, as it involves the behavior of external/OS library functions. So whether the program is undefined behavior depends on any synchronization guarantees provided by the network I/O functions. In the absence of such guarantees, it is indeed undefined behavior. Switching to (initialized) atomics to avoid undefined behavior still wouldn't guarantee the "correct" up-to-date value will be read. To ensure that within the realms of the C++ standard would require some kind of locking (e.g. spinlock or mutex), even though it seems like waiting shouldn't be required due to the real-time ordering of the situation.
In general, the notion of "real-time" synchronization (involving visibility rather than merely ordering) required to avoid having to potentially wait after the recv returns before loading a isn't supported by the C++ standard. At a lower level, this notion does exist however, and would typically be implemented through inter-processor interrupts, e.g. FlushProcessWriteBuffers on Windows, or sys_membarrier on x86 Linux. This would be inserted after the store to a before send in thread1. No synchronization or barrier would be required in thread2. (It also seems like a simple SFENCE in thread1 might suffice on x86 due to its strong memory model, at least in the absence of non-temporal loads/stores.)
A compiler barrier shouldn't be needed in either thread for the reasons outlined in the question (call to an external function send, which for all the compiler knows could be acquiring an internal mutex to synchronize with the other call to recv).
Insidious problems of the sort described in section 4.3 of Hans Boehm's paper "Threads Cannot be Implemented as a Library" should not be a concern as the C++ compiler is thread-aware (and in particular the opaque functions send and recv could contain synchronization operations), so transformations introducing writes to a after the send in thread1 are not permissible under the memory model.
This leaves the open question of whether the POSIX network functions provide the necessary guarantees. I highly doubt it, as on some of the architectures with weak memory models, they are highly non-trivial and/or expensive to provide (requiring a process-wide mutex or IPI as mentioned earlier). On x86 specifically, it's almost certain that accessing a shared resource like a socket will entail an SFENCE or MFENCE (or even a LOCK-prefixed instruction) somewhere along the line, which should be sufficient, but this is unlikely to be enshrined in a standard anywhere. Edit: In fact, I think even the INT to switch to kernel mode entails a drain of the store buffer (the best reference I have to hand is this forum post).
Consider the following spin_lock() implementation, originally from this answer:
void spin_lock(volatile bool* lock) {
for (;;) {
// inserts an acquire memory barrier and a compiler barrier
if (!__atomic_test_and_set(lock, __ATOMIC_ACQUIRE))
return;
while (*lock) // no barriers; is it OK?
cpu_relax();
}
}
What I already know:
volatile prevents compiler from optimizing out *lock re-read on each iteration of the while loop;
volatile inserts neither memory nor compiler barriers;
such an implementation actually works in GCC for x86 (e.g. in Linux kernel) and some other architectures;
at least one memory and compiler barrier is required in spin_lock() implementation for a generic architecture; this example inserts them in __atomic_test_and_set().
Questions:
Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?
1.1 According to C++ standards?
1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?
Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)
Is the while loop safe according to C++11 and its memory model?
There are several related questions, but I was unable to construct an explicit and unambiguous answer from them:
Q: Memory barrier in a single thread
In principle: Yes, if program execution moves from one core to the next, it might not see all writes that occurred on the previous core.
Q: memory barrier and cache flush
On pretty much all modern architectures, caches (like the L1 and L2 caches) are ensured coherent by hardware. There is no need to flush any cache to make memory visible to other CPUs.
Q: Is my spin lock implementation correct and optimal?
Q: Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?
Q: Do you expect that future CPU generations are not cache coherent?
This important: in C++ volatile has nothing at all to do with concurrency! The purpose of volatile is to tell the compiler that it shall not optimize accesses to the affected object. It does not tell the CPU anything, primarily because the CPU would know already whether memory would be volatile or not. The purpose of volatile is effectively to deal with memory mapped I/O.
The C++ standard is very clear in section 1.10 [intro.multithread] that unsynchronized access to an object which is modified in one thread and is accessed (modified or read) in another thread is undefined behavior. The synchronization primitives avoiding undefined behavior are library components like the atomic classes or mutexes. This clause mentions volatile only in the context of signals (i.e., as volatile sigatomic_t) and in the context of forward progress (i.e., that a thread will eventually do something which has an observable effect like accessing a volatile object or doing I/O). There is no mention of volatile in conjunction with synchronization.
Thus, unsynchronized assess to a variable shared across threads leads to undefined behavior. Whether it is declared volatile or not doesn't matter to this undefined behavior.
From the Wikipedia page on memory barriers:
... Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively.
To me this implies that Itanium requires a suitable fence to make reads/writes visible to other processors, but this may in fact just be for purposes of ordering. The question, I think, really boils down to:
Does there exist an architecture where a processor might never update its local cache if not instructed to do so? I don't know the answer, but if you posed the question in this form then someone else might. In such an architecture your code potentially goes into an infinite loop where the read of *lock always sees the same value.
In terms of general C++ legality, the one atomic test and set in your example isn't enough, since it implements only a single fence which will allow you to see the initial state of the *lock when entering the while loop but not to see when it changes (which results in undefined behavior, since you are reading a variable that is changed in another thread without synchronisation) - so the answer to your question (1.1/3) is no.
On the other hand, in practice, the answer to (1.2/2) is yes (given GCC's volatile semantics), so long as the architecture guarantees cache coherence without explicit memory fences, which is true of x86 and probably for many architectures but I can't give a definite answer on whether it is true for all architectures that GCC supports. It is however generally unwise to knowingly rely on particular behavior of code that is technically undefined behavior according to the language spec, especially if it is possible to get the same result without doing so.
Incidentally, given that memory_order_relaxed exists, there seems little reason not to use it in this case rather than try to hand-optimise by using non-atomic reads, i.e. changing the while loop in your example to:
while (atomic_load_explicit(lock, memory_order_relaxed)) {
cpu_relax();
}
On x86_64 for instance the atomic load becomes a regular mov instruction and the optimised assembly output is essentially the same as it would be for your original example.
Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?
will the volatile code see the change. Yes, but not necessarily as quickly as if there was a memory barrier. At some point, some form of synchronization will occur, and the new state will be read from the variable, but there are no guarantees on how much has happened elsewhere in the code.
1.1 According to C++ standards?
From cppreference : memory_order
It is the memory model and memory order which defines the generalized hardware that the code needs to work on. For a message to pass between threads of execution, an inter-thread-happens-before relationship needs to occur. This requires either...
A synchronizes-with B
A has a std::atomic operation before B
A indirectly synchronizes with B (through X).
A is sequenced before X which inter-thread happens before B
A interthread happens before X and X interthread happens before B.
As you are not performing any of those cases there will be forms of your program where on some current hardware, it may fail.
In practice, the end of a time-slice will cause the memory to become coherent, or any form of barrier on the non-spinlock thread will ensure that the caches are flushed.
Not sure on the causes of the volatile read getting the "current value".
1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?
As the code is not consistent with the generalized CPU, from C++11 then it is likely this code will fail to perform with versions of C++ which try to adhere to the standard.
From cppreference : const volatile qualifiers
Volatile access stops optimizations from moving work from before it to after it, and from after it to before it.
"This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution"
So an implementation has to ensure that instructions are read from the memory location rather than any local copy. But it does not have to ensure that the volatile write is flushed through the caches to produce a coherent view across all the CPUs. In this sense, there is no time boundary on how long after a write into a volatile variable will become visible to another thread.
Also see kernel.org why volatile is nearly always wrong in kernel
Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)
There is no guarantee the volatile message gets out of the thread which sets it. So not really safe. On linux it may be safe.
Is the while loop safe according to C++11 and its memory model?
No - as it doesn't create any of the inter-thread messaging primitives.
I have really simple question. I have simple type variable (like int). I have one process, one writer thread, several "readonly" threads. How should I declare variable?
volatile int
std::atomic<int>
int
I expect that when "writer" thread modifies value all "reader" threads should see fresh value ASAP.
It's ok to read and write variable at the same time, but I expect reader to obtain either old value or new value, not some "intermediate" value.
I'm using single-CPU Xeon E5 v3 machine. I do not need to be portable, I run the code only on this server, i compile with -march=native -mtune=native. Performance is very important so I do not want to add "synchronization overhead" unless absolutely required.
If I just use int and one thread writes value is it possible that in another thread I do not see "fresh" value for a while?
Just use std::atomic.
Don't use volatile, and don't use it as it is; that doesn't give the necessary synchronisation. Modifying it in one thread and accessing it from another without synchronisation will give undefined behaviour.
If you have unsynchronized access to a variable where you have one or more writers then your program has undefined behavior. Some how you have to guarantee that while a write is happening no other write or read can happen. This is called synchronization. How you achieve this synchronization depends on the application.
For something like this where we have one writer and and several readers and are using a TriviallyCopyable datatype then a std::atomic<> will work. The atomic variable will make sure under the hood that only one thread can access the variable at the same time.
If you do not have a TriviallyCopyable type or you do not want to use a std::atomic You could also use a conventional std::mutex and a std::lock_guard to control access
{ // enter locking scope
std::lock_guard lock(mutx); // create lock guard which locks the mutex
some_variable = some_value; // do work
} // end scope lock is destroyed and mutx is released
An important thing to keep in mind with this approach is that you want to keep the // do work section as short as possible as while the mutex is locked no other thread can enter that section.
Another option would be to use a std::shared_timed_mutex(C++14) or std::shared_mutex(C++17) which will allow multiple readers to share the mutex but when you need to write you can still look the mutex and write the data.
You do not want to use volatile to control synchronization as jalf states in this answer:
For thread-safe accesses to shared data, we need a guarantee that:
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until
much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to be
read. In our code, we simply set the flag after preparing the data, so
all looks fine. But what if the instructions are reordered so the flag
is set first?
volatile does guarantee the first point. It also guarantees that no
reordering occurs between different volatile reads/writes. All
volatile memory accesses will occur in the order in which they're
specified. That is all we need for what volatile is intended for:
manipulating I/O registers or memory-mapped hardware, but it doesn't
help us in multithreaded code where the volatile object is often
only used to synchronize access to non-volatile data. Those accesses
can still be reordered relative to the volatile ones.
As always if you measure the performance and the performance is lacking then you can try a different solution but make sure to remeasure and compare after changing.
Lastly Herb Sutter has an excellent presentation he did at C++ and Beyond 2012 called Atomic Weapons that:
This is a two-part talk that covers the C++ memory model, how locks and atomics and fences interact and map to hardware, and more. Even though we’re talking about C++, much of this is also applicable to Java and .NET which have similar memory models, but not all the features of C++ (such as relaxed atomics).
I'll complete a little bit the previous answers.
As exposed previously, just using int or eventually volatile int is not enough for various reason (even with the memory order constraint of Intel processors.)
So, yes, you should use atomic types for that, but you need extra considerations: atomic types guarantee coherent access but if you have visibility concerns you need to specify memory barrier (memory order.)
Barriers will enforce visibility and coherency between threads, on Intel and most modern architectures, it will enforce cache synchronizations so updates are visible for every cores. The problem is that it may be expensive if you're not careful enough.
Possible memory order are:
relaxed: no special barrier, only coherent read/write are enforce;
sequential consistency: strongest possible constraint (the default);
acquire: enforce that no loads after the current one are reordered before and add the required barrier to ensure that released stores are visible;
consume: a simplified version of acquire that mostly only constraint reordering;
release: enforce that all stores before are complete before the current one and that memory writes are done and visible to loads performing an acquire barrier.
So, if you want to be sure that updates to the variable are visible to readers, you need to flag your store with a (at least) a release memory order and, on the reader side you need an acquire memory order (again, at least.) Otherwise, readers may not see the actual version of the integer (it'll see a coherent version at least, that is the old or the new one, but not an ugly mix of the two.)
Of course, the default behavior (full consistency) will also give you the correct behavior, but at the expense of a lot of synchronization. In short, each time you add a barrier it forces cache synchronization which is almost as expensive as several cache misses (and thus reads/writes in main memory.)
So, in short you should declare your int as atomic and use the following code for store and load:
// Your variable
std::atomic<int> v;
// Read
x = v.load(std::memory_order_acquire);
// Write
v.store(x, std::memory_order_release);
And just to complete, sometimes (and more often that you think) you don't really need the sequential consistency (even the partial release/acquire consistency) since visibility of updates are pretty relative. When dealing with concurrent operations, updates take place not when write is performed but when others see the change, reading the old value is probably not a problem !
I strongly recommend reading articles related to relativistic programming and RCU, here are some interesting links:
Relativistic Programming wiki: http://wiki.cs.pdx.edu/rp/
Structured Deferral: Synchronization via Procrastination: https://queue.acm.org/detail.cfm?id=2488549
Introduction to RCU Concepts: http://www.rdrop.com/~paulmck/RCU/RCU.LinuxCon.2013.10.22a.pdf
Let's start from int at int. In general, when used on single processor, single core machine this should be sufficient, assuming int size same or smaller than CPU word (like 32bit int on 32bit CPU). In this case, assuming correctly aligned address word addresses (high level language should assure this by default) the write/read operations should be atomic. This is guaranteed by Intel as stated in [1] . However, in C++ specification simultaneous reading and writing from different threads is undefined behaviour.
$1.10
6 Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.
Now volatile. This keyword disables almost every optimization. This is the reason why it was used. For example, sometimes when optimizing the compiler can come to idea, that variable you only read in one thread is constant there and simply replace it with it's initial value. This solves such problems. However, it does not make access to variable atomic. Also, in most cases, it is simply unnecessary, because use of proper multithreading tools, like mutex or memory barrier, will achieve same effect as volatile on it's own, as described for instance in [2]
While this may be sufficient for most uses, there are other operations that are not guaranteed to be atomic. Like incrementation is a one. This is when std::atomic comes in. It has those operations defined, like here for mentioned incrementations in [3]. It is also well defined when reading and writing from different threads [4].
In addition, as stated in answers in [5], there exists a lot of other factors that may influence (negatively) atomicity of operations. From loosing cache coherency between multiple cores to some hardware details are the factors that may change how operations are performed.
To summarize, std::atomic is created to support accesses from different threads and it is highly recommended to use it when multithreading.
[1] http://www.intel.com/Assets/PDF/manual/253668.pdf see section 8.1.1.
[2] https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
[3] http://en.cppreference.com/w/cpp/atomic/atomic/operator_arith
[4] http://en.cppreference.com/w/cpp/atomic/atomic
[5] Are C++ Reads and Writes of an int Atomic?
The other answers, which say to use atomic and not volatile, are correct when portability matters. If you’re asking this question, and it’s a good question, that’s the practical answer for you, not, “But, if the standard library doesn’t provide one, you can implement a lock-free, wait-free data structure yourself!” Nevertheless, if the standard library doesn’t provide one, you can implement a lock-free data structure yourself that works on a particular compiler and a particular architecture, provided that there’s only one writer. (Also, somebody has to implement those atomic primitives in the standard library.) If I’m wrong about this, I’m sure someone will kindly inform me.
If you absolutely need an algorithm guaranteed to be lock-free on all platforms, you might be able to build one with atomic_flag. If even that doesn’t suffice, and you need to roll your own data structure, you can do that.
Since there’s only one writer thread, your CPU might guarantee that certain operations on your data will still work atomically even if you just use normal accesses instead of locks or even compare-and-swaps. This is not safe according to the language standard, because C++ has to work on architectures where it isn’t, but it can be safe, for example, on an x86 CPU if you guarantee that the variable you’re updating fits into a single cache line that it doesn’t share with anything else, and you might be able to ensure this with nonstandard extensions such as __attribute__ (( aligned (x) )).
Similarly, your compiler might provide some guarantees: g++ in particular makes guarantees about how the compiler will not assume that the memory referenced by a volatile* hasn’t changed unless the current thread could have changed it. It will actually re-read the variable from memory each time you dereference it. That is in no way sufficient to ensure thread-safety, but it can be handy if another thread is updating the variable.
A real-world example might be: the writer thread maintains some kind of pointer (on its own cache line) which points to a consistent view of the data structure that will remain valid through all future updates. It updates its data with the RCU pattern, making sure to use a release operation (implemented in an architecture-specific way) after updating its copy of the data and before making the pointer to that data globally visible, so that any other thread that sees the updated pointer is guaranteed to see the updated data as well. The reader then makes a local copy (not volatile) of the current value of the pointer, getting a view of the data which will stay valid even after the writer thread updates again, and works with that. You want to use volatile on the single variable that notifies the readers of the updates, so they can see those updates even if the compiler “knows” your thread couldn’t have changed it. In this framework, the shared data just needs to be constant, and readers will use the RCU pattern. That’s one of the two ways I’ve seen volatile be useful in the real world (the other being when you don’t want to optimize out your timing loop).
There also needs to be some way, in this scheme, for the program to know when nobody’s using an old view of the data structure any longer. If that’s a count of readers, that count needs to be atomically modified in a single operation at the same time as the pointer is read (so getting the current view of the data structure involves an atomic CAS). Or this might be a periodic tick when all the threads are guaranteed to be done with the data they’re working with now. It might be a generational data structure where the writer rotates through pre-allocated buffers.
Also observe that a lot of things your program might do could implicitly serialize the threads: those atomic hardware instructions lock the processor bus and force other CPUs to wait, those memory fences could stall your threads, or your threads might be waiting in line to allocate memory from the heap.
Unfortunately it depends.
When a variable is read and written in multiple threads, there may be 2 failures.
1) tearing. Where half the data is pre-change and half the data is post change.
2) stale data. Where the data read has some older value.
int, volatile int and std:atomic all don't tear.
Stale data is a different issue. However, all values have existed, can be concieved as correct.
volatile. This tells the compiler neither to cache the data, nor to re-order operations around it. This improves the coherence between threads by ensuring all operations in a thread are either before the variable, at the variable, or after.
This means that
volatile int x;
int y;
y =5;
x = 7;
the instruction for x = 7 will be written after y = 5;
Unfortunately, the CPU is also capable of re-ordering operations. This can mean that another thread sees x ==7 before y =5
std::atomic x; would allow a guarantee that after seeing x==7, another thread would see y ==5. (Assuming other threads are not modifying y)
So all reads of int, volatile int, std::atomic<int> would show previous valid values of x. Using volatile and atomic increase the ordering of values.
See kernel.org barriers
I have simple type variable (like int).
I have one process, one writer thread, several "readonly" threads. How
should I declare variable?
volatile int
std::atomic
int
Use std::atomic with memory_order_relaxed for the store and load
It's quick, and from your description of your problem, safe. E.g.
void func_fast()
{
std::atomic<int> a;
a.store(1, std::memory_order_relaxed);
}
Compiles to:
func_fast():
movl $1, -24(%rsp)
ret
This assumes you don't need to guarantee that any other data is seen to be written before the integer is updated, and therefore the slower and more complicated synchronisation is unnecessary.
If you use the atomic naively like this:
void func_slow()
{
std::atomic<int> b;
b = 1;
}
You get an MFENCE instruction with no memory_order* specification which is massive slower (100 cycles more more vs just 1 or 2 for the bare MOV).
func_slow():
movl $1, -24(%rsp)
mfence
ret
See http://goo.gl/svPpUa
(Interestingly on Intel the use of memory_order_release and _acquire for this code results in the same assembly language. Intel guarantees that writes and reads happen in order when using the standard MOV instruction).
Here is my attempt at bounty:
- a. General answer already given above says 'use atomics'. This is correct answer. volatile is not enough.
-a. If you dislike the answer, and you are on Intel, and you have properly aligned int, and you love unportable solutions, you can do away with simple volatile, using Intel strong memory ordering gurantees.
TL;DR: Use std::atomic<int> with a mutex around it if you read multiple times.
Depends on how strong guarantees you want.
First volatile is a compiler hint and you shouldn't count on it doing something helpful.
If you use int you can suffer for memory aliasing. Say you have something like
struct {
int x;
bool q;
}
Depending on how this is aligned in memory and the exact implementation of CPU and memory bus it's possible that writing to q will actually overwrite x when the page is copied from the cpu cache back to ram. So unless you know how much to allocate around your int it's not guaranteed that your writer will be able to write without being overwritten by some other thread.
Also even if you write you depend on the processor for reloading the data to the cache of other cores so there's no guarantee that your other thread will see a new value.
std::atomic<int> basically guarantees that you will always allocate sufficient memory, properly aligned so that you don't suffer from aliasing. Depending on the memory order requested you will also disable a bunch of optimizations, like caching, so everything will run slightly slower.
This still doesn't grantee that if your read the var multiple times you'll get the value. The only way to do that is to put a mutex around it to block the writer from changing it.
Still better find a library that already solves the problem you have and it has already been tested by others to make sure it works well.
I'm writing a C++ app.
I have a class variable that more than one thread is writing to.
In C++, anything that can be modified without the compiler "realizing" that it's being changed needs to be marked volatile right? So if my code is multi threaded, and one thread may write to a var while another reads from it, do I need to mark the var volaltile?
[I don't have a race condition since I'm relying on writes to ints being atomic]
Thanks!
C++ hasn't yet any provision for multithreading. In practice, volatile doesn't do what you mean (it has been designed for memory adressed hardware and while the two issues are similar they are different enough that volatile doesn't do the right thing -- note that volatile has been used in other language for usages in mt contexts).
So if you want to write an object in one thread and read it in another, you'll have to use synchronization features your implementation needs when it needs them. For the one I know of, volatile play no role in that.
FYI, the next standard will take MT into account, and volatile will play no role in that. So that won't change. You'll just have standard defined conditions in which synchronization is needed and standard defined way of achieving them.
Yes, volatile is the absolute minimum you'll need. It ensures that the code generator won't generate code that stores the variable in a register and always performs reads and writes from/to memory. Most code generators can provide atomicity guarantees on variables that have the same size as the native CPU word, they'll ensure the memory address is aligned so that the variable cannot straddle a cache-line boundary.
That is however not a very strong contract on modern multi-core CPUs. Volatile does not promise that another thread that runs on another core can see updates to the variable. That requires a memory barrier, usually an instruction that flushes the CPU cache. If you don't provide a barrier, the thread will in effect keep running until such a flush occurs naturally. That will eventually happen, the thread scheduler is bound to provide one. That can take milliseconds.
Once you've taken care of details like this, you'll eventually have re-invented a condition variable (aka event) that isn't likely to be any faster than the one provided by a threading library. Or as well tested. Don't invent your own, threading is hard enough to get right, you don't need the FUD of not being sure that the very basic primitives are solid.
volatile instruct the compiler not to optimize upon "intuition" of a variable value or usage since it could be optimize "from the outside".
volatile won't provide any synchronization however and your assumption of writes to int being atomic are all but realistic!
I'd guess we'd need to see some usage to know if volatile is needed in your case (or check the behavior of your program) or more importantly if you see some sort of synchronization.
I think that volatile only really applies to reading, especially reading memory-mapped I/O registers.
It can be used to tell the compiler to not assume that once it has read from a memory location that the value won't change:
while (*p)
{
// ...
}
In the above code, if *p is not written to within the loop, the compiler might decide to move the read outside the loop, more like this:
cached_p=*p
while (cached_p)
{
// ...
}
If p is a pointer to a memory-mapped I/O port, you would want the first version where the port is checked before the loop is entered every time.
If p is a pointer to memory in a multi-threaded app, you're still not guaranteed that writes are atomic.
Without locking you may still get 'impossible' re-orderings done by the compiler or processor. And there's no guarantee that writes to ints are atomic.
It would be better to use proper locking.
Volatile will solve your problem, ie. it will guarantee consistency among all the caches of the system. However it will be inefficiency since it will update the variable in memory for each R or W access. You might concider using a memory barrier, only whenever it is needed, instead.
If you are working with or gcc/icc have look on sync built-ins : http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
EDIT (mostly about pm100 comment):
I understand that my beliefs are not a reference so I found something to quote :)
The volatile keyword was devised to prevent compiler optimizations that might render code incorrect in the presence of certain asynchronous events. For example, if you declare a primitive variable as volatile, the compiler is not permitted to cache it in a register
From Dr Dobb's
More interesting :
Volatile fields are linearizable. Reading a volatile field is like acquiring a lock; the working memory is invalidated and the volatile field's current value is reread from memory. Writing a volatile field is like releasing a lock : the volatile field is immediately written back to memory.
(this is all about consistency, not about atomicity)
from The Art of multiprocessor programming, Maurice Herlihy & Nir Shavit
Lock contains memory synchronization code, if you don't lock, you must do something and using volatile keyword is probably the simplest thing you can do (even if it was designed for external devices with memory binded to the address space, it's not the point here)
I have a multi-R/W lock class that keeps the read, write and pending read , pending write counters. A mutex guards them from multiple threads.
My question is Do we still need the counters to be declared as volatile so that the compiler won't screw it up while doing the optimization.
Or does the compiler takes into account that the counters are guarded by the mutex.
I understand that the mutex is a run time mechanism to for synchronization and "volatile" keyword is a compile time indication to the compiler to do the right thing while doing the optimizations.
Regards,
-Jay.
There are 2 basically unrelated items here, that are always confused.
volatile
threads, locks, memory barriers, etc.
volatile is used to tell the compiler to produce code to read the variable from memory, not from a register. And to not reorder the code around. In general, not to optimize or take 'short-cuts'.
memory barriers (supplied by mutexes, locks, etc), as quoted from Herb Sutter in another answer, are for preventing the CPU from reordering read/write memory requests, regardless of how the compiler said to do it. ie don't optimize, don't take short cuts - at the CPU level.
Similar, but in fact very different things.
In your case, and in most cases of locking, the reason that volatile is NOT necessary, is because of function calls being made for the sake of locking. ie:
Normal function calls affecting optimizations:
external void library_func(); // from some external library
global int x;
int f()
{
x = 2;
library_func();
return x; // x is reloaded because it may have changed
}
unless the compiler can examine library_func() and determine that it doesn't touch x, it will re-read x on the return. This is even WITHOUT volatile.
Threading:
int f(SomeObject & obj)
{
int temp1;
int temp2;
int temp3;
int temp1 = obj.x;
lock(obj.mutex); // really should use RAII
temp2 = obj.x;
temp3 = obj.x;
unlock(obj.mutex);
return temp;
}
After reading obj.x for temp1, the compiler is going to re-read obj.x for temp2 - NOT because of the magic of locks - but because it is unsure whether lock() modified obj. You could probably set compiler flags to aggressively optimize (no-alias, etc) and thus not re-read x, but then a bunch of your code would probably start failing.
For temp3, the compiler (hopefully) won't re-read obj.x.
If for some reason obj.x could change between temp2 and temp3, then you would use volatile (and your locking would be broken/useless).
Lastly, if your lock()/unlock() functions were somehow inlined, maybe the compiler could evaluate the code and see that obj.x doesn't get changed. But I guarantee one of two things here:
- the inline code eventually calls some OS level lock function (thus preventing evaluation) or
- you call some asm memory barrier instructions (ie that are wrapped in inline functions like __InterlockedCompareExchange) that your compiler will recognize and thus avoid reordering.
EDIT: P.S. I forgot to mention - for pthreads stuff, some compilers are marked as "POSIX compliant" which means, among other things, that they will recognize the pthread_ functions and not do bad optimizations around them. ie even though the C++ standard doesn't mention threads yet, those compilers do (at least minimally).
So, short answer
you don't need volatile.
From Herb Sutter's article "Use Critical Sections (Preferably Locks) to Eliminate Races" (http://www.ddj.com/cpp/201804238):
So, for a reordering transformation to be valid, it must respect the program's critical sections by obeying the one key rule of critical sections: Code can't move out of a critical section. (It's always okay for code to move in.) We enforce this golden rule by requiring symmetric one-way fence semantics for the beginning and end of any critical section, illustrated by the arrows in Figure 1:
Entering a critical section is an acquire operation, or an implicit acquire fence: Code can never cross the fence upward, that is, move from an original location after the fence to execute before the fence. Code that appears before the fence in source code order, however, can happily cross the fence downward to execute later.
Exiting a critical section is a release operation, or an implicit release fence: This is just the inverse requirement that code can't cross the fence downward, only upward. It guarantees that any other thread that sees the final release write will also see all of the writes before it.
So for a compiler to produce correct code for a target platform, when a critical section is entered and exited (and the term critical section is used in it's generic sense, not necessarily in the Win32 sense of something protected by a CRITICAL_SECTION structure - the critical section can be protected by other synchronization objects) the correct acquire and release semantics must be followed. So you should not have to mark the shared variables as volatile as long as they are accessed only within protected critical sections.
volatile is used to inform the optimizer to always load the current value of the location, rather than load it into a register and assume that it won't change. This is most valuable when working with dual-ported memory locations or locations that can be updated real-time from sources external to the thread.
The mutex is a run-time OS mechanism that the compiler really doesn't know anything about - so the optimizer wouldn't take that into account. It will prevent more than one thread from accessing the counters at one time, but the values of those counters are still subject to change even while the mutex is in effect.
So, you're marking the vars volatile because they can be externally modified, and not because they're inside a mutex guard.
Keep them volatile.
While this may depend on the threading library you are using, my understanding is that any decent library will not require use of volatile.
In Pthreads, for example, use of a mutex will ensure that your data gets committed to memory correctly.
EDIT: I hereby endorse tony's answer as being better than my own.
You still need the "volatile" keyword.
The mutexes prevent the counters from concurrent access.
"volatile" tells the compiler to actually use the counter
instead of caching it into a CPU register (which would not
be updated by the concurrent thread).