I wonder if it would be possible to write std::atomic<> for the use on the AVR µC. The
__atomic_xxx() built-ins are unfortunately not implemented in avr-gcc.
As I understand basic load/store of uint8_t on AVR is atomic, but e.g. operator++() not because it implies an rmw-cycle. So, for these operations one has to disable interrupts, because this is the only form of concurrency on this hardware. For larger types than uint8_t even the operator=(T) needs to be protected from interrupts.
On the other hand one has to apply a memory-barrier to the data member of the std::atomic<> template: e.g. this data member has the name value one has to use
asm volatile("" : "+m" (value)); to accomplish real load/stores on the machine.
Is this sufficient to implement the std::atomic<>?
Because this implementation is lock-free, it should be useable for ISRs on that hardware.
If one would implement such std::atomic<> this would led to unefficient machine code inside the ISR, because the unneccessary interrupt disabling and/or the memory-barrier, which prevents optimization.
Well, this could be circumvented by extending the interface of the std::atomic<> with unsafe operations.
On the other hand: would it be more feasable to implement std::atomic_ref<> and use this outside the ISR?
Related
in volatile: The Multithreaded Programmer's Best Friend, Andrei Alexandrescu gives this example:
class Gadget
{
public:
void Wait()
{
while (!flag_)
{
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup()
{
flag_ = true;
}
...
private:
bool flag_;
};
he states,
... the compiler concludes that it can cache flag_ in a register ... it harms correctness: after you call Wait for some Gadget object, although another thread calls Wakeup, Wait will loop forever. This is because the change of flag_ will not be reflected in the register that caches flag_.
then he offers a solution:
If you use the volatile modifier on a variable, the compiler won't cache that variable in registers — each access will hit the actual memory location of that variable.
now, other people mentioned on stackoverflow and elsewhere that volatile keyword doesn't really offer any thread-safety guarantees, and i should use std::atomic or mutex synchronization instead, which i do agree.
however, going the std::atomic route for example, which internally uses memory fences read_acquire and write_release (Acquire and Release Semantics), i don't see how it actually fixes the register-cache problem in particular.
in case of x86 for example, every load on x86/64 already implies acquire semantics and every store implies release semantics such that compiled code under x86 doesn't emit any actual memory barriers at all. (The Purpose of memory_order_consume in C++11)
g = Guard.load(memory_order_acquire);
if (g != 0)
p = Payload;
On Intel x86-64, the Clang compiler generates compact machine code for this example – one machine instruction per line of C++ source code. This family of processors features a strong memory model, so the compiler doesn’t need to emit special memory barrier instructions to implement the read-acquire.
so.... just assuming x86 arch for now, how does std::atomic solve the cache in registry problem? w/ no memory barrier instructions for read-acquire in compiled code, it seems to be the same as the compiled code for just regular read.
Did you notice that there was no load from just a register in your code? There was an explicit memory load from _Guard. So it did in fact prevent caching in a register.
Now how it does this is up to the specific platform's implementation of std::atomic, but it must do this.
And, by the way, Alexandrescu's reasoning is completely wrong for modern platforms. While it's true that volatile prevents the compiler from caching in a register, it doesn't prevent similar caching being done by the CPU or by hardware. On some platforms, it might happen to be adequate, but there is absolutely no reason to write gratuitously non-portable code that might break on a future CPU, compiler, library, or platform when a fully-portable alternative is readily available.
volatile is not necessary for any "sane" implementation when the Gadget example is changed to use std::atomic<bool>. The reason for this is not that the standard forbids the use of registers, instead (§29.3/13 in n3690):
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
Of course, what constitutes "reasonable" is open to interpretation, and it's "should", not "shall", so an implementation might ignore the requirement without violating the letter of the standard. Typical implementations do not cache the results of atomic loads, nor (much) delay issuing an atomic store to the CPU, and thus leave the decision largely to the hardware. If you would like to enforce this behavior, you should use volatile std::atomic<bool> instead. In both cases, however, if another thread sets the flag, the Wait() should be finite, but if your compiler and/or CPU are so willing, can still take much longer than you would like.
Also note that a memory fence does not guarantee that a store becomes visible to another thread immediately nor any sooner than it otherwise would. So even if the compiler added fence instructions to Gadget's methods, they wouldn't help at all. Fences are used to guarantee consistency, not to increase performance.
A coworker and I write software for a variety of platforms running on x86, x64, Itanium, PowerPC, and other 10 year old server CPUs.
We just had a discussion about whether mutex functions such as pthread_mutex_lock() ... pthread_mutex_unlock() are sufficient by themselves, or whether the protected variable needs to be volatile.
int foo::bar()
{
//...
//code which may or may not access _protected.
pthread_mutex_lock(m);
int ret = _protected;
pthread_mutex_unlock(m);
return ret;
}
My concern is caching. Could the compiler place a copy of _protected on the stack or in a register, and use that stale value in the assignment? If not, what prevents that from happening? Are variations of this pattern vulnerable?
I presume that the compiler doesn't actually understand that pthread_mutex_lock() is a special function, so are we just protected by sequence points?
Thanks greatly.
Update: Alright, I can see a trend with answers explaining why volatile is bad. I respect those answers, but articles on that subject are easy to find online. What I can't find online, and the reason I'm asking this question, is how I'm protected without volatile. If the above code is correct, how is it invulnerable to caching issues?
Simplest answer is volatile is not needed for multi-threading at all.
The long answer is that sequence points like critical sections are platform dependent as is whatever threading solution you're using so most of your thread safety is also platform dependent.
C++0x has a concept of threads and thread safety but the current standard does not and therefore volatile is sometimes misidentified as something to prevent reordering of operations and memory access for multi-threading programming when it was never intended and can't be reliably used that way.
The only thing volatile should be used for in C++ is to allow access to memory mapped devices, allow uses of variables between setjmp and longjmp, and to allow uses of sig_atomic_t variables in signal handlers. The keyword itself does not make a variable atomic.
Good news in C++0x we will have the STL construct std::atomic which can be used to guarantee atomic operations and thread safe constructs for variables. Until your compiler of choice supports it you may need to turn to the boost library or bust out some assembly code to create your own objects to provide atomic variables.
P.S. A lot of the confusion is caused by Java and .NET actually enforcing multi-threaded semantics with the keyword volatile C++ however follows suit with C where this is not the case.
Your threading library should include the apropriate CPU and compiler barriers on mutex lock and unlock. For GCC, a memory clobber on an asm statement acts as a compiler barrier.
Actually, there are two things that protect your code from (compiler) caching:
You are calling a non-pure external function (pthread_mutex_*()), which means that the compiler doesn't know that that function doesn't modify your global variables, so it has to reload them.
As I said, pthread_mutex_*() includes a compiler barrier, e.g: on glibc/x86 pthread_mutex_lock() ends up calling the macro lll_lock(), which has a memory clobber, forcing the compiler to reload variables.
If the above code is correct, how is it invulnerable to caching
issues?
Until C++0x, it is not. And it is not specified in C. So, it really depends on the compiler. In general, if the compiler does not guarantee that it will respect ordering constraints on memory accesses for functions or operations that involve multiple threads, you will not be able to write multithreaded safe code with that compiler. See Hans J Boehm's Threads Cannot be Implemented as a Library.
As for what abstractions your compiler should support for thread safe code, the wikipedia entry on Memory Barriers is a pretty good starting point.
(As for why people suggested volatile, some compilers treat volatile as a memory barrier for the compiler. It's definitely not standard.)
The volatile keyword is a hint to the compiler that the variable might change outside of program logic, such as a memory-mapped hardware register that could change as part of an interrupt service routine. This prevents the compiler from assuming a cached value is always correct and would normally force a memory read to retrieve the value. This usage pre-dates threading by a couple decades or so. I've seen it used with variables manipulated by signals as well, but I'm not sure that usage was correct.
Variables guarded by mutexes are guaranteed to be correct when read or written by different threads. The threading API is required to ensure that such views of variables are consistent. This access is all part of your program logic and the volatile keyword is irrelevant here.
With the exception of the simplest spin lock algorithm, mutex code is quite involved: a good optimized mutex lock/unlock code contains the kind of code even excellent programmer struggle to understand. It uses special compare and set instructions, manages not only the unlocked/locked state but also the wait queue, optionally uses system calls to go into a wait state (for lock) or wake up other threads (for unlock).
There is no way the average compiler can decode and "understand" all that complex code (again, with the exception of the simple spin lock) no matter way, so even for a compiler not aware of what a mutex is, and how it relates to synchronization, there is no way in practice a compiler could optimize anything around such code.
That's if the code was "inline", or available for analyse for the purpose of cross module optimization, or if global optimization is available.
I presume that the compiler doesn't actually understand that
pthread_mutex_lock() is a special function, so are we just protected
by sequence points?
The compiler does not know what it does, so does not try to optimize around it.
How is it "special"? It's opaque and treated as such. It is not special among opaque functions.
There is no semantic difference with an arbitrary opaque function that can access any other object.
My concern is caching. Could the compiler place a copy of _protected
on the stack or in a register, and use that stale value in the
assignment?
Yes, in code that act on objects transparently and directly, by using the variable name or pointers in a way that the compiler can follow. Not in code that might use arbitrary pointers to indirectly use variables.
So yes between calls to opaque functions. Not across.
And also for variables which can only be used in the function, by name: for local variables that don't have either their address taken or a reference bound to them (such that the compiler cannot follow all further uses). These can indeed be "cached" across arbitrary calls include lock/unlock.
If not, what prevents that from happening? Are variations of this
pattern vulnerable?
Opacity of the functions. Non inlining. Assembly code. System calls. Code complexity. Everything that make compilers bail out and think "that's complicated stuff just make calls to it".
The default position of a compiler is always the "let's execute stupidly I don't understand what is being done anyway" not "I will optimize that/let's rewrite the algorithm I know better". Most code is not optimized in complex non local way.
Now let's assume the absolute worse (from out point of view which is that the compiler should give up, that is the absolute best from the point of view of an optimizing algorithm):
the function is "inline" (= available for inlining) (or global optimization kicks in, or all functions are morally "inline");
no memory barrier is needed (as in a mono-processor time sharing system, and in a multi-processor strongly ordered system) in that synchronization primitive (lock or unlock) so it contains no such thing;
there is no special instruction (like compare and set) used (for example for a spin lock, the unlock operation is a simple write);
there is no system call to pause or wake threads (not needed in a spin lock);
then we might have a problem as the compiler could optimize around the function call. This is fixed trivially by inserting a compiler barrier such as an empty asm statement with a "clobber" for other accessible variables. That means that compiler just assumes that anything that might be accessible to a called function is "clobbered".
or whether the protected variable needs to be volatile.
You can make it volatile for the usual reason you make things volatile: to be certain to be able to access the variable in the debugger, to prevent a floating point variable from having the wrong datatype at runtime, etc.
Making it volatile would actually not even fix the issue described above as volatile is essentially a memory operation in the abstract machine that has the semantics of an I/O operation and as such is only ordered with respect to
real I/O like iostream
system calls
other volatile operations
asm memory clobbers (but then no memory side effect is reordered around those)
calls to external functions (as they might do one the above)
Volatile is not ordered with respect to non volatile memory side effects. That makes volatile practically useless (useless for practical uses) for writing thread safe code in even the most specific case where volatile would a priori help, the case where no memory fence is ever needed: when programming threading primitives on a time sharing system on a single CPU. (That may be one of the least understood aspects of either C or C++.)
So while volatile does prevent "caching", volatile doesn't even prevent compiler reordering of lock/unlock operation unless all shared variables are volatile.
Locks/synchronisation primitives make sure the data is not cached in registers/cpu cache, that means data propagates to memory. If two threads are accessing/ modifying data with in locks, it is guaranteed that data is read from memory and written to memory. We don't need volatile in this use case.
But the case where you have code with double checks, compiler can optimise the code and remove redundant code, to prevent that we need volatile.
Example: see singleton pattern example
https://en.m.wikipedia.org/wiki/Singleton_pattern#Lazy_initialization
Why do some one write this kind of code?
Ans: There is a performance benefit of not accuiring lock.
PS: This is my first post on stack overflow.
Not if the object you're locking is volatile, eg: if the value it represents depends on something foreign to the program (hardware state).
volatile should NOT be used to denote any kind of behavior that is the result of executing the program.
If it's actually volatile what I personally would do is locking the value of the pointer/address, instead of the underlying object.
eg:
volatile int i = 0;
// ... Later in a thread
// ... Code that may not access anything without a lock
std::uintptr_t ptr_to_lock = &i;
some_lock(ptr_to_lock);
// use i
release_some_lock(ptr_to_lock);
Please note that it only works if ALL the code ever using the object in a thread locks the same address. So be mindful of that when using threads with some variable that is part of an API.
I have a linux multithread application in C++.
In this application in class App offer variable Status:
class App {
...
typedef enum { asStop=0, asStart, asRestart, asWork, asClose } TAppStatus;
TAppStatus Status;
...
}
All threads are often check Status by calling GetStatus() function.
inline TAppStatus App::GetStatus(){ return Status };
Other functions of the application can assign a different values to a Status variable by calling SetStatus() function and do not use Mutexes.
void App::SetStatus( TAppStatus aStatus ){ Status=aStatus };
Edit: All threads use Status in switch operator:
switch ( App::GetStatus() ){ case asStop: ... case asStart: ... };
Is the assignment in this case, an atomic operation?
Is this correct code?
Thanks.
There is no portable way to implement synchronized variables in C99 or C++03 and pthread library does not provide one either. You can:
Use C++0x <atomic> header (or C1x <stdatomic.h>). Gcc does support it for C++ if given -std=c++0x or -std=gnu++0x option since version 4.4.
Use the Linux-specific <linux/atomic.h> (this is implementation used by kernel, but it should be usable from userland as well).
Use GCC-specific __sync_* builtin functions.
Use some other library that provides atomic operations like glib.
Use locks, but that's orders of magnitude slower compared to the fast operation itself.
Note: As Martinho pointed out, while they are called "atomic", for store and load it's not the atomic property (operation cannot be interrupted and load always sees or does not see the whole store, which is usually true of 32-bit stores and loads) but the ordering property (if you store a and than b, nobody may get new value of b and than old value of a) that is hard to get but necessary in this case.
This depends entirely upon the enum representation chosen. For x86, I believe, then all assignment operations of the operating system's word size (so 32bit for x86 and 64bit for x64) and alignment of that size as well are atomic, so a simple read and write is atomic.
Even assuming that it is the correct size and alignment, this doesn't mean that these functions are thread-safe, depends on what the status is used for.
Edit: In addition, the compiler's optimizer may well wreak havoc if there are no uses of atomic operations or other volatile accesses.
Edit to your edit: No, that's not thread safe at all. If you converted it manually into a jump table then you might be thread-safe, I'll need to think about it for a little while.
On certain architecture this assignment might be atomic (by accident), but even if it is, this code is wrong. Compiler and hardware may perform various optimizations, with might break this "atomicity". Look at: http://video.google.com/videoplay?docid=-4714369049736584770#
Use locks or atomic http://www.stdthread.co.uk/doc/headers/atomic/atomic.html variable to fix it.
Plain load has acquire semantics on x86, plain store has release semantics, however compiler still can reorder instructions. While fences and locked instructions (locked xchg, locked cmpxchg) prevent both hardware and compiler from reordering, plain loads and stores are still necessary to protect with compiler barriers. Visual C++ provides _ReadWriterBarrier() function, which prevents compiler from reordering, also C++ provides volatile keyword for the same reason. I write all this information just to make sure that I get everything right. So all written above is true, is there any reason to mark as volatile variables which are going to be used in functions protected with _ReadWriteBarrier()?
For example:
int load(int& var)
{
_ReadWriteBarrier();
T value = var;
_ReadWriteBarrier();
return value;
}
Is it safe to make that variable non-volatile? As far as I understand it is, because function is protected and no reordering could be done by compiler inside. On the other hand Visual C++ provides special behavior for volatile variables (different from the one that standard does), it makes volatile reads and writes atomic loads and stores, but my target is x86 and plain loads and stores are supposed to be atomic on x86 anyway, right?
Thanks in advance.
Volatile keyword is available in C too. "volatile" is often used in embedded System, especially when value of the variable may change at any time-without any action being taken by the code - three common scenarios include reading from a memory-mapped peripheral register or global variables either modified by an interrupt service routine or those within a multi-threaded program.
So it is the last scenario where volatile could be considered to be similar to _ReadWriteBarrier.
_ReadWriteBarrier is not a function - _ReadWriteBarrier does not insert any additional instructions, and it does not prevent the CPU from rearranging reads and writes— it only prevents the compiler from rearranging them. _ReadWriteBarrier is to prevent compiler reordering.
MemoryBarrier is to prevent CPU reordering!
A compiler typically rearranges instructions... C++ does not contain built-in support for multithreaded programs so the compiler assumes the code is single-threaded when reordering the code. With MSVC use _ReadWriteBarrier in the code, so that the compiler will not move reads and writes across it.
Check this link for more detailed discussion on those topics
http://msdn.microsoft.com/en-us/library/ee418650(v=vs.85).aspx
Regarding your code snippet - you do not have to use ReadWriteBarrier as a SYNC primitive - the first call to _ReadWriteBarrier is not necessary.
When using ReadWriteBarrier you do not have to use volatile
You wrote "it makes volatile reads and writes atomic loads and stores" - I don't think that is OK to say that, Atomicity and volatility are different. Atomic operations are considered to be indivisible - ... http://www.yoda.arachsys.com/csharp/threads/volatility.shtml
Note: I am not an expert on this topic, some of my statements are "what I heard on the internet", but I think I csan still clear up some misconceptions.
[edit] In general, I would rely on platform-specifics such as x86 atomic reads and lack of OOOX only in isolated, local optimizations that are guarded by an #ifdef checking the target platform, ideally accompanied by a portable solution in the #else path.
Things to look out for
atomicity of read / write operations
reordering due to compiler optimizations (this includes a different order seen by another thread due to simple register caching)
out-of-order execution in the CPU
Possible misconceptions
1. As far as I understand it is, because function is protected and no reordering could be done by compiler inside.
[edit] To clarify: the _ReadWriteBarrier provides protection against instruction reordering, however, you have to look beyond the scope of the function. _ReadWriteBarrier has been fixed in VS 2010 to do that, earlier versions may be broken (depending on the optimizations they actually do).
Optimization isn't limited to functions. There are multiple mechanisms (automatic inlining, link time code generation) that span functions and even compilation units (and can provide much more significant optimizations than small-scoped register caching).
2. Visual C++ [...] makes volatile reads and writes atomic loads and stores,
Where did you find that? MSDN says that beyond the standard, will put memory barriers around reads and writes, no guarantee for atomic reads.
[edit] Note that C#, Java, Delphi etc. have different memory mdoels and may make different guarantees.
3. plain loads and stores are supposed to be atomic on x86 anyway, right?
No, they are not. Unaligned reads are not atomic. They happen to be atomic if they are well-aligned - a fact I'd not rely on unless it's isolated and easily exchanged. Otherwise your "simplificaiton fo x86" becomes a lockdown to that target.
[edit] Unaligned reads happen:
char * c = new char[sizeof(int)+1];
load(*(int *)c); // allowed by standard to be unaligned
load(*(int *)(c+1)); // unaligned with most allocators
#pragma pack(push,1)
struct
{
char c;
int i;
} foo;
load(foo.i); // caller said so
#pragma pack(pop)
This is of course all academic if you remember the parameter must be aligned, and you control all code. I wouldn't write such code anymore, because I've been bitten to often by laziness of the past.
4. Plain load has acquire semantics on x86, plain store has release semantics
No. x86 processors do not use out-of-order execution (or rather, no visible OOOX - I think), but this doesn't stop the optimizer from reordering instructions.
5. _ReadBarrier / _WriteBarrier / _ReadWriteBarrier do all the magic
They don't - they just prevent reordering by the optimizer. MSDN finally made it a big bad warning for VS2010, but the information apparently applies to previous versions as well.
Now, to your question.
I assume the purpose of the snippet is to pass any variable N, and load it (atomically?) The straightforward choice would be an interlocked read or (on Visual C++ 2005 and later) a volatile read.
Otherwise you'd need a barrier for both compiler and CPU before the read, in VC++ parlor this would be:
int load(int& var)
{
// force Optimizer to complete all memory writes:
// (Note that this had issues before VC++ 2010)
_WriteBarrier();
// force CPU to settle all pending read/writes, and not to start new ones:
MemoryBarrier();
// now, read.
int value = var;
return value;
}
Noe that _WriteBarrier has a second warning in MSDN:
*In past versions of the Visual C++ compiler, the _ReadWriteBarrier and _WriteBarrier functions were enforced only locally and did not affect functions up the call tree. These functions are now enforced all the way up the call tree.*
I hope that is correct. stackoverflowers, please correct me if I'm wrong.
I'm not quite sure if it's safe to spin on a volatile variable in user-mode threads, to implement a light-weight spin_lock, I looked at the tbb source code, tbb_machine.h:170,
//! Spin WHILE the value of the variable is equal to a given value
/** T and U should be comparable types. */
template<typename T, typename U>
void spin_wait_while_eq( const volatile T& location, U value ) {
atomic_backoff backoff;
while( location==value ) backoff.pause();
}
And there is no fences in atomic_backoff class as I can see. While from other user-mode spin_lock implementation, most of them use CAS (Compare and Swap).
Tricky. I'd say that in theory, this code isn't safe. If there are no memory barriers then the data accesses you're guarding could be moved across the spinlock. However, this would only be done if the compiler inlined very aggressively, and could see a purpose in this reordering.
Perhaps Intel simply determined that this code works on all current compilers, and that even though a compiler could theoretically perform transformations that'd break this code, those transformations wouldn't speed up the program, and so compilers probably won't do it.
Another possibility is that this code is only used on compilers that implicitly implement memory barriers around volatile accesses. Microsoft's Visual C++ compiler (2005 and later) does this. Have you checked if this is wrapped in an #ifdef block or similar, applying this implementation only on compilers where volatile use memory barriers?
Note that this is a spin wait and not a spin lock. There is no write operation involved. Nothing is acquired.
If you tried to add a write operation to complete the locking process, you would have an unsolvable race.