InterlockedExchange vs InterlockedCompareExchange spin locks - c++

I've written a basic spin lock (see below) using InterlockedExchange. However I've seen a lot of implementations use InterlockedCompareExchange instead. Is mine incorrect in some unforeseen way and if not what are the pro's and cons of each way (if indeed there are any)?
PS I know the sleep is expensive and I'd want to have an attempt count before I call it.
class SpinLock
{
public:
SpinLock() : m_lock( 0 ) {}
~SpinLock(){}
void Lock()
{
while( InterlockedExchange( &m_lock, 1 ) == 1 )
{
Sleep( 0 );
}
}
void Unlock()
{
InterlockedExchange( &m_lock, 0 );
}
private:
volatile unsigned int m_lock;
};

First of all, InterlockedExchange takes a LONG. Please repeat after me: a LONG isn't the same an an int. This may seem like a small thing but it can cause you grief.
Now, to elaborate a little on what Mats Petersson said:
Your spinlock will have horrible performance since the InterlockedExchange loop in Lock will modify the m_lock variable unconditionally, causing a lot of work to be done by the processors behind the scenes to maintain cache coherency.
To make matters worse, by not ensuring that your m_lock variable is on a cache line by itself, the above effect is amplified and could affect other data, unlucky enough to share the cache line with the instance of your spinlock.
These are just two moderately subtle issues with this code. There are others. The simple fact is that locks aren't easy to get right, and you shouldn't be implementing custom locking primitives. Please don't reinvent the wheel. Use the facilities provided to you by the operating system. It's unlikely they themselves are a bottleneck.
If you do find you have a performance issue (that is, you have profiling data that suggests a performance bottleneck) first focus on algorithmic changes and on improving parallelization and reducing lock contention. If the problem persists then and only then look elsewhere.

There is very little difference between CMPXCHG and XCHG (which is the x86 instructions that you'd get from the two intrinsic functions you mention).
I think the main difference is that in a SMP system with a lot of contention on the lock, you don't get a bunch of writes when the value is already "locked" - which means that the other processors don't have to read back a value that is already there in the cache.
In a debug build, you'd also want to ensure that Unlock() is only called from the current owner of the lock!

Related

Can/should non-lock-free atomics be implemented with a SeqLock?

In both MSVC STL and LLVM libc++ implementations std::atomic for non-atomic size is implemented using a spin lock.
libc++ (Github):
_LIBCPP_INLINE_VISIBILITY void __lock() const volatile {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
_LIBCPP_INLINE_VISIBILITY void __lock() const {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
MSVC (Github) (recently discussed in this Q&A):
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif
/* ... */
#endif
}
While thinking of a better possible implementation, I wonder if it is feasible to replace this with SeqLock? Advantage would be cheap reads if reads don't contend with writes.
Another thing I'm questioning is if SeqLock can be improved to use OS wait. It appears to me that if reader observes an odd count, it can wait with atomic wait underlying mechanism (Linux futex/Windows WaitOnAddress), thus avoiding the starvation problem of spinlock.
To me it looks like possible. Though C++ memory model doesn't cover Seqlock currently, types in std::atomic must be trivially copyable, so memcpy reads/writes in seqlock will work and will deal with races if sufficient barriers are used to get a volatile-equivalent without defeating optimizations too badly. This will be part of a specific C++ implementation's header files so it doesn't have to be portable.
Existing SO Q&As about implement a SeqLock in C++ (perhaps using other std::atomic operations)
Implementing 64 bit atomic counter with 32 bit atomics
how to implement a seqlock lock using c++11 atomic library
Yes, you can use a SeqLock as a readers/writers lock if you provide mutual exclusion between writers. You'd still get read-side scalability, while writes and RMWs would stay about the same.
It's not a bad idea, although it has potential fairness problems for readers if you have very frequent writes. Maybe not a good idea for a mainstream standard library, at least not without some testing with some different workloads / use-cases on a range of hardware, since working great on some machines but faceplanting on others is not what you want for standard library stuff. (Code that wants great performance for its special case often unfortunately has to use an implementation that's tuned for it, not the standard one.)
Mutual exclusion is possible with a separate spinlock, or just using the low bit of the sequence number. In fact I've seen other descriptions of a SeqLock that assumed you'd be using it with multiple writers, and didn't even mention the single-writer case that allows pure-load and pure-store for the sequence number to avoid the cost of an atomic RMW.
How to use the sequence number as a spinlock
A writer or RMWer attempts to atomically CAS the sequence number to increment (if it wasn't already odd). If the sequence number is already odd, writers just spin until they see an even value.
This would mean writers have to start by reading the sequence number before trying to write, which can cause extra coherency traffic (MESI Share request, then RFO). On a machine that actually had a fetch_or in hardware, you could use that to atomically make the count odd and see if you won the race to take it from even to odd.
On x86-64, you can use lock bts to set the low bit and find out what the old low bit was, then load the whole sequence number if it was previously even (because you won the race, no other writer is going to be modifying it). So you can do a release-store of that plus 1 to "unlock" instead of needing a lock add.
Making other writers faster at reclaiming the lock may actually be a bad thing, though: you want to give a window for readers to complete. Maybe just use multiple pause instructions (or equivalent on non-x86) in write-side spin loops, more than in read-side spins. If contention is low, readers probably had time to see it before writers got to it, otherwise writers will frequently see it locked and go into the slower spin loop. Maybe with faster-increasing backoff for writers, too.
An LL/SC machine could (in asm at least) test-and-increment just as easily as CAS or TAS. I don't know how to write pure C++ that would compile to just that. fetch_or could compile efficiently for LL/SC, but still to a store even if it was already odd. (If you have to LL separately from SC, you might as well make the most of it and not store at all if it will be useless, and hope that the hardware is designed to make the best of things.)
(It's critical to not unconditionally increment; you must not unlock another writer's ownership of the lock. But an atomic-RMW that leaves the value unchanged is always ok for correctness, if not performance.)
It may not be a good idea by default because of bad results with heavy write activity making it potentially hard for a reader to get a successful read done. As Wikipedia points out:
The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read–write lock. Also, writers do not wait for readers, whereas with traditional read–write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read–write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).
The "too slow reader" problem is unlikely, just a small memcpy. Code shouldn't expect good results from std::atomic<T> for very large T; the general assumption is that you'd only bother with std::atomic for a T that can be lock-free on some implementations. (Usually not including transactional memory since mainstream implementations don't do that.)
But the "too much write" problem could still be real: SeqLock is best for read-mostly data. Readers may have a bad time with a heavy write mix, retrying even more than with a simple spinlock or a readers-writers lock.
It would be nice if there was a way to make this an option for an implementation, like an optional template parameter such as std::atomic<T, true>, or a #pragma, or #define before including <atomic>. Or a command-line options.
An optional template param affects every use of the type, but might be slightly less clunky than a separate class name like gnu::atomic_seqlock<T>. An optional template param would still make std::atomic types be that class name, so e.g. matching specializations of other things for std::atomic. But might break other things, IDK.
Might be fun to hack something up to experiment with.

Is this approach of barriers right?

I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?
EDIT
I am primarily interested in cases where there are equal number of threads as cores.
atomic<int> thread_count = 0;
void my_barrier()
{
thread_count++;
while( thread_count % NUM_OF_THREADS )
sched_yield();
}
Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:
NUM_OF_THREADS-1 threads are waiting at the barrier, spinning.
Last thread arrives and passes through the barrier.
Last thread exits barrier, continues processing, finishes its next task, and reenters the barrier wait.
Only now do the other waiting threads get scheduled, and they can't exit the barrier because the counter was incremented again. Deadlock.
In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...
How can barriers be destroyable as soon as pthread_barrier_wait returns?
Can a correct fail-safe process-shared barrier be implemented on Linux?
And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.
AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.
There's a way to make fast barriers though, using eventcounts (look at it through google).
struct barrier {
atomic<int> count;
struct eventcount ec;
};
void my_barrier_wait(struct barrier *b)
{
eventcount_key_t key;
if (--b->count == 0) {
eventcount_broadcast(&b->ec);
return;
}
for (;;) {
key = eventcount_get(&b->ec);
if (!b->count)
return;
eventcount_wait(&b->ec);
}
}
This should scale way better.
Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.
Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.
Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.
As a side note: since you use atomic<int> I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield() instead of sched_yield() in that case to remove the dependency on pthreads?
This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU) case, except for the yielding)

IPC through shared memory with atomic_t; is it good for x86?

I have the following code for interprocess communication through shared memory. One process writes to a log and the other reads from it. One way is to use semaphores, but here I'm using atomic flag (log_flag) of type atomic_t which resides inside the shared memory. The log (log_data) is also shared.
Now the question is, would this work for x86 architecture or do I need semaphores or mutexes? What if I make log_flag non-atomic? Given x86 has a strict memory model and proactive cache coherence, and optimizations are not applied on pointers, I think it would still work?
EDIT: Note that I have a multicore processor with 8 cores, so I don't have any problem with busy waits here!
// Process 1 calls this function
void write_log( void * data, size_t size )
{
while( *log_flag )
;
memcpy( log_data, data, size );
*log_flag = 1;
}
// Process 2 calls this function
void read_log( void * data, size_t size )
{
while( !( *log_flag ) )
;
memcpy( data, log_data, size );
*log_flag = 0;
}
You may want to use the following macro in the loop, to avoid stressing the memory bus:
#if defined(__x86_64) || defined(__i386)
#define cpu_relax() __asm__("pause":::"memory")
#else
#define cpu_relax() __asm__("":::"memory")
#endif
Also, it acts as a memory barrier ("memory" param.), so no need to declare log_flag as volatile.
But I think this is overkill, it should only be done for hard real-time stuff. You should be fine using a futex. And maybe you could simply use a pipe, it's sufficiently fast for almost all purposes.
I wouldn't recommend that for two reasons: first, although pointer access may not be optimized by the compiler, that doesn't mean the pointed value won't be cached by the processor. Second, the fact that it is atomic won't prevent a read access between the end of the while loop and the line that does *log_flag=0. A mutex is safer, though a lot slower.
If you're using pthreads, consider using an RW mutex to protect the whole buffer, that way you don't need a flag to control it, the mutex is itself the flag and you'll have better performance when doing frequent reads.
I also don't recommend doing empty while() loops, you'll hog all the processor that way. Put a usleep(1000) inside the loop to give the processor a chance to breathe.
There are a whole bunch of reasons why you should use a semaphore and not rely on a flag.
Your read log while loop is spinning unnnecessarily. This consumes system resources like power unnecessarly. It also means that the CPU cannot be used for other tasks.
I will be surprised if x86 fully guarantees read and write ordering. incoming data may set log flag to 1 only to have outgoing data set it to 0. This may potentially mean that you end up losing data.
I don't know where you got it from that optimizations are not applied on pointers as a general use. Optimization can be applied anywhere where there is no difference to external change. The compiler will probably not know that log_flag can be changed by a concurrent process.
Problem 2 might appear very may appear rarely and tracking down the issue will be hard. So do yourself a favour and use the correct operating system primitives. They will guarantee that things work as expected.
As long as log_flag is atomic you will be fine.
If log_flag was just a regular bool, you have no guarantee it will work.
The compiler could reorder you instructions
*log_flag = 1;
memcpy( log_data, data, size );
This is semantically identical on a uniprocessor system as long as log_flag is not accessed inside memcpy. Your only saving grace may be an inferior optimizer that cant deduce what variables are accessed in memcpy.
The cpu can reorder your instructions
It may choose to load the log_flag before the loop to optimize the pipeline.
The cache may reorder you memory writes.
The cache line that contains log_flag may get synced to the other processor before the cache line containing data.
What you need is a way to tell the compiler, cpu, and cache "hands off", so that they don't make assumptions about the order. That can only be done with a memory fence. std::atomic, std::mutex, and semaphore all have the correct memory fence instructions embedded in their code.

C++ Thread Safe Integer

I have currently created a C++ class for a thread safe integer which simply stores an integer privately and has public get a set functions which use a boost::mutex to ensure that only one change at a time can be applied to the integer.
Is this the most efficient way to do it, I have been informed that mutexes are quite resource intensive? The class is used a lot, very rapidly so it could well be a bottleneck...
Googleing C++ Thread Safe Integer returns unclear views and oppinions on the thread safety of integer operations on different architectures.
Some say that a 32bit int on a 32bit arch is safe, but 64 on 32 isn't due to 'alignment' Others say it is compiler/OS specific (which I don't doubt).
I am using Ubuntu 9.10 on 32 bit machines, some have dual cores and so threads may be executed simultaneously on different cores in some cases and I am using GCC 4.4's g++ compiler.
Thanks in advance...
Please Note: The answer I have marked as 'correct' was most suitable for my problem - however there are some excellent points made in the other answers and they are all worth reading!
There is the C++0x atomic library, and there is also a Boost.Atomic library under development that use lock free techniques.
It's not compiler and OS specific, it's architecture specific. The compiler and OS come into it because they're the tools you work through, but they're not the ones setting the real rules. This is why the C++ standard won't touch the issue.
I have never in my life heard of an 64-bit integer write, which can be split into two 32-bit writes, being interrupted halfway through. (Yes, that's an invitation to others to post counterexamples.) Specifically, I have never heard of a CPU's load/store unit allowing a misaligned write to be interrupted; an interrupting source has to wait for the whole misaligned access to complete.
To have an interruptible load/store unit, its state would have to be saved to the stack... and the load/store unit is what saves the rest of the CPU's state to the stack. This would be hugely complicated, and bug prone, if the load/store unit were interruptible... and all that you would gain is one cycle less latency in responding to interrupts, which, at best, is measured in tens of cycles. Totally not worth it.
Back in 1997, A coworker and I wrote a C++ Queue template which was used in a multiprocessing system. (Each processor had its own OS running, and its own local memory, so these queues were only needed for memory shared between processors.) We worked out a way to make the queue change state with a single integer write, and treated this write as an atomic operation. Also, we required that each end of the queue (i.e. the read or write index) be owned by one and only one processor. Thirteen years later, the code is still running fine, and we even have a version that handles multiple readers.
Still, if you want to treat a 64-bit integer write as atomic, align the field to a 64-bit bound. Why worry?
EDIT: For the case you mention in your comment, I'd need more information to be sure, so let me give an example of something that could be implemented without specialized synchronization code.
Suppose you have N writers and one reader. You want the writers to be able to signal events to the reader. The events themselves have no data; you just want an event count, really.
Declare a structure for the shared memory, shared between all writers and the reader:
#include <stdint.h>
struct FlagTable
{ uint32_t flag[NWriters];
};
(Make this a class or template or whatever as you see fit.)
Each writer needs to be told its index and given a pointer to this table:
class Writer
{public:
Writer(FlagTable* flags_, size_t index_): flags(flags_), index(index_) {}
void SignalEvent(uint32_t eventCount = 1);
private:
FlagTable* flags;
size_t index;
}
When the writer wants to signal an event (or several), it updates its flag:
void Writer::SignalEvent(uint32_t eventCount)
{ // Effectively atomic: only one writer modifies this value, and
// the state changes when the incremented value is written out.
flags->flag[index] += eventCount;
}
The reader keeps a local copy of all the flag values it has seen:
class Reader
{public:
Reader(FlagTable* flags_): flags(flags_)
{ for(size_t i = 0; i < NWriters; ++i)
seenFlags[i] = flags->flag[i];
}
bool AnyEvents(void);
uint32_t CountEvents(int writerIndex);
private:
FlagTable* flags;
uint32_t seenFlags[NWriters];
}
To find out if any events have happened, it just looks for changed values:
bool Reader::AnyEvents(void)
{ for(size_t i = 0; i < NWriters; ++i)
if(seenFlags[i] != flags->flag[i])
return true;
return false;
}
If something happened, we can check each source and get the event count:
uint32_t Reader::CountEvents(int writerIndex)
{ // Only read a flag once per function call. If you read it twice,
// it may change between reads and then funny stuff happens.
uint32_t newFlag = flags->flag[i];
// Our local copy, though, we can mess with all we want since there
// is only one reader.
uint32_t oldFlag = seenFlags[i];
// Next line atomically changes Reader state, marking the events as counted.
seenFlags[i] = newFlag;
return newFlag - oldFlag;
}
Now the big gotcha in all this? It's nonblocking, which is to say that you can't make the Reader sleep until a Writer writes something. The Reader has to choose between sitting in a spin-loop waiting for AnyEvents() to return true, which minimizes latency, or it can sleep a bit each time through, which saves CPU but could let a lot of events build up. So it's better than nothing, but it's not the solution to everything.
Using actual synchronization primitives, one would only need to wrap this code with a mutex and condition variable to make it properly blocking: the Reader would sleep until there was something to do. Since you used atomic operations with the flags, you could actually keep the amount of time the mutex is locked to a minimum: the Writer would only need to lock the mutex long enough to send the condition, and not set the flag, and the reader only needs to wait for the condition before calling AnyEvents() (basically, it's like the sleep-loop case above, but with a wait-for-condition instead of a sleep call).
C++ has no real atomic integer implementation, neither do most common libraries.
Consider the fact that even if said implementation would exist, it would have to rely on some sort of mutex - due to the fact that you cannot guarantee atomic operations across all architectures.
As you're using GCC, and depending on what operations you want to perform on the integer, you might get away with GCC's atomic builtins.
These might be a bit faster than mutexes, but in some cases still a lot slower than "normal" operations.
For full, general purpose synchronization, as others have already mentioned, the traditional synchronization tools are pretty much required. However, for certain special cases it is possible to take advantage of hardware optimizations. Specifically, most modern CPUs support atomic increment & decrement on integers. The GLib library has pretty good cross-platform support for this. Essentially, the library wraps CPU & compiler specific assembly code for these operations and defaults to mutex protection where they're not available. It's certainly not very general-purpose but if you're only interested in maintaining a counter, this might be sufficient.
you can also have a look at the atomic ops section of intels Thread Building Blocks or the atomic_ops project

Synchronizing access to variable

I need to provide synchronization to some members of a structure.
If the structure is something like this
struct SharedStruct {
int Value1;
int Value2;
}
and I have a global variable
SharedStruct obj;
I want that the write from a processor
obj.Value1 = 5; // Processor B
to be immediately visible to the other processors, so that when I test the value
if(obj.Value1 == 5) { DoSmth(); } // Processor A
else DoSmthElse();
to get the new value, not some old value from the cache.
First I though that if I use volatile when writing/reading the values, it is enough. But I read that volatile can't solve this kind o issues.
The members are guaranteed to be properly aligned on 2/4/8 byte boundaries, and writes should be atomic in this case, but I'm not sure how the cache could interfere with this.
Using memory barriers (mfence, sfence, etc.) would be enough ? Or some interlocked operations are required ?
Or maybe something like
lock mov addr, REGISTER
?
The easiest would obviously be some locking mechanism, but speed is critical and can't afford locks :(
Edit
Maybe I should clarify a bit. The value is set only once (behaves like a flag). All the other threads need just to read it. That's why I think that it may be a way to force the read of this new value without using locks.
Thanks in advance!
There Ain't No Such Thing As A Free Lunch. If your data is being accessed from multiple threads, and it is necessary that updates are immediately visible by those other threads, then you have to protect the shared struct by a mutex, or a readers/writers lock, or some similar mechanism.
Performance is a valid concern when synchronizing code, but it is trumped by correctness. Generally speaking, aim for correctness first and then profile your code. Worrying about performance when you haven't yet nailed down correctness is premature optimization.
Use explicitly atomic instructions. I believe most compilers offer these as intrinsics. Compare and Exchange is another good one.
If you intend to write a lockless algorithm, you need to write it so that your changes only take effect when conditions are as expected.
For example, if you intend to insert a linked list object, use the compare/exchange stuff so that it only inserts if the pointer still points at the same location when you actually do the update.
Or if you are going to decrement a reference count and free the memory at count 0, you will want to pre-free it by making it unavailable somehow, check that the count is still 0 and then really free it. Or something like that.
Using a lock, operate, unlock design is generally a lot easier. The lock-free algorithms are really difficult to get right.
All the other answers here seem to hand wave about the complexities of updating shared variables using mutexes, etc. It is true that you want the update to be atomic.
And you could use various OS primitives to ensure that, and that would be good
programming style.
However, on most modern processors (certainly the x86), writes of small, aligned scalar values is atomic and immediately visible to other processors due to cache coherency.
So in this special case, you don't need all the synchronizing junk; the hardware does the
atomic operation for you. Certainly this is safe with 4 byte values (e.g., "int" in 32 bit C compilers).
So you could just initialize Value1 with an uninteresting value (say 0) before you start the parallel threads, and simply write other values there. If the question is exiting the loop on a fixed value (e.g., if value1 == 5) this will be perfectly safe.
If you insist on capturing the first value written, this won't work. But if you have a parallel set of threads, and any value written other than the uninteresting one will do, this is also fine.
I second peterb's answer to aim for correctness first. Yes, you can use memory barriers here, but they will not do what you want.
You said immediately. However, how immediate this update ever can be, you could (and will) end up with the if() clause being executed, then the flag being set, and than the DoSmthElse() being executed afterwards. This is called a race condition...
You want to synchronize something, it seems, but it is not this flag.
Making the field volatile should make the change "immediately" visible in other threads, but there is no guarantee that the instant at which thread A executes the update doesn't occur after thread B tests the value but before thread B executes the body of the if/else statement.
It sounds like what you really want to do is make that if/else statement atomic, and that will require either a lock, or an algorithm that is tolerant of this sort of situation.