Parsing compare_and_swap - concurrency

I am trying to parse compare_and_swap, because honestly I don't understand a thing.
do { while (compare_and_swap(&lock, 0, 1) != 0)
; /* do nothing */
/* critical section */
lock = 0;
/* remainder section */
} while (true);
int compare and swap(int *value, int expected, int new_value) {
int temp = *value;
if (*value == expected)
*value = new_value;
return temp;
}
let's assume lock is set to 0;
1st: compare and swap returns 0, lock is set to 1, critical section is ran, lock is set to 0.
rinse and repeat
I don't know if I parsed it well, but is it possible that compare and swap takes 1 less cycle than test_and_set to do the same thing (insure concurrency of threads)?

It's difficult to tell what you are actually asking here, but that is not a correct implementation of compare_and_swap, as it is not atomic.
A correct, atomic implementation will generally take advantage of a compare-and-swap CPU instruction. E.g. in x86 LOCK CMPXCHG. However, you are probably much better off using a function provided by the system library for performing compare-and-swap (such as InterlockedCompareExchange in the Windows API), or the compare-exchange functions of std::atomic<> in C++11.

Related

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.
There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.
All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.

Implementing 64 bit atomic counter with 32 bit atomics

I would like to cobble together a uint64 atomic counter from atomic uint32s. The counter has a single writer and multiple readers. The writer is a signal handler so it must not block.
My idea is to use a generation count with the low bit as a read lock. The reader retries until the generation count is stable across the read, and the low bit is unset.
Is the following code correct in design and use of memory ordering? Is there a better way?
using namespace std;
class counter {
atomic<uint32_t> lo_{};
atomic<uint32_t> hi_{};
atomic<uint32_t> gen_{};
uint64_t read() const {
auto acquire = memory_order_acquire;
uint32_t lo, hi, gen1, gen2;
do {
gen1 = gen_.load(acquire);
lo = lo_.load(acquire);
hi = hi_.load(acquire);
gen2 = gen_.load(acquire);
} while (gen1 != gen2 || (gen1 & 1));
return (uint64_t(hi) << 32) | lo;
}
void increment() {
auto release = memory_order_release;
gen_.fetch_add(1, release);
uint32_t newlo = 1 + lo_.fetch_add(1, release);
if (newlo == 0) {
hi_.fetch_add(1, release);
}
gen_.fetch_add(1, release);
}
};
edit: whoops, fixed auto acquire = memory_order_release;
This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there's only one writer so no extra support for excluding simultaneous writers is needed.) It's not lock-free; a writer sleeping at the wrong time will leave readers spinning until the writer finishes. But in the common case where that doesn't happen, it has excellent performance with no contention between readers which are truly read-only.
You don't need or want the increment of the payload to use atomic RMW operations. (Unless you're on a system that can cheaply do a 64-bit atomic add or load, then do that instead of a SeqLock).
You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap relaxed or release memory order for the payload, and using a release store for the 2nd sequence counter update, what you're calling the "generation" counter).
Similarly the sequence counter also doesn't need to be an atomic RMW. (Unless you're using it as a spinlock with multiple writers)
The single writer only needs pure loads and pure stores with only release ordering, which are (much) cheaper than atomic RMW, or stores with seq_cst ordering:
load the counter and the value in any order
store a new counter (old+1)
store the new value (or just update the low half if you want to branch on no carry)
store the final counter.
The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don't really want the cost of making both stores of both halves of the value release, on CPUs where that's more expensive than relaxed.
Unfortunately to satisfy C++ rules, the value has to be atomic<T>, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARM ldrd or ldp / stp load-pair aren't guaranteed atomic until ARMv8.4a, but that doesn't matter. (And compilers often won't optimize two separate atomic 32-bit loads into one wider load.)
Values other threads read while the sequence-counter is odd are irrelevant, but we'd like to avoid undefined behaviour. Maybe we could use a union of a volatile uint64_t and an atomic<uint64_t>
I wrote this C++ SeqLock<class T> template for another question I didn't finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).
This tries to check if the target already supports lock-free atomic operations on atomic<T> to stop you from using this when it's pointless. (Disable that for testing purposed by defining IGNORE_SIZECHECK.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using a static_assert.
I provided an inc() function for T that supports a ++ operator. TODO would be an apply() that accepts a lambda to do something to a T, and store the result between sequence counter updates.
// **UNTESTED**
#include <atomic>
#ifdef UNIPROCESSOR
// all readers and writers run on the same core (or same software thread)
// ordering instructions at compile time is all that's necessary
#define ATOMIC_FENCE std::atomic_signal_fence
#else
// A reader can be running on another core while writing.
// Memory barriers or ARMv8 acquire / release loads / store are needed
#define ATOMIC_FENCE std::atomic_thread_fence
#endif
// using fences instead of .store(std::memory_order_release) will stop the compiler
// from taking advantage of a release-store instruction instead of separate fence, like on AArch64
// But fences allow it to be optimized away to just compile-time ordering for the single thread or unirprocessor case.
// SINGLE WRITER only.
// uses volatile + barriers for the data itself, like pre-C++11
template <class T>
class SeqLocked
{
#ifndef IGNORE_SIZECHECK
// sizeof(T) > sizeof(unsigned)
static_assert(!std::atomic<T>::is_always_lock_free, "A Seq Lock with a type small enough to be atomic on its own is totally pointless, and we don't have a specialization that replaces it with a straight wrapper for atomic<T>");
#endif
// C++17 doesn't have a good way to express a load that doesn't care about tearing
// without explicitly writing it as multiple small parts and thus gimping the compiler if it can use larger loads
volatile T data; // volatile should be fine on any implementation where pre-C++11 lockless code was possible with volatile,
// even though Data Race UB does apply to volatile variables in ISO C++11 and later.
// even non-volatile normally works in practice, being ordered by compiler barriers.
std::atomic<unsigned> seqcount{0}; // Even means valid, odd means modification in progress.
// unsigned definitely wraps around at a power of 2 on overflow
public:
T get() const {
unsigned c0, c1;
T tmp;
// READER RETRY LOOP
do {
c0 = seqcount.load(std::memory_order_acquire); // or for your signal-handler use-case, relaxed load followed by ATOMIC_FENCE(std::memory_order_acquire);
tmp = (T)data; // load
ATOMIC_FENCE(std::memory_order_acquire); // LoadLoad barrier
c1 = seqcount.load(std::memory_order_relaxed);
} while(c0&1 || c0 != c1); // retry if the counter changed or is odd
return tmp;
}
// TODO: a version of this that takes a lambda for the operation on tmp
T inc() // WRITER
{
unsigned orig_count = seqcount.load(std::memory_order_relaxed);
// we're the only writer, avoid an atomic RMW.
seqcount.store(orig_count+1, std::memory_order_relaxed);
ATOMIC_FENCE(std::memory_order_release); // 2-way barrier *after* the store, not like a release store. Or like making data=tmp a release operation.
// make sure the counter becomes odd *before* any data change
T tmp = data; // load into a non-volatile temporary
++tmp; // make any change to it
data = tmp; // store
seqcount.store(orig_count+2, std::memory_order_release); // or use ATOMIC_FENCE(std::memory_order_release); *before* this, so the UNIPROCESSOR case can just do compile-time ordering
return tmp;
}
void set(T newval) {
unsigned orig_count = seqcount.load(std::memory_order_relaxed);
seqcount.store(orig_count+1, std::memory_order_relaxed);
ATOMIC_FENCE(std::memory_order_release);
// make sure the data stores appear after the first counter update.
data = newval; // store
ATOMIC_FENCE(std::memory_order_release);
seqcount.store(orig_count+2, std::memory_order_relaxed); // Or use mo_release here, better on AArch64
}
};
/***** test callers *******/
#include <stdint.h>
struct sixteenbyte {
//unsigned arr[4];
unsigned long a,b,c,d;
sixteenbyte() = default;
sixteenbyte(const volatile sixteenbyte &old)
: a(old.a), b(old.b), c(old.c), d(old.d) {}
//arr(old.arr) {}
};
void test_inc(SeqLocked<uint64_t> &obj) { obj.inc(); }
sixteenbyte test_get(SeqLocked<sixteenbyte> &obj) { return obj.get(); }
//void test_set(SeqLocked<sixteenbyte> &obj, sixteenbyte val) { obj.set(val); }
uint64_t test_get(SeqLocked<uint64_t> &obj) {
return obj.get();
}
// void atomic_inc_u64_seq_cst(std::atomic<uint64_t> &a) { ++a; }
uint64_t u64_inc_relaxed(std::atomic<uint64_t> &a) {
// same but without dmb barriers
return 1 + a.fetch_add(1, std::memory_order_relaxed);
}
uint64_t u64_load_relaxed(std::atomic<uint64_t> &a) {
// gcc uses LDREXD, not just LDRD?
return a.load(std::memory_order_relaxed);
}
void u64_store_relaxed(std::atomic<uint64_t> &a, uint64_t val) {
// gcc uses a LL/SC retry loop even for a pure store?
a.store(val, std::memory_order_relaxed);
}
It compiles to the asm we want on the Godbolt compiler explorer for ARM, and other ISAs. At least for int64_t; larger struct types may be copied less efficiently because of cumbersome volatile rules.
It uses non-atomic volatile T data for the shared data. This is technically data-race undefined behaviour, but all compilers we use in practice were fine with pre-C++11 multi-threaded access to volatile objects. And pre-C++11, people even depended on atomicity for some sizes. We do not, we check the counter and only use the value we read if there were no concurrent writes. (That's the whole point of a SeqLock.)
One problem with volatile T data is that in ISO C++, T foo = data won't compile for struct objects unless you provide a copy-constructor from a volatile object, like
sixteenbyte(const volatile sixteenbyte &old)
: a(old.a), b(old.b), c(old.c), d(old.d) {}
This is really annoying for us, because we don't care about the details of how memory is read, just that multiple reads aren't optimized into one.
volatile is really the wrong tool here, and plain T data with sufficient fencing to ensure that the read actually happens between the reads of the atomic counter would be better. e.g. we could do that in GNU C with a asm("":::"memory"); compiler barrier against reordering before/after the accesses. That would let the compiler copy larger objects with SIMD vectors, or whatever, which it won't do with separate volatile accesses.
I think std::atomic_thread_fence(mo_acquire) would also be a sufficient barrier, but I'm not 100% sure.
In ISO C, you can copy a volatile aggregate (struct), and the compiler will emit whatever asm it normally would to copy that many bytes. But in C++, we can't have nice things apparently.
Related: single-core systems with a writer in an interrupt handler
In an embedded system with one core, and some variables that are only updated by interrupt handlers, you may have a writer that can interrupt the reader but not vice versa. That allows some cheaper variations that use the value itself to detect torn reads.
See Reading a 64 bit variable that is updated by an ISR, especially for a monotonic counter Brendan's suggestion of reading the most significant-half first, then the low half, then the most-significant half again. If it matches, your read wasn't torn in a way that matters. (A write that didn't change the high half isn't a problem even if it interrupted the reader to change the low half right before or after the reader read it.)
Or in general, re-read the whole value until you see the same value twice in a row.
Neither of these techniques are SMP-safe: the read retry only guards against torn reads, not torn writes if the writer stored the halves separately. That's why a SeqLock uses a 3rd atomic integer as a sequence counter. They would work in any case where the writer is atomic wrt. the reader, but the reader isn't atomic. Interrupt handler vs. main code is one such case, or signal handler is equivalent.
You could potentially use the low half of a monotonic counter as a sequence number, if you don't mind incrementing by 2 instead of 1. (Perhaps requiring readers to do a 64-bit right shift by 1 to recover the actual number. So that's not good.)

Synchronizing a non-atomic array using atomic variables and memory fences

I've been trying to build a simple (i.e. inefficient) MPMC queue using only std C++ but I'm having trouble getting the underlying array to synchronize between threads. A simplified version of the queue is:
constexpr int POISON = 5000;
class MPMC{
std::atomic_int mPushStartCounter;
std::atomic_int mPushEndCounter;
std::atomic_int mPopCounter;
static constexpr int Size = 1<<20;
int mData[Size];
public:
MPMC(){
mPushStartCounter.store(0);
mPushEndCounter.store(-1);
mPopCounter.store(0);
for(int i = 0; i < Size;i++){
//preset data with a poison flag to
// detect race conditions
mData[i] = POISON;
}
}
void push(int x) {
int index = mPushStartCounter.fetch_add(1);
mData[index] = x;//Race condition
atomic_thread_fence(std::memory_order_release);
int expected = index-1;
while(!mPushEndCounter.compare_exchange_strong(expected, index, std::memory_order_acq_rel)){std::this_thread::yield();}
}
int pop(){
int index = mPopCounter.load();
if(index <= mPushEndCounter.load(std::memory_order_acquire) && mPopCounter.compare_exchange_strong(index, index+1, std::memory_order_acq_rel)){
return mData[index]; //race condition
}else{
return pop();
}
}
};
It uses three atomic variables for synchronization:
mPushStartCounter that is used by push(int) to determine which location to write to.
mPushEndCounter that is used to signal that push(int) has finished writing up to that point int the array to pop().
mPopCounter that is used by pop() to prevent double pops from occurring.
In push(), between writing to the array mData and updating mPushEndCounter I've put a release barrier in an attempt to force synchronization of the mData array.
The way I understood cpp reference this should force a Fence-Atomic Synchronization. where
the CAS in push() is an 'atomic store X',
the load of mPushEndCounter in pop() is an 'atomic acquire operation Y' ,
The release barrier 'F' in push() is 'sequenced-before X'.
In which case cppreference states that
In this case, all non-atomic and relaxed atomic stores that are sequenced-before F in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after Y.
Which I interpreted to mean that the write to mData from push() would be visible in pop(). This is however not the case, sometimes pop() reads uninitialized data. I believe this to be a synchronization issues because if I check the contents of the queue afterwards, or via breakpoint, it reads correct data instead.
I am using clang 6.0.1, and g++ 7.3.0.
I tried looking at the generated assembly but it looks correct to me: the write to the array is followed by a lock cmpxchg and the read is preceded by a check on the same variable. Which to the best of my limited knowledge should work as expected on x64 because
Loads are not reordered with other loads, hence the load from array can not speculate ahead of reading the atomic counter.
stores are not reordered with other stores, hence the cmpxchg always comes after the store to array.
lock cmpxchg flushes the write-buffer, cache, etc. Therefore if another thread observes it as finished, one can rely on cache coherency to guarantee that write to the array has finished. I am not too sure that this is correct however.
I've posted a runable test on Github. The test code involves 16 threads, half of which push the numbers 0 to 4999 and the other half read back 5000 elements each. It then combines the results of all the readers and checks that we've seen all the numbers in [0, 4999] exactly 8 times (which fails) and scans the underlying array once more to see if it contains all the numbers in [0, 4999] exactly 8 times (which succeeds).

about spin lock

i have some questions in boost spinlock code :
class spinlock
{
public:
spinlock()
: v_(0)
{
}
bool try_lock()
{
long r = InterlockedExchange(&v_, 1);
_ReadWriteBarrier(); // 1. what this mean
return r == 0;
}
void lock()
{
for (unsigned k = 0; !try_lock(); ++k)
{
yield(k);
}
}
void unlock()
{
_ReadWriteBarrier();
*const_cast<long volatile*>(&v_) = 0;
// 2. Why don't need to use InterlockedExchange(&v_, 0);
}
private:
long v_;
};
A ReadWriteBarrier() is a "memory barrier" (in this case for both reads and writes), a special instruction to the processor to ensure that any instructions resulting in memory operations have completed (load & store operations - or in for example x86 processors, any opertion which has a memory operand at either side). In this particular case, to make sure that the InterlockedExchange(&v_,1) has completed before we continue.
Because an InterlockedExchange would be less efficient (takes more interaction with any other cores in the machine to ensure all other processor cores have 'let go' of the value - which makes no sense, since most likely (in correctly working code) we only unlock if we actually hold the lock, so no other processor will have a different value cached than what we're writing over anyway), and a volatile write to the memory will be just as good.
The barriers are there to ensure memory synchronization; without
them, different threads may see modifications of memory in
different orders.
And the InterlockedExchange isn't necessary in the second case
because we're not interested in the previous value. The role of
InterlockedExchange is doubtlessly to set the value and return
the previous value. (And why v_ would be long, when it can
only take values 0 and 1, is beyond me.)
There are three issues with atomic access to variables. First, ensuring that there is no thread switch in the middle of reading or writing a value; if this happens it's called "tearing"; the second thread can see a partly written value, which will usually be nonsensical. Second, ensuring that all processors see the change that is being made with a write, or that the processor reading a value sees any previous changes to that value; this is called "cache coherency". Third, ensuring that the compiler doesn't move code across the read or write; this is called "code motion". InterlockedExchange does the first two; although the MSDN documentation is rather muddled, _ReadWriteBarrier does the third, and possibly the second.

Atomic counter in gcc

I must be just having a moment, because this should be easy but I can't seem to get it working right.
Whats the correct way to implement an atomic counter in GCC?
i.e. I want a counter that runs from zero to 4 and is thread safe.
I was doing this (which is further wrapped in a class, but not here)
static volatile int _count = 0;
const int limit = 4;
int get_count(){
// Create a local copy of diskid
int save_count = __sync_fetch_and_add(&_count, 1);
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
}
return save_count;
}
But it's running from 1 through from 1 - 4 inclusive then around to zero.
It should go from 0 - 3. Normally I'd do a counter with a mod operator but I don't
know how to do that safely.
Perhaps this version is better. Can you see any problems with it, or offer
a better solution.
int get_count(){
// Create a local copy of diskid
int save_count = _count;
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
return 0;
}
return save_count;
}
Actually, I should point out that it's not absolutely critical that each thread get a different value. If two threads happened to read the same value at the same time that wouldn't be a problem. But they can't exceed limit at any time.
Your code isn't atomic (and your second get_count doesn't even increment the counter value)!
Say count is 3 at the start and two threads simultaneously call get_count. One of them will get his atomic add done first and increments count to 4. If the second thread is fast enough, it can increment it to 5 before the first thread resets it to zero.
Also, in your wraparound processing, you reset count to 0 but not save_count. This is clearly not what's intended.
This is easiest if limit is a power of 2. Don't ever do the reduction yourself, just use
return (unsigned) __sync_fetch_and_add(&count, 1) % (unsigned) limit;
or alternatively
return __sync_fetch_and_add(&count, 1) & (limit - 1);
This only does one atomic operation per invocation, is safe and very cheap. For generic limits, you can still use %, but that will break the sequence if the counter ever overflows. You can try using a 64-bit value (if your platform supports 64-bit atomics) and just hope it never overflows; this is a bad idea though. The proper way to do this is using an atomic compare-exchange operation. You do this:
int old_count, new_count;
do {
old_count = count;
new_count = old_count + 1;
if (new_count >= limit) new_count = 0; // or use %
} while (!__sync_bool_compare_and_swap(&count, old_count, new_count));
This approach generalizes to more complicated sequences and update operations too.
That said, this type of lockless operation is tricky to get right, relies on undefined behavior to some degree (all current compilers get this right, but no C/C++ standard before C++0x actually has a well-defined memory model) and is easy to break. I recommend using a simple mutex/lock unless you've profiled it and found it to be a bottleneck.
You're in luck, because the range you want happens to fit into exactly 2 bits.
Easy solution: Let the volatile variable count up forever. But after you read it, use just the lowest two bits (val & 3). Presto, atomic counter from 0-3.
It's impossible to create anything atomic in pure C, even with volatile. You need asm. C1x will have special atomic types, but until then you're stuck with asm.
You have two problems.
__sync_fetch_and_add will return the previous value (i.e., before adding one). So at the step where _count becomes 3, your local save_count variable is getting back 2. So you actually have to increment _count up to 4 before it'll come back as a 3.
But even on top of that, you're specifically looking for it to be >= 4 before you reset it back to 0. That's just a question of using the wrong limit if you're only looking for it to get as high as three.