Fetch-and-add using OpenMP atomic operations - c++

I’m using OpenMP and need to use the fetch-and-add operation. However, OpenMP doesn’t provide an appropriate directive/call. I’d like to preserve maximum portability, hence I don’t want to rely on compiler intrinsics.
Rather, I’m searching for a way to harness OpenMP’s atomic operations to implement this but I’ve hit a dead end. Can this even be done? N.B., the following code almost does what I want:
#pragma omp atomic
x += a
Almost – but not quite, since I really need the old value of x. fetch_and_add should be defined to produce the same result as the following (only non-locking):
template <typename T>
T fetch_and_add(volatile T& value, T increment) {
T old;
#pragma omp critical
{
old = value;
value += increment;
}
return old;
}
(An equivalent question could be asked for compare-and-swap but one can be implemented in terms of the other, if I’m not mistaken.)

As of openmp 3.1 there is support for capturing atomic updates, you can capture either the old value or the new value. Since we have to bring the value in from memory to increment it anyways, it only makes sense that we should be able to access it from say, a CPU register and put it into a thread-private variable.
There's a nice work-around if you're using gcc (or g++), look up atomic builtins:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
It think Intel's C/C++ compiler also has support for this but I haven't tried it.
For now (until openmp 3.1 is implemented), I've used inline wrapper functions in C++ where you can choose which version to use at compile time:
template <class T>
inline T my_fetch_add(T *ptr, T val) {
#ifdef GCC_EXTENSION
return __sync_fetch_and_add(ptr, val);
#endif
#ifdef OPENMP_3_1
T t;
#pragma omp atomic capture
{ t = *ptr; *ptr += val; }
return t;
#endif
}
Update: I just tried Intel's C++ compiler, it currently has support for openmp 3.1 (atomic capture is implemented). Intel offers free use of its compilers in linux for non-commercial purposes:
http://software.intel.com/en-us/articles/non-commercial-software-download/
GCC 4.7 will support openmp 3.1, when it eventually is released... hopefully soon :)

If you want to get old value of x and a is not changed, use (x-a) as old value:
fetch_and_add(int *x, int a) {
#pragma omp atomic
*x += a;
return (*x-a);
}
UPDATE: it was not really an answer, because x can be modified after atomic by another thread.
So it's seems to be impossible to make universal "Fetch-and-add" using OMP Pragmas. As universal I mean operation, which can be easily used from any place of OMP code.
You can use omp_*_lock functions to simulate an atomics:
typedef struct { omp_lock_t lock; int value;} atomic_simulated_t;
fetch_and_add(atomic_simulated_t *x, int a)
{
int ret;
omp_set_lock(x->lock);
x->value +=a;
ret = x->value;
omp_unset_lock(x->lock);
}
This is ugly and slow (doing a 2 atomic ops instead of 1). But If you want your code to be very portable, it will be not the fastest in all cases.
You say "as the following (only non-locking)". But what is the difference between "non-locking" operations (using CPU's "LOCK" prefix, or LL/SC or etc) and locking operations (which are implemented itself with several atomic instructions, busy loop for short wait of unlock and OS sleeping for long waits)?

Related

OpenMP atomic on return values

I have some trouble understanding the definition of scalar expression in OpenMP.
In particular, I thought that calling a function and using its return value in an atomic expression is not allowed.
Looking at Compiler Explorer asm code, it seems to me as it is atomic, though.
Maybe someone can clarify this.
#include <cmath>
int f() { return sin(2); }
int main()
{
int i=5;
#pragma omp atomic
i += f();
return i;
}
#paleonix comment is a good answer, so I'm expanding it a little here.
The important point is that the atomicity is an issue only for the read/modify/write operation expressed by the += operator.
How you generate the value to be added is irrelevant, and what happens during generating that value is unaffected by the atomic pragma, so can still contain races (should you want them :-)).
Of course, this is unlike the use of a critical section, where the whole scope of the critical region is protected.

Implementing 64 bit atomic counter with 32 bit atomics

I would like to cobble together a uint64 atomic counter from atomic uint32s. The counter has a single writer and multiple readers. The writer is a signal handler so it must not block.
My idea is to use a generation count with the low bit as a read lock. The reader retries until the generation count is stable across the read, and the low bit is unset.
Is the following code correct in design and use of memory ordering? Is there a better way?
using namespace std;
class counter {
atomic<uint32_t> lo_{};
atomic<uint32_t> hi_{};
atomic<uint32_t> gen_{};
uint64_t read() const {
auto acquire = memory_order_acquire;
uint32_t lo, hi, gen1, gen2;
do {
gen1 = gen_.load(acquire);
lo = lo_.load(acquire);
hi = hi_.load(acquire);
gen2 = gen_.load(acquire);
} while (gen1 != gen2 || (gen1 & 1));
return (uint64_t(hi) << 32) | lo;
}
void increment() {
auto release = memory_order_release;
gen_.fetch_add(1, release);
uint32_t newlo = 1 + lo_.fetch_add(1, release);
if (newlo == 0) {
hi_.fetch_add(1, release);
}
gen_.fetch_add(1, release);
}
};
edit: whoops, fixed auto acquire = memory_order_release;
This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there's only one writer so no extra support for excluding simultaneous writers is needed.) It's not lock-free; a writer sleeping at the wrong time will leave readers spinning until the writer finishes. But in the common case where that doesn't happen, it has excellent performance with no contention between readers which are truly read-only.
You don't need or want the increment of the payload to use atomic RMW operations. (Unless you're on a system that can cheaply do a 64-bit atomic add or load, then do that instead of a SeqLock).
You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap relaxed or release memory order for the payload, and using a release store for the 2nd sequence counter update, what you're calling the "generation" counter).
Similarly the sequence counter also doesn't need to be an atomic RMW. (Unless you're using it as a spinlock with multiple writers)
The single writer only needs pure loads and pure stores with only release ordering, which are (much) cheaper than atomic RMW, or stores with seq_cst ordering:
load the counter and the value in any order
store a new counter (old+1)
store the new value (or just update the low half if you want to branch on no carry)
store the final counter.
The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don't really want the cost of making both stores of both halves of the value release, on CPUs where that's more expensive than relaxed.
Unfortunately to satisfy C++ rules, the value has to be atomic<T>, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARM ldrd or ldp / stp load-pair aren't guaranteed atomic until ARMv8.4a, but that doesn't matter. (And compilers often won't optimize two separate atomic 32-bit loads into one wider load.)
Values other threads read while the sequence-counter is odd are irrelevant, but we'd like to avoid undefined behaviour. Maybe we could use a union of a volatile uint64_t and an atomic<uint64_t>
I wrote this C++ SeqLock<class T> template for another question I didn't finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).
This tries to check if the target already supports lock-free atomic operations on atomic<T> to stop you from using this when it's pointless. (Disable that for testing purposed by defining IGNORE_SIZECHECK.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using a static_assert.
I provided an inc() function for T that supports a ++ operator. TODO would be an apply() that accepts a lambda to do something to a T, and store the result between sequence counter updates.
// **UNTESTED**
#include <atomic>
#ifdef UNIPROCESSOR
// all readers and writers run on the same core (or same software thread)
// ordering instructions at compile time is all that's necessary
#define ATOMIC_FENCE std::atomic_signal_fence
#else
// A reader can be running on another core while writing.
// Memory barriers or ARMv8 acquire / release loads / store are needed
#define ATOMIC_FENCE std::atomic_thread_fence
#endif
// using fences instead of .store(std::memory_order_release) will stop the compiler
// from taking advantage of a release-store instruction instead of separate fence, like on AArch64
// But fences allow it to be optimized away to just compile-time ordering for the single thread or unirprocessor case.
// SINGLE WRITER only.
// uses volatile + barriers for the data itself, like pre-C++11
template <class T>
class SeqLocked
{
#ifndef IGNORE_SIZECHECK
// sizeof(T) > sizeof(unsigned)
static_assert(!std::atomic<T>::is_always_lock_free, "A Seq Lock with a type small enough to be atomic on its own is totally pointless, and we don't have a specialization that replaces it with a straight wrapper for atomic<T>");
#endif
// C++17 doesn't have a good way to express a load that doesn't care about tearing
// without explicitly writing it as multiple small parts and thus gimping the compiler if it can use larger loads
volatile T data; // volatile should be fine on any implementation where pre-C++11 lockless code was possible with volatile,
// even though Data Race UB does apply to volatile variables in ISO C++11 and later.
// even non-volatile normally works in practice, being ordered by compiler barriers.
std::atomic<unsigned> seqcount{0}; // Even means valid, odd means modification in progress.
// unsigned definitely wraps around at a power of 2 on overflow
public:
T get() const {
unsigned c0, c1;
T tmp;
// READER RETRY LOOP
do {
c0 = seqcount.load(std::memory_order_acquire); // or for your signal-handler use-case, relaxed load followed by ATOMIC_FENCE(std::memory_order_acquire);
tmp = (T)data; // load
ATOMIC_FENCE(std::memory_order_acquire); // LoadLoad barrier
c1 = seqcount.load(std::memory_order_relaxed);
} while(c0&1 || c0 != c1); // retry if the counter changed or is odd
return tmp;
}
// TODO: a version of this that takes a lambda for the operation on tmp
T inc() // WRITER
{
unsigned orig_count = seqcount.load(std::memory_order_relaxed);
// we're the only writer, avoid an atomic RMW.
seqcount.store(orig_count+1, std::memory_order_relaxed);
ATOMIC_FENCE(std::memory_order_release); // 2-way barrier *after* the store, not like a release store. Or like making data=tmp a release operation.
// make sure the counter becomes odd *before* any data change
T tmp = data; // load into a non-volatile temporary
++tmp; // make any change to it
data = tmp; // store
seqcount.store(orig_count+2, std::memory_order_release); // or use ATOMIC_FENCE(std::memory_order_release); *before* this, so the UNIPROCESSOR case can just do compile-time ordering
return tmp;
}
void set(T newval) {
unsigned orig_count = seqcount.load(std::memory_order_relaxed);
seqcount.store(orig_count+1, std::memory_order_relaxed);
ATOMIC_FENCE(std::memory_order_release);
// make sure the data stores appear after the first counter update.
data = newval; // store
ATOMIC_FENCE(std::memory_order_release);
seqcount.store(orig_count+2, std::memory_order_relaxed); // Or use mo_release here, better on AArch64
}
};
/***** test callers *******/
#include <stdint.h>
struct sixteenbyte {
//unsigned arr[4];
unsigned long a,b,c,d;
sixteenbyte() = default;
sixteenbyte(const volatile sixteenbyte &old)
: a(old.a), b(old.b), c(old.c), d(old.d) {}
//arr(old.arr) {}
};
void test_inc(SeqLocked<uint64_t> &obj) { obj.inc(); }
sixteenbyte test_get(SeqLocked<sixteenbyte> &obj) { return obj.get(); }
//void test_set(SeqLocked<sixteenbyte> &obj, sixteenbyte val) { obj.set(val); }
uint64_t test_get(SeqLocked<uint64_t> &obj) {
return obj.get();
}
// void atomic_inc_u64_seq_cst(std::atomic<uint64_t> &a) { ++a; }
uint64_t u64_inc_relaxed(std::atomic<uint64_t> &a) {
// same but without dmb barriers
return 1 + a.fetch_add(1, std::memory_order_relaxed);
}
uint64_t u64_load_relaxed(std::atomic<uint64_t> &a) {
// gcc uses LDREXD, not just LDRD?
return a.load(std::memory_order_relaxed);
}
void u64_store_relaxed(std::atomic<uint64_t> &a, uint64_t val) {
// gcc uses a LL/SC retry loop even for a pure store?
a.store(val, std::memory_order_relaxed);
}
It compiles to the asm we want on the Godbolt compiler explorer for ARM, and other ISAs. At least for int64_t; larger struct types may be copied less efficiently because of cumbersome volatile rules.
It uses non-atomic volatile T data for the shared data. This is technically data-race undefined behaviour, but all compilers we use in practice were fine with pre-C++11 multi-threaded access to volatile objects. And pre-C++11, people even depended on atomicity for some sizes. We do not, we check the counter and only use the value we read if there were no concurrent writes. (That's the whole point of a SeqLock.)
One problem with volatile T data is that in ISO C++, T foo = data won't compile for struct objects unless you provide a copy-constructor from a volatile object, like
sixteenbyte(const volatile sixteenbyte &old)
: a(old.a), b(old.b), c(old.c), d(old.d) {}
This is really annoying for us, because we don't care about the details of how memory is read, just that multiple reads aren't optimized into one.
volatile is really the wrong tool here, and plain T data with sufficient fencing to ensure that the read actually happens between the reads of the atomic counter would be better. e.g. we could do that in GNU C with a asm("":::"memory"); compiler barrier against reordering before/after the accesses. That would let the compiler copy larger objects with SIMD vectors, or whatever, which it won't do with separate volatile accesses.
I think std::atomic_thread_fence(mo_acquire) would also be a sufficient barrier, but I'm not 100% sure.
In ISO C, you can copy a volatile aggregate (struct), and the compiler will emit whatever asm it normally would to copy that many bytes. But in C++, we can't have nice things apparently.
Related: single-core systems with a writer in an interrupt handler
In an embedded system with one core, and some variables that are only updated by interrupt handlers, you may have a writer that can interrupt the reader but not vice versa. That allows some cheaper variations that use the value itself to detect torn reads.
See Reading a 64 bit variable that is updated by an ISR, especially for a monotonic counter Brendan's suggestion of reading the most significant-half first, then the low half, then the most-significant half again. If it matches, your read wasn't torn in a way that matters. (A write that didn't change the high half isn't a problem even if it interrupted the reader to change the low half right before or after the reader read it.)
Or in general, re-read the whole value until you see the same value twice in a row.
Neither of these techniques are SMP-safe: the read retry only guards against torn reads, not torn writes if the writer stored the halves separately. That's why a SeqLock uses a 3rd atomic integer as a sequence counter. They would work in any case where the writer is atomic wrt. the reader, but the reader isn't atomic. Interrupt handler vs. main code is one such case, or signal handler is equivalent.
You could potentially use the low half of a monotonic counter as a sequence number, if you don't mind incrementing by 2 instead of 1. (Perhaps requiring readers to do a 64-bit right shift by 1 to recover the actual number. So that's not good.)

Fast and Lock Free Single Writer, Multiple Reader

I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.
The write is triggered by an external interrupt.
Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.
The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.
Here my implementation in C++
template<typename T>
class SafeValue
{
private:
volatile T _value;
volatile T _valueCheck;
public:
void setValue(T newValue)
{
_value = newValue;
_valueCheck = _value;
}
T getValue()
{
volatile T value;
volatile T valueCheck;
do
{
valueCheck = _valueCheck;
value = _value;
} while(value != valueCheck);
return value;
}
}
The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:
Is there any problem with my aproach when used with a single writer and multiple readers?
I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?
Edit 1:
My target system is a ARM Cortex-A15.
T should be able to become at least any primitive integral type.
Edit 2:
std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.
Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.
You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.
The first shall generate ldrexd instruction which should be atomic according to ARM docs:
Single-copy atomicity
In ARMv7, the single-copy atomic processor accesses are:
all byte accesses
all halfword accesses to halfword-aligned locations
all word accesses to word-aligned locations
memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.
The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:
In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.
--- Note ---
The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.
You can also check how Linux kernel implements those:
#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrexd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#endif
There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.
I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.
Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.
#include <atomic>
#include <utility>
template<class T>
class PublisherValue {
static auto constexpr N = 32;
T values_[N];
std::atomic<T*> current_{values_};
public:
PublisherValue() = default;
PublisherValue(PublisherValue const&) = delete;
PublisherValue& operator=(PublisherValue const&) = delete;
// Single writer thread only.
template<class U>
void store(U&& value) {
T* p = current_.load(std::memory_order_relaxed);
if(++p == values_ + N)
p = values_;
*p = std::forward<U>(value);
current_.store(p, std::memory_order_release); // (1)
}
// Multiple readers. Make a copy to avoid referring the value for too long.
T load() const {
return *current_.load(std::memory_order_consume); // Sync with (1).
}
};
This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.

Should I protect operations on primitive types with mutexes for being thread-safe in C++?

What is the best approach to achieve thread-safety for rather simple operations?
Consider a pair of functions:
void setVal(int val)
{
this->_val = val;
}
int getVal() {
return this->_val;
}
Since even assignments of primitive types aren't guaranteed to be atomic, should I modify every getter and setter in the program in the following way to be thread-safe?
void setVal(int val)
{
this->_mutex.lock();
this->_val = val;
this->_mutex.unlock();
}
int getVal() {
this->_mutex.lock();
int result = this->_val;
this->_mutex.unlock();
return result;
}
Are you using _val in multiple threads? If not, then no, you don't need to synchronize access to it.
If it is used from multiple threads, then yes, you need to synchronize access, either using a mutex or by using an atomic type (like std::atomic<T> in C++0x, though other threading libraries have nonstandard atomic types as well).
Mutexes are very costly, as they are able to be shared across processes. If the state that you're limiting access to is only to be constrained to threads within your current process then go for something much less heavy, such as a Critical Section or Semaphore.
On 32-bit x86 platforms, reads and writes of 32-bit values aligned on 4-byte boundary are atomic. On 64-bit platforms you can also rely on 64-bit loads and stores of 8-byte aligned values to be atomic as well. SPARC and POWER CPUs also work like that.
C++ doesn't make any guarantees like that, but in practice no compiler is going to mess with it, since every non-trivial multi-threaded program relies on this behaviour.
int getVal() {
this->_mutex.lock();
int result = this->_val;
this->_mutex.unlock();
return result;
}
What exactly are you hoping to accomplish with this? Sure, you've stopped this->_val from changing before you saved into result but it still may change before result is returned, -- or between the return and the assignment to whatever you assigned it -- or a microsecond later. Regardless of what you do, you are just going to get a snapshot of a moving target. Deal with it.
void setVal(int val)
{
this->_mutex.lock();
this->_val = val;
this->_mutex.unlock();
}
Similarly, what is this buying you? If you call setVal(-5) and setVal(17) from separate threads at the same time, what value should be there after both complete? You've gone to some trouble to make sure that the first to start is also the first to finish, but how is that help to get the "right" value set?

lock-free memory reclamation with 64bit pointers

Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.
To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.
Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?
Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.
Windows gives you a bunch of Interlocked functions that are atomic and can probably be used to do what you want. Similar functions exist for other platforms, and I believe Boost has an interlocked library as well.
Your question isn't super clear and I don't have a copy of Herlihy and Shavit laying around. Perhaps if you elaborated or gave psuedo code outlining what you want to do, we can give you a more specific answer.
Ok hopefully, I have the book,
For others that may provides answers, the point is to implement this class :
class AtomicReference<T>{
public:
void set(T *ref, int stamp){ ... }
T *get(int *stamp){ ... }
private:
T *_ref;
int _stamp;
};
in a lock-free way so that :
set() updates the reference and the stamp, atomicly.
get() returns the reference and set *stamp to the stamp corresponding to the reference.
JDonner please, correct me if I am wrong.
Now my answer : I don't think you can do it without a lock somewhere (a lock can be while(test_and_set() != ..)). Therefore there is no lockfree algorithm for this. This would mean that it is possible to build an N-bythe register a lock-free way for any N.
If you look at the book pragma 9.8.1, The AtomicMarkableReference wich is the same with a single bit insteam of an integer stamp. The author suggest to "steal" a bit from a pointer to extract the mark and the pointer from a single word (alsmost quoted) This obviously mean that they want to use a single atomic register to do it.
However, there may be a way to bluid a wait-free memory reclamation without it. I don't know.
Yes, x64 supports this; you need to use CMPXCHG16B.
You can save a bit on memory by relying on the fact that the pointer will use less than 64 bits. First, define a compare&set function (this ASM works in GCC & ICC):
inline bool CAS_ (volatile uint64_t* mem, uint64_t old_val, uint64_t new_val)
{
unsigned long old_high = old_val >> 32, old_low = old_val;
unsigned long new_high = new_val >> 32, new_low = new_val;
char res = 0;
asm volatile("lock; cmpxchg8b (%6);"
"setz %7; "
: "=a" (old_low), // 0
"=d" (old_high) // 1
: "0" (old_low), // 2
"1" (old_high), // 3
"b" (new_low), // 4
"c" (new_high), // 5
"r" (mem), // 6
"m" (res) // 7
: "cc", "memory");
return res;
}
You'll then need to build a tagged-pointer type. I'm assuming a 40-bit pointer with a cacheline-width of 128-bytes (like Nehalem). Aligning to the cache-line will give enormous speed improvements by reducing false-sharing, contention, etc.; this has the obvious trade-off of using a lot more memory, in some situations.
template <typename pointer_type, typename tag_type, int PtrAlign=7, int PtrWidth=40>
struct tagged_pointer
{
static const unsigned int PtrMask = (1 << (PtrWidth - PtrAlign)) - 1;
static const unsigned int TagMask = ~ PtrMask;
typedef unsigned long raw_value_type;
raw_value_type raw_m_;
tagged_pointer () : raw_m_(0) {}
tagged_pointer (pointer_type ptr) { this->pack(ptr, 0); }
tagged_pointer (pointer_type ptr, tag_type tag) { this->pack(ptr, tag); }
void pack (pointer_type ptr, tag_type tag)
{
this->raw_m_ = 0;
this->raw_m_ |= ((ptr >> PtrAlign) & PtrMask);
this->raw_m_ |= ((tag << (PtrWidth - PtrAlign)) & TagMask);
}
pointer_type ptr () const
{
raw_value_type p = (this->raw_m_ & PtrMask) << PtrAlign;
return *reinterpret_cast<pointer_type*>(&p);
}
tag_type tag () const
{
raw_value_type t = (this->raw_m_ & TagMask) >> (PtrWidth - PtrAlign_;
return *reinterpret_cast<tag_type*>(&t);
}
};
I haven't had a chance to debug this code, so you'll need to do that, but this is the general idea.
Note, on x86_64 architecture and gcc you can enable 128 bit CAS. It can be enabled with -mcx16 gcc option.
int main()
{
__int128_t x = 0;
__sync_bool_compare_and_swap(&x,0,10);
return 0;
}
Compile with:
gcc -mcx16 file.c
The cmpxchg16b operation provides the expected operation but beware that some older x86-64 processors don't have this instruction.
You then just need to build an entity with the counter and and the pointer and the asm-inline code. I've written a blog post on the subject here:Implementing Generic Double Word Compare And Swap
Nevertheless, you don't need this operation if you just want to prevent early-free and ABA issues. The hazard pointer is more simpler and doesn't require specific asm code (as long as you use C++11 atomic values.) I've got a repo on bitbucket with experimental implementations of various lock-free algorithms: Lock Free Experiment (beware all these implementations are toys for experimentation, not reliable and tested code for production.)