G++ SSE memory alignment on the stack

G++ SSE memory alignment on the stack - c++

I am attempting to re-write a raytracer using Streaming SIMD Extensions. My original raytracer used inline assembly and movups instructions to load data into the xmm registers. I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. The primary class affected is vector, which looks something like this:
#include "xmmintrin.h"
union vector {
__m128 simd;
float raw[4];
//some constructors
//a bunch of functions and operators
} __attribute__ ((aligned (16)));
I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. Is there any way to force this alignment? I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. I'm willing to increase the size of my vector data structure, if needed.

The simplest way is std::aligned_storage, which takes alignment as a second parameter.
If you don't have it yet, you might want to check Boost's version.
Then you can build your union:
union vector {
__m128 simd;
std::aligned_storage<16, 16> alignment_only;
}
Finally, if it does not work, you can always create your own little class:
template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
Type* operator->() {
return reinterpret_cast<Type const*>(aligned());
}
Type const* operator->() const {
return reinterpret_cast<Type const*>(aligned());
}
Type& operator*() { return *(operator->()); }
Type const& operator*() const { return *(operator->()); }
private:
unsigned char* aligned() {
if (data & ~(Align-1) == data) { return data; }
return (data + Align) & ~(Align-1);
}
unsigned char data[sizeof(Type) + Align - 1];
};
It will allocate a bit more storage than necessary, but this way alignment is guaranteed.
int main(int argc, char* argv[])
{
RawStorage<__m128, 16> simd;
*simd = /* ... */;
return 0;
}
With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.

A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it).
There very well may be better ways of doing it, but here is what I did ...
typedef float v4f_t __attribute__((vector_size (16)));
class Vector {
...
union {
v4f_t simd;
float f[4];
} __attribute__ ((aligned (16)));
...
};
Disassembling my compiled binary showed that it was indeed making use of the movaps instruction.
Hope this helps.

Normally all you should need is:
union vector {
__m128 simd;
float raw[4];
};
i.e. no additional __attribute__ ((aligned (16))) required for the union itself.
This works as expected on pretty much every compiler I've ever used, with the notable exception of gcc 2.95.2 back in the day, which used to screw up stack alignment in some cases.

I use this union trick all the time with __m128 and it works with GCC on Mac and Visual C++ on Windows, so this must be a bug in the compiler that you use.
The other answers contain good workarounds though.

If you need an array of N of these objects, allocate vector raw[N+1], and use vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15) as the base address of your array. This will always be aligned.

Related

Is there any portable way to ensure a struct is defined without padding bytes using c++11? [duplicate]

This question already has answers here:
Compile-time check to make sure that there is no padding anywhere in a struct
(4 answers)
Closed 3 years ago.
Lets consider the following task:
My C++ module as part of an embedded system receives 8 bytes of data, like: uint8_t data[8].
The value of the first byte determines the layout of the rest (20-30 different). In order to get the data effectively, I would create different structs for each layout and put each to a union and read the data directly from the address of my input through a pointer like this:
struct Interpretation_1 {
uint8_t multiplexer;
uint8_t timestamp;
uint32_t position;
uint16_t speed;
};
// and a lot of other struct like this (with bitfields, etc..., layout is not defined by me :( )
union DataInterpreter {
Interpretation_1 movement;
//Interpretation_2 temperatures;
//etc...
};
...
uint8_t exampleData[8] {1u, 10u, 20u,0u,0u,0u, 5u,0u};
DataInterpreter* interpreter = reinterpret_cast<DataInterpreter*>(&exampleData);
std::cout << "position: " << +interpreter->movement.position << "\n";
The problem I have is, the compiler can insert padding bytes to the interpretation structs and this kills my idea. I know I can use
with gcc: struct MyStruct{} __attribute__((__packed__));
with MSVC: I can use #pragma pack(push, 1) MyStruct{}; #pragma pack(pop)
with clang: ? (I could check it)
But is there any portable way to achieve this? I know c++11 has e.g. alignas for alignment control, but can I use it for this? I have to use c++11 but I would be just interested if there is a better solution with later version of c++.

But is there any portable way to achieve this?
No, there is no (standard) way to "make" a type that would have padding to not have padding in C++. All objects are aligned at least as much as their type requires and if that alignment doesn't match with the previous sub objects, then there will be padding and that is unavoidable.
Furthermore, there is another problem: You're accessing through a reinterpreted pointed that doesn't point to an object of compatible type. The behaviour of the program is undefined.
We can conclude that classes are not generally useful for representing arbitrary binary data. The packed structures are non-standard, and they also aren't compatible across different systems with different representations for integers (byte endianness).
There is a way to check whether a type contains padding: Compare the size of the sub objects to the size of the complete object, and do this recursively to each member. If the sizes don't match, then there is padding. This is quite tricky however because C++ has minimal reflection capabilities, so you need to resort either hard coding or meta programming.
Given such check, you can make the compilation fail on systems where the assumption doesn't hold.
Another handy tool is std::has_unique_object_representations (since C++17) which will always be false for all types that have padding. But note that it will also be false for types that contain floats for example. Only types that return true can be meaningfully compared for equality with std::memcmp.

Reading from unaligned memory is undefined behavior in C++. In other words, the compiler is allowed to assume that every uint32_t is located at a alignof(uint32_t)-byte boundary and every uint16_t is located at a alignof(uint16_t)-byte boundary. This means that if you somehow manage to pack your bytes portably, doing interpreter->movement.position will still trigger undefined behaviour.
(In practice, on most architectures, unaligned memory access will still work, but albeit incur a performance penalty.)
You could, however, write a wrapper, like how std::vector<bool>::operator[] works:
#include <cstdint>
#include <cstring>
#include <iostream>
#include <type_traits>
template <typename T>
struct unaligned_wrapper {
static_assert(std::is_trivial<T>::value);
std::aligned_storage_t<sizeof(T), 1> buf;
operator T() const noexcept {
T ret;
memcpy(&ret, &buf, sizeof(T));
return ret;
}
unaligned_wrapper& operator=(T t) noexcept {
memcpy(&buf, &t, sizeof(T));
return *this;
}
};
struct Interpretation_1 {
unaligned_wrapper<uint8_t> multiplexer;
unaligned_wrapper<uint8_t> timestamp;
unaligned_wrapper<uint32_t> position;
unaligned_wrapper<uint16_t> speed;
};
// and a lot of other struct like this (with bitfields, etc..., layout is not defined by me :( )
union DataInterpreter {
Interpretation_1 movement;
//Interpretation_2 temperatures;
//etc...
};
int main(){
uint8_t exampleData[8] {1u, 10u, 20u,0u,0u,0u, 5u,0u};
DataInterpreter* interpreter = reinterpret_cast<DataInterpreter*>(&exampleData);
std::cout << "position: " << interpreter->movement.position << "\n";
}
This would ensure that every read or write to the unaligned integer is transformed to a bytewise memcpy, which does not have any alignment requirement. There might be a performance penalty for this on architectures with the ability to access unaligned memory quickly, but it would work on any conforming compiler.

Using std::atomic with aligned classes

I have a mat4 class, a 4x4 matrix that uses sse intrinsics. This class is aligned using _MM_ALIGN16, because it stores the matrix as a set of __m128's. The problem is, when I declare an atomic<mat4>, my compiler yells at me:
f:\program files (x86)\microsoft visual studio 12.0\vc\include\atomic(504): error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned
This is the same error I get when I try to pass any class aligned with _MM_ALIGN16 as an argument for a function (without using const &).
How can I declare an atomic version of my mat4 class?

The MSC compiler has never supported more than 4 bytes of alignment for parameters on the x86 stack, and there is no workaround.
You can verify this yourself by compiling,
struct A { __declspec(align(4)) int x; };
void foo(A a) {}
versus,
// won't compile, alignment guarantee can't be fulfilled
struct A { __declspec(align(8)) int x; };
versus,
// __m128d is naturally aligned, again - won't compile
struct A { __m128d x; };
Generally MSC is absolved by the following,
You cannot specify alignment for function parameters.
align (C++)
And you cannot specify the alignment, because MSC writers wanted to reserve the freedom to decide on the alignment,
The x86 compiler uses a different method for aligning the stack. By
default, the stack is 4-byte aligned. Although this is space
efficient, you can see that there are some data types that need to be
8-byte aligned, and that, in order to get good performance, 16-byte
alignment is sometimes needed. The compiler can determine, on some
occasions, that dynamic 8-byte stack alignment would be
beneficial—notably when there are double values on the stack.
The compiler does this in two ways. First, the compiler can use
link-time code generation (LTCG), when specified by the user at
compile and link time, to generate the call-tree for the complete
program. With this, it can determine regions of the call-tree where
8-byte stack alignment would be beneficial, and it determines
call-sites where the dynamic stack alignment gets the best payoff. The
second way is used when the function has doubles on the stack, but,
for whatever reason, has not yet been 8-byte aligned. The compiler
applies a heuristic (which improves with each iteration of the
compiler) to determine whether the function should be dynamically
8-byte aligned.
Windows Data Alignment on IPF, x86, and x64
Thus as long as you use MSC with the 32-bit platform toolset, this issue is unavoidable.
The x64 ABI has been explicit about the alignment, defining that non-trivial structures or structures over certain sizes are passed as a pointer parameter. This is elaborated in Section 3.2.3 of the ABI, and MSC had to implement this to be compatible with the ABI.
Path 1: Use another Windows compiler toolchain: GCC or ICC.
Path 2: Move to a 64-bit platform MSC toolset
Path 3: Reduce your use cases to std::atomic<T> with T=__m128d, because it will be possible to skip the stack and pass the variable in an XMM register directly.

The atomic<T> probably has a constructor which is passed a copy of T as a (formal) parameter. For example in the atomic header packaged with GCC 4.5 :
97: atomic(_Tp __i) : _M_i(__i) { }
This is problematic for exactly the same reason as any other function which has a memory aligned type as a parameter: It would be very complicated and slow for functions to keep track of memory aligned data on the stack.
Even if the compiler allowed it, this approach would incur a significant performance penalty. Assuming you are trying to optimise for speed I would implement a less fine grained memory access approach. Either locking access to a chunk of memory whilst performing a series of calculations, or explicitly designing your program so that threads never try and access the same piece of memory.

I faced a similar problem using Agner Fog's vectorclass in MSVC. The problem happens in 32-bit mode. If you compile in 64-bit mode release mode I don't think you will have this problem. In Windows and Unix all variables on the stack are aligned to 16 bytes in 64-bit mode but not necessarily in 32-bit mode. In his manual under compile time errors he writes
"error C2719: formal parameter with __declspec(align('16')) won't be aligned".
The Microsoft compiler cannot handle vectors as function parameters. The
easiest solution is to change the parameter to a const reference, e.g.:
Vec4f my_function(Vec4f const & x) {
... }
So if you use a const reference (as you mentioned) when you pass your class to a function it should work in 32-bit mode as well.
Edit: Based on this Self-contained, STL-compatible implementation of std::vector I think you can use a "thin wrapper". Something like.
template <typename T>
struct wrapper : public T
{
wrapper() {}
wrapper(const T& rhs) : T(rhs) {}
};
struct __declspec(align(64)) mat4
{
//float x, y, z, w;
};
int main()
{
atomic< wrapper<mat4> > m; // OK, no C2719 error
return 0;
}

I don't profess to understand how __declspec(align(foo)) is supposed to work, but this standard C++ program compiles and runs fine in gcc & clang using alignas(16):
struct alignas(16) mat4 {
float some_floats[4][4];
};
std::atomic<mat4> am4;
static_assert(alignof(decltype(am4)) == 16,
"Jabberwocky is killing user.");
int main() {
static const mat4 foo = {{
{ 1, 2, 3, 4 },
{ 1, 2, 3, 4 },
{ 1, 2, 3, 4 },
{ 1, 2, 3, 4 }
}};
am4 = foo;
}

What is the recommended way to align memory in C++11

I am working on a single producer single consumer ring buffer implementation.I have two requirements:
Align a single heap allocated instance of a ring buffer to a cache line.
Align a field within a ring buffer to a cache line (to prevent false sharing).
My class looks something like:
#define CACHE_LINE_SIZE 64 // To be used later.
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
....
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_; // This needs to be aligned to a cache line.
};
Let me first tackle point 1 i.e. aligning a single heap allocated instance of the class. There are a few ways:
Use the c++ 11 alignas(..) specifier:
template<typename T, uint64_t num_events>
class alignas(CACHE_LINE_SIZE) RingBuffer {
public:
....
private:
// All the private fields.
};
Use posix_memalign(..) + placement new(..) without altering the class definition. This suffers from not being platform independent:
void* buffer;
if (posix_memalign(&buffer, 64, sizeof(processor::RingBuffer<int, kRingBufferSize>)) != 0) {
perror("posix_memalign did not work!");
abort();
}
// Use placement new on a cache aligned buffer.
auto ring_buffer = new(buffer) processor::RingBuffer<int, kRingBufferSize>();
Use the GCC/Clang extension __attribute__ ((aligned(#)))
template<typename T, uint64_t num_events>
class RingBuffer {
public:
....
private:
// All the private fields.
} __attribute__ ((aligned(CACHE_LINE_SIZE)));
I tried to use the C++ 11 standardized aligned_alloc(..) function instead of posix_memalign(..) but GCC 4.8.1 on Ubuntu 12.04 could not find the definition in stdlib.h
Are all of these guaranteed to do the same thing? My goal is cache-line alignment so any method that has some limits on alignment (say double word) will not do. Platform independence which would point to using the standardized alignas(..) is a secondary goal.
I am not clear on whether alignas(..) and __attribute__((aligned(#))) have some limit which could be below the cache line on the machine. I can't reproduce this any more but while printing addresses I think I did not always get 64 byte aligned addresses with alignas(..). On the contrary posix_memalign(..) seemed to always work. Again I cannot reproduce this any more so maybe I was making a mistake.
The second aim is to align a field within a class/struct to a cache line. I am doing this to prevent false sharing. I have tried the following ways:
Use the C++ 11 alignas(..) specifier:
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
...
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_ alignas(CACHE_LINE_SIZE);
};
Use the GCC/Clang extension __attribute__ ((aligned(#)))
template<typename T, uint64_t num_events>
class RingBuffer { // This needs to be aligned to a cache line.
public:
...
private:
std::atomic<int64_t> publisher_sequence_ ;
int64_t cached_consumer_sequence_;
T* events_;
std::atomic<int64_t> consumer_sequence_ __attribute__ ((aligned (CACHE_LINE_SIZE)));
};
Both these methods seem to align consumer_sequence to an address 64 bytes after the beginning of the object so whether consumer_sequence is cache aligned depends on whether the object itself is cache aligned. Here my question is - are there any better ways to do the same?
EDIT:
The reason aligned_alloc did not work on my machine was that I was on eglibc 2.15 (Ubuntu 12.04). It worked on a later version of eglibc.
From the man page: The function aligned_alloc() was added to glibc in version 2.16.
This makes it pretty useless for me since I cannot require such a recent version of eglibc/glibc.

Unfortunately the best I have found is allocating extra space and then using the "aligned" part. So the RingBuffer new can request an extra 64 bytes and then return the first 64 byte aligned part of that. It wastes space but will give the alignment you need. You will likely need to set the memory before what is returned to the actual alloc address to unallocate it.
[Memory returned][ptr to start of memory][aligned memory][extra memory]
(assuming no inheritence from RingBuffer) something like:
void * RingBuffer::operator new(size_t request)
{
static const size_t ptr_alloc = sizeof(void *);
static const size_t align_size = 64;
static const size_t request_size = sizeof(RingBuffer)+align_size;
static const size_t needed = ptr_alloc+request_size;
void * alloc = ::operator new(needed);
void *ptr = std::align(align_size, sizeof(RingBuffer),
alloc+ptr_alloc, request_size);
((void **)ptr)[-1] = alloc; // save for delete calls to use
return ptr;
}
void RingBuffer::operator delete(void * ptr)
{
if (ptr) // 0 is valid, but a noop, so prevent passing negative memory
{
void * alloc = ((void **)ptr)[-1];
::operator delete (alloc);
}
}
For the second requirement of having a data member of RingBuffer also 64 byte aligned, for that if you know that the start of this is aligned, you can pad to force the alignment for data members.

The answer to your problem is std::aligned_storage. It can be used top level and for individual members of a class.

After some more research my thoughts are:
Like #TemplateRex pointed out there does not seem to be a standard way to align to more than 16 bytes. So even if we use the standardized alignas(..)there is no guarantee unless the alignment boundary is less than or equal to 16 bytes. I'll have to verify that it works as expected on a target platform.
__attribute ((aligned(#))) or alignas(..) cannot be used to align a heap allocated object as I suspected i.e. new() doesn't do anything with these annotations. They seem to work for static objects or stack allocations with the caveats from (1).
Either posix_memalign(..) (non standard) or aligned_alloc(..) (standardized but couldn't get it to work on GCC 4.8.1) + placement new(..) seems to be the solution. My solution for when I need platform independent code is compiler specific macros :)
Alignment for struct/class fields seems to work with both __attribute ((aligned(#))) and alignas() as noted in the answer. Again I think the caveats from (1) about guarantees on alignment stand.
So my current solution is to use posix_memalign(..) + placement new(..) for aligning a heap allocated instance of my class since my target platform right now is Linux only. I am also using alignas(..) for aligning fields since it's standardized and at least works on Clang and GCC. I'll be happy to change it if a better answer comes along.

I don't know if it is the best way to align memory allocated with a new operator, but it is certainly very simple !
This is the way it is done in thread sanitizer pass in GCC 6.1.0
#define ALIGNED(x) __attribute__((aligned(x)))
static char myarray[sizeof(myClass)] ALIGNED(64) ;
var = new(myarray) myClass;
Well, in sanitizer_common/sanitizer_internal_defs.h, it is also written
// Please only use the ALIGNED macro before the type.
// Using ALIGNED after the variable declaration is not portable!
So I do not know why the ALIGNED here is used after the variable declaration. But it is an other story.

Memory Access Violations When Using SSE Operations

I've been trying to re-implement some existing vector and matrix classes to use SSE3 commands, and I seem to be running into these "memory access violation" errors whenever I perform a series of operations on an array of vectors. I'm relatively new to SSE, so I've been starting off simple. Here's the entirety of my vector class:
class SSEVector3D
{
public:
SSEVector3D();
SSEVector3D(float x, float y, float z);
SSEVector3D& operator+=(const SSEVector3D& rhs); //< Elementwise Addition
float x() const;
float y() const;
float z() const;
private:
float m_coords[3] __attribute__ ((aligned (16))); //< The x, y and z coordinates
};
So, not a whole lot going on yet, just some constructors, accessors, and one operation. Using my (admittedly limited) knowledge of SSE, I implemented the addition operation as follows:
SSEVector3D& SSEVector3D::operator+=(const SSEVector3D& rhs)
{
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
return (*this);
}
To speed-test my new vector class against the old one (to see if it's worth re-implementing the whole thing), I created a simple program that generates a random array of SSEVector3D objects and adds them together. Nothing too complicated:
SSEVector3D sseSum(0, 0, 0);
for(i=0; i<sseVectors.size(); i++)
{
sseSum += sseVectors[i];
}
printf("Total: %f %f %f\n", sseSum.x(), sseSum.y(), sseSum.z());
The sseVectors variable is an std::vector containing elements of type SSEVector3D, whose components are all initialized to random numbers between -1 and 1.
Here's the issue I'm having. If the size of sseVectors is 8,191 or less (a number I arrived at through a lot of trial and error), this runs fine. If the size is 8,192 or more, I get this error when I try to run it:
signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
However, if I comment out that print statement at the end, I get no error even if sseVectors has a size of 8,192 or more.
Is there something wrong with the way I've written this vector class? I'm running Ubuntu 12.04.1 with GCC version 4.6

First, and foremost, don't do this
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
With SSE, always do your loads and stores explicitly via the appropriate intrinsics, never by just dereferencing. Instead of storing an array of 3 floats in your class, store a value of type _m128. That should make the compiler align instances of your class correctly, without any need for align attributes.
Note, however, that this won't work very well with MSVC. MSVC seems to generally be unable to cope with alignment requirements stronger than 8-byte aligned for by-value arguments :-(. The last time I needed to port SSE code to windows, my solution was to use Intel's C++ compiler for the SSE parts instead of MSVC...

The trick is to notice that __m128 is 16 byte aligned. Use _malloc_aligned() to assure that your float array is correctly aligned, then you can go ahead and cast your float to an array of __m128. Make sure also that the number of floats you allocate is divisible by four.

lock-free memory reclamation with 64bit pointers

Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.
To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.
Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?

Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.

Windows gives you a bunch of Interlocked functions that are atomic and can probably be used to do what you want. Similar functions exist for other platforms, and I believe Boost has an interlocked library as well.
Your question isn't super clear and I don't have a copy of Herlihy and Shavit laying around. Perhaps if you elaborated or gave psuedo code outlining what you want to do, we can give you a more specific answer.

Ok hopefully, I have the book,
For others that may provides answers, the point is to implement this class :
class AtomicReference<T>{
public:
void set(T *ref, int stamp){ ... }
T *get(int *stamp){ ... }
private:
T *_ref;
int _stamp;
};
in a lock-free way so that :
set() updates the reference and the stamp, atomicly.
get() returns the reference and set *stamp to the stamp corresponding to the reference.
JDonner please, correct me if I am wrong.
Now my answer : I don't think you can do it without a lock somewhere (a lock can be while(test_and_set() != ..)). Therefore there is no lockfree algorithm for this. This would mean that it is possible to build an N-bythe register a lock-free way for any N.
If you look at the book pragma 9.8.1, The AtomicMarkableReference wich is the same with a single bit insteam of an integer stamp. The author suggest to "steal" a bit from a pointer to extract the mark and the pointer from a single word (alsmost quoted) This obviously mean that they want to use a single atomic register to do it.
However, there may be a way to bluid a wait-free memory reclamation without it. I don't know.

Yes, x64 supports this; you need to use CMPXCHG16B.

You can save a bit on memory by relying on the fact that the pointer will use less than 64 bits. First, define a compare&set function (this ASM works in GCC & ICC):
inline bool CAS_ (volatile uint64_t* mem, uint64_t old_val, uint64_t new_val)
{
unsigned long old_high = old_val >> 32, old_low = old_val;
unsigned long new_high = new_val >> 32, new_low = new_val;
char res = 0;
asm volatile("lock; cmpxchg8b (%6);"
"setz %7; "
: "=a" (old_low), // 0
"=d" (old_high) // 1
: "0" (old_low), // 2
"1" (old_high), // 3
"b" (new_low), // 4
"c" (new_high), // 5
"r" (mem), // 6
"m" (res) // 7
: "cc", "memory");
return res;
}
You'll then need to build a tagged-pointer type. I'm assuming a 40-bit pointer with a cacheline-width of 128-bytes (like Nehalem). Aligning to the cache-line will give enormous speed improvements by reducing false-sharing, contention, etc.; this has the obvious trade-off of using a lot more memory, in some situations.
template <typename pointer_type, typename tag_type, int PtrAlign=7, int PtrWidth=40>
struct tagged_pointer
{
static const unsigned int PtrMask = (1 << (PtrWidth - PtrAlign)) - 1;
static const unsigned int TagMask = ~ PtrMask;
typedef unsigned long raw_value_type;
raw_value_type raw_m_;
tagged_pointer () : raw_m_(0) {}
tagged_pointer (pointer_type ptr) { this->pack(ptr, 0); }
tagged_pointer (pointer_type ptr, tag_type tag) { this->pack(ptr, tag); }
void pack (pointer_type ptr, tag_type tag)
{
this->raw_m_ = 0;
this->raw_m_ |= ((ptr >> PtrAlign) & PtrMask);
this->raw_m_ |= ((tag << (PtrWidth - PtrAlign)) & TagMask);
}
pointer_type ptr () const
{
raw_value_type p = (this->raw_m_ & PtrMask) << PtrAlign;
return *reinterpret_cast<pointer_type*>(&p);
}
tag_type tag () const
{
raw_value_type t = (this->raw_m_ & TagMask) >> (PtrWidth - PtrAlign_;
return *reinterpret_cast<tag_type*>(&t);
}
};
I haven't had a chance to debug this code, so you'll need to do that, but this is the general idea.

Note, on x86_64 architecture and gcc you can enable 128 bit CAS. It can be enabled with -mcx16 gcc option.
int main()
{
__int128_t x = 0;
__sync_bool_compare_and_swap(&x,0,10);
return 0;
}
Compile with:
gcc -mcx16 file.c

The cmpxchg16b operation provides the expected operation but beware that some older x86-64 processors don't have this instruction.
You then just need to build an entity with the counter and and the pointer and the asm-inline code. I've written a blog post on the subject here:Implementing Generic Double Word Compare And Swap
Nevertheless, you don't need this operation if you just want to prevent early-free and ABA issues. The hazard pointer is more simpler and doesn't require specific asm code (as long as you use C++11 atomic values.) I've got a repo on bitbucket with experimental implementations of various lock-free algorithms: Lock Free Experiment (beware all these implementations are toys for experimentation, not reliable and tested code for production.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js