improve atomic read from InterlockedCompareExchange()

improve atomic read from InterlockedCompareExchange() - c++

Assuming architecture is ARM64 or x86-64.
I want to make sure if these two are equivalent:
a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
MyBarrier(); a = *(volatile __int64*)p; MyBarrier();
Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.
I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.
I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?
(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)
Thank you for any advise/correction on this.
Here is some benchmarks on Ivy Bridge (i5 laptop).
(1E+006 loops: 27ms):
; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx
(1E+006 loops: 27ms):
; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 7ms):
; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 1.26ms, not synchronized?):
; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]

For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.
However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.
When you create this replacement, you basically end up with pretty much the same result:
// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val
// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val
Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.
However, with two memory barriers, the "optimized version" is actually around 2x slower.
So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.

Related

Why does MSVC generate nop instructions for atomic loads on x64?

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

Error : Invalid Character '(' in mnemonic

Hi I am trying to compile the below assembly code on Linux using gcc 7.5 version but somehow getting the error
Error : Invalid Character '(' in mnemonic
bool InterlockedCompareAndStore128(int *dest,int *newVal,int *oldVal)
{
asm(
"push %rbx\n"
"push %rdi\n"
"mov %rcx, %rdi\n" // ptr to dest -> RDI
"mov 8(%rdx), %rcx\n" // newVal -> RCX:RBX
"mov (%rdx), %rbx\n"
"mov 8(%r8), %rdx\n" // oldVal -> RDX:RAX
"mov (%r8), %rax\n"
"lock (%rdi), cmpxchg16b\n"
"mov $0, %rax\n"
"jnz exit\n"
"inc1 %rax\n"
"exit:;\n"
"pop %rdi\n"
"pop %rbx\n"
);
}
Can anyone suggest how to resolve this . Checked many online links and tutorials for Assembly code but could not relate the exact issue.
Thanks for the help in advance.
In Windows I could see the implementation of the above function as:
function InterlockedCompareExchange128;
asm
.PUSHNV RBX
MOV R10,RCX
MOV RBX,R8
MOV RCX,RDX
MOV RDX,[R9+8]
MOV RAX,[R9]
LOCK CMPXCHG16B [R10]
MOV [R9+8],RDX
MOV [R9],RAX
SETZ AL
MOVZX EAX, AL
end;
For PUSHNV , I could not found anything related to this on Linux. So , basically I am trying to implement same functionality in c++ on Linux.

The question here was about Invalid Character '(' in mnemonic which the other answer addresses.
However, OP's code has a number of issues beyond that problem. Here's (what I think are) two better approaches to this problem. Note that I've changed the order of the parameters and turned them const.
This one continues to use inline asm, but uses Extended asm instead of Basic. While I'm of the don't use inline asm school of thought, this might be useful or at least educational.
bool InterlockedCompareAndStore128B(__int64 *dest, const __int64 *oldVal, const __int64 *newVal)
{
bool result;
__int64 ovl = oldVal[0];
__int64 ovh = oldVal[1];
asm volatile ("lock cmpxchg16b %[ptr]"
: "=#ccz" (result), [ptr] "+m" (*dest),
"+d" (ovh), "+a" (ovl)
: "c" (newVal[1]), "b" (newVal[0])
: "cc", "memory");
// cmpxchg16b changes rdx:rax to the current value in dest. Useful if you need
// to loop until you succeed, but OP's code doesn't save the values, so I'm
// just following that spec.
//oldVal[0] = ovl;
//oldVal[1] = ovh;
return result;
}
In addition to solving the problems with the original code, it's also inlineable and shorter. The constraints likely make it harder to read, but the fact that there's only 1 line of asm might help offset that. If you want to understand what the constraints mean, check out this page (scroll down to x86 family) and the description of flag output constraints (again, scroll down for x86 family).
As an alternative, this code uses a gcc builtin and allows the compiler to generate the appropriate asm instructions. Note that this must be built with -mcx16 for best results.
bool InterlockedCompareAndStore128C(__int128 *dest, const __int128 *oldVal, const __int128 *newVal)
{
// While a sensible person would use __atomic_compare_exchange_n and let gcc generate
// cmpxchg16b, gcc decided they needed to turn this into a big hairy function call:
// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
// In short, if someone wants to compare/exchange against readonly memory, you can't just
// use cmpxchg16b cuz it would crash. Why would anyone try to exchange memory that can't
// be written to? Apparently because it's expected to *not* crash if the compare fails
// and nothing gets written. So no one gets to use that 1 line instruction and everyone
// gets an entire routine (that uses MUTEX instead of lockfree) to support this absurd
// border case. Sounds dumb to me, but that's where things stand as of 2021-05-07.
// Use the legacy function instead.
bool b = __sync_bool_compare_and_swap(dest, *oldVal, *newVal);
return b;
}
For the kibizters in the crowd, here's the code generated by -m64 -O3 -mcx16 for that last one:
InterlockedCompareAndStore128C(__int128*, __int128 const*, __int128 const*):
mov rcx, rdx
push rbx
mov rax, QWORD PTR [rsi]
mov rbx, QWORD PTR [rcx]
mov rdx, QWORD PTR [rsi+8]
mov rcx, QWORD PTR [rcx+8]
lock cmpxchg16b XMMWORD PTR [rdi]
pop rbx
sete al
ret
If someone wants to fiddle, here's the godbolt link.

There are a number of problems with this code, and I'm not convinced I'm doing you any favors by telling you how to fix the specific problem.
But the short answer is that
"lock (%rdi), cmpxchg16b\n"
should be
"lock cmpxchg16b (%rdi)\n"
Tada, now it compiles. Well, it would if inc1 was a real instruction.
But I can't help but notice that the pointers here are int *, which is 4 bytes, not 16. And that this function is not declared as naked. And using Extended asm would save you from having to push all these registers around by hand, making this code a lot slower than it needs to be.
But most of all, you should really use the builtins, like __atomic_compare_exchange because inline asm is error prone, not portable, and really hard to maintain.

How to measure the number of increments per second

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)

This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu

You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.

Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

C++ atomics and cross-thread visibility

AFAIK C++ atomics (<atomic>) family provide 3 benefits:
primitive instructions indivisibility (no dirty reads),
memory ordering (both, for CPU and compiler) and
cross-thread visibility/changes propagation.
And I am not sure about the third bullet, thus take a look at the following example.
#include <atomic>
std::atomic_bool a_flag = ATOMIC_VAR_INIT(false);
struct Data {
int x;
long long y;
char const* z;
} data;
void thread0()
{
// due to "release" the data will be written to memory
// exactly in the following order: x -> y -> z
data.x = 1;
data.y = 100;
data.z = "foo";
// there can be an arbitrary delay between the write
// to any of the members and it's visibility in other
// threads (which don't synchronize explicitly)
// atomic_bool guarantees that the write to the "a_flag"
// will be clean, thus no other thread will ever read some
// strange mixture of 4bit + 4bits
a_flag.store(true, std::memory_order_release);
}
void thread1()
{
while (a_flag.load(std::memory_order_acquire) == false) {};
// "acquire" on a "released" atomic guarantees that all the writes from
// thread0 (thus data members modification) will be visible here
}
void thread2()
{
while (data.y != 100) {};
// not "acquiring" the "a_flag" doesn't guarantee that will see all the
// memory writes, but when I see the z == 100 I know I can assume that
// prior writes have been done due to "release ordering" => assert(x == 1)
}
int main()
{
thread0(); // concurrently
thread1(); // concurrently
thread2(); // concurrently
// join
return 0;
}
First, please validate my assumptions in code (especially thread2).
Second, my questions are:
How does the a_flag write propagate to other cores?
Does the std::atomic synchronize the a_flag in the writer cache with the other cores cache (using MESI, or anything else), or the propagation is automatic?
Assuming that on a particular machine a write to a flag is atomic (think int_32 on x86) AND we don't have any private memory to synchronize (we only have a flag) do we need to use atomics?
Taking into consideration most popular CPU architectures (x86, x64, ARM v.whatever, IA-64), is the cross-core visibility (I am now not considering reorderings) automatic (but potentially delayed), or you need to issue specific commands to propagate any piece of data?

Cores themselves don't matter. The question is "how do all cores see the same memory update eventually", which is something your hardware does for you (e.g. cache coherency protocols). There is only one memory, so the main concern is caching, which is a private concern of the hardware.
That question seems unclear. What matters is the acquire-release pair formed by the load and store of a_flag, which is a synchronisation point and causes the effects of thread0 and thread1 to appear in a certain order (i.e. everything in thread0 before the store happens-before everything after the loop in thread1).
Yes, otherwise you wouldn't have synchronisation point.
You don't need any "commands" in C++. C++ isn't even aware of the fact that it's running on any particular kind of CPU. You could probably run a C++ program on a Rubik's cube with enough imagination. A C++ compiler chooses the necessary instructions to implement the synchronisation behaviour that's described by the C++ memory model, and on x86 that involves issuing instruction lock prefixes and memory fences, as well as not reordering instructions too much. Since x86 has a strongly ordered memory model, the above code should produce minimal additional code compared to the naive, incorrect one without atomics.
Having your thread2 in the code makes the entire program undefined behaviour.
Just for fun, and to show that working out what's happening for yourself can be edifying, I compiled the code in three variations. (I added a glbbal int x and in thread1 I added x = data.y;).
Acquire/Release: (your code)
thread0:
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
mov BYTE PTR a_flag, 1
ret
thread1:
.L14:
movzx eax, BYTE PTR a_flag
test al, al
je .L14
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
Sequentially consistent: (remove the explicit ordering)
thread0:
mov eax, 1
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
xchg al, BYTE PTR a_flag
ret
thread1:
.L14:
movzx eax, BYTE PTR a_flag
test al, al
je .L14
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
"Naive": (just using bool)
thread0:
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
mov BYTE PTR a_flag, 1
ret
thread1:
cmp BYTE PTR a_flag, 0
jne .L3
.L4:
jmp .L4
.L3:
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
As you can see, there's not a big difference. The "incorrect" version actually looks mostly correct, except for missing the load (it uses cmp with memory operand). The sequentially consistent version hides its expensiveness in the xcgh instruction, which has an implicit lock prefix and doesn't seem to require any explicit fences.

Passing a pointer from C to assembly

I want to use "_test_and_set lock" assembly language implementation with atomic swap assembly instruction in my C/C++ program.
class LockImpl
{
public:
static void lockResource(DWORD resourceLock )
{
__asm
{
InUseLoop: mov eax, 0;0=In Use
xchg eax, resourceLock
cmp eax, 0
je InUseLoop
}
}
static void unLockResource(DWORD resourceLock )
{
__asm
{
mov resourceLock , 1
}
}
};
This works but there is a bug in here.
The problem is that i want to pass DWORD * resourceLock instead of DWORD resourceLock.
So question is that how to pass a pointer from C/C++ to assembly and get it back. ?
thanks in advance.
Regards,
-Jay.
P.S. this is done to avoid context switches between user space and kernel space.

If you're writing this for Windows, you should seriously consider using a critical section object. The critical section API functions are optimised such that they won't transition into kernel mode unless they really need to, so the normal case of no contention has very little overhead.
The biggest problem with your spin lock is that if you're on a single CPU system and you're waiting for the lock, then you're using all the cycles you can and whatever is holding the lock won't even get a chance to run until your timeslice is up and the kernel preempts your thread.
Using a critical section will be more successful than trying to roll your own user mode spin lock.

In terms of your actual question, it's pretty simple: just change the function headers to use volatile DWORD *resourceLock, and change the assembly lines that touch resourceLock to use indirection:
mov ecx, dword ptr [resourceLock]
xchg eax, dword ptr [ecx]
and
mov ecx, dword ptr [resourceLock]
lock mov dword ptr [ecx], 1
However, note that you've got a couple of other problems looming:
You say you're developing this on Windows, but want to switch to Linux. However, you're using MSVC-specific inline assembly - this will have to be ported to gcc-style when you move to Linux (in particular that involves switching from Intel syntax to AT&T syntax). You will be much better off developing with gcc even on Windows; that will minimise the pain of migration (see mingw for gcc for Windows).
Greg Hewgill is absolutely right about spinning uselessly, stopping the lock-holder from getting CPU. Consider yielding the CPU if you've been spinning for too long.
On a multiprocessor x86, you might well have a problem with memory loads and stores being re-ordered around your lock - mfence instructions in the lock and unlock procedures might be necessary.
Really, if you're worrying about locking that means you're using threading, which probably means you're using the platform-specific threading APIs already. So use the native synchronisation primitives, and switch out to the pthreads versions when you switch to Linux.

Apparently, you are compiling with MSVC using inline assembly blocks in your C++ code.
As a general remark, you should really use compiler intrinsics as inline assembly has no future: it's no more supported my MS compilers when compiling for x64.
If you need to have functions fine tuned in assembly, you will have to implement them in separate files.

The main problems with the original version in the question is that it needs to use register indirect addressing and take a reference (or pointer parameter) rather than a by-value parameter for the lock DWORD.
Here's a working solution for Visual C++. EDIT: I have worked offline with the author and we have verified the code in this answer works in his test harness correctly.
But if you're using Windows, you should really by using the Interlocked API (i.e. InterlockedExchange).
Edit: As noted by CAF, lock xchg is not required because xchg automatically asserts a BusLock.
I also added a faster version that does a non-locking read before attempting to do the xchg. This significantly reduces BusLock contention on the memory interface. The algorithm can be sped up quite a bit more (in a contentious multithreaded case) by doing backoffs (yield then sleep) for locks held a long time. For the single-threaded-CPU case, using a OS lock that sleeps immediately on held-locks will be fastest.
class LockImpl
{
// This is a simple SpinLock
// 0 - in use / busy
// 1 - free / available
public:
static void lockResource(volatile DWORD &resourceLock )
{
__asm
{
mov ebx, resourceLock
InUseLoop:
mov eax, 0 ;0=In Use
xchg eax, [ebx]
cmp eax, 0
je InUseLoop
}
}
static void lockResource_FasterVersion(DWORD &resourceLock )
{
__asm
{
mov ebx, resourceLock
InUseLoop:
mov eax, [ebx] ;// Read without BusLock
cmp eax, 0
je InUseLoop ;// Retry Read if Busy
mov eax, 0
xchg eax, [ebx] ;// XCHG with BusLock
cmp eax, 0
je InUseLoop ;// Retry if Busy
}
}
static void unLockResource(volatile DWORD &resourceLock)
{
__asm
{
mov ebx, resourceLock
mov [ebx], 1
}
}
};
// A little testing code here
volatile DWORD aaa=1;
void test()
{
LockImpl::lockResource(aaa);
LockImpl::unLockResource(aaa);
}

You should be using something like this:
volatile LONG resourceLock = 1;
if(InterlockedCompareExchange(&resourceLock, 0, 1) == 1) {
// success!
// do something, and then
resourceLock = 1;
} else {
// failed, try again later
}
See InterlockedCompareExchange.

Look at your compiler documentation to find out how to print the generated assembly language for functions.
Print the assembly language for this function:
static void unLockResource(DWORD resourceLock )
{
resourceLock = 0;
return;
}
This may not work because the compiler can optimize the function and remove all the code. You should change the above function to pass a pointer to resourceLock and then have the function set the lock. Print the assembly of this working function.

I already provided a working version which answered the original poster's question both on how to get the parameters passed in ASM and how to get his lock working correctly.
Many other answers have questioned the wiseness of using ASM at all and mentioned that either intrinsics or C OS calls should be used. The following works as well and is a C++ version of my ASM answer. There is a snippet of ASM in there that only needs to be used if your platform does not support InterlockedExchange().
class LockImpl
{
// This is a simple SpinLock
// 0 - in use / busy
// 1 - free / available
public:
#if 1
static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
{
// InterlockedExchange() uses LONG / He wants to use DWORD
return((DWORD)InterlockedExchange(
(volatile LONG *)variable,(LONG)newval));
}
#else
// You can use this if you don't have InterlockedExchange()
// on your platform. Otherwise no ASM is required.
static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
{
DWORD old;
__asm
{
mov ebx, variable
mov eax, newval
xchg eax, [ebx] ;// XCHG with BusLock
mov old, eax
}
return(old);
}
#endif
static void lockResource(volatile DWORD &resourceLock )
{
DWORD oldval;
do
{
while(0==resourceLock)
{
// Could have a yield, spin count, exponential
// backoff, OS CS fallback, etc. here
}
oldval=MyInterlockedExchange(&resourceLock,0);
} while (0==oldval);
}
static void unLockResource(volatile DWORD &resourceLock)
{
// _ReadWriteBarrier() is a VC++ intrinsic that generates
// no instructions / only prevents compiler reordering.
// GCC uses __sync_synchronize() or __asm__ ( :::"memory" )
_ReadWriteBarrier();
resourceLock=1;
}
};

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js