Summary: I had expected that std::atomic<int*>::load with std::memory_order_relaxed would be close to the performance of just loading a pointer directly, at least when the loaded value rarely changes. I saw far worse performance for the atomic load than a normal load on Visual Studio C++ 2012, so I decided to investigate. It turns out that the atomic load is implemented as a compare-and-swap loop, which I suspect is not the fastest possible implementation.
Question: Is there some reason that std::atomic<int*>::load needs to do a compare-and-swap loop?
Background: I believe that MSVC++ 2012 is doing a compare-and-swap loop on atomic load of a pointer based on this test program:
#include <atomic>
#include <iostream>
template<class T>
__declspec(noinline) T loadRelaxed(const std::atomic<T>& t) {
return t.load(std::memory_order_relaxed);
}
int main() {
int i = 42;
char c = 42;
std::atomic<int*> ptr(&i);
std::atomic<int> integer;
std::atomic<char> character;
std::cout
<< *loadRelaxed(ptr) << ' '
<< loadRelaxed(integer) << ' '
<< loadRelaxed(character) << std::endl;
return 0;
}
I'm using a __declspec(noinline) function in order to isolate the assembly instructions related to the atomic load. I made a new MSVC++ 2012 project, added an x64 platform, selected the release configuration, ran the program in the debugger and looked at the disassembly. Turns out that both std::atomic<char> and std::atomic<int> parameters end up giving the same call to loadRelaxed<int> - this must be something the optimizer did. Here is the disassembly of the two loadRelaxed instantiations that get called:
loadRelaxed<int * __ptr64>
000000013F4B1790 prefetchw [rcx]
000000013F4B1793 mov rax,qword ptr [rcx]
000000013F4B1796 mov rdx,rax
000000013F4B1799 lock cmpxchg qword ptr [rcx],rdx
000000013F4B179E jne loadRelaxed<int * __ptr64>+6h (013F4B1796h)
loadRelaxed<int>
000000013F3F1940 prefetchw [rcx]
000000013F3F1943 mov eax,dword ptr [rcx]
000000013F3F1945 mov edx,eax
000000013F3F1947 lock cmpxchg dword ptr [rcx],edx
000000013F3F194B jne loadRelaxed<int>+5h (013F3F1945h)
The instruction lock cmpxchg is atomic compare-and-swap and we see here that the code for atomically loading a char, an int or an int* is a compare-and-swap loop. I also built this code for 32-bit x86 and that implementation is still based on lock cmpxchg.
Question: Is there some reason that std::atomic<int*>::load needs to do a compare-and-swap loop?
I do not believe that relaxed atomic loads require compare-and-swap. In the end this std::atomic implementation was not usable for my purpose, but I still wanted to have the interface, so I made my own std::atomic using MSVC's barrier intrinsics. This has better performance than the default std::atomic for my use case. You can see the code here. It's supposed to be implemented to the C++11 spec for all the orderings for load and store. Btw GCC 4.6 is not better in this regard. I don't know about GCC 4.7.
Related
I know atomic variable is lock-free!!
It doesn't lock thread, but I have one question..
Read-Modify-Store operation like std::atomic::fetch_add is also executed atomically???
I think this operation isn't just a one instruction.
It need multiple cycle... So If i doesn't lock memory bus ( Actually i don't know if mutex locking contain memory bus lock), Other thread can make memory operation between Read and Store.
So I think it require locking even if atomic variable...
Am i knowing well???
You konwing is right in in earlyer x86 architecture.
In the x86 architecture, the instruction prefix LOCK is provided.Atomic variables depend on this directive.Early a LOCK is implemented by locking a bus to prevent memory access from other CPU cores. As you can imagine, this implementation is very inefficient
Most x86 processors support the hardware implementation of CAS, which ensures the correctness of atomic operation in multi-processor and multi-core systems. The implementation of CAS also does not lock the bus and only blocks access by other CPUs to the cache blocks that check the associated memory.
let show you code.
example code is :
#include <iostream>
#include <thread>
#include <atomic>
std::atomic<long long> data;
void do_work()
{
data.fetch_add(1, std::memory_order_relaxed);
}
int main()
{
std::thread th1(do_work);
std::thread th2(do_work);
std::thread th3(do_work);
std::thread th4(do_work);
std::thread th5(do_work);
th1.join();
th2.join();
th3.join();
th4.join();
th5.join();
std::cout << "Result:" << data << '\n';
}
Convert the above code into instructions. In gcc 8 do_work function translated into
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 1
mov DWORD PTR [rbp-12], 0
mov rax, QWORD PTR [rbp-8]
mov edx, OFFSET FLAT:data
lock xadd QWORD PTR [rdx], rax
nop
pop rbp
ret
use lock xadd to ensure atomic operator.
I wrote a simple multithreading programs as follows:
static bool finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.
Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.
My fix:
#include <iostream>
#include <future>
#include <atomic>
static std::atomic<bool> finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Output:
result =1023045342
main thread id=140147660588864
Live Demo on coliru
Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)
But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.
Making a bool unguarded, non-atomic can cause additional issues:
The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)
To prevent this to happen, the compiler must be told explicitly not to do.
I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:
Is volatile useful with threads
Who's afraid of a big bad optimizing compiler?.
Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.
I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:
func():
cmp BYTE PTR finished[rip], 0
jne .L4
.L5:
jmp .L5
.L4:
mov eax, 0
ret
So, what is happening here?
First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.
If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.
If it is false however, we move to
.L5:
jmp .L5
This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.
In other words, the thread is put into an infinite busy loop.
So why has this happened?
As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.
Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.
The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.
at -O0 the compiler (as expected) does not optimise the loop body and comparison away:
func():
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 0
.L148:
movzx eax, BYTE PTR finished[rip]
test al, al
jne .L147
add QWORD PTR [rbp-8], 1
jmp .L148
.L147:
mov rax, QWORD PTR [rbp-8]
pop rbp
ret
therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.
A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.
For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.
Here is an example:
class ST {
public:
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
void setFinished(bool val)
{
finished = val;
}
private:
std::atomic<bool> finished = false;
};
int main()
{
ST st;
auto result=std::async(std::launch::async, &ST::func, std::ref(st));
std::this_thread::sleep_for(std::chrono::seconds(1));
st.setFinished(true);
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Live on wandbox
Assuming architecture is ARM64 or x86-64.
I want to make sure if these two are equivalent:
a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
MyBarrier(); a = *(volatile __int64*)p; MyBarrier();
Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.
I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.
I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?
(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)
Thank you for any advise/correction on this.
Here is some benchmarks on Ivy Bridge (i5 laptop).
(1E+006 loops: 27ms):
; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx
(1E+006 loops: 27ms):
; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 7ms):
; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 1.26ms, not synchronized?):
; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]
For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.
However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.
When you create this replacement, you basically end up with pretty much the same result:
// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val
// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val
Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.
However, with two memory barriers, the "optimized version" is actually around 2x slower.
So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.
I'm trying to implement a simple busy loop function.
This should keep polling a std::atomic variable for a maximum number of times (spinCount), and return true if the status did change (to anything other than NOT_AVAILABLE) within the given tries, or false otherwise:
// noinline is just to be able to inspect the resulting ASM a bit easier - in final code, this function SHOULD be inlined!
__declspec(noinline) static bool trySpinWait(std::atomic<Status>* statusPtr, const int spinCount)
{
int iSpinCount = 0;
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
return iSpinCount == spinCount;
}
However, it seems that MSVC just opitmizes the loop away on Release mode for Win64. I'm pretty bad with Assembly, but doesn't look to me like it's ever even trying to read the value of statusPtr at all:
int iSpinCount = 0;
000000013F7E2040 xor eax,eax
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
000000013F7E2042 inc eax
000000013F7E2044 cmp eax,edx
000000013F7E2046 jge trySpinWait+12h (013F7E2052h)
000000013F7E2048 mov r8d,dword ptr [rcx]
000000013F7E204B test r8d,r8d
000000013F7E204E je trySpinWait+2h (013F7E2042h)
return iSpinCount == spinCount;
000000013F7E2050 cmp eax,edx
000000013F7E2052 sete al
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this, but seems that's not the case (or rather, my understanding was probably wrong).
What am I doing wrong here, or rather - how can I best implement that loop without having it optimized away, with least impact on overall performance?
I know I could use #pragma optimize( "", off ), but (other than in the example above), in my final code I'd very much like to have this call inlined into a larger function for performance reasons. seems that this #pragma will generally prevent inlining though.
Appreciate any thoughts!
Thanks
but doesn't look to me like it's ever even trying to read the value of statusPtr at all
It does reload it on every iteration of the loop:
000000013F7E2048 mov r8d,dword ptr [rcx] # rcx is statusPtr
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this,
You do not need anything more than std::memory_order_relaxed here because there is only one variable shared between threads (even more, this code doesn't change the value of the atomic variable). There are no reordering concerns.
In other words, this function works as expected.
You may like to use PAUSE instruction, see Benefitting Power and Performance Sleep Loops.
I want to use "_test_and_set lock" assembly language implementation with atomic swap assembly instruction in my C/C++ program.
class LockImpl
{
public:
static void lockResource(DWORD resourceLock )
{
__asm
{
InUseLoop: mov eax, 0;0=In Use
xchg eax, resourceLock
cmp eax, 0
je InUseLoop
}
}
static void unLockResource(DWORD resourceLock )
{
__asm
{
mov resourceLock , 1
}
}
};
This works but there is a bug in here.
The problem is that i want to pass DWORD * resourceLock instead of DWORD resourceLock.
So question is that how to pass a pointer from C/C++ to assembly and get it back. ?
thanks in advance.
Regards,
-Jay.
P.S. this is done to avoid context switches between user space and kernel space.
If you're writing this for Windows, you should seriously consider using a critical section object. The critical section API functions are optimised such that they won't transition into kernel mode unless they really need to, so the normal case of no contention has very little overhead.
The biggest problem with your spin lock is that if you're on a single CPU system and you're waiting for the lock, then you're using all the cycles you can and whatever is holding the lock won't even get a chance to run until your timeslice is up and the kernel preempts your thread.
Using a critical section will be more successful than trying to roll your own user mode spin lock.
In terms of your actual question, it's pretty simple: just change the function headers to use volatile DWORD *resourceLock, and change the assembly lines that touch resourceLock to use indirection:
mov ecx, dword ptr [resourceLock]
xchg eax, dword ptr [ecx]
and
mov ecx, dword ptr [resourceLock]
lock mov dword ptr [ecx], 1
However, note that you've got a couple of other problems looming:
You say you're developing this on Windows, but want to switch to Linux. However, you're using MSVC-specific inline assembly - this will have to be ported to gcc-style when you move to Linux (in particular that involves switching from Intel syntax to AT&T syntax). You will be much better off developing with gcc even on Windows; that will minimise the pain of migration (see mingw for gcc for Windows).
Greg Hewgill is absolutely right about spinning uselessly, stopping the lock-holder from getting CPU. Consider yielding the CPU if you've been spinning for too long.
On a multiprocessor x86, you might well have a problem with memory loads and stores being re-ordered around your lock - mfence instructions in the lock and unlock procedures might be necessary.
Really, if you're worrying about locking that means you're using threading, which probably means you're using the platform-specific threading APIs already. So use the native synchronisation primitives, and switch out to the pthreads versions when you switch to Linux.
Apparently, you are compiling with MSVC using inline assembly blocks in your C++ code.
As a general remark, you should really use compiler intrinsics as inline assembly has no future: it's no more supported my MS compilers when compiling for x64.
If you need to have functions fine tuned in assembly, you will have to implement them in separate files.
The main problems with the original version in the question is that it needs to use register indirect addressing and take a reference (or pointer parameter) rather than a by-value parameter for the lock DWORD.
Here's a working solution for Visual C++. EDIT: I have worked offline with the author and we have verified the code in this answer works in his test harness correctly.
But if you're using Windows, you should really by using the Interlocked API (i.e. InterlockedExchange).
Edit: As noted by CAF, lock xchg is not required because xchg automatically asserts a BusLock.
I also added a faster version that does a non-locking read before attempting to do the xchg. This significantly reduces BusLock contention on the memory interface. The algorithm can be sped up quite a bit more (in a contentious multithreaded case) by doing backoffs (yield then sleep) for locks held a long time. For the single-threaded-CPU case, using a OS lock that sleeps immediately on held-locks will be fastest.
class LockImpl
{
// This is a simple SpinLock
// 0 - in use / busy
// 1 - free / available
public:
static void lockResource(volatile DWORD &resourceLock )
{
__asm
{
mov ebx, resourceLock
InUseLoop:
mov eax, 0 ;0=In Use
xchg eax, [ebx]
cmp eax, 0
je InUseLoop
}
}
static void lockResource_FasterVersion(DWORD &resourceLock )
{
__asm
{
mov ebx, resourceLock
InUseLoop:
mov eax, [ebx] ;// Read without BusLock
cmp eax, 0
je InUseLoop ;// Retry Read if Busy
mov eax, 0
xchg eax, [ebx] ;// XCHG with BusLock
cmp eax, 0
je InUseLoop ;// Retry if Busy
}
}
static void unLockResource(volatile DWORD &resourceLock)
{
__asm
{
mov ebx, resourceLock
mov [ebx], 1
}
}
};
// A little testing code here
volatile DWORD aaa=1;
void test()
{
LockImpl::lockResource(aaa);
LockImpl::unLockResource(aaa);
}
You should be using something like this:
volatile LONG resourceLock = 1;
if(InterlockedCompareExchange(&resourceLock, 0, 1) == 1) {
// success!
// do something, and then
resourceLock = 1;
} else {
// failed, try again later
}
See InterlockedCompareExchange.
Look at your compiler documentation to find out how to print the generated assembly language for functions.
Print the assembly language for this function:
static void unLockResource(DWORD resourceLock )
{
resourceLock = 0;
return;
}
This may not work because the compiler can optimize the function and remove all the code. You should change the above function to pass a pointer to resourceLock and then have the function set the lock. Print the assembly of this working function.
I already provided a working version which answered the original poster's question both on how to get the parameters passed in ASM and how to get his lock working correctly.
Many other answers have questioned the wiseness of using ASM at all and mentioned that either intrinsics or C OS calls should be used. The following works as well and is a C++ version of my ASM answer. There is a snippet of ASM in there that only needs to be used if your platform does not support InterlockedExchange().
class LockImpl
{
// This is a simple SpinLock
// 0 - in use / busy
// 1 - free / available
public:
#if 1
static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
{
// InterlockedExchange() uses LONG / He wants to use DWORD
return((DWORD)InterlockedExchange(
(volatile LONG *)variable,(LONG)newval));
}
#else
// You can use this if you don't have InterlockedExchange()
// on your platform. Otherwise no ASM is required.
static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
{
DWORD old;
__asm
{
mov ebx, variable
mov eax, newval
xchg eax, [ebx] ;// XCHG with BusLock
mov old, eax
}
return(old);
}
#endif
static void lockResource(volatile DWORD &resourceLock )
{
DWORD oldval;
do
{
while(0==resourceLock)
{
// Could have a yield, spin count, exponential
// backoff, OS CS fallback, etc. here
}
oldval=MyInterlockedExchange(&resourceLock,0);
} while (0==oldval);
}
static void unLockResource(volatile DWORD &resourceLock)
{
// _ReadWriteBarrier() is a VC++ intrinsic that generates
// no instructions / only prevents compiler reordering.
// GCC uses __sync_synchronize() or __asm__ ( :::"memory" )
_ReadWriteBarrier();
resourceLock=1;
}
};