I am playing with atomic reads and writes currently and have hit a wall in my understanding. I understand that writing to a variable (eg via increment) has to be atomic, but I'm not sure about reading the variable.
Consider _InterlockedExchangeAdd on Windows, or __sync_add_and_fetch on Linux. I cannot find a function that atomically retrieves the value being updated. Now I've done my research before posting here and Are C++ Reads and Writes of an int Atomic? tells me that a read isn't atomic.
1) If I use the functions above, how do I atomically read the value, say if returning it from a function?
2) If I didn't want to use these functions and just wanted to lock a mutex before every write of my "atomic" variable, in the function that retrieves the current value of the variable, would I need to first lock the mutex, copy the current value, unlock the mutex then return the copy?
EDIT
I am using a compiler that doesn't have access to the atomic headers, hence have to use these APIs.
You cannot find the answer because there was no one way to do it so that it is (a) fast and (b) portable. It depends on: C++ or C, compiler, compiler version, compiler settings, library, architecture... the list goes on and on.
Here is a starting point:
If you have access to standard atomics or then use atomic_load() - it is portable and fast.
If you don't have standard atomics then use CAS. For example, on Windows it will be _InterlockedCompareExchange(,0,0). See https://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx. On Linux it will be __sync_val_compare_and_swap(, 0, 0). See https://gcc.gnu.org/onlinedocs/gcc-4.4.3/gcc/Atomic-Builtins.html.
If you want to use direct assignment - then you should check with an expert who knows where your code will be running.
I happen to have a snippet of the assembler code which may explain why CAS is a reasonable alternative. This is C, i86, Microsoft compiler VS2015, Win64 target:
volatile long debug_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
00000001401A6958 xor edi,edi
00000001401A695A mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A695D xor eax,eax
00000001401A695F lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6964 mov dword ptr [rbp-0Ch],eax
debug_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A6967 prefetchw [rbp+30h]
00000001401A696B mov eax,dword ptr [rbp+30h]
00000001401A696E xchg ax,ax
00000001401A6970 mov ecx,eax
00000001401A6972 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6977 jne foo+30h (01401A6970h)
00000001401A6979 mov dword ptr [rbp-0Ch],eax
volatile long release_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955 mov eax,dword ptr [rbp+30h]
release_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A6958 mov dword ptr [rbp-0Ch],eax
00000001401A695B xor edi,edi
00000001401A695D mov eax,dword ptr [rbp-0Ch]
00000001401A6960 xor eax,eax
00000001401A6962 lock cmpxchg dword ptr [rbp+30h],edi
00000001401A6967 mov dword ptr [rbp-0Ch],eax
release_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A696A prefetchw [rbp+30h]
00000001401A696E mov eax,dword ptr [rbp+30h]
00000001401A6971 mov ecx,eax
00000001401A6973 lock cmpxchg dword ptr [rbp+30h],ecx
00000001401A6978 jne foo+31h (01401A6971h)
00000001401A697A mov dword ptr [rbp-0Ch],eax
Your plan (2) for using mutex is correct.
Good luck.
Related
I have a question about performance. I think this can also applies to other languages (not only C++).
Imagine that I have this function:
int addNumber(int a, int b){
int result = a + b;
return result;
}
Is there any performance improvement if I write the code above like this?
int addNumber(int a, int b){
return a + b;
}
I have this question because the second function doesn´t declare a 3rd variable. But would the compiler detect this in the first code?
To answer this question you can look at the generated assembler code. With -O2, x86-64 gcc 6.2 generates exactly the same code for both methods:
addNumber(int, int):
lea eax, [rdi+rsi]
ret
addNumber2(int, int):
lea eax, [rdi+rsi]
ret
Only without optimization turned on, there is a difference:
addNumber(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add eax, edx
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
addNumber2(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
However, performance comparison without optimization is meaningless
In principle there is no difference between the two approaches. The majority of compilers have handled this type of optimisation for some decades.
Additionally, if the function can be inlined (e.g. its definition is visible to the compiler when compiling code that uses such a function) the majority of compilers will eliminate the function altogether, and simply emit code to add the two variables passed and store the result as required by the caller.
Obviously, the comments above assume compiling with a relevant optimisation setting (e.g. not doing a debug build without optimisation).
Personally, I would not write such a function anyway. It is easier, in the caller, to write c = a + b instead of c = addNumber(a, b), so having a function like that offers no benefit to either programmer (effort to understand) or program (performance, etc). You might as well write comments that give no useful information.
c = a + b; // add a and b and store into c
Any self-respecting code reviewer would complain bitterly about uninformative functions or uninformative comments.
I'd only use such a function if its name conveyed some special meaning (i.e. more than just adding two values) for the application
c = FunkyOperation(a,b);
int FunkyOperation(int a, int b)
{
/* Many useful ways of implementing this operation.
One of those ways happens to be addition, but we need to
go through 25 pages of obscure mathematical proof to
realise that
*/
return a + b;
}
I was curious to see what the cost is of accessing a data member through a pointer compared with not through a pointer, so came up with this test:
#include <iostream>
struct X{
int a;
};
int main(){
X* xheap = new X();
std::cin >> xheap->a;
volatile int x = xheap->a;
X xstack;
std::cin >> xstack.a;
volatile int y = xstack.a;
}
the generated x86 is:
int main(){
push rbx
sub rsp,20h
X* xheap = new X();
mov ecx,4
call qword ptr [__imp_operator new (013FCD3158h)]
mov rbx,rax
test rax,rax
je main+1Fh (013FCD125Fh)
xor eax,eax
mov dword ptr [rbx],eax
jmp main+21h (013FCD1261h)
xor ebx,ebx
std::cin >> xheap->a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov rdx,rbx
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int x = xheap->a;
mov eax,dword ptr [rbx]
X xstack;
std::cin >> xstack.a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov dword ptr [x],eax
lea rdx,[xstack]
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int y = xstack.a;
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
It looks like the non-pointer access takes two instructions, compared to oneinstruction for the access through a pointer. Could somebody please tell me why this is and which would take fewer CPU cycles to retrieve?
I am trying to understand if pointers do incur more CPU instructions/cycles when accessing data members through them as opposed to non-pointer-access.
That's a terrible test.
The complete assignment to x is this:
mov eax,dword ptr [rbx]
mov dword ptr [x],eax
(the compiler is allowed to re-order the instructions somewhat, and has).
The assignment to y (which the compiler has given the same address as x) is
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
which is almost the same (read memory pointed to by register, write to the stack).
The first one would be more complicated except that the compiler kept xheap in register rbx after the call to new, so it doesn't need to re-load it.
In either case I would be more worried about whether any of those accesses misses the L1 or L2 caches than about the precise instructions. (The processor doesn't even directly execute those instructions, they get converted internally to a different instruction set, and it may execute them in a different order.)
Accessing via a pointer instead of directly accessing from the stack costs you one extra indirection in the worst case (fetching the pointer). This is almost always irrelevant in itself; you need to look at your whole algorithm and how it works with the processor's caches and branch prediction logic.
AFAIK C++ atomics (<atomic>) family provide 3 benefits:
primitive instructions indivisibility (no dirty reads),
memory ordering (both, for CPU and compiler) and
cross-thread visibility/changes propagation.
And I am not sure about the third bullet, thus take a look at the following example.
#include <atomic>
std::atomic_bool a_flag = ATOMIC_VAR_INIT(false);
struct Data {
int x;
long long y;
char const* z;
} data;
void thread0()
{
// due to "release" the data will be written to memory
// exactly in the following order: x -> y -> z
data.x = 1;
data.y = 100;
data.z = "foo";
// there can be an arbitrary delay between the write
// to any of the members and it's visibility in other
// threads (which don't synchronize explicitly)
// atomic_bool guarantees that the write to the "a_flag"
// will be clean, thus no other thread will ever read some
// strange mixture of 4bit + 4bits
a_flag.store(true, std::memory_order_release);
}
void thread1()
{
while (a_flag.load(std::memory_order_acquire) == false) {};
// "acquire" on a "released" atomic guarantees that all the writes from
// thread0 (thus data members modification) will be visible here
}
void thread2()
{
while (data.y != 100) {};
// not "acquiring" the "a_flag" doesn't guarantee that will see all the
// memory writes, but when I see the z == 100 I know I can assume that
// prior writes have been done due to "release ordering" => assert(x == 1)
}
int main()
{
thread0(); // concurrently
thread1(); // concurrently
thread2(); // concurrently
// join
return 0;
}
First, please validate my assumptions in code (especially thread2).
Second, my questions are:
How does the a_flag write propagate to other cores?
Does the std::atomic synchronize the a_flag in the writer cache with the other cores cache (using MESI, or anything else), or the propagation is automatic?
Assuming that on a particular machine a write to a flag is atomic (think int_32 on x86) AND we don't have any private memory to synchronize (we only have a flag) do we need to use atomics?
Taking into consideration most popular CPU architectures (x86, x64, ARM v.whatever, IA-64), is the cross-core visibility (I am now not considering reorderings) automatic (but potentially delayed), or you need to issue specific commands to propagate any piece of data?
Cores themselves don't matter. The question is "how do all cores see the same memory update eventually", which is something your hardware does for you (e.g. cache coherency protocols). There is only one memory, so the main concern is caching, which is a private concern of the hardware.
That question seems unclear. What matters is the acquire-release pair formed by the load and store of a_flag, which is a synchronisation point and causes the effects of thread0 and thread1 to appear in a certain order (i.e. everything in thread0 before the store happens-before everything after the loop in thread1).
Yes, otherwise you wouldn't have synchronisation point.
You don't need any "commands" in C++. C++ isn't even aware of the fact that it's running on any particular kind of CPU. You could probably run a C++ program on a Rubik's cube with enough imagination. A C++ compiler chooses the necessary instructions to implement the synchronisation behaviour that's described by the C++ memory model, and on x86 that involves issuing instruction lock prefixes and memory fences, as well as not reordering instructions too much. Since x86 has a strongly ordered memory model, the above code should produce minimal additional code compared to the naive, incorrect one without atomics.
Having your thread2 in the code makes the entire program undefined behaviour.
Just for fun, and to show that working out what's happening for yourself can be edifying, I compiled the code in three variations. (I added a glbbal int x and in thread1 I added x = data.y;).
Acquire/Release: (your code)
thread0:
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
mov BYTE PTR a_flag, 1
ret
thread1:
.L14:
movzx eax, BYTE PTR a_flag
test al, al
je .L14
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
Sequentially consistent: (remove the explicit ordering)
thread0:
mov eax, 1
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
xchg al, BYTE PTR a_flag
ret
thread1:
.L14:
movzx eax, BYTE PTR a_flag
test al, al
je .L14
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
"Naive": (just using bool)
thread0:
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
mov BYTE PTR a_flag, 1
ret
thread1:
cmp BYTE PTR a_flag, 0
jne .L3
.L4:
jmp .L4
.L3:
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
As you can see, there's not a big difference. The "incorrect" version actually looks mostly correct, except for missing the load (it uses cmp with memory operand). The sequentially consistent version hides its expensiveness in the xcgh instruction, which has an implicit lock prefix and doesn't seem to require any explicit fences.
According to some post here ( Efficiency of std::copy vs memcpy , ) std::copy is supposed to be reduce to memcpy/memmove on a pod type. i am trying to test that but i cant replicate the result.
I am using visual studio 2010 and I tried all optimization levels
struct pod_
{
unsigned int v1 ,v2 ,v3 ;
} ;
typedef pod_ T ;
static_assert(std::is_pod<pod_>::value, "Struct must be a POD type");
const unsigned int size = 20*1024*1024 / sizeof(T);
std::vector<T> buffer1(size) ;
std::vector<T> buffer2((size)) ;
And i tried this :
std::copy(buffer1.begin(),buffer1.end(),&buffer2[0]);
0030109C mov esi,dword ptr [esp+14h]
003010A0 mov ecx,dword ptr [esp+18h]
003010A4 mov edi,dword ptr [esp+24h]
003010A8 mov eax,esi
003010AA cmp esi,ecx
003010AC je main+8Eh (3010CEh)
003010AE mov edx,edi
003010B0 sub edx,esi
003010B2 mov ebx,dword ptr [eax]
003010B4 mov dword ptr [edx+eax],ebx
003010B7 mov ebx,dword ptr [eax+4]
003010BA mov dword ptr [edx+eax+4],ebx
003010BE mov ebx,dword ptr [eax+8]
003010C1 mov dword ptr [edx+eax+8],ebx
003010C5 add eax,0Ch
003010C8 cmp eax,ecx
003010CA jne main+72h (3010B2h)
003010CC xor ebx,ebx
casting to a primitive type seems to work.
std::copy((char *)&buffer1[0],(char *)&buffer1[buffer1.size() - 1],(char *)&buffer2[0]);
003010CE sub ecx,esi
003010D0 mov eax,2AAAAAABh
003010D5 imul ecx
003010D7 sar edx,1
003010D9 mov eax,edx
003010DB shr eax,1Fh
003010DE add eax,edx
003010E0 lea eax,[eax+eax*2]
003010E3 lea ecx,[eax*4-0Ch]
003010EA push ecx
003010EB push esi
003010EC push edi
003010ED call dword ptr [__imp__memmove (3020B0h)]
003010F3 add esp,0Ch
The "answer" in the thread you post is wrong. Generally, I would expect
std::copy to be more efficient than memcpy or memmove (because it
is more specialized) for POD types. Whether this is the case when using
iterators depends on the compiler, but any optimization does depend on
the compiler being able to "see through" the iterators. Depending on
the compiler and the implementation of the library, this may not be
possible.
Note too that your test code has undefined behavior. One of the
requirements for using std::copy (and memcpy) is that the
destination is not in the range of the source ([first,last) for
std::copy, [source,source+n) for memcpy). If the source and the
destination overlap, the behavior is undefined.
If you've been programming for a while then you probably noticed something completely impossible occurs every now and then for which you are convinced there is no possible explanation ("IT'S A COMPILER BUG!!"). After you find out what it was caused by though you are like "oooohhh".
Well, it just happened to me :(
Here AuthDb is NOT NULL but a valid pointer:
SingleResult sr(AuthDb, format("SELECT Id, Access, Flags, SessionKey, RealmSplitPreference FROM accounts WHERE Name = '%s'") % Escaped(account_name));
Here it mysteriously becomes NULL:
struct SingleResult : public BaseResult
{
SingleResult(Database *db, const boost::format& query) { _ExecuteQuery(db, query.str()); }
}
Notice that it's the immediate next call. It can be explained much better with two screenshots:
http://img187.imageshack.us/img187/5757/ss1zm.png
http://img513.imageshack.us/img513/5610/ss2b.png
EDIT: AuthDb is a global variable. It itself keeps pointing to the right thing; but the copy of the ptr Database *db points at NULL.
ASM code (unfortunately I don't even know how to read it :/)
Of the first screenshot
01214E06 mov eax,dword ptr [ebp-328h]
01214E0C push eax
01214E0D push offset string "SELECT Id, Access, Flags, Sessio"... (13C6278h)
01214E12 lea ecx,[ebp-150h]
01214E18 call boost::basic_format<char,std::char_traits<char>,std::allocator<char> >::basic_format<char,std::char_traits<char>,std::allocator<char> > (11A3260h)
01214E1D mov dword ptr [ebp-32Ch],eax
01214E23 mov ecx,dword ptr [ebp-32Ch]
01214E29 mov dword ptr [ebp-330h],ecx
01214E2F mov byte ptr [ebp-4],2
01214E33 mov ecx,dword ptr [ebp-330h]
01214E39 call boost::basic_format<char,std::char_traits<char>,std::allocator<char> >::operator%<Snow::Escaped> (11A3E18h)
01214E3E push eax
01214E3F mov edx,dword ptr [__tls_index (144EC40h)]
01214E45 mov eax,dword ptr fs:[0000002Ch]
01214E4B mov ecx,dword ptr [eax+edx*4]
01214E4E mov edx,dword ptr [ecx+12A3Ch]
01214E54 push edx
01214E55 lea ecx,[sr]
01214E58 call Snow::SingleResult::SingleResult (11A27D4h)
01214E5D mov byte ptr [ebp-4],4 // VS GREEN ARROW IS HERE
01214E61 lea ecx,[ebp-150h]
01214E67 call boost::basic_format<char,std::char_traits<char>,std::allocator<char> >::~basic_format<char,std::char_traits<char>,std::allocator<char> > (11A1DBBh)
01214E6C mov byte ptr [ebp-4],5
01214E70 lea ecx,[ebp-170h]
01214E76 call Snow::Escaped::~Escaped (11A42D2h)
const bool account_found = !sr.Error();
01214E7B lea ecx,[sr]
01214E7E call Snow::BaseResult::Error (11A2964h)
01214E83 movzx eax,al
01214E86 test eax,eax
01214E88 sete cl
01214E8B mov byte ptr [account_found],cl
if (!account_found) {
01214E8E movzx edx,byte ptr [account_found]
01214E92 test edx,edx
01214E94 jne AuthSession+1C0h (1214F10h)
client.Kill(format("%s: Attempted to login with non existant account `%s'") % client % account_name, true);
Second screenshot
011A8E7D mov dword ptr [ebp-10h],ecx
011A8E80 mov ecx,dword ptr [this]
011A8E83 call Snow::BaseResult::BaseResult (11A31D9h)
011A8E88 mov dword ptr [ebp-4],0
011A8E8F lea eax,[ebp-30h]
011A8E92 push eax
011A8E93 mov ecx,dword ptr [query]
011A8E96 call boost::basic_format<char,std::char_traits<char>,std::allocator<char> >::str (11A1E01h)
011A8E9B mov dword ptr [ebp-34h],eax
011A8E9E mov ecx,dword ptr [ebp-34h]
011A8EA1 mov dword ptr [ebp-38h],ecx
011A8EA4 mov byte ptr [ebp-4],1
011A8EA8 mov edx,dword ptr [ebp-38h]
011A8EAB push edx
011A8EAC mov eax,dword ptr [db]
011A8EAF push eax
011A8EB0 mov ecx,dword ptr [this]
011A8EB3 call Snow::SingleResult::_ExecuteQuery (124F380h)
011A8EB8 mov byte ptr [ebp-4],0 // VS GREEN ARROW HERE
011A8EBC lea ecx,[ebp-30h]
011A8EBF call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string<char,std::char_traits<char>,std::allocator<char> > (11A2C02h)
011A8EC4 mov dword ptr [ebp-4],0FFFFFFFFh
011A8ECB mov eax,dword ptr [this]
011A8ECE mov ecx,dword ptr [ebp-0Ch]
011A8ED1 mov dword ptr fs:[0],ecx
011A8ED8 pop edi
011A8ED9 add esp,38h
011A8EDC cmp ebp,esp
011A8EDE call _RTC_CheckEsp (12B4450h)
011A8EE3 mov esp,ebp
011A8EE5 pop ebp
011A8EE6 ret 8
UPDATE
Following the suggestion of peterchen, I added ASSERT(AuthDb); here:
ASSERT(AuthDb);
SingleResult sr(AuthDb, format("SELECT Id, Access, Flags, SessionKey, RealmSplitPreference FROM accounts WHERE Name = '%s'") % Escaped(account_name));
And it failed O.o And yet the debugger keeps insisting that it's not NULL.. It's not shadowed by a local
UPDATE2*
cout << AuthDb; is 0 in there even if the debugger says it's not
FOUND THE PROBLEM
Database *AuthDb = NULL, *GameDb = NULL; in a .cpp
extern thread Database *AuthDb, *GameDb; in a .h
The variable was marked thread (TLS - Thread local storage) in the header, but not TLS in the definition...
Countless hours wasted on this super stupid mistake, no warnings or hints or anything from the compiler that I feel like killing right now. :( Oh well, like I said for each impossible behavior there is a solution that once known seems obvious :)
Thanks to everyone who helped, I was really desperate!
Is AuthDB a thread-local variable?
Maybe the debugger isn't handling it correctly. What if you ASSERT(AuthDB) before the constructor is called?
UPDATE: If it is thread-local, it simply hasn't been initialized in this thread.
One possibility is that in the second screenshot, you have the debugger stopped at the very beginning of the function, before the stack has been manipulated, and so the variable locations aren't correct. You might also be after the end of the function, where the stack has already been torn down. I've seen that sort of thing before.
Expand that function to be a multiline function, so that it looks like this:
struct SingleResult : public BaseResult
{
SingleResult(Database *db, const boost::format& query)
{
_ExecuteQuery(db, query.str());
}
}
... and see if it still shows db as null when you have the debugger stopped on the _ExecuteQuery line.
Do you have a local AuthDB that is null and hides your global one?
(I'd expect the debugger in this case to correctly show you the local one... but with VS quality decay, I'd not rely on that)
I'd change the code to:
_ASSERTE(AuthDb);
SingleResult sr(AuthDb, format(...));
....
struct SingleResult : public BaseResult
{
SingleResult(Database *db, const boost::format& query)
{
_ASSERTE(db);
_ExecuteQuery(db, query.str());
}
}
And follow the disassembly in the debugger.
Well I'm not sure what classes/functions you're using here, but from a quick glance, shouldn't it be:
SingleResult sr(AuthDb, format("SELECT Id, Access, Flags, SessionKey, RealmSplitPreference FROM accounts WHERE Name = '%s'", Escaped(account_name)));
instead of:
SingleResult sr(AuthDb, format("SELECT Id, Access, Flags, SessionKey, RealmSplitPreference FROM accounts WHERE Name = '%s'") % Escaped(account_name));
It seems to me you're putting a modulus of the of Escaped(account_name) rather than passing that as an argument to format. However, I could just be confused.
Mike
I have no guess what is going on, but if I were debugging this I would be looking at the assembly language to see what is happening. You may need to get a better understanding of your platforms calling convention (i.e. how are arguments passed, on the stack, in registers, etc.) to troubleshoot this.
There is likely a bug somewhere else in your program. I suggest you find the problem by looking elsewhere in your code.
Another possibility--you might be looking at a memory overwrite due to a wild pointer somewhere.
Assuming your debugger supports it when you hit the first line set a memory-write breakpoint on AuthDb.