Is VC++ still broken Sequentially-Consistent-wise? - c++

I watched (most of) Herb Sutter's the atmoic<> weapons video, and I wanted to test the "conditional lock" with a loop inside sample. Apparently, although (if I understand correctly) the C++11 standard says the below example should work properly and be sequentially consistent, it is not.
Before you read on, my question is: Is this correct? Is the compiler broken? Is my code broken - do I have a race condition here which I missed? How do I bypass this?
I tried it on 3 different versions of Visual C++: VC10 professional, VC11 professional and VC12 Express (== Visual Studio 2013 Desktop Express).
Below is the code I used for the Visual Studio 2013. For the other versions I used boost instead of std, but the idea is the same.
#include <iostream>
#include <thread>
#include <mutex>
int a = 0;
std::mutex m;
void other()
{
std::lock_guard<std::mutex> l(m);
std::this_thread::sleep_for(std::chrono::milliseconds(2));
a = 999999;
std::this_thread::sleep_for(std::chrono::seconds(2));
std::cout << a << "\n";
}
int main(int argc, char* argv[])
{
bool work = (argc > 1);
if (work)
{
m.lock();
}
std::thread th(other);
for (int i = 0; i < 100000000; ++i)
{
if (i % 7 == 3)
{
if (work)
{
++a;
}
}
}
if (work)
{
std::cout << a << "\n";
m.unlock();
}
th.join();
}
To summarize the idea of the code: The global variable a is protected by the global mutex m. Assuming there are no command line arguments (argc==1) the thread which runs other() is the only one which is supposed to access the global variable a.
The correct output of the program is to print 999999.
However, because of the compiler loop optimization (using a register for in-loop increments and at the end of the loop copying the value back to a), a is modified by the assembly even though it's not supposed to.
This happened in all 3 VC versions, although in this code example in VC12 I had to plant some calls to sleep() to make it break.
Here's some of the assembly code (the adress of a in this run is 0x00f65498):
Loop initialization - value from a is copied into edi
27: for (int i = 0; i < 100000000; ++i)
00F61543 xor esi,esi
00F61545 mov edi,dword ptr ds:[0F65498h]
00F6154B jmp main+0C0h (0F61550h)
00F6154D lea ecx,[ecx]
28: {
29: if (i % 7 == 3)
Increment within the condition, and after the loop copied back to the location of a unconditionally
30: {
31: if (work)
00F61572 mov al,byte ptr [esp+1Bh]
00F61576 jne main+0EDh (0F6157Dh)
00F61578 test al,al
00F6157A je main+0EDh (0F6157Dh)
32: {
33: ++a;
00F6157C inc edi
27: for (int i = 0; i < 100000000; ++i)
00F6157D inc esi
00F6157E cmp esi,5F5E100h
00F61584 jl main+0C0h (0F61550h)
32: {
33: ++a;
00F61586 mov dword ptr ds:[0F65498h],edi
34: }
And the output of the program is 0.

The 'volatile' keyword will prevent that kind of optimization. That's exactly what it's for: every use of 'a' will be read or written exactly as shown, and won't be moved in a different order to other volatile variables.
The implementation of the mutex should include compiler-specific instructions to cause a "fence" at that point, telling the optimizer not to reorder instructions across that boundary. Since the implementation is not from the compiler vendor, maybe that's left out? I've never checked.
Since 'a' is global, I would generally think the compiler would be more careful with it. But, VS10 doesn't know about threads so it won't consider that other threads will use it. Since the optimizer grasps the entire loop execution, it knows that functions called from within the loop won't touch 'a' and that's enough for it.
I'm not sure what the new standard says about thread visibility of global variables other than volatile. That is, is there a rule that would prevent that optimization (even though the function can be grasped all the way down so it knows other functions don't use the global, must it assume that other threads can) ?
I suggest trying the newer compiler with the compiler-provided std::mutex, and checking what the C++ standard and current drafts say about that. I think the above should help you know what to look for.
—John

Almost a month later, Microsoft still hasn't responded to the bug in MSDN Connect.
To summarize the above comments (and some further tests), apparently it happens in VS2013 professional as well, but the bug only happens when building for Win32, not for x64. The generated assembly code in x64 doesn't have this problem.
So it appears that it is a bug in the optimizer, and that there's no race condition in this code.
Apparently this bug also happens in GCC 4.8.1, but not in GCC 4.9.
(Thanks to Voo, nosid and Chris Dodd for all their testing).
It was suggested to mark a as volatile. This indeed prevents the bug, but only because it prevents the optimizer from performing the loop register optimization.
I found another solution: Add another local variable b, and if needed (and under lock) do the following:
Copy a into b
Increment b in the loop
Copy back to a if needed
The optimizer replaces the local variable with a register, so the code is still optimized, but the copies from and to a are done only if needed, and under lock.
Here's the new main() code, with arrows marking the changed lines.
int main(int argc, char* argv[])
{
bool work = (argc == 1);
int b = 0; // <----
if (work)
{
m.lock();
b = a; // <----
}
std::thread th(other);
for (int i = 0; i < 100000000; ++i)
{
if (i % 7 == 3)
{
if (work)
{
++b; // <----
}
}
}
if (work)
{
a = b; // <----
std::cout << a << "\n";
m.unlock();
}
th.join();
}
And this is what the assembly code looks like (&a == 0x000744b0, b replaced with edi):
21: int b = 0;
00071473 xor edi,edi
22:
23: if (work)
00071475 test bl,bl
00071477 je main+5Bh (07149Bh)
24: {
25: m.lock();
........
00071492 add esp,4
26: b = a;
00071495 mov edi,dword ptr ds:[744B0h]
27: }
28:
........
33: {
34: if (work)
00071504 test bl,bl
00071506 je main+0C9h (071509h)
35: {
36: ++b;
00071508 inc edi
30: for (int i = 0; i < 100000000; ++i)
00071509 inc esi
0007150A cmp esi,5F5E100h
00071510 jl main+0A0h (0714E0h)
37: }
38: }
39: }
40:
41: if (work)
00071512 test bl,bl
00071514 je main+10Ch (07154Ch)
42: {
43: a = b;
44: std::cout << a << "\n";
00071516 mov ecx,dword ptr ds:[73084h]
0007151C push edi
0007151D mov dword ptr ds:[744B0h],edi
00071523 call dword ptr ds:[73070h]
00071529 mov ecx,eax
0007152B call std::operator<<<std::char_traits<char> > (071A80h)
........
This keeps the optimization and solves (or works around) the problem.

Related

Result of TLS variable access not cached

Edit: It seems this is a compiler bug indeed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
I am writing a wrapper for writing logs that uses TLS to store a std::stringstream buffer. This code will be used by shared-libraries. When looking at the code on godbolt.org it seems that neither gcc nor clang will cache the result of a TLS lookup (the loop repeatedly calls '__tls_get_addr()' when I believe I have designed my class in a way that should let it.
#include <sstream>
class LogStream
{
public:
LogStream()
: m_buffer(getBuffer())
{
}
LogStream(std::stringstream& buffer)
: m_buffer(buffer)
{
}
static std::stringstream& getBuffer()
{
thread_local std::stringstream buffer;
return buffer;
}
template <typename T>
inline LogStream& operator<<(const T& t)
{
m_buffer << t;
return *this;
}
private:
std::stringstream& m_buffer;
};
int main()
{
LogStream log{};
for (int i = 0; i < 12345678; ++i)
{
log << i;
}
}
Looking at the assembly code output both gcc and clang generate pretty similar output:
clang 5.0.0:
xor ebx, ebx
.LBB0_3: # =>This Inner Loop Header: Depth=1
data16
lea rdi, [rip + LogStream::getBuffer[abi:cxx11]()::buffer[abi:cxx11]#TLSGD]
data16
data16
rex64
call __tls_get_addr#PLT // Called on every loop iteration.
lea rdi, [rax + 16]
mov esi, ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
inc ebx
cmp ebx, 12345678
jne .LBB0_3
gcc 7.2:
xor ebx, ebx
.L3:
lea rdi, guard variable for LogStream::getBuffer[abi:cxx11]()::buffer#tlsld[rip]
call __tls_get_addr#PLT // Called on every loop iteration.
mov esi, ebx
add ebx, 1
lea rdi, LogStream::getBuffer[abi:cxx11]()::buffer#dtpoff[rax+16]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
cmp ebx, 12345678
jne .L3
How can I convince both compilers that the lookup doesn't need to be repeatedly done?
Compiler options: -std=c++11 -O3 -fPIC
Godbolt link
This really looks like an optimization bug in both Clang and GCC.
Here's what I think happens. (I might be completely off.) The compiler completely inlines everything down to this code:
int main()
{
// pseudo-access
std::stringstream& m_buffer = LogStream::getBuffer::buffer;
for (int i = 0; i < 12345678; ++i)
{
m_buffer << i;
}
}
And then, not realizing that access to a thread-local is very expensive under -fPIC, it decides that the temporary reference to the global is not necessary and inlines that as well:
int main()
{
for (int i = 0; i < 12345678; ++i)
{
// pseudo-access
LogStream::getBuffer::buffer << i;
}
}
Whatever actually happens, this is clearly a pessimization of the code your wrote. You should report this as a bug to GCC and Clang.
GCC bugtracker: https://gcc.gnu.org/bugzilla/
Clang bugtracker: https://bugs.llvm.org/

Adding stringstream/cout hurts performance, even when the code is never called

I have a program in which a simple function is called a large number of times. I have added some simple logging code and find that this significantly affects performance, even when the logging code is not actually called. A complete (but simplified) test case is shown below:
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
using namespace std::chrono;
std::mt19937 rng;
uint32_t getValue()
{
// Just some pointless work, helps stop this function from getting inlined.
for (int x = 0; x < 100; x++)
{
rng();
}
// Get a value, which happens never to be zero
uint32_t value = rng();
// This (by chance) is never true
if (value == 0)
{
value++; // This if statment won't get optimized away when printing below is commented out.
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
return value;
}
int main(int argc, char* argv[])
{
// Just fror timing
high_resolution_clock::time_point start = high_resolution_clock::now();
uint32_t sum = 0;
for (uint32_t i = 0; i < 10000000; i++)
{
sum += getValue();
}
milliseconds elapsed = duration_cast<milliseconds>(high_resolution_clock::now() - start);
// Use (print) the sum to make sure it doesn't get optimized away.
std::cout << "Sum = " << sum << ", Elapsed = " << elapsed.count() << "ms" << std::endl;
return 0;
}
Note that the code contains stringstream and cout but these are never actually called. However, the presence of these three lines of code increases the run time from 2.9 to 3.3 seconds. This is in release mode on VS2013. Curiously, if I build in GCC using '-O3' flag the extra three lines of code actually decrease the runtime by half a second or so.
I understand that the extra code could impact the resulting executable in a number of ways, such as by preventing inlining or causing more cache misses. The real question is whether there is anything I can do to improve on this situation? Switching to sprintf()/printf() doesn't seem to make a difference. Do I need to simply accept that adding such logging code to small functions will affect performance even if not called?
Note: For completeness, my real/full scenario is that I use a wrapper macro to throw exceptions and I like to log when such an exception is thrown. So when I call THROW_EXCEPT(...) it inserts code similar to that shown above and then throws. This in then hurting when I throw exceptions from inside a small function. Any better alternatives here?
Edit: Here is a VS2013 solution for quick testing, and so compiler settings can be checked: https://drive.google.com/file/d/0B7b4UnjhhIiEamFyS0hjSnVzbGM/view?usp=sharing
So I initially thought that this was due to branch prediction and optimising out branches so I took a look at the annotated assembly for when the code is commented out:
if (value == 0)
00E21371 mov ecx,1
00E21376 cmove eax,ecx
{
value++;
Here we see that the compiler has helpfully optimised out our branch, so what if we put in a more complex statement to prevent it from doing so:
if (value == 0)
00AE1371 jne getValue+99h (0AE1379h)
{
value /= value;
00AE1373 xor edx,edx
00AE1375 xor ecx,ecx
00AE1377 div eax,ecx
Here the branch is left in but when running this it runs about as fast as the previous example with the following lines commented out. So lets have a look at the assembly for having those lines left in:
if (value == 0)
008F13A0 jne getValue+20Bh (08F14EBh)
{
value++;
std::stringstream ss;
008F13A6 lea ecx,[ebp-58h]
008F13A9 mov dword ptr [ss],8F32B4h
008F13B3 mov dword ptr [ebp-0B0h],8F32F4h
008F13BD call dword ptr ds:[8F30A4h]
008F13C3 push 0
008F13C5 lea eax,[ebp-0A8h]
008F13CB mov dword ptr [ebp-4],0
008F13D2 push eax
008F13D3 lea ecx,[ss]
008F13D9 mov dword ptr [ebp-10h],1
008F13E0 call dword ptr ds:[8F30A0h]
008F13E6 mov dword ptr [ebp-4],1
008F13ED mov eax,dword ptr [ss]
008F13F3 mov eax,dword ptr [eax+4]
008F13F6 mov dword ptr ss[eax],8F32B0h
008F1401 mov eax,dword ptr [ss]
008F1407 mov ecx,dword ptr [eax+4]
008F140A lea eax,[ecx-68h]
008F140D mov dword ptr [ebp+ecx-0C4h],eax
008F1414 lea ecx,[ebp-0A8h]
008F141A call dword ptr ds:[8F30B0h]
008F1420 mov dword ptr [ebp-4],0FFFFFFFFh
That's a lot of instructions if that branch is ever hit. So what if we try something else?
if (value == 0)
011F1371 jne getValue+0A6h (011F1386h)
{
value++;
printf("This never gets printed, but commenting out these three lines improves performance.");
011F1373 push 11F31D0h
011F1378 call dword ptr ds:[11F30ECh]
011F137E add esp,4
Here we have far fewer instructions and once again it runs as quickly as with all lines commented out.
So I'm not sure I can say for certain exactly what is happening here but I feel at the moment it is a combination of branch prediction and CPU instruction cache misses.
In order to solve this problem you could move the logging into a function like so:
void log()
{
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
and
if (value == 0)
{
value++;
log();
Then it runs as fast as before with all those instructions replaced with a single call log (011C12E0h).

Dangerous error Visual c++ 2005

I bumped into a very serious error using visual studio 2005, running a C++ Win32 Console application. The problem will show when running the code below (simplified), using the following project properties: C++|optimization|optimization|/O2 (or /O1, or /Ox), C++|optimization|Whole program optimization|/GL, linker|optimization|/ltcg
#include "stdafx.h"
#include <iostream>
using namespace std;
const int MAXVAL=10;
class MyClass
{
private:
int p;
bool isGood;
public:
int SetUp(int val);
};
int MyClass::SetUp(int val)
{
isGood = true;
if (MAXVAL<val)
{
int wait;
cerr<<"ERROR, "<<MAXVAL<<"<"<<val<<endl;
cin>>wait;
//exit(1); //for x64 uncomment, for win32 leave commented
}
if (isGood) p=4;
return 1;
}
int _tmain(int argc, _TCHAR* argv[])
{
int wait=0, setupVal1=10, setupVal2=12;
MyClass classInstance1;
MyClass classInstance2;
if (MAXVAL>=setupVal1) classInstance1.SetUp(setupVal1);
if (MAXVAL>setupVal2) classInstance2.SetUp(setupVal2);
cerr<<"exit, enter value to terminate\n";
cin>>wait;
return 0;
}
The output shows that value 10 is smaller then value 10! I already found out that changing setting /O2 to /Od solves the problem (setting /Og, which is part of /O2, causes the problem), but that really slows down the execution time. Also changing the code a bit can solve it but hey, I can never be sure that the code is reliable. I am using Visual studio 2005 professional (Version 8.0.50727.867), os windows 7.
My questions are: can someone try to reproduce this error using Visual Studio 2005, (I already tried VS 2010, no problem), and if so, what happens here?
Can I assume that newer versions have solved this problem (I consider buying VS 2012)
Thank you
You can reduce your example significantly and still get the same problem! You don't need two instances, and you don't need any of the other local or member variables. Also, you can hardcode MAXVAL.
Quick summary of what "solves" the problem:
making MAXVAL a nonconst int
setting setupVal2 to a value less than 10
surprisingly, changing the condition 10<val to val>10 !!!
Here's my minimal version to reproduce the problem:
#include "stdafx.h"
#include <iostream>
using namespace std;
class MyClass
{
public:
int SetUp(int val);
};
int MyClass::SetUp(int val)
{
if (10<val)
cout<<10<<"<"<<val<<endl;
return 1;
}
int _tmain(int argc, _TCHAR* argv[])
{
int setupVal1=10, setupVal2=12;
MyClass classInstance;
classInstance.SetUp(setupVal1);
classInstance.SetUp(setupVal2);
cin.get();
return 0;
}
The problem, as witnessed by the disassembly, is that the compiler thinks 10<val is always true and therefore omits the check.
_TEXT SEGMENT
?SetUp#MyClass##QAEHH#Z PROC ; MyClass::SetUp
; _val$ = ecx
; 16 : if (10<val)
; 17 : cout<<10<<"<"<<val<<endl;
mov eax, DWORD PTR __imp_?endl#std##YAAAV?$basic_ostream#DU?$char_traits#D#std###1#AAV21##Z
push eax
push ecx
mov ecx, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
push 10 ; 0000000aH
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
push eax
call ??$?6U?$char_traits#D#std###std##YAAAV?$basic_ostream#DU?$char_traits#D#std###0#AAV10#PBD#Z ; std::operator<<<std::char_traits<char> >
add esp, 4
mov ecx, eax
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#H#Z
mov ecx, eax
call DWORD PTR __imp_??6?$basic_ostream#DU?$char_traits#D#std###std##QAEAAV01#P6AAAV01#AAV01##Z#Z
; 18 : return 1;
mov eax, 1
; 19 : }

Possible VS2012 compiler bug (maybe in Whole Program Optimization?)

Could this be a compiler error? My environment is:
Win7 pro (64-bit)
VS2012 (update 3)
I compile the tiny console program below. Things work fine for x64 bit release/debug builds. The x32 debug build also works just fine. The x32 release build, however displays 'BUG!'.
If i disable 'Whole Program Optimization' that will fix the issue.
Any ideas?
-
#include <string>
#include <iostream>
int main()
{
std::string const buffer = "hello, world";
std::string::size_type pos = 0;
std::string::size_type previous_pos;
while (pos != std::string::npos)
{
previous_pos = ++pos;
pos = buffer.find('w', pos);
}
if (previous_pos == std::string::npos)
{
std::cout << "BUG!!"<< std::endl;
}
return 0;
}
I can reproduce this too. When the bug manifests, the code is testing eax to deterimine whether to output "BUG", which is the same register it's using for 'pos'.
17: previous_pos = ++pos;
013C12E5 inc eax
...
21: if (previous_pos == std::string::npos)
00301345 cmp eax,eax
00301347 jne main+0F6h (0301366h)
However if you make a change to try to get the optimzer to realise they are distinct then the test is different. If I add ++previous_pos at the end of the loop body then it uses ecx for previous_pos and the bug goes away:
22: if (previous_pos == std::string::npos)
00361349 cmp ecx,eax
0036134B jne main+0FAh (036136Ah)
If I change the find to 'pos = buffer.find('w', previous_pos);' (searching from previous_pos instead of pos, which has the same value) then it uses ebx, and again the bug goes away:
21: if (previous_pos == std::string::npos)
00191345 cmp ebx,eax
00191347 jne main+0F6h (0191366h)
So it seems in the original that the optimiser is erroneously deciding that it can use eax for both of those variables, despite the line 'pos = buffer.find('w', pos);' that can set pos to a different value than previous_pos.

Extremely bizarre code generation in Visual C++, for nearly identical code; 3x speed difference

The code below (reduced from my larger code, after my astonishment at how its speed paled in comparison with that of std::vector) has two peculiar features:
It runs more than three times faster when I make a very tiny modification to the source code (always compiling it with /O2 with Visual C++ 2010).
Note: To make this a little more fun, I put a hint for the modification at the end, so you can spend some time figuring out the change yourself. The original code was ~500 lines, so it took me a heck of a lot longer to pin it down, since the fix looks pretty irrelevant to the performance.
It runs about 20% faster with /MTd than with /MT, even though the output loop looks the same!!!
The difference in the assembly code for the tiny-modification case is:
Loop without the modification (~300 ms):
00403383 mov esi,dword ptr [esp+10h]
00403387 mov edx,dword ptr [esp+0Ch]
0040338B mov dword ptr [edx+esi*4],eax
0040338E add dword ptr [esp+10h],ecx
00403392 add eax,ecx
00403394 cmp eax,4000000h
00403399 jl main+43h (403383h)
Loop with /MTd (looks identical! but ~270 ms):
00407D73 mov esi,dword ptr [esp+10h]
00407D77 mov edx,dword ptr [esp+0Ch]
00407D7B mov dword ptr [edx+esi*4],eax
00407D7E add dword ptr [esp+10h],ecx
00407D82 add eax,ecx
00407D84 cmp eax,4000000h
00407D89 jl main+43h (407D73h)
Loop with the modification (~100 ms!!):
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
Now my question is, why should the changes above have the effects they do? It's completely bizarre!
Especially the first one -- it shouldn't affect anything at all (once you see the difference in the code), and yet it lowers the speed dramatically.
Is there an explanation for this?
#include <cstdio>
#include <ctime>
#include <algorithm>
#include <memory>
template<class T, class Allocator = std::allocator<T> >
struct vector : Allocator
{
T *p;
size_t n;
struct scoped
{
T *p_;
size_t n_;
Allocator &a_;
~scoped() { if (p_) { a_.deallocate(p_, n_); } }
scoped(Allocator &a, size_t n) : a_(a), n_(n), p_(a.allocate(n, 0)) { }
void swap(T *&p, size_t &n)
{
std::swap(p_, p);
std::swap(n_, n);
}
};
vector(size_t n) : n(0), p(0) { scoped(*this, n).swap(p, n); }
void push_back(T const &value) { p[n++] = value; }
};
int main()
{
int const COUNT = 1 << 26;
vector<int> vect(COUNT);
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
}
Hint (hover your mouse below):
It has to do with the allocator.
Answer:
Change Allocator &a_ to Allocator a_.
For what it's worth, my speculation for the difference between /MT and /MTd is that the /MTd heap allocation will paint the heap memory for debugging purposes making it more likely to be paged in - that occurs before you start the clock.
If you 'pre-heat' the vector allocation, you get the same numbers for /MT and /MTd:
vector<int> vect(COUNT);
// make sure vect's memory is warmed up
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
vect.n = 0; // clear the vector
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
It's strange that Allocator& will break the alias chain while Allocator will not.
You can try
for(int i=vect.n; i<COUNT;++i){
...
}
to enforce i and n are synchronized.
This will make vc much easier to optimize.
emm... It seems that the "fastest" code
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
is somewhat over-optimized. In this loop, vect.n is ignored at all...
If there was an exception happened in the loop, vect.n will not be updated correctly.
So the answer might be: when you use Allocator, vc figures out that vect.n will be never used again,
so that it can be ignored. It's amazing, but in general it's not so useful and dangerous.