Could this be a compiler error? My environment is:
Win7 pro (64-bit)
VS2012 (update 3)
I compile the tiny console program below. Things work fine for x64 bit release/debug builds. The x32 debug build also works just fine. The x32 release build, however displays 'BUG!'.
If i disable 'Whole Program Optimization' that will fix the issue.
Any ideas?
-
#include <string>
#include <iostream>
int main()
{
std::string const buffer = "hello, world";
std::string::size_type pos = 0;
std::string::size_type previous_pos;
while (pos != std::string::npos)
{
previous_pos = ++pos;
pos = buffer.find('w', pos);
}
if (previous_pos == std::string::npos)
{
std::cout << "BUG!!"<< std::endl;
}
return 0;
}
I can reproduce this too. When the bug manifests, the code is testing eax to deterimine whether to output "BUG", which is the same register it's using for 'pos'.
17: previous_pos = ++pos;
013C12E5 inc eax
...
21: if (previous_pos == std::string::npos)
00301345 cmp eax,eax
00301347 jne main+0F6h (0301366h)
However if you make a change to try to get the optimzer to realise they are distinct then the test is different. If I add ++previous_pos at the end of the loop body then it uses ecx for previous_pos and the bug goes away:
22: if (previous_pos == std::string::npos)
00361349 cmp ecx,eax
0036134B jne main+0FAh (036136Ah)
If I change the find to 'pos = buffer.find('w', previous_pos);' (searching from previous_pos instead of pos, which has the same value) then it uses ebx, and again the bug goes away:
21: if (previous_pos == std::string::npos)
00191345 cmp ebx,eax
00191347 jne main+0F6h (0191366h)
So it seems in the original that the optimiser is erroneously deciding that it can use eax for both of those variables, despite the line 'pos = buffer.find('w', pos);' that can set pos to a different value than previous_pos.
Related
Edit: It seems this is a compiler bug indeed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
I am writing a wrapper for writing logs that uses TLS to store a std::stringstream buffer. This code will be used by shared-libraries. When looking at the code on godbolt.org it seems that neither gcc nor clang will cache the result of a TLS lookup (the loop repeatedly calls '__tls_get_addr()' when I believe I have designed my class in a way that should let it.
#include <sstream>
class LogStream
{
public:
LogStream()
: m_buffer(getBuffer())
{
}
LogStream(std::stringstream& buffer)
: m_buffer(buffer)
{
}
static std::stringstream& getBuffer()
{
thread_local std::stringstream buffer;
return buffer;
}
template <typename T>
inline LogStream& operator<<(const T& t)
{
m_buffer << t;
return *this;
}
private:
std::stringstream& m_buffer;
};
int main()
{
LogStream log{};
for (int i = 0; i < 12345678; ++i)
{
log << i;
}
}
Looking at the assembly code output both gcc and clang generate pretty similar output:
clang 5.0.0:
xor ebx, ebx
.LBB0_3: # =>This Inner Loop Header: Depth=1
data16
lea rdi, [rip + LogStream::getBuffer[abi:cxx11]()::buffer[abi:cxx11]#TLSGD]
data16
data16
rex64
call __tls_get_addr#PLT // Called on every loop iteration.
lea rdi, [rax + 16]
mov esi, ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
inc ebx
cmp ebx, 12345678
jne .LBB0_3
gcc 7.2:
xor ebx, ebx
.L3:
lea rdi, guard variable for LogStream::getBuffer[abi:cxx11]()::buffer#tlsld[rip]
call __tls_get_addr#PLT // Called on every loop iteration.
mov esi, ebx
add ebx, 1
lea rdi, LogStream::getBuffer[abi:cxx11]()::buffer#dtpoff[rax+16]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
cmp ebx, 12345678
jne .L3
How can I convince both compilers that the lookup doesn't need to be repeatedly done?
Compiler options: -std=c++11 -O3 -fPIC
Godbolt link
This really looks like an optimization bug in both Clang and GCC.
Here's what I think happens. (I might be completely off.) The compiler completely inlines everything down to this code:
int main()
{
// pseudo-access
std::stringstream& m_buffer = LogStream::getBuffer::buffer;
for (int i = 0; i < 12345678; ++i)
{
m_buffer << i;
}
}
And then, not realizing that access to a thread-local is very expensive under -fPIC, it decides that the temporary reference to the global is not necessary and inlines that as well:
int main()
{
for (int i = 0; i < 12345678; ++i)
{
// pseudo-access
LogStream::getBuffer::buffer << i;
}
}
Whatever actually happens, this is clearly a pessimization of the code your wrote. You should report this as a bug to GCC and Clang.
GCC bugtracker: https://gcc.gnu.org/bugzilla/
Clang bugtracker: https://bugs.llvm.org/
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have been giving a metamorphic engine a try. I started by trying to analyze the opcode assembly instruction but it does not seem to give me anything. The instruction I am looking for in the function is MOV. Why does it not return anything even though they are in the function?
#include <iostream>
#include <Windows.h>
using namespace std;
struct OPCODE
{
unsigned short usSize;
PBYTE pbOpCode;
bool bRelative;
bool bMutated;
};
namespace MOVRegisters
{
enum MovRegisters
{
EAX = 0xB8,
ECX,
EDX,
EBX,
ESP,
EBP,
ESI,
EDI
};
}
bool __fastcall bIsMOV(PBYTE pInstruction)
{
if (*pInstruction == MOVRegisters::EAX || *pInstruction == MOVRegisters::ECX || *pInstruction == MOVRegisters::EDX || *pInstruction == MOVRegisters::EBX ||
*pInstruction == MOVRegisters::ESP || *pInstruction == MOVRegisters::EBP || *pInstruction == MOVRegisters::ESI || *pInstruction == MOVRegisters::EDI)
return true;
else
return false;
}
void pCheckByte(PVOID pFunction, PBYTE pFirstFive)
{
if (*pFirstFive == 0x0)
memcpy(pFirstFive, pFunction, 5);
else
memcpy(pFunction, pFirstFive, 5);
PBYTE pCurrentByte = (PBYTE)pFunction;
while (*pCurrentByte != 0xC3 && *pCurrentByte != 0xC2 && *pCurrentByte != 0xCB && *pCurrentByte != 0xCA)
{
OPCODE* pNewOp = new OPCODE();
pNewOp->pbOpCode = pCurrentByte;
if (bIsMOV(pCurrentByte))
{
cout << "mov instr.\n";
}
}
}
void function()
{
int eaxVal;
__asm
{
mov eax, 5
add eax, 6
mov eaxVal, eax
}
printf("Testing %d\n", eaxVal);
}
int main()
{
PBYTE pFirstFive = (PBYTE)malloc(5);
RtlZeroMemory(pFirstFive, 5);
while (true)
{
pCheckByte(function, pFirstFive);
system("pause");
}
return 0;
}
Did you look at the disassembly of function()? The first instruction probably won't be mov eax, 5, since MSVC probably makes a stack frame in functions with inline asm. (push ebp / mov ebp, esp).
Does your code actually loop over the bytes of the function? You have a loop, but it leaks memory every iteration. The only occurrence of pNewOp is, so it's write-only.
OPCODE* pNewOp = new OPCODE();
pNewOp->pbOpCode = pCurrentByte;
Note that looping over all the bytes will give false positives, because 0xb3 or whatever can occur as a non-opcode byte. (e.g. a ModR/M or SIB byte, or immediate data.) Similarly, you could have false positives on your 0xC3, ... scan for ret instructions. Again, look at disassembly with the raw machine code.
Writing your own code for parsing x86 machine code seems like a lot of unnecessary work; there are many tools an libraries that already do this.
Also, single-step through your C++ code in a debugger to see what it does.
I have a program in which a simple function is called a large number of times. I have added some simple logging code and find that this significantly affects performance, even when the logging code is not actually called. A complete (but simplified) test case is shown below:
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
using namespace std::chrono;
std::mt19937 rng;
uint32_t getValue()
{
// Just some pointless work, helps stop this function from getting inlined.
for (int x = 0; x < 100; x++)
{
rng();
}
// Get a value, which happens never to be zero
uint32_t value = rng();
// This (by chance) is never true
if (value == 0)
{
value++; // This if statment won't get optimized away when printing below is commented out.
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
return value;
}
int main(int argc, char* argv[])
{
// Just fror timing
high_resolution_clock::time_point start = high_resolution_clock::now();
uint32_t sum = 0;
for (uint32_t i = 0; i < 10000000; i++)
{
sum += getValue();
}
milliseconds elapsed = duration_cast<milliseconds>(high_resolution_clock::now() - start);
// Use (print) the sum to make sure it doesn't get optimized away.
std::cout << "Sum = " << sum << ", Elapsed = " << elapsed.count() << "ms" << std::endl;
return 0;
}
Note that the code contains stringstream and cout but these are never actually called. However, the presence of these three lines of code increases the run time from 2.9 to 3.3 seconds. This is in release mode on VS2013. Curiously, if I build in GCC using '-O3' flag the extra three lines of code actually decrease the runtime by half a second or so.
I understand that the extra code could impact the resulting executable in a number of ways, such as by preventing inlining or causing more cache misses. The real question is whether there is anything I can do to improve on this situation? Switching to sprintf()/printf() doesn't seem to make a difference. Do I need to simply accept that adding such logging code to small functions will affect performance even if not called?
Note: For completeness, my real/full scenario is that I use a wrapper macro to throw exceptions and I like to log when such an exception is thrown. So when I call THROW_EXCEPT(...) it inserts code similar to that shown above and then throws. This in then hurting when I throw exceptions from inside a small function. Any better alternatives here?
Edit: Here is a VS2013 solution for quick testing, and so compiler settings can be checked: https://drive.google.com/file/d/0B7b4UnjhhIiEamFyS0hjSnVzbGM/view?usp=sharing
So I initially thought that this was due to branch prediction and optimising out branches so I took a look at the annotated assembly for when the code is commented out:
if (value == 0)
00E21371 mov ecx,1
00E21376 cmove eax,ecx
{
value++;
Here we see that the compiler has helpfully optimised out our branch, so what if we put in a more complex statement to prevent it from doing so:
if (value == 0)
00AE1371 jne getValue+99h (0AE1379h)
{
value /= value;
00AE1373 xor edx,edx
00AE1375 xor ecx,ecx
00AE1377 div eax,ecx
Here the branch is left in but when running this it runs about as fast as the previous example with the following lines commented out. So lets have a look at the assembly for having those lines left in:
if (value == 0)
008F13A0 jne getValue+20Bh (08F14EBh)
{
value++;
std::stringstream ss;
008F13A6 lea ecx,[ebp-58h]
008F13A9 mov dword ptr [ss],8F32B4h
008F13B3 mov dword ptr [ebp-0B0h],8F32F4h
008F13BD call dword ptr ds:[8F30A4h]
008F13C3 push 0
008F13C5 lea eax,[ebp-0A8h]
008F13CB mov dword ptr [ebp-4],0
008F13D2 push eax
008F13D3 lea ecx,[ss]
008F13D9 mov dword ptr [ebp-10h],1
008F13E0 call dword ptr ds:[8F30A0h]
008F13E6 mov dword ptr [ebp-4],1
008F13ED mov eax,dword ptr [ss]
008F13F3 mov eax,dword ptr [eax+4]
008F13F6 mov dword ptr ss[eax],8F32B0h
008F1401 mov eax,dword ptr [ss]
008F1407 mov ecx,dword ptr [eax+4]
008F140A lea eax,[ecx-68h]
008F140D mov dword ptr [ebp+ecx-0C4h],eax
008F1414 lea ecx,[ebp-0A8h]
008F141A call dword ptr ds:[8F30B0h]
008F1420 mov dword ptr [ebp-4],0FFFFFFFFh
That's a lot of instructions if that branch is ever hit. So what if we try something else?
if (value == 0)
011F1371 jne getValue+0A6h (011F1386h)
{
value++;
printf("This never gets printed, but commenting out these three lines improves performance.");
011F1373 push 11F31D0h
011F1378 call dword ptr ds:[11F30ECh]
011F137E add esp,4
Here we have far fewer instructions and once again it runs as quickly as with all lines commented out.
So I'm not sure I can say for certain exactly what is happening here but I feel at the moment it is a combination of branch prediction and CPU instruction cache misses.
In order to solve this problem you could move the logging into a function like so:
void log()
{
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
and
if (value == 0)
{
value++;
log();
Then it runs as fast as before with all those instructions replaced with a single call log (011C12E0h).
I watched (most of) Herb Sutter's the atmoic<> weapons video, and I wanted to test the "conditional lock" with a loop inside sample. Apparently, although (if I understand correctly) the C++11 standard says the below example should work properly and be sequentially consistent, it is not.
Before you read on, my question is: Is this correct? Is the compiler broken? Is my code broken - do I have a race condition here which I missed? How do I bypass this?
I tried it on 3 different versions of Visual C++: VC10 professional, VC11 professional and VC12 Express (== Visual Studio 2013 Desktop Express).
Below is the code I used for the Visual Studio 2013. For the other versions I used boost instead of std, but the idea is the same.
#include <iostream>
#include <thread>
#include <mutex>
int a = 0;
std::mutex m;
void other()
{
std::lock_guard<std::mutex> l(m);
std::this_thread::sleep_for(std::chrono::milliseconds(2));
a = 999999;
std::this_thread::sleep_for(std::chrono::seconds(2));
std::cout << a << "\n";
}
int main(int argc, char* argv[])
{
bool work = (argc > 1);
if (work)
{
m.lock();
}
std::thread th(other);
for (int i = 0; i < 100000000; ++i)
{
if (i % 7 == 3)
{
if (work)
{
++a;
}
}
}
if (work)
{
std::cout << a << "\n";
m.unlock();
}
th.join();
}
To summarize the idea of the code: The global variable a is protected by the global mutex m. Assuming there are no command line arguments (argc==1) the thread which runs other() is the only one which is supposed to access the global variable a.
The correct output of the program is to print 999999.
However, because of the compiler loop optimization (using a register for in-loop increments and at the end of the loop copying the value back to a), a is modified by the assembly even though it's not supposed to.
This happened in all 3 VC versions, although in this code example in VC12 I had to plant some calls to sleep() to make it break.
Here's some of the assembly code (the adress of a in this run is 0x00f65498):
Loop initialization - value from a is copied into edi
27: for (int i = 0; i < 100000000; ++i)
00F61543 xor esi,esi
00F61545 mov edi,dword ptr ds:[0F65498h]
00F6154B jmp main+0C0h (0F61550h)
00F6154D lea ecx,[ecx]
28: {
29: if (i % 7 == 3)
Increment within the condition, and after the loop copied back to the location of a unconditionally
30: {
31: if (work)
00F61572 mov al,byte ptr [esp+1Bh]
00F61576 jne main+0EDh (0F6157Dh)
00F61578 test al,al
00F6157A je main+0EDh (0F6157Dh)
32: {
33: ++a;
00F6157C inc edi
27: for (int i = 0; i < 100000000; ++i)
00F6157D inc esi
00F6157E cmp esi,5F5E100h
00F61584 jl main+0C0h (0F61550h)
32: {
33: ++a;
00F61586 mov dword ptr ds:[0F65498h],edi
34: }
And the output of the program is 0.
The 'volatile' keyword will prevent that kind of optimization. That's exactly what it's for: every use of 'a' will be read or written exactly as shown, and won't be moved in a different order to other volatile variables.
The implementation of the mutex should include compiler-specific instructions to cause a "fence" at that point, telling the optimizer not to reorder instructions across that boundary. Since the implementation is not from the compiler vendor, maybe that's left out? I've never checked.
Since 'a' is global, I would generally think the compiler would be more careful with it. But, VS10 doesn't know about threads so it won't consider that other threads will use it. Since the optimizer grasps the entire loop execution, it knows that functions called from within the loop won't touch 'a' and that's enough for it.
I'm not sure what the new standard says about thread visibility of global variables other than volatile. That is, is there a rule that would prevent that optimization (even though the function can be grasped all the way down so it knows other functions don't use the global, must it assume that other threads can) ?
I suggest trying the newer compiler with the compiler-provided std::mutex, and checking what the C++ standard and current drafts say about that. I think the above should help you know what to look for.
—John
Almost a month later, Microsoft still hasn't responded to the bug in MSDN Connect.
To summarize the above comments (and some further tests), apparently it happens in VS2013 professional as well, but the bug only happens when building for Win32, not for x64. The generated assembly code in x64 doesn't have this problem.
So it appears that it is a bug in the optimizer, and that there's no race condition in this code.
Apparently this bug also happens in GCC 4.8.1, but not in GCC 4.9.
(Thanks to Voo, nosid and Chris Dodd for all their testing).
It was suggested to mark a as volatile. This indeed prevents the bug, but only because it prevents the optimizer from performing the loop register optimization.
I found another solution: Add another local variable b, and if needed (and under lock) do the following:
Copy a into b
Increment b in the loop
Copy back to a if needed
The optimizer replaces the local variable with a register, so the code is still optimized, but the copies from and to a are done only if needed, and under lock.
Here's the new main() code, with arrows marking the changed lines.
int main(int argc, char* argv[])
{
bool work = (argc == 1);
int b = 0; // <----
if (work)
{
m.lock();
b = a; // <----
}
std::thread th(other);
for (int i = 0; i < 100000000; ++i)
{
if (i % 7 == 3)
{
if (work)
{
++b; // <----
}
}
}
if (work)
{
a = b; // <----
std::cout << a << "\n";
m.unlock();
}
th.join();
}
And this is what the assembly code looks like (&a == 0x000744b0, b replaced with edi):
21: int b = 0;
00071473 xor edi,edi
22:
23: if (work)
00071475 test bl,bl
00071477 je main+5Bh (07149Bh)
24: {
25: m.lock();
........
00071492 add esp,4
26: b = a;
00071495 mov edi,dword ptr ds:[744B0h]
27: }
28:
........
33: {
34: if (work)
00071504 test bl,bl
00071506 je main+0C9h (071509h)
35: {
36: ++b;
00071508 inc edi
30: for (int i = 0; i < 100000000; ++i)
00071509 inc esi
0007150A cmp esi,5F5E100h
00071510 jl main+0A0h (0714E0h)
37: }
38: }
39: }
40:
41: if (work)
00071512 test bl,bl
00071514 je main+10Ch (07154Ch)
42: {
43: a = b;
44: std::cout << a << "\n";
00071516 mov ecx,dword ptr ds:[73084h]
0007151C push edi
0007151D mov dword ptr ds:[744B0h],edi
00071523 call dword ptr ds:[73070h]
00071529 mov ecx,eax
0007152B call std::operator<<<std::char_traits<char> > (071A80h)
........
This keeps the optimization and solves (or works around) the problem.
Since some days I am using MSVC 2013 and my application crashes when executing the following code (sparse matrix multiplied by vector, pseudo code: A = this * pVector):
complex<double> x = (A.getValue(lRow) + (mValues[lIdx] * pVectorB->getValue(lCol)));
Before I used MSVC 2005 and the application runs well.
The exception (First-chance exception at 0x000000014075D1D2 in psc64.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.) was thrown.
I track the assembly to:
addpd xmm6, xmmword ptr [rax+rbx*8]
It crash only with optimization /O2 (maximize speed) but not with no optimization /Od.
I can also avoid the crash when adding code (cout<<"bla bla") into the method pVectorB->getValue(lCol).
I believe it could be some problem with not initialized variables. But I could not find any. Therefore I look into the disassembly.
I check XMM6 and ptr [rax+rbx*8]. They are the same without crash (with cout<<"bla bla") and with crash.
Is there any thing more I should look for other then XMM6 and the value of ptr [rax+rbx*8]?
I am looking for the problem since quite some time but could not find any hint to track down the problem to the line of code I have to correct.
Any help is highly appreciated. Thank you.
The code for getValue:
template <class T> class Vector
{ const T& getValue(const int pIdx) const
{
if(false == checkBounds(pIdx)){
throw MathException(__FILE__, __LINE__, "T& Vector<class T>::getValue(const pIdx): checkBounds fails pIdx = %i", pIdx);
}
return mVal[pIdx];
}
bool checkBounds(const int pIdx)const
{
bool ret = true;
if(pIdx >= mMaxSize){
DBG_SEVERE2("pIdx >= mMaxSize, pIdx = %i, mMaxSize = %i", pIdx, mMaxSize);
ret = false;
}
if(pIdx < 0){
DBG_SEVERE1("pIdx < 0, pIdx = %i", pIdx);
ret = false;
}
return ret;
}
}
The allocation of mVal:
void* lTmp= calloc((4 * sizeof(complex<double>))+4, 1);
((char*)lTmp)[0] = 0xC;
((char*)lTmp)[1] = 0xC;
((char*)lTmp)[(4 * sizeof(complex<double>)) + 2] = 0xC;
((char*)lTmp)[(4 * sizeof(complex<double>)) + 3] = 0xC;
mVal= (void*)(((char*)lTmp) + 2)
SOLUTION:
As suggested it works without the 2 byte in front and behind the desired array (mVal). But it also works with multiple of 16byte before and after the array.
When using memory operands to SSE instructions like addpd xmm6, xmmword ptr [rax+rbx*8] the memory operand must be aligned. There are unaligned load instructions: You could movdqu from [rax+rbx*8] and then operate on the register. But if you use the memory form, the alignment is important. The optimization flags probably changed the alignment of your array. Or it may have folded the load into a memory operand (which are faster in some cases) and caused the problem that way.