Why does 64-bit VC++ compiler add nop instruction after function calls? - c++

I've compiled the following using Visual Studio C++ 2008 SP1, x64 C++ compiler:
I'm curious, why did compiler add those nop instructions after those calls?
PS1. I would understand that the 2nd and 3rd nops would be to align the code on a 4 byte margin, but the 1st nop breaks that assumption.
PS2. The C++ code that was compiled had no loops or special optimization stuff in it:
CTestDlg::CTestDlg(CWnd* pParent /*=NULL*/)
: CDialog(CTestDlg::IDD, pParent)
{
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
//This makes no sense. I used it to set a debugger breakpoint
::GdiFlush();
srand(::GetTickCount());
}
PS3. Additional Info: First off, thank you everyone for your input.
Here's additional observations:
My first guess was that incremental linking could've had something to do with it. But, the Release build settings in the Visual Studio for the project have incremental linking off.
This seems to affect x64 builds only. The same code built as x86 (or Win32) does not have those nops, even though instructions used are very similar:
I tried to build it with a newer linker, and even though the x64 code produced by VS 2013 looks somewhat different, it still adds those nops after some calls:
Also dynamic vs static linking to MFC made no difference on presence of those nops. This one is built with dynamical linking to MFC dlls with VS 2013:
Also note that those nops can appear after near and far calls as well, and they have nothing to do with alignment. Here's a part of the code that I got from IDA if I step a little bit further on:
As you see, the nop is inserted after a far call that happens to "align" the next lea instruction on the B address! That makes no sense if those were added for alignment only.
I was originally inclined to believe that since near relative calls (i.e. those that start with E8) are somewhat faster than far calls (or the ones that start with FF,15 in this case)
the linker may try to go with near calls first, and since those are one byte shorter than far calls, if it succeeds, it may pad the remaining space with nops at the end. But then the example (5) above kinda defeats this hypothesis.
So I still don't have a clear answer to this.

This is purely a guess, but it might be some kind of a SEH optimization. I say optimization because SEH seems to work fine without the NOPs too. NOP might help speed up unwinding.
In the following example (live demo with VC2017), there is a NOP inserted after a call to basic_string::assign in test1 but not in test2 (identical but declared as non-throwing1).
#include <stdio.h>
#include <string>
int test1() {
std::string s = "a"; // NOP insterted here
s += getchar();
return (int)s.length();
}
int test2() throw() {
std::string s = "a";
s += getchar();
return (int)s.length();
}
int main()
{
return test1() + test2();
}
Assembly:
test1:
. . .
call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign
npad 1 ; nop
call getchar
. . .
test2:
. . .
call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign
call getchar
Note that MSVS compiles by default with the /EHsc flag (synchronous exception handling). Without that flag the NOPs disappear, and with /EHa (synchronous and asynchronous exception handling), throw() no longer makes a difference because SEH is always on.
1 For some reason only throw() seems to reduce the code size, using noexcept makes the generated code even bigger and summons even more NOPs. MSVC...

This is special filler to let exception handler/unwinding function to detect correctly whether it's prologue/epilogue/body of the function.

This is due to a calling convention in x64 which requires the stack to be 16 bytes aligned before any call instruction. This is not (to my knwoledge) a hardware requirement but a software one. This provides a way to be sure that when entering a function (that is, after a call instruction), the value of the stack pointer is always 8 modulo 16. Thus permitting simple data alignement and storage/reads from aligned location in stack.

Related

InitializeCriticalSection works in one project, but fails in another

Using Visual Studio 2019 Professional on Windows 10 x64. I have several C++ DLL projects, some of which are multi-threaded. I'm using CRITICAL_SECTION objects for thread safety.
In DLL1:
CRITICAL_SECTION critDLL1;
InitializeCriticalSection(&critDLL1);
In DLL2:
CRITICAL_SECTION critDLL2;
InitializeCriticalSection(&critDLL2);
When I use critDLL1 with EnterCriticalSection or LeaveCriticalSection everything is fine in both _DEBUG or NDEBUG mode. But when I use critDLL2, I get an access violation in 'ntdll.dll' in NDEBUG (though not in _DEBUG).
After popping up message boxes in NDEBUG mode, I was eventually able to track the problem down to the first use of EnterCriticalSection.
What might be causing the CRITICAL_SECTION to fail in one project but work in others? The MSDN page was not helpful.
UPDATE 1
After comparing project settings of DLL1 (working) and DLL2 (not working), I've accidentally got DLL2 working. I've confirmed this by reverting to an earlier version (which crashes) and then making the project changes (no crash!).
This is the setting:
Project Properties > C/C++ > Optimization > Whole Program Optimization
Set this to Yes (/GL) and my program crashes. Change that to No and it works fine. What does the /GL switch do and why might it cause this crash?
UPDATE 2
The excellent answer from #Acorn and comment from #RaymondChen, provided the clues to track down and then resolve the issue. There were two problems (both programmer errors).
PROBLEM 1
The assumption of Whole Program Optimzation (wPO) is the MSVC compiler is compiling "the whole program". This is an incorrect assumption for my DLL project which internally consumes a 3rd party library and is in turn consumed by an external application written in Delphi. This setting is set to Yes (/GL) by default but should be No. This feels like a bug in Visual Studio, but in any case, the programmer needs to be aware of this. I don't know all the details of what WPO is meant to do, but at least for DLLs meant to be consumed by other applications, the default should be changed.
PROBLEM 2
Serious programmer error. It was a call into a 3rd party library, which returned a 128-byte ASCII code which was the error:
// Before
// m_config::acSerial defined as "char acSerial[21]"
(void) m_pLib->GetPara(XPARA_PRODUCT_INFO, &m_config.acSerial[0]);
EnterCriticalSection(&crit); // Crash!
// After
#define SERIAL_LEN 20
// m_config::acSerial defined as "char acSerial[SERIAL_LEN+1]"
//...
char acSerial[128];
(void) m_pLib->GetPara(XPARA_PRODUCT_INFO, &acSerial[0]);
strncpy(m_config.acSerial, acSerial, max(SERIAL_LEN, strlen(acSerial)));
EnterCriticalSection(&crit); // Works!
The error, now obvious, is that the 3rd party library did not copy the serial number of the device into the char* I provided...it copied 128 bytes into my char* stomping over everything contiguous in memory after acSerial. This wasn't noticed before because m_pLib->GetPara(XPARA_PRODUCT_INFO, ...) was one of the first calls into the 3rd party library and the rest of the contiguous data was mostly NULL at that point.
The problem was never to do with the CRITICAL_SECTION. My thanks for Acorn and RaymondChen ... sanity has been restored to this corner of the universe.
If your program crashes under WPO (an optimization that assumes that whatever you are compiling is the entire program), it means that either the assumption is incorrect or that the optimizer ends up exploiting some undefined behavior that previously didn't (without the optimization applied), even if the assumption is correct.
In general, avoid enabling optimizations unless you are really sure you know you meet their requirements.
For further analysis, please provide a MRE.

Are functions in a C++/CLI native class compiled to MSIL or native x64 machine code?

This question is related to another question of mine, titled Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems. I din't receive any comments and answers, but eventually I found out myself that the problem is caused by function thunks that are inserted by the compiler whenever a managed function calls an unmanaged one, and vice versa. I won't go into the details once again, because today I wan't to focus on another consequence of this tunking mechanism.
To provide some context for the question, my problem was the replacement of a C++ function for 64-to-128-bit unsigned integer multiplication in an unmanaged C++/CLI class by a function in an MASM64 file for the sake of performance. The ASM replacement is as simple as can be:
AsmMul1 proc ; ?AsmMul1##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU MUL instruction. The big surprise was that the ASM version was about four times slower (!) than the C++ version. After a lot of research and testing, I found out that some function calls in C++/CLI involve thunking, which obviously is such a complex thing that it takes much more time than the thunked function itself.
After reading more about this thunking, it turned out that whenever you are using the compiler option /clr, the calling convention of all functions is silently changed to __clrcall, which means that they become managed functions. Exceptions are functions that use compiler intrinsics, inline ASM, and calls to other DLLs via dllimport - and as my tests revealed, this seems to include functions that call external ASM functions.
As long as all interacting functions use the __clrcall convention (i.e. are managed), no thunking is involved, and everything runs smoothly. As soon as the managed/unmanaged boundary is crossed in either direction, thunking kicks in, and performance is seriously degraded.
Now, after this long prologue, let's get to the core of my question. As far as I understand the __clrcall convention, and the /clr compiler switch, marking a function in an unmanaged C++ class this way causes the compiler to emit MSIL code. I've found this sentence in the documentation of __clrcall:
When marking a function as __clrcall, you indicate the function
implementation must be MSIL and that the native entry point function
will not be generated.
Frankly, this is scaring me! After all, I'm going through the hassles of writing C++/CLI code in order to get real native code, i.e. super-fast x64 machine code. However, this doesn't seem to be the default for mixed assemblies. Please correct me if I'm getting it wrong: If I'm using the project defaults given by VC2017, my assembly contains MSIL, which will be JIT-compiled. True?
There is a #pragma managed that seems to inhibit the generation of MSIL in favor of native code on a per-function basis. I've tested it, and it works, but then the problem is that thunking gets in the way again as soon as the native code calls a managed function, and vice versa. In my C++/CLI project, I found no way to configure the thunking and code generation without getting a performance hit at some place.
So what I'm asking myself now: What's the point in using C++/CLI in the first place? Does it give me performance advantages, when everything is still compiled to MSIL? Maybe it's better to write everything in pure C++ and use Pinvoke to call those functions? I don't know, I'm kind of stuck here.
Maybe someone can shed some light on this terribly poorly documented topic...

Stack runtime check failure with sqrt intrinsic in VS2012

While debugging some crash, I've come across some code which simplifies down to the following case:
#include <cmath>
#pragma intrinsic (sqrt)
class MyClass
{
public:
MyClass() { m[0] = 0; }
double& x() { return m[0]; }
private:
double m[1];
};
void function()
{
MyClass obj;
obj.x() = -sqrt(2.0);
}
int main()
{
function();
return 0;
}
When built in Debug|Win32 with VS2012 (Pro Version 11.0.61030.00 Update 4, and Express for Windows Desktop Version 11.0.61030.00 Update 4), the code triggers run-time check errors at the end of the function execution, which show up as either (in a random fashion):
Run-Time Check Failure #2 - Stack around the variable 'obj' was corrupted.
or
A buffer overrun has occurred in Test.exe which has corrupted the program's internal state. Press Break to debug the program or Continue to terminate the program.
I understand that this usually means some sort of buffer overrun/underrun for objects on the stack. Perhaps I'm overlooking something, but I can't see anywhere in this C++ code where such a buffer overrun could occur. After playing around with various tweaks to the code and stepping through the generated assembly code of the function (see "details" section below), I'd be tempted to say it looks like a bug in Visual Studio 2012, but perhaps I'm just in too deep and missing something.
Are there intrinsic function usage requirements or other C++ standard requirements that this code does not meet, which could explain this behaviour?
If not, is disabling function intrinsic the only way to obtain correct run-time check behaviour (other than workaround such as 0-sqrt noted below which could easily get lost)?
The details
Playing around the code, I've noted that the run-time check errors go away when I disable the sqrt intrinsic by commenting out the #pragma line.
Otherwise with the sqrt intrinsic pragma (or the /Oi compiler option) :
Using a setter such as obj.setx(double x) { m[0] = x; }, not surprisingly also generates the run-time check errors.
Replacing obj.x() = -sqrt(2.0) with obj.x() = +sqrt(2.0) or obj.x() = 0.0-sqrt(2.0) to my surprise does not generate the run-time check errors.
Similarly replacing obj.x() = -sqrt(2.0) with obj.x() = -1.4142135623730951; does not generate the run-time check error.
Replacing the member double m[1]; with double m; (along with m[0] accesses) only seem to generate the "Run-Time Check Failure #2" error (even with obj.x() = -sqrt(2.0)), and sometimes runs fine.
Declaring obj as a static instance, or allocating it on the heap does not generate the run-time check errors.
Setting compiler warnings to level 4 does not produce any warnings.
Compiling the same code with VS2005 Pro or VS2010 Express does not generate the run-time check errors.
For what it's worth, I've noted the problem on a Windows 7 (with Intel Xeon CPU) and a Windows 8.1 machine (with Intel Core i7 CPU).
Then I went on to look at the generated assembly code. For the purpose of illustration, I will refer to "the failing version" as the one obtained from the code provided above, whereas I've generated a "working version" by simply commenting the #pragma intrinsic (sqrt) line. A side-by-side diff view of the resulting generated assembly code is shown below with the "failing version" on the left, and the "working version" on the right:
First I've noted that the _RTC_CheckStackVars call is responsible for the "Run-Time Check Failure #2" errors and checks in particular whenever the magic cookies 0xCCCCCCCC are still intact around the obj object on the stack (which happens to be starting at an offset of -20 bytes relative to the original value of ESP). In the following screenshots, I've highlighted the object location in green and the magic cookie location in red. At the start of the function in the "working version" this is what it looks like:
then later right before the call to _RTC_CheckStackVars:
Now in the "failing version", the preamble include an additional (line 3415)
and esp,0FFFFFFF8h
which essentially makes obj aligned on a 8 byte boundary. Specifically, whenever the function is called with an initial value of ESP that ends with a 0 or 8 nibble, the obj is stored starting at an offset of -24 bytes relative to the initial value of ESP.
The problem is that the _RTC_CheckStackVars still looks for those 0xCCCCCCCC magic cookies at those same locations relative to the original ESP value as in the "working version" depicted above (ie. offsets of -24 and -12 bytes). In this case, obj's first 4 bytes actually overlaps one of the magic cookie location. This is shown in the screenshots below at the start of the "failing version":
then later right before the call to _RTC_CheckStackVars:
We can note in passing the the actual data which corresponds to obj.m[0] is identical between the "working version" and the "failing version" ("cd 3b 7f 66 9e a0 f6 bf", or the expected value of -1.4142135623730951 when interpreted as a double).
Incidentally, the _RTC_CheckStackVars checks actually passes whenever the initial value of ESP ends with a 4 or C nibble (in which case obj starts at a -20 bytes offset, just like in the "working version").
After the _RTC_CheckStackVars checks complete (assuming it passes), there is an additional check that the restored value of ESP corresponds to the original value. This check, when it fails, is responsible for the "A buffer overrun has occurred in ..." message.
In the "working version", the original ESP is copied to EBP early in the preamble (line 3415) and it's this value which is used to compute the checksum by xoring with a ___security_cookie (line 3425). In the "failing version", the checksum computation is based on ESP (line 3425) after ESP has been decremented by 12 while pushing some registers (lines 3417-3419), but the corresponding check with the restored ESP is done at the same point where those registers have been restored.
So, in short and unless I didn't get this right, it looks like the "working version" follows standard textbook and tutorials on stack handling, whereas the "failing version" messes up the run-time checks.
P.S.: "Debug build" refers to the standard set of compiler options of the "Debug" config from the "Win32 Console Application" new project template.
As pointed out by Hans in comments, the issue can no longer be reproduced with the Visual Studio 2013.
Similarly, the official answer on Microsoft connect bug report is:
we are unable to reproduce it with VS2013 Update 4 RTM. The product team itself no longer directly accepting feedback for Microsoft Visual Studio 2012 and earlier products. You can get support for issues with Visual Studio 2012 and earlier by visiting one of the resources in the link below:
http://www.visualstudio.com/support/support-overview-vs
So, given that the problem is triggered only on VS2012 with function intrinsics (/Oi compiler option), runtime-checks (either /RTCs or /RTC1 compiler option) and usage of unary minus operator, getting rid of any one (or more) of those conditions should work around the problem.
Thus, it seems the available options are:
Upgrade to the latest Visual Studio (if your project permits)
Disable runtime checks for the affected functions by surrounding them with #pragma runtime_check such as in the following sample:
#pragma runtime_check ("s", off)
void function()
{
MyClass obj;
obj.x() = -sqrt(2.0);
}
#pragma runtime_check ("s", restore)
Disable intrinsics by removing the #pragma intrinsics (sqrt) line, and adding #pragma function (sqrt) (see msdn for more info).
If intrinsics have been activated for all files through the "Enable Intrinsic Functions" project property (/Oi compiler option), you would need to deactivate that project property. You can then enable intrinsics on a piece-by-piece basis for specific functions while checking that they are not affected by the bug (with #pragma intrinsics directives for each required intrinsic function).
Tweak the code using workarounds such as 0-sqrt(2.0), -1*sqrt(2.0) (which remove the unary minus operator) in an attempt to fool the compiler into using a different code generation path. Note that this is very likely to break with seemingly minor code changes.

C++ inline assembly (Intel compiler): LEA and MOV behaving differently in Windows and Linux

I am converting a huge Windows dll to work on both Windows and Linux. The dll has a lot of assembly (and SS2 instructions) for video manipulation.
The code now compiles fine on both Windows and Linux using Intel compiler included in Intel ComposerXE-2011 on Windows and Intel ComposerXE-2013 SP1 on Linux.
The execution, however, crashes in Linux when trying to call a function pointer. I traced the code in gdb and indeed the function pointer doesn't point to the required function (whereas in Windows in does). Almost everything else works fine.
This is the sequence of code:
...
mov rdi, this
lea rdx, [rdi].m_sSomeStruct
...
lea rax, FUNCTION_NAME # if replaced by 'mov', works in Linux but crashes in Windows
mov [rdx].m_pfnFunction, rax
...
call [rdx].m_pfnFunction # crash in Linux
where:
1) 'this' has a struct member m_sSomeStruct.
2) m_sSomeStruct has a member m_pfnFunction, which is a pointer to a function.
3) FUNCTION_NAME is a free function in the same compilation unit.
4) All those pure assembly functions are declared as naked.
5) 64-bit environment.
What is confusing me the most is that if I replace the 'lea' instruction that is supposed to load the function's address into rax with a 'mov' instruction, it works fine on Linux but crashes on Windows. I traced the code in both Visual Studio and gdb and apparently in Windows 'lea' gives the correct function address, whereas in Linux 'mov' does.
I tried looking into the Intel assembly reference but didn't find much to help me there (unless I wasn't looking in the right place).
Any help is appreciated. Thanks!
Edit More details:
1) I tried using square brackets
lea rax, [FUNCTION_NAME]
but that didn't change the behaviour in Windows nor in Linux.
2) I looked at the disassembler in gdb and Windows, seem to both give the same instructions that I actually wrote. What's even worse is that I tried putting both lea/mov one after the other, and when I look at them in disassembly in gdb, the address printed after the instruction after a # sign (which I'm assuming is the address that's going to be stored in the register) is actually the same, and is NOT the correct address of the function.
It looked like this in gdb disassembler
lea 0xOffset1(%rip), %rax # 0xSomeAddress
mov 0xOffset2(%rip), %rax # 0xSomeAddress
where both (SomeAddress) were identical and both offsets were off by the same amount of difference between lea and mov instructions,
But somehow, the when I check the contents of the registers after each execution, mov seem to put in the correct value!!!!
3) The member variable m_pfnFunction is of type LOAD_FUNCTION which is defined as
typedef void (*LOAD_FUNCTION)(const void*, void*);
4) The function FUNCTION_NAME is declared in the .h (within a namespace) as
void FUNCTION_NAME(const void* , void*);
and implemented in .cpp as
__declspec(naked) void namespace_name::FUNCTION_NAME(const void* , void*)
{
...
}
5) I tried turning off optimizations by adding
#pragma optimize("", off)
but I still have the same issue
Off hand, I suspect that the way linking to DLLs works in the latter case is that FUNCTION_NAME is a memory location that actually will be set to the loaded address of the function. That is, it's a reference (or pointer) to the function, not the entry point.
I'm familiar with Win (not the other), and I've seen how calling a function might either
(1) generate a CALL to that address, which is filled in at link time. Normal enough for functions in the same module, but if it's discovered at link time that it's in a different DLL, then the Import Library is a stub that the linker treats the same as any normal function, but is nothing more than JMP [????]. The table of addresses to imported functions is arranged to have bytes that code a JMP instruction just before the field that will hold the address. The table is populated at DLL Load time.
(2) If the compiler knows that the function will be in a different DLL, it can generate more efficient code: It codes an indirect CALL to the address located in the import table. The stub function shown in (1) has a symbol name associated with it, and the actual field containing the address has a symbol name too. They both are named for the function, but with different "decorations". In general, a program might contain fixup references to both.
So, I conjecture that the symbol name you used matches the stub function on one compiler, and (that it works in a similar way) matches the pointer on the other platform. Maybe the assembler assigns the unmangled name to one or the other depending on whether it is declared as imported, and the options are different on the two toolchains.
Hope that helps. I suppose you could look at run-time in a debugger and see if the above helps you interpret the address and the stuff around it.
After reading the difference between mov and lea here What's the purpose of the LEA instruction? it looks to me like on Linux there is one additional level of indirection added into the function pointer. The mov instruction causes that extra level of indirection to be passed through, while on Windows without that extra indirection you would use lea.
Are you by any chance compiling with PIC on Linux? I could see that adding the extra indirection layer.

Why would this access violation occur with the /Og and /GL flags, with pass-by-reference?

When (and only when) I compile my program with the /Og and /GL flag using the Windows Server 2003 DDK C++ compiler (it's fine on WDK 7.1 as well as Visual Studio 2010!), I get an access violation when I run this:
#include <algorithm>
#include <vector>
template<typename T> bool less(T a, T b) { return a < b; }
int main()
{
std::vector<int> s;
for (int i = 0; i < 13; i++)
s.push_back(i);
std::stable_sort(s.begin(), s.end(), &less<const int&>);
}
The access violation goes away when I change the last line to
std::stable_sort(s.begin(), s.end(), &less<int>);
-- in other words, it goes away when I let my item get copied instead of merely referenced.
(I have no multithreading of any sort going on whatsoever.)
Why would something like this happen? Am I invoking some undefined behavior through passing by const &?
Compiler flags:
/Og /GL /MD /EHsc
Linker flags: (none)
INCLUDE environmental variable:
C:\WinDDK\3790.1830\inc\crt
LIB environmental variable:
C:\WinDDK\3790.1830\lib\crt\I386;C:\WinDDK\3790.1830\lib\wxp\I386
Operating system: Windows 7 x64
Platform: 32-bit compilation gives error (64-bit runs correctly)
Edit:
I just tried it with the Windows XP DDK (that's C:\WinDDK\2600) and I got:
error LNK2001: unresolved external symbol
"bool __cdecl less(int const &,int const &)" (?less##YA_NABH0#Z)
but when I changed it from a template to a regular function, it magically worked with both compilers!
I'm suspecting this means that I've found a bug that happens while taking the address of a templated function, using the DDK compilers. Any ideas if this might be the case, or if it's a different corner case I don't know about?
I tried this with a Windows Server 2003 DDK SP1 installation (the non-SP1 DDK isn't readily available at the moment). This uses cl.exe version 13.10.4035 for 80x86. It appears to have the same problem you've found.
If you step through the code in a debugger (which is made a bit easier by following along with the .cod file generated using the /FAsc option) you'll find that the less<int const &>() function expects to be called with the pointers to the int values passed in eax and edx. However, the function that calls less<int const&>() (named _Insertion_sort_1<>()) calls it passing the pointers on the stack.
If you turn the templated less function into a non-templated function, it expects the parameters to be passed on the stack, so everyone is happy.
Of a bit more interest is what happens when you change less<const int&> to be less<int> instead. There's no crash, but nothing gets sorted either (of course, you'd need to change your program to start out with a non-sorted vector to actually see this effect). That's because when you change to less<int> the less function no longer dereferences any pointers - it expects the actual int values to be passed in registers (ecx and edx in this case). But no pointer dereference means no crash. However, the caller, _Insertion_sort_1, still passes the arguments on the stack, so the comparison being performed by less<int> has nothing to do with the values in the vector.
So that's what's happening, but I don't really know what the root cause is - as others have mentioned, it looks like a compiler bug related to the optimizations.
Since the bug has apparently been fixed, there's obviously no point in reporting it (the compiler in that version of the DDK corresponds to something close to VS 2003/VC 7.1).
By the way - I was unable to get your example to compile completely cleanly - to get it to build at all, I had to include the bufferoverflowu.lib to get the stack checking stuff to link, and even then the linker complained about "multiple '.rdata' sections found with different attributes". I seem to remember that being a warning that was safe to ignore, but I really don't remember. I don't think either of these has anything to do with the bug though.
If you don't get it on newer compilers, it's most likely a bug.
Do you have a small self-contained repro?