Stack runtime check failure with sqrt intrinsic in VS2012

Stack runtime check failure with sqrt intrinsic in VS2012 - c++

While debugging some crash, I've come across some code which simplifies down to the following case:
#include <cmath>
#pragma intrinsic (sqrt)
class MyClass
{
public:
MyClass() { m[0] = 0; }
double& x() { return m[0]; }
private:
double m[1];
};
void function()
{
MyClass obj;
obj.x() = -sqrt(2.0);
}
int main()
{
function();
return 0;
}
When built in Debug|Win32 with VS2012 (Pro Version 11.0.61030.00 Update 4, and Express for Windows Desktop Version 11.0.61030.00 Update 4), the code triggers run-time check errors at the end of the function execution, which show up as either (in a random fashion):
Run-Time Check Failure #2 - Stack around the variable 'obj' was corrupted.
or
A buffer overrun has occurred in Test.exe which has corrupted the program's internal state. Press Break to debug the program or Continue to terminate the program.
I understand that this usually means some sort of buffer overrun/underrun for objects on the stack. Perhaps I'm overlooking something, but I can't see anywhere in this C++ code where such a buffer overrun could occur. After playing around with various tweaks to the code and stepping through the generated assembly code of the function (see "details" section below), I'd be tempted to say it looks like a bug in Visual Studio 2012, but perhaps I'm just in too deep and missing something.
Are there intrinsic function usage requirements or other C++ standard requirements that this code does not meet, which could explain this behaviour?
If not, is disabling function intrinsic the only way to obtain correct run-time check behaviour (other than workaround such as 0-sqrt noted below which could easily get lost)?
The details
Playing around the code, I've noted that the run-time check errors go away when I disable the sqrt intrinsic by commenting out the #pragma line.
Otherwise with the sqrt intrinsic pragma (or the /Oi compiler option) :
Using a setter such as obj.setx(double x) { m[0] = x; }, not surprisingly also generates the run-time check errors.
Replacing obj.x() = -sqrt(2.0) with obj.x() = +sqrt(2.0) or obj.x() = 0.0-sqrt(2.0) to my surprise does not generate the run-time check errors.
Similarly replacing obj.x() = -sqrt(2.0) with obj.x() = -1.4142135623730951; does not generate the run-time check error.
Replacing the member double m[1]; with double m; (along with m[0] accesses) only seem to generate the "Run-Time Check Failure #2" error (even with obj.x() = -sqrt(2.0)), and sometimes runs fine.
Declaring obj as a static instance, or allocating it on the heap does not generate the run-time check errors.
Setting compiler warnings to level 4 does not produce any warnings.
Compiling the same code with VS2005 Pro or VS2010 Express does not generate the run-time check errors.
For what it's worth, I've noted the problem on a Windows 7 (with Intel Xeon CPU) and a Windows 8.1 machine (with Intel Core i7 CPU).
Then I went on to look at the generated assembly code. For the purpose of illustration, I will refer to "the failing version" as the one obtained from the code provided above, whereas I've generated a "working version" by simply commenting the #pragma intrinsic (sqrt) line. A side-by-side diff view of the resulting generated assembly code is shown below with the "failing version" on the left, and the "working version" on the right:
First I've noted that the _RTC_CheckStackVars call is responsible for the "Run-Time Check Failure #2" errors and checks in particular whenever the magic cookies 0xCCCCCCCC are still intact around the obj object on the stack (which happens to be starting at an offset of -20 bytes relative to the original value of ESP). In the following screenshots, I've highlighted the object location in green and the magic cookie location in red. At the start of the function in the "working version" this is what it looks like:
then later right before the call to _RTC_CheckStackVars:
Now in the "failing version", the preamble include an additional (line 3415)
and esp,0FFFFFFF8h
which essentially makes obj aligned on a 8 byte boundary. Specifically, whenever the function is called with an initial value of ESP that ends with a 0 or 8 nibble, the obj is stored starting at an offset of -24 bytes relative to the initial value of ESP.
The problem is that the _RTC_CheckStackVars still looks for those 0xCCCCCCCC magic cookies at those same locations relative to the original ESP value as in the "working version" depicted above (ie. offsets of -24 and -12 bytes). In this case, obj's first 4 bytes actually overlaps one of the magic cookie location. This is shown in the screenshots below at the start of the "failing version":
then later right before the call to _RTC_CheckStackVars:
We can note in passing the the actual data which corresponds to obj.m[0] is identical between the "working version" and the "failing version" ("cd 3b 7f 66 9e a0 f6 bf", or the expected value of -1.4142135623730951 when interpreted as a double).
Incidentally, the _RTC_CheckStackVars checks actually passes whenever the initial value of ESP ends with a 4 or C nibble (in which case obj starts at a -20 bytes offset, just like in the "working version").
After the _RTC_CheckStackVars checks complete (assuming it passes), there is an additional check that the restored value of ESP corresponds to the original value. This check, when it fails, is responsible for the "A buffer overrun has occurred in ..." message.
In the "working version", the original ESP is copied to EBP early in the preamble (line 3415) and it's this value which is used to compute the checksum by xoring with a ___security_cookie (line 3425). In the "failing version", the checksum computation is based on ESP (line 3425) after ESP has been decremented by 12 while pushing some registers (lines 3417-3419), but the corresponding check with the restored ESP is done at the same point where those registers have been restored.
So, in short and unless I didn't get this right, it looks like the "working version" follows standard textbook and tutorials on stack handling, whereas the "failing version" messes up the run-time checks.
P.S.: "Debug build" refers to the standard set of compiler options of the "Debug" config from the "Win32 Console Application" new project template.

As pointed out by Hans in comments, the issue can no longer be reproduced with the Visual Studio 2013.
Similarly, the official answer on Microsoft connect bug report is:
we are unable to reproduce it with VS2013 Update 4 RTM. The product team itself no longer directly accepting feedback for Microsoft Visual Studio 2012 and earlier products. You can get support for issues with Visual Studio 2012 and earlier by visiting one of the resources in the link below:
http://www.visualstudio.com/support/support-overview-vs
So, given that the problem is triggered only on VS2012 with function intrinsics (/Oi compiler option), runtime-checks (either /RTCs or /RTC1 compiler option) and usage of unary minus operator, getting rid of any one (or more) of those conditions should work around the problem.
Thus, it seems the available options are:
Upgrade to the latest Visual Studio (if your project permits)
Disable runtime checks for the affected functions by surrounding them with #pragma runtime_check such as in the following sample:
#pragma runtime_check ("s", off)
void function()
{
MyClass obj;
obj.x() = -sqrt(2.0);
}
#pragma runtime_check ("s", restore)
Disable intrinsics by removing the #pragma intrinsics (sqrt) line, and adding #pragma function (sqrt) (see msdn for more info).
If intrinsics have been activated for all files through the "Enable Intrinsic Functions" project property (/Oi compiler option), you would need to deactivate that project property. You can then enable intrinsics on a piece-by-piece basis for specific functions while checking that they are not affected by the bug (with #pragma intrinsics directives for each required intrinsic function).
Tweak the code using workarounds such as 0-sqrt(2.0), -1*sqrt(2.0) (which remove the unary minus operator) in an attempt to fool the compiler into using a different code generation path. Note that this is very likely to break with seemingly minor code changes.

Related

InitializeCriticalSection works in one project, but fails in another

Using Visual Studio 2019 Professional on Windows 10 x64. I have several C++ DLL projects, some of which are multi-threaded. I'm using CRITICAL_SECTION objects for thread safety.
In DLL1:
CRITICAL_SECTION critDLL1;
InitializeCriticalSection(&critDLL1);
In DLL2:
CRITICAL_SECTION critDLL2;
InitializeCriticalSection(&critDLL2);
When I use critDLL1 with EnterCriticalSection or LeaveCriticalSection everything is fine in both _DEBUG or NDEBUG mode. But when I use critDLL2, I get an access violation in 'ntdll.dll' in NDEBUG (though not in _DEBUG).
After popping up message boxes in NDEBUG mode, I was eventually able to track the problem down to the first use of EnterCriticalSection.
What might be causing the CRITICAL_SECTION to fail in one project but work in others? The MSDN page was not helpful.
UPDATE 1
After comparing project settings of DLL1 (working) and DLL2 (not working), I've accidentally got DLL2 working. I've confirmed this by reverting to an earlier version (which crashes) and then making the project changes (no crash!).
This is the setting:
Project Properties > C/C++ > Optimization > Whole Program Optimization
Set this to Yes (/GL) and my program crashes. Change that to No and it works fine. What does the /GL switch do and why might it cause this crash?
UPDATE 2
The excellent answer from #Acorn and comment from #RaymondChen, provided the clues to track down and then resolve the issue. There were two problems (both programmer errors).
PROBLEM 1
The assumption of Whole Program Optimzation (wPO) is the MSVC compiler is compiling "the whole program". This is an incorrect assumption for my DLL project which internally consumes a 3rd party library and is in turn consumed by an external application written in Delphi. This setting is set to Yes (/GL) by default but should be No. This feels like a bug in Visual Studio, but in any case, the programmer needs to be aware of this. I don't know all the details of what WPO is meant to do, but at least for DLLs meant to be consumed by other applications, the default should be changed.
PROBLEM 2
Serious programmer error. It was a call into a 3rd party library, which returned a 128-byte ASCII code which was the error:
// Before
// m_config::acSerial defined as "char acSerial[21]"
(void) m_pLib->GetPara(XPARA_PRODUCT_INFO, &m_config.acSerial[0]);
EnterCriticalSection(&crit); // Crash!
// After
#define SERIAL_LEN 20
// m_config::acSerial defined as "char acSerial[SERIAL_LEN+1]"
//...
char acSerial[128];
(void) m_pLib->GetPara(XPARA_PRODUCT_INFO, &acSerial[0]);
strncpy(m_config.acSerial, acSerial, max(SERIAL_LEN, strlen(acSerial)));
EnterCriticalSection(&crit); // Works!
The error, now obvious, is that the 3rd party library did not copy the serial number of the device into the char* I provided...it copied 128 bytes into my char* stomping over everything contiguous in memory after acSerial. This wasn't noticed before because m_pLib->GetPara(XPARA_PRODUCT_INFO, ...) was one of the first calls into the 3rd party library and the rest of the contiguous data was mostly NULL at that point.
The problem was never to do with the CRITICAL_SECTION. My thanks for Acorn and RaymondChen ... sanity has been restored to this corner of the universe.

If your program crashes under WPO (an optimization that assumes that whatever you are compiling is the entire program), it means that either the assumption is incorrect or that the optimizer ends up exploiting some undefined behavior that previously didn't (without the optimization applied), even if the assumption is correct.
In general, avoid enabling optimizations unless you are really sure you know you meet their requirements.
For further analysis, please provide a MRE.

Why does 64-bit VC++ compiler add nop instruction after function calls?

I've compiled the following using Visual Studio C++ 2008 SP1, x64 C++ compiler:
I'm curious, why did compiler add those nop instructions after those calls?
PS1. I would understand that the 2nd and 3rd nops would be to align the code on a 4 byte margin, but the 1st nop breaks that assumption.
PS2. The C++ code that was compiled had no loops or special optimization stuff in it:
CTestDlg::CTestDlg(CWnd* pParent /*=NULL*/)
: CDialog(CTestDlg::IDD, pParent)
{
m_hIcon = AfxGetApp()->LoadIcon(IDR_MAINFRAME);
//This makes no sense. I used it to set a debugger breakpoint
::GdiFlush();
srand(::GetTickCount());
}
PS3. Additional Info: First off, thank you everyone for your input.
Here's additional observations:
My first guess was that incremental linking could've had something to do with it. But, the Release build settings in the Visual Studio for the project have incremental linking off.
This seems to affect x64 builds only. The same code built as x86 (or Win32) does not have those nops, even though instructions used are very similar:
I tried to build it with a newer linker, and even though the x64 code produced by VS 2013 looks somewhat different, it still adds those nops after some calls:
Also dynamic vs static linking to MFC made no difference on presence of those nops. This one is built with dynamical linking to MFC dlls with VS 2013:
Also note that those nops can appear after near and far calls as well, and they have nothing to do with alignment. Here's a part of the code that I got from IDA if I step a little bit further on:
As you see, the nop is inserted after a far call that happens to "align" the next lea instruction on the B address! That makes no sense if those were added for alignment only.
I was originally inclined to believe that since near relative calls (i.e. those that start with E8) are somewhat faster than far calls (or the ones that start with FF,15 in this case)
the linker may try to go with near calls first, and since those are one byte shorter than far calls, if it succeeds, it may pad the remaining space with nops at the end. But then the example (5) above kinda defeats this hypothesis.
So I still don't have a clear answer to this.

This is purely a guess, but it might be some kind of a SEH optimization. I say optimization because SEH seems to work fine without the NOPs too. NOP might help speed up unwinding.
In the following example (live demo with VC2017), there is a NOP inserted after a call to basic_string::assign in test1 but not in test2 (identical but declared as non-throwing1).
#include <stdio.h>
#include <string>
int test1() {
std::string s = "a"; // NOP insterted here
s += getchar();
return (int)s.length();
}
int test2() throw() {
std::string s = "a";
s += getchar();
return (int)s.length();
}
int main()
{
return test1() + test2();
}
Assembly:
test1:
. . .
call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign
npad 1 ; nop
call getchar
. . .
test2:
. . .
call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign
call getchar
Note that MSVS compiles by default with the /EHsc flag (synchronous exception handling). Without that flag the NOPs disappear, and with /EHa (synchronous and asynchronous exception handling), throw() no longer makes a difference because SEH is always on.
1 For some reason only throw() seems to reduce the code size, using noexcept makes the generated code even bigger and summons even more NOPs. MSVC...

This is special filler to let exception handler/unwinding function to detect correctly whether it's prologue/epilogue/body of the function.

This is due to a calling convention in x64 which requires the stack to be 16 bytes aligned before any call instruction. This is not (to my knwoledge) a hardware requirement but a software one. This provides a way to be sure that when entering a function (that is, after a call instruction), the value of the stack pointer is always 8 modulo 16. Thus permitting simple data alignement and storage/reads from aligned location in stack.

frame pointer register 'ebx' modified by inline assembly code in xatomic.h

Okay I stopped to most obscure bug what I have ever encountered. I would have commented on the almost exactly same question but I do not have enough reputation.. :(
What the bug does is that my program tries to execute on memory area that is not executable when the program tries to return from function. "Access violation on executing address 0x00000000".
I tracked the bug taking place in Visual Studio 2012's xatomic.h header (#include 'atomic' C++11 standard header) where it overwrites ebx register in x86 inline assembly. Once that happens the thread's stack is destroyed permanently.
I know quite precisely when this happens. The bug is triggered by boost::lockfree::queue::empty() function and only in release build with optimizations on. The empty() function must be inlined by compiler into it's caller function. The program works perfectly fine on debug mode as the empty() function is not inlined.
I get many compiler warnings about modifying the ebx register:
"include\boost-1_55\boost\atomic\detail\windows.hpp(1598): warning C4731: 'BuzyStack<JobPool>::push' : frame pointer register 'ebx' modified by inline assembly code"
"include\boost-1_55\boost\atomic\detail\windows.hpp(1598): warning C4731: 'BuzyStack<JobPool>::push' : frame pointer register 'ebx' modified by inline assembly code"
"y:\work\visualstudio\vc\include\xatomic.h(2133): warning C4731: 'ThreadSubSystem::join_pool' : frame pointer register 'ebx' modified by inline assembly code"
"y:\work\visualstudio\vc\include\xatomic.h(2137): warning C4731: 'ThreadSubSystem::join_pool' : frame pointer register 'ebx' modified by inline assembly code"
BuzyStack is my concurrent 'thread pool stack' that manages threading pools. Items can be concurrently pushed to/poped from the BuzyStack.
I realy do need the boost::lockfree::queue::empty() function, so how do I fix this?
What I have already done is quite radical action. I modified the Visual Studio 2012 (Update 4) xatomic.h header __asm {} parts, where the ebx register is overwritten. I force preserving the ebx register by saving it at begining of __asm block into temporal var and restoring the ebx at end of __asm block. This works. The bug is gone, but I can still see point in my program where the call stack is temporaly invalid. Also the number of compiler warnings doubled when I did this change.
(Update)
Sorry for being unclear with the question: Shortly: How do I fix this bug? I seems to be the MSVC compiler's fault.
I do not have any inline asm code in my code what so ever. All warnings are generated by code in boost-1.55 atomic and lockfree libraries plus MSVC 2012 xatomic.h header.
The standard header mod was only a temporal workaround and I do not use the modded header any more nor the empty() function. The bug still exists and destroys my stack today if I try call the empty() function.

Boost threads: in IOS, thread_info object is being destructed before the thread finishes executing

Our project uses a few boost 1.48 libraries on several platforms, including Windows, Mac, Android, and IOS.
We are able to consistently get the IOS version of the project to crash (nontrivially but reliably) when using IOS, and
from our investigation we see that ~thread_data_base is being called on the thread's thread_info while its thread is still running.
This seems to happen as a result of the smart pointer reaching a zero count, even though it is obviously still
in scope in the thread_proxy function which creates it and runs the requested function in the thread.
This seems to happen in various cases - the call stack is not identical between crashes, though there are a few
variations which are common.
Just to be clear - this often requires running code which is creating hundreds of threads, though there are
never more than about 30 running simultaneously. I have "been lucky" and got it very very early in the
run also, but that's rare.
I created a version of the destructor which actually catches the code red-handed:
in libs/thread/src/pthread/thread.cpp:
thread_data_base::~thread_data_base()
{
boost::detail::thread_data_base* const thread_info=detail::get_current_thread_data();
void *void_thread_info = (void *) thread_info;
void *void_this = (void *) this;
// is somebody destructing the thread_data other than its own thread?
// (remember that its own which should no longer point to it anyway,
// because of the call to detail::set_current_thread_data(0) in thread_proxy)
if (void_thread_info) { // == void_this) {
__builtin_trap();
}
}
I should note that (as seen from the commented-out code) I had previously checked to see that void_thread_info == void_this because I
was only checking for the case where the thread's current thread_info was killing itself.
I have also seen cases where the value returned by get_current_thread_data is non-zero and
different from "this", which is really weird.
Also when I first wrote that version of the code, I wrote:
if (((void*)thread_info) == ((void*)this))
and at run-time I got some very weird exception that said I something about a virtual function table
or something like that - I don't remember. I decided that it was trying to call "==" for this object type
and was unhappy with that, so I rewrote as above, putting the conversions to void * as separate
lines of code. That in itself is quite suspicious to me. I am not one to run to rush to blame compilers, but...
I should also note that when we did catch this happening the trap, we saw the destructor for
~shared_count appear twice consecutively on the stack in Xcode source. Very doubleweird.
We tried to look at the disassembly, but couldn't make much out of it.
Again - it looks like this is always a result of the shared_count which seems to be owned by
the shared_ptr which owns the thread_info reaching zero too early.
Update: it seems that it is possible to get into situations which reach the above trap without the situation doing any harm. Since fixing the issue (see answer) I have seen it happen, but always after thread_info->run() has finished executing. Don't yet understand how...but it's working.
Some additional info:
I should note that the boost.sh from Pete Goodliffe (and modified by others) that is commonly used to compile boost for IOS
has the following note in the header:
: ${EXTRA_CPPFLAGS:="-DBOOST_AC_USE_PTHREADS -DBOOST_SP_USE_PTHREADS"}
# The EXTRA_CPPFLAGS definition works around a thread race issue in
# shared_ptr. I encountered this historically and have not verified that
# the fix is no longer required. Without using the posix thread primitives
# an invalid compare-and-swap ARM instruction (non-thread-safe) was used for the
# shared_ptr use count causing nasty and subtle bugs.
#
# Should perhaps also consider/use instead: -BOOST_SP_USE_PTHREADS
I use those flags, but to no avail.
I found the following which is very tantalizing - it looks like they had the same issue in std::thread:
http://llvm.org/bugs/show_bug.cgi?format=multiple&id=12730
That was suggestive of using an alternate implementation inside boost for arm processors which seems also to directly address this issue:
spinlock_gcc_arm.hpp
The version included with boost 1.48 uses outdated arm assembly.
I took the updated version from boost 1.52, but I'm having trouble compiling it.
I get the following error:
predicated instructions must be in IT block
I found a reference to what looks to be a similar use of this instruction here:
https://zeromq.jira.com/browse/LIBZMQ-414
I was able to use the same idea to get the 1.52 code to compile by modifying
the code as follows (I inserted an appropriate IT instruction)
__asm__ __volatile__(
"ldrex %0, [%2]; \n"
"cmp %0, %1; \n"
"it ne; \n"
"strexne %0, %1, [%2]; \n"
BOOST_SP_ARM_BARRIER :
"=&r"( r ): // outputs
"r"( 1 ), "r"( &v_ ): // inputs
"memory", "cc" );
But in any case, there are ifdefs in this file which look for the arm architecture, which is not defined that way in my environment. After I simply edited the file so that only ARM 7 code
was left, the compiler complains about the definition of BOOST_SP_ARM_BARRIER:
In file included from ./boost/smart_ptr/detail/spinlock.hpp:35:
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:39:13: error: instruction requires a CPU feature not currently enabled
BOOST_SP_ARM_BARRIER :
^
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:13:32: note: expanded from macro 'BOOST_SP_ARM_BARRIER'
# define BOOST_SP_ARM_BARRIER "dmb"
Any ideas??

Figured this out. It turns out that the boost.sh script that I mention in the question chose the incorrect boost flag to address this problem - instead of BOOST_SP_USE_PTHREADS (and the other flag there with it, BOOST_AC_USE_PTHREADS) it turns out that what is needed on IOS is BOOST_SP_USE_SPINLOCK. This ends up giving pretty much the identical solution used in the std::thread issue referred to in the question.
If you are compiling for any modern IOS device which uses ARM 7, but using an older boost (we are using 1.48), you need to copy the file spinlock_gcc_arm.hpp from a more recent boost (like 1.52). That file is #ifdef'd for the different arm architectures, but it is not clear to me that the defines it is looking for are defined in the IOS compile environment using the script. So you can either edit the file (violent but effective) or invest some time to figure out how to make this tidy and correct.
In any case, you may need to insert the extra assembly instruction that I did above in the question:
"it ne; \n"
I have not yet gone back to see if I can delete that now that I have my compile environment working problem.
However, we're not done yet. The code used in boost for this option includes, as discussed, ARM assembly language instructions. The ARM chips support two instruction sets which can't be mixed in a given module (not sure of the scope, but evidently file by file is an acceptable granularity when compiling). The instructions used in boost for this locking include non-Thumb instructions, but IOS by default uses the Thumb instruction set. The boost code, aware of the instruction set issue, checks to see that you have arm enabled but not thumb, but by default in IOS, thumb is on.
Getting the compiler to generate non-thumb ARM code depends on which compiler you are using in IOS - Apple's LLVM or LLVM GCC. GCC is deprecated, and Apple's LLVM is the default when you use XCode.
For the default Clang + Apple LLVM 4.1, you need to compile using the -mno-thumb flag. Also any files in your IOS app which use any part of boost which uses smart pointers will also have to be compiled using -mno-thumb.
To compile boost like this, I think you can just add -mno-thumb to the EXTRA_CPP_FLAGS in the script. (I modified the user-config.jam directly while experimenting and haven't yet gone back to clean up.)
For your app, in Xcode you need to select your target, then go into the Build Phases tab, and there select Compile sources. There you have the option of adding compile flags, so for each relevant file (which includes boost), add the -mno-thumb flag. You can do this directly in project.pbxproj also where each file has
settings = { COMPILER_FLAGS = ""; };
you just change this to
settings = { COMPILER_FLAGS = "-mno-thumb"; };
But there's a little more. You also have to modify the darwin.jam file in the tools/build/v2/tools directory. In boost 1.48, there is a code that says:
case arm :
{
options = -arch armv6;
}
This has to be modified to
case arm :
{
options = -arch armv7 ;
}
Finally, in the boost.sh script, in the function writeBjamUserConfig(), you should remove the references to -arch armv6.
If somebody knows how to do this a little more generally and cleanly, I'm sure we'd all benefit. For now, this is where I've gotten to, and I hope that this will help other IOS boost threads users. I hope that the various variants on the boost.sh IOS script out there will be updated. I plan to add some more links to this answer later.
Update: For a great article which describes the issue on the processor level,
see here:
http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu
Enjoy!

I use boost.asio, boost.thread, boost.smart_ptr etc. on iOS platform, the app always crash when run in release mode, which throws signal sigabrt. The crash call stack is :
__stack_chk_fail
boost::asio::detail::completion_handle
boost::asio::detail::task_ios_service_operation::complete
boost::asio::detail::task_io_service::do_run_one
boost::asio::detail::task_ios_service::run
boost::asio::io_service::run
![when create a asio work with creating new thread and io_service][1]
When trying to solve the problem, I found the following articles:
[boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-build][2]
[The issue of spin_lock and thumb on iOS][3]
Then I try to add -mno-thumb to my project compile flag, and the problem occured in release mode is gone.
However, a new bug bring out : EXC_ARM_DA_ALIGN, which crashed at where I try to convert network data to host-endian.
As[this article][4] says, the ARM instructions strict that the memory data must be aligned.
And follow the article [Exc_arm_da_align][5], I fix it by using memcpy for the data convert, instead of directly converting from the pointer.
[1]: http://i.stack.imgur.com/3ijF4.png
[2]: http://stackoverflow.com/questions/4201262/boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-builds/4245821#4245821
[3]: http://groups.google.com/group/boost-list/browse_thread/thread/7dc1e80659182ab3
[4]: https://brewx.qualcomm.com/bws/content/gi/common/appseng/en/knowledgebase/docs/kb95.html
[5]: http://www.cnblogs.com/unionfind/archive/2013/02/25/2932262.html

Why would this access violation occur with the /Og and /GL flags, with pass-by-reference?

When (and only when) I compile my program with the /Og and /GL flag using the Windows Server 2003 DDK C++ compiler (it's fine on WDK 7.1 as well as Visual Studio 2010!), I get an access violation when I run this:
#include <algorithm>
#include <vector>
template<typename T> bool less(T a, T b) { return a < b; }
int main()
{
std::vector<int> s;
for (int i = 0; i < 13; i++)
s.push_back(i);
std::stable_sort(s.begin(), s.end(), &less<const int&>);
}
The access violation goes away when I change the last line to
std::stable_sort(s.begin(), s.end(), &less<int>);
-- in other words, it goes away when I let my item get copied instead of merely referenced.
(I have no multithreading of any sort going on whatsoever.)
Why would something like this happen? Am I invoking some undefined behavior through passing by const &?
Compiler flags:
/Og /GL /MD /EHsc
Linker flags: (none)
INCLUDE environmental variable:
C:\WinDDK\3790.1830\inc\crt
LIB environmental variable:
C:\WinDDK\3790.1830\lib\crt\I386;C:\WinDDK\3790.1830\lib\wxp\I386
Operating system: Windows 7 x64
Platform: 32-bit compilation gives error (64-bit runs correctly)
Edit:
I just tried it with the Windows XP DDK (that's C:\WinDDK\2600) and I got:
error LNK2001: unresolved external symbol
"bool __cdecl less(int const &,int const &)" (?less##YA_NABH0#Z)
but when I changed it from a template to a regular function, it magically worked with both compilers!
I'm suspecting this means that I've found a bug that happens while taking the address of a templated function, using the DDK compilers. Any ideas if this might be the case, or if it's a different corner case I don't know about?

I tried this with a Windows Server 2003 DDK SP1 installation (the non-SP1 DDK isn't readily available at the moment). This uses cl.exe version 13.10.4035 for 80x86. It appears to have the same problem you've found.
If you step through the code in a debugger (which is made a bit easier by following along with the .cod file generated using the /FAsc option) you'll find that the less<int const &>() function expects to be called with the pointers to the int values passed in eax and edx. However, the function that calls less<int const&>() (named _Insertion_sort_1<>()) calls it passing the pointers on the stack.
If you turn the templated less function into a non-templated function, it expects the parameters to be passed on the stack, so everyone is happy.
Of a bit more interest is what happens when you change less<const int&> to be less<int> instead. There's no crash, but nothing gets sorted either (of course, you'd need to change your program to start out with a non-sorted vector to actually see this effect). That's because when you change to less<int> the less function no longer dereferences any pointers - it expects the actual int values to be passed in registers (ecx and edx in this case). But no pointer dereference means no crash. However, the caller, _Insertion_sort_1, still passes the arguments on the stack, so the comparison being performed by less<int> has nothing to do with the values in the vector.
So that's what's happening, but I don't really know what the root cause is - as others have mentioned, it looks like a compiler bug related to the optimizations.
Since the bug has apparently been fixed, there's obviously no point in reporting it (the compiler in that version of the DDK corresponds to something close to VS 2003/VC 7.1).
By the way - I was unable to get your example to compile completely cleanly - to get it to build at all, I had to include the bufferoverflowu.lib to get the stack checking stuff to link, and even then the linker complained about "multiple '.rdata' sections found with different attributes". I seem to remember that being a warning that was safe to ignore, but I really don't remember. I don't think either of these has anything to do with the bug though.

If you don't get it on newer compilers, it's most likely a bug.
Do you have a small self-contained repro?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js