Result of TLS variable access not cached - c++

Edit: It seems this is a compiler bug indeed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
I am writing a wrapper for writing logs that uses TLS to store a std::stringstream buffer. This code will be used by shared-libraries. When looking at the code on godbolt.org it seems that neither gcc nor clang will cache the result of a TLS lookup (the loop repeatedly calls '__tls_get_addr()' when I believe I have designed my class in a way that should let it.
#include <sstream>
class LogStream
{
public:
LogStream()
: m_buffer(getBuffer())
{
}
LogStream(std::stringstream& buffer)
: m_buffer(buffer)
{
}
static std::stringstream& getBuffer()
{
thread_local std::stringstream buffer;
return buffer;
}
template <typename T>
inline LogStream& operator<<(const T& t)
{
m_buffer << t;
return *this;
}
private:
std::stringstream& m_buffer;
};
int main()
{
LogStream log{};
for (int i = 0; i < 12345678; ++i)
{
log << i;
}
}
Looking at the assembly code output both gcc and clang generate pretty similar output:
clang 5.0.0:
xor ebx, ebx
.LBB0_3: # =>This Inner Loop Header: Depth=1
data16
lea rdi, [rip + LogStream::getBuffer[abi:cxx11]()::buffer[abi:cxx11]#TLSGD]
data16
data16
rex64
call __tls_get_addr#PLT // Called on every loop iteration.
lea rdi, [rax + 16]
mov esi, ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
inc ebx
cmp ebx, 12345678
jne .LBB0_3
gcc 7.2:
xor ebx, ebx
.L3:
lea rdi, guard variable for LogStream::getBuffer[abi:cxx11]()::buffer#tlsld[rip]
call __tls_get_addr#PLT // Called on every loop iteration.
mov esi, ebx
add ebx, 1
lea rdi, LogStream::getBuffer[abi:cxx11]()::buffer#dtpoff[rax+16]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
cmp ebx, 12345678
jne .L3
How can I convince both compilers that the lookup doesn't need to be repeatedly done?
Compiler options: -std=c++11 -O3 -fPIC
Godbolt link

This really looks like an optimization bug in both Clang and GCC.
Here's what I think happens. (I might be completely off.) The compiler completely inlines everything down to this code:
int main()
{
// pseudo-access
std::stringstream& m_buffer = LogStream::getBuffer::buffer;
for (int i = 0; i < 12345678; ++i)
{
m_buffer << i;
}
}
And then, not realizing that access to a thread-local is very expensive under -fPIC, it decides that the temporary reference to the global is not necessary and inlines that as well:
int main()
{
for (int i = 0; i < 12345678; ++i)
{
// pseudo-access
LogStream::getBuffer::buffer << i;
}
}
Whatever actually happens, this is clearly a pessimization of the code your wrote. You should report this as a bug to GCC and Clang.
GCC bugtracker: https://gcc.gnu.org/bugzilla/
Clang bugtracker: https://bugs.llvm.org/

Related

DWCAS-alternative with no help of the kernel

I just wanted to test if my compiler recognizes ...
atomic<pair<uintptr_t, uintptr_t>
... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.
So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the
implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.
At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to
me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.
So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?
Here's my test-code:
#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>
using namespace std;
using namespace chrono;
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
int main()
{
cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
atomic_pair ap( uip_pair( 0, 0 ) );
cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
mutex mtx;
unsigned nThreads = thread::hardware_concurrency();
unsigned ready = nThreads;
condition_variable cvReady;
bool run = false;
condition_variable cvRun;
atomic_int64_t sumDur = 0;
auto theThread = [&]( size_t n )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvReady.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
uip_pair cmp = ap.load( memory_order_relaxed );
for( ; n--; )
while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
lock.lock();
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const ROUNDS = 100'000;
for( unsigned t = nThreads; t--; )
threads.emplace_back( theThread, ROUNDS );
unique_lock<mutex> lock( mtx );
cvReady.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( jthread &thr : threads )
thr.join();
cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
uip_pair p = ap.load( memory_order_relaxed );
cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}
[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
#if defined(_MSC_VER)
#define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute((noinline))
#endif
NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}
mov eax, 1
mov r10, rcx
mov r9d, eax
xchg DWORD PTR [rcx], eax
test eax, eax
je SHORT label8
label1:
mov eax, DWORD PTR [rcx]
test eax, eax
je SHORT label7
label2:
mov eax, r9d
test r9d, r9d
je SHORT label5
label4:
pause
sub eax, 1
jne SHORT label4
cmp r9d, 64
jl SHORT label5
lea r9d, QWORD PTR [rax+64]
jmp SHORT label6
label5:
add r9d, r9d
label6:
mov eax, DWORD PTR [rcx]
test eax, eax
jne SHORT label2
label7:
mov eax, 1
xchg DWORD PTR [rcx], eax
test eax, eax
jne SHORT label1
label8:
mov rax, QWORD PTR [rcx+8]
sub rax, QWORD PTR [rdx]
jne SHORT label9
mov rax, QWORD PTR [rcx+16]
sub rax, QWORD PTR [rdx+8]
label9:
test rax, rax
sete al
test al, al
je SHORT label10
movups xmm0, XMMWORD PTR [r8]
movups XMMWORD PTR [rcx+8], xmm0
xor ecx, ecx
xchg DWORD PTR [r10], ecx
ret
label10:
movups xmm0, XMMWORD PTR [rcx+8]
xor ecx, ecx
movups XMMWORD PTR [rdx], xmm0
xchg DWORD PTR [r10], ecx
ret
Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.
As Nate pointed out in comments, it is a spinlock.
You can look up source, it ships with the compiler. and is available on Github.
If you build an unoptimized debug, you can step into this source during interactive debugging!
There's a member variable called _Spinlock.
And here's the locking function:
#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
__yield();
}
}
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}
(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)
Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.
Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)
There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151
As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.

Avoiding the overheads of std::function

I want to run a set of operations over elements in a (custom) singly-linked list. The code to traverse the linked list and run the operations is simple, but repetitive and could be done wrong if copy/pasted everywhere. Performance & careful memory allocation is important in my program, so I want to avoid unnecessary overheads.
I want to write a wrapper to include the repetitive code and encapsulate the operations which are to take place on each element of the linked list. As the functions which take place within the operation vary I need to capture multiple variables (in the real code) which must be provided to the operation, so I looked at using std::function. The actual calculations done in this example code are meaningless here.
#include <iostream>
#include <memory>
struct Foo
{
explicit Foo(int num) : variable(num) {}
int variable;
std::unique_ptr<Foo> next;
};
void doStuff(Foo& foo, std::function<void(Foo&)> operation)
{
Foo* fooPtr = &foo;
do
{
operation(*fooPtr);
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
int main(int argc, char** argv)
{
int val = 7;
Foo first(4);
first.next = std::make_unique<Foo>(5);
first.next->next = std::make_unique<Foo>(6);
#ifdef USE_FUNC
for (long i = 0; i < 100000000; ++i)
{
doStuff(first, [&](Foo& foo){ foo.variable += val + i; /*Other, more complex functionality here */ });
}
doStuff(first, [&](Foo& foo){ std::cout << foo.variable << std::endl; /*Other, more complex and different functionality here */ });
#else
for (long i = 0; i < 100000000; ++i)
{
Foo* fooPtr = &first;
do
{
fooPtr->variable += val + i;
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
Foo* fooPtr = &first;
do
{
std::cout << fooPtr->variable << std::endl;
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
#endif
}
If run as:
g++ test.cpp -O3 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.252s
user 0m0.250s
sys 0m0.001s
Whereas if run as:
g++ test.cpp -O3 -Wall -DUSE_FUNC -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.834s
user 0m0.831s
sys 0m0.001s
These timings are fairly consistent across multiple runs, and show a 4x multiplier when using std::function. Is there a better way I can do what I want to do?
Use a template:
template<typename T>
void doStuff(Foo& foo, T const& operation)
For me this gives:
mvine#xxx:~/mikeytemp$ g++ test.cpp -O3 -DUSE_FUNC -std=c++14 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.534s
user 0m0.529s
sys 0m0.005s
mvine#xxx:~/mikeytemp$ g++ test.cpp -O3 -std=c++14 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.583s
user 0m0.583s
sys 0m0.000s
Function objects are quite heavy weight, but have a use where the payload is quite large (>10000 cycles) or needs to be polymorphic such as in a generalised job scheduler.
They need to contain a copy of your callable object and handle any exceptions it might throw.
Using a template gets you much closer to the metal as the resulting code frequently gets inlined.
template <typename Func>
void doStuff(Foo& foo, Func operation)
{
Foo* fooPtr = &foo;
do
{
operation(*fooPtr);
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
The compiler will be able to look inside your function and eliminate redundancy.
On Golbolt, your inner loop becomes
.LBB0_6: # =>This Loop Header: Depth=1
lea edx, [rax + 7]
mov rsi, rcx
.LBB0_7: # Parent Loop BB0_6 Depth=1
add dword ptr [rsi], edx
mov rsi, qword ptr [rsi + 8]
test rsi, rsi
jne .LBB0_7
mov esi, eax
or esi, 1
add esi, 7
mov rdx, rcx
.LBB0_9: # Parent Loop BB0_6 Depth=1
add dword ptr [rdx], esi
mov rdx, qword ptr [rdx + 8]
test rdx, rdx
jne .LBB0_9
add rax, 2
cmp rax, 100000000
jne .LBB0_6
As a bonus, if you didn't use a linked list, the loop may disappear entirely.

Will template function typedef specifier be properly inlined when creating each instance of template function?

Have made function that operates on several streams of data in same time, creates output result which is put to destination stream. It has been put huge amount of time to optimize performance of this function (openmp, intrinsics, and etc...). And it performs beautifully.
There is alot math involved here, needless to say very long function.
Now I want to implement in same function with math replacement code for each instance of this without writing each version of this function. Where I want to differentiate between different instances of this function using only #defines or inlined function (code has to be inlined in each version).
Went for templates, but templates allow only type specifiers, and realized that #defines can't be used here. Remaining solution would be inlined math functions, so simplified idea is to create header like this:
'alm_quasimodo.h':
#pragma once
typedef struct ALM_DATA
{
int l, t, r, b;
int scan;
BYTE* data;
} ALM_DATA;
typedef BYTE (*MATH_FX)(BYTE&, BYTE&);
// etc
inline BYTE math_a1(BYTE& A, BYTE& B){ return ((BYTE)((B > A) ? B:A)); }
inline BYTE math_a2(BYTE& A, BYTE& B){ return ((BYTE)(255 - ((long)((long)(255 - A) * (255 - B)) >> 8))); }
inline BYTE math_a3(BYTE& A, BYTE& B){ return ((BYTE)((B < 128)?(2*(((long)A>>1)+64))*((float)B/255):(255-(2*(255-(((long)A>>1)+64))*(float)(255-B)/255)))); }
// etc
template <typename MATH>
inline int const template_math_av (MATH math, ALM_DATA& a, ALM_DATA& b)
{
// ultra simplified version of very complex code
for (int y = a.t; y <= a.b; y++)
{
int yoffset = y * a.scan;
for (int x = a.l; x <= a.r; x++)
{
int xoffset = yoffset + x;
a.data[xoffset] = math(a.data[xoffset], b.data[xoffset]);
}
}
return 0;
}
ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b);
and math_caller is defined in 'alm_quasimodo.cpp' as follows:
#include "stdafx.h"
#include "alm_quazimodo.h"
ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b)
{
switch(condition)
{
case 1: return template_math_av<MATH_FX>(math_a1, a, b);
break;
case 2: return template_math_av<MATH_FX>(math_a2, a, b);
break;
case 3: return template_math_av<MATH_FX>(math_a3, a, b);
break;
// etc
}
return -1;
}
Main concern here is optimization, mainly in-lining of MATH function code, and not to break existing optimizations of original code. Without writing each instance of function for specific math operation, of course ;)
So does this template inlines properly all math functions?
And any suggestions how to optimize this function template?
If nothing, thanks for reading this lengthy question.
It all depends on your compiler, optimization level, and how and where are math_a1 to math_a3 functions defined.
Usually, the compiler can optimize this if the functions in question are inline function in the same compilation unit as the rest of the code.
If this doesn't happen for you, you may want to consider functors instead of functions.
Here are some simple examples I experimented with. You can do the same for your function, and check the behavior of different compilers.
For my example, GCC 7.3 and clang 6.0 are pretty good in optimizing-out function calls (provided they see the definition of the function of course). However, somewhat surprisingly, ICC 18.0.0 is only able to optimize-out functors and closures. Even inline functions give it some trouble.
Just to have some code here in case the link stops working in the future.
For the following code:
template <typename T, int size, typename Closure>
T accumulate(T (&array)[size], T init, Closure closure) {
for (int i = 0; i < size; ++i) {
init = closure(init, array[i]);
}
return init;
}
int sum(int x, int y) { return x + y; }
inline int sub_inline(int x, int y) { return x - y; }
struct mul_functor {
int operator ()(int x, int y) const { return x * y; }
};
extern int extern_operation(int x, int y);
int accumulate_function(int (&array)[5]) {
return accumulate(array, 0, sum);
}
int accumulate_inline(int (&array)[5]) {
return accumulate(array, 0, sub_inline);
}
int accumulate_functor(int (&array)[5]) {
return accumulate(array, 1, mul_functor());
}
int accumulate_closure(int (&array)[5]) {
return accumulate(array, 0, [](int x, int y) { return x | y; });
}
int accumulate_exetern(int (&array)[5]) {
return accumulate(array, 0, extern_operation);
}
GCC 7.3 (x86) produces the following assembly:
sum(int, int):
lea eax, [rdi+rsi]
ret
accumulate_function(int (&) [5]):
mov eax, DWORD PTR [rdi+4]
add eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi+12]
add eax, DWORD PTR [rdi+16]
ret
accumulate_inline(int (&) [5]):
mov eax, DWORD PTR [rdi]
neg eax
sub eax, DWORD PTR [rdi+4]
sub eax, DWORD PTR [rdi+8]
sub eax, DWORD PTR [rdi+12]
sub eax, DWORD PTR [rdi+16]
ret
accumulate_functor(int (&) [5]):
mov eax, DWORD PTR [rdi]
imul eax, DWORD PTR [rdi+4]
imul eax, DWORD PTR [rdi+8]
imul eax, DWORD PTR [rdi+12]
imul eax, DWORD PTR [rdi+16]
ret
accumulate_closure(int (&) [5]):
mov eax, DWORD PTR [rdi+4]
or eax, DWORD PTR [rdi+8]
or eax, DWORD PTR [rdi+12]
or eax, DWORD PTR [rdi]
or eax, DWORD PTR [rdi+16]
ret
accumulate_exetern(int (&) [5]):
push rbp
push rbx
lea rbp, [rdi+20]
mov rbx, rdi
xor eax, eax
sub rsp, 8
.L8:
mov esi, DWORD PTR [rbx]
mov edi, eax
add rbx, 4
call extern_operation(int, int)
cmp rbx, rbp
jne .L8
add rsp, 8
pop rbx
pop rbp
ret

Exzessive stack usage for simple function in debug build

I have simple class using a kind of ATL database access.
All functions are defined in a header file.
The problematic functions all do the same. There are some macros in use. The generated code looks like this
void InitBindings()
{
if (sName) // Static global char*
m_sTableName = sName; // Save into member
{ AddCol("Name", some_constant_data... _GetOleDBType(...), ...); };
{ AddCol("Name1", some_other_constant_data_GetOleDBType(...), ...); };
...
}
AddCol returns a reference to a structure, but as you see it is ignored.
When I look into the assembler code where I have a function that uses 6 AddCol calls I can see that the function requires 2176 bytes of stack space. I have functions that requires 20kb and more. And in the debugger I can see that the stack isn't use at all. (All initialized to 0xCC and never touched)
See assembler code at the end.
The problem can be seen with VS-2015, and VS-2017.Only in Debug mode.
In Release mode the function reserves no extra stack space at all.
The only rule I see is; more AddCol calls, will cause more stack to be reserved. I can see that approximativ 500bytes per AddCol call is reserved.
Again: The function returns no object, it returns a reference to the binding information.
I already used the following pragmas in front of the function (but inside the class definition in the header):
__pragma(runtime_checks("", off)) __pragma(optimize("ts", on)) __pragma(strict_gs_check(push, off))
But no avail. This pragmas should turn optimization on, switches off runtime checks and stack checks. How can I reduce this unneeded stack space that is allocated. In some cases I can see stack overflows in the debug version, when this functions are used. No problems in the release version.
; 325 : BIND_BEGIN(CMasterData, _T("tblMasterData"))
push ebp
mov ebp, esp
sub esp, 2176 ; 00000880H
push ebx
push esi
push edi
mov DWORD PTR _this$[ebp], ecx
mov eax, OFFSET ??_C#_1BM#GOLNKAI#?$AAt?$AAb?$AAl?$AAM?$AAa?$AAs?$AAt?$AAe?$AAr?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
test eax, eax
je SHORT $LN2#InitBindin
push OFFSET ??_C#_1BM#GOLNKAI#?$AAt?$AAb?$AAl?$AAM?$AAa?$AAs?$AAt?$AAe?$AAr?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
add ecx, 136 ; 00000088H
call DWORD PTR __imp_??4?$CStringT#_WV?$StrTraitMFC_DLL#_WV?$ChTraitsCRT#_W#ATL#####ATL##QAEAAV01#PB_W#Z
$LN2#InitBindin:
; 326 : // Columns:
; 327 : B$C_IDENT (_T("Id"), m_lId);
push 0
push 0
push 1
push 4
push 0
call ?_GetOleDBType#ATL##YAGAAJ#Z ; ATL::_GetOleDBType
add esp, 4
movzx eax, ax
push eax
push 0
push OFFSET ??_C#_15NCCOGFKM#?$AAI?$AAd?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
; 328 : B$C (_T("Name"), m_szName);
push 0
push 0
push 0
push 122 ; 0000007aH
mov eax, 4
push eax
call ?_GetOleDBType#ATL##YAGQA_W#Z ; ATL::_GetOleDBType
add esp, 4
movzx ecx, ax
push ecx
push 4
push OFFSET ??_C#_19DINFBLAK#?$AAN?$AAa?$AAm?$AAe?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
; 329 : B$C (_T("Data"), m_data);
push 0
push 0
push 0
push 4
push 128 ; 00000080H
call ?_GetOleDBType#ATL##YAGAAVCComBSTR#1##Z ; ATL::_GetOleDBType
add esp, 4
movzx eax, ax
push eax
push 128 ; 00000080H
push OFFSET ??_C#_19IEEMEPMH#?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
It is a compiler bug. Already known in connect.
EDIT The problem seams to be fixed in VS-2017 15.5.1
The problem has to do with a bug in the built in offsetof.
It is not possible for me to #undef _CRT_USE_BUILTIN_OFFSETOF as written in this case.
For me it only works to #undef offsetof and to use one of this:
#define myoffsetof1(s,m) ((size_t)&reinterpret_cast<char const volatile&>((((s*)0)->m)))
#define myoffsetof2(s, m) ((size_t)&(((s*)0)->m))
#undef offsetof
#define offsetof myoffsetof1
All ATL DB consumers are affected.
Here is a minimum repro, that shows the bug. Set a breakpint on the Init function. Look into the assembler code and wonder how much stack is used!
// StackUsage.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <string>
#include <list>
#include <iostream>
using namespace std;
struct CRec
{
char t1[20];
char t2[20];
char t3[20];
char t4[20];
char t5[20];
int i1, i2, i3, i4, i5;
GUID g1, g2, g3, g4, g5;
DBTIMESTAMP d1, d2, d3, d4, d5;
};
#define sizeofmember(s,m) sizeof(reinterpret_cast<const s *>(0)->m)
#define typeofmember(c,m) _GetOleDBType(((c*)0)->m)
#define myoffsetof1(s,m) ((size_t)&reinterpret_cast<char const volatile&>((((s*)0)->m)))
#define myoffsetof2(s, m) ((size_t)&(((s*)0)->m))
// Undef this lines to fix the bug
// #undef offsetof
// #define offsetof myoffsetof1
#define COL(n,v) { AddCol(n,offsetof(CRec,v),typeofmember(CRec,v),sizeofmember(CRec,v)); }
class CFoo
{
public:
CFoo()
{
Init();
}
void Init()
{
COL("t1", t1);
COL("t2", t2);
COL("t3", t3);
COL("t4", t4);
COL("t5", t5);
COL("i1", i1);
COL("i2", i2);
COL("i3", i3);
COL("i4", i4);
COL("i5", i5);
COL("g1", g1);
COL("g2", g2);
COL("g2", g3);
COL("g2", g4);
COL("g2", g5);
COL("d1", d1);
COL("d2", d2);
COL("d2", d3);
COL("d2", d4);
COL("d2", d5);
}
void AddCol(PCSTR szName, ULONG nOffset, DBTYPE wType, ULONG nSize)
{
cout << szName << '\t' << nOffset << '\t' << wType << '\t' << nSize << endl;
}
};
int main()
{
CFoo foo;
return 0;
}

The compiler decided to call the function `POW`, instead of evaluating it at compile time. Why?

I got the snippet below from this comment by #CaffeineAddict.
#include <iostream>
template<typename base_t, typename expo_t>
constexpr base_t POW(base_t base, expo_t expo)
{
return (expo != 0) ? base * POW(base, expo - 1) : 1;
}
int main(int argc, char** argv)
{
std::cout << POW((unsigned __int64)2, 63) << std::endl;
return 0;
}
with the following disassembly obtained from VS2015:
int main(int argc, char** argv)
{
009418A0 push ebp
009418A1 mov ebp,esp
009418A3 sub esp,0C0h
009418A9 push ebx
009418AA push esi
009418AB push edi
009418AC lea edi,[ebp-0C0h]
009418B2 mov ecx,30h
009418B7 mov eax,0CCCCCCCCh
009418BC rep stos dword ptr es:[edi]
std::cout << POW((unsigned __int64)2, 63) << std::endl;
009418BE mov esi,esp
009418C0 push offset std::endl<char,std::char_traits<char> > (0941064h)
009418C5 push 3Fh
009418C7 push 0
009418C9 push 2
009418CB call POW<unsigned __int64,int> (09410FAh) <<========
009418D0 add esp,0Ch
009418D3 mov edi,esp
009418D5 push edx
009418D6 push eax
009418D7 mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (094A098h)]
009418DD call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (094A0ACh)]
009418E3 cmp edi,esp
009418E5 call __RTC_CheckEsp (0941127h)
009418EA mov ecx,eax
009418EC call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (094A0B0h)]
009418F2 cmp esi,esp
009418F4 call __RTC_CheckEsp (0941127h)
return 0;
009418F9 xor eax,eax
}
which shows (see the characters "<<======" introduced by me in the disassembly) that the compiler didn't evaluate the function POW at compile time. From his comment, #CaffeineAddict seemed to expect this behavior from the compiler. But I still can't understand why was this expected at all?
Two reasons.
First, a constexpr function isn't guaranteed to be called at compile time. To force the compiler to call it at compile time, you must store it in a constexpr variable first, i.e.
constexpr auto pow = POW((unsigned __int64)2, 63);
std::cout << pow << std::endl;
Secondly, you must build the project in Release configuration. In VS, you'll find you can break-point through constexpr functions if you have built the project using Debug configuration.