Avoiding the overheads of std::function - c++

I want to run a set of operations over elements in a (custom) singly-linked list. The code to traverse the linked list and run the operations is simple, but repetitive and could be done wrong if copy/pasted everywhere. Performance & careful memory allocation is important in my program, so I want to avoid unnecessary overheads.
I want to write a wrapper to include the repetitive code and encapsulate the operations which are to take place on each element of the linked list. As the functions which take place within the operation vary I need to capture multiple variables (in the real code) which must be provided to the operation, so I looked at using std::function. The actual calculations done in this example code are meaningless here.
#include <iostream>
#include <memory>
struct Foo
{
explicit Foo(int num) : variable(num) {}
int variable;
std::unique_ptr<Foo> next;
};
void doStuff(Foo& foo, std::function<void(Foo&)> operation)
{
Foo* fooPtr = &foo;
do
{
operation(*fooPtr);
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
int main(int argc, char** argv)
{
int val = 7;
Foo first(4);
first.next = std::make_unique<Foo>(5);
first.next->next = std::make_unique<Foo>(6);
#ifdef USE_FUNC
for (long i = 0; i < 100000000; ++i)
{
doStuff(first, [&](Foo& foo){ foo.variable += val + i; /*Other, more complex functionality here */ });
}
doStuff(first, [&](Foo& foo){ std::cout << foo.variable << std::endl; /*Other, more complex and different functionality here */ });
#else
for (long i = 0; i < 100000000; ++i)
{
Foo* fooPtr = &first;
do
{
fooPtr->variable += val + i;
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
Foo* fooPtr = &first;
do
{
std::cout << fooPtr->variable << std::endl;
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
#endif
}
If run as:
g++ test.cpp -O3 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.252s
user 0m0.250s
sys 0m0.001s
Whereas if run as:
g++ test.cpp -O3 -Wall -DUSE_FUNC -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.834s
user 0m0.831s
sys 0m0.001s
These timings are fairly consistent across multiple runs, and show a 4x multiplier when using std::function. Is there a better way I can do what I want to do?

Use a template:
template<typename T>
void doStuff(Foo& foo, T const& operation)
For me this gives:
mvine#xxx:~/mikeytemp$ g++ test.cpp -O3 -DUSE_FUNC -std=c++14 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.534s
user 0m0.529s
sys 0m0.005s
mvine#xxx:~/mikeytemp$ g++ test.cpp -O3 -std=c++14 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.583s
user 0m0.583s
sys 0m0.000s

Function objects are quite heavy weight, but have a use where the payload is quite large (>10000 cycles) or needs to be polymorphic such as in a generalised job scheduler.
They need to contain a copy of your callable object and handle any exceptions it might throw.
Using a template gets you much closer to the metal as the resulting code frequently gets inlined.
template <typename Func>
void doStuff(Foo& foo, Func operation)
{
Foo* fooPtr = &foo;
do
{
operation(*fooPtr);
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
The compiler will be able to look inside your function and eliminate redundancy.
On Golbolt, your inner loop becomes
.LBB0_6: # =>This Loop Header: Depth=1
lea edx, [rax + 7]
mov rsi, rcx
.LBB0_7: # Parent Loop BB0_6 Depth=1
add dword ptr [rsi], edx
mov rsi, qword ptr [rsi + 8]
test rsi, rsi
jne .LBB0_7
mov esi, eax
or esi, 1
add esi, 7
mov rdx, rcx
.LBB0_9: # Parent Loop BB0_6 Depth=1
add dword ptr [rdx], esi
mov rdx, qword ptr [rdx + 8]
test rdx, rdx
jne .LBB0_9
add rax, 2
cmp rax, 100000000
jne .LBB0_6
As a bonus, if you didn't use a linked list, the loop may disappear entirely.

Related

DWCAS-alternative with no help of the kernel

I just wanted to test if my compiler recognizes ...
atomic<pair<uintptr_t, uintptr_t>
... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.
So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the
implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.
At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to
me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.
So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?
Here's my test-code:
#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>
using namespace std;
using namespace chrono;
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
int main()
{
cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
atomic_pair ap( uip_pair( 0, 0 ) );
cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
mutex mtx;
unsigned nThreads = thread::hardware_concurrency();
unsigned ready = nThreads;
condition_variable cvReady;
bool run = false;
condition_variable cvRun;
atomic_int64_t sumDur = 0;
auto theThread = [&]( size_t n )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvReady.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
uip_pair cmp = ap.load( memory_order_relaxed );
for( ; n--; )
while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
lock.lock();
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const ROUNDS = 100'000;
for( unsigned t = nThreads; t--; )
threads.emplace_back( theThread, ROUNDS );
unique_lock<mutex> lock( mtx );
cvReady.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( jthread &thr : threads )
thr.join();
cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
uip_pair p = ap.load( memory_order_relaxed );
cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}
[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
#if defined(_MSC_VER)
#define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute((noinline))
#endif
NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}
mov eax, 1
mov r10, rcx
mov r9d, eax
xchg DWORD PTR [rcx], eax
test eax, eax
je SHORT label8
label1:
mov eax, DWORD PTR [rcx]
test eax, eax
je SHORT label7
label2:
mov eax, r9d
test r9d, r9d
je SHORT label5
label4:
pause
sub eax, 1
jne SHORT label4
cmp r9d, 64
jl SHORT label5
lea r9d, QWORD PTR [rax+64]
jmp SHORT label6
label5:
add r9d, r9d
label6:
mov eax, DWORD PTR [rcx]
test eax, eax
jne SHORT label2
label7:
mov eax, 1
xchg DWORD PTR [rcx], eax
test eax, eax
jne SHORT label1
label8:
mov rax, QWORD PTR [rcx+8]
sub rax, QWORD PTR [rdx]
jne SHORT label9
mov rax, QWORD PTR [rcx+16]
sub rax, QWORD PTR [rdx+8]
label9:
test rax, rax
sete al
test al, al
je SHORT label10
movups xmm0, XMMWORD PTR [r8]
movups XMMWORD PTR [rcx+8], xmm0
xor ecx, ecx
xchg DWORD PTR [r10], ecx
ret
label10:
movups xmm0, XMMWORD PTR [rcx+8]
xor ecx, ecx
movups XMMWORD PTR [rdx], xmm0
xchg DWORD PTR [r10], ecx
ret
Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.
As Nate pointed out in comments, it is a spinlock.
You can look up source, it ships with the compiler. and is available on Github.
If you build an unoptimized debug, you can step into this source during interactive debugging!
There's a member variable called _Spinlock.
And here's the locking function:
#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
__yield();
}
}
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}
(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)
Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.
Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)
There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151
As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.

Will template function typedef specifier be properly inlined when creating each instance of template function?

Have made function that operates on several streams of data in same time, creates output result which is put to destination stream. It has been put huge amount of time to optimize performance of this function (openmp, intrinsics, and etc...). And it performs beautifully.
There is alot math involved here, needless to say very long function.
Now I want to implement in same function with math replacement code for each instance of this without writing each version of this function. Where I want to differentiate between different instances of this function using only #defines or inlined function (code has to be inlined in each version).
Went for templates, but templates allow only type specifiers, and realized that #defines can't be used here. Remaining solution would be inlined math functions, so simplified idea is to create header like this:
'alm_quasimodo.h':
#pragma once
typedef struct ALM_DATA
{
int l, t, r, b;
int scan;
BYTE* data;
} ALM_DATA;
typedef BYTE (*MATH_FX)(BYTE&, BYTE&);
// etc
inline BYTE math_a1(BYTE& A, BYTE& B){ return ((BYTE)((B > A) ? B:A)); }
inline BYTE math_a2(BYTE& A, BYTE& B){ return ((BYTE)(255 - ((long)((long)(255 - A) * (255 - B)) >> 8))); }
inline BYTE math_a3(BYTE& A, BYTE& B){ return ((BYTE)((B < 128)?(2*(((long)A>>1)+64))*((float)B/255):(255-(2*(255-(((long)A>>1)+64))*(float)(255-B)/255)))); }
// etc
template <typename MATH>
inline int const template_math_av (MATH math, ALM_DATA& a, ALM_DATA& b)
{
// ultra simplified version of very complex code
for (int y = a.t; y <= a.b; y++)
{
int yoffset = y * a.scan;
for (int x = a.l; x <= a.r; x++)
{
int xoffset = yoffset + x;
a.data[xoffset] = math(a.data[xoffset], b.data[xoffset]);
}
}
return 0;
}
ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b);
and math_caller is defined in 'alm_quasimodo.cpp' as follows:
#include "stdafx.h"
#include "alm_quazimodo.h"
ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b)
{
switch(condition)
{
case 1: return template_math_av<MATH_FX>(math_a1, a, b);
break;
case 2: return template_math_av<MATH_FX>(math_a2, a, b);
break;
case 3: return template_math_av<MATH_FX>(math_a3, a, b);
break;
// etc
}
return -1;
}
Main concern here is optimization, mainly in-lining of MATH function code, and not to break existing optimizations of original code. Without writing each instance of function for specific math operation, of course ;)
So does this template inlines properly all math functions?
And any suggestions how to optimize this function template?
If nothing, thanks for reading this lengthy question.
It all depends on your compiler, optimization level, and how and where are math_a1 to math_a3 functions defined.
Usually, the compiler can optimize this if the functions in question are inline function in the same compilation unit as the rest of the code.
If this doesn't happen for you, you may want to consider functors instead of functions.
Here are some simple examples I experimented with. You can do the same for your function, and check the behavior of different compilers.
For my example, GCC 7.3 and clang 6.0 are pretty good in optimizing-out function calls (provided they see the definition of the function of course). However, somewhat surprisingly, ICC 18.0.0 is only able to optimize-out functors and closures. Even inline functions give it some trouble.
Just to have some code here in case the link stops working in the future.
For the following code:
template <typename T, int size, typename Closure>
T accumulate(T (&array)[size], T init, Closure closure) {
for (int i = 0; i < size; ++i) {
init = closure(init, array[i]);
}
return init;
}
int sum(int x, int y) { return x + y; }
inline int sub_inline(int x, int y) { return x - y; }
struct mul_functor {
int operator ()(int x, int y) const { return x * y; }
};
extern int extern_operation(int x, int y);
int accumulate_function(int (&array)[5]) {
return accumulate(array, 0, sum);
}
int accumulate_inline(int (&array)[5]) {
return accumulate(array, 0, sub_inline);
}
int accumulate_functor(int (&array)[5]) {
return accumulate(array, 1, mul_functor());
}
int accumulate_closure(int (&array)[5]) {
return accumulate(array, 0, [](int x, int y) { return x | y; });
}
int accumulate_exetern(int (&array)[5]) {
return accumulate(array, 0, extern_operation);
}
GCC 7.3 (x86) produces the following assembly:
sum(int, int):
lea eax, [rdi+rsi]
ret
accumulate_function(int (&) [5]):
mov eax, DWORD PTR [rdi+4]
add eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi+12]
add eax, DWORD PTR [rdi+16]
ret
accumulate_inline(int (&) [5]):
mov eax, DWORD PTR [rdi]
neg eax
sub eax, DWORD PTR [rdi+4]
sub eax, DWORD PTR [rdi+8]
sub eax, DWORD PTR [rdi+12]
sub eax, DWORD PTR [rdi+16]
ret
accumulate_functor(int (&) [5]):
mov eax, DWORD PTR [rdi]
imul eax, DWORD PTR [rdi+4]
imul eax, DWORD PTR [rdi+8]
imul eax, DWORD PTR [rdi+12]
imul eax, DWORD PTR [rdi+16]
ret
accumulate_closure(int (&) [5]):
mov eax, DWORD PTR [rdi+4]
or eax, DWORD PTR [rdi+8]
or eax, DWORD PTR [rdi+12]
or eax, DWORD PTR [rdi]
or eax, DWORD PTR [rdi+16]
ret
accumulate_exetern(int (&) [5]):
push rbp
push rbx
lea rbp, [rdi+20]
mov rbx, rdi
xor eax, eax
sub rsp, 8
.L8:
mov esi, DWORD PTR [rbx]
mov edi, eax
add rbx, 4
call extern_operation(int, int)
cmp rbx, rbp
jne .L8
add rsp, 8
pop rbx
pop rbp
ret

Result of TLS variable access not cached

Edit: It seems this is a compiler bug indeed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
I am writing a wrapper for writing logs that uses TLS to store a std::stringstream buffer. This code will be used by shared-libraries. When looking at the code on godbolt.org it seems that neither gcc nor clang will cache the result of a TLS lookup (the loop repeatedly calls '__tls_get_addr()' when I believe I have designed my class in a way that should let it.
#include <sstream>
class LogStream
{
public:
LogStream()
: m_buffer(getBuffer())
{
}
LogStream(std::stringstream& buffer)
: m_buffer(buffer)
{
}
static std::stringstream& getBuffer()
{
thread_local std::stringstream buffer;
return buffer;
}
template <typename T>
inline LogStream& operator<<(const T& t)
{
m_buffer << t;
return *this;
}
private:
std::stringstream& m_buffer;
};
int main()
{
LogStream log{};
for (int i = 0; i < 12345678; ++i)
{
log << i;
}
}
Looking at the assembly code output both gcc and clang generate pretty similar output:
clang 5.0.0:
xor ebx, ebx
.LBB0_3: # =>This Inner Loop Header: Depth=1
data16
lea rdi, [rip + LogStream::getBuffer[abi:cxx11]()::buffer[abi:cxx11]#TLSGD]
data16
data16
rex64
call __tls_get_addr#PLT // Called on every loop iteration.
lea rdi, [rax + 16]
mov esi, ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
inc ebx
cmp ebx, 12345678
jne .LBB0_3
gcc 7.2:
xor ebx, ebx
.L3:
lea rdi, guard variable for LogStream::getBuffer[abi:cxx11]()::buffer#tlsld[rip]
call __tls_get_addr#PLT // Called on every loop iteration.
mov esi, ebx
add ebx, 1
lea rdi, LogStream::getBuffer[abi:cxx11]()::buffer#dtpoff[rax+16]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
cmp ebx, 12345678
jne .L3
How can I convince both compilers that the lookup doesn't need to be repeatedly done?
Compiler options: -std=c++11 -O3 -fPIC
Godbolt link
This really looks like an optimization bug in both Clang and GCC.
Here's what I think happens. (I might be completely off.) The compiler completely inlines everything down to this code:
int main()
{
// pseudo-access
std::stringstream& m_buffer = LogStream::getBuffer::buffer;
for (int i = 0; i < 12345678; ++i)
{
m_buffer << i;
}
}
And then, not realizing that access to a thread-local is very expensive under -fPIC, it decides that the temporary reference to the global is not necessary and inlines that as well:
int main()
{
for (int i = 0; i < 12345678; ++i)
{
// pseudo-access
LogStream::getBuffer::buffer << i;
}
}
Whatever actually happens, this is clearly a pessimization of the code your wrote. You should report this as a bug to GCC and Clang.
GCC bugtracker: https://gcc.gnu.org/bugzilla/
Clang bugtracker: https://bugs.llvm.org/

Adding stringstream/cout hurts performance, even when the code is never called

I have a program in which a simple function is called a large number of times. I have added some simple logging code and find that this significantly affects performance, even when the logging code is not actually called. A complete (but simplified) test case is shown below:
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
using namespace std::chrono;
std::mt19937 rng;
uint32_t getValue()
{
// Just some pointless work, helps stop this function from getting inlined.
for (int x = 0; x < 100; x++)
{
rng();
}
// Get a value, which happens never to be zero
uint32_t value = rng();
// This (by chance) is never true
if (value == 0)
{
value++; // This if statment won't get optimized away when printing below is commented out.
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
return value;
}
int main(int argc, char* argv[])
{
// Just fror timing
high_resolution_clock::time_point start = high_resolution_clock::now();
uint32_t sum = 0;
for (uint32_t i = 0; i < 10000000; i++)
{
sum += getValue();
}
milliseconds elapsed = duration_cast<milliseconds>(high_resolution_clock::now() - start);
// Use (print) the sum to make sure it doesn't get optimized away.
std::cout << "Sum = " << sum << ", Elapsed = " << elapsed.count() << "ms" << std::endl;
return 0;
}
Note that the code contains stringstream and cout but these are never actually called. However, the presence of these three lines of code increases the run time from 2.9 to 3.3 seconds. This is in release mode on VS2013. Curiously, if I build in GCC using '-O3' flag the extra three lines of code actually decrease the runtime by half a second or so.
I understand that the extra code could impact the resulting executable in a number of ways, such as by preventing inlining or causing more cache misses. The real question is whether there is anything I can do to improve on this situation? Switching to sprintf()/printf() doesn't seem to make a difference. Do I need to simply accept that adding such logging code to small functions will affect performance even if not called?
Note: For completeness, my real/full scenario is that I use a wrapper macro to throw exceptions and I like to log when such an exception is thrown. So when I call THROW_EXCEPT(...) it inserts code similar to that shown above and then throws. This in then hurting when I throw exceptions from inside a small function. Any better alternatives here?
Edit: Here is a VS2013 solution for quick testing, and so compiler settings can be checked: https://drive.google.com/file/d/0B7b4UnjhhIiEamFyS0hjSnVzbGM/view?usp=sharing
So I initially thought that this was due to branch prediction and optimising out branches so I took a look at the annotated assembly for when the code is commented out:
if (value == 0)
00E21371 mov ecx,1
00E21376 cmove eax,ecx
{
value++;
Here we see that the compiler has helpfully optimised out our branch, so what if we put in a more complex statement to prevent it from doing so:
if (value == 0)
00AE1371 jne getValue+99h (0AE1379h)
{
value /= value;
00AE1373 xor edx,edx
00AE1375 xor ecx,ecx
00AE1377 div eax,ecx
Here the branch is left in but when running this it runs about as fast as the previous example with the following lines commented out. So lets have a look at the assembly for having those lines left in:
if (value == 0)
008F13A0 jne getValue+20Bh (08F14EBh)
{
value++;
std::stringstream ss;
008F13A6 lea ecx,[ebp-58h]
008F13A9 mov dword ptr [ss],8F32B4h
008F13B3 mov dword ptr [ebp-0B0h],8F32F4h
008F13BD call dword ptr ds:[8F30A4h]
008F13C3 push 0
008F13C5 lea eax,[ebp-0A8h]
008F13CB mov dword ptr [ebp-4],0
008F13D2 push eax
008F13D3 lea ecx,[ss]
008F13D9 mov dword ptr [ebp-10h],1
008F13E0 call dword ptr ds:[8F30A0h]
008F13E6 mov dword ptr [ebp-4],1
008F13ED mov eax,dword ptr [ss]
008F13F3 mov eax,dword ptr [eax+4]
008F13F6 mov dword ptr ss[eax],8F32B0h
008F1401 mov eax,dword ptr [ss]
008F1407 mov ecx,dword ptr [eax+4]
008F140A lea eax,[ecx-68h]
008F140D mov dword ptr [ebp+ecx-0C4h],eax
008F1414 lea ecx,[ebp-0A8h]
008F141A call dword ptr ds:[8F30B0h]
008F1420 mov dword ptr [ebp-4],0FFFFFFFFh
That's a lot of instructions if that branch is ever hit. So what if we try something else?
if (value == 0)
011F1371 jne getValue+0A6h (011F1386h)
{
value++;
printf("This never gets printed, but commenting out these three lines improves performance.");
011F1373 push 11F31D0h
011F1378 call dword ptr ds:[11F30ECh]
011F137E add esp,4
Here we have far fewer instructions and once again it runs as quickly as with all lines commented out.
So I'm not sure I can say for certain exactly what is happening here but I feel at the moment it is a combination of branch prediction and CPU instruction cache misses.
In order to solve this problem you could move the logging into a function like so:
void log()
{
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
and
if (value == 0)
{
value++;
log();
Then it runs as fast as before with all those instructions replaced with a single call log (011C12E0h).

Exceptions, move semantics and optimizations: at compiler's mercy (MSVC2010)?

While doing some upgrades to my old exception classes hierarchy to utilize some of C++11 features, I did some speed tests and came across results that are somewhat frustrating. All of this was done with x64bit MSVC++2010 compiler, maximum speed optimization /O2.
Two very simple struct's, both bitwise copy semantics. One without move assignment operator (why would you need one?), another - with. Two simple inlined function returning by value newly created instances of these structs, which get assigned to local variable. Also, notice try/catch block around. Here is the code:
#include <iostream>
#include <windows.h>
struct TFoo
{
unsigned long long int m0;
unsigned long long int m1;
TFoo( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
};
struct TBar
{
unsigned long long int m0;
unsigned long long int m1;
TBar( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
TBar & operator=( TBar && f )
{
m0 = f.m0;
m1 = f.m1;
f.m0 = f.m1 = 0;
return ( *this );
}
};
TFoo MakeFoo( unsigned long long int f )
{
return ( TFoo( f ) );
}
TBar MakeBar( unsigned long long int f )
{
return ( TBar( f ) );
}
int main( void )
{
try
{
unsigned long long int lMin = 0;
unsigned long long int lMax = 20000000;
LARGE_INTEGER lStart = { 0 };
LARGE_INTEGER lEnd = { 0 };
TFoo lFoo( 0 );
TBar lBar( 0 );
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lFoo = MakeFoo( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lFoo = ( " << lFoo.m0 << " , " << lFoo.m1 << " )\t\tMakeFoo count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lBar = MakeBar( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lBar = ( " << lBar.m0 << " , " << lBar.m1 << " )\t\tMakeBar count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
}
catch( ... ){}
return ( 0 );
}
Program output:
lFoo = ( 19999999 , 9999999 ) MakeFoo count : 428652
lBar = ( 19999999 , 9999999 ) MakeBar count : 74518
Assembler for both loops (showing surrounding counter calls ) :
//- MakeFoo loop START --------------------------------
00000001`3f4388aa 488d4810 lea rcx,[rax+10h]
00000001`3f4388ae ff1594db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4388b4 448bdf mov r11d,edi
00000001`3f4388b7 48897c2428 mov qword ptr [rsp+28h],rdi
00000001`3f4388bc 0f1f4000 nop dword ptr [rax]
00000001`3f4388c0 4981fb002d3101 cmp r11,1312D00h
00000001`3f4388c7 732a jae Prototype_Console!main+0x83 (00000001`3f4388f3)
00000001`3f4388c9 4c895c2450 mov qword ptr [rsp+50h],r11
00000001`3f4388ce 498bc3 mov rax,r11
00000001`3f4388d1 48d1e8 shr rax,1
00000001`3f4388d4 4889442458 mov qword ptr [rsp+58h],rax // these 3 lines
00000001`3f4388d9 0f28442450 movaps xmm0,xmmword ptr [rsp+50h] // are of interest
00000001`3f4388de 660f7f442430 movdqa xmmword ptr [rsp+30h],xmm0 // see MakeBar
00000001`3f4388e4 49ffc3 inc r11
00000001`3f4388e7 4c895c2428 mov qword ptr [rsp+28h],r11
00000001`3f4388ec 4c8b6c2438 mov r13,qword ptr [rsp+38h] // this one too
00000001`3f4388f1 ebcd jmp Prototype_Console!main+0x50 (00000001`3f4388c0)
00000001`3f4388f3 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f4388fb ff1547db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeFoo loop END --------------------------------
//- MakeBar loop START --------------------------------
00000001`3f4389d1 488d8c24c8000000 lea rcx,[rsp+0C8h]
00000001`3f4389d9 ff1569da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4389df 4c8bdf mov r11,rdi
00000001`3f4389e2 48897c2440 mov qword ptr [rsp+40h],rdi
00000001`3f4389e7 4981fb002d3101 cmp r11,1312D00h
00000001`3f4389ee 7322 jae Prototype_Console!main+0x1a2 (00000001`3f438a12)
00000001`3f4389f0 4c895c2478 mov qword ptr [rsp+78h],r11
00000001`3f4389f5 498bf3 mov rsi,r11
00000001`3f4389f8 48d1ee shr rsi,1
00000001`3f4389fb 4d8be3 mov r12,r11 // these 3 lines
00000001`3f4389fe 4c895c2468 mov qword ptr [rsp+68h],r11 // are of interest
00000001`3f438a03 48897c2478 mov qword ptr [rsp+78h],rdi // see MakeFoo
00000001`3f438a08 49ffc3 inc r11
00000001`3f438a0b 4c895c2440 mov qword ptr [rsp+40h],r11
00000001`3f438a10 ebd5 jmp Prototype_Console!main+0x177 (00000001`3f4389e7)
00000001`3f438a12 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f438a1a ff1528da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeBar loop END --------------------------------
Both times are the same if I remove try/catch block. But in presence of it, compiler clearly optimizes code better for struct with redundant move operator=. Also, MakeFoo time does dependent on the size of TFoo and its layout, but in, general, time is several worse than for MakeBar for which time does not depend on small size changes.
Questions:
Is it compiler specific feature ofy MSVC++2010 (could someone check for GCC?)?
Is it because compiler has to preserve temporary until the call finishes, it cannot "rip it apart" in case of MakeFoo, and in case of MakeBar it knows that we allow it to use move semantics and it "rips it apart", generating faster code?
Can I expect same behavior for similar things without try\catch block, but in more complicated scenarios?
Your test is flawed. When compiled with /O2 /EHsc, it runs to completion in a fraction of a second and there is high variability in the results of the test.
I retried the same test, but ran it for 100x as many iterations, with the following results (the results were similar across several runs of the test):
lFoo = ( 1999999999 , 999999999 ) MakeFoo count : 16584927
lBar = ( 1999999999 , 999999999 ) MakeBar count : 16613002
Your test does not show any difference between the performance of assignment of the two types.