DWCAS-alternative with no help of the kernel - c++

I just wanted to test if my compiler recognizes ...
atomic<pair<uintptr_t, uintptr_t>
... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.
So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the
implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.
At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to
me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.
So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?
Here's my test-code:
#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>
using namespace std;
using namespace chrono;
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
int main()
{
cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
atomic_pair ap( uip_pair( 0, 0 ) );
cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
mutex mtx;
unsigned nThreads = thread::hardware_concurrency();
unsigned ready = nThreads;
condition_variable cvReady;
bool run = false;
condition_variable cvRun;
atomic_int64_t sumDur = 0;
auto theThread = [&]( size_t n )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvReady.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
uip_pair cmp = ap.load( memory_order_relaxed );
for( ; n--; )
while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
lock.lock();
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const ROUNDS = 100'000;
for( unsigned t = nThreads; t--; )
threads.emplace_back( theThread, ROUNDS );
unique_lock<mutex> lock( mtx );
cvReady.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( jthread &thr : threads )
thr.join();
cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
uip_pair p = ap.load( memory_order_relaxed );
cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}
[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
#if defined(_MSC_VER)
#define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute((noinline))
#endif
NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}
mov eax, 1
mov r10, rcx
mov r9d, eax
xchg DWORD PTR [rcx], eax
test eax, eax
je SHORT label8
label1:
mov eax, DWORD PTR [rcx]
test eax, eax
je SHORT label7
label2:
mov eax, r9d
test r9d, r9d
je SHORT label5
label4:
pause
sub eax, 1
jne SHORT label4
cmp r9d, 64
jl SHORT label5
lea r9d, QWORD PTR [rax+64]
jmp SHORT label6
label5:
add r9d, r9d
label6:
mov eax, DWORD PTR [rcx]
test eax, eax
jne SHORT label2
label7:
mov eax, 1
xchg DWORD PTR [rcx], eax
test eax, eax
jne SHORT label1
label8:
mov rax, QWORD PTR [rcx+8]
sub rax, QWORD PTR [rdx]
jne SHORT label9
mov rax, QWORD PTR [rcx+16]
sub rax, QWORD PTR [rdx+8]
label9:
test rax, rax
sete al
test al, al
je SHORT label10
movups xmm0, XMMWORD PTR [r8]
movups XMMWORD PTR [rcx+8], xmm0
xor ecx, ecx
xchg DWORD PTR [r10], ecx
ret
label10:
movups xmm0, XMMWORD PTR [rcx+8]
xor ecx, ecx
movups XMMWORD PTR [rdx], xmm0
xchg DWORD PTR [r10], ecx
ret
Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.

As Nate pointed out in comments, it is a spinlock.
You can look up source, it ships with the compiler. and is available on Github.
If you build an unoptimized debug, you can step into this source during interactive debugging!
There's a member variable called _Spinlock.
And here's the locking function:
#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
__yield();
}
}
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}
(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)
Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.
Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)
There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151
As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.

Related

Result of TLS variable access not cached

Edit: It seems this is a compiler bug indeed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82803
I am writing a wrapper for writing logs that uses TLS to store a std::stringstream buffer. This code will be used by shared-libraries. When looking at the code on godbolt.org it seems that neither gcc nor clang will cache the result of a TLS lookup (the loop repeatedly calls '__tls_get_addr()' when I believe I have designed my class in a way that should let it.
#include <sstream>
class LogStream
{
public:
LogStream()
: m_buffer(getBuffer())
{
}
LogStream(std::stringstream& buffer)
: m_buffer(buffer)
{
}
static std::stringstream& getBuffer()
{
thread_local std::stringstream buffer;
return buffer;
}
template <typename T>
inline LogStream& operator<<(const T& t)
{
m_buffer << t;
return *this;
}
private:
std::stringstream& m_buffer;
};
int main()
{
LogStream log{};
for (int i = 0; i < 12345678; ++i)
{
log << i;
}
}
Looking at the assembly code output both gcc and clang generate pretty similar output:
clang 5.0.0:
xor ebx, ebx
.LBB0_3: # =>This Inner Loop Header: Depth=1
data16
lea rdi, [rip + LogStream::getBuffer[abi:cxx11]()::buffer[abi:cxx11]#TLSGD]
data16
data16
rex64
call __tls_get_addr#PLT // Called on every loop iteration.
lea rdi, [rax + 16]
mov esi, ebx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
inc ebx
cmp ebx, 12345678
jne .LBB0_3
gcc 7.2:
xor ebx, ebx
.L3:
lea rdi, guard variable for LogStream::getBuffer[abi:cxx11]()::buffer#tlsld[rip]
call __tls_get_addr#PLT // Called on every loop iteration.
mov esi, ebx
add ebx, 1
lea rdi, LogStream::getBuffer[abi:cxx11]()::buffer#dtpoff[rax+16]
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)#PLT
cmp ebx, 12345678
jne .L3
How can I convince both compilers that the lookup doesn't need to be repeatedly done?
Compiler options: -std=c++11 -O3 -fPIC
Godbolt link
This really looks like an optimization bug in both Clang and GCC.
Here's what I think happens. (I might be completely off.) The compiler completely inlines everything down to this code:
int main()
{
// pseudo-access
std::stringstream& m_buffer = LogStream::getBuffer::buffer;
for (int i = 0; i < 12345678; ++i)
{
m_buffer << i;
}
}
And then, not realizing that access to a thread-local is very expensive under -fPIC, it decides that the temporary reference to the global is not necessary and inlines that as well:
int main()
{
for (int i = 0; i < 12345678; ++i)
{
// pseudo-access
LogStream::getBuffer::buffer << i;
}
}
Whatever actually happens, this is clearly a pessimization of the code your wrote. You should report this as a bug to GCC and Clang.
GCC bugtracker: https://gcc.gnu.org/bugzilla/
Clang bugtracker: https://bugs.llvm.org/

Exzessive stack usage for simple function in debug build

I have simple class using a kind of ATL database access.
All functions are defined in a header file.
The problematic functions all do the same. There are some macros in use. The generated code looks like this
void InitBindings()
{
if (sName) // Static global char*
m_sTableName = sName; // Save into member
{ AddCol("Name", some_constant_data... _GetOleDBType(...), ...); };
{ AddCol("Name1", some_other_constant_data_GetOleDBType(...), ...); };
...
}
AddCol returns a reference to a structure, but as you see it is ignored.
When I look into the assembler code where I have a function that uses 6 AddCol calls I can see that the function requires 2176 bytes of stack space. I have functions that requires 20kb and more. And in the debugger I can see that the stack isn't use at all. (All initialized to 0xCC and never touched)
See assembler code at the end.
The problem can be seen with VS-2015, and VS-2017.Only in Debug mode.
In Release mode the function reserves no extra stack space at all.
The only rule I see is; more AddCol calls, will cause more stack to be reserved. I can see that approximativ 500bytes per AddCol call is reserved.
Again: The function returns no object, it returns a reference to the binding information.
I already used the following pragmas in front of the function (but inside the class definition in the header):
__pragma(runtime_checks("", off)) __pragma(optimize("ts", on)) __pragma(strict_gs_check(push, off))
But no avail. This pragmas should turn optimization on, switches off runtime checks and stack checks. How can I reduce this unneeded stack space that is allocated. In some cases I can see stack overflows in the debug version, when this functions are used. No problems in the release version.
; 325 : BIND_BEGIN(CMasterData, _T("tblMasterData"))
push ebp
mov ebp, esp
sub esp, 2176 ; 00000880H
push ebx
push esi
push edi
mov DWORD PTR _this$[ebp], ecx
mov eax, OFFSET ??_C#_1BM#GOLNKAI#?$AAt?$AAb?$AAl?$AAM?$AAa?$AAs?$AAt?$AAe?$AAr?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
test eax, eax
je SHORT $LN2#InitBindin
push OFFSET ??_C#_1BM#GOLNKAI#?$AAt?$AAb?$AAl?$AAM?$AAa?$AAs?$AAt?$AAe?$AAr?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
add ecx, 136 ; 00000088H
call DWORD PTR __imp_??4?$CStringT#_WV?$StrTraitMFC_DLL#_WV?$ChTraitsCRT#_W#ATL#####ATL##QAEAAV01#PB_W#Z
$LN2#InitBindin:
; 326 : // Columns:
; 327 : B$C_IDENT (_T("Id"), m_lId);
push 0
push 0
push 1
push 4
push 0
call ?_GetOleDBType#ATL##YAGAAJ#Z ; ATL::_GetOleDBType
add esp, 4
movzx eax, ax
push eax
push 0
push OFFSET ??_C#_15NCCOGFKM#?$AAI?$AAd?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
; 328 : B$C (_T("Name"), m_szName);
push 0
push 0
push 0
push 122 ; 0000007aH
mov eax, 4
push eax
call ?_GetOleDBType#ATL##YAGQA_W#Z ; ATL::_GetOleDBType
add esp, 4
movzx ecx, ax
push ecx
push 4
push OFFSET ??_C#_19DINFBLAK#?$AAN?$AAa?$AAm?$AAe?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
; 329 : B$C (_T("Data"), m_data);
push 0
push 0
push 0
push 4
push 128 ; 00000080H
call ?_GetOleDBType#ATL##YAGAAVCComBSTR#1##Z ; ATL::_GetOleDBType
add esp, 4
movzx eax, ax
push eax
push 128 ; 00000080H
push OFFSET ??_C#_19IEEMEPMH#?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
It is a compiler bug. Already known in connect.
EDIT The problem seams to be fixed in VS-2017 15.5.1
The problem has to do with a bug in the built in offsetof.
It is not possible for me to #undef _CRT_USE_BUILTIN_OFFSETOF as written in this case.
For me it only works to #undef offsetof and to use one of this:
#define myoffsetof1(s,m) ((size_t)&reinterpret_cast<char const volatile&>((((s*)0)->m)))
#define myoffsetof2(s, m) ((size_t)&(((s*)0)->m))
#undef offsetof
#define offsetof myoffsetof1
All ATL DB consumers are affected.
Here is a minimum repro, that shows the bug. Set a breakpint on the Init function. Look into the assembler code and wonder how much stack is used!
// StackUsage.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <string>
#include <list>
#include <iostream>
using namespace std;
struct CRec
{
char t1[20];
char t2[20];
char t3[20];
char t4[20];
char t5[20];
int i1, i2, i3, i4, i5;
GUID g1, g2, g3, g4, g5;
DBTIMESTAMP d1, d2, d3, d4, d5;
};
#define sizeofmember(s,m) sizeof(reinterpret_cast<const s *>(0)->m)
#define typeofmember(c,m) _GetOleDBType(((c*)0)->m)
#define myoffsetof1(s,m) ((size_t)&reinterpret_cast<char const volatile&>((((s*)0)->m)))
#define myoffsetof2(s, m) ((size_t)&(((s*)0)->m))
// Undef this lines to fix the bug
// #undef offsetof
// #define offsetof myoffsetof1
#define COL(n,v) { AddCol(n,offsetof(CRec,v),typeofmember(CRec,v),sizeofmember(CRec,v)); }
class CFoo
{
public:
CFoo()
{
Init();
}
void Init()
{
COL("t1", t1);
COL("t2", t2);
COL("t3", t3);
COL("t4", t4);
COL("t5", t5);
COL("i1", i1);
COL("i2", i2);
COL("i3", i3);
COL("i4", i4);
COL("i5", i5);
COL("g1", g1);
COL("g2", g2);
COL("g2", g3);
COL("g2", g4);
COL("g2", g5);
COL("d1", d1);
COL("d2", d2);
COL("d2", d3);
COL("d2", d4);
COL("d2", d5);
}
void AddCol(PCSTR szName, ULONG nOffset, DBTYPE wType, ULONG nSize)
{
cout << szName << '\t' << nOffset << '\t' << wType << '\t' << nSize << endl;
}
};
int main()
{
CFoo foo;
return 0;
}

The compiler decided to call the function `POW`, instead of evaluating it at compile time. Why?

I got the snippet below from this comment by #CaffeineAddict.
#include <iostream>
template<typename base_t, typename expo_t>
constexpr base_t POW(base_t base, expo_t expo)
{
return (expo != 0) ? base * POW(base, expo - 1) : 1;
}
int main(int argc, char** argv)
{
std::cout << POW((unsigned __int64)2, 63) << std::endl;
return 0;
}
with the following disassembly obtained from VS2015:
int main(int argc, char** argv)
{
009418A0 push ebp
009418A1 mov ebp,esp
009418A3 sub esp,0C0h
009418A9 push ebx
009418AA push esi
009418AB push edi
009418AC lea edi,[ebp-0C0h]
009418B2 mov ecx,30h
009418B7 mov eax,0CCCCCCCCh
009418BC rep stos dword ptr es:[edi]
std::cout << POW((unsigned __int64)2, 63) << std::endl;
009418BE mov esi,esp
009418C0 push offset std::endl<char,std::char_traits<char> > (0941064h)
009418C5 push 3Fh
009418C7 push 0
009418C9 push 2
009418CB call POW<unsigned __int64,int> (09410FAh) <<========
009418D0 add esp,0Ch
009418D3 mov edi,esp
009418D5 push edx
009418D6 push eax
009418D7 mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (094A098h)]
009418DD call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (094A0ACh)]
009418E3 cmp edi,esp
009418E5 call __RTC_CheckEsp (0941127h)
009418EA mov ecx,eax
009418EC call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (094A0B0h)]
009418F2 cmp esi,esp
009418F4 call __RTC_CheckEsp (0941127h)
return 0;
009418F9 xor eax,eax
}
which shows (see the characters "<<======" introduced by me in the disassembly) that the compiler didn't evaluate the function POW at compile time. From his comment, #CaffeineAddict seemed to expect this behavior from the compiler. But I still can't understand why was this expected at all?
Two reasons.
First, a constexpr function isn't guaranteed to be called at compile time. To force the compiler to call it at compile time, you must store it in a constexpr variable first, i.e.
constexpr auto pow = POW((unsigned __int64)2, 63);
std::cout << pow << std::endl;
Secondly, you must build the project in Release configuration. In VS, you'll find you can break-point through constexpr functions if you have built the project using Debug configuration.

Exceptions, move semantics and optimizations: at compiler's mercy (MSVC2010)?

While doing some upgrades to my old exception classes hierarchy to utilize some of C++11 features, I did some speed tests and came across results that are somewhat frustrating. All of this was done with x64bit MSVC++2010 compiler, maximum speed optimization /O2.
Two very simple struct's, both bitwise copy semantics. One without move assignment operator (why would you need one?), another - with. Two simple inlined function returning by value newly created instances of these structs, which get assigned to local variable. Also, notice try/catch block around. Here is the code:
#include <iostream>
#include <windows.h>
struct TFoo
{
unsigned long long int m0;
unsigned long long int m1;
TFoo( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
};
struct TBar
{
unsigned long long int m0;
unsigned long long int m1;
TBar( unsigned long long int f ) : m0( f ), m1( f / 2 ) {}
TBar & operator=( TBar && f )
{
m0 = f.m0;
m1 = f.m1;
f.m0 = f.m1 = 0;
return ( *this );
}
};
TFoo MakeFoo( unsigned long long int f )
{
return ( TFoo( f ) );
}
TBar MakeBar( unsigned long long int f )
{
return ( TBar( f ) );
}
int main( void )
{
try
{
unsigned long long int lMin = 0;
unsigned long long int lMax = 20000000;
LARGE_INTEGER lStart = { 0 };
LARGE_INTEGER lEnd = { 0 };
TFoo lFoo( 0 );
TBar lBar( 0 );
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lFoo = MakeFoo( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lFoo = ( " << lFoo.m0 << " , " << lFoo.m1 << " )\t\tMakeFoo count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
::QueryPerformanceCounter( &lStart );
for( auto i = lMin; i < lMax; i++ )
{
lBar = MakeBar( i );
}
::QueryPerformanceCounter( &lEnd );
std::cout << "lBar = ( " << lBar.m0 << " , " << lBar.m1 << " )\t\tMakeBar count : " << lEnd.QuadPart - lStart.QuadPart << std::endl;
}
catch( ... ){}
return ( 0 );
}
Program output:
lFoo = ( 19999999 , 9999999 ) MakeFoo count : 428652
lBar = ( 19999999 , 9999999 ) MakeBar count : 74518
Assembler for both loops (showing surrounding counter calls ) :
//- MakeFoo loop START --------------------------------
00000001`3f4388aa 488d4810 lea rcx,[rax+10h]
00000001`3f4388ae ff1594db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4388b4 448bdf mov r11d,edi
00000001`3f4388b7 48897c2428 mov qword ptr [rsp+28h],rdi
00000001`3f4388bc 0f1f4000 nop dword ptr [rax]
00000001`3f4388c0 4981fb002d3101 cmp r11,1312D00h
00000001`3f4388c7 732a jae Prototype_Console!main+0x83 (00000001`3f4388f3)
00000001`3f4388c9 4c895c2450 mov qword ptr [rsp+50h],r11
00000001`3f4388ce 498bc3 mov rax,r11
00000001`3f4388d1 48d1e8 shr rax,1
00000001`3f4388d4 4889442458 mov qword ptr [rsp+58h],rax // these 3 lines
00000001`3f4388d9 0f28442450 movaps xmm0,xmmword ptr [rsp+50h] // are of interest
00000001`3f4388de 660f7f442430 movdqa xmmword ptr [rsp+30h],xmm0 // see MakeBar
00000001`3f4388e4 49ffc3 inc r11
00000001`3f4388e7 4c895c2428 mov qword ptr [rsp+28h],r11
00000001`3f4388ec 4c8b6c2438 mov r13,qword ptr [rsp+38h] // this one too
00000001`3f4388f1 ebcd jmp Prototype_Console!main+0x50 (00000001`3f4388c0)
00000001`3f4388f3 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f4388fb ff1547db0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeFoo loop END --------------------------------
//- MakeBar loop START --------------------------------
00000001`3f4389d1 488d8c24c8000000 lea rcx,[rsp+0C8h]
00000001`3f4389d9 ff1569da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
00000001`3f4389df 4c8bdf mov r11,rdi
00000001`3f4389e2 48897c2440 mov qword ptr [rsp+40h],rdi
00000001`3f4389e7 4981fb002d3101 cmp r11,1312D00h
00000001`3f4389ee 7322 jae Prototype_Console!main+0x1a2 (00000001`3f438a12)
00000001`3f4389f0 4c895c2478 mov qword ptr [rsp+78h],r11
00000001`3f4389f5 498bf3 mov rsi,r11
00000001`3f4389f8 48d1ee shr rsi,1
00000001`3f4389fb 4d8be3 mov r12,r11 // these 3 lines
00000001`3f4389fe 4c895c2468 mov qword ptr [rsp+68h],r11 // are of interest
00000001`3f438a03 48897c2478 mov qword ptr [rsp+78h],rdi // see MakeFoo
00000001`3f438a08 49ffc3 inc r11
00000001`3f438a0b 4c895c2440 mov qword ptr [rsp+40h],r11
00000001`3f438a10 ebd5 jmp Prototype_Console!main+0x177 (00000001`3f4389e7)
00000001`3f438a12 488d8c24c0000000 lea rcx,[rsp+0C0h]
00000001`3f438a1a ff1528da0400 call qword ptr [Prototype_Console!_imp_QueryPerformanceCounter (00000001`3f486448)]
//- MakeBar loop END --------------------------------
Both times are the same if I remove try/catch block. But in presence of it, compiler clearly optimizes code better for struct with redundant move operator=. Also, MakeFoo time does dependent on the size of TFoo and its layout, but in, general, time is several worse than for MakeBar for which time does not depend on small size changes.
Questions:
Is it compiler specific feature ofy MSVC++2010 (could someone check for GCC?)?
Is it because compiler has to preserve temporary until the call finishes, it cannot "rip it apart" in case of MakeFoo, and in case of MakeBar it knows that we allow it to use move semantics and it "rips it apart", generating faster code?
Can I expect same behavior for similar things without try\catch block, but in more complicated scenarios?
Your test is flawed. When compiled with /O2 /EHsc, it runs to completion in a fraction of a second and there is high variability in the results of the test.
I retried the same test, but ran it for 100x as many iterations, with the following results (the results were similar across several runs of the test):
lFoo = ( 1999999999 , 999999999 ) MakeFoo count : 16584927
lBar = ( 1999999999 , 999999999 ) MakeBar count : 16613002
Your test does not show any difference between the performance of assignment of the two types.

CPUID implementations in C++

I would like to know if somebody around here has some good examples of a C++ CPUID implementation that can be referenced from any of the managed .net languages.
Also, should this not be the case, should I be aware of certain implementation differences between X86 and X64?
I would like to use CPUID to get info on the machine my software is running on (crashreporting etc...) and I want to keep everything as widely compatible as possible.
Primary reason I ask is because I am a total noob when it comes to writing what will probably be all machine instructions though I have basic knowledge about CPU registers and so on...
Before people start telling me to Google: I found some examples online, but usually they were not meant to allow interaction from managed code and none of the examples were aimed at both X86 and X64. Most examples appeared to be X86 specific.
Accessing raw CPUID information is actually very easy, here is a C++ class for that which works in Windows, Linux and OSX:
#ifndef CPUID_H
#define CPUID_H
#ifdef _WIN32
#include <limits.h>
#include <intrin.h>
typedef unsigned __int32 uint32_t;
#else
#include <stdint.h>
#endif
class CPUID {
uint32_t regs[4];
public:
explicit CPUID(unsigned i) {
#ifdef _WIN32
__cpuid((int *)regs, (int)i);
#else
asm volatile
("cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "a" (i), "c" (0));
// ECX is set to zero for CPUID function 4
#endif
}
const uint32_t &EAX() const {return regs[0];}
const uint32_t &EBX() const {return regs[1];}
const uint32_t &ECX() const {return regs[2];}
const uint32_t &EDX() const {return regs[3];}
};
#endif // CPUID_H
To use it just instantiate an instance of the class, load the CPUID instruction you are interested in and examine the registers. For example:
#include "CPUID.h"
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[]) {
CPUID cpuID(0); // Get CPU vendor
string vendor;
vendor += string((const char *)&cpuID.EBX(), 4);
vendor += string((const char *)&cpuID.EDX(), 4);
vendor += string((const char *)&cpuID.ECX(), 4);
cout << "CPU vendor = " << vendor << endl;
return 0;
}
This Wikipedia page tells you how to use CPUID: http://en.wikipedia.org/wiki/CPUID
EDIT: Added #include <intrin.h> for Windows, per comments.
See this MSDN article about __cpuid.
There is a comprehensive sample that compiles with Visual Studio 2005 or better. For Visual Studio 6, you can use this instead of the compiler instrinsic __cpuid:
void __cpuid(int CPUInfo[4], int InfoType)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
xor ecx, ecx
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
For Visual Studio 2005, you can use this instead of the compiler instrinsic __cpuidex:
void __cpuidex(int CPUInfo[4], int InfoType, int ECXValue)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
mov ecx, ECXValue
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
Might not be exactly what you are looking for, but Intel have a good article and sample code for enumerating Intel 64 bit platform architectures (processor, cache, etc.) which also seems to cover 32 bit x86 processors.