Addressing two unrelated arrays with [base+index*scale] without UB - c++

I'm trying to figure out how to get GCC or Clang to use [base+index*scale] addressing to index two unrelated arrays, like this, to save an add instruction:
;; rdi is srcArr, rsi is srcArr - dstArr, rax is len of arrays
Loop:
mov ecx, dword ptr [rdi + 4*rsi]
mov dword ptr [rdi], ecx
add rdi, 4
cmp rdi, rax
jb .Loop
The C++ code below achieves this, but because dst and src are unrelated, subtracting them is UB, even though the difference is never dereferenced. (I think this will nonetheless work on all x86/64 hardware though.)
#include <cstddef>
#include <cstring>
void offset(float* __restrict__ dst, const float* __restrict__ src, size_t n) {
const float *enddst = dst + n;
const std::ptrdiff_t offset = src - dst;
while (dst < enddst) {
std::memcpy(dst, /* src */ dst + offset, sizeof(float));
dst++;
}
}
(Ignore the silly loop body, it's just for illustration as something using the addresses.)
Is there a way to do this without UB?
godbolt with non-UB and UB version

Related

DWCAS-alternative with no help of the kernel

I just wanted to test if my compiler recognizes ...
atomic<pair<uintptr_t, uintptr_t>
... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.
So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the
implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.
At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to
me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.
So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?
Here's my test-code:
#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>
using namespace std;
using namespace chrono;
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
int main()
{
cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
atomic_pair ap( uip_pair( 0, 0 ) );
cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
mutex mtx;
unsigned nThreads = thread::hardware_concurrency();
unsigned ready = nThreads;
condition_variable cvReady;
bool run = false;
condition_variable cvRun;
atomic_int64_t sumDur = 0;
auto theThread = [&]( size_t n )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvReady.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
uip_pair cmp = ap.load( memory_order_relaxed );
for( ; n--; )
while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
lock.lock();
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const ROUNDS = 100'000;
for( unsigned t = nThreads; t--; )
threads.emplace_back( theThread, ROUNDS );
unique_lock<mutex> lock( mtx );
cvReady.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( jthread &thr : threads )
thr.join();
cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
uip_pair p = ap.load( memory_order_relaxed );
cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}
[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
#if defined(_MSC_VER)
#define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute((noinline))
#endif
NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}
mov eax, 1
mov r10, rcx
mov r9d, eax
xchg DWORD PTR [rcx], eax
test eax, eax
je SHORT label8
label1:
mov eax, DWORD PTR [rcx]
test eax, eax
je SHORT label7
label2:
mov eax, r9d
test r9d, r9d
je SHORT label5
label4:
pause
sub eax, 1
jne SHORT label4
cmp r9d, 64
jl SHORT label5
lea r9d, QWORD PTR [rax+64]
jmp SHORT label6
label5:
add r9d, r9d
label6:
mov eax, DWORD PTR [rcx]
test eax, eax
jne SHORT label2
label7:
mov eax, 1
xchg DWORD PTR [rcx], eax
test eax, eax
jne SHORT label1
label8:
mov rax, QWORD PTR [rcx+8]
sub rax, QWORD PTR [rdx]
jne SHORT label9
mov rax, QWORD PTR [rcx+16]
sub rax, QWORD PTR [rdx+8]
label9:
test rax, rax
sete al
test al, al
je SHORT label10
movups xmm0, XMMWORD PTR [r8]
movups XMMWORD PTR [rcx+8], xmm0
xor ecx, ecx
xchg DWORD PTR [r10], ecx
ret
label10:
movups xmm0, XMMWORD PTR [rcx+8]
xor ecx, ecx
movups XMMWORD PTR [rdx], xmm0
xchg DWORD PTR [r10], ecx
ret
Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.
As Nate pointed out in comments, it is a spinlock.
You can look up source, it ships with the compiler. and is available on Github.
If you build an unoptimized debug, you can step into this source during interactive debugging!
There's a member variable called _Spinlock.
And here's the locking function:
#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
__yield();
}
}
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}
(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)
Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.
Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)
There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151
As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.

Will template function typedef specifier be properly inlined when creating each instance of template function?

Have made function that operates on several streams of data in same time, creates output result which is put to destination stream. It has been put huge amount of time to optimize performance of this function (openmp, intrinsics, and etc...). And it performs beautifully.
There is alot math involved here, needless to say very long function.
Now I want to implement in same function with math replacement code for each instance of this without writing each version of this function. Where I want to differentiate between different instances of this function using only #defines or inlined function (code has to be inlined in each version).
Went for templates, but templates allow only type specifiers, and realized that #defines can't be used here. Remaining solution would be inlined math functions, so simplified idea is to create header like this:
'alm_quasimodo.h':
#pragma once
typedef struct ALM_DATA
{
int l, t, r, b;
int scan;
BYTE* data;
} ALM_DATA;
typedef BYTE (*MATH_FX)(BYTE&, BYTE&);
// etc
inline BYTE math_a1(BYTE& A, BYTE& B){ return ((BYTE)((B > A) ? B:A)); }
inline BYTE math_a2(BYTE& A, BYTE& B){ return ((BYTE)(255 - ((long)((long)(255 - A) * (255 - B)) >> 8))); }
inline BYTE math_a3(BYTE& A, BYTE& B){ return ((BYTE)((B < 128)?(2*(((long)A>>1)+64))*((float)B/255):(255-(2*(255-(((long)A>>1)+64))*(float)(255-B)/255)))); }
// etc
template <typename MATH>
inline int const template_math_av (MATH math, ALM_DATA& a, ALM_DATA& b)
{
// ultra simplified version of very complex code
for (int y = a.t; y <= a.b; y++)
{
int yoffset = y * a.scan;
for (int x = a.l; x <= a.r; x++)
{
int xoffset = yoffset + x;
a.data[xoffset] = math(a.data[xoffset], b.data[xoffset]);
}
}
return 0;
}
ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b);
and math_caller is defined in 'alm_quasimodo.cpp' as follows:
#include "stdafx.h"
#include "alm_quazimodo.h"
ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b)
{
switch(condition)
{
case 1: return template_math_av<MATH_FX>(math_a1, a, b);
break;
case 2: return template_math_av<MATH_FX>(math_a2, a, b);
break;
case 3: return template_math_av<MATH_FX>(math_a3, a, b);
break;
// etc
}
return -1;
}
Main concern here is optimization, mainly in-lining of MATH function code, and not to break existing optimizations of original code. Without writing each instance of function for specific math operation, of course ;)
So does this template inlines properly all math functions?
And any suggestions how to optimize this function template?
If nothing, thanks for reading this lengthy question.
It all depends on your compiler, optimization level, and how and where are math_a1 to math_a3 functions defined.
Usually, the compiler can optimize this if the functions in question are inline function in the same compilation unit as the rest of the code.
If this doesn't happen for you, you may want to consider functors instead of functions.
Here are some simple examples I experimented with. You can do the same for your function, and check the behavior of different compilers.
For my example, GCC 7.3 and clang 6.0 are pretty good in optimizing-out function calls (provided they see the definition of the function of course). However, somewhat surprisingly, ICC 18.0.0 is only able to optimize-out functors and closures. Even inline functions give it some trouble.
Just to have some code here in case the link stops working in the future.
For the following code:
template <typename T, int size, typename Closure>
T accumulate(T (&array)[size], T init, Closure closure) {
for (int i = 0; i < size; ++i) {
init = closure(init, array[i]);
}
return init;
}
int sum(int x, int y) { return x + y; }
inline int sub_inline(int x, int y) { return x - y; }
struct mul_functor {
int operator ()(int x, int y) const { return x * y; }
};
extern int extern_operation(int x, int y);
int accumulate_function(int (&array)[5]) {
return accumulate(array, 0, sum);
}
int accumulate_inline(int (&array)[5]) {
return accumulate(array, 0, sub_inline);
}
int accumulate_functor(int (&array)[5]) {
return accumulate(array, 1, mul_functor());
}
int accumulate_closure(int (&array)[5]) {
return accumulate(array, 0, [](int x, int y) { return x | y; });
}
int accumulate_exetern(int (&array)[5]) {
return accumulate(array, 0, extern_operation);
}
GCC 7.3 (x86) produces the following assembly:
sum(int, int):
lea eax, [rdi+rsi]
ret
accumulate_function(int (&) [5]):
mov eax, DWORD PTR [rdi+4]
add eax, DWORD PTR [rdi]
add eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi+12]
add eax, DWORD PTR [rdi+16]
ret
accumulate_inline(int (&) [5]):
mov eax, DWORD PTR [rdi]
neg eax
sub eax, DWORD PTR [rdi+4]
sub eax, DWORD PTR [rdi+8]
sub eax, DWORD PTR [rdi+12]
sub eax, DWORD PTR [rdi+16]
ret
accumulate_functor(int (&) [5]):
mov eax, DWORD PTR [rdi]
imul eax, DWORD PTR [rdi+4]
imul eax, DWORD PTR [rdi+8]
imul eax, DWORD PTR [rdi+12]
imul eax, DWORD PTR [rdi+16]
ret
accumulate_closure(int (&) [5]):
mov eax, DWORD PTR [rdi+4]
or eax, DWORD PTR [rdi+8]
or eax, DWORD PTR [rdi+12]
or eax, DWORD PTR [rdi]
or eax, DWORD PTR [rdi+16]
ret
accumulate_exetern(int (&) [5]):
push rbp
push rbx
lea rbp, [rdi+20]
mov rbx, rdi
xor eax, eax
sub rsp, 8
.L8:
mov esi, DWORD PTR [rbx]
mov edi, eax
add rbx, 4
call extern_operation(int, int)
cmp rbx, rbp
jne .L8
add rsp, 8
pop rbx
pop rbp
ret

Are compilers generally able to condense multiple contiguous memory copies into one operation?

Something like the following:
struct Vec2
{
int x, y;
};
struct Bounds
{
int left, top, right, bottom;
};
int main()
{
Vec2 topLeft = {5, 5};
Vec2 bottomRight = { 10, 10 };
Bounds bounds;
//___Here is copy operation
//___Note they're not in contiguous order, harder for the compiler?
bounds.left = topLeft.x;
bounds.bottom = bottomRight.y;
bounds.top = topLeft.y;
bounds.right = bottomRight.x;
}
Those four assignments could be done like so:
memcpy(&bounds, &topLeft, sizeof(Vec2));
memcpy(&bounds.right, &bottomRight, sizeof(Vec2));
I'm wondering two things:
Are compilers usually able to optimise in this way?
Are four int copies the same as two int pair copies, as copying memory is O(n)?
I got the following disassembly results for the four copies:
bounds.left = topLeft.x;
00007FF642291034 mov dword ptr [bounds],5
bounds.bottom = bottomRight.y;
00007FF64229103C mov dword ptr [rsp+2Ch],0Ah
bounds.top = topLeft.y;
00007FF642291044 mov dword ptr [rsp+24h],5
bounds.right = bottomRight.x;
00007FF64229104C mov dword ptr [rsp+28h],0Ah
And confusingly, the two memcpys are different instructions for the first one and second one, I don't understand this:
memcpy(&bounds, &topLeft, sizeof(Vec2));
00007FF64229105E mov rbx,qword ptr [topLeft] // This is only one instruction
memcpy(&bounds.right, &bottomRight, sizeof(Vec2));
00007FF642291063 mov rdi,qword ptr [bottomRight] // Compared to 6?
00007FF642291068 mov qword ptr [bounds],rbx
00007FF64229106D mov qword ptr [rsp+28h],rdi
00007FF642291072 jmp main+7Eh (07FF64229107Eh)
00007FF642291074 mov rdi,qword ptr [rsp+28h]
00007FF642291079 mov rbx,qword ptr [bounds]
Any modern compiler supporting threads has to consider instruction dependencies and reordering. With that technology in place, it will quickly discover that there are no dependencies in the set of instructions you have, which means they can be reordered in linear memory order, and then combined.
Not that it likely matters; the CPU cache will just load the whole cache line on first access, and flush the whole cache line at some later point. It's these operations which take time, not the CPU operations themselves.

Unusual heap size limitations in VS2003 C++

I have a C++ app that uses large arrays of data, and have noticed while testing that it is running out of memory, while there is still plenty of memory available. I have reduced the code to a sample test case as follows;
void MemTest()
{
size_t Size = 500*1024*1024; // 512mb
if (Size > _HEAP_MAXREQ)
TRACE("Invalid Size");
void * mem = malloc(Size);
if (mem == NULL)
TRACE("allocation failed");
}
If I create a new MFC project, include this function, and run it from InitInstance, it works fine in debug mode (memory allocated as expected), yet fails in release mode (malloc returns NULL). Single stepping through release into the C run times, my function gets inlined I get the following
// malloc.c
void * __cdecl _malloc_base (size_t size)
{
void *res = _nh_malloc_base(size, _newmode);
RTCCALLBACK(_RTC_Allocate_hook, (res, size, 0));
return res;
}
Calling _nh_malloc_base
void * __cdecl _nh_malloc_base (size_t size, int nhFlag)
{
void * pvReturn;
// validate size
if (size > _HEAP_MAXREQ)
return NULL;
'
'
And (size > _HEAP_MAXREQ) returns true and hence my memory doesn't get allocated. Putting a watch on size comes back with the exptected 512MB, which suggests the program is linking into a different run-time library with a much smaller _HEAP_MAXREQ. Grepping the VC++ folders for _HEAP_MAXREQ shows the expected 0xFFFFFFE0, so I can't figure out what is happening here. Anyone know of any CRT changes or versions that would cause this problem, or am I missing something way more obvious?
Edit: As suggested by Andreas, looking at this under this assembly view shows the following;
--- f:\vs70builds\3077\vc\crtbld\crt\src\malloc.c ------------------------------
_heap_alloc:
0040B0E5 push 0Ch
0040B0E7 push 4280B0h
0040B0EC call __SEH_prolog (40CFF8h)
0040B0F1 mov esi,dword ptr [size]
0040B0F4 cmp dword ptr [___active_heap (434660h)],3
0040B0FB jne $L19917+7 (40B12Bh)
0040B0FD cmp esi,dword ptr [___sbh_threshold (43464Ch)]
0040B103 ja $L19917+7 (40B12Bh)
0040B105 push 4
0040B107 call _lock (40DE73h)
0040B10C pop ecx
0040B10D and dword ptr [ebp-4],0
0040B111 push esi
0040B112 call __sbh_alloc_block (40E736h)
0040B117 pop ecx
0040B118 mov dword ptr [pvReturn],eax
0040B11B or dword ptr [ebp-4],0FFFFFFFFh
0040B11F call $L19916 (40B157h)
$L19917:
0040B124 mov eax,dword ptr [pvReturn]
0040B127 test eax,eax
0040B129 jne $L19917+2Ah (40B14Eh)
0040B12B test esi,esi
0040B12D jne $L19917+0Ch (40B130h)
0040B12F inc esi
0040B130 cmp dword ptr [___active_heap (434660h)],1
0040B137 je $L19917+1Bh (40B13Fh)
0040B139 add esi,0Fh
0040B13C and esi,0FFFFFFF0h
0040B13F push esi
0040B140 push 0
0040B142 push dword ptr [__crtheap (43465Ch)]
0040B148 call dword ptr [__imp__HeapAlloc#12 (425144h)]
0040B14E call __SEH_epilog (40D033h)
0040B153 ret
$L19914:
0040B154 mov esi,dword ptr [ebp+8]
$L19916:
0040B157 push 4
0040B159 call _unlock (40DDBEh)
0040B15E pop ecx
$L19929:
0040B15F ret
_nh_malloc:
0040B160 cmp dword ptr [esp+4],0FFFFFFE0h
0040B165 ja _nh_malloc+29h (40B189h)
With the registers as follows;
EAX = 009C8AF0 EBX = FFFFFFFF ECX = 009C8A88 EDX = 00747365 ESI = 00430F80
EDI = 00430F80 EIP = 0040B160 ESP = 0013FDF4 EBP = 0013FFC0 EFL = 00000206
So the compare does appear to be against the correct constant, i.e. #040B160 cmp dword ptr [esp+4],0FFFFFFE0h, also esp+4 = 0013FDF8 = 1F400000 (my 512mb)
Second edit: Problem was actually in HeapAlloc, as per Andreas' post. Changing to a new seperate heap for large objects, using HeapCreate & HeapAlloc, did not help alleviate the problem, nor did an attempt to use VirtualAlloc with various parameters. Some further experimentation has shown that where allocation one large section of contiguous memory fails, two smaller blocks yielding the same total memory is ok. e.g. where a 300MB malloc fails, 2 x 150MB mallocs work ok. So it looks like I'll need a new array class that can live in a number of biggish memory fragments rather than a single contiguous block. Not a major problem, but I would have expected a bit more out of Win32 in this day and age.
Last edit: The following yielded 1.875GB of space, albeit non-contiguous
#define TenMB 1024*1024*10
void SmallerAllocs()
{
size_t Total = 0;
LPVOID p[200];
for (int i = 0; i < 200; i++)
{
p[i] = malloc(TenMB);
if (p[i])
Total += TenMB; else
break;
}
CString Msg;
Msg.Format("Allocated %0.3lfGB",Total/(1024.0*1024.0*1024.0));
AfxMessageBox(Msg,MB_OK);
}
May it be the cast that the debugger is playing a trick on you in release-mode? Neither single stepping nor the values of variables are reliable in release-mode.
I tried your example in VS2003 in release mode, and when single stepping it does at first look like the code is landing on the return NULL line, but when I continue stepping it eventually continues into HeapAlloc, I would guess that it's this function that's failing, looking at the disassembly if (size > _HEAP_MAXREQ) reveals the following:
00401078 cmp dword ptr [esp+4],0FFFFFFE0h
so I don't think it's a problem with _HEAP_MAXREQ.

CPUID implementations in C++

I would like to know if somebody around here has some good examples of a C++ CPUID implementation that can be referenced from any of the managed .net languages.
Also, should this not be the case, should I be aware of certain implementation differences between X86 and X64?
I would like to use CPUID to get info on the machine my software is running on (crashreporting etc...) and I want to keep everything as widely compatible as possible.
Primary reason I ask is because I am a total noob when it comes to writing what will probably be all machine instructions though I have basic knowledge about CPU registers and so on...
Before people start telling me to Google: I found some examples online, but usually they were not meant to allow interaction from managed code and none of the examples were aimed at both X86 and X64. Most examples appeared to be X86 specific.
Accessing raw CPUID information is actually very easy, here is a C++ class for that which works in Windows, Linux and OSX:
#ifndef CPUID_H
#define CPUID_H
#ifdef _WIN32
#include <limits.h>
#include <intrin.h>
typedef unsigned __int32 uint32_t;
#else
#include <stdint.h>
#endif
class CPUID {
uint32_t regs[4];
public:
explicit CPUID(unsigned i) {
#ifdef _WIN32
__cpuid((int *)regs, (int)i);
#else
asm volatile
("cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "a" (i), "c" (0));
// ECX is set to zero for CPUID function 4
#endif
}
const uint32_t &EAX() const {return regs[0];}
const uint32_t &EBX() const {return regs[1];}
const uint32_t &ECX() const {return regs[2];}
const uint32_t &EDX() const {return regs[3];}
};
#endif // CPUID_H
To use it just instantiate an instance of the class, load the CPUID instruction you are interested in and examine the registers. For example:
#include "CPUID.h"
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[]) {
CPUID cpuID(0); // Get CPU vendor
string vendor;
vendor += string((const char *)&cpuID.EBX(), 4);
vendor += string((const char *)&cpuID.EDX(), 4);
vendor += string((const char *)&cpuID.ECX(), 4);
cout << "CPU vendor = " << vendor << endl;
return 0;
}
This Wikipedia page tells you how to use CPUID: http://en.wikipedia.org/wiki/CPUID
EDIT: Added #include <intrin.h> for Windows, per comments.
See this MSDN article about __cpuid.
There is a comprehensive sample that compiles with Visual Studio 2005 or better. For Visual Studio 6, you can use this instead of the compiler instrinsic __cpuid:
void __cpuid(int CPUInfo[4], int InfoType)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
xor ecx, ecx
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
For Visual Studio 2005, you can use this instead of the compiler instrinsic __cpuidex:
void __cpuidex(int CPUInfo[4], int InfoType, int ECXValue)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
mov ecx, ECXValue
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
Might not be exactly what you are looking for, but Intel have a good article and sample code for enumerating Intel 64 bit platform architectures (processor, cache, etc.) which also seems to cover 32 bit x86 processors.