I would like to know if somebody around here has some good examples of a C++ CPUID implementation that can be referenced from any of the managed .net languages.
Also, should this not be the case, should I be aware of certain implementation differences between X86 and X64?
I would like to use CPUID to get info on the machine my software is running on (crashreporting etc...) and I want to keep everything as widely compatible as possible.
Primary reason I ask is because I am a total noob when it comes to writing what will probably be all machine instructions though I have basic knowledge about CPU registers and so on...
Before people start telling me to Google: I found some examples online, but usually they were not meant to allow interaction from managed code and none of the examples were aimed at both X86 and X64. Most examples appeared to be X86 specific.
Accessing raw CPUID information is actually very easy, here is a C++ class for that which works in Windows, Linux and OSX:
#ifndef CPUID_H
#define CPUID_H
#ifdef _WIN32
#include <limits.h>
#include <intrin.h>
typedef unsigned __int32 uint32_t;
#else
#include <stdint.h>
#endif
class CPUID {
uint32_t regs[4];
public:
explicit CPUID(unsigned i) {
#ifdef _WIN32
__cpuid((int *)regs, (int)i);
#else
asm volatile
("cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "a" (i), "c" (0));
// ECX is set to zero for CPUID function 4
#endif
}
const uint32_t &EAX() const {return regs[0];}
const uint32_t &EBX() const {return regs[1];}
const uint32_t &ECX() const {return regs[2];}
const uint32_t &EDX() const {return regs[3];}
};
#endif // CPUID_H
To use it just instantiate an instance of the class, load the CPUID instruction you are interested in and examine the registers. For example:
#include "CPUID.h"
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[]) {
CPUID cpuID(0); // Get CPU vendor
string vendor;
vendor += string((const char *)&cpuID.EBX(), 4);
vendor += string((const char *)&cpuID.EDX(), 4);
vendor += string((const char *)&cpuID.ECX(), 4);
cout << "CPU vendor = " << vendor << endl;
return 0;
}
This Wikipedia page tells you how to use CPUID: http://en.wikipedia.org/wiki/CPUID
EDIT: Added #include <intrin.h> for Windows, per comments.
See this MSDN article about __cpuid.
There is a comprehensive sample that compiles with Visual Studio 2005 or better. For Visual Studio 6, you can use this instead of the compiler instrinsic __cpuid:
void __cpuid(int CPUInfo[4], int InfoType)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
xor ecx, ecx
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
For Visual Studio 2005, you can use this instead of the compiler instrinsic __cpuidex:
void __cpuidex(int CPUInfo[4], int InfoType, int ECXValue)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
mov ecx, ECXValue
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
Might not be exactly what you are looking for, but Intel have a good article and sample code for enumerating Intel 64 bit platform architectures (processor, cache, etc.) which also seems to cover 32 bit x86 processors.
Related
I'm trying to write a program that can mask its command line arguments after it reads them. I know this is stored in the PEB, so I tried using the answer to "How to get the Process Environment Block (PEB) address using assembler (x64 OS)?" by Sirmabus to get that and modify it there. Here's a minimal program that does that:
#include <wchar.h>
#include <windows.h>
#include <winnt.h>
#include <winternl.h>
// Thread Environment Block (TEB)
#if defined(_M_X64) // x64
PTEB tebPtr = reinterpret_cast<PTEB>(__readgsqword(reinterpret_cast<DWORD_PTR>(&static_cast<NT_TIB*>(nullptr)->Self)));
#else // x86
PTEB tebPtr = reinterpret_cast<PTEB>(__readfsdword(reinterpret_cast<DWORD_PTR>(&static_cast<NT_TIB*>(nullptr)->Self)));
#endif
// Process Environment Block (PEB)
PPEB pebPtr = tebPtr->ProcessEnvironmentBlock;
int main() {
UNICODE_STRING *s = &pebPtr->ProcessParameters->CommandLine;
wmemset(s->Buffer, 'x', s->Length / sizeof *s->Buffer);
getwchar();
}
I compiled this both as 32-bit and 64-bit, and tested it on both 32-bit and 64-bit versions of Windows. I looked for the command line using Process Explorer, and also by using this PowerShell command to fetch it via WMI:
Get-WmiObject Win32_Process -Filter "name = 'overwrite.exe'" | Select-Object CommandLine
I've found that this works in every combination I tested it in, except for using WMI on a WOW64 process. Summarizing my test results in a table:
Architecture
Process Explorer
WMI
64-bit executable on 64-bit OS (native)
✔️ xxxxxxxxxxxxx
✔️ xxxxxxxxxxxxx
32-bit executable on 64-bit OS (WOW64)
✔️ xxxxxxxxxxxxx
❌ overwrite.exe
32-bit executable on 32-bit OS (native)
✔️ xxxxxxxxxxxxx
✔️ xxxxxxxxxxxxx
How can I modify my code to make this work in the WMI WOW64 case too?
wow64 processes have 2 PEB (32 and 64 bit) and 2 different ProcessEnvironmentBlock (again 32 and 64). the command line exist in both. some tools take command line correct (from 32 ProcessEnvironmentBlock for 32bit processes) and some unconditional from 64bit ProcessEnvironmentBlock (on 64 bit os). so you want zero (all or first char) of command line in both blocks. for do this in "native" block we not need access TEB/PEB/ProcessEnvironmentBlock - the GetCommandLineW return the direct pointer to the command-line string in ProcessEnvironmentBlock. so next code is enough:
PWSTR psz = GetCommandLineW();
while (*psz) *psz++ = 0;
or simply
*GetCommandLineW() = 0;
is enough
as side note, for get TEB pointer not need write own macro - NtCurrentTeb() macro already exist in winnt.h
access 64 bit ProcessEnvironmentBlock from 32 bit process already not trivial.
one way suggested in comment.
another way more simply, but not documented - call NtQueryInformationProcess with ProcessWow64Information
When the ProcessInformationClass parameter is ProcessWow64Information, the buffer pointed to by the
ProcessInformation parameter should be large enough to hold a
ULONG_PTR. If this value is nonzero, the process is running in a WOW64
environment. Otherwise, the process is not running in a WOW64
environment.
so this value receive some pointer. but msdn not say for what he point . in reality this pointer to 64 PEB of process in wow64 process.
so code can be next:
#ifndef _WIN64
PEB64* peb64;
if (0 <= NtQueryInformationProcess(NtCurrentProcess(),
ProcessWow64Information, &peb64, sizeof(peb64), 0) && peb64)
{
// ...
}
#endif
but declare and use 64 bit structures in 32bit process very not comfortable (need all time check that pointer < 0x100000000 )
another original way - execute small 64bit shellcode which do the task.
the code doing approximately the following:
#include <winternl.h>
#include <intrin.h>
void ZeroCmdLine()
{
PUNICODE_STRING CommandLine =
&NtCurrentTeb()->ProcessEnvironmentBlock->ProcessParameters->CommandLine;
if (USHORT Length = CommandLine->Length)
{
//*CommandLine->Buffer = 0;
__stosw((PUSHORT)CommandLine->Buffer, 0, Length / sizeof(WCHAR));
}
}
you need create asm, file (if yet not have it in project) with the next code
.686
.MODEL FLAT
.code
#ZeroCmdLine#0 proc
push ebp
mov ebp,esp
and esp,not 15
push 33h
call ##1
;++++++++ x64 +++++++++
sub esp,20h
call ##0
add esp,20h
retf
##0:
DQ 000003025048b4865h
DQ 0408b4860408b4800h
DQ 00de3677048b70f20h
DQ 033e9d178788b4857h
DQ 0ccccc35fab66f3c0h
;-------- x64 ---------
##1:
call fword ptr [esp]
leave
ret
#ZeroCmdLine#0 endp
end
the code in the DQs came from this:
mov rax,gs:30h
mov rax,[rax+60h]
mov rax,[rax+20h]
movzx ecx,word ptr [rax+70h]
jecxz ##2
push rdi
mov rdi,[rax+78h]
shr ecx,1
xor eax,eax
rep stosw
pop rdi
##2:
ret
int3
int3
custom build: ml /c /Cp $(InputFileName) -> $(InputName).obj
declare in c++
#ifdef __cplusplus
extern "C"
#endif
void FASTCALL ZeroCmdLine(void);
and call it.
#ifndef _WIN64
BOOL bWow;
if (IsWow64Process(GetCurrentProcess(), &bWow) && bWow)
{
ZeroCmdLine();
}
#endif
To be explicit about the reason for the difference, it's that for a WOW64 process, Process Explorer will read from the 32-bit PEB, while WMI will read from the 64-bit PEB. For completeness, here's a WOW64 program written in NASM and C that will change its command line to all 3s as seen by Process Explorer, and to all 6s as seen by WMI:
global _memcpy64, _wmemset64, _readgsqword64
section .text
BITS 32
_memcpy64:
push edi
push esi
call 0x33:.heavensgate
pop esi
pop edi
ret
BITS 64
.heavensgate:
mov rdi, [esp + 20]
mov rsi, [esp + 28]
mov rcx, [esp + 36]
mov rdx, rdi
rep movsb
mov eax, edx
shr rdx, 32
retf
BITS 32
_wmemset64:
push edi
call 0x33:.heavensgate
pop edi
ret
BITS 64
.heavensgate:
mov rdi, [esp + 16]
mov eax, [esp + 24]
mov rcx, [esp + 28]
mov rdx, rdi
rep stosw
mov eax, edx
shr rdx, 32
retf
BITS 32
_readgsqword64:
call 0x33:.heavensgate
ret
BITS 64
.heavensgate:
mov rdx, [rsp + 12]
mov rdx, gs:[rdx]
mov eax, edx
shr rdx, 32
retf
#include <windows.h>
#include <winternl.h>
#include <stdint.h>
typedef struct _TEB64 {
PVOID64 Reserved1[12];
PVOID64 ProcessEnvironmentBlock;
PVOID64 Reserved2[399];
BYTE Reserved3[1952];
PVOID64 TlsSlots[64];
BYTE Reserved4[8];
PVOID64 Reserved5[26];
PVOID64 ReservedForOle; // Windows 2000 only
PVOID64 Reserved6[4];
PVOID64 TlsExpansionSlots;
} TEB64;
typedef struct _PEB64 {
BYTE Reserved1[2];
BYTE BeingDebugged;
BYTE Reserved2[21];
PVOID64 LoaderData;
PVOID64 ProcessParameters;
BYTE Reserved3[520];
PVOID64 PostProcessInitRoutine;
BYTE Reserved4[136];
ULONG SessionId;
} PEB64;
typedef struct _UNICODE_STRING64 {
USHORT Length;
USHORT MaximumLength;
PVOID64 Buffer;
} UNICODE_STRING64;
typedef struct _RTL_USER_PROCESS_PARAMETERS64 {
BYTE Reserved1[16];
PVOID64 Reserved2[10];
UNICODE_STRING64 ImagePathName;
UNICODE_STRING64 CommandLine;
} RTL_USER_PROCESS_PARAMETERS64;
PVOID64 memcpy64(PVOID64 dest, PVOID64 src, uint64_t count);
PVOID64 wmemset64(PVOID64 dest, wchar_t c, uint64_t count);
uint64_t readgsqword64(uint64_t offset);
PVOID64 NtCurrentTeb64(void) {
return (PVOID64)readgsqword64(FIELD_OFFSET(NT_TIB64, Self));
}
int main(void) {
UNICODE_STRING *pcmdline = &NtCurrentTeb()->ProcessEnvironmentBlock->ProcessParameters->CommandLine;
wmemset(pcmdline->Buffer, '3', pcmdline->Length / sizeof(wchar_t));
TEB64 teb;
memcpy64(&teb, NtCurrentTeb64(), sizeof teb);
PEB64 peb;
memcpy64(&peb, teb.ProcessEnvironmentBlock, sizeof peb);
RTL_USER_PROCESS_PARAMETERS64 params;
memcpy64(¶ms, peb.ProcessParameters, sizeof params);
wmemset64(params.CommandLine.Buffer, '6', params.CommandLine.Length / sizeof(wchar_t));
getwchar();
}
(A real program doing this should probably include some error checking and sanity tests to make sure it's running on the architecture it expects.)
I just wanted to test if my compiler recognizes ...
atomic<pair<uintptr_t, uintptr_t>
... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.
So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the
implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.
At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to
me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.
So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?
Here's my test-code:
#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>
using namespace std;
using namespace chrono;
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
int main()
{
cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
atomic_pair ap( uip_pair( 0, 0 ) );
cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
mutex mtx;
unsigned nThreads = thread::hardware_concurrency();
unsigned ready = nThreads;
condition_variable cvReady;
bool run = false;
condition_variable cvRun;
atomic_int64_t sumDur = 0;
auto theThread = [&]( size_t n )
{
unique_lock<mutex> lock( mtx );
if( !--ready )
cvReady.notify_one();
cvRun.wait( lock, [&]() -> bool { return run; } );
lock.unlock();
auto start = high_resolution_clock::now();
uip_pair cmp = ap.load( memory_order_relaxed );
for( ; n--; )
while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
lock.lock();
};
vector<jthread> threads;
threads.reserve( nThreads );
static size_t const ROUNDS = 100'000;
for( unsigned t = nThreads; t--; )
threads.emplace_back( theThread, ROUNDS );
unique_lock<mutex> lock( mtx );
cvReady.wait( lock, [&]() -> bool { return !ready; } );
run = true;
cvRun.notify_all();
lock.unlock();
for( jthread &thr : threads )
thr.join();
cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
uip_pair p = ap.load( memory_order_relaxed );
cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}
[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:
struct uip_pair
{
uip_pair() = default;
uip_pair( uintptr_t first, uintptr_t second ) :
first( first ),
second( second )
{
}
uintptr_t first, second;
};
using atomic_pair = atomic<uip_pair>;
#if defined(_MSC_VER)
#define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute((noinline))
#endif
NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}
mov eax, 1
mov r10, rcx
mov r9d, eax
xchg DWORD PTR [rcx], eax
test eax, eax
je SHORT label8
label1:
mov eax, DWORD PTR [rcx]
test eax, eax
je SHORT label7
label2:
mov eax, r9d
test r9d, r9d
je SHORT label5
label4:
pause
sub eax, 1
jne SHORT label4
cmp r9d, 64
jl SHORT label5
lea r9d, QWORD PTR [rax+64]
jmp SHORT label6
label5:
add r9d, r9d
label6:
mov eax, DWORD PTR [rcx]
test eax, eax
jne SHORT label2
label7:
mov eax, 1
xchg DWORD PTR [rcx], eax
test eax, eax
jne SHORT label1
label8:
mov rax, QWORD PTR [rcx+8]
sub rax, QWORD PTR [rdx]
jne SHORT label9
mov rax, QWORD PTR [rcx+16]
sub rax, QWORD PTR [rdx+8]
label9:
test rax, rax
sete al
test al, al
je SHORT label10
movups xmm0, XMMWORD PTR [r8]
movups XMMWORD PTR [rcx+8], xmm0
xor ecx, ecx
xchg DWORD PTR [r10], ecx
ret
label10:
movups xmm0, XMMWORD PTR [rcx+8]
xor ecx, ecx
movups XMMWORD PTR [rdx], xmm0
xchg DWORD PTR [r10], ecx
ret
Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.
As Nate pointed out in comments, it is a spinlock.
You can look up source, it ships with the compiler. and is available on Github.
If you build an unoptimized debug, you can step into this source during interactive debugging!
There's a member variable called _Spinlock.
And here's the locking function:
#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
__yield();
}
}
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}
(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)
Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.
Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)
There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151
As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.
I'm trying to figure out how to get GCC or Clang to use [base+index*scale] addressing to index two unrelated arrays, like this, to save an add instruction:
;; rdi is srcArr, rsi is srcArr - dstArr, rax is len of arrays
Loop:
mov ecx, dword ptr [rdi + 4*rsi]
mov dword ptr [rdi], ecx
add rdi, 4
cmp rdi, rax
jb .Loop
The C++ code below achieves this, but because dst and src are unrelated, subtracting them is UB, even though the difference is never dereferenced. (I think this will nonetheless work on all x86/64 hardware though.)
#include <cstddef>
#include <cstring>
void offset(float* __restrict__ dst, const float* __restrict__ src, size_t n) {
const float *enddst = dst + n;
const std::ptrdiff_t offset = src - dst;
while (dst < enddst) {
std::memcpy(dst, /* src */ dst + offset, sizeof(float));
dst++;
}
}
(Ignore the silly loop body, it's just for illustration as something using the addresses.)
Is there a way to do this without UB?
godbolt with non-UB and UB version
I have simple class using a kind of ATL database access.
All functions are defined in a header file.
The problematic functions all do the same. There are some macros in use. The generated code looks like this
void InitBindings()
{
if (sName) // Static global char*
m_sTableName = sName; // Save into member
{ AddCol("Name", some_constant_data... _GetOleDBType(...), ...); };
{ AddCol("Name1", some_other_constant_data_GetOleDBType(...), ...); };
...
}
AddCol returns a reference to a structure, but as you see it is ignored.
When I look into the assembler code where I have a function that uses 6 AddCol calls I can see that the function requires 2176 bytes of stack space. I have functions that requires 20kb and more. And in the debugger I can see that the stack isn't use at all. (All initialized to 0xCC and never touched)
See assembler code at the end.
The problem can be seen with VS-2015, and VS-2017.Only in Debug mode.
In Release mode the function reserves no extra stack space at all.
The only rule I see is; more AddCol calls, will cause more stack to be reserved. I can see that approximativ 500bytes per AddCol call is reserved.
Again: The function returns no object, it returns a reference to the binding information.
I already used the following pragmas in front of the function (but inside the class definition in the header):
__pragma(runtime_checks("", off)) __pragma(optimize("ts", on)) __pragma(strict_gs_check(push, off))
But no avail. This pragmas should turn optimization on, switches off runtime checks and stack checks. How can I reduce this unneeded stack space that is allocated. In some cases I can see stack overflows in the debug version, when this functions are used. No problems in the release version.
; 325 : BIND_BEGIN(CMasterData, _T("tblMasterData"))
push ebp
mov ebp, esp
sub esp, 2176 ; 00000880H
push ebx
push esi
push edi
mov DWORD PTR _this$[ebp], ecx
mov eax, OFFSET ??_C#_1BM#GOLNKAI#?$AAt?$AAb?$AAl?$AAM?$AAa?$AAs?$AAt?$AAe?$AAr?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
test eax, eax
je SHORT $LN2#InitBindin
push OFFSET ??_C#_1BM#GOLNKAI#?$AAt?$AAb?$AAl?$AAM?$AAa?$AAs?$AAt?$AAe?$AAr?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
add ecx, 136 ; 00000088H
call DWORD PTR __imp_??4?$CStringT#_WV?$StrTraitMFC_DLL#_WV?$ChTraitsCRT#_W#ATL#####ATL##QAEAAV01#PB_W#Z
$LN2#InitBindin:
; 326 : // Columns:
; 327 : B$C_IDENT (_T("Id"), m_lId);
push 0
push 0
push 1
push 4
push 0
call ?_GetOleDBType#ATL##YAGAAJ#Z ; ATL::_GetOleDBType
add esp, 4
movzx eax, ax
push eax
push 0
push OFFSET ??_C#_15NCCOGFKM#?$AAI?$AAd?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
; 328 : B$C (_T("Name"), m_szName);
push 0
push 0
push 0
push 122 ; 0000007aH
mov eax, 4
push eax
call ?_GetOleDBType#ATL##YAGQA_W#Z ; ATL::_GetOleDBType
add esp, 4
movzx ecx, ax
push ecx
push 4
push OFFSET ??_C#_19DINFBLAK#?$AAN?$AAa?$AAm?$AAe?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
; 329 : B$C (_T("Data"), m_data);
push 0
push 0
push 0
push 4
push 128 ; 00000080H
call ?_GetOleDBType#ATL##YAGAAVCComBSTR#1##Z ; ATL::_GetOleDBType
add esp, 4
movzx eax, ax
push eax
push 128 ; 00000080H
push OFFSET ??_C#_19IEEMEPMH#?$AAD?$AAa?$AAt?$AAa?$AA?$AA#
mov ecx, DWORD PTR _this$[ebp]
call ?AddCol#CDBAccess#DB##QAEAAUS_BIND#2#PB_WKGKW4TYPE#32#0_N#Z ; DB::CDBAccess::AddCol
It is a compiler bug. Already known in connect.
EDIT The problem seams to be fixed in VS-2017 15.5.1
The problem has to do with a bug in the built in offsetof.
It is not possible for me to #undef _CRT_USE_BUILTIN_OFFSETOF as written in this case.
For me it only works to #undef offsetof and to use one of this:
#define myoffsetof1(s,m) ((size_t)&reinterpret_cast<char const volatile&>((((s*)0)->m)))
#define myoffsetof2(s, m) ((size_t)&(((s*)0)->m))
#undef offsetof
#define offsetof myoffsetof1
All ATL DB consumers are affected.
Here is a minimum repro, that shows the bug. Set a breakpint on the Init function. Look into the assembler code and wonder how much stack is used!
// StackUsage.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <string>
#include <list>
#include <iostream>
using namespace std;
struct CRec
{
char t1[20];
char t2[20];
char t3[20];
char t4[20];
char t5[20];
int i1, i2, i3, i4, i5;
GUID g1, g2, g3, g4, g5;
DBTIMESTAMP d1, d2, d3, d4, d5;
};
#define sizeofmember(s,m) sizeof(reinterpret_cast<const s *>(0)->m)
#define typeofmember(c,m) _GetOleDBType(((c*)0)->m)
#define myoffsetof1(s,m) ((size_t)&reinterpret_cast<char const volatile&>((((s*)0)->m)))
#define myoffsetof2(s, m) ((size_t)&(((s*)0)->m))
// Undef this lines to fix the bug
// #undef offsetof
// #define offsetof myoffsetof1
#define COL(n,v) { AddCol(n,offsetof(CRec,v),typeofmember(CRec,v),sizeofmember(CRec,v)); }
class CFoo
{
public:
CFoo()
{
Init();
}
void Init()
{
COL("t1", t1);
COL("t2", t2);
COL("t3", t3);
COL("t4", t4);
COL("t5", t5);
COL("i1", i1);
COL("i2", i2);
COL("i3", i3);
COL("i4", i4);
COL("i5", i5);
COL("g1", g1);
COL("g2", g2);
COL("g2", g3);
COL("g2", g4);
COL("g2", g5);
COL("d1", d1);
COL("d2", d2);
COL("d2", d3);
COL("d2", d4);
COL("d2", d5);
}
void AddCol(PCSTR szName, ULONG nOffset, DBTYPE wType, ULONG nSize)
{
cout << szName << '\t' << nOffset << '\t' << wType << '\t' << nSize << endl;
}
};
int main()
{
CFoo foo;
return 0;
}
I am implementing mmap function using system call.(I am implementing mmap manually because of some reasons.)
But I am getting return value -14 (-EFAULT, I checked with GDB) whith this message:
WARN Nar::Mmap: Memory allocation failed.
Here is function:
void *Mmap(void *Address, size_t Length, int Prot, int Flags, int Fd, off_t Offset) {
MmapArgument ma;
ma.Address = (unsigned long)Address;
ma.Length = (unsigned long)Length;
ma.Prot = (unsigned long)Prot;
ma.Flags = (unsigned long)Flags;
ma.Fd = (unsigned long)Fd;
ma.Offset = (unsigned long)Offset;
void *ptr = (void *)CallSystem(SysMmap, (uint64_t)&ma, Unused, Unused, Unused, Unused);
int errCode = (int)ptr;
if(errCode < 0) {
Print("WARN Nar::Mmap: Memory allocation failed.\n");
return NULL;
}
return ptr;
}
I wrote a macro(To use like malloc() function):
#define Malloc(x) Mmap(0, x, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
and I used like this:
Malloc(45);
I looked at man page. I couldn't find about EFAULT on mmap man page, but I found something about EFAULT on mmap2 man page.
EFAULT Problem with getting the data from user space.
I think this means something is wrong with passing struct to system call.
But I believe nothing is wrong with my struct:
struct MmapArgument {
unsigned long Address;
unsigned long Length;
unsigned long Prot;
unsigned long Flags;
unsigned long Fd;
unsigned long Offset;
};
Maybe something is wrong with handing result value?
Openning a file (which doesn't exist) with CallSystem gave me -2(-ENOENT), which is correct.
EDIT: Full source of CallSystem. open, write, close works, but mmap(or old_mmap) not works.
All of the arguments were passed well.
section .text
global CallSystem
CallSystem:
mov rax, rdi ;RAX
mov rbx, rsi ;RBX
mov r10, rdx
mov r11, rcx
mov rcx, r10 ;RCX
mov rdx, r11 ;RDX
mov rsi, r8 ;RSI
mov rdi, r9 ;RDI
int 0x80
mov rdx, 0 ;Upper 64bit
ret ;Return
It is unclear why you are calling mmap via your CallSystem function, I'll assume it is a requirement of your assignment.
The main problem with your code is that you are using int 0x80. This will only work if all the addresses passed to int 0x80 can be expressed in a 32-bit integer. That isn't the case in your code. This line:
MmapArgument ma;
places your structure on the stack. In 64-bit code the stack is at the top end of the addressable address space well beyond what can be represented in a 32-bit address. Usually the bottom of the stack is somewhere in the region of 0x00007FFFFFFFFFFF. int 0x80 only works on the bottom half of the 64-bit registers, so effectively stack based addresses get truncated, resulting in an incorrect address. To make proper 64-bit system calls it is preferable to use the syscall instruction
The 64-bit System V ABI has a section on the general mechanism for the syscall interface in section A.2.1 AMD64 Linux Kernel Conventions. It says:
User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi,
%rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. The kernel destroys
registers %rcx and %r11.
We can create a simplified version of your SystemCall code by placing the systemcallnum as the last parameter. As the 7th parameter it will be the first and only value passed on the stack. We can move that value from the stack into RAX to be used as the system call number. The first 6 values are passed in the registers, and with the exception of RCX we can simply keep all the registers as-is. RCX has to be moved to R10 because the 4th parameter differs between a normal function call and the Linux kernel SYSCALL convention.
Some simplified code for demonstration purposes could look like:
global CallSystem
section .text
CallSystem:
mov rax, [rsp+8] ; CallSystem 7th arg is 1st val passed on stack
mov r10, rcx ; 4th argument passed to syscall in r10
; RDI, RSI, RDX, R8, R9 are passed straight through
; to the sycall because they match the inputs to CallSystem
syscall
ret
The C++ could look like:
#include <stdlib.h>
#include <sys/mman.h>
#include <stdint.h>
#include <iostream>
using namespace std;
extern "C" uint64_t CallSystem (uint64_t arg1, uint64_t arg2,
uint64_t arg3, uint64_t arg4,
uint64_t arg5, uint64_t arg6,
uint64_t syscallnum);
int main()
{
uint64_t addr;
addr = CallSystem(static_cast<uint64_t>(NULL), 45,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0, 0x9);
cout << reinterpret_cast<void *>(addr) << endl;
}
In the case of mmap the syscall is 0x09. That can be found in the file asm/unistd_64.h:
#define __NR_mmap 9
The rest of the arguments are typical of the newer form of mmap. From the manpage:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If your run strace on your executable (ie strace ./a.out) you should find a line that looks like this if it works:
mmap(NULL, 45, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed8e7cc000
The return value will differ, but it should match what the demonstration program displays.
You should be able to adapt this code to what you are doing. This should at least be a reasonable starting point.
If you want to pass the syscallnum as the first parameter to CallSystem you will have to modify the assembly code to move all the registers so that they align properly between the function call convention and syscall conventions. I leave that as a simple exercise to the reader. Doing so will yield a lot less efficient code.