C++ SYSENTER x86 calls in inline assembly - c++

I'm about to learning how sysenter on x86 works. and i created a simple console application on x86 platform, which should call the NtWriteVirtualMemory function manually in inline assembly.
i started with this code here, but it seems that the compiler dont understand the opcode "sysenter" so i decided to _emit them with the bytes for sysenter.(maybe i need to change something in my project settings?) it compiles but when its about to calling the function visual studio gives me an error that my ret is an illegal instruction while executing, and the program stops.
someone have knowledge how to do that correctly?
#include <windows.h>
#include <iostream>
__declspec(naked) void __KiFastSystemCall()
{
__asm
{
mov edx, esp
// need to emit "sysenter" because of syntaxerrors, "Opcode"; "newline"
_emit 0x0F
_emit 0x34
ret // illegal instructiona after execute?
}
}
void Test_NtWriteVirtualMemory(HANDLE hProcess, PVOID BaseAddress, PVOID Buffer, SIZE_T sizeToWrite, SIZE_T* NumberOfBytesWritten)
{
__asm
{
push NumberOfBytesWritten
push sizeToWrite
push Buffer
push BaseAddress
push hProcess
mov eax, 0x3A // Syscall ID NtWriteVirtualMemory in Windows10
mov edx, __KiFastSystemCall
call edx
add esp, 0x14 // 5 push * 4 bytes 20 dec
retn
}
}
void Test_NtWriteVirtualMemory(HANDLE hProcess, PVOID BaseAddress, PVOID Buffer, SIZE_T sizeToWrite, SIZE_T* NumberOfBytesWritten)
{
__asm
{
push NumberOfBytesWritten
push sizeToWrite
push Buffer
push BaseAddress
push hProcess
mov eax, 0x3a // Syscall ID NtWriteVirtualMemory in Windows10
mov edx, 0x76F88E00
call edx
ret 0x14
}
}
int main()
{
std::cout << "Test Hello World\n";
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetProcessId("MyGame.exe"));
if (hProcess == NULL)
return false;
DWORD TestAddress = 0x87A0B4; // harcoded
DWORD TestValue = 4;
Test_NtWriteVirtualMemory(hProcess, (PVOID)TestAddress, (PVOID)TestValue, sizeof(DWORD), NULL);
CloseHandle(hProcess);
return 0;
}

Do you have a 32-bit only version of Windows?
sysenter was the "successor" of int 2eh and introduced during the Windows XP era.
The 64-bit versions of Windows don't use it, in fact it was removed since:
sysenter and sysret are illegal in long mode in an AMD CPU (irrespective of the compatibility mode).
The IA32_SYSENTER_CS MSR is left to zero by a 64-bit version of Windows1.
This will cause a #GP fault when executing sysenter.
If you single-step through your __KiFastSystemCall you should see the debugger catch an exception with code 0xc0000005 when executing sysenter.
So, in order to use sysenter you must have a real 32-bit version of Windows.
Running a 32-bit program on a 64-bit version of Windows won't work, that's compatibility mode (done through the WOW64 machinery).
If, besides having a 64-bit version of Windows, you also have an AMD CPU then it won't work double time.
Windows 64-bit uses syscall for 64-bit program or an indirect call to the WOW32Reserved field of the TEB2, you should use those.
Beware that the 64-bit system call convention is slightly different from the usual one: particularly it assumes the syscall is in a function of its own, thus it expects the parameters on the stack to be shifted up by 8.
Plus, the first parameter must be in r10, not rcx.
For example, if you inline the syscall instruction, the first parameter on the stack (if any) must be at rsp + 28h and not at rsp + 20h.
The 32-bit compatibility mode syscall convention is also different, you need to set both eax and ecx to specific values.
I didn't dig what exactly ecx is used for but it may be related to an optimization called Turbo thunks and must be set to a specific value.
Note that while syscall numbers are very volatile, turbo thunks are even more because they can be disabled by the admin.
1I don't have a definitive source for this, it is just zero on my version of Windows and it makes sysenter fault.
2I.e. a call DWORD [fs:0c0h], this will point to a code that will jump to a gate descriptor for a 64-bit code segment that in turn will execute a syscall

Since you got illegal instruction on the ret instruction rather than the sysenter instruction, you know the sysenter instruction was decoded correctly by the CPU. Your call got into kernel mode, but the kernel didn't like your system call invocation.
Probably it was depending on user-space to help save some registers because sysenter is very minimal. Check the stack pointer after returning from the kernel as you single-step before letting ret execute.
I'd be only speculating as the the problem, but wrapping the syscall gate in another function call looks wrong to my eyes. As I said in comments, do not do this because the syscall numbers can change on you.
Under Linux, 32-bit processes call through the VDSO (a library injected into their address space by the kernel) to get the optimal system-call instruction, used in a way that matches what the kernel wants. (sysenter doesn't preserve the stack pointer so user-space has to help.)
Perhaps if you want to play with this instruction you're better off writing a toy OS.
Sorry it's not a ton of answer, but it's not completely unreasonable.

Making system calls in x86 Windows is different from x64. You need to specify the correct arguments length in ret otherwise you will get illegal instruction and/or runtime esp.
Furthermore I don't recommend you to use inline assembly, instead use it inside an .asm file or as shellcode.
To make a correct x86 system call on x86 Windows:
mov eax, SYSCALL_INDEX
call sysentry
ret ARGUMENTS_LENGTH_SIZE
mov edx,esp
sysenter
retn
To make a correct x64 system call on x64 Windows:
mov eax, SYSCALL_INDEX
mov r10,rcx
syscall
retn
The above will work 100% correctly on any x86 and x64 Windows (tested). Can't help you with inline assembly though, because I never used it that way.
Enjoy.

Related

Compiler explorer and GCC have different outputs

I have some C code that when given to Compiler Explorer, it outputs:
mov BYTE PTR [rbp-4], al
mov eax, ecx
mov BYTE PTR [rbp-8], al
mov eax, edx
mov BYTE PTR [rbp-12], al
However if I use GCC or G++ then it gives me this:
mov BYTE PTR 16[rbp], al
mov eax, edx
mov BYTE PTR 24[rbp], al
mov eax, ecx
mov BYTE PTR 32[rbp], al
I have no idea why the BYTE PTRs are different. They have a completely wrong address and I don't get why they are before the [rdp] part.
If you know how to reproduce the first output using gcc or g++ please help!
gcc.exe (GCC) 8.2.0
Looks like GCC for the Windows x64 calling convention is using the shadow space (32 bytes above the return address) reserved by its caller. Godbolt's GCC installs target GNU/Linux, i.e. the x86-64 System V ABI.
You can get the same code on Godbolt by marking your function with __attribute__((ms_abi)). Of course that means your caller has to see that attribute in the prototype so it knows to reserve that space, and which registers to pass function args in.
The Windows x64 calling convention is mostly worse than x86-64 System V; fewer arg-passing registers for example. One of its only advantages is easier implementation of variadic functions (because of the shadow space), and having some call-preserved XMM regs. (Probably too many, but x86-64 SysV has zero.) So more likely you want to use a cross compiler (targeting GNU/Linux) on Windows, or use __attribute__((sysv_abi)) on all your functions. (https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html)
The XMM part of the calling convention is normally irrelevant for kernel code; most kernels avoid saving/restoring the SIMD/FPU state on kernel entry/exit by not letting the compiler use SIMD/FP instructions.

Mixing c++ and assembly cant pass multiple paramaters from C++ function to assembly

I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END
MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.

How to get efficient asm for zeroing a tiny struct with MSVC++ for x86-32?

My project is compiled for 32-bit in both Windows and Linux. I have an 8-byte struct that's used just about everywhere:
struct Value {
unsigned char type;
union { // 4 bytes
unsigned long ref;
float num;
}
};
In a lot of places I need to zero out the struct, which is done like so:
#define NULL_VALUE_LITERAL {0, {0L}};
static const Value NULL_VALUE = NULL_VALUE_LITERAL;
// example of clearing a value
var = NULL_VALUE;
This however does not compile to the most efficient code in Visual Studio 2013, even with all optimizations on. What I see in the assembly is that the memory location for NULL_VALUE is being read, then written to the var. This results in two reads from memory and two writes to memory. This clearing however happens a lot, even in routines that are time-sensitive, and I'm looking to optimize.
If I set the value to NULL_VALUE_LITERAL, it's worse. The literal data, which again is all zeroes, is copied into temporary a stack value and THEN copied to the variable--even if the variable is also on the stack. So that's absurd.
There's also a common situation like this:
*pd->v1 = NULL_VALUE;
It has similar assembly code to the var=NULL_VALUE above, but it's something I can't optimize with inline assembly should I choose to go that route.
From my research the very, very fastest way to clear the memory would be something like this:
xor eax, eax
mov byte ptr [var], al
mov dword ptr [var+4], eax
Or better still, since the struct alignment means there's just junk for 3 bytes after the data type:
xor eax, eax
mov dword ptr [var], eax
mov dword ptr [var+4], eax
Can you think of any way I can get code similar to that, optimized to avoid the memory reads that are totally unnecessary?
I tried some other methods, which end up creating what I feel is overly bloated code writing a 32-bit 0 literal to the two addresses, but IIRC writing a literal to memory still isn't as fast as writing a register to memory. I'm looking to eke out any extra performance I can get.
Ideally I would also like the result to be highly readable. Your help is appreciated.
I'd recommend uint32_t or unsigned int for the union with float. long on Linux x86-64 is a 64-bit type, which is probably not what you want.
I can reproduce the missed-optimization with MSVC CL19 -Ox on the Godbolt compiler explorer for x86-32 and x86-64. Workarounds that work with CL19:
make type an unsigned int instead of char, so there's no padding in the struct, then assign from a literal {0, {0L}} instead of a static const Value object. (Then you get two mov-immediate stores: mov DWORD PTR [eax], 0 / mov DWORD PTR [eax+4], 0).
gcc also has struct-zeroing missed-optimizations with padding in structs, but not as bad as MSVC (Bug 82142). It just defeats merging into wider stores; it doesn't get gcc to create an object on the stack and copy from that.
std::memset: probably the best option, MSVC compiles it to a single 64-bit store using SSE2. xorps xmm0, xmm0 / movq QWORD PTR [mem], xmm0. (gcc -m32 -O3 compiles this memset to two mov-immediate stores.)
void arg_memset(Value *vp) {
memset(vp, 0, sizeof(gvar));
}
;; x86 (32-bit) MSVC -Ox
mov eax, DWORD PTR _vp$[esp-4]
xorps xmm0, xmm0
movq QWORD PTR [eax], xmm0
ret 0
This is what I'd choose for modern CPUs (Intel and AMD). The penalty for crossing a cache-line is low enough that it's worth saving an instruction if it doesn't happen all the time. xor-zeroing is extremely cheap (especially on Intel SnB-family).
IIRC writing a literal to memory still isn't as fast as writing a register to memory
In asm, constants embedded in the instruction are called immediate data. mov-immediate to memory is mostly fine on x86, but it's a bit bloated for code-size.
(x86-64 only): A store with a RIP-relative addressing mode and an immediate can't micro-fuse on Intel CPUs, so it's 2 fused-domain uops. (See Agner Fog's microarch pdf, and other links in the x86 tag wiki.) This means it's worth it (for front-end bandwidth) to zero a register if you're doing more than one store to a RIP-relative address. Other addressing modes do fuse, though, so it's just a code-size issue.
Related: Micro fusion and addressing modes (indexed addressing modes un-laminate on Sandybridge/Ivybridge, but Haswell and later can keep indexed stores micro-fused.) This isn't dependent on immediate vs. register source.
I think memset would be a very poor fit since this is just an 8-byte struct.
Modern compilers know what some heavily-used / important standard library functions do (memset, memcpy, etc.), and treat them like intrinsics. There's very little difference as far as optimization is concerned between a = b and memcpy(&a, &b, sizeof(a)) if they have the same type.
You might get a function call to the actual library implementation in debug mode, but debug mode is very slow anyway. If you have debug-mode perf requirements, that's unusual. (But does happen for code that needs to keep up with something else...)

C++ VS2010 Compiler doesn't use 'push' for a simple function call

I just started to learn a bit assembler from compiler output.
test(1);
This simple function call creates following asm output (compiled with x64)
000000013FFF2E76 mov ecx,1
000000013FFF2E7B call test (13FFF33C0h)
But why isn't it:
000000013FFF2E76 push 1
000000013FFF2E7B call test (13FFF33C0h)
I thought a function parameter will be pushed to the stack and then poped in the function. Can somebody explain why VS prefer the top one?
It's because that's the ABI on x64 Windows.
On Windows x64, the first integer argument is passed in RCX, the second in RDX, the third in R8 and the fourth in R9. The fifth and following are passed through the stack.
Because your function has a single argument, only RCX is used.
The compiler issued a write to ECX because all writes to 32-bit registers result in zeroing the higher part of the 64-bit register, and 32-bit immediates are obviously shorter than 64-bit ones (instruction cache anyone?).

Doubting about the Threads window of visual studio

As you can see above , there are 4 win32 threads at exactly the same location, how to understand it?
UPDATE
7C92E4BE mov dword ptr [esp],eax
7C92E4C1 mov dword ptr [esp+4],0
7C92E4C9 mov dword ptr [esp+8],0
7C92E4D1 mov dword ptr [esp+10h],0
7C92E4D9 push esp
7C92E4DA call 7C92E508
7C92E4DF mov eax,dword ptr [esp]
7C92E4E2 mov esp,ebp
7C92E4E4 pop ebp
7C92E4E5 ret
7C92E4E6 lea esp,[esp]
7C92E4ED lea ecx,[ecx]
7C92E4F0 mov edx,esp
7C92E4F2 sysenter
7C92E4F4 ret
At a guess, they're probably sleeping in something like WaitForSingleObject or similar.
The debugger shows the next ring3 processor instruction that is going to be executed. In this case the thread has called sysenter, which makes a ring0 system call to the operating system's kernel. This kernel system call is waiting for something to happen before returning control back to the calling code. Once that something happens, then it will call the next user-mode instruction, which in this case is ret.
If you have 4 threads that are all calling the same function that waits for a system call at the same location, you will have 4 threads that show the same address in the Threads window. This is something that you will see quite often in applications built with the Windows subsystem, which usually have a number of threads that are started by the Windows API that spend most of their time waiting for kernel events.
At a guess, you have a thread pool of some sort, so you have four threads all executing the same thread function. In this case, all four are mostly likely idle, waiting for a task they need to execute. If that's the case, it's quite sensible that all four show the same location.
You'll need to ignore the threads that are started by Microsoft code. I'm guessing at mmsys or DirectX from your screen shot. Microsoft code is very thread-happy.
You can get better diagnostics about what they do when you enable the Microsoft Symbol Server. You'll get decent names in the Call Stack window, often letting you guess what their purpose is. Of course, you'll never get to look at their code.