Porting Inline GASM to x64 MASM Access Violation Issue - c++

I am currently porting some code to MS Windows x64 from the https://github.com/mono project which was written for GCC Linux and I am having some challenges.
Currently I am unsure if my translation from x64 AT&T inline ASM to x64 MASM is correct. It compiles fine but my test case fails as memcpy throws exceptions/memory access violations after my ASM function executes. Is my translation correct?
One of the things I was really challenged by was the fact that rip is not accessible in Windows x64 MASM? I really don't know how to translate those remaining lines of the AT&T syntax (see below). But I gave it a best try. Did I handle the lack of rip access correctly?
If my work is correct then why is memcpy failing?
Here is the related C++:
void mono_context_get_current(MonoContext cnt); //declare the ASM func
//Pass the static struct pointer to the ASM function mono_context_get_current
//The purpose here is to clobber it
#ifdef _MSC_VER
#define MONO_CONTEXT_GET_CURRENT(ctx) do { \
mono_context_get_current(ctx); \
} while (0)
#endif
static MonoContext cur_thread_ctx = {0};
MONO_CONTEXT_GET_CURRENT (cur_thread_ctx);
memcpy (&info->ctx, &cur_thread_ctx, sizeof (MonoContext)); //memcpy throws Exception.
Here is the current ASM function.
mono_context_get_current PROTO
.code
mono_context_get_current PROC
mov rax, rcx ;Assume that rcx contains the pointer being passed
mov [rax+00h], rax
mov [rax+08h], rbx
mov [rax+10h], rcx
mov [rax+18h], rdx ;purpose is to offset from my understanding of the GCC assembly
mov [rax+20h], rbp
mov [rax+28h], rsp
mov [rax+30h], rsi
mov [rax+38h], rdi
mov [rax+40h], r8
mov [rax+48h], r9
mov [rax+50h], r10
mov [rax+58h], r11
mov [rax+60h], r12
mov [rax+68h], r13
mov [rax+70h], r14
mov [rax+78h], r15
call $ + 5
mov rdx, [rax+80h]
pop rdx
mono_context_get_current ENDP
END
To my understanding the rcx register should contain the struct pointer and that I should be using rdx to pop.
As I mentioned I have GCC ASM for non-Win64 platforms which appears to work on those platforms. This is what that code looks like:
#define MONO_CONTEXT_GET_CURRENT(ctx) \
__asm__ __volatile__( \
"movq $0x0, 0x00(%0)\n" \
"movq %%rbx, 0x08(%0)\n" \
"movq %%rcx, 0x10(%0)\n" \
"movq %%rdx, 0x18(%0)\n" \
"movq %%rbp, 0x20(%0)\n" \
"movq %%rsp, 0x28(%0)\n" \
"movq %%rsi, 0x30(%0)\n" \
"movq %%rdi, 0x38(%0)\n" \
"movq %%r8, 0x40(%0)\n" \
"movq %%r9, 0x48(%0)\n" \
"movq %%r10, 0x50(%0)\n" \
"movq %%r11, 0x58(%0)\n" \
"movq %%r12, 0x60(%0)\n" \
"movq %%r13, 0x68(%0)\n" \
"movq %%r14, 0x70(%0)\n" \
"movq %%r15, 0x78(%0)\n" \
"leaq (%%rip), %%rdx\n" \
"movq %%rdx, 0x80(%0)\n" \
: \
: "a" (&(ctx)) \
: "rdx", "memory")
Thanks for any help you may be able to offer! I'll be the first to admit my assembly is pretty rusty.

You can let gcc create the asm file for you (gcc can produce MASM syntax as well):
gcc -S -masm=intel myfile.c

Comparing between the two versions there appears to be some discrepancy:
movq $0x0, 0x00(%0)
It doesn't look like rax is being saved but instead that memory slot is zero'ed out.
leaq (%%rip), %%rdx
You should be able to translate that into intel synatx:
lea rdx, [rip]
which is valid if you're using 64-bit relative addressing mode.
And this line is incorrectly translated from att:
call $ + 5
mov rdx, [rax+80h] ; looks reversed
pop rdx
Here's how I've translated the original gas syntax above:
mov qword ptr [rcx], 0
mov [rcx + 0x08], rbx
mov [rcx + 0x10], rax
mov [rcx + 0x18], rdx
mov [rcx + 0x20], rbp
mov [rcx + 0x28], rsp
mov [rcx + 0x30], rsi
mov [rcx + 0x38], rdi
mov [rcx + 0x40], r8
mov [rcx + 0x48], r9
mov [rcx + 0x50], r10
mov [rcx + 0x58], r11
mov [rcx + 0x60], r12
mov [rcx + 0x68], r13
mov [rcx + 0x70], r14
mov [rcx + 0x78], r15
lea rdx, [rip]
mov [rcx + 0x80], rdx
mov rdx, [rcx + 0x18] ; restore old rdx since it's on clobber list
Note that I switched rcx around with rax just to save an extra mov. So rax gets saved in place of rcx in the gas syntax. You might need to modify this depending on your invariants.
If it still crashes I'd advise stepping through it with a debugger.

Related

vector bool compiler xor specialization?

I was thinking again about implementing the quadratic sieve for fun, which requires Guassian elimination over a binary field, that is the operations required are 1. swapping rows and 2. XORing rows.
My ideas were either to maintain a bit array using a vector of 64-bit ints and bit twiddling, or use vector<bool>, which is probably space-optimized on my system. The bit array must be able to be dynamically sized, so std::bitset won't work. The advantage of maintaining my own ints is that I can XOR 64 bits at a time which is a neat trick. I wanted to see what a compiler would do for a loop that XOR'd bool vectors: (I wasn't able to use ^=, see operator |= on std::vector<bool>)
void xor_vector(std::vector<bool>& a, std::vector<bool>& b) {
for (std::size_t i=0; i<a.size(); ++i)
a[i] = a[i] ^ b[i];
}
I have a very basic understanding of x86 but it looks like the compiler isn't actually XORing words together? Is there a way to get the compiler to XOR entire words at a time?
https://godbolt.org/z/PbGdv3sKT
xor_vector(std::vector<bool, std::allocator<bool> >&, std::vector<bool, std::allocator<bool> >&):
mov r8, QWORD PTR [rdi]
mov rax, QWORD PTR [rdi+16]
mov edx, DWORD PTR [rdi+24]
sub rax, r8
lea rdi, [rdx+rax*8]
test rdi, rdi
je .L11
push rbp
mov r10d, 1
push rbx
mov r9, QWORD PTR [rsi]
xor esi, esi
jmp .L7
.L16:
mov rdx, r10
sal rdx, cl
mov rcx, QWORD PTR [r11]
mov rbp, rdx
test rdx, rcx
setne bl
and rbp, QWORD PTR [rax]
setne bpl
.L4:
mov rax, rdx
not rdx
or rax, rcx
and rdx, rcx
cmp bpl, bl
cmovne rdx, rax
add rsi, 1
mov QWORD PTR [r11], rdx
cmp rsi, rdi
je .L15
.L7:
test rsi, rsi
lea rax, [rsi+63]
mov rdx, rsi
cmovns rax, rsi
sar rdx, 63
shr rdx, 58
sar rax, 6
lea rcx, [rsi+rdx]
sal rax, 3
and ecx, 63
lea r11, [r8+rax]
add rax, r9
sub rcx, rdx
jns .L16
add rcx, 64
mov rdx, r10
sal rdx, cl
mov rcx, QWORD PTR [r11-8]
mov rbp, rdx
test rcx, rdx
setne bl
and rbp, QWORD PTR [rax-8]
setne bpl
sub r11, 8
jmp .L4
.L15:
pop rbx
pop rbp
ret
.L11:
ret
My question is similar to bitwise operations on vector<bool> but the answers are dated and don't seem to answer my question.
Update: I tested with a 256 bit sized bitset too. Still I don't see XORing whole machine words.
void xor_vector(std::bitset<256>& a, std::bitset<256>& b) {
for (std::size_t i=0; i<a.size(); ++i)
a[i] = a[i] ^ b[i];
}
https://godbolt.org/z/jKEf89E1j
xor_vector(std::bitset<256ul>&, std::bitset<256ul>&):
push rbx
mov r8, rdi
mov r11, rsi
xor edx, edx
mov ebx, 1
.L4:
mov rsi, rdx
mov rcx, rdx
mov rax, rbx
shr rsi, 6
and ecx, 63
sal rax, cl
mov rdi, QWORD PTR [r8+rsi*8]
mov rcx, rax
and rcx, QWORD PTR [r11+rsi*8]
mov rcx, rax
setne r10b
test rax, rdi
not rax
setne r9b
or rcx, rdi
and rax, rdi
cmp r10b, r9b
cmovne rax, rcx
add rdx, 1
mov QWORD PTR [r8+rsi*8], rax
cmp rdx, 256
jne .L4
pop rbx
ret

Understanding what clang is doing in assembly, decrementing for a loop that is incrementing

Consider the following code, in C++:
#include <cstdlib>
std::size_t count(std::size_t n)
{
std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
It is just a loop that is incrementing its value, and returns it at the end. The asm volatile prevents the loop from being optimized away. We compile it under g++ 8.1 and clang++ 5.0 with the arguments -Wall -Wextra -std=c++11 -g -O3.
Now, if we look at what compiler explorer is producing, we have, for g++:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
main:
mov eax, 1
xor edx, edx
cmp edi, 1
jg .L25
.L21:
add rdx, 1
cmp rdx, rax
jb .L21
mov eax, edx
ret
.L25:
push rcx
mov rdi, QWORD PTR [rsi+8]
mov edx, 10
xor esi, esi
call strtoll
mov rdx, rax
test rax, rax
je .L11
xor edx, edx
.L12:
add rdx, 1
cmp rdx, rax
jb .L12
.L11:
mov eax, edx
pop rdx
ret
and for clang++:
count(unsigned long): # #count(unsigned long)
test rdi, rdi
je .LBB0_1
mov rax, rdi
.LBB0_3: # =>This Inner Loop Header: Depth=1
dec rax
jne .LBB0_3
mov rax, rdi
ret
.LBB0_1:
xor edi, edi
mov rax, rdi
ret
main: # #main
push rbx
cmp edi, 2
jl .LBB1_1
mov rdi, qword ptr [rsi + 8]
xor ebx, ebx
xor esi, esi
mov edx, 10
call strtoll
test rax, rax
jne .LBB1_3
mov eax, ebx
pop rbx
ret
.LBB1_1:
mov eax, 1
.LBB1_3:
mov rcx, rax
.LBB1_4: # =>This Inner Loop Header: Depth=1
dec rcx
jne .LBB1_4
mov rbx, rax
mov eax, ebx
pop rbx
ret
Understanding the code generated by g++, is not that complicated, the loop being:
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
every iteration increments rdx, and compares it to rax that stores the size of the loop.
Now, I have no idea of what clang++ is doing. Apparently it uses dec, which is weird to me, and I don't even understand where the actual loop is. My question is the following: what is clang doing?
(I am looking for comments about the clang assembly code to describe what is done at each step and how it actually works).
The effect of the function is to return n, either by counting up to n and returning the result, or by simply returning the passed-in value of n. The clang code does the latter. The counting loop is here:
mov rax, rdi
.LBB0_3: # =>This Inner Loop Header: Depth=1
dec rax
jne .LBB0_3
mov rax, rdi
ret
It begins by copying the value of n into rax. It decrements the value in rax, and if the result is not 0, it jumps back to .LBB0_3. If the value is 0 it falls through to the next instruction, which copies the original value of n into rax and returns.
There is no i stored, but the code does the loop the prescribed number of times, and returns the value that i would have had, namely, n.

Broken CPUID brand string?

I am printing some information about CPU in my OS using CPUID instruction.
Reading and printing vendor string(GenuineIntel) works well, but reading brand string gives me little strange string.
ok cpu-info <= Run command
CPU Vendor name: GenuineIntel <= Vendor string is good
CPU Brand: D: l(R) Core(TMD: CPU MD: <= What..?
ok
Vendor string supposed to be:
Intel(R) Core(TM) i5 CPU M 540
But what I got is:
D: l(R) Core(TMD: CPU MD:
C++ code:
char vendorString[13] = { 0, };
Dword eax, ebx, ecx, edx;
ACpuid(0, &eax, &ebx, &ecx, &edx);
*((Dword*)vendorString) = ebx;
*((Dword*)vendorString + 1) = edx;
*((Dword*)vendorString + 2) = ecx;
Console::Output.Write(L"CPU vendor name: ");
for (int i = 0; i < 13; i++) {
Console::Output.Write((wchar_t)(vendorString[i]));
}
Console::Output.WriteLine();
char brandString[48] = { 0, };
ACpuid(0x80000002, &eax, &ebx, &ecx, &edx);
*((Dword*)brandString) = eax;
*((Dword*)brandString + 1) = ebx;
*((Dword*)brandString + 2) = ecx;
*((Dword*)brandString + 3) = edx;
ACpuid(0x80000003, &eax, &ebx, &ecx, &edx);
*((Dword*)brandString + 4) = eax;
*((Dword*)brandString + 5) = ebx;
*((Dword*)brandString + 6) = ecx;
*((Dword*)brandString + 7) = edx;
ACpuid(0x80000004, &eax, &ebx, &ecx, &edx);
*((Dword*)brandString + 8) = eax;
*((Dword*)brandString + 9) = ebx;
*((Dword*)brandString + 10) = ecx;
*((Dword*)brandString + 11) = edx;
Console::Output.Write(L"CPU brand: ");
for (int i = 0; i < 48; i++) {
Console::Output.Write((wchar_t) brandString[i]);
}
Console::Output.WriteLine();
NOTE:
This program is UEFI application. No problem with permissions.
Console is an wrapper class for EFI console. Not C# stuff.
Dword = unsigned 32bit integer
Assembly code(MASM):
;Cpuid command
;ACpuid(Type, pEax, pEbx, pEcx, pEdx)
ACpuid Proc
;Type => Rcx
;pEax => Rdx
;pEbx => R8
;pEcx => R9
;pEdx => [ rbp + 48 ] ?
push rbp
mov rbp, rsp
push rax
push rsi
mov rax, rcx
cpuid
mov [ rdx ], eax
mov [ r8 ], ebx
mov [ r9 ], ecx
mov rsi, [ rbp + 48 ]
mov [ rsi ], rdx
pop rsi
pop rax
pop rbp
ret
ACpuid Endp
I agree with Ross Ridge that you should use the compiler intrinsic __cpuid. As for why your code likely doesn't work as is - there are some bugs that will cause problems.
CPUID destroys the contents of RAX, RBX, RCX, and RDX and yet you do this in your code:
cpuid
mov [ rdx ], eax
RDX has been destroyed by the time mov [ rdx ], eax is executed, rendering the pointer in RDX invalid. You'll need to move RDX to another register before using the CPUID instruction.
Per the Windows 64-bit Calling Convention these are the volatile registers that need to be preserved by the caller:
The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile and must be considered destroyed on function calls (unless otherwise safety-provable by analysis such as whole program optimization).
These are the non-volatile ones that need to be preserved by the callee:
The registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered nonvolatile and must be saved and restored by a function that uses them.
We can use R10 (a volatile register) to store RDX temporarily. Rather than use RSI in the code we can reuse R10 for updating the value at pEdx. We won't need to preserve RSI if we don't use it. CPUID does destroy RBX, and RBX is non-volatile, so we need to preserve it. RAX is volatile so we don't need to preserve it.
In your code you have this line:
mov [ rsi ], rdx
RSI is a memory address (pEdx) provided by the caller to store the value in EDX. The code you have would move the contents of the 8-byte register RDX to a memory location that was expecting a 4-byte DWORD. This could potentially trash data in the caller. This really should have been:
mov [ rsi ], edx
With all of the above in mind we could code the ACpuid routine this way:
option casemap:none
.code
;Cpuid command
;ACpuid(Type, pEax, pEbx, pEcx, pEdx)
ACpuid Proc
;Type => Rcx
;pEax => Rdx
;pEbx => R8
;pEcx => R9
;pEdx => [ rbp + 48 ] ?
push rbp
mov rbp, rsp
push rbx ; Preserve RBX (destroyed by CPUID)
mov r10, rdx ; Save RDX before CPUID
mov rax, rcx
cpuid
mov [ r10 ], eax
mov [ r8 ], ebx
mov [ r9 ], ecx
mov r10, [ rbp + 48 ]
mov [ r10 ], edx ; Last parameter is pointer to 32-bit DWORD,
; Move EDX to the memory location, not RDX
pop rbx
pop rbp
ret
ACpuid Endp
end

Segmentation fault in NASM 64bit

I am trying to output the result to the user after getting 3 inputs from scanf.
When I run my code, I am able to get the input I need. However it crashes after I collect the input and begin the calculation.
By the way, I am using Ubuntu 14.04 with g++ and NASM 64bit.
Here's how it should look:
This program is brought to you by Chris Tarazi
Welcome to Areas of Trapezoids
Please enter one of the base numbers: 5.8
Please enter the other base number: 2.2
Please enter the height: 6.5
****//Crashes here with Segmentation fault (core dumped)****
The area of a trapezoid with sizes 5.799999999999999365, 2.200000000000000153,
and 6.500000000000000000 is 26.000000000000000328
Have a nice day. Enjoy your trapezoids.
C++ file:
#include <stdio.h>
#include <stdint.h>
extern "C" double ComputeArea(); // links with global in assembly
using namespace std;
int main()
{
double area;
printf("This program is brought to you by Chris Tarazi.\n");
area = ComputeArea();
printf("Have a nice day. Enjoy your trapezoids.\n");
return 0;
}
Assembly file:
extern printf ; This function will be linked later.
extern scanf
global ComputeArea ; Declare function global to link with "extern" from C++.
;---------------------------------Declare variables-------------------------------------------
segment .data
welcome: db "Welcome to the area of trapezoids.", 10, 0
input: db "Please enter one of the base numbers: ", 0
secInput: db "Please enter the other base number: ", 0
output: db "The area of a trapezoid with sizes %1.18lf, %1.18lf, and %1.18lf is %1.18lf .", 10, 0
hInput: db "Please enter the height: ", 0
inputformat: db "%lf", 0
stringformat: db "%s", 0
fourfloatformat: db "%1.18lf %1.18lf %1.18lf %1.18lf", 0
;---------------------------------Begin segment of executable code------------------------------
segment .text
ComputeArea: ; Area of trapezoid = ((a + b) / 2) * h.
push rbp ; Save a copy of the stack base pointer
mov rbp, rsp ; We do this in order to be 100% compatible with C and C++.
push rbx ; Back up rbx
push rcx ; Back up rcx
push rdx ; Back up rdx
push rsi ; Back up rsi
push rdi ; Back up rdi
push r8 ; Back up r8
push r9 ; Back up r9
push r10 ; Back up r10
push r11 ; Back up r11
push r12 ; Back up r12
push r13 ; Back up r13
push r14 ; Back up r14
push r15 ; Back up r15
pushf ; Back up rflags
;---------------------------------Output messages to user---------------------------------------
mov qword rax, 0
mov rdi, stringformat
mov rsi, welcome
call printf
mov qword rax, 0
mov rdi, stringformat
mov rsi, input
call printf
push qword 0
mov qword rax, 0
mov rdi, inputformat
mov rsi, rsp ;firstbase
call scanf
movsd xmm0, [rsp]
pop rax
mov qword rax, 0
mov rdi, stringformat
mov rsi, secInput
call printf
push qword 0
mov qword rax, 0
mov rdi, inputformat
mov rsi, rsp ;secondbase
call scanf
movsd xmm1, [rsp + 4]
pop rax
mov qword rax, 0
mov rdi, stringformat
mov rsi, hInput
call printf
push qword 0
mov qword rax, 0
mov rdi, inputformat
mov rsi, rsp ;height
call scanf
movsd xmm2, [rsp + 8]
pop rax
;---------------------------------Begin ComputeArea Calculation-----------------------------------
mov rax, 2
cvtsi2sd xmm3, rax
addsd xmm0, xmm1
divsd xmm0, xmm3
mulsd xmm0, xmm2
ret
;---------------------------------Output result to user-------------------------------------------
mov rax, 3
mov rdi, output
call printf
First off, why on earth are you saving ALL of those registers?!? The ABI for 64 bit Linux says you only need to save rbx, rbp, and r12 - r15 if you use those registers in your function. Also, you using Assembler, there is no need to create a stack frame in 64bit land (plus you aren't even using rbp! so why create a stack frame?) The only thing that is very important is to make sure your stack is aligned on a 16 byte boundary - call pushes an 8 byte return address, so all you need in your ComputeArea function is sub rsp, 8 and add rsp, 8 right before your ret.
In your first scanf you are using rsp without adjusting it, you just overwrote something!
You do some computations here:
mov rax, 2
cvtsi2sd xmm3, rax
addsd xmm0, xmm1
divsd xmm0, xmm3
mulsd xmm0, xmm2
ret
You return from the procedure here but do not pop all of those registers you just pushed!! So basically your stack pointer is all messed up! The CPU does not know what the return address is!
What you do in the prologue, must be reversed in the epilogue before you return!
Maybe, you should start simple, read in 3 floats and try to print them!
When I correct your code, this is my output:
Welcome to the area of trapezoids.
Please enter one of the base numbers: 5.8
Please enter the other base number: 2.2
Please enter the height: 6.5
The area of a trapezoid with sizes 5.799999999999999822, 2.200000000000000178, and 6.500000000000000000 is 26.000000000000000000 .

Convert AT&T syntax to Intel Syntax (ASM)

I've been trying to access the peb information of an executable as seen here: Access x64 TEB C++ & Assembly
The code works only in AT&T syntax for some odd reason but when I try to use Intel syntax, it fails to give the same value. There's of course an error on my part. So I'm asking..
How can I convert:
int main()
{
void* ptr = 0; //0x7fff5c4ff3c0
asm volatile
(
"movq %%gs:0x30, %%rax\n\t"
"movq 0x60(%%rax), %%rax\n\t"
"movq 0x18(%%rax), %%rax\n\t"
"movq %%rax, %0\n"
: "=r" (ptr) ::
);
}
to Intel Syntax?
I tried:
asm volatile
(
"movq rax, gs:[0x30]\n\t"
"movq rax, [rax + 0x60]\n\t"
"movq rax, [rax + 0x18]\n\t"
"movq rax, %0\n"
: "=r" (ptr) ::
);
and:
asm volatile
(
"mov rax, QWORD PTR gs:[0x30]\n\t"
"mov rax, QWORD PTR [rax + 0x60]\n\t"
"mov rax, QWORD PTR [rax + 0x18]\n\t"
"movq rax, %0\n" //mov rax, QWORD PTR [%0]\n
: "=r" (ptr) ::
);
They do not print the same value as the AT&T syntax: 0x7fff5c4ff3c0
Any ideas?
You forgot to reverse operand order on the last line. That said, the only instruction you need to have in asm is the first one due to the gs segment override, the rest could be done in C.