How do I convert C++ function to assembly(x86_64)? - c++

This is my .CPP file
#include <iostream>
using namespace std;
extern "C" void KeysAsm(int arr[], int n, int thetha, int rho);
// Keep this and call it from assembler
extern "C"
void crim(int *xp, int *yp) {
int temp = *xp;
*xp = *yp;
*yp = temp+2;
}
// Translate this into Intel assembler
void KeysCpp(int arr[], int n, int thetha, int rho){
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
crim(&arr[j], &arr[j + 1]);
}
}
arr[i]= arr[i] + thetha / rho * 2 - 4;
}
}
// Function to print an array
void printArray(int arr[], int size){
int i;
for (i = 0; i < size; i++)
cout << arr[i] << "\n";
cout << endl;
}
int main() {
int gamma1[]{
9,
270,
88,
-12,
456,
80,
45,
123,
427,
999
};
int gamma2[]{
900,
312,
542,
234,
234,
1,
566,
123,
427,
111
};
printf("Array:\n");
printArray(gamma1, 10);
KeysAsm(gamma1, 10, 5, 6);
printf("Array Result Asm:\n");
printArray(gamma1, 10);
KeysCpp(gamma2, 10, 5, 6);
printf("Array Result Cpp:\n");
printArray(gamma2, 10);
}
What I want to do is, convert the KeysCpp function into assembly language and call it from this very .CPP file. I want to keep the crim function as it is in .CPP, while only converting the KeysCpp.
Here is my .ASM file
PUBLIC KeysAsm
includelib kernel32.lib
_DATA SEGMENT
EXTERN crim:PROC
_DATA ENDS
_TEXT SEGMENT
KeysAsm PROC
push rbp
mov rbp, rsp
sub rsp, 40
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov DWORD PTR [rbp-32], edx
mov DWORD PTR [rbp-36], ecx
mov DWORD PTR [rbp-4], 0
jmp L3
L3:
mov eax, DWORD PTR [rbp-28]
sub eax, 1
cmp DWORD PTR [rbp-4], eax
jl L7
L4:
mov eax, DWORD PTR [rbp-28]
sub eax, DWORD PTR [rbp-4]
sub eax, 1
cmp DWORD PTR [rbp-8], eax
jl L6
L5:
add DWORD PTR [rbp-8], 1
L6:
mov eax, DWORD PTR [rbp-8]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov edx, DWORD PTR [rax]
mov eax, DWORD PTR [rbp-8]
cdqe
add rax, 1
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov eax, DWORD PTR [rax]
cmp edx, eax
jle L5
mov eax, DWORD PTR [rbp-8]
cdqe
add rax, 1
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rdx, rax
mov eax, DWORD PTR [rbp-8]
cdqe
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov rsi, rdx
mov rdi, rax
call crim
L7:
mov DWORD PTR [rbp-8], 0
jmp L4
KeysAsm ENDP
_TEXT ENDS
END
I am using Visual Studio 2017 to run this project.
I am getting next error when I run this code.
Unhandled exception at 0x00007FF74B0E429C in MatrixMultiplication.exe: Stack cookie instrumentation code detected a stack-based buffer overrun. occurred

Your asm looks like it's expecting the x86-64 System V calling convention, with args in RDI, ESI, EDX, ECX. But you said you're compiling with Visual Studio, so the compiler-generated code will use the Windows x64 calling convention: RCX, EDX, R8D, R9D.
And when you call crim, it can use shadow space (32 bytes above its return address, which you didn't reserve space for).
It looks like you got this asm from un-optimized compiler output, probably from https://godbolt.org/z/ea4MPh81r using GCC for Linux, without using -mabi=ms to override the default -mabi=sysv when compiling for non-Windows targets. And then you modified it to make the loop infinite, with a jmp at the bottom instead of a ret? Maybe a different GCC version than 12.2 since the label numbers and code don't match exactly.
(The signs of being un-optimized compiler output are all the reloads from [rbp-whatever], and redoing sign-extension before using an int to index an array with cdqe. A human would know the int must be non-negative. And being GCC specifically, the numbered label like .L1: etc. where you just removed the ., and of heavily using RAX for as much as possible in a debug build. And choices like lea rdx, [0+rax*4] to copy-and-shift, and the exact syntax it used to print that instruction in Intel syntax match GCC.)
To compile a single function for Windows x64, isolate it and give the compiler only prototypes for anything it calls
extern "C" void crim(int *xp, int *yp); // prototype only
void KeysCpp(int arr[], int n, int thetha, int rho){
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
crim(&arr[j], &arr[j + 1]);
}
}
arr[i]= arr[i] + thetha / rho * 2 - 4;
}
}
Then on Godbolt, use gcc -O3 -mabi=ms, or use MSVC which always targets Windows. https://godbolt.org/z/Mj5Gb54b5 shows both GCC and MSVC with optimization enabled.
KeysCpp(int*, int, int, int): ; demangled name
cmp edx, 1
jle .L11 ; "shrink wrap" optimization: early-out on n<=1 before saving regs
push r15 ; save some call-preserved regs
push r14
lea r14, [rcx+4] ; arr + 1
push r13
mov r13, rcx
Unfortunately GCC fails to hoist the thetha / rho * 2 - 4 loop-invariant, instead redoing idiv every time through the loop. Seems like an obvious optimization since those are local vars whose address hasn't been taken at all, and it keeps thetha (typo for theta?) and rho in registers. So MSVC is much more efficient here. Clang also misses this optimization.

Related

Why is the object prefix converted to function argument?

In the learncpp article about the hidden this pointer, the author mentioned that the compiler converts the object prefix to an argument passed by address to the function.
In the example:
simple.setID(2);
Will be converted to:
setID(&simple, 2); // note that simple has been changed from an object prefix to a function argument!
Why does the compiler do this? I've tried searching other documentation about it but couldn't find any. I've asked other people but they say it is a mistake or the compiler doesn't do that.
I have a second question on this topic. Let's go back to the example:
simple.setID(2); //Will be converted to setID(&simple, 2);
If the compiler converts it, won't it just look exactly like a function that has a name of setID and has two parameters?
void setID(MyClass* obj, int id) {
return;
}
int main() {
MyClass simple;
simple.setID(2); //Will be converted to setID(&simple, 2);
setID(&simple, 2);
}
Line 6 and 7 would look exactly the same.
object prefix to an argument passed by address to the function
This refers to how implementations use to translate it to machine code (but they could do it any other way)
Why does the compiler do this?
In some way, you need to be able to refer to the object in the called member function, and one way is to just handle it like an argument.
If the compiler converts it, won't it just look exactly like a function that has a name of setID and has two parameters?
If you have this code:
struct Test {
int v = 0;
Test(int v ) : v(v) {
}
void test(int a) {
int v = this->v;
int r = a;
}
};
void test(Test* t, int a) {
int v = t->v;
int r = a + v;
}
int main() {
Test a(2);
a.test(1);
test(&a, 1);
return 0;
}
gcc-12 will create this assembly code (for x86 and if optimizations are turned off):
Test::Test(int) [base object constructor]:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov DWORD PTR [rbp-12], esi
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rbp-12]
mov DWORD PTR [rax], edx
nop
pop rbp
ret
Test::test(int a):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
// int v = this->v;
mov rax, QWORD PTR [rbp-24]
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
// int r = a;
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-8], eax
// end of function
nop
pop rbp
ret
test(Test* t, int a):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
// int v = t->v;
mov rax, QWORD PTR [rbp-24]
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
// int r = a + v;
mov edx, DWORD PTR [rbp-28]
mov eax, DWORD PTR [rbp-4]
add eax, edx
mov DWORD PTR [rbp-8], eax
// end of function
nop
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-4]
mov esi, 2
mov rdi, rax
call Test::Test(int) [complete object constructor]
// a.test(1);
lea rax, [rbp-4]
mov esi, 1
mov rdi, rax
call Test::test(int)
// test(&a, 1);
lea rax, [rbp-4]
mov esi, 1
mov rdi, rax
call test(Test*, int)
// end of main
mov eax, 0
leave
ret
So the machine code generated with no optimizations, looks identical for test(&a, 1) and a.test(1). And that's what the statement refers to.
But again that is an implementation detail how the compiler translates c++ to machine code, and not related to c++ itself.

Replacing even array elements with zeros in assembler

There is the following problem:
I have an array, it is necessary to give its size and then in this array for each element that has an even value, assign the value zeros, and return the modified array.
There is my C++ code:
#include <iostream>
int main()
{
const int size = 10;
int arr[size] = { 1,2,3,4,5,6,7,8,9,10 };
for (int i = 0; i < size; i++)
{
if (arr[i] % 2 == 0)
arr[i] = 0;
}
for (int i = 0; i < size; i++)
{
std::cout << arr[i] << ' ';
}
system("pause");
return 0;
}
There is my nasm code:
%include "io64.inc"
section .text
global CMAIN
CMAIN:
mov DWORD size$[rbp], 10 ; size of array
; elements of array
mov DWORD arr$[rbp], 1
mov DWORD arr$[rbp+4], 2
mov DWORD arr$[rbp+8], 3
mov DWORD arr$[rbp+12], 4
mov DWORD arr$[rbp+16], 5
mov DWORD arr$[rbp+20], 6
mov DWORD arr$[rbp+24], 7
mov DWORD arr$[rbp+28], 8
mov DWORD arr$[rbp+32], 9
mov DWORD arr$[rbp+36], 10
mov DWORD i$4[rbp], 0
jmp SHORT $LN4#main
$LN2#main:
mov eax, DWORD i$4[rbp]
inc eax
mov DWORD i$4[rbp], eax
$LN4#main:
cmp DWORD i$4[rbp], 10
jge SHORT $LN3#main
movsxd rax, DWORD i$4[rbp]
mov eax, DWORD arr$[rbp+rax*4]
cdq
and eax, 1
xor eax, edx
sub eax, edx
test eax, eax
jne SHORT $LN8#main
movsxd rax, DWORD i$4[rbp]
mov DWORD arr$[rbp+rax*4], 0
$LN8#main:
jmp SHORT $LN2#main
$LN3#main:
mov DWORD i$5[rbp], 0
jmp SHORT $LN7#main
$LN5#main:
mov eax, DWORD i$5[rbp]
inc eax
mov DWORD i$5[rbp], eax
$LN7#main:
cmp DWORD i$5[rbp], 10
jge SHORT $LN6#main
$LN6#main:
mov edi, eax
lea rcx, QWORD [rbp-32]
mov eax, edi
lea rsp, QWORD [rbp+360]
pop rdi
pop rbp
ret
My question is: will this code work and how can it be optimized?
I just get errors like this:
C:\Users\79268\AppData\Local\Temp\SASM\program.asm:1: fatal: unable to open include file `io64.inc'
gcc.exe: error: C:\Users\79268\AppData\Local\Temp\SASM\program.o: No such file or directory
C:\Users\79268\AppData\Local\Temp\SASM\program.asm:6: error: comma, colon, decorator or end of line expected after operand
C:\Users\79268\AppData\Local\Temp\SASM\program.asm:9: error: comma, colon, decorator or end of line expected after operand
C:\Users\79268\AppData\Local\Temp\SASM\program.asm:10: error: comma, colon, decorator or end of line expected after operand
So, I install NASM and IDE SASM correctly...
I try to rewrite code and compile it in SASM:
section .data
arr dd 1, 2, 3, 4, 5, 6, 7, 8, 9,10
section .text
global CMAIN
CMAIN:
call calc
push arr
calc:
lea rsi, [arr]
mov rcx, [10]
mov rdi, rsi
xor rbx, rbx
##for:
lodsq
test al, 1
cmovz rax, rbx
stosq
loop ##for
ret
And I get the same error:
c:/program files (x86)/sasm/mingw64/bin/../lib/gcc/x86_64-w64
mingw32/4.8.1/../../../../x86_64-w64
mingw32/lib/../lib/libmingw32.a(lib64_libmingw32_a-crt0_c.o):
crt0_c.c:(.text.startup+0x25): undefined reference to `WinMain'
Now I get the same error for updated code:
BITS 64
section .text
global _start
_start:
push arr
lea rsi, [arr]
mov rcx, [10]
mov rdi, rsi
xor rbx, rbx
##for:
lodsq
test al, 1
cmovz rax, rbx
stosq
loop ##for
ret
section .data
arr dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$nasm -f elf *.asm; ld -m elf_i386 -s -o demo *.o
$demo
/usr/bin/timeout: the monitored command dumped core
sh: line 1: 21184 Segmentation fault /usr/bin/timeout 10s demo
Any idea for fix and compile this programm? SASM is nit wirking on my machine, and a try to use this online NASM compilier: ASM Online compilier

Why is vzeroupper being inserted at the end of this code?

I noticed something strange when I compile this code on godbolt, with MSVC:
#include <intrin.h>
#include <cstdint>
void test(unsigned char*& pSrc) {
__m256i data = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(pSrc));
int32_t mask = _mm256_movemask_epi8(data);
if (!mask) {
++pSrc;
}
else {
unsigned long v;
_BitScanForward(&v, mask);
pSrc += v;
}
}
I get this resulting assembly:
pSrc$ = 8
void test(unsigned char * &) PROC ; test, COMDAT
mov rdx, QWORD PTR [rcx]
vmovdqu ymm0, YMMWORD PTR [rdx]
vpmovmskb eax, ymm0
test eax, eax
jne SHORT $LN2#test
mov eax, 1
add rax, rdx
mov QWORD PTR [rcx], rax
vzeroupper ; Why is this being inserted?
ret 0
$LN2#test:
bsf eax, eax
add rax, rdx
mov QWORD PTR [rcx], rax
vzeroupper ; Why is this being inserted?
ret 0
void test(unsigned char * &) ENDP ; test
Why is vzeroupper being inserted at the end of each scope? I heard that it's because of switching between SSE and AVX, but I'm not doing that here. I'm using exclusively AVX code.
I was wondering, does this pose a performance problem?

Union with Insidious bug

Having a problem grasping that when we want to traverse a whole array and compare each value of the array with a number present in the array, say arr[0] then, why is it advised to initialize an int with arr[0], like int acomp =arr[0] and compare acomp with every integer present in the array than comparing every integer present in the array with arr[0]?
For eg., in the following code of union it was pointed out to me that Code 2 is better than Code 1, but I am not quite sure why.
int unionarr(int p, int q){ //Code 1
for(int i=0;i<size;i++)
if(arr[i]==arr[p])
arr[i]=arr[q];}
int unionarr(int p, int q){ //Code 2
int pid=arr[p];
int qid=arr[q];
for(int i=0;i<size;i++)
if(arr[i]==pid)
arr[i]=qid;}
It's a correctness issue. The assignment inside the for loop can modify array values. You might modify the very elements that are being used in the comparison or right-hand side of the assignment. That's why you must save them before entering the loop.
Making local copies pid, and qid of values which would otherwise have to be repeatedly looked up in the array is something of a performance optimisation.
However, I would be surprised if any modern compiler would fail to pick that up and do that optimisation implicitly.
Using https://godbolt.org/ you can compare the two. what you care about is the instruction inside the loop.
With Clang 4.0 the assembly is:
Code 1
movsxd rax, dword ptr [rbp - 16]
mov ecx, dword ptr [4*rax + arr]
movsxd rax, dword ptr [rbp - 8]
cmp ecx, dword ptr [4*rax + arr]
jne .LBB0_4
movsxd rax, dword ptr [rbp - 12]
mov ecx, dword ptr [4*rax + arr]
movsxd rax, dword ptr [rbp - 16]
mov dword ptr [4*rax + arr], ecx
Code 2
movsxd rax, dword ptr [rbp - 24]
mov ecx, dword ptr [4*rax + arr]
cmp ecx, dword ptr [rbp - 16]
jne .LBB0_4
mov eax, dword ptr [rbp - 20]
movsxd rcx, dword ptr [rbp - 24]
mov dword ptr [4*rcx + arr], eax

nasm function called from C ends up segfaulting

I have to find min max values in array using only one conditional jump directive.
After compiling and linking the two files below I get a Segmentation Fault (core dumped), but I don't understand why that is.
Question: What is causing the segmentation fault?
main.cpp
#include <cstdio>
#include <time.h>
using namespace std;
extern "C" void minmax(int n, int * tab, int * max, int * min);
int main(){
const int rozmiar = 100000;
const int liczba_powtorzen = 10000;
int tab[rozmiar] = {1, 3, 3, -65, 3, 123, 4, 32, 342, 22, 11, 32, 44, 12, 324, 43};
tab[rozmiar-1] = -1000;
int min, max;
min = 99999;
max = -99999;
clock_t start, stop;
start = clock();
for(int i=0; i<liczba_powtorzen; i++){
minmax(rozmiar, tab, &max, &min);
}
printf("min = %d max = %d\n", min, max);
stop = clock();
printf("\n time = %f ( %d cykli)", (stop - start)*1.0/CLOCKS_PER_SEC, (stop - start));
return 0;
}
minmax.asm
global minmax ; required for linker and NASM
section .text ; start of the "CODE segment"
minmax:
push ebp
mov ebp, esp ; set up the EBP
push ecx ; save used registers
push esi
mov ecx, [ebp+8] ; array length n
mov esi, [ebp+12] ; array address
mov eax, [ebp+16] ;max
mov edi, [ebp+20] ; min
lp: add eax, [esi] ; fetch an array element
cmp eax, [esi]
jl max ; max<[esi] ->update max
cmp edi, [esi]
jg min ; min>[esi] ->update min
add esi, 4 ; move to another element
loop lp ; loop over all elements
max:
mov eax, esi
ret
min:
mov edi, esi
ret
pop esi ; restore used registers
pop ecx
pop ebp
ret ; return to caller
Long story, short:
You need to restore the stack before using ret.
Your asm implementation is faulty on many levels, but the reason for your segmentation fault is poor understanding of how ret works.
Invalid use of ret
ret does not bring you back to the last jump, it reads the value that is at the top of the stack, and returns to that address.
After you jump to either min: or max:, you call ret, where you should be jumping back to your loop.
This means that it will try to return back to the address at the top of the stack, which certainly isn't a valid address; you modified it upon entering the function.
push ebp
mov ebp, esp ; set up the EBP
push ecx ; save used registers
push esi ; note, this is where `ret` will try to go
I do not know what you're exactly trying to do, but the assembler function is written poor.
Try this:
push ebp
mov ebp, esp ; set up the EBP
push ecx ; save used registers
push esi
mov ecx, [ebp+8] ; array length n
mov esi, [ebp+12] ; array address
mov eax, 0x80000000
mov edi,[ebp+16]
mov [edi], eax
mov eax, 0x7fffffff
mov edi,[ebp+20]
mov [edi], eax
lp:
mov edi,[ebp+16]
lodsd
cmp [edi], eax
jg short _min_test
mov [edi], eax
_min_test:
mov edi,[ebp+20]
cmp [edi], eax
jl short _loop
mov [edi], eax
_loop:
loop lp
pop esi ; restore used registers
pop ecx
pop ebp
ret ; return to caller