I'm trying to call x64 assembly function from C++ code with four parameters and the assembly function reset the first parameter to zero every time. Please find the code snippet below.
C++ code: test.cpp
#include <iostream>
extern "C" int IntegerShift_(unsigned int a, unsigned int* a_shl, unsigned int* a_shr, unsigned int count);
int main(int argc, char const *argv[])
{
unsigned int a = 3119, count = 6, a_shl, a_shr;
std::cout << "a value before calling " << a << std::endl;
IntegerShift_(a, &a_shl, &a_shr, count);
std::cout << "a value after calling " << a << std::endl;
return 0;
}
x64 assembly code: test.asm
section .data
section .bss
section .text
global IntegerShift_
IntegerShift_:
;prologue
push rbp
mov rbp, rsp
mov rax, rdi
shl rax, cl
mov [rsi], rax
mov rax, rdi
shr rax, cl
mov [rdx], rax
xor rax,rax
;epilogue
mov rbp, rsp
pop rbp
ret
I'm working on the below environment.
OS - Ubuntu 18.04 64-bit
Assembler - nasm (2.13.02)
C++ compiler - g++ (7.4.0)
processor - Intel® Pentium(R) CPU G3240 # 3.10GHz × 2
and I'm compiling my code as below
$ nasm -f elf64 -g -F dwarf test.asm
$ g++ -g -o test test.cpp test.o
$ ./test
$ a value before calling 3119
$ a value after calling 0
But if i comment out the line mov [rdx], rax from assembly function, its not resetting the value of variable a. I'm new to x64 assembly programming and I couldn't find the relation between rdx register and variable a.
unsigned int* a_shl, unsigned int* a_shr are pointers to unsigned int, a 32-bit (dword) type.
You do two qword stores, mov [rsi], rax and mov [rdx], rax which store outside of the pointed-to objects.
The C equivalent would be a function that takes unsigned int* args and does
*(unsigned long)a_shr = a>>count;. This is of course UB, and behaviour like this (overwriting other variables) is pretty much what you'd expect.
Presumably you compiled with optimization disabled so the caller actually reloaded a from the stack. And it put a_shr or a_shl next to a in its stack frame, and one of your stores zeroed your caller's copy of a.
(As usual, gcc happened to zero the upper 32 bits of RDI while it put a into EDI as the first arg. Writing a 32-bit register zero-extends to the full register. So your other bug; right shifting high garbage into the low 32 bits for a_shr, didn't bite you with this caller.)
Simpler implementation:
global IntegerShift ; why the trailing underscore? That's weird for no reason.
IntegerShift:
;prologue not needed, we don't even use the stack
; so don't waste instructions making a frame pointer.
mov eax, edi
shl rax, cl ; a<<count
mov [rsi], eax ; 32-bit store
;mov rax, rdi ; we can just destroy our local a, we're done with it
shr edi, cl ; a>>count
mov [rdx], edi ; 32-bit store
xor eax, eax ; return 0
ret
xor eax, eax is the most efficient way to zero a 64-bit register (no wasted REX prefix). And your return value is only 32-bit anyway because you declared it int, so it makes no sense to be using 64-bit registers.
BTW, if you had BMI2 available (which you don't on your budget Pentium CPU, unfortunately), you could avoid all the register copying, and be more efficient on Intel CPUs (SHL/RX is only 1 uop instead of 3 for shl/r reg, cl because of legacy x86 FLAGS-unmodified semantics for the cl=0 case)
shlx eax, edi, ecx
shrx edi, edi, ecx
mov [rsi], eax
mov [rdx], edi
xor eax, eax
ret
Related
I am trying to include a function I have coded in nasm into a project that is in c++.
When I try to include the function in the cpp file using the extern keyword and compile the project, linker (ld) fails with "undefined reference to `rnd_by_seed(unsigned long long)".
Almost as if there was missing some reference in cmakelists file.
cmake_minimum_required(VERSION 3.23)
project(App)
set(CMAKE_CXX_STANDARD 23)
#enable_language(ASM_NASM)
set(NASM_COMPILER nasm)
set(CMAKE_ASM_NASM_OBJECT_FORMAT elf64)
set(CMAKE_ASM_NASM_COMPILE_OBJECT "<CMAKE_ASM_NASM_COMPILER> <INCLUDES> <FLAGS>
-f ${CMAKE_ASM_NASM_OBJECT_FORMAT} -o <OBJECT> <SOURCE>")
set(SOURCE_FILES main.cpp Tools/rnd_byseed.asm)
add_executable(App ${SOURCE_FILES})
here's the nasm function I am trying to execute:
global rnd_by_seed
;------------------------------------------
; return random number by seed value
; #param rdi seed value (unsigned 64bit integer)
; #return rax random number (64bit unsigned)
rnd_by_seed:
push rbp
mov rbp, rsp
mov eax, dword [rsp+8]
mov ebx, 16807
mul ebx
mov esi, edx
mov edi, eax
mov eax, dword [rsp+4]
mul ebx
add eax, esi
adc edx, 0
shl eax, 1
rcl edx, 1
shr eax, 1
add edx, edi
adc eax, 0
xchg eax, edx
test edx, 80000000h
jz Store
and edx, 7fffffffh
add eax, 1
adc edx, 0
Store:
mov dword [rsp+4], edx
mov dword [rsp+8], eax
mov rax, qword [rsp+8]
add rsp, 16
leave
ret
and main.cpp:
#include <iostream>
extern "C" unsigned long long rnd_by_seed(unsigned long long seed);
int main()
{
std::cout << "Hello, World!" << std::endl;
// call rnd_rnd_by_seed function
auto val = rnd_by_seed(12);
printf("val = %llu", val);
return 0;
}
I tried to look at different methods for including nasm files into cmakelists file, got worse results. This is the only cmakelists file that does not fail when reloading the project.
You're almost there, just need to uncomment the enable_language(ASM_NASM) line.
enable_language(ASM_NASM)
set(CMAKE_ASM_NASM_OBJECT_FORMAT elf64)
set(CMAKE_ASM_NASM_COMPILE_OBJECT "<CMAKE_ASM_NASM_COMPILER> <INCLUDES> <FLAGS -f ${CMAKE_ASM_NASM_OBJECT_FORMAT} -o <OBJECT> <SOURCE>")
Or enable ASM_NASM at project level.
project(App C CXX ASM_NASM)
However, your assembly looks wrong - assuming x64 SysV ABI, I see at least the following issues:
the first 64-bit argument is passed in via RDI, not stack
the value of register RBX must be preserved
writing to [rsp+4], [rsp+8] will damade the stack
add rsp, 16 is bogus
I have found that this code causes a startling error in the gnu C++ compiler when it is optimizing.
#include <stdio.h>
int main()
{
int a = 333666999, b = 0;
for (short i = 0; i<7; ++i)
{
b += a;
printf("%d ", b);
}
return 9;
}
To compile using g++ -Os fail.cpp the executable does not print seven numbers, it goes on forever, printing and printing. I am using -
-rwxr-xr-x 4 root root 700388 Jun 3 2013 /usr/bin/g++
Is there a later corrected version?
The compiler is very, very rarely wrong. In this case, b is overflowing, which is undefined behaviour for signed integers:
$ g++ --version
g++ (GCC) 10.2.0
...
$ g++ -Os -otest test.cpp
test.cpp: In function ‘int main()’:
test.cpp:8:11: warning: iteration 6 invokes undefined behavior [-Waggressive-loop-optimizations]
8 | b += a;
| ~~^~~~
test.cpp:6:24: note: within this loop
6 | for (short i = 0; i<7; ++i)
| ~^~
And if you invoke undefined behaviour, the compiler is free to do whatever it likes, including making your program never terminate.
Edit: Some people seem to think that the UB should only affect the value of b, but not the loop iteration. This is not according to the Standard (UB can cause literally anything to happen) but it's a reasonable thought, so let's look at the generated assembly to see why the loop doesn't terminate.
First without -Os:
.LC0:
.string "%d "
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-12], 333666999
mov DWORD PTR [rbp-4], 0
mov WORD PTR [rbp-6], 0
.L3:
cmp WORD PTR [rbp-6], 6 # Compare i to 6
jg .L2 # If greater, jump to end
mov eax, DWORD PTR [rbp-12]
add DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
movzx eax, WORD PTR [rbp-6]
add eax, 1
mov WORD PTR [rbp-6], ax
jmp .L3
.L2:
mov eax, 9
leave
ret
Then with -Os:
.LC0:
.string "%d "
main:
push rbx
xor ebx, ebx
.L2:
add ebx, 333666999
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov esi, ebx
call printf
jmp .L2
The comparison and jump instructions are completely gone. Ironically, the compiler did exactly what you asked it to do: optimize for size, so remove as many instructions as it can while obeying the C++ standard. -O3 and -O2 generate the exact same code as -Os here.
-O1 generates a very interesting output:
.LC0:
.string "%d "
main:
push rbx
mov ebx, 0
.L2:
add ebx, 333666999
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp ebx, -1959298303
jne .L2
mov eax, 9
pop rbx
ret
Here, the compiler optimized away the loop counter i and just compares the value of b to its final value after 7 iterations, using the fact that signed overflow happens according to two's complement on this platform! Cheeky, isn't it? :)
I am using g++ version 4.8.1. Thomas has version 10.2.0 which evidently puts out a warning about "undefined behavior" when adding two signed integers. However, being only a warning still goes ahead and compiles the program. In all circumstances though, the "undefined behavior" should only be concerning the integers being added. In practice those integers in fact do abide by the 2's complement expected result. The "undefined behavior" should not overwrite other variables in the program. Otherwise the executable cannot be trusted at all. And if it cannot be trusted it shouldn't be compiled. Perhaps there is an even later version of the gnu compiler that works correctly when optimizing?
Firstly: This code is considered to be of pure fun, please do not do anything like this in production. We will not be responsible of any harm caused to you, your company or your reindeer after compiling and executing this piece of code in any environment. The code below is not safe, not portable and is plainly dangerous. Be warned. Long post below. You were warned.
Now, after the disclaimer: Let's consider the following piece of code:
#include <stdio.h>
int fun()
{
return 5;
}
typedef int(*F)(void) ;
int main(int argc, char const *argv[])
{
void *ptr = &&hi;
F f = (F)ptr;
int c = f();
printf("TT: %d\n", c);
if(c == 5) goto bye;
//else goto bye; /* <---- This is the most important line. Pay attention to it */
hi:
c = 5;
asm volatile ("movl $5, %eax");
asm volatile ("retq");
bye:
return 66;
}
For the beginning we have the function fun which I have created purely for reference to get the generated assembly code.
Then we declare a function pointer F to functions taking no parameters and returning an int.
Then we use the not so well known GCC extension https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html to get the address of a label hi, and this works in clang too. Then we do something evil, we create a function pointer F called f and initialize it to be the label above.
Then the worst of all, we actually call this function, and assign its return value to a local variable, called C and the we print it out.
The following is an if to check if the value assigned to the c is actually the one we need, and if yes go to bye so that he application exits normally, with exit code 66. If that can be considered a normal exit code.
The next line is commented out, but I can say this is the most important line in the entire application.
The piece of code after the label hi is to assign 5 to the value of c, then two lines of assembly to initialize the value of eax to 5 and to actually return from the "function" call. As mentioned, there is a reference function, fun which generates the same code.
And now we compile this application, and run it on our online platform: https://gcc.godbolt.org/z/K6z5Yc
It generates the following assembly (with -O1 turned on, and O0 gives a similar result, albeit a bit more longer):
# else goto bye is COMMENTED OUT
fun:
mov eax, 5
ret
.LC0:
.string "TT: %d\n"
main:
push rbx
mov eax, OFFSET FLAT:.L3
call rax
mov ebx, eax
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp ebx, 5
je .L4
.L3:
movl $5, %eax
retq
.L4:
mov eax, 66
pop rbx
ret
The important lines are mov eax, OFFSET FLAT:.L3 where the L3 corresponds to our hi label, and the line after that: call rax which actually calls it.
And runs like:
ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 66
TT: 5
Now, let's revisit the most important line in the application and uncomment it.
With -O0 we get the following assembly, generated by gcc:
# else goto bye is UNCOMMENTED
# even gcc -O0 "knows" hi: is unreachable.
fun:
push rbp
mov rbp, rsp
mov eax, 5
pop rbp
ret
.LC0:
.string "TT: %d\n"
main:
push rbp
mov rbp, rsp
sub rsp, 48
mov DWORD PTR [rbp-36], edi
mov QWORD PTR [rbp-48], rsi
mov QWORD PTR [rbp-8], OFFSET FLAT:.L4
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rbp-16], rax
mov rax, QWORD PTR [rbp-16]
call rax
mov DWORD PTR [rbp-20], eax
mov eax, DWORD PTR [rbp-20]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp DWORD PTR [rbp-20], 5
nop
.L4:
mov eax, 66
leave
ret
and the following output:
ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 66
so, as you can see our printf was never called, the culprit is the line mov QWORD PTR [rbp-8], OFFSET FLAT:.L4 where L4 actually corresponds to our bye label.
And from what I can see from the generated assembly, not a piece of code from the part after hi was added into the generated code.
But at least the application runs and at least has some code for comparing c to 5.
On the other end, clang, with O0 generates the following nightmare, which by the way crashes:
# else goto bye is UNCOMMENTED
# clang -O0 also doesn't emit any instructions for the hi: block
fun: # #fun
push rbp
mov rbp, rsp
mov eax, 5
pop rbp
ret
main: # #main
push rbp
mov rbp, rsp
sub rsp, 48
mov dword ptr [rbp - 4], 0
mov dword ptr [rbp - 8], edi
mov qword ptr [rbp - 16], rsi
mov qword ptr [rbp - 24], 1
mov rax, qword ptr [rbp - 24]
mov qword ptr [rbp - 32], rax
call qword ptr [rbp - 32]
mov dword ptr [rbp - 36], eax
mov esi, dword ptr [rbp - 36]
movabs rdi, offset .L.str
mov al, 0
call printf
cmp dword ptr [rbp - 36], 5
jne .LBB1_2
jmp .LBB1_3
.LBB1_2:
jmp .LBB1_3
.LBB1_3:
mov eax, 66
add rsp, 48
pop rbp
ret
.L.str:
.asciz "TT: %d\n"
If we turn on some optimization, for example O1, we get from gcc:
# else goto bye is UNCOMMENTED
# gcc -O1
fun:
mov eax, 5
ret
.LC0:
.string "TT: %d\n"
main:
sub rsp, 8
mov eax, OFFSET FLAT:.L3
call rax
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
.L3:
mov eax, 66
add rsp, 8
ret
and the application crashes, which is sort of understandable. Again, the compiler had entirely removed our hi section (mov eax, OFFSET FLAT:.L3 goes tiptoe to L3 which corresponds to our bye section) and unfortunately decided that it's a good idea to increase rsp before a ret so to be sure we end up somewhere totally different where we need to be.
And clang delivers something even more dubious:
# else goto bye is UNCOMMENTED
# clang -O1
fun: # #fun
mov eax, 5
ret
main: # #main
push rax
mov eax, 1
call rax
mov edi, offset .L.str
mov esi, eax
xor eax, eax
call printf
mov eax, 66
pop rcx
ret
.L.str:
.asciz "TT: %d\n"
1 ? How on earth did clang end up with this?
To some level I understand that the compiler decided that dead code after an if where both if and else go to the same location is not needed, but here my knowledge and insight stops.
So now, dear C and C++ gurus, assembly aficionados and compiler crushers, here comes the question:
Why?
Why do you think did the compiler decide that the two labels should be considered equivalent if we have added the else branch, or why did clang put there 1, and last but not least: someone with a deep understanding of the C standard could maybe point out where this piece of code deviated so badly from normality that we ended up in this really really weird situation.
someone with a deep understanding of the C standard could maybe point out where this piece of code deviated so badly from normality that we ended up in this really really weird situation.
You think the ISO C standard has anything to say about this code? It's chock full of UB and GNU extensions, notably pointers to local labels.
Casting a label pointer to a function pointer and calling through it is obviously UB. The GCC manual doesn't say you can do that. It's also UB to goto a label in another function.
You were only able to make that work by tricking the compiler into thinking that block might be reached so it's not removed, then using GNU C Basic asm statements to emit a ret instruction there.
GCC and clang remove dead code even with optimization disabled; e.g. if(0) { ... } doesn't emit any instructions to implement the ...
Also note that the c=5 in hi: compiles with optimization fully disabled (and else goto bye commented) to asm like movl $5, -20(%rbp). i.e. using the caller's RBP to modify local variables in the stack frame of the caller. So you have a nested function.
GNU C allows you to define nested functions that can access the local vars of their parent scope. (If you liked the asm you got from your experiment, you'll love the executable trampoline of machine-code that GCC stores to the stack with mov-immediate if you take a pointer to a nested function!)
asm volatile ("movl $5, %eax"); is missing a clobber on EAX. You step on the compiler's toes which would be UB if this statement was ever reached normally, rather than as if it were a separate function.
The use-case for GNU C Basic asm (no constraints / clobbers) is instructions like cli (disable interrupts), not anything involving integer registers, and definitely not ret.
If you want to define a callable function using inline asm, you can use asm("") at global scope, or as the body of an __attribute__((naked)) function.
I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.
I'm surprise with this two tests and how much change with the different compiler flags optimizations.
Test iterator 1:
Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.
Test iterator 2:
Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.
P.D.: I use the following clock code.
Test iterator 1:
void test_iterator_1()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
size_t count = v.size();
for (unsigned int i = 0; i < count; ++i) {
v[i] = 1;
}
}
Test iterator 2:
void test_iterator_2()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
for (int& i : v) {
i = 1;
}
}
UPDATE: The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best.
UPDATE 2: Results:
With -O3:
t1: 80 units
t2: 74 units
With -O0:
t1: 287 units
t2: 538 units
UPDATE 3: The CODE!. Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)
Your first test is actually setting the value of each element in the vector to 1.
Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).
When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.
If you want the second loop to actually set the value:
for (int& i : v) // notice the &
{
i = 1;
}
Once you make that change, your loops are likely to produce assembly code that is almost identical.
As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:
std::vector<int> v(SIZE, 1);
EDIT
The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:
Version 1 will store a value for count and increment i, testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...)). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).
Version 1 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
push eax
lea ecx, DWORD PTR _v$[ebp]
call ??A?$vector#HV?$allocator#H#std###std##QAEAAHI#Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1
Version 2 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1
When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.
Version 1 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov ecx, DWORD PTR _v$[ebp+4]
mov edx, DWORD PTR _v$[ebp]
sub ecx, edx
sar ecx, 2 ; this is the only differing instruction
test ecx, ecx
je SHORT $LN3#test_itera
push edi
mov eax, 1
mov edi, edx
rep stosd
pop edi
$LN3#test_itera:
test edx, edx
je SHORT $LN21#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN21#test_itera:
mov esp, ebp
pop ebp
ret 0
Version 2 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov edx, DWORD PTR _v$[ebp]
mov ecx, DWORD PTR _v$[ebp+4]
mov eax, edx
cmp edx, ecx
je SHORT $LN1#test_itera
$LL33#test_itera:
mov DWORD PTR [eax], 1
add eax, 4
cmp eax, ecx
jne SHORT $LL33#test_itera
$LN1#test_itera:
test edx, edx
je SHORT $LN47#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN47#test_itera:
mov esp, ebp
pop ebp
ret 0
Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.
Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.
Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler, suddenly the effect of your optimization vanishes.
So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.
In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).
I'm trying to produce assembly code like this (so that it works with nasm)
;hello.asm
[SECTION .text]
global _start
_start:
jmp short ender
starter:
xor eax, eax ;clean up the registers
xor ebx, ebx
xor edx, edx
xor ecx, ecx
mov al, 4 ;syscall write
mov bl, 1 ;stdout is 1
pop ecx ;get the address of the string from the stack
mov dl, 5 ;length of the string
int 0x80
xor eax, eax
mov al, 1 ;exit the shellcode
xor ebx,ebx
int 0x80
ender:
call starter ;put the address of the string on the stack
db 'hello'
First off, what assembly style is this and second, how can I produce it from a C file using a command similar to gcc -S code.c -o code.S -masm=intel
This is Intel style.
What's wrong with the commandline you wrote in the question?