Accessing Assembly language from C++ - c++

This is my programming assignment. I need to find out the largest among the array of integers using a method written in 8086 programming language. This is my attempt :
#include <iostream.h>
#include <conio.h>
int returnLargest(int a[])
{
int max;
asm mov si,offset a
for(int i=0;i<6;i++) //Assuming six numbers in the array...Can be set to a variable 'n' later
{
asm mov ax,[si]
asm mov max,ax
asm inc si
cout<<max<<"\n"; //Just to see what is there in the memory location
}
asm mov si,offset a
asm mov cx,0000h
asm mov dx, [si]
asm mov cx,06h
skip: asm mov si,offset a
asm mov bx,[si]
asm mov max,bx
asm inc si
abc: asm mov bx,max
asm cmp [si],bx
asm jl ok
asm mov bx,[si]
asm mov max,bx
ok: asm loop abc
asm mov ax,max
return max;
}
void main()
{
clrscr();
int n;
int a[]={1,2,3,4,5,6};
n=returnLargest(a);
cout<<n; //Prints the largest
getch();
}
The expected answer is
1
2
3
4
5
6
6. But what I get is this :
Here I sit down and think... Is'nt it the value at the index i of array actually stored in the memory? Because atleast we were taught that if a[i] is 12(say) then ith memory location has the number 12 written inside it.
Or if the value is'nt stored at the memory location, How do I write into the memory location so as to accomplish the desired task?
Also I request you all to link some material on net/paperback so as to brush-up on these concepts.
EDIT :
The same code in assembly works just fine...
data segment
a db 01h,02h,03h,04h,05h,06h,'$'
max db ?
data ends
code segment
start:
assume cs:code,ds:data
mov ax,data
mov ds,ax
mov si,offset a
mov cx,0000h
back: mov dl,byte ptr [si]
cmp dl,'$'
je skip
inc cx
inc si
jmp back
skip: mov si,offset a
mov bl,byte ptr[si]
mov max,bl
inc si
abc: mov bl,max
cmp [si],bl
jl ok
mov bl,[si]
mov max,bl
ok: loop abc
mov al,max
int 03h
code ends
end start

mov si,offset a is incorrect. When you have a function parameter declared as int a[], the function actually receives a pointer. Since you want the pointer value (a) rather than its address (&a in C, offset a in assembly), use mov si, a.
Additionally, inc si doesn't seem right - you need to increase si by sizeof(int) for each element.
Edit:
You are mixing C++ code (for loop, cout) with your assembly. The C++ code is likely to use the same registers, which would cause conflicts. You should avoid doing this.
You also need to find out which registers your function is allowed to change according to the calling convention used. If you use any registers which aren't allowed to change, you need to push them at the beginning and pop them at the end.

You will have to make sure your compiler doesnt use your registers. Best way would be to write the entire function in assembly and implement a desired calling convention (c-call or stdcall - whatever). Then call that function from C/C++.
However if you know you will use only one compiler and how it works you shouldnt have any problems by inlining assembler, but it's really a pitfall.

Related

Decoding assembly from MSVC 32-bit release (homework). What does no-op do?

Hi heads up this is a homework. I'm given an assembly generated by MSVC 32-bit Release with optimizations on, and I'm supposed to decode it back into C++. I've included the top of the function to the line I'm having problems with. The comments are mine, which I'm wrote while trying to understand this.
Note: Code is supposedly generated from C++. Not traditional ASM.
Note 2: There is one area of undefined behavior in the code.
Here are the lines I'm stuck with
TheFunction: ; TheFunction(int* a, int s);
0F2D4670 push ebp ; Push/clear/save ebp
0F2D4671 mov ebp,esp ; ebp now points to top of stack
0F2D4673 push ecx ; Push/clear/save ecx
0F2D4674 push ebx ; Push/clear/save ebs
0F2D4675 push esi ; Push/clear/save esi
0F2D4676 mov ebx,edx ; ebx = int s
0F2D4678 mov esi,1 ; esi = 1
0F2D467D push edi ; calling convention ; Push/clear/save edi
0F2D467E mov edi,dword ptr [a (0F2D95E8h)] ; edi = a[0]
0F2D4684 cmp ebx,esi ; if(s < 1)
0F2D4686 jl SomeFunction+3Ch (0F2D46ACh) ; Jump to return
0F2D4688 nop dword ptr [eax+eax] ; !! <-- No op involving dereferencing? What does this do?
0F2D4690 mov eax,dword ptr [edi+esi*4-4] ; !! <-- edi is *a, while esi is 1. There is no address
here!
..... More code but I've figured these out ....
I've more or less got the gist of the function. Its a function that takes a pointer to an int, with an underlying array, and a size. It then goes through each element in the array from last to first, adding to each subsequent one and printing it out. However, I still haven't got the details down and need help
Two questions, both at the end of the code snippet. What does no op on a dereference pointer do, and am I reading the last line in that its attempting to dereference something not in memory?
The nop dword ptr [eax+eax] instruciton does nothing. It doesn't even access the memory location given by the operand. It literally performs no operation.
It's just there so the next instruction is aligned to a 16-byte boundary. You'll notice that next instruction address is 0F2D4690 which ends with 0 which means it's 16-byte aligned. This can improve the performance of loops. Somewhere there will be an instruction that jumps back to 0F2D4690 as part of a loop. This particular form of a NOP instruction is used because it encodes a single NOP instruction in 8 bytes.
There is no corresponding C++ code for this instruction. You shouldn't try to represent it in your C++ code, just ignore it.
Also note that your comment for mov edi,dword ptr [a (0F2D95E8h)] is incorrect. Instead of being edi = a[0] it's simply edi = a. The variable a isn't a parameter at all, instead it's a global (or file level static) variable located at memory location 0F2D95E8h. This instruction just loads the value from memory.

Understanding volatile asm vs volatile variable

We consider the following program, that is just timing a loop:
#include <cstdlib>
std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
volatile std::size_t i = 0;
#else
std::size_t i = 0;
#endif
while (i < n) {
#ifdef VOLATILEASM
asm volatile("": : :"memory");
#endif
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
For readability, the version with both volatile variable and volatile asm reads as follow:
#include <cstdlib>
std::size_t count(std::size_t n)
{
volatile std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings:
default: 0m0.001s
-DVOLATILEASM: 0m1.171s
-DVOLATILEVAR: 0m5.954s
-DVOLATILEVAR -DVOLATILEASM: 0m5.965s
The question I have is: why is that? The default version is normal since the loop is optimized away by the compiler. But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run.
Compiler explorer gives the following count function for -DVOLATILEASM:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR):
count(unsigned long):
mov QWORD PTR [rsp-8], 0
mov rax, QWORD PTR [rsp-8]
cmp rdi, rax
jbe .L2
.L3:
mov rax, QWORD PTR [rsp-8]
add rax, 1
mov QWORD PTR [rsp-8], rax
mov rax, QWORD PTR [rsp-8]
cmp rax, rdi
jb .L3
.L2:
mov rax, QWORD PTR [rsp-8]
ret
Why is the exact reason of that? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile?
When you make i volatile you tell the compiler that something that it doesn't know about can change its value. That means it is forced to load it's value every time you use it and it has to store it every time you write to it. When i is not volatile the compiler can optimize that synchronization away.
-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle.
Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory, not just a register. This is what volatile means.
There's also a reload for the compare, but that's only a throughput issue, not latency. The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits.
This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding.
With only VOLATILEASM, the empty asm template (""), has to run the right number of times. Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs.
Critically, the loop counter can stay in a register, despite the compiler memory barrier. A "memory" clobber is treated like a call to a non-inline function: it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function. (i.e. we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234). But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object.
i.e. a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to.
And BTW, your asm statement is already implicitly volatile because it has no output operands. (See Extended-Asm#Volatile in the gcc manual).
You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. If i's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope.
But anyway, here's the source I used. As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not.
#include <stdlib.h>
#include <stdio.h>
#ifndef VOLATILEVAR // compile with -DVOLATILEVAR=volatile to apply that
#define VOLATILEVAR
#endif
#ifndef VOLATILEASM // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif
// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
int dummy; // asm with no outputs is implicitly volatile
VOLATILEVAR size_t i = 0;
sscanf("0", "%zd", &i);
while (i < n) {
asm VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
++i;
}
return i;
}
compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm.
(Godbolt compiler explorer with gcc and clang):
# gcc8.1 -O3 with sscanf(.., &i) but non-volatile asm
# the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
mov rdx, rax # i, <retval>
.L3: # first iter entry point
lea rax, [rdx+1] # <retval>,
cmp rax, rbx # <retval>, n
jb .L8 #,
Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop:
# gcc4.8 -O3 with sscanf(.., &i) but non-volatile asm
.L3:
add rdx, 1 # i,
cmp rbx, rdx # n, i
ja .L3 #,
mov rax, rdx # i.0, i # outside the loop
Anyway, without the dummy output operand, or with volatile, gcc8.1 gives us:
# gcc8.1 with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
nop # operand = eax # dummy
mov rax, QWORD PTR [rsp+8] # tmp96, i
add rax, 1 # <retval>,
mov QWORD PTR [rsp+8], rax # i, <retval>
cmp rax, rbx # <retval>, n
jb .L3 #,
So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it.
I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. For clang, there might be some effect because the asm has to be valid (i.e. actually assemble correctly).
If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. But keep the dummy output operand and the nop doesn't appear anywhere.

MSVC Assembly function arguments C++ vs _asm

I have a function which takes 3 arguments, dest, src0, src1, each a pointer to data of size 12. I made two versions. One is written in C and optimized by the compiler, the other one is fully written in _asm. So yeah. 3 arguments? I naturally do something like:
mov ecx, [src0]
mov edx, [src1]
mov eax, [dest]
I am a bit confused by the compiler, as it saw fit to add the following:
_src0$ = -8 ; size = 4
_dest$ = -4 ; size = 4
_src1$ = 8 ; size = 4
?vm_vec_add_scalar_asm##YAXPAUvec3d##PBU1#1#Z PROC ; vm_vec_add_scalar_asm
; _dest$ = ecx
; _src0$ = edx
; 20 : {
sub esp, 8
mov DWORD PTR _src0$[esp+8], edx
mov DWORD PTR _dest$[esp+8], ecx
; 21 : _asm
; 22 : {
; 23 : mov ecx, [src0]
mov ecx, DWORD PTR _src0$[esp+8]
; 24 : mov edx, [src1]
mov edx, DWORD PTR _src1$[esp+4]
; 25 : mov eax, [dest]
mov eax, DWORD PTR _dest$[esp+8]
Function body etc.
add esp, 8
ret 0
What does the _src0$[esp+8] etc. even means? Why does it do all this stuff before my code? Why does it try to [apparently]stack anything so badly?
In comparison, the C++ version has only the following before its body, which is pretty similar:
_src1$ = 8 ; size = 4
?vm_vec_add##YAXPAUvec3d##PBU1#1#Z PROC ; vm_vec_add
; _dest$ = ecx
; _src0$ = edx
mov eax, DWORD PTR _src1$[esp-4]
Why is this little sufficient?
The answer of Mats Petersson explained __fastcall. But I guess that is not exactly what you're asking ...
Actually _src0$[esp+8] just means [_src0$ + esp + 8], and _src0$ is defined above:
_src0$ = -8 ; size = 4
So, the whole expression _src0$[esp+8] is nothing but [esp] ...
To see why it does all these stuff, you should probably first understand what Mats Petersson said in his post, the __fastcall, or more generally, what is a calling convention. See the link in his post for detailed informations.
Assuming that you have understood __fastcall, now let's see what happens to your codes. The compiler is using __fastcall. Your callee function is f(dst, src0, src1), which requires 3 parameters, so according to the calling convention, when a caller calls f, it does the following:
Move dst to ecx and src0 to edx
Push src1 onto the stack
Push the 4 bytes return address onto the stack
Go to the starting address of the function f
And the callee f, when its code begins, then knows where the parameters are: dst and src0 are in the registers ecx and edx, respectively; esp is pointing to the 4 bytes return address, but the 4 bytes below it (i.e. DWORD PTR[esp+4]) is exactly src1.
So, in your "C++ version", the function f just does what it should do:
mov eax, DWORD PTR _src1$[esp-4]
Here _src1$ = 8, so _src1$[esp-4] is exactly [esp+4]. See, it just retrieves the parameter src1 and stores it in eax.
There is however a tricky point here. In the code of f, if you want to use the parameter src1 multiple times, you can certainly do that, because it's always stored in the stack, right below the return address; but what if you want to use dst and src0 multiple times? They are in the registers, and can be destroyed at any time.
So in that case, the compiler should do the following: right after entering the function f, it should remember the current values of ecx and edx (by pushing them onto the stack). These 8 bytes are the so-called "shadow space". It is not done in your "C++ version", probably because the compiler knows for sure that these two parameters will not be used multiple times, or that it can handle it properly some other way.
Now, what happens to your _asm version? The problem here is that you are using inline assembly. The compiler then loses its control to the registers, and it cannot assume that the registers ecx and edx are safe in your _asm block (they are actually not, since you used them in the _asm block). Thus it is forced to save them at the beginning of the function.
The saving goes as follows: it first raises esp by 8 bytes (sub esp, 8), then move edx and ecx to [esp] and [esp+4] respectively.
And then it can enter safely your _asm block. Now in its mind (if it has one), the picture is that [esp] is src0, [esp+4] is dst, [esp+8] is the 4 byte return address, and [esp+12] is src1. It no longer thinks about ecx and edx.
Thus your first instruction in the _asm block, mov ecx, [src0], should be interpreted as mov ecx, [esp], which is the same as
mov ecx, DWORD PTR _src0$[esp+8]
and the same for the other two instructions.
At this point, you might say, aha it's doing stupid things, I don't want it to waste time and space on that, is there a way?
Well there is a way - do not use inline assembly... it's convenient, but there is a compromise.
You can write the assembly function f in a .asm source file and public it. In the C/C++ code, declare it as extern 'C' f(...). Then, when you begin your assembly function f, you can play directly with your ecx and edx.
The compiler has decided to use a calling convention that uses "pass arguments in registers" aka __fastcall. This allows the compiler to pass some of the arguments in registers, instead of pushing onto stack, and this can reduce the overhead in the call, because moving from a variable to a register is faster than pushing onto the stack, and it's now already in a register when we get to the callee function, so no need to read it from the stack.
There is a lot more information about how calling conventions work on the web. The wikipedia article on x86 calling conventions is a good starting point.

x86 Assembly Compare Arguments

I'm using visual studio and calling assembly from C++. I know that when you pass an argument to assembly the first argument is in ECX and the second is in EDX. Why can't I compare the two registers directly without first copying ECX to EAX?
C++:
#include <iostream>
extern "C" int PassingParameters(int a, int b);
int main()
{
std::cout << "The function returned: " << PassingParameters(5, 10) << std::endl;
std::cin.get();
return 0;
}
ASM: This gives the wrong value when comparing the two registers directly.
.code
PassingParameters proc
cmp edx, ecx
jg ReturnEAX
mov eax, edx
ReturnEAX:
ret
PassingParameters endp
end
But if I write it like this I get the correct value, and can compare the two registers directly, why is this?
.code
PassingParameters proc
mov eax, ecx ; copy ecx to eax.
cmp edx, ecx ; compare ecx and edx directly like above, but this gives the correct value.
jg ReturnEAX
mov eax, edx
ReturnEAX:
ret
PassingParameters endp
end
In your first version if the jg is taken, you're leaving eax exactly as it was upon entry to the function (i.e., we pretty much have no clue). Since the return value will normally be in eax, that's going to give an undefined return whenever the jg is taken. In other words, what you've written is roughly like:
int PassingParameters(int a, int b) {
if (a < b)
return a;
}
In this case, if a==b, or a>b, your return value is garbage.
In the second code sequence, you're loading one value into eax. Then, if the jg not taken, you're loading the other value into eax. Either way, the return value will be one input parameter or the other (depending on which is greater). In other words, what you have is roughly equivalent to:
int PassingParameters(int a, int b) {
if (a<b)
return a;
return b;
}
P.S. I would also note that your code looks like x86, not 64-bit code at all. For 64-bit code, you should be using RAX, RCX, etc., rather than EAX, ECX, and such.

Stack walk with inline asm for VC++

I have inserted the following asm code in my C++ code. I am using a VC++ compiler.
char c;
curr_stack_return_addr = s.AddrFrame.Offset; //I am doing a stack walk
__asm{
push bx
mov eax, curr_stack_return_addr
mov bl, BYTE PTR [eax - 1]
mov c,bl
pop bx
}
I get the correct value in c for my functions but it crashes when it reaches system functions on stack. I get no compiler errors. What did I do wrong?
Resolved: I forgot to check for end of stack! The return address in last frame is 0. Thanks everyone.
I see two problems here:
push bl and pop bl don't exist. You can only push and pop word or dwords. The compiler warns by the way.
How do you know that eax points to a legal address?
You have no way of knowing the value of eax when your program enters the asm block.