Can I make a C++ method in external assembly (function.asm)? - c++

I am writing a program that requires one function in assembly. It would be pretty helpful to encapsulate the assembly function in a C++ class, so its own data is isolated and I can create multiple instances.
If I create a class and call an external function from a C++ method, the function is reentrant even if it has its own stack and local "variables" into the stack frame.
Is there some way to make the assembly function a C++ method, maybe using name mangling, so the function is implemented in assembly but the prototype is declared inside the C++ class?
If not possible, is there some way to create multiple instances (dynamically) of the assembly function although it is not part of the class? Something like clone the function in memory and just call it, obviously using relocatable code (adding a delta displacement for variables and data if required)...

I am writing a program that requires one function in assembly.
Then, by definition, your program becomes much less portable. And depends upon the calling conventions and ABI of your C++ implementation and your operating system.
It would then be coherent to use some compiler specific features (which are not in portable standard C++11, e.g. in n3337).
My recommendation is then to take advantage of GCC extended assembly. Read the chapter on using assembly language with C (it also, and of course, applies to C++).
By directly embedding some extended asm inside a C++ member function, you avoid the hassle of calling some function. Probably, your assembler code is really short and executed quickly. So it is better to embed it in C or C++ functions, avoiding the costs of function call prologue and epilogue.
NB: In 2019, there is no economical sense to spend efforts in writing large assembly code: most optimizing compilers produce better assembler code than a reasonable programmer can (in a reasonable time). So you have an incentive to use small assembler code chunks in larger C++ or C functions.

Yes, you can. Either define it as an inline wrapper that passes all the args (including the implicit this pointer) to an external function, or figure out the name-mangling to define the right symbol for the function entry point in asm.
An example of the wrapper way:
extern "C" int asm_function(myclass *p, int a, double b);
class myclass {
int q, r, member_array[4];
int my_method(int a, double b) { return asm_function(this, a, b); }
};
A stand-alone definition of my_method for x86-64 would be just jmp asm_function, a tailcall, because the args are identical. So after inlining, you'll have call asm_function instead of call _Zmyclass_mymethodZd or whatever the actual name mangling is. (I made that up).
In GNU C / C++, there's also the asm keyword to set the asm symbol name for a function, instead of letting the normal name-mangling rules generate it from the class and member-function name, and arg types. (Or with extern "C", usually just a leading underscore or not, depending on the platform.)
class myclass {
int q, r, member_array[4];
public:
int my_method(int a, double b)
asm("myclass_my_method_int_double"); // symbol name for separate asm
};
Then in your .asm file (e.g. NASM syntax, for the x86-64 System V calling convention)
global myclass_my_method_int_double
myclass_my_method_int_double:
;; inputs: myclass *this in RDI, int a in ESI, double b in XMM0
cvtsd2si eax, xmm0
add eax, [rdi+4] ;; this->r
imul eax, esi
ret
(You can pick any name you want for your asm function; it doesn't have to encode the args. But doing that will let you overload it without conflicting symbol names.)
Example on Godbolt of a test caller calling the asm("") way:
void foo(myclass *p){
p->my_method(1, 1.0);
}
compiles to
foo(myclass*):
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
mov esi, 1
jmp myclass_my_method_int_double # TAILCALL
Note that the caller emitted jmp myclass_my_method_int_double, using your name, not a mangled name.

Related

Are functions in a C++/CLI native class compiled to MSIL or native x64 machine code?

This question is related to another question of mine, titled Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems. I din't receive any comments and answers, but eventually I found out myself that the problem is caused by function thunks that are inserted by the compiler whenever a managed function calls an unmanaged one, and vice versa. I won't go into the details once again, because today I wan't to focus on another consequence of this tunking mechanism.
To provide some context for the question, my problem was the replacement of a C++ function for 64-to-128-bit unsigned integer multiplication in an unmanaged C++/CLI class by a function in an MASM64 file for the sake of performance. The ASM replacement is as simple as can be:
AsmMul1 proc ; ?AsmMul1##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU MUL instruction. The big surprise was that the ASM version was about four times slower (!) than the C++ version. After a lot of research and testing, I found out that some function calls in C++/CLI involve thunking, which obviously is such a complex thing that it takes much more time than the thunked function itself.
After reading more about this thunking, it turned out that whenever you are using the compiler option /clr, the calling convention of all functions is silently changed to __clrcall, which means that they become managed functions. Exceptions are functions that use compiler intrinsics, inline ASM, and calls to other DLLs via dllimport - and as my tests revealed, this seems to include functions that call external ASM functions.
As long as all interacting functions use the __clrcall convention (i.e. are managed), no thunking is involved, and everything runs smoothly. As soon as the managed/unmanaged boundary is crossed in either direction, thunking kicks in, and performance is seriously degraded.
Now, after this long prologue, let's get to the core of my question. As far as I understand the __clrcall convention, and the /clr compiler switch, marking a function in an unmanaged C++ class this way causes the compiler to emit MSIL code. I've found this sentence in the documentation of __clrcall:
When marking a function as __clrcall, you indicate the function
implementation must be MSIL and that the native entry point function
will not be generated.
Frankly, this is scaring me! After all, I'm going through the hassles of writing C++/CLI code in order to get real native code, i.e. super-fast x64 machine code. However, this doesn't seem to be the default for mixed assemblies. Please correct me if I'm getting it wrong: If I'm using the project defaults given by VC2017, my assembly contains MSIL, which will be JIT-compiled. True?
There is a #pragma managed that seems to inhibit the generation of MSIL in favor of native code on a per-function basis. I've tested it, and it works, but then the problem is that thunking gets in the way again as soon as the native code calls a managed function, and vice versa. In my C++/CLI project, I found no way to configure the thunking and code generation without getting a performance hit at some place.
So what I'm asking myself now: What's the point in using C++/CLI in the first place? Does it give me performance advantages, when everything is still compiled to MSIL? Maybe it's better to write everything in pure C++ and use Pinvoke to call those functions? I don't know, I'm kind of stuck here.
Maybe someone can shed some light on this terribly poorly documented topic...

Jump/tailcall to another function

I have two functions, looking like this in C++:
void f1(...);
void f2(...);
I can change the body of f1, but f2 is defined in another library I cannot change. I absolutely have to (tail) call f2 inside f1, and I must pass all arguments provided to f1 to f2, but as far as I know, this is impossible in pure C or C++. There is no alternative of f2 that accepts a va_list, unfortunately. The call to f2 happens last in the function, so I need some form of tailcall.
I decided to use assembly to pop the stack frame of the current function, then jump to f2 (it is actually received as a function pointer and in a variable, so that's why I first store it in a register):
__asm {
mov eax, f2
leave
jmp eax
}
In MSVC++, in Debug, it appears to work at first, but it somehow messes with the return values of other functions, and sometimes it crashes. In Release, it always crashes.
Is this assembly code incorrect, or do some optimizations of the compiler somehow break this code?
The compiler will make no guarantees at the point you are digging around. A trampoline function might work, but you have to save state between them, and do a lot of digging around.
Here is a skeleton, but you will need to know a lot about calling conventions, class method invocation, etc...
/
* argn, ..., arg0, retaddr */
trampoline:
push < all volatile regs >
call <get thread local storage >
copy < volatile regs and ret addr > to < local storage >
pop < volatile regs >
remove ret addr
call f2
call < get thread local storage >
restore < volatile regs and ret addr>
jmp f1
ret
You have to write f1 in pure asm for it to be guaranteed-safe.
In all the major x86 calling conventions, the callee "owns" the args, and can modify the stack-space that held them. (Whether or not the C source changes them and whether or not they're declared const).
e.g. void foo(int x) { x += 1; bar(x); } might modify the stack space above the return address that holds x, if compiled with optimization disabled. Making another call with the same args requires storing them again unless you know the callee hasn't stepped on them. The same argument applies for tailcalling from the end of one function.
I checked on the Godbolt compiler explorer; both MSVC and gcc do in fact modify x on the stack in debug builds. gcc uses add DWORD PTR [ebp+8], 1 before pushing [ebp+8].
Compilers in practice may not actually take advantage of this for variadic functions, though, so depending on the definitions of your functions, you might get away with it if you can convince them to make a tailcall.
Note that void bar(...); is not a valid prototype in C, though:
# gcc -xc on Godbolt to force compiling as C, not C++
<source>:1:10: error: ISO C requires a named argument before '...'
It is valid in C++, or at least g++ accepts it while gcc doesn't. MSVC accepts it in C++ mode, but not in C mode. (Godbolt has a whole separate C mode with a different set of compilers, which you can use to get MSVC to compile code as C instead of C++. I don't know a command-line option to flip it to C mode the way gcc has -xc and -xc++)
Anyway, It might work (in optimized builds) to write f2(); at the end of f1, but that's nasty and completely lying to the compiler about what args are passed. And of course only works for a calling convention with no register args. (But you were showing 32-bit asm, so you might well be using a calling convention with no register args.)
Any decent compiler will use jmp f2 to make an optimized tail-call in this case, because they both return void. (For non-void, you would return f2();)
BTW, if mov eax, f2 works, then jmp f2 will also work.
Your code can't work in an optimized build, though, because you're assuming that the compiler made a legacy stack-frame, and that the function won't inline anywhere.
It's unsafe even in a debug build because the compiler may have pushed some call-preserved registers that need to be popped before leaving the function (and before running leave to destroy the stack frame).
The trampoline idea that #mevets showed could maybe be simplified: if there's a reasonable fixed upper size limit on the args, you can copy maybe 64 or 128 bytes of potential-args from your incoming args into args for f1. A few SIMD vectors will do it. Then you can call f1 normally, then tail-call f2 from your asm wrapper.
If there are potentially register args, save them to stack space before the args you copy, and restore them before tailcalling.

Can a C/C++ compiler inline builtin functions like malloc()?

While inspecting the disassembly of below function,
void * malloc_float_align(size_t n, unsigned int a, float *& dizi)
{
void * adres=NULL;
void * adres2=NULL;
adres=malloc(n*sizeof(float)+a);
size_t adr=(size_t)adres;
size_t adr2=adr+a-(adr&(a-1u));
adres2=(void * ) adr2;
dizi=(float *)adres2;
return adres;
}
Builtin functions are not inlined even with the inline optimization flag set.
; Line 26
$LN4:
push rbx
sub rsp, 32 ; 00000020H
; Line 29
mov ecx, 160 ; 000000a0H
mov rbx, r8
call QWORD PTR __imp_malloc <------this is not inlined
; Line 31
mov rcx, rax
; Line 33
mov rdx, rax
and ecx, 31
sub rdx, rcx
add rdx, 32 ; 00000020H
mov QWORD PTR [rbx], rdx
; Line 35
add rsp, 32 ; 00000020H
pop rbx
ret 0
Question: is this a must-have property of functions like malloc? Can we inline it some way to inspect it(or any other function like strcmp/new/free/delete)? Is this forbidden?
Typically the compiler will inline functions when it has the source code available during compilation (in other words, the function is defined, rather than just a prototype declaration) in a header file).
However, in this case, the function (malloc) is in a DLL, so clearly the source code is not available to the compiler during the compilation of your code. It has nothing to do with what malloc does (etc). However, it's also likely that malloc won't be inlined anyway, since it is a fairly large function [at least it often is], whcih prevents it from being inlined even if the source code is available.
If you are using Visual Studio, you can almost certainly find the source code for your runtime library, as it is supplied with the Visual Studio package.
(The C runtime functions are in a DLL because many different programs in the system use the same functions, so putting them in a DLL that is loaded once for all "users" of the functionality will give a good saving on the size of all the code in the system. Although malloc is perhaps only a few hundred bytes, a function like printf can easily add some 5-25KB to the size of an executable. Multiply that by the number of "users" of printf, and there is likely several hundred kilobytes just from that one function "saved" - and of course, all other functions such as fopen, fclose, malloc, calloc, free, and so on all add a little bit each to the overall size)
A C compiler is allowed to inline malloc (or, as you see in your example, part of it), but it is not required to inline anything. The heuristics it uses need not be documented, and they're usually quite complex, but normally only short functions will be inlined, since otherwise code-bloat is likely.
malloc and friends are implemented in the runtime library, so they're not available for inlining. They would need to have their implementation in their header files for that to happen.
If you want to see their disassembly, you could step into them with a debugger. Or, depending on the compiler and runtime you're using, the source code might be available. It is available for both gcc and msvc, for example.
The main thing stopping the inlining of malloc() et al is their complexity — and the obvious fact that no inline definition of the function is provided. Besides, you may need different versions of the function at different times; it would be harder (messier) for tools like valgrind to work, and you could not arrange to use a debugging version of the functions if their code is expanded inline.

Order of function signature, call and definition

I want to ask order of function signature, call and definition
like, which one would the computer look first, second and third
So:
#include <iostream>
using namespace std;
void max(void);
void min(void);
int main() {
max();
min();
return;
}
void max() {
return;
}
void min() {
return;
}
So this is what I think,
the computer will go to main and look at the function call, then it will look at the
function signature, and at the last, it will look at the definition.
It is right?
Thank
It is right?
No.
You need to understand the difference between function declarations and function definitions, the difference between compilation, linking, and execution, and the difference between non-virtual and virtual functions.
Function declarations
This is a function declaration: void max(void);. It doesn't tell the compiler anything about what the function does. What it does is to tell the compiler how to call the function and how to interpret the result. When the compiler is compiling the body of some function, call it function A, the compiler doesn't need to know what other functions do. All it needs to know is what to do with the functions that function A calls. The compiler might generate code in assembly or some intermediate language that corresponds to your C++ function calls. Or it might reject your C++ code because your code doesn't make sense.
Determining whether your code makes sense is another key purpose of those function declarations. This is particularly important in C++ where multiple functions can have the same name. How would the compiler know which of the half dozen or so max functions to call if it didn't know about those functions? When your C++ code calls some function, the compiler must find one best match (possibly involving type conversions) with one of those function declarations. Your code doesn't make sense if the compiler can't find a match at all, or if it finds more than one match but can't distinguish one as the best match.
When the compiler does find a best match, the generated code will be in the form of a call to an undefined external reference to that function. Where that function lives is not the job of the compiler.
Function definitions
That void max(void) was a function declaration. The corresponding void max() {...} is the definition of that function. When the compiler is processing void max() {...} it doesn't have to worry about what other functions have called it. It just has to worry about processing void max() {...} . The body of this function becomes assembly or intermediate language code that is inserted into some compiled object file. The compiler marks the address of the entry point to this generated code is marked as such.
Compilation versus linking
So far I've talked about what the compiler does. It generates chunks of low-level code that correspond to your C++ code. That generated code is not ready for prime time because of those external references. Resolving those undefined external references is the job of the linker. The linker is what builds your executable from multiple object files, multiple libraries. It keeps track of where it has put those chunks of code in the executable. What about those undefined external references? If the linker has already placed that reference in the executable, the linker simply fills in the placeholder for that reference. If the linker hasn't come across the definition for that reference, it puts the reference and the placeholder onto a list of still-unresolved references. Every time the linker adds a chunk of code to the executable, it checks that list to see if it can fix any of those still-unresolved references. At the end, you will either have all references resolved or you will still have some outstanding ones. The latter is an error. The former means that you have an executable.
Execution
When your code runs, those function calls are really just some stack management wrapped around the machine language equivalent of that evil goto statement. There's no examining your function declarations; those don't even exist by the time the code is executed. Return? That's a goto also.
Non-virtual versus virtual functions
What I said above pertains to non-virtual functions. Run-time dispatching does occur for virtual functions. That run-time dispatching has nothing to do with examining function declarations. Those virtual functions are perhaps an issue for a different question.
One last thing:
Get out of the habit of using namespace std; Think of it as akin to smoking. It's a bad habit.
As you may know, the compiler converts the program into machine code (via several intermediate steps). Here is the dissassembly of the machine code for main() when compiled on Visual Studio 2012 in debug mode on Windows 8:
int main() {
00C24400 push ebp # Setup stack frame
00C24401 mov ebp,esp
00C24403 sub esp,0C0h
00C24409 push ebx
00C2440A push esi
00C2440B push edi
00C2440C lea edi,[ebp-0C0h] # Fill with guard bytes
00C24412 mov ecx,30h
00C24417 mov eax,0CCCCCCCCh
00C2441C rep stos dword ptr es:[edi]
max();
00C2441E call max (0C21302h) # Call max
min();
00C24423 call min (0C2126Ch) # Call min
return 0;
00C24428 xor eax,eax
}
00C2442A pop edi # Restore stack frame
00C2442B pop esi
00C2442C pop ebx
00C2442D add esp,0C0h
00C24433 cmp ebp,esp
}
00C24435 call __RTC_CheckEsp (0C212D5h) # Check for memory corruption
00C2443A mov esp,ebp
00C2443C pop ebp
00C2443D ret
The exact details will vary from compiler to compiler and operating system to operating system. If min() or max() had arguments or return values, they would be passed as appropriate for the architecture. The key point is that the compiler has already worked out what the arguments and return values are and created machine code that just passes or accepts them.
You can learn more details if you wish to help with debugging or to do low level calls but be aware that the machine code emitted can be highly variable. For example, here is the same code compiled on the same system in release mode (i.e. with optimizations on):
return 0;
01151270 xor eax,eax
}
01151272 ret
As you can see, it has detected that min() and max() do nothing and removed them completely. Since there is now no stack frame to setup and restore, that is gone, leaving a single instruction to set eax to 0 then returning (since the return value is in the eax register).

C++ custom calling convention

While reverse engineering I came around a very odd program that uses a calling convention that passes one argument in eax ( very odd compiler ?? ). I want to call that function now and I don't know how to declare it, IDA defines it as
bool __usercall foo<ax>(int param1<eax>, int param2);
where param1 is passed in the eax register. I tried something like
bool MyFoo(int param1, int param2)
{
__asm mov eax, param1;
return reinterpret_cast<bool(__stdcall *)(int)>(g_FooAddress)(param2);
}
However, unfortunately my compiler makes use of the eax register when pushing param2 on the stack, is there any way how I can make this clean without writing the whole call with inline assembler? (I am using Visual Studio if that matters)
There are "normal" calling conventions which pass arguments via registers. If you are using MSVC for example, __fastcall.
http://en.wikipedia.org/wiki/X86_calling_conventions#fastcall
You cannot define your own calling conventions, but I would suggest that you do create a wrapper function which does its own calling / cleanup via inline assembly. This is probably the most practical to achieve this effect, though you could also probably do it faster by using __fastcall, doing a bit of register swapping, then jmp to the correct function.
There's more to a calling convention than argument passing though, so option #1 is probably the best as you'll get full control over how the caller acts.