Can a C/C++ compiler inline builtin functions like malloc()? - c++

While inspecting the disassembly of below function,
void * malloc_float_align(size_t n, unsigned int a, float *& dizi)
{
void * adres=NULL;
void * adres2=NULL;
adres=malloc(n*sizeof(float)+a);
size_t adr=(size_t)adres;
size_t adr2=adr+a-(adr&(a-1u));
adres2=(void * ) adr2;
dizi=(float *)adres2;
return adres;
}
Builtin functions are not inlined even with the inline optimization flag set.
; Line 26
$LN4:
push rbx
sub rsp, 32 ; 00000020H
; Line 29
mov ecx, 160 ; 000000a0H
mov rbx, r8
call QWORD PTR __imp_malloc <------this is not inlined
; Line 31
mov rcx, rax
; Line 33
mov rdx, rax
and ecx, 31
sub rdx, rcx
add rdx, 32 ; 00000020H
mov QWORD PTR [rbx], rdx
; Line 35
add rsp, 32 ; 00000020H
pop rbx
ret 0
Question: is this a must-have property of functions like malloc? Can we inline it some way to inspect it(or any other function like strcmp/new/free/delete)? Is this forbidden?

Typically the compiler will inline functions when it has the source code available during compilation (in other words, the function is defined, rather than just a prototype declaration) in a header file).
However, in this case, the function (malloc) is in a DLL, so clearly the source code is not available to the compiler during the compilation of your code. It has nothing to do with what malloc does (etc). However, it's also likely that malloc won't be inlined anyway, since it is a fairly large function [at least it often is], whcih prevents it from being inlined even if the source code is available.
If you are using Visual Studio, you can almost certainly find the source code for your runtime library, as it is supplied with the Visual Studio package.
(The C runtime functions are in a DLL because many different programs in the system use the same functions, so putting them in a DLL that is loaded once for all "users" of the functionality will give a good saving on the size of all the code in the system. Although malloc is perhaps only a few hundred bytes, a function like printf can easily add some 5-25KB to the size of an executable. Multiply that by the number of "users" of printf, and there is likely several hundred kilobytes just from that one function "saved" - and of course, all other functions such as fopen, fclose, malloc, calloc, free, and so on all add a little bit each to the overall size)

A C compiler is allowed to inline malloc (or, as you see in your example, part of it), but it is not required to inline anything. The heuristics it uses need not be documented, and they're usually quite complex, but normally only short functions will be inlined, since otherwise code-bloat is likely.

malloc and friends are implemented in the runtime library, so they're not available for inlining. They would need to have their implementation in their header files for that to happen.
If you want to see their disassembly, you could step into them with a debugger. Or, depending on the compiler and runtime you're using, the source code might be available. It is available for both gcc and msvc, for example.

The main thing stopping the inlining of malloc() et al is their complexity — and the obvious fact that no inline definition of the function is provided. Besides, you may need different versions of the function at different times; it would be harder (messier) for tools like valgrind to work, and you could not arrange to use a debugging version of the functions if their code is expanded inline.

Related

Why does `monotonic_buffer_resource` appear in the assembly when it doesn't seem to be used?

This is a follow-up from another question.
I think the following code should not use monotonic_buffer_resource, but in the generated assembly there are references to it.
void default_pmr_alloc(std::pmr::polymorphic_allocator<int>& alloc) {
(void)alloc.allocate(1);
}
godbolt
I looked into the source code of the header files and libstdc++, but could not find how monotonic_buffer_resource was selected to be used by the default pmr allocator.
The assembly tells the story. In particular, this:
cmp rax, OFFSET FLAT:_ZNSt3pmr25monotonic_buffer_resource11do_allocateEmm
jne .L11
This appears to be a test to see if the memory resource is a monotonic_buffer_resource. This seems to be done by checking the do_allocate member of the vtable. If it is not such a resource (ie: if do_allocate in the memory resource is not the monotonic one), then it jumps down to this:
.L11:
mov rdi, rbx
mov edx, 4
mov esi, 4
pop rbx
jmp rax
This appears to be a vtable call.
The rest of the assembly appears to be an inlined version of monotonic_buffer_resource::do_allocate. Which is why it conditionally calls std::pmr::monotonic_buffer_resource::_M_new_buffer.
So overall, this implementation of polymorphic_resource::allocate seems to have some built-in inlining of monotonic_buffer_resource::do_allocate if the resource is appropriate for that. That is, it won't do a vtable call if it can determine that it should call monotonic_buffer_resource::do_allocate.

Are functions in a C++/CLI native class compiled to MSIL or native x64 machine code?

This question is related to another question of mine, titled Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems. I din't receive any comments and answers, but eventually I found out myself that the problem is caused by function thunks that are inserted by the compiler whenever a managed function calls an unmanaged one, and vice versa. I won't go into the details once again, because today I wan't to focus on another consequence of this tunking mechanism.
To provide some context for the question, my problem was the replacement of a C++ function for 64-to-128-bit unsigned integer multiplication in an unmanaged C++/CLI class by a function in an MASM64 file for the sake of performance. The ASM replacement is as simple as can be:
AsmMul1 proc ; ?AsmMul1##$$FYAX_K0AEA_K1#Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU MUL instruction. The big surprise was that the ASM version was about four times slower (!) than the C++ version. After a lot of research and testing, I found out that some function calls in C++/CLI involve thunking, which obviously is such a complex thing that it takes much more time than the thunked function itself.
After reading more about this thunking, it turned out that whenever you are using the compiler option /clr, the calling convention of all functions is silently changed to __clrcall, which means that they become managed functions. Exceptions are functions that use compiler intrinsics, inline ASM, and calls to other DLLs via dllimport - and as my tests revealed, this seems to include functions that call external ASM functions.
As long as all interacting functions use the __clrcall convention (i.e. are managed), no thunking is involved, and everything runs smoothly. As soon as the managed/unmanaged boundary is crossed in either direction, thunking kicks in, and performance is seriously degraded.
Now, after this long prologue, let's get to the core of my question. As far as I understand the __clrcall convention, and the /clr compiler switch, marking a function in an unmanaged C++ class this way causes the compiler to emit MSIL code. I've found this sentence in the documentation of __clrcall:
When marking a function as __clrcall, you indicate the function
implementation must be MSIL and that the native entry point function
will not be generated.
Frankly, this is scaring me! After all, I'm going through the hassles of writing C++/CLI code in order to get real native code, i.e. super-fast x64 machine code. However, this doesn't seem to be the default for mixed assemblies. Please correct me if I'm getting it wrong: If I'm using the project defaults given by VC2017, my assembly contains MSIL, which will be JIT-compiled. True?
There is a #pragma managed that seems to inhibit the generation of MSIL in favor of native code on a per-function basis. I've tested it, and it works, but then the problem is that thunking gets in the way again as soon as the native code calls a managed function, and vice versa. In my C++/CLI project, I found no way to configure the thunking and code generation without getting a performance hit at some place.
So what I'm asking myself now: What's the point in using C++/CLI in the first place? Does it give me performance advantages, when everything is still compiled to MSIL? Maybe it's better to write everything in pure C++ and use Pinvoke to call those functions? I don't know, I'm kind of stuck here.
Maybe someone can shed some light on this terribly poorly documented topic...

Can I make a C++ method in external assembly (function.asm)?

I am writing a program that requires one function in assembly. It would be pretty helpful to encapsulate the assembly function in a C++ class, so its own data is isolated and I can create multiple instances.
If I create a class and call an external function from a C++ method, the function is reentrant even if it has its own stack and local "variables" into the stack frame.
Is there some way to make the assembly function a C++ method, maybe using name mangling, so the function is implemented in assembly but the prototype is declared inside the C++ class?
If not possible, is there some way to create multiple instances (dynamically) of the assembly function although it is not part of the class? Something like clone the function in memory and just call it, obviously using relocatable code (adding a delta displacement for variables and data if required)...
I am writing a program that requires one function in assembly.
Then, by definition, your program becomes much less portable. And depends upon the calling conventions and ABI of your C++ implementation and your operating system.
It would then be coherent to use some compiler specific features (which are not in portable standard C++11, e.g. in n3337).
My recommendation is then to take advantage of GCC extended assembly. Read the chapter on using assembly language with C (it also, and of course, applies to C++).
By directly embedding some extended asm inside a C++ member function, you avoid the hassle of calling some function. Probably, your assembler code is really short and executed quickly. So it is better to embed it in C or C++ functions, avoiding the costs of function call prologue and epilogue.
NB: In 2019, there is no economical sense to spend efforts in writing large assembly code: most optimizing compilers produce better assembler code than a reasonable programmer can (in a reasonable time). So you have an incentive to use small assembler code chunks in larger C++ or C functions.
Yes, you can. Either define it as an inline wrapper that passes all the args (including the implicit this pointer) to an external function, or figure out the name-mangling to define the right symbol for the function entry point in asm.
An example of the wrapper way:
extern "C" int asm_function(myclass *p, int a, double b);
class myclass {
int q, r, member_array[4];
int my_method(int a, double b) { return asm_function(this, a, b); }
};
A stand-alone definition of my_method for x86-64 would be just jmp asm_function, a tailcall, because the args are identical. So after inlining, you'll have call asm_function instead of call _Zmyclass_mymethodZd or whatever the actual name mangling is. (I made that up).
In GNU C / C++, there's also the asm keyword to set the asm symbol name for a function, instead of letting the normal name-mangling rules generate it from the class and member-function name, and arg types. (Or with extern "C", usually just a leading underscore or not, depending on the platform.)
class myclass {
int q, r, member_array[4];
public:
int my_method(int a, double b)
asm("myclass_my_method_int_double"); // symbol name for separate asm
};
Then in your .asm file (e.g. NASM syntax, for the x86-64 System V calling convention)
global myclass_my_method_int_double
myclass_my_method_int_double:
;; inputs: myclass *this in RDI, int a in ESI, double b in XMM0
cvtsd2si eax, xmm0
add eax, [rdi+4] ;; this->r
imul eax, esi
ret
(You can pick any name you want for your asm function; it doesn't have to encode the args. But doing that will let you overload it without conflicting symbol names.)
Example on Godbolt of a test caller calling the asm("") way:
void foo(myclass *p){
p->my_method(1, 1.0);
}
compiles to
foo(myclass*):
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
mov esi, 1
jmp myclass_my_method_int_double # TAILCALL
Note that the caller emitted jmp myclass_my_method_int_double, using your name, not a mangled name.

C++ inline assembly (Intel compiler): LEA and MOV behaving differently in Windows and Linux

I am converting a huge Windows dll to work on both Windows and Linux. The dll has a lot of assembly (and SS2 instructions) for video manipulation.
The code now compiles fine on both Windows and Linux using Intel compiler included in Intel ComposerXE-2011 on Windows and Intel ComposerXE-2013 SP1 on Linux.
The execution, however, crashes in Linux when trying to call a function pointer. I traced the code in gdb and indeed the function pointer doesn't point to the required function (whereas in Windows in does). Almost everything else works fine.
This is the sequence of code:
...
mov rdi, this
lea rdx, [rdi].m_sSomeStruct
...
lea rax, FUNCTION_NAME # if replaced by 'mov', works in Linux but crashes in Windows
mov [rdx].m_pfnFunction, rax
...
call [rdx].m_pfnFunction # crash in Linux
where:
1) 'this' has a struct member m_sSomeStruct.
2) m_sSomeStruct has a member m_pfnFunction, which is a pointer to a function.
3) FUNCTION_NAME is a free function in the same compilation unit.
4) All those pure assembly functions are declared as naked.
5) 64-bit environment.
What is confusing me the most is that if I replace the 'lea' instruction that is supposed to load the function's address into rax with a 'mov' instruction, it works fine on Linux but crashes on Windows. I traced the code in both Visual Studio and gdb and apparently in Windows 'lea' gives the correct function address, whereas in Linux 'mov' does.
I tried looking into the Intel assembly reference but didn't find much to help me there (unless I wasn't looking in the right place).
Any help is appreciated. Thanks!
Edit More details:
1) I tried using square brackets
lea rax, [FUNCTION_NAME]
but that didn't change the behaviour in Windows nor in Linux.
2) I looked at the disassembler in gdb and Windows, seem to both give the same instructions that I actually wrote. What's even worse is that I tried putting both lea/mov one after the other, and when I look at them in disassembly in gdb, the address printed after the instruction after a # sign (which I'm assuming is the address that's going to be stored in the register) is actually the same, and is NOT the correct address of the function.
It looked like this in gdb disassembler
lea 0xOffset1(%rip), %rax # 0xSomeAddress
mov 0xOffset2(%rip), %rax # 0xSomeAddress
where both (SomeAddress) were identical and both offsets were off by the same amount of difference between lea and mov instructions,
But somehow, the when I check the contents of the registers after each execution, mov seem to put in the correct value!!!!
3) The member variable m_pfnFunction is of type LOAD_FUNCTION which is defined as
typedef void (*LOAD_FUNCTION)(const void*, void*);
4) The function FUNCTION_NAME is declared in the .h (within a namespace) as
void FUNCTION_NAME(const void* , void*);
and implemented in .cpp as
__declspec(naked) void namespace_name::FUNCTION_NAME(const void* , void*)
{
...
}
5) I tried turning off optimizations by adding
#pragma optimize("", off)
but I still have the same issue
Off hand, I suspect that the way linking to DLLs works in the latter case is that FUNCTION_NAME is a memory location that actually will be set to the loaded address of the function. That is, it's a reference (or pointer) to the function, not the entry point.
I'm familiar with Win (not the other), and I've seen how calling a function might either
(1) generate a CALL to that address, which is filled in at link time. Normal enough for functions in the same module, but if it's discovered at link time that it's in a different DLL, then the Import Library is a stub that the linker treats the same as any normal function, but is nothing more than JMP [????]. The table of addresses to imported functions is arranged to have bytes that code a JMP instruction just before the field that will hold the address. The table is populated at DLL Load time.
(2) If the compiler knows that the function will be in a different DLL, it can generate more efficient code: It codes an indirect CALL to the address located in the import table. The stub function shown in (1) has a symbol name associated with it, and the actual field containing the address has a symbol name too. They both are named for the function, but with different "decorations". In general, a program might contain fixup references to both.
So, I conjecture that the symbol name you used matches the stub function on one compiler, and (that it works in a similar way) matches the pointer on the other platform. Maybe the assembler assigns the unmangled name to one or the other depending on whether it is declared as imported, and the options are different on the two toolchains.
Hope that helps. I suppose you could look at run-time in a debugger and see if the above helps you interpret the address and the stuff around it.
After reading the difference between mov and lea here What's the purpose of the LEA instruction? it looks to me like on Linux there is one additional level of indirection added into the function pointer. The mov instruction causes that extra level of indirection to be passed through, while on Windows without that extra indirection you would use lea.
Are you by any chance compiling with PIC on Linux? I could see that adding the extra indirection layer.

How to call a function using Delphi's register calling conventions from Visual C++?

I have a program written in Visual C++ 2012, and I was trying to call a function written in Delphi(which I don't have the source code). Here is the code in Visual C++:
int (_fastcall *test)(void*) = (int(_fastcall *)(void*))0x00489A7D;
test((void *)0x12345678);
But in the compiled code it actually was:
.text:1000113B mov eax, 489A7Dh
.text:10001140 mov ecx, 12345678h
.text:10001145 call eax
And what I am excepting is:
.text:1000113B mov ebx, 489A7Dh
.text:10001140 mov eax, 12345678h
.text:10001145 call ebx
I know 'fastcall' use EAX, ECX, EDX as parameters, but I don't know why Visual C++ compiler use EAX as a entry point. Shouldn't EAX be the first parameter(12345678h)?
I tried to call the delphi function in assembly code and it works, but I really want to know how to do that without using assembly.
So is that possible to let Visual C++ compiler generate code as what I am excepting? If yes, how to do that?
Delphi's register calling convention, also known as Borland fastcall, on x86 uses EAX, EDX and ECX registers, in that order.
However, Microsoft's fastcall calling convention uses different registers. It does not use EAX at all. Instead it uses ECX and EDX registers for first two parameters, as described by the documentation.
So, with that information you could probably write some assembler to make a Delphi register function call from C++, by moving the parameter into the EAX register. However, it's going to be so much easier to let the Delphi compiler do that. Especially as I imagine that your real problem involves multiple functions and more than a single parameter.
I suggest that you write some Pascal code to adapt between stdcall and register.
function FuncRegister(param: Pointer): Integer; register; external '...';
function FuncStdcall(param: Pointer): Integer; stdcall;
begin
Result := FuncRegister(param);
end;
exports
FuncStdcall;
Then you can call FuncStdcall from your C++ code and let the Delphi compiler handle the parameter passing.