Can I make clang generate absolute addresses for function pointers? - c++

Here's a simplified version of some code I'm working with right now:
int Do_A_Thing(void)
{
return(42);
}
void Some_Function(void)
{
int (*fn_do_thing)(void) = Do_A_Thing;
fn_do_thing();
}
When I compile this in xcode 4.1, the assembler it generates sets the fn_do_thing variable like so:
0x2006: calll 0x200b ;
0x200b: popl %eax ; get EIP
0x200c: movl 1333(%eax), %eax
I.e. it generates a relative address for the place to find the Do_A_Thing function - "current instruction plus 1333", which according to the map file is a "non-lazy pointer" to the function.
When I compile similar code on Windows with Visual Studio, windows generates a fixed address instead of doing it relatively like this. If Do_A_Thing lives at, for example, 0x40050914, it just sets fn_do_thing to 0x40050914. By contrast, xcode sets it to "where I am + some offset".
Is there any way to make xcode generate an absolute address to set the function pointer to, like visual studio does? Or is there some reason that wouldn't work? I have noticed that every time I run the program, the Do_A_Thing function (and all other functions) seem to load at a different address.

You're looking at position independent code (more specifically Position Independent Executable) in action. This, as you noticed, allows the OS to load the binary anywhere in memory, which provides numerous security improvements for potentially insecure code.
You can disable it via removing a linker option in XCode (-Wl,-pie).
Note that on x86_64 (amd64), instructions can operate relative to the instruction pointer, which improves the efficiency of this technique (and makes it basically "free" in performance cost).

Related

How to execute separate compiled binary file from inside program on MCU?

I have an MCU (say an STM32) running, and I would like to 'pass' it a separately compiled binary file over UART/USB and use it like calling a function, where I can pass it data and collect its output? After its complete, a second, different binary would be sent to be executed, and so on.
How can I do this? Does this require an OS be running? I'd like to avoid that overhead.
Thanks!
It is somewhat specific to the mcu what the exact call function is but you are just making a function call. You can try the function pointer thing but that has been known to fail with thumb (on gcc)(stm32 uses the thumb instruction set from arm).
First off you need to decide in your overall system design if you want to use a specific address for this code. for example 0x20001000. or do you want to have several of these resident at the same time and want to load them at any one of multiple possible addresses? This will determine how you link this code. Is this code standalone? with its own variables or does it want to know how to call functions in other code? All of this determines how you build this code. The easiest, at least to first try this out, is a fixed address. Build like you build your normal application but based in a ram address like 0x20001000. Then you load the program sent to you at that address.
In any case the normal way to "call" a function in thumb (say an stm32). Is the bl or blx instruction. But normally in this situation you would use bx but to make it a call need a return address. The way arm/thumb works is that for bx and other related instructions the lsbit determines the mode you switch/stay in when branching. Lsbit set is thumb lsbit clear is arm. This is all documented in the arm documentation which completely covers your question BTW, not sure why you are asking...
Gcc and I assume llvm struggles to get this right and then some users know enough to be dangerous and do the worst thing of ADDing one (rather than ORRing one) or even attempting to put the one there. Sometimes putting the one there helps the compiler (this is if you try to do the function pointer approach and hope the compiler does all the work for you *myfun = 0x10000 kind of thing). But it has been shown on this site that you can make subtle changes to the code or depending on the exact situation the compiler will get it right or wrong and without looking at the code you have to help with the orr one thing. As with most things when you need an exact instruction, just do this in asm (not inline please, use real) yourself, make your life 10000 times easier...and your code significantly more reliable.
So here is my trivial solution, extremely reliable, port the asm to your assembly language.
.thumb
.thumb_func
.globl HOP
HOP:
bx r0
I C it looks like this
void HOP ( unsigned int );
Now if you loaded to address 0x20001000 then after loading there
HOP(0x20001000|1);
Or you can
.thumb
.thumb_func
.globl HOP
HOP:
orr r0,#1
bx r0
Then
HOP(0x20001000);
The compiler generates a bl to hop which means the return path is covered.
If you want to send say a parameter...
.thumb
.thumb_func
.globl HOP
HOP:
orr r1,#1
bx r1
void HOP ( unsigned int, unsigned int );
HOP(myparameter,0x20001000);
Easy and extremely reliable, compiler cannot mess this up.
If you need to have functions and global variables between the main app and the downloaded app, then there are a few solutions and they involve resolving addresses, if the loaded app and the main app are not linked at the same time (doing a copy and jump and single link is generally painful and should be avoided, but...) then like any shared library you need to have a mechanism for resolving addresses. If this downloaded code has several functions and global variables and/or your main app has several functions and global variables that the downloaded library needs, then you have to solve this. Essentially one side has to have a table of addresses in a way that both sides agree on the format, could be as a simple array of addresses and both sides know which address is which simply from position. Or you create a list of addresses with labels and then you have to search through the list matching up names to addresses for all the things you need to resolve. You could for example use the above to have a setup function that you pass an array/structure to (structures across compile domains is of course a very bad thing). That function then sets up all the local function pointers and variable pointers to the main app so that subsequent functions in this downloaded library can call the functions in the main app. And/or vice versa this first function can pass back an array structure of all the things in the library.
Alternatively a known offset in the downloaded library there could be an array/structure for example the first words/bytes of that downloaded library. Providing one or the other or both, that the main app can find all the function addresses and variables and/or the caller can be given the main applications function addresses and variables so that when one calls the other it all works... This of course means function pointers and variable pointers in both directions for all of this to work. Think about how .so or .dlls work in linux or windows, you have to replicate that yourself.
Or you go the path of linking at the same time, then the downloaded code has to have been built along with the code being run, which is probably not desirable, but some folks do this, or they do this to load code from flash to ram for various reasons. but that is a way to resolve all the addresses at build time. then part of the binary in the build you extract separately from the final binary and then pass it around later.
If you do not want a fixed address, then you need to build the downloaded binary as position independent, and you should link that with .text and .bss and .data at the same address.
MEMORY
{
hello : ORIGIN = 0x20001000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > hello
.rodata : { *(.rodata*) } > hello
.bss : { *(.bss*) } > hello
.data : { *(.data*) } > hello
}
you should obviously do this anyway, but with position independent then you have it all packed in along with the GOT (might need a .got entry but I think it knows to use .data). Note, if you put .data after .bss with gnu at least and insure, even if it is a bogus variable you do not use, make sure you have one .data then .bss is zero padded and allocated for you, no need to set it up in a bootstrap.
If you build for position independence then you can load it almost anywhere, clearly on arm/thumb at least on a word boundary.
In general for other instruction sets the function pointer thing works just fine. In ALL cases you simply look at the documentation for the processor and see the instruction(s) used for calling and returning or branching and simply use that instruction, be it by having the compiler do it or forcing the right instruction so that you do not have it fail down the road in a re-compile (and have a very painful debug). arm and mips have 16 bit modes that require specific instructions or solutions for switching modes. x86 has different modes 32 bit and 64 bit and ways to switch modes, but normally you do not need to mess with this for something like this. msp430, pic, avr, these should be just a function pointer thing in C should work fine. In general do the function pointer thing then see what the compiler generates and compare that to the processor documentation. (compare it to a non-function pointer call).
If you do not know these basic C concepts of function pointer, linking a bare metal app on an mcu/processor, bootstrap, .text, .data, etc. You need to go learn all that.
The times you decide to switch to an operating system are....if you need a filesystem, networking, or a few things like this where you just do not want to do that yourself. Now sure there is lwip for networking and some embedded filesystem libraries. And multithreading then an os as well, but if all you want to do is generate a branch/jump/call instruction you do not need an operating system for that. Just generate the call/branch/whatever.
Loading and execution a fully linked binary and loading and calling a single function (and returning to the caller) are not really the same thing. The latter is somewhat complicated and involves "dynamic linking", where the code effectively and secures in the same execution environment as the caller.
Loading a complete stand-alone executable in the other hand is more straightforward and is the function of a bootloader. A bootloader loads and jumps to the loaded executable which then establishes it's own execution environment. Returning to the bootloader requires a processor reset.
In this case it would make sense to have the bootloader load and execute code in RAM if you are going to be frequently loading different code. However be aware that on Harvard Architecture devices like STM32, RAM execution may slow down execution because data and instruction fetch share the same bus.
The actual implementation of a bootloader will depend on the target architecture, but for Cortex-M devices is fairly straightforward and dealt with elsewhere.
STM32 actually includes an on-chip bootloader (you need to configure the boot source pins to invoke it), which I believe can load and execute code in RAM. It is normally used to load a secondary bootloader to load and program flash, but it can be used for loading any code.
You do need to build and link your code to run from RAM at the address tle loader locates it, or if supported build position-indeoendent code that can run from anywhere.

C++ self erasing code [duplicate]

I was reading this question because I'm trying to find the size of a function in a C++ program, It is hinted at that there may be a way that is platform specific. My targeted platform is windows
The method I currently have in my head is the following:
1. Obtain a pointer to the function
2. Increment the Pointer (& counter) until I reach the machine code value for ret
3. The counter will be the size of the function?
Edit1: To clarify what I mean by 'size' I mean the number of bytes (machine code) that make up the function.
Edit2: There have been a few comments asking why or what do I plan to do with this. The honest answer is I have no intention, and I can't really see the benefits of knowing a functions length pre-compile time. (although I'm sure there are some)
This seems like a valid method to me, will this work?
Wow, I use function size counting all the time and it has lots and lots of uses. Is it reliable? No way. Is it standard c++? No way. But that's why you need to check it in the disassembler to make sure it worked, every time that you release a new version. Compiler flags can mess up the ordering.
static void funcIwantToCount()
{
// do stuff
}
static void funcToDelimitMyOtherFunc()
{
__asm _emit 0xCC
__asm _emit 0xCC
__asm _emit 0xCC
__asm _emit 0xCC
}
int getlength( void *funcaddress )
{
int length = 0;
for(length = 0; *((UINT32 *)(&((unsigned char *)funcaddress)[length])) != 0xCCCCCCCC; ++length);
return length;
}
It seems to work better with static functions. Global optimizations can kill it.
P.S. I hate people, asking why you want to do this and it's impossible, etc. Stop asking these questions, please. Makes you sound stupid. Programmers are often asked to do non-standard things, because new products almost always push the limits of what's availble. If they don't, your product is probably a rehash of what's already been done. Boring!!!
No, this will not work:
There is no guarantee that your function only contains a single ret instruction.
Even if it only does contain a single ret, you can't just look at the individual bytes - because the corresponding value could appear as simply a value, rather than an instruction.
The first problem can possibly be worked around if you restrict your coding style to, say, only have a single point of return in your function, but the other basically requires a disassembler so you can tell the individual instructions apart.
It is possible to obtain all blocks of a function, but is an unnatural question to ask what is the 'size' of a function. Optimized code will rearrange code blocks in the order of execution and will move seldom used blocks (exception paths) into outer parts of the module. For more details, see Profile-Guided Optimizations for example how Visual C++ achieves this in link time code generation. So a function can start at address 0x00001000, branch at 0x00001100 into a jump at 0x20001000 and a ret, and have some exception handling code 0x20001000. At 0x00001110 another function starts. What is the 'size' of your function? It does span from 0x00001000 to +0x20001000, but it 'owns' only few blocks in that span. So your question should be unasked.
There are other valid questions in this context, like the total number of instructions a function has (can be determined from the program symbol database and from the image), and more importantly, what is the number of instructions in the frequent executed code path inside the function. All these are questions normally asked in the context of performance measurement and there are tools that instrument code and can give very detailed answers.
Chasing pointers in memory and searching for ret will get you nowhere I'm afraid. Modern code is way way way more complex than that.
This won't work... what if there's a jump, a dummy ret, and then the target of the jump? Your code will be fooled.
In general, it's impossible to do this with 100% accuracy because you have to predict all code paths, which is like solving the halting problem. You can get "pretty good" accuracy if you implement your own disassembler, but no solution will be nearly as easy as you imagine.
A "trick" would be to find out which function's code is after the function that you're looking for, which would give pretty good results assuming certain (dangerous) assumptions. But then you'd have to know what function comes after your function, which, after optimizations, is pretty hard to figure out.
Edit 1:
What if the function doesn't even end with a ret instruction at all? It could very well just jmp back to its caller (though it's unlikely).
Edit 2:
Don't forget that x86, at least, has variable-length instructions...
Update:
For those saying that flow analysis isn't the same as solving the halting problem:
Consider what happens when you have code like:
foo:
....
jmp foo
You will have to follow the jump each time to figure out the end of the function, and you cannot ignore it past the first time because you don't know whether or not you're dealing with self-modifying code. (You could have inline assembly in your C++ code that modifies itself, for instance.) It could very well extend to some other place of memory, so your analyzer will (or should) end in an infinite loop, unless you tolerate false negatives.
Isn't that like the halting problem?
I'm posting this to say two things:
1) Most of the answers given here are really bad and will break easily. If you use the C function pointer (using the function name), in a debug build of your executable, and possibly in other circumstances, it may point to a JMP shim that will not have the function body itself. Here's an example. If I do the following for the function I defined below:
FARPROC pfn = (FARPROC)some_function_with_possibility_to_get_its_size_at_runtime;
the pfn I get (for example: 0x7FF724241893) will point to this, which is just a JMP instruction:
Additionally, a compiler can nest several of those shims, or branch your function code so that it will have multiple epilogs, or ret instructions. Heck, it may not even use a ret instruction. Then, there's no guarantee that functions themselves will be compiled and linked in the order you define them in the source code.
You can do all that stuff in assembly language, but not in C or C++.
2) So that above was the bad news. The good news is that the answer to the original question is, yes, there's a way (or a hack) to get the exact function size, but it comes with the following limitations:
It works in 64-bit executables on Windows only.
It is obviously Microsoft specific and is not portable.
You have to do this at run-time.
The concept is simple -- utilize the way SEH is implemented in x64 Windows binaries. Compiler adds details of each function into the PE32+ header (into the IMAGE_DIRECTORY_ENTRY_EXCEPTION directory of the optional header) that you can use to obtain the exact function size. (In case you're wondering, this information is used for catching, handling and unwinding of exceptions in the __try/__except/__finally blocks.)
Here's a quick example:
//You will have to call this when your app initializes and then
//cache the size somewhere in the global variable because it will not
//change after the executable image is built.
size_t fn_size; //Will receive function size in bytes, or 0 if error
some_function_with_possibility_to_get_its_size_at_runtime(&fn_size);
and then:
#include <Windows.h>
//The function itself has to be defined for two types of a call:
// 1) when you call it just to get its size, and
// 2) for its normal operation
bool some_function_with_possibility_to_get_its_size_at_runtime(size_t* p_getSizeOnly = NULL)
{
//This input parameter will define what we want to do:
if(!p_getSizeOnly)
{
//Do this function's normal work
//...
return true;
}
else
{
//Get this function size
//INFO: Works only in 64-bit builds on Windows!
size_t nFnSz = 0;
//One of the reasons why we have to do this at run-time is
//so that we can get the address of a byte inside
//the function body... we'll get it as this thread context:
CONTEXT context = {0};
RtlCaptureContext(&context);
DWORD64 ImgBase = 0;
RUNTIME_FUNCTION* pRTFn = RtlLookupFunctionEntry(context.Rip, &ImgBase, NULL);
if(pRTFn)
{
nFnSz = pRTFn->EndAddress - pRTFn->BeginAddress;
}
*p_getSizeOnly = nFnSz;
return false;
}
}
This can work in very limited scenarios. I use it in part of a code injection utility I wrote. I don't remember where I found the information, but I have the following (C++ in VS2005):
#pragma runtime_checks("", off)
static DWORD WINAPI InjectionProc(LPVOID lpvParameter)
{
// do something
return 0;
}
static DWORD WINAPI InjectionProcEnd()
{
return 0;
}
#pragma runtime_checks("", on)
And then in some other function I have:
size_t cbInjectionProc = (size_t)InjectionProcEnd - (size_t)InjectionProc;
You have to turn off some optimizations and declare the functions as static to get this to work; I don't recall the specifics. I don't know if this is an exact byte count, but it is close enough. The size is only that of the immediate function; it doesn't include any other functions that may be called by that function. Aside from extreme edge cases like this, "the size of a function" is meaningless and useless.
The real solution to this is to dig into your compiler's documentation. The ARM compiler we use can be made to produce an assembly dump (code.dis), from which it's fairly trivial to subtract the offsets between a given mangled function label and the next mangled function label.
I'm not certain which tools you will need for this with a windows target, however. It looks like the tools listed in the answer to this question might be what you're looking for.
Also note that I (working in the embedded space) assumed you were talking about post-compile-analysis. It still might be possible to examine these intermediate files programmatically as part of a build provided that:
The target function is in a different object
The build system has been taught the dependencies
You know for sure that the compiler will build these object files
Note that I'm not sure entirely WHY you want to know this information. I've needed it in the past to be sure that I can fit a particular chunk of code in a very particular place in memory. I have to admit I'm curious what purpose this would have on a more general desktop-OS target.
In C++, the there is no notion of function size. In addition to everything else mentioned, preprocessor macros also make for an indeterminate size. If you want to count number of instruction words, you can't do that in C++, because it doesn't exist until it's been compiled.
What do you mean "size of a function"?
If you mean a function pointer than it is always just 4 bytes for 32bits systems.
If you mean the size of the code than you should just disassemble generated code and find the entry point and closest ret call. One way to do it is to read the instruction pointer register at the beginning and at the end of your function.
If you want to figure out the number of instructions called in the average case for your function you can use profilers and divide the number of retired instructions on the number of calls.
I think it will work on windows programs created with msvc, as for branches the 'ret' seems to always come at the end (even if there are branches that return early it does a jne to go the end).
However you will need some kind of disassembler library to figure the current opcode length as they are variable length for x86. If you don't do this you'll run into false positives.
I would not be surprised if there are cases this doesn't catch.
There is no facilities in Standard C++ to obtain the size or length of a function.
See my answer here: Is it possible to load a function into some allocated memory and run it from there?
In general, knowing the size of a function is used in embedded systems when copying executable code from a read-only source (or a slow memory device, such as a serial Flash) into RAM. Desktop and other operating systems load functions into memory using other techniques, such as dynamic or shared libraries.
Just set PAGE_EXECUTE_READWRITE at the address where you got your function. Then read every byte. When you got byte "0xCC" it means that the end of function is actual_reading_address - 1.
Using GCC, not so hard at all.
void do_something(void) {
printf("%s!", "Hello your name is Cemetech");
do_something_END:
}
...
printf("size of function do_something: %i", (int)(&&do_something_END - (int)do_something));
below code the get the accurate function block size, it works fine with my test
runtime_checks disable _RTC_CheckEsp in debug mode
#pragma runtime_checks("", off)
DWORD __stdcall loadDll(char* pDllFullPath)
{
OutputDebugStringA(pDllFullPath);
//OutputDebugStringA("loadDll...................\r\n");
return 0;
//return test(pDllFullPath);
}
#pragma runtime_checks("", restore)
DWORD __stdcall getFuncSize_loadDll()
{
DWORD maxSize=(PBYTE)getFuncSize_loadDll-(PBYTE)loadDll;
PBYTE pTail=(PBYTE)getFuncSize_loadDll-1;
while(*pTail != 0xC2 && *pTail != 0xC3) --pTail;
if (*pTail==0xC2)
{ //0xC3 : ret
//0xC2 04 00 : ret 4
pTail +=3;
}
return pTail-(PBYTE)loadDll;
};
The non-portable, but API-based and correctly working approach is to use program database readers - like dbghelp.dll on Windows or readelf on Linux. The usage of those is only possible if debug info is enabled/present along with the program. Here's an example on how it works on Windows:
SYMBOL_INFO symbol = { };
symbol.SizeOfStruct = sizeof(SYMBOL_INFO);
// Implies, that the module is loaded into _dbg_session_handle, see ::SymInitialize & ::SymLoadModule64
::SymFromAddr(_dbg_session_handle, address, 0, &symbol);
You will get the size of the function in symbol.Size, but you may also need additional logic identifying whether the address given is a actually a function, a shim placed there by incremental linker or a DLL call thunk (same thing).
I guess somewhat similar can be done via readelf on Linux, but maybe you'll have to come up with the library on top of its sourcecode...
You must bear in mind that although disassembly-based approach is possible, you'll basically have to analyze a directed graph with endpoints in ret, halt, jmp (PROVIDED you have incremental linking enabled and you're able to read jmp-table to identify whether the jmp you're facing in function is internal to that function (missing in image's jmp-table) or external (present in that table; such jmps frequently occur as part of tail-call optimization on x64, as I know)), any calls that are meant to be nonret (like an exception generating helper), etc.
It's an old question but still...
For Windows x64, functions all have a function table, which contains the offset and the size of the function. https://learn.microsoft.com/en-us/windows/win32/debug/pe-format . This function table is used for unwinding when an exception is thrown.
That said, this doesn't contain information like inlining, and all the other issues that people already noted...
int GetFuncSizeX86(unsigned char* Func)
{
if (!Func)
{
printf("x86Helper : Function Ptr NULL\n");
return 0;
}
for (int count = 0; ; count++)
{
if (Func[count] == 0xC3)
{
unsigned char prevInstruc = *(Func - 1);
if (Func[1] == 0xCC // int3
|| prevInstruc == 0x5D// pop ebp
|| prevInstruc == 0x5B// pop ebx
|| prevInstruc == 0x5E// pop esi
|| prevInstruc == 0x5F// pop edi
|| prevInstruc == 0xCC// int3
|| prevInstruc == 0xC9)// leave
return count++;
}
}
}
you could use this assumming you are in x86 or x86_64

C++: Why does this speed my code up?

I have the following function
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = patch_top_left_row + random_values[0];
int first_pixel_col = patch_top_left_col + random_values[1];
int second_pixel_row = patch_top_left_row + random_values[2];
int second_pixel_col = patch_top_left_col + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
Which is called about one and a half million times with different values for patch_top_left_row and patch_top_left_col. This takes about 2 seconds to run, now when I change the calculation of first_pixel_row etc to not use the arguments but hard coded numbers instead (shown below), the thing runs sub second and I don't know why. Is the compiler doing something smart here ( I am using gcc cross compiler)?
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = 5 + random_values[0];
int first_pixel_col = 6 + random_values[1];
int second_pixel_row = 8 + random_values[2];
int second_pixel_col = 10 + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
EDIT:
I have pasted the assembly from the two versions of the function
using arguments: http://pastebin.com/tpCi8c0F
using constants: http://pastebin.com/bV0d7QH7
EDIT:
After compiling with -O3 I get the following clock ticks and speeds:
using arguments: 1990000 ticks and 1.99seconds
using constants: 330000 ticks and 0.33seconds
EDIT:
using argumenst with -03 compilation: http://pastebin.com/fW2HCnHc
using constant with -03 compilation: http://pastebin.com/FHs68Agi
On the x86 platform there are instructions that very quickly add small integers to a register. These instructions are the lea (aka 'load effective address') instructions and they are meant for computing address offsets for structures and the like. The small integer being added is actually part of the instruction. Smart compilers know that these instructions are very quick and use them for addition even when addresses are not involved.
I bet if you changed the constants to some random value that was at least 24 bits long that you would see much of the speedup disappear.
Secondly those constants are known values. The compiler can do a lot to arrange for those values to end up in a register in the most efficient way possible. With an argument, unless the argument is passed in a register (and I think your function has too many arguments for that calling convention to be used) the compiler has no choice but to fetch the number from memory using a stack offset load instruction. That isn't a particularly slow instruction or anything, but with constants the compiler is free to do something much faster than may involve simply fetching the number from the instruction itself. The lea instructions are simply the most extreme example of this.
Edit: Now that you've pasted the assembly things are much clearer
In the non-constant code, here is how the add is done:
addl -68(%rbp), %eax
This fetches a value from the stack an offset -68(%rpb) and adds it to the %eax% register.
In the constant code, here is how the add is done:
addl $5, %eax
and if you look at the actual numbers, you see this:
0138 83C005
It's pretty clear that the constant being added is encoded directly into the instruction as a small value. This is going to be much faster to fetch than fetching a value from a stack offset for a number of reasons. First it's smaller. Secondly, it's part of an instruction stream with no branches. So it will be pre-fetched and pipelined with no possibility for cache stalls of any kind.
So while my surmise about the lea instruction wasn't correct, I was still on the right track. The constant version uses a small instruction specifically oriented towards adding a small integer to a register. The non-constant version has to fetch an integer that may be of indeterminate size (so it has to fetch ALL the bits, not just the low ones) from a stack offset (which adds in an additional add to compute the actual address from the offset and stack base address).
Edit 2: Now that you've posted the -O3 results
Well, it's much more confusing now. It's apparently inlined the function in question and it jumps around a whole ton between the code for the inlined function and the code for the calling function. I'm going to need to see the original code for the whole file to make a proper analysis.
But what I strongly suspect is happening now is that the unpredictability of the values retrieved from get_random_number_in_range is severely limiting the optimization options available to the compiler. In fact, it looks like in the constant version it doesn't even bother to call get_random_number_in_range because the value is tossed out and never used.
I'm assuming that the values of patch_top_left_row and patch_top_left_col are generated in a loop somewhere. I would push this loop into this function. If the compiler knows the values are generated as part of a loop, there are a very large number of optimization options open to it. In the extreme case it could use some of the SIMD instructions that are part of the various SSE or 3dnow! instruction suites to make things a whole ton faster than even the version you have that uses constants.
The other option would be to make this function inline, which would hint to the compiler that it should try inserting it into the loop in which it's called. If the compiler takes the hint (this function is a bit largish, so the compiler might not) it will have much the same effect as if you'd stuffed the loop into the function.
Well, binary arithmetic operations of immediate constant vs. memory format are expected to produce faster code than the ones of memory vs. memory format, but the timing effect you observe appears to be too extreme, especially considering that there are other operations inside that function.
Could it be that the compiler decided to inline your function? Inlining would allow the compiler to easily eliminate everything related to the unused patch_top_left_row and patch_top_left_col parameters in the second version, including any steps that prepare/calculate these parameters in the calling code.
Technically, this can be done even if the function is not inlined, but it is generally more complicated.

Getting The Size of a C++ Function

I was reading this question because I'm trying to find the size of a function in a C++ program, It is hinted at that there may be a way that is platform specific. My targeted platform is windows
The method I currently have in my head is the following:
1. Obtain a pointer to the function
2. Increment the Pointer (& counter) until I reach the machine code value for ret
3. The counter will be the size of the function?
Edit1: To clarify what I mean by 'size' I mean the number of bytes (machine code) that make up the function.
Edit2: There have been a few comments asking why or what do I plan to do with this. The honest answer is I have no intention, and I can't really see the benefits of knowing a functions length pre-compile time. (although I'm sure there are some)
This seems like a valid method to me, will this work?
Wow, I use function size counting all the time and it has lots and lots of uses. Is it reliable? No way. Is it standard c++? No way. But that's why you need to check it in the disassembler to make sure it worked, every time that you release a new version. Compiler flags can mess up the ordering.
static void funcIwantToCount()
{
// do stuff
}
static void funcToDelimitMyOtherFunc()
{
__asm _emit 0xCC
__asm _emit 0xCC
__asm _emit 0xCC
__asm _emit 0xCC
}
int getlength( void *funcaddress )
{
int length = 0;
for(length = 0; *((UINT32 *)(&((unsigned char *)funcaddress)[length])) != 0xCCCCCCCC; ++length);
return length;
}
It seems to work better with static functions. Global optimizations can kill it.
P.S. I hate people, asking why you want to do this and it's impossible, etc. Stop asking these questions, please. Makes you sound stupid. Programmers are often asked to do non-standard things, because new products almost always push the limits of what's availble. If they don't, your product is probably a rehash of what's already been done. Boring!!!
No, this will not work:
There is no guarantee that your function only contains a single ret instruction.
Even if it only does contain a single ret, you can't just look at the individual bytes - because the corresponding value could appear as simply a value, rather than an instruction.
The first problem can possibly be worked around if you restrict your coding style to, say, only have a single point of return in your function, but the other basically requires a disassembler so you can tell the individual instructions apart.
It is possible to obtain all blocks of a function, but is an unnatural question to ask what is the 'size' of a function. Optimized code will rearrange code blocks in the order of execution and will move seldom used blocks (exception paths) into outer parts of the module. For more details, see Profile-Guided Optimizations for example how Visual C++ achieves this in link time code generation. So a function can start at address 0x00001000, branch at 0x00001100 into a jump at 0x20001000 and a ret, and have some exception handling code 0x20001000. At 0x00001110 another function starts. What is the 'size' of your function? It does span from 0x00001000 to +0x20001000, but it 'owns' only few blocks in that span. So your question should be unasked.
There are other valid questions in this context, like the total number of instructions a function has (can be determined from the program symbol database and from the image), and more importantly, what is the number of instructions in the frequent executed code path inside the function. All these are questions normally asked in the context of performance measurement and there are tools that instrument code and can give very detailed answers.
Chasing pointers in memory and searching for ret will get you nowhere I'm afraid. Modern code is way way way more complex than that.
This won't work... what if there's a jump, a dummy ret, and then the target of the jump? Your code will be fooled.
In general, it's impossible to do this with 100% accuracy because you have to predict all code paths, which is like solving the halting problem. You can get "pretty good" accuracy if you implement your own disassembler, but no solution will be nearly as easy as you imagine.
A "trick" would be to find out which function's code is after the function that you're looking for, which would give pretty good results assuming certain (dangerous) assumptions. But then you'd have to know what function comes after your function, which, after optimizations, is pretty hard to figure out.
Edit 1:
What if the function doesn't even end with a ret instruction at all? It could very well just jmp back to its caller (though it's unlikely).
Edit 2:
Don't forget that x86, at least, has variable-length instructions...
Update:
For those saying that flow analysis isn't the same as solving the halting problem:
Consider what happens when you have code like:
foo:
....
jmp foo
You will have to follow the jump each time to figure out the end of the function, and you cannot ignore it past the first time because you don't know whether or not you're dealing with self-modifying code. (You could have inline assembly in your C++ code that modifies itself, for instance.) It could very well extend to some other place of memory, so your analyzer will (or should) end in an infinite loop, unless you tolerate false negatives.
Isn't that like the halting problem?
I'm posting this to say two things:
1) Most of the answers given here are really bad and will break easily. If you use the C function pointer (using the function name), in a debug build of your executable, and possibly in other circumstances, it may point to a JMP shim that will not have the function body itself. Here's an example. If I do the following for the function I defined below:
FARPROC pfn = (FARPROC)some_function_with_possibility_to_get_its_size_at_runtime;
the pfn I get (for example: 0x7FF724241893) will point to this, which is just a JMP instruction:
Additionally, a compiler can nest several of those shims, or branch your function code so that it will have multiple epilogs, or ret instructions. Heck, it may not even use a ret instruction. Then, there's no guarantee that functions themselves will be compiled and linked in the order you define them in the source code.
You can do all that stuff in assembly language, but not in C or C++.
2) So that above was the bad news. The good news is that the answer to the original question is, yes, there's a way (or a hack) to get the exact function size, but it comes with the following limitations:
It works in 64-bit executables on Windows only.
It is obviously Microsoft specific and is not portable.
You have to do this at run-time.
The concept is simple -- utilize the way SEH is implemented in x64 Windows binaries. Compiler adds details of each function into the PE32+ header (into the IMAGE_DIRECTORY_ENTRY_EXCEPTION directory of the optional header) that you can use to obtain the exact function size. (In case you're wondering, this information is used for catching, handling and unwinding of exceptions in the __try/__except/__finally blocks.)
Here's a quick example:
//You will have to call this when your app initializes and then
//cache the size somewhere in the global variable because it will not
//change after the executable image is built.
size_t fn_size; //Will receive function size in bytes, or 0 if error
some_function_with_possibility_to_get_its_size_at_runtime(&fn_size);
and then:
#include <Windows.h>
//The function itself has to be defined for two types of a call:
// 1) when you call it just to get its size, and
// 2) for its normal operation
bool some_function_with_possibility_to_get_its_size_at_runtime(size_t* p_getSizeOnly = NULL)
{
//This input parameter will define what we want to do:
if(!p_getSizeOnly)
{
//Do this function's normal work
//...
return true;
}
else
{
//Get this function size
//INFO: Works only in 64-bit builds on Windows!
size_t nFnSz = 0;
//One of the reasons why we have to do this at run-time is
//so that we can get the address of a byte inside
//the function body... we'll get it as this thread context:
CONTEXT context = {0};
RtlCaptureContext(&context);
DWORD64 ImgBase = 0;
RUNTIME_FUNCTION* pRTFn = RtlLookupFunctionEntry(context.Rip, &ImgBase, NULL);
if(pRTFn)
{
nFnSz = pRTFn->EndAddress - pRTFn->BeginAddress;
}
*p_getSizeOnly = nFnSz;
return false;
}
}
This can work in very limited scenarios. I use it in part of a code injection utility I wrote. I don't remember where I found the information, but I have the following (C++ in VS2005):
#pragma runtime_checks("", off)
static DWORD WINAPI InjectionProc(LPVOID lpvParameter)
{
// do something
return 0;
}
static DWORD WINAPI InjectionProcEnd()
{
return 0;
}
#pragma runtime_checks("", on)
And then in some other function I have:
size_t cbInjectionProc = (size_t)InjectionProcEnd - (size_t)InjectionProc;
You have to turn off some optimizations and declare the functions as static to get this to work; I don't recall the specifics. I don't know if this is an exact byte count, but it is close enough. The size is only that of the immediate function; it doesn't include any other functions that may be called by that function. Aside from extreme edge cases like this, "the size of a function" is meaningless and useless.
The real solution to this is to dig into your compiler's documentation. The ARM compiler we use can be made to produce an assembly dump (code.dis), from which it's fairly trivial to subtract the offsets between a given mangled function label and the next mangled function label.
I'm not certain which tools you will need for this with a windows target, however. It looks like the tools listed in the answer to this question might be what you're looking for.
Also note that I (working in the embedded space) assumed you were talking about post-compile-analysis. It still might be possible to examine these intermediate files programmatically as part of a build provided that:
The target function is in a different object
The build system has been taught the dependencies
You know for sure that the compiler will build these object files
Note that I'm not sure entirely WHY you want to know this information. I've needed it in the past to be sure that I can fit a particular chunk of code in a very particular place in memory. I have to admit I'm curious what purpose this would have on a more general desktop-OS target.
In C++, the there is no notion of function size. In addition to everything else mentioned, preprocessor macros also make for an indeterminate size. If you want to count number of instruction words, you can't do that in C++, because it doesn't exist until it's been compiled.
What do you mean "size of a function"?
If you mean a function pointer than it is always just 4 bytes for 32bits systems.
If you mean the size of the code than you should just disassemble generated code and find the entry point and closest ret call. One way to do it is to read the instruction pointer register at the beginning and at the end of your function.
If you want to figure out the number of instructions called in the average case for your function you can use profilers and divide the number of retired instructions on the number of calls.
I think it will work on windows programs created with msvc, as for branches the 'ret' seems to always come at the end (even if there are branches that return early it does a jne to go the end).
However you will need some kind of disassembler library to figure the current opcode length as they are variable length for x86. If you don't do this you'll run into false positives.
I would not be surprised if there are cases this doesn't catch.
There is no facilities in Standard C++ to obtain the size or length of a function.
See my answer here: Is it possible to load a function into some allocated memory and run it from there?
In general, knowing the size of a function is used in embedded systems when copying executable code from a read-only source (or a slow memory device, such as a serial Flash) into RAM. Desktop and other operating systems load functions into memory using other techniques, such as dynamic or shared libraries.
Just set PAGE_EXECUTE_READWRITE at the address where you got your function. Then read every byte. When you got byte "0xCC" it means that the end of function is actual_reading_address - 1.
Using GCC, not so hard at all.
void do_something(void) {
printf("%s!", "Hello your name is Cemetech");
do_something_END:
}
...
printf("size of function do_something: %i", (int)(&&do_something_END - (int)do_something));
below code the get the accurate function block size, it works fine with my test
runtime_checks disable _RTC_CheckEsp in debug mode
#pragma runtime_checks("", off)
DWORD __stdcall loadDll(char* pDllFullPath)
{
OutputDebugStringA(pDllFullPath);
//OutputDebugStringA("loadDll...................\r\n");
return 0;
//return test(pDllFullPath);
}
#pragma runtime_checks("", restore)
DWORD __stdcall getFuncSize_loadDll()
{
DWORD maxSize=(PBYTE)getFuncSize_loadDll-(PBYTE)loadDll;
PBYTE pTail=(PBYTE)getFuncSize_loadDll-1;
while(*pTail != 0xC2 && *pTail != 0xC3) --pTail;
if (*pTail==0xC2)
{ //0xC3 : ret
//0xC2 04 00 : ret 4
pTail +=3;
}
return pTail-(PBYTE)loadDll;
};
The non-portable, but API-based and correctly working approach is to use program database readers - like dbghelp.dll on Windows or readelf on Linux. The usage of those is only possible if debug info is enabled/present along with the program. Here's an example on how it works on Windows:
SYMBOL_INFO symbol = { };
symbol.SizeOfStruct = sizeof(SYMBOL_INFO);
// Implies, that the module is loaded into _dbg_session_handle, see ::SymInitialize & ::SymLoadModule64
::SymFromAddr(_dbg_session_handle, address, 0, &symbol);
You will get the size of the function in symbol.Size, but you may also need additional logic identifying whether the address given is a actually a function, a shim placed there by incremental linker or a DLL call thunk (same thing).
I guess somewhat similar can be done via readelf on Linux, but maybe you'll have to come up with the library on top of its sourcecode...
You must bear in mind that although disassembly-based approach is possible, you'll basically have to analyze a directed graph with endpoints in ret, halt, jmp (PROVIDED you have incremental linking enabled and you're able to read jmp-table to identify whether the jmp you're facing in function is internal to that function (missing in image's jmp-table) or external (present in that table; such jmps frequently occur as part of tail-call optimization on x64, as I know)), any calls that are meant to be nonret (like an exception generating helper), etc.
It's an old question but still...
For Windows x64, functions all have a function table, which contains the offset and the size of the function. https://learn.microsoft.com/en-us/windows/win32/debug/pe-format . This function table is used for unwinding when an exception is thrown.
That said, this doesn't contain information like inlining, and all the other issues that people already noted...
int GetFuncSizeX86(unsigned char* Func)
{
if (!Func)
{
printf("x86Helper : Function Ptr NULL\n");
return 0;
}
for (int count = 0; ; count++)
{
if (Func[count] == 0xC3)
{
unsigned char prevInstruc = *(Func - 1);
if (Func[1] == 0xCC // int3
|| prevInstruc == 0x5D// pop ebp
|| prevInstruc == 0x5B// pop ebx
|| prevInstruc == 0x5E// pop esi
|| prevInstruc == 0x5F// pop edi
|| prevInstruc == 0xCC// int3
|| prevInstruc == 0xC9)// leave
return count++;
}
}
}
you could use this assumming you are in x86 or x86_64

What does the PIC register (%ebx) do?

I have written a "dangerous" program in C++ that jumps back and forth from one stack frame to another. The goal is to be jump from the lowest level of a call stack to a caller, do something, and then jump back down again, each time skipping all the calls inbetween.
I do this by manually changing the stack base address (setting %ebp) and jumping to a label address. It totally works, with gcc and icc both, without any stack corruption at all. The day this worked was a cool day.
Now I'm taking the same program and re-writing it in C, and it doesn't work. Specifically, it doesn't work with gcc v4.0.1 (Mac OS). Once I jump to the new stack frame (with the stack base pointer set correctly), the following instructions execute, being just before a call to fprintf. The last instruction listed here crashes, dereferencing NULL:
lea 0x18b8(%ebx), %eax
mov (%eax), %eax
mov (%eax), %eax
I've done some debugging, and I've figured out that by setting the %ebx register manually when I switch stack frames (using a value I observed before leaving the function in the first place), I fix the bug. I've read that this register deals with "position independent code" in gcc.
What is position independent code? How does position independent code work? To what is this register pointing?
EBX points to the Global Offset Table. See this reference about PIC on i386. The link explains what PIC is an how EBX is used.
PIC is code that is relocated dynamically when it is loaded. Code that is non-PIC has jump and call addresses set at link time. PIC has a table that references all the places where such values exist, much like a .dll.
When the image is loaded, the loader will dynamically update those values. Other schemes reference a data value that defines a "base" and the target address is decided by performing calculations on the base. The base is usually set by the loader again.
Finally, other schemes use various trampolines that call to known relative offsets. The relative offsets contain code and/or data that are updated by a loader.
There are different reasons why different schemes are chosen. Some are fast when run, but slower to load. Some are fast to load, but have less runtime performance.