I am in a dilemma, what would be the more performing option for the loop of a VM:
option 1 - force inline for the instruction functions, use computed goto for switch to go the call (effectively inlined code) of the instruction on that label... or...
option 2 - use a lookup array of function pointers, each pointing to a fastcall function, and the instruction determines the index.
Basically, what is better, a lookup table with jump addresses and in-line code or a lookup table with fastcall function addresses. Yes, I know, both are effectively just memory addresses and jumps back and forth, but I think fastcall may still cause some data to be pushed on the stack if out of register space, even if forced to use registers for the parameters.
Compiler is GCC.
I assume, that with "virtual machine", you refer to a simulated processor executing some sort of bytecode, similiar to the "Java virtual machine", and not a whole simulated computer that allows installation of another OS (like in VirtualBox/VMware).
My suggestion is to let the compiler do the decision, about what has the best performance, and create a big traditional "switch" on the current item of the byte code stream. This will likely result in a jump table created by the compiler, so it it as fast (or slow) as your computed goto variant, but more portable.
Your variant 2 - lookup array of function pointers - is likely slower than inlined functions, as there is likely extra overhead with non-inlined functions, such as the handling of return values. After all, some of your VM-op functions (like "goto" or "set-register-to-immediate") have to modify the instruction pointer, others don't need to.
Generally, calls to function pointers (or jumps via a jump table) are slow on current CPUs, as they are hardly predicted right by branch prediction. So, if you think about optimizing your VM, try to find a set of instructions, that requires as few code points as necessary.
Related
Would a large amount of stack space required by a function prevent it from being inlined? Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?
int inlineme(int args) {
char svar[10000];
return stringyfunc(args, svar);
}
I'm more concerned about gcc, but icc and llvm would also be nice to know.
I know this isn't ideal, but I'm very curious. The code is probable also pretty bad on cache too.
Yes, the decision to inline or not depends on the complexity of the function, its stack and registers usage and the context in which the call is made. The rules are compiler- and target platform-dependent. Always check the generated assembly when performance matters.
Compare this version with a 10000-char array not being inlined (GCC 8.2, x64, -O2):
inline int inlineme(int args) {
char svar[10000];
return stringyfunc(args, svar);
}
int test(int x) {
return inlineme(x);
}
Generated assembly:
inlineme(int):
sub rsp, 10008
mov rsi, rsp
call stringyfunc(int, char*)
add rsp, 10008
ret
test(int):
jmp inlineme(int)
with this one with a much smaller 10-char array, which is inlined:
inline int inlineme(int args) {
char svar[10];
return stringyfunc(args, svar);
}
int test(int x) {
return inlineme(x);
}
Generated assembly:
test(int):
sub rsp, 24
lea rsi, [rsp+6]
call stringyfunc(int, char*)
add rsp, 24
ret
Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?
Not necessarily in general. In fact, inline expansion can sometimes reduce stack space usage due to not having to set up space for function arguments.
Expanding a "wide" call into a single frame which calls other "wide" functions can be a problem though, and unless the optimiser guards against that separately, it may have to avoid expansion of "wide" functions in general.
In case of recursion: Most likely yes.
An example of LLVM source:
if (IsCallerRecursive &&
AllocatedSize > InlineConstants::TotalAllocaSizeRecursiveCaller) {
InlineResult IR = "recursive and allocates too much stack space";
From GCC source:
For stack growth limits we always base the growth in stack usage
of the callers. We want to prevent applications from segfaulting
on stack overflow when functions with huge stack frames gets
inlined.
Controlling the limit, from GCC manual:
--param name=value
large-function-growth
Specifies maximal growth of large function caused by inlining in percents. For example, parameter value 100 limits large function growth to 2.0 times the original size.
large-stack-frame
The limit specifying large stack frames. While inlining the algorithm is trying to not grow past this limit too much.
large-stack-frame-growth
Specifies maximal growth of large stack frames caused by inlining in percents. For example, parameter value 1000 limits large stack frame growth to 11 times the original size.
Yes, partly because compilers do stack allocation for the whole function once in prologue/epilogue, not moving the stack pointer around as they enter/leave block scopes.
and each inlined call to inlineme() would need its own buffer.
No, I'm pretty sure compilers are smart enough to reuse the same stack space for different instances of the same function, because only one instance of that C variable can ever be in-scope at once.
Optimization after inlining can merge some of the operations of the inline function into calling code, but I think it would be rare for the compiler to end up with 2 versions of the array it wanted to keep around simultaneously.
I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline?
A real example of a problem it could create (which compiler heuristics mostly avoid):
Inlining if (rare_special_case) use_much_stack() into a recursive function that otherwise doesn't use much stack would be an obvious problem for performance (more cache and TLB misses), and even correctness if you recurse deep enough to actually overflow the stack.
(Especially in a constrained environment like Linux kernel stacks, typically 8kiB or 16kiB per thread, up from 4k on 32-bit platforms in older Linux versions. https://elinux.org/Kernel_Small_Stacks has some info and historical quotes about trying to get away with 4k stacks so the kernel didn't have to find 2 contiguous physical pages per task).
Compilers normally make functions allocate all the stack space they'll ever need up front (except for VLAs and alloca). Inlining an error-handling or special-case handling function instead of calling it in the rare case where it's needed will put a large stack allocation (and often save/restore of more call-preserved registers) in the main prologue/epilogue, where it affects the fast path, too. Especially if the fast path didn't make any other function calls.
If you don't inline the handler, that stack space will never be used if there aren't errors (or the special case didn't happen). So the fast-path can be faster, with fewer push/pop instructions and not allocating any big buffers before going on to call another function. (Even if the function itself isn't actually recursive, having this happen in multiple functions in a deep call tree could waste a lot of stack.)
I've read that the Linux kernel does manually do this optimization in a few key places where gcc's inlining heuristics make an unwanted decision to inline: break a function up into fast-path with a call to the slow path, and use __attribute__((noinline)) on the bigger slow-path function to make sure it doesn't inline.
In some cases not doing a separate allocation inside a conditional block is a missed optimization, but more stack-pointer manipulation makes stack unwinding metadata to support exceptions (and backtraces) more bloated (especially saving/restoring of call-preserved registers that stack unwinding for exceptions has to restore).
If you were doing a save and/or allocate inside a conditional block before running some common code that's reached either way (with another branch to decide which registers to restore in the epilogue), then there'd be no way for the exception handler machinery to know whether to load just R12, or R13 as well (for example) from where this function saved them, without some kind of insanely complicated metadata format that could signal a register or memory location to be tested for some condition. The .eh_frame section in ELF executables / libraries is bloated enough as is! (It's non-optional, BTW. The x86-64 System V ABI (for example) requires it even in code that doesn't support exceptions, or in C. In some ways that's good, because it means backtraces usually work, even passing an exception back up through a function would cause breakage.)
You can definitely adjust the stack pointer inside a conditional block, though. Code compiled for 32-bit x86 (with crappy stack-args calling conventions) can and does use push even inside conditional branches. So as long as you clean up the stack before leaving the block that allocated space, it's doable. That's not saving/restoring registers, just moving the stack pointer. (In functions built without a frame pointer, the unwind metadata has to record all such changes, because the stack pointer is the only reference for finding saved registers and the return address.)
I'm not sure exactly what the details are on why compiler can't / don't want to be smarter allocating large extra stack space only inside a block that uses it. Probably a good part of the problem is that their internals just aren't set up to be able to even look for this kind of optimization.
Related: Raymond Chen posted a blog about the PowerPC calling convention, and how there are specific requirements on function prologues / epilogues that make stack unwinding work. (And the rules imply / require the existence of a red zone below the stack pointer that's safe from async clobber. A few other calling conventions use red zones, like x86-64 System V, but Windows x64 doesn't. Raymond posted another blog about red zones)
I am implementing a simple VM, and currently I am using runtime arithmetic to calculate individual program object addresses as offsets from base pointers.
I asked a couple of questions on the subject today, but I seem to be going slowly nowhere.
I learned a couple of things thou, from question one -
Object and struct member access and address offset calculation -
I learned that modern processors have virtual addressing capabilities, allowing to calculate memory offsets without any additional cycles devoted to arithmetic.
And from question two - Are address offsets resolved during compile time in C/C++? - I learned that there is no guarantee for this happening when doing the offsets manually.
By now it should be clear that what I want to achieve is to take advantage of the virtual memory addressing features of the hardware and offload those from the runtime.
I am using GCC, as for platform - I am developing on x86 in windows, but since it is a VM I'd like to have it efficiently running on all platforms supported by GCC.
So ANY information on the subject is welcome and will be very appreciated.
Thanks in advance!
EDIT: Some overview on my program code generation - during the design stage the program is build as a tree hierarchy, which is then recursively serialized into one continuous memory block, along with indexing objects and calculating their offset from the beginning of the program memory block.
EDIT 2: Here is some pseudo code of the VM:
switch *instruction
case 1: call_fn1(*(instruction+1)); instruction += (1+sizeof(parameter1)); break;
case 2: call_fn2(*(instruction+1), *(instruction+1+sizeof(parameter1));
instruction += (1+sizeof(parameter1)+sizeof(parameter2); break;
case 3: instruction += *(instruction+1); break;
Case 1 is a function that takes one parameter, which is found immediately after the instruction, so it is passed as an offset of 1 byte from the instruction. The instruction pointer is incremented by 1 + the size of the first parameter to find the next instruction.
Case 2 is a function that takes two parameters, same as before, first parameter passed as 1 byte offset, second parameter passed as offset of 1 byte plus the size of the first parameter. The instruction pointer is then incremented by the size of the instruction plus sizes of both parameters.
Case 3 is a goto statement, the instruction pointer is incremented by an offset which immediately follows the goto instruction.
EDIT 3: To my understanding, the OS will provide each process with its own dedicated virtual memory addressing space. If so, does this mean the first address is always ... well zero, so the offset from the first byte of the memory block is actually the very address of this element? If memory address is dedicated to every process, and I know the offset of my program memory block AND the offset of every program object from the first byte of the memory block, then are the object addresses resolved during compile time?
Problem is those offsets are not available during the compilation of the C code, they become known during the "compilation" phase and translation to bytecode. Does this mean there is no way to do object memory address calculation for "free"?
How is this done in Java for example, where only the virtual machine is compiled to machine code, does this mean the calculation of object addresses takes a performance penalty because of runtime arithmetics?
Here's an attempt to shed some light on how the linked questions and answers apply to this situation.
The answer to the first question mixes two different things, the first is the addressing modes in X86 instruction and the second is virtual-to-physical address mapping. The first is something that is done by compilers and the second is something that is (typically) set up by the operating system. In your case you should only be worrying about the first.
Instructions in X86 assembly have great flexibility in how they access a memory address. Instructions that read or write memory have the address calculated according to the following formula:
segment + base + index * size + offset
The segment portion of the address is almost always the default DS segment and can usually be ignored. The base portion is given by one of the general purpose registers or the stack pointer. The index part is given by one of the general purpose registers and the size is either 1, 2, 4, or 8. Finally the offset is a constant value embedded in the instruction. Each of these components is optional, but obviously at least one must be given.
This addressing capability is what is generally meant when talking about computing addresses without explicit arithmetic instructions. There is a special instruction that one of the commenters mentioned: LEA which does the address calculation but instead of reading or writing memory, stores the computed address in a register.
For the code you included in the question, it is quite plausible that the compiler would use these addressing modes to avoid explicit arithmetic instructions.
As an example, the current value of the instruction variable could be held in the ESI register. Additionally, each of sizeof(parameter1) and sizeof(parameter2) are compile time constants. In the standard X86 calling conventions function arguments are pushed in reverse order (so the first argument is at the top of the stack) so the assembly codes might look something like
case1:
PUSH [ESI+1]
CALL fn1
ADD ESP,4 ; drop arguments from stack
ADD ESI,5
JMP end_switch
case2:
PUSH [ESI+5]
PUSH [ESI+1]
CALL fn2
ADD ESP,8 ; drop arguments from stack
ADD ESI,9
JMP end_swtich
case3:
MOV ESI,[ESI+1]
JMP end_switch
end_switch:
this is assuming that the size of both parameters is 4 bytes. Of course the actual code is up to the compiler and it is reasonable to expect that the compiler will output fairly efficient code as long as you ask for some level optimization.
You have a data item X in the VM, at relative address A, and an instruction that says (for instance) push X, is that right? And you want to be able to execute this instruction without having to add A to the base address of the VM's data area.
I have written a VM that solves this problem by mapping the VM's data area to a fixed Virtual Address. The compiler knows this Virtual Address, and so can adjust A at compile time. Would this solution work for you? Can you change the compiler yourself?
My VM runs on a smart card, and I have complete control over the OS, so it's a very different environment from yours. But Windows does have some facilities for allocating memory at a fixed address -- the VirtualAlloc function, for instance. You might like to try this out. If you do try it out, you might find that Windows is allocating regions that clash with your fixed-address data area, so you will probably have to load by hand any DLLs that you use, after you have allocated the VM's data area.
But there will probably be unforeseen problems to overcome, and it might not be worth the trouble.
Playing with virtual address translation, page tables or TLBs is something that can only be done at the OS kernel level, and is unportable between platforms and processor families. Furthermore hardware address translation on most CPU ISAs is usually supported only at the level of certain page sizes.
To answer my own question, based on the many responses I got.
Turns out what I want to achieve is not really possible in my situation, getting memory address calculations for free is attainable only when specific requirements are met and requires compilation to machine specific instructions.
I am developing a visual element, lego style drag and drop programming environment for educational purposes, which relies on a simple VM to execute the program code. I was hoping to maximize performance, but it is just not possible in my scenario. It is not that big of a deal thou, because program elements can also generate their C code equivalent, which can then be compiled conventionally to maximize performance.
Thanks to everyone who responded and clarified a matter that wasn't really clear to me!
In c++, what is a good heuristic for estimating the compute time benefits of inlining a function, particularly when the function is called very frequently and accounts for >= 10% of the program's execution time (eg. the evaluation function of a brute force or stochastic optimization process). Even though inlining may be ultimately beyond my control, I am still curious.
There is no general answer. It depends on the hardware, the number and
type of its arguments, and what is done in the function. And how often
it is called, and where. On a Sparc, for example, arguments (and the
return value) are passed in registers, and each function gets 16 new
registers: if the function is complex enough, those new registers may
avoid spilling that would occur if the function were inlined, and the
non-inline version may end up faster than the inlined one. On an Intel,
which is register poor, and passes arguments in registers, just the
opposite might be true, for the same function in the same program. More
generally, inlining may increase program size, reducing locality. Or
for very simple functions, it may reduce program size; but that again
depends on the architecture. The only possible way to know is to try
both, measuring the time. And even then you'll only know for that
particular program, on that particular hardware.
A function call and return on some architectures take as few as one instruction each (although they're generally not RISC-like single-cycle instructions.) In general, you can compare that to the number of cycles represented by the body of the function. A simple property access might be only a single instruction, and so putting it into a non-inlined function will triple the number of instructions to execute it -- obviously a great candidate for inlining. On the other hand, a function that formats a string for printing might represent hundreds of instructions, so two more isn't going to make any difference at all.
If your bottleneck is in a recursive function, and assuming that the level of recursion is not minimal (i.e. average recursion is not just a few levels), you are better off in working with the algorithm in the function rather than with inlining.
Try, if possible, to transform the recursion into a loop or tail-recursion (that can be implicitly transformed into a loop by the compiler), or try to determine where in the function the cost is being spent. Try to minimize the impact of the internal operations (maybe you are dynamically allocating memory that could have auto storage duration, or maybe you can factor a common operation to be performed external to the function in a wrapper and passed in as an extra argument,...)
*EDIT after the comment that recursion was not intended, but rather iteration*
If the compiler has access to the definition of the function, it will make the right decision for you in most cases. If it does not have access to the definition, just move the code around so that it does see it. Maybe make the function a static function to provide an extra hint that it won't be used anywhere else, or even mark it as inline (knowing that this will not force inlining), but avoid using special attributes that will force inlining, as the compiler probably does it better than any simple heuristic that can be produced without looking at the code.
All inlining saves you is the entry/exit cost of the function, so it's only worth considering if the function does almost nothing.
Certainly if the function itself contains a function call, it's probably not worth considering.
Even if the function does very little, it has to be called so much that it owns the program counter a significant percent of the time, before any speedup of the function would be noticeable.
The behaviour here is somewhat compiler dependant. With a recursive function obviously inlining behaviour can in theory be infinite. The 'inline' keyword is only a hint to the compiler, it can choose it ignore if it can't do anything with it. Some compilers will inline the recursive function to a certain depth.
As for the 'how much will this speed things up' - unfortunately we can't provide any sort of answer to that question as 'it depends' - how much work is the function doing vs the overhead of the function call mechanism itself. Why don't you set up a test and see?
Our experience, 20+ years of writing computationally intensive C++, is that inlining is no silver bullet. You really do need to profile your code to see whether inlining will increase performance. For us except for low level 2D and 3D point and vector manipulations inlining is a waste of time. You are far better off working out a better algorithm than trying to micromanage clock ticks.
I have divided the whole question into smaller ones:
What kind of different algorithms GDB is capable to use to reconstruct stacktraces?
How each of the stacktrace reconstruction algorithm works at high level? Advantages and disadvantages?
What kind of meta-information compiler needs to provide in program for each stacktrace reconstruction algorithm to work?
And also corresponding g++ compiler switches that enable/disable particular algorithm?
Speaking Pseudocode, you could call the stack "an array of packed stack frames", where every stack frame is a data structure of variable size you could express like:
template struct stackframe<N> {
uintptr_t contents[N];
#ifndef OMIT_FRAME_POINTER
struct stackframe<> *nextfp;
#endif
void *retaddr;
};
Problem is that every function has a different <N> - frame sizes vary.
The compiler knows frame sizes, and if creating debugging information will usually emit these as part of that. All the debugger then needs to do is to locate the last program counter, look up the function in the symbol table, then use that name to look up the framesize in the debugging information. Add that to the stackpointer and you get to the beginning of the next frame.
If using this method you don't require frame linkage, and backtracing will work just fine even if you use -fomit-frame-pointer. On the other hand, if you have frame linkage, then iterating the stack is just following a linked list - because every framepointer in a new stackframe is initialized by the function prologue code to point to the previous one.
If you have neither frame size information nor framepointers, but still a symbol table, then you can also perform backtracing by a bit of reverse engineering to calculate the framesizes from the actual binary. Start with the program counter, look up the function it belongs to in the symbol table, and then disassemble the function from the start. Isolate all operations between the beginning of the function and the program counter that actually modify the stackpointer (write anything to the stack and/or allocate stackspace). That calculates the frame size for the current function, so subtract that from the stackpointer, and you should (on most architectures) find the last word written to the stack before the function was entered - which is usually the return address into the caller. Re-iterate as necessary.
Finally, you can perform a heuristic analysis of the contents of the stack - isolate all words in the stack that are within executably-mapped segments of the process address space (and thereby could be function offsets aka return addresses), and play a what-if game looking up the memory, disassembling the instruction there and see if it actually is a call instruction of sort, if so whether that really called the 'next' and if you can construct an uninterrupted call sequence from that. This works to a degree even if the binary is completely stripped (although all you could get in that case is a list of return addresses). I don't think GDB employs this technique, but some embedded lowlevel debuggers do. On x86, due to the varying instruction lengths, this is terribly difficult to do because you can't easily "step back" through an instruction stream, but on RISC, where instruction lengths are fixed, e.g. on ARM, this is much simpler.
There are some holes that make simple or even complex/exhaustive implementations of these algorithms fall over sometimes, like tail-recursive functions, inlined code, and so on. The gdb sourcecode might give you some more ideas:
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/frame.c
GDB employs a variety of such techniques.
Can someone explain the mechanics of a jump table and why is would be needed in embedded systems?
A jump table can be either an array of pointers to functions or an array of machine code jump instructions. If you have a relatively static set of functions (such as system calls or virtual functions for a class) then you can create this table once and call the functions using a simple index into the array. This would mean retrieving the pointer and calling a function or jumping to the machine code depending on the type of table used.
The benefits of doing this in embedded programming are:
Indexes are more memory efficient than machine code or pointers, so there is a potential for memory savings in constrained environments.
For any particular function the index will remain stable and changing the function merely requires swapping out the function pointer.
If does cost you a tiny bit of performance for accessing the table, but this is no worse than any other virtual function call.
A jump table, also known as a branch table, is a series of instructions, all unconditionally branching to another point in code.
You can think of them as a switch (or select) statement where all the cases are filled:
MyJump(int c)
{
switch(state)
{
case 0:
goto func0label;
case 1:
goto func1label;
case 2:
goto func2label;
}
}
Note that there's no return - the code that it jumps to will execute the return, and it will jump back to wherever myjump was called.
This is useful for state machines where you execute certain code based on the state variable. There are many, many other uses, but this is one of the main uses.
It's used where you don't want to waste time fiddling with the stack, and want to save code space. It is especially of use in interrupt handlers where speed is extremely important, and the peripheral that caused the interrupt is only known by a single variable. This is similar to the vector table in processors with interrupt controllers.
One use would be taking a $0.60 microcontroller and generating a composite (TV) signal for video applications. the micro isn't powerful - in fact it's just barely fast enough to write each scan line. A jump table would be used to draw characters, because it would take too long to load a bitmap from memory, and use a for() loop to shove the bitmap out. Instead there's a separate jump to the letter and scan line, and then 8 or so instructions that actually write the data directly to the port.
-Adam
Jump tables are commonly (but not exclusively) used in finite state machines to make them data driven.
Instead of nested switch/case
switch (state)
case A:
switch (event):
case e1: ....
case e2: ....
case B:
switch (event):
case e3: ....
case e1: ....
you can make a 2d array or function pointers and just call handleEvent[state][event]
From Wikipedia:
In computer programming, a branch
table (sometimes known as a jump
table) is a term used to describe an
efficient method of transferring
program control (branching) to another
part of a program (or a different
program that may have been dynamically
loaded) using a table of branch
instructions. The branch table
construction is commonly used when
programming in assembly language but
may also be generated by a compiler.
A branch table consists of a serial
list of unconditional branch
instructions that is branched into
using an offset created by multiplying
a sequential index by the instruction
length (the number of bytes in memory
occupied by each branch instruction).
It makes use of the fact that machine
code instructions for branching have a
fixed length and can be executed
extremely efficiently by most
hardware, and is most useful when
dealing with raw data values that may
be easily converted to sequential
index values. Given such data, a
branch table can be extremely
efficient; it usually consists of the
following steps: optionally validating
the input data to ensure it is
acceptable; transforming the data into
an offset into the branch table, this
usually involves multiplying or
shifting it to take into account the
instruction length; and branching to
an address made up of the base of the
table and the generated offset: this
often involves an addition of the
offset onto the program counter
register.
A jump table is described here, but briefly, it's an array of addresses the CPU should jump to based on certain conditions. As an example, a C switch statement is often implemented as a jump table where each jump entry will go to a particular "case" label.
In embedded systems, where memory usage is at a premium, many constructs are better served by using a jump table instead of more memory-intensive methods (like a massive if-else-if).
Wikipedia sums it up pretty well:
In computer programming, a branch
table (sometimes known as a jump
table) is a term used to describe an
efficient method of transferring
program control (branching) to another
part of a program (or a different
program that may have been dynamically
loaded) using a table of branch
instructions. The branch table
construction is commonly used when
programming in assembly language but
may also be generated by a compiler.
... Use of branch tables and other raw
data encoding was common in the early
days of computing when memory was
expensive, CPUs were slower and
compact data representation and
efficient choice of alternatives were
important. Nowadays, they are commonly
used in embedded programming and
operating system development.
In other words, it's a useful construct to use when your system is extremely memory and/or CPU limited, as is often the case in an embedded platform.
Jump tables, more often known as a Branch table, are usually used only by the machine.
The compiler creates a list of all labels in a assembly program and links all labels to a a memory location. A jump table pretty much is a reference card to where, a function or variable or what ever the label maybe, is stored in memory.
So as a function executes, on finishing it jumps back to it's previous memory location or jumps to the next function, etc.
And If your talking about what I think you are, you don't just need them in embedded systems but in any type of compiled/interpreted environment.
Brian Gianforcaro