Optimal virtual machine/byte-code interpreter loop - c++

My project has a VM that executes a byte-code compiled from a domain-specific-language. I'm looking at ways that I can improve the execution time of the byte-code. As a first step I'd like to see if there is a way to simply improve the byte-code interpreter before I venture into machine code compilation.
The main loop of the interpreter looks like this:
while(true)
{
uint8_t cmd = *code++;
switch( cmd )
{
case op_1: ...; break;
...
}
}
QUESTION: Is there a faster way to implement this loop without resorting to assembler?
The one option I see is GCC specific to use dynamic goto with label addresses. Rather than a break at the end of each case I could jump directly to the next instruction. I had hoped the optimizer would do this for me, but looking at the disassembly it apparently doesn't: there is a repeated constant jump at the end of most op_codes.
If relevant the VM is a simple register based machine with floating point and integer registers (8 of each). There is no stack, only a global heap (that language is not that complicated).

One very easy optimisation is that instead of
switch /case/case/case/case/case,
just define an array with function pointers (where each function would process a specified command, or a couple of commands in which case you could set several entries in the array to the same function, and the function itself could check the exact code), and instead of
switch(cmd)
just do a
array[cmd]()
This is given that you dont have too many commands. Also, do some checking if you will not define all the possible commands (maybe you only have 300 commands, but you have to use 2 bytes for representing them, so instead of definining an array with 65536 items, just check if the command is less than 301 and if its not, dont do the lookup)
If you won't do that, at least sort the commands that the most used ones are in the beginning of the switch statement.
Otherwise it would be to look into hashtables, but I assume you don't have that many commands, and in that case overhead of doing a hash function would probably cost you more than not having to do a switch. (Or have a VERY simple hash function)

What's the architecture? You may get a speed-up with word-aligned opcodes, but it'll blow out your code size, which means you'll have to balance it against the cost of a cache miss.

Few obvious optimization I see are,
If you don't use cmd anywhere than switch() then, directly use the pointer indirection, switch( *code++ ). For longer while(true) loop, this can be little helpful.
In switch(), you can use continue instead of break. Because when continue is used inside if/else or switch, compiler knows that execution has to jump to the outer loop; the same is not true for break (with respect to switch).
Hope this helps.

Related

How do jump-tables work?

In the following document, pages 4-5:
http://www.open-std.org/jtc1/sc22/wg21/docs/ESC_Boston_01_304_paper.pdf
typedef int (* jumpfnct)(void * param);
static int CaseError(void * param)
{
return -1;
}
static jumpfnct const jumptable[] =
{
CaseError, CaseError, ...
.
.
.
Case44, CaseError, ...
.
.
.
CaseError, Case255
};
result = index <= 0xFF ? jumptable[index](param) : -1;
it is comparing IF-ELSE vs SWITCH and then introduces this "Jump table". Apparently it is the fastest implementation of the three. What exactly is it? I cannot see how it could work??
The jumptable is a method of mapping some input integer to an action. It stems from the fact that you can use the input integer as the index of an array.
The code sets up an array of pointers to functions. Your input integer is then used to select on of these function-pointers. Generally, it looks like it's going to be a pointer to the function CaseError. However, every now and again, it will be a different function that is being pointed to.
It's designed so that
jumptable[62] = Case62;
jumptable[95] = Case95;
jumptable[35] = Case35;
jumptable[34] = CaseError; /* For example... and so it goes on */
Thus, selecting the right function to call is constant time... with the if-elses and selects, the time taken to select the correct function is dependent on the input integer... assuming the compiler doesn't optimize the select to a jumptable itself... if it's for embedded code, then there's a chance that optimizations of this kind have been disabled... you'd have to check.
Once the correct function-pointer is found, the last line simply calls it:
result = index <= 0xFF ? jumptable[index](param) : -1;
becomes
result = index <= 0xFF /* Check that the index is within
the range of the jump table */
? jumptable[index](param) /* jumptable[index] selects a function
then it gets called with (param) */
: -1; /* If the index is out of range, set result to be -1
Personally, I think a better choice would be to call
CaseError(param) here */
Jumpfnct is a pointer to a function. Jumptable is an array that consists of a number of jumpfncts. The functions can be called just by referencing their position in the array.
For example, jumptable0 will execute the first function, passing along param. jumptable1 will execute the second function, etc.
If you don't know about function pointers, you shouldn't use this trick. They're very handy, in a narrow domain.
It's very fast and space efficient, when what you're doing is switching between a large number of similar function calls. You are adding a function call overhead that a switch statement doesn't necessarily have, so it might not be appropriate in all circumstances. If your code is something like this:
switch(x) {
case 1:
function1();
break;
case 2:
function2();
break;
...
}
A jump table might be a good substitution. If, though, your switch is something like this:
switch(x) {
case 1:
y++;
break;
case 1023:
y--;
break;
...
}
It probably wouldn't be worth doing.
I've used them in a toy FORTH language interpreter, where they were invaluable, but in most cases you're not going to see a speed benefit that makes them worth using. Use them if it makes the logic of your program clearer, not for optimization.
This jumptable returns a pointer-to-function by its index. You define this table in a way that invalid indexes point to the function that returns some invalid code (like -1 in the example) and valid indexes point to the functions you need to call.
Construction
jumptable[index]
returns pointer-to-function and this function gets called
jumptable[index](param)
where param is some custom parameter.
A Jump-Table is an obvious, but rarely used optimization, that for some reason seems to have fallen out of favor.
Briefly, instead of testing a value and exiting out of a switch/case or if-else block to branch to function or code path, you create an array which is filled with the addresses of the functions the program can branch to.
Once completed, this arrangement eliminates the relentless if testing attendant with if-else and switch/case blocks. The code uses the variable that would otherwise be tested with if as a subscript into the function-pointer array, and proceeds directly the the appropriate code - sans ANY if testing. A perfectly efficient branch. The assembly code should literally be a jump.
If you profile code, and find a hot-spot where the program is spending a large % of it's time, look to this kind of optimization to improve performance. A little bit of this can go a long way if it's part of a code's hot-spot.
Thanks for the link. Nice find!
As mentioned in the comment above, whether this solution is more or less effiicent than, for example, a switch statement depends on the amount of work needed to be done for each case.
Writing a regular switch statement for the values you want to process will definitely be a clearer way to see what the code does. So unless either space or speed requirements dictate that a more sophisticated solution, I would suggest that this is not a "better" solution.
Tables of function pointers is however an efficient and good way to solve certain problems. I use function pointers in a table quite regularly to do things like "Benchmark 11 different solutions to a problem", where I have a struct wiht the name of the function and the function, and some parameters perhaps. Then I have one function to time and loop over the code a few million times (or whatever it takes to get a long enough measurement to make sense)

Switch statement with huge number of cases

What happens if the switch has more than 5000 case. What are the drawbacks and how we can replace it with something faster?
Note: I am not expecting to use array to store cases as it's the same.
There's no specific reason to think you'd want anything other than a switch/case statement (and indeed I'd actively expect it to be unhelpful). The compiler should create efficient dispatching code, which might involve some combination of static [sparse] table(s) and direct indexing, binary branching etc.; it's got insights into the static values of the cases and should do an excellent job (retuning it on the fly each time you change the cases, whereas new values that don't fit well with a hand-crafted approach - such as wildly differing values when you'd had a pretty packed array lookup - could require reworking of code or silently cause memory bloat or a performance drop).
People really cared about this kind of thing back when C was trying to win over hard-core assembly programmers... the compilers were held accountable for generating good code. Put another way - if it's not (measurably) broken, don't fix it.
More generally, it's great to be curious about this kind of thing and get people's ideas on alternatives and their performance implications, but if you really care and the performance difference could make a useful difference to your program (especially if profiling suggests it) then always benchmark with your program doing real work.
As food for thought... in case one might be stuck with an old/buggy/inefficient compiler or just love hacking.
Inner work of switch statement consist of two parts. Finding address to jump, and well jumping there. For the first part you need to use a table to find the address. If the number of cases increases, table gets bigger - searching address to jump takes time. This is the point compilers tries to optimize, combining several techniques but one easy approach is to use table directly which depends on case value space.
In a back of the napkin example;
switch (n) {
case 1: foo(); break;
case 2: bar(); break;
case 3: baz(); break;
}
with such piece of code compiler can create an array of jump_addresses and directly get the address by array[n]. Now search just took O(1). But if you had a switch like below:
switch (n) {
case 10: foo(); break;
case 17: bar(); break;
case 23: baz(); break;
// and a lot other
}
compiler needs to generate a table containing case_id, jump_address pairs and code to search through that structure which with worst implementation can take O(n). (Decent compilers optimize the hell out of such scenario when they are fully unleashed by enabling their optimization flags to a degree that when you need to debug such optimized code your brain starts to fry.)
Then question is can you do this all yourself at C level to beat the compiler? and funny thing is while creating tables and searching through them seems easy, jumping to a variable point using goto is not possible in standard C. So there is a chance that if you are not going to use function pointers due to overhead or code structure, you are stuck... well if you are not using GCC. GCC has a non-standard feature called Labels as Values which helps you to get pointers to labels.
To complete the example you can write the second switch statement with "labels as values" feature like this:
const void *cases[] = {&&case_foo, &&case_bar, &&case_baz, ....};
goto *labels[n];
case_foo:
foo();
goto switch_end;
case_bar:
bar();
goto switch_end;
case_baz:
baz();
goto switch_end;
// and a lot other
switch_end:
Of course if you are talking about 5000 cases, it is much better if you write a piece of code to create this code for you - and it is probably only way to maintain such software.
As closing notes; will this improve your daily work? No. Will this improve your skills? Yes and talking from experience, I once found myself improved a security algorithm in a smart card just by optimizing case values. It is a strange world.
Try to use Dictionary class with Delegate values. At least it makes code a little bit more readable.
Big switch statement, generally auto-generated one, may take long time to compile. But I like the idea that compiler optimizes the switch statement.
One way to break apart the switch statement is to use bucketing,
int getIt(int input)
{
int bucket = input%16;
switch(bucket)
{
case 1:
return getItBucket1(input);
case 2:
return getItBucket2(input);
...
...
}
return -1;
}
So in the code above, we broke apart our switch statement into 16 parts. It is easy to change the number of buckets in auto-generated code.
This code has added run-time cost of one layer of indirection or function-call. . But considering the buckets defined in different files, it is faster to compile them in parallel.

Text iteration, Assembly versus C++

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a
for(size_t len = 0;len != tstring.length();len++) {
if(tstring[len] == ',')
stuff();
}
Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.
Thank you,
No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.
An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
you're retrieving tstring.length() once per loop iteration; that's unnecessary.
you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...
Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.
string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');
while(pos != string::npos){
do_stuff(pos);
pos = s.find(',', pos+1);
}
Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.
Would an inline-assembly routine using cmp and jz/jnz be faster?
Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.
First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.
If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

How to print result of C++ evaluation with GDB?

I've been looking around but was unable to figure out how one could print out in GDB the result of an evaluation. For example, in the code below:
if (strcmp(current_node->word,min_node->word) > 0)
min_node = current_node;
(above I was trying out a possible method for checking alphabetical order for strings, and wasn't absolutely certain it works correctly.)
Now I could watch min_node and see if the value changes but in more involved code this is sometimes more complicated. I am wondering if there is a simple way to watch the evaluation of a test on the line where GDB / program flow currently is.
There is no expression-level single stepping in gdb, if that's what you are asking for.
Your options are (from most commonly to most infrequently used):
evaluate the expression in gdb, doing print strcmp(current_node->word,min_node->word). Surprisingly, this works: gdb can evaluate function calls, by injecting code into the running program and having it execute the code. Of course, this is fairly dangerous if the functions have side effects or may crash; in this case, it is so harmless that people typically won't think about potential problems.
perform instruction-level (assembly) single-stepping (ni/si). When the call instruction is done, you find the result in a register, according to the processor conventions (%eax on x86).
edit the code to assign intermediate values to variables, and split that into separate lines/statements; then use regular single-stepping and inspect the variables.
you may simply try to type in :
call "my_funtion()"
as far as i rember, though it won't work when a function is inlined.

Gentle introduction to JIT and dynamic compilation / code generation

The deceptively simple foundation of dynamic code generation within a C/C++ framework has already been covered in another question. Are there any gentle introductions into topic with code examples?
My eyes are starting to bleed staring at highly intricate open source JIT compilers when my needs are much more modest.
Are there good texts on the subject that don't assume a doctorate in computer science? I'm looking for well worn patterns, things to watch out for, performance considerations, etc. Electronic or tree-based resources can be equally valuable. You can assume a working knowledge of (not just x86) assembly language.
Well a pattern I've used in emulators goes something like this:
typedef void (*code_ptr)();
unsigned long instruction_pointer = entry_point;
std::map<unsigned long, code_ptr> code_map;
void execute_block() {
code_ptr f;
std::map<unsigned long, void *>::iterator it = code_map.find(instruction_pointer);
if(it != code_map.end()) {
f = it->second
} else {
f = generate_code_block();
code_map[instruction_pointer] = f;
}
f();
instruction_pointer = update_instruction_pointer();
}
void execute() {
while(true) {
execute_block();
}
}
This is a simplification, but the idea is there. Basically, every time the engine is asked to execute a "basic block" (usually a everything up to next flow control op or whole function in possible), it will look it up to see if it has already been created. If so, execute it, else create it, add it and then execute.
rinse repeat :)
As for the code generation, that gets a little complicated, but the idea is to emit a proper "function" which does the work of your basic block in the context of your VM.
EDIT: note that I haven't demonstrated any optimizations either, but you asked for a "gentle introduction"
EDIT 2: I forgot to mention one of the most immediately productive speed ups you can implement with this pattern. Basically, if you never remove a block from your tree (you can work around it if you do but it is way simpler if you never do), then you can "chain" blocks together to avoid lookups. Here's the concept. Whenever you return from f() and are about to do the "update_instruction_pointer", if the block you just executed ended in either a call, unconditional jump, or didn't end in flow control at all, then you can "fixup" its ret instruction with a direct jmp to the next block it'll execute (cause it'll always be the same one) if you have already emited it. This makes it so you are executing more and more often in the VM and less and less in the "execute_block" function.
I'm not aware of any sources specifically related to JITs, but I imagine that it's pretty much like a normal compiler, only simpler if you aren't worried about performance.
The easiest way is to start with a VM interpreter. Then, for each VM instruction, generate the assembly code that the interpreter would have executed.
To go beyond that, I imagine that you would parse the VM byte codes and convert them into some sort of suitable intermediate form (three address code? SSA?) and then optimize and generate code as in any other compiler.
For a stack based VM, it may help to to keep track of the "current" stack depth as you translate the byte codes into intermediate form, and treat each stack location as a variable. For example, if you think that the current stack depth is 4, and you see a "push" instruction, you might generate an assignment to "stack_variable_5" and increment a compile time stack counter, or something like that. An "add" when the stack depth is 5 might generate the code "stack_variable_4 = stack_variable_4+stack_variable_5" and decrement the compile time stack counter.
It is also possible to translate stack based code into syntax trees. Maintain a compile-time stack. Every "push" instruction causes a representation of the thing being pushed to be stored on the stack. Operators create syntax tree nodes that include their operands. For example, "X Y +" might cause the stack to contain "var(X)", then "var(X) var(Y)" and then the plus pops both var references off and pushes "plus(var(X), var(Y))".
Get yourself a copy of Joel Pobar's book on Rotor (when it's out), and delve through the source to the SSCLI. Beware, insanity lies within :)