I was wondering if all control flow graphs obtained from a typical JVM bytecode (see how to) of a single method (no recursion allowed) could be translated back to equivalent ifs and whiles code.
If not, what is the smallest JVM bytecode sequence which cannot be translated back to ifs and whiles ?
There are several reasons why bytecode control flow may not be translatable back into Java without extreme measures.
JSR/RET - this instruction pairs has no equivalent in Java. The best you can do is inline it. However, this will lead to an exponential increase in code size if they are nested.
Irreducible loops - In Java, every loop has a single entry point which dominates the rest of the loop. An "irreducible" loop is one that has multiple distinct entry points, and hence no direct Java equivalent. There are several approaches. My preferred solution is to duplicate part of the loop body, though this can lead to exponential blow up in pathological cases as well. The other approach is to turn the method into a while-switch state machine, but this obscures the original control flow.
An example instruction sequence is
ifnull L3
L2: nop
L3: goto L2
This is the simplest possible irreducible loop. It is impossible to turn into Java without changing the structure or duplicating part of the code (though in this case, there are no actual statements so duplicating wouldn't be so bad).
The last part is exception handling. Java requires all exception handling to be done through structured try/catch blocks and it's variations while bytecode doesn't. At the bytecode level, exception handlers are basically another form of goto. In pathological cases, the best you can do is create a seperate try catch for every instruction that throws and repeat the process above.
I think a jump into the middle of a loop is not expressible in structured code:
JMP L1 // jump into the middle of a loop
L2:
IFCMP L3 // loop condition
// do something inside the loop
L1:
// do something else inside the loop
JMP L2
L3:
// exit the loop
Sorry, this is not exactly JVM bytecode, but you can get the idea.
Related
I've heard this often enough to actually question it - many people say that in an if-else statement, one should put the condition that is most likely to be true first. So, if condition is likely to be false most of the time, put !condition in the if statement, otherwise use condition. Some contrived illustration of what I mean:
if (likely) {
// do this
} else {
// do that
}
if (!unlikely) {
// do this
} else {
// do that
}
Some people say that this is more efficient - perhaps due to branch prediction or some other optimization, I've never actually enquired when the topic has been breached - but as far as I can tell there will always be one test, and both paths will result in a jump.
So my question is - is there a convincing reason (where a "convincing reason" may be a tiny efficiency gain) why the condition that is most likely to be true should come first in an if-else statement?
There are 2 reasons why order may matter:
multiple branches of several if/else_if/else_if/else statement have different probability due to known distribution of data
Sample - sort apples, most are good yellow once (90%), but some are orange (5%) and some are of other colors (5%).
if (apple.isYellow) {...}
else if (apple.isOrange) {....}
else {...}
vs.
if (!apple.isYellow && !apple.isOrange) {...}
else if (apple.isOrange ) {....}
else {...}
In first sample 90% of apples checked with just one if check and 10% will hit 2, but in second sample only 5% hit one check and 95% hit two.
So if you know that that is significant difference between chances when one branch will be used it may be useful to move it up to be the first condition.
Note that your sample of single if makes no difference on that level.
low level CPU optimizations that may favor one of the branches (also this is more about incoming data to consistently hitting the same branch of condition).
Earlier/simpler CPUs may need to clear command parsing pipeline if conditional jump is performed compared to case where code executed sequentially. So this may be a reason for such suggestion.
Sample (fake assembly):
IF R1 > R2 JUMP ElseBranch
ADD R2, R3 -> R4 // CPU decodes this command WHILE executing previous IF
....
JUMP EndCondition
ElseBranch:
ADD 4, R3 -> R4 // CPU will have to decodes this command AFTER
// and drop results of parsing ADD R2, R3 -> R4
....
EndCondition:
....
Modern CPUs should not have the problem as they would parse commands for both branches. They even have branch prediction logic for conditions. So if condition mostly resolved one way CPU will assume that condition will be resolved that particular way and start executing code in that branch before check is finished. To my knowledge it does not matter on current CPUs whether it is first or alternative branch of condition. Check out Why is it faster to process a sorted array than an unsorted array? for good info on that.
My project has a VM that executes a byte-code compiled from a domain-specific-language. I'm looking at ways that I can improve the execution time of the byte-code. As a first step I'd like to see if there is a way to simply improve the byte-code interpreter before I venture into machine code compilation.
The main loop of the interpreter looks like this:
while(true)
{
uint8_t cmd = *code++;
switch( cmd )
{
case op_1: ...; break;
...
}
}
QUESTION: Is there a faster way to implement this loop without resorting to assembler?
The one option I see is GCC specific to use dynamic goto with label addresses. Rather than a break at the end of each case I could jump directly to the next instruction. I had hoped the optimizer would do this for me, but looking at the disassembly it apparently doesn't: there is a repeated constant jump at the end of most op_codes.
If relevant the VM is a simple register based machine with floating point and integer registers (8 of each). There is no stack, only a global heap (that language is not that complicated).
One very easy optimisation is that instead of
switch /case/case/case/case/case,
just define an array with function pointers (where each function would process a specified command, or a couple of commands in which case you could set several entries in the array to the same function, and the function itself could check the exact code), and instead of
switch(cmd)
just do a
array[cmd]()
This is given that you dont have too many commands. Also, do some checking if you will not define all the possible commands (maybe you only have 300 commands, but you have to use 2 bytes for representing them, so instead of definining an array with 65536 items, just check if the command is less than 301 and if its not, dont do the lookup)
If you won't do that, at least sort the commands that the most used ones are in the beginning of the switch statement.
Otherwise it would be to look into hashtables, but I assume you don't have that many commands, and in that case overhead of doing a hash function would probably cost you more than not having to do a switch. (Or have a VERY simple hash function)
What's the architecture? You may get a speed-up with word-aligned opcodes, but it'll blow out your code size, which means you'll have to balance it against the cost of a cache miss.
Few obvious optimization I see are,
If you don't use cmd anywhere than switch() then, directly use the pointer indirection, switch( *code++ ). For longer while(true) loop, this can be little helpful.
In switch(), you can use continue instead of break. Because when continue is used inside if/else or switch, compiler knows that execution has to jump to the outer loop; the same is not true for break (with respect to switch).
Hope this helps.
is there be any performance effect on "Lines of code - (C)" running inside nested ifs?
if (condition_1)
{
/* Lines of code */ - (A)
if (condition_2)
{
/* Lines of code */ - (B)
if (condition_n)
{
/* Lines of code */ - (C)
}
}
}
Does that mean you can nest any number of if statements without effecting the execution time for the code enclosing at the end of last if statement?
Remember C and C++ are translated to their assembly equivalents. In most cases, this is likely to be via some form of compare (e.g. cmp) and some form of jmp instruction.
As such, whatever code is generated from (C) will still be the same. The if nesting has no bearing on the output. If the resultant code is to generate add eax, 1 no matter how many ifs precede that, it will still be the same thing.
The only performance penalty will be in the number of if statements you use and whether or not the resultant assembly (jxx) is expensive on your system. However, I doubt that repeated nested use of if is likely to be a performance bottleneck in your application. Usually, it is time required to process data and or time required to get data.
You won't affect the execution time of the indicated code itself, but if evaluating your conditions is complex, or affected by other factors, then it could potentially lengthen the total time of execution.
The code will run as fast as if it was outside.
Just remember that evaluating an expression (in a if statement) is not "free" and will take a bit of time (more if the condition is more complex), so if your code is deeply nested it will take more time to reach it.
I've been looking around but was unable to figure out how one could print out in GDB the result of an evaluation. For example, in the code below:
if (strcmp(current_node->word,min_node->word) > 0)
min_node = current_node;
(above I was trying out a possible method for checking alphabetical order for strings, and wasn't absolutely certain it works correctly.)
Now I could watch min_node and see if the value changes but in more involved code this is sometimes more complicated. I am wondering if there is a simple way to watch the evaluation of a test on the line where GDB / program flow currently is.
There is no expression-level single stepping in gdb, if that's what you are asking for.
Your options are (from most commonly to most infrequently used):
evaluate the expression in gdb, doing print strcmp(current_node->word,min_node->word). Surprisingly, this works: gdb can evaluate function calls, by injecting code into the running program and having it execute the code. Of course, this is fairly dangerous if the functions have side effects or may crash; in this case, it is so harmless that people typically won't think about potential problems.
perform instruction-level (assembly) single-stepping (ni/si). When the call instruction is done, you find the result in a register, according to the processor conventions (%eax on x86).
edit the code to assign intermediate values to variables, and split that into separate lines/statements; then use regular single-stepping and inspect the variables.
you may simply try to type in :
call "my_funtion()"
as far as i rember, though it won't work when a function is inlined.
The deceptively simple foundation of dynamic code generation within a C/C++ framework has already been covered in another question. Are there any gentle introductions into topic with code examples?
My eyes are starting to bleed staring at highly intricate open source JIT compilers when my needs are much more modest.
Are there good texts on the subject that don't assume a doctorate in computer science? I'm looking for well worn patterns, things to watch out for, performance considerations, etc. Electronic or tree-based resources can be equally valuable. You can assume a working knowledge of (not just x86) assembly language.
Well a pattern I've used in emulators goes something like this:
typedef void (*code_ptr)();
unsigned long instruction_pointer = entry_point;
std::map<unsigned long, code_ptr> code_map;
void execute_block() {
code_ptr f;
std::map<unsigned long, void *>::iterator it = code_map.find(instruction_pointer);
if(it != code_map.end()) {
f = it->second
} else {
f = generate_code_block();
code_map[instruction_pointer] = f;
}
f();
instruction_pointer = update_instruction_pointer();
}
void execute() {
while(true) {
execute_block();
}
}
This is a simplification, but the idea is there. Basically, every time the engine is asked to execute a "basic block" (usually a everything up to next flow control op or whole function in possible), it will look it up to see if it has already been created. If so, execute it, else create it, add it and then execute.
rinse repeat :)
As for the code generation, that gets a little complicated, but the idea is to emit a proper "function" which does the work of your basic block in the context of your VM.
EDIT: note that I haven't demonstrated any optimizations either, but you asked for a "gentle introduction"
EDIT 2: I forgot to mention one of the most immediately productive speed ups you can implement with this pattern. Basically, if you never remove a block from your tree (you can work around it if you do but it is way simpler if you never do), then you can "chain" blocks together to avoid lookups. Here's the concept. Whenever you return from f() and are about to do the "update_instruction_pointer", if the block you just executed ended in either a call, unconditional jump, or didn't end in flow control at all, then you can "fixup" its ret instruction with a direct jmp to the next block it'll execute (cause it'll always be the same one) if you have already emited it. This makes it so you are executing more and more often in the VM and less and less in the "execute_block" function.
I'm not aware of any sources specifically related to JITs, but I imagine that it's pretty much like a normal compiler, only simpler if you aren't worried about performance.
The easiest way is to start with a VM interpreter. Then, for each VM instruction, generate the assembly code that the interpreter would have executed.
To go beyond that, I imagine that you would parse the VM byte codes and convert them into some sort of suitable intermediate form (three address code? SSA?) and then optimize and generate code as in any other compiler.
For a stack based VM, it may help to to keep track of the "current" stack depth as you translate the byte codes into intermediate form, and treat each stack location as a variable. For example, if you think that the current stack depth is 4, and you see a "push" instruction, you might generate an assignment to "stack_variable_5" and increment a compile time stack counter, or something like that. An "add" when the stack depth is 5 might generate the code "stack_variable_4 = stack_variable_4+stack_variable_5" and decrement the compile time stack counter.
It is also possible to translate stack based code into syntax trees. Maintain a compile-time stack. Every "push" instruction causes a representation of the thing being pushed to be stored on the stack. Operators create syntax tree nodes that include their operands. For example, "X Y +" might cause the stack to contain "var(X)", then "var(X) var(Y)" and then the plus pops both var references off and pushes "plus(var(X), var(Y))".
Get yourself a copy of Joel Pobar's book on Rotor (when it's out), and delve through the source to the SSCLI. Beware, insanity lies within :)