how to generate the LLVM IR when nest basicblock in LLVM - llvm

There is a function basicblock A,another block in the inside of block A ,how to generate the LLVM IR.For example:
int fun()
{/*block A*/
int i=0;
{/*block B*/
int i=1
printf("i in block B is %d\n",i);
}
printf("i in block A is %d\n",i);
}

Your blocks A and B aren't basic blocks, they're just blocks. C (or whichever language this is) does not have a concept of basic blocks - LLVM does.
Basic blocks in LLVM do not have to (and often don't) correspond to blocks in the source language. Basically a basic block is just a unit of code, such that you never jump into or out of the middle of it. You only jump to the beginning of the block and only jump from its end.
Blocks in source languages can serve many purposes. Sometimes they're used as part of control flow statements - sometimes they're not. And sometimes you can have control flow without blocks. For example, in many languages, loops and if statements can be used with a single statement body that is not a compound statement (e.g. if (condition) return; - no block here, but still control flow). Likewise switch statements generally don't have a block for each case and then there's of course goto.
So when there's control flow without blocks, the generated program will contain more basic blocks than the source program contained blocks. And in the opposite case where there are blocks without control flow, the generated program will contain less basic blocks.
In your example, the function fun contains no control flow other than the implicit return at the end of the function. Therefore you should only generate a single basic block for it.

Related

Can all control flow graphs be translated back using if and while?

I was wondering if all control flow graphs obtained from a typical JVM bytecode (see how to) of a single method (no recursion allowed) could be translated back to equivalent ifs and whiles code.
If not, what is the smallest JVM bytecode sequence which cannot be translated back to ifs and whiles ?
There are several reasons why bytecode control flow may not be translatable back into Java without extreme measures.
JSR/RET - this instruction pairs has no equivalent in Java. The best you can do is inline it. However, this will lead to an exponential increase in code size if they are nested.
Irreducible loops - In Java, every loop has a single entry point which dominates the rest of the loop. An "irreducible" loop is one that has multiple distinct entry points, and hence no direct Java equivalent. There are several approaches. My preferred solution is to duplicate part of the loop body, though this can lead to exponential blow up in pathological cases as well. The other approach is to turn the method into a while-switch state machine, but this obscures the original control flow.
An example instruction sequence is
ifnull L3
L2: nop
L3: goto L2
This is the simplest possible irreducible loop. It is impossible to turn into Java without changing the structure or duplicating part of the code (though in this case, there are no actual statements so duplicating wouldn't be so bad).
The last part is exception handling. Java requires all exception handling to be done through structured try/catch blocks and it's variations while bytecode doesn't. At the bytecode level, exception handlers are basically another form of goto. In pathological cases, the best you can do is create a seperate try catch for every instruction that throws and repeat the process above.
I think a jump into the middle of a loop is not expressible in structured code:
JMP L1 // jump into the middle of a loop
L2:
IFCMP L3 // loop condition
// do something inside the loop
L1:
// do something else inside the loop
JMP L2
L3:
// exit the loop
Sorry, this is not exactly JVM bytecode, but you can get the idea.

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.

Optimal virtual machine/byte-code interpreter loop

My project has a VM that executes a byte-code compiled from a domain-specific-language. I'm looking at ways that I can improve the execution time of the byte-code. As a first step I'd like to see if there is a way to simply improve the byte-code interpreter before I venture into machine code compilation.
The main loop of the interpreter looks like this:
while(true)
{
uint8_t cmd = *code++;
switch( cmd )
{
case op_1: ...; break;
...
}
}
QUESTION: Is there a faster way to implement this loop without resorting to assembler?
The one option I see is GCC specific to use dynamic goto with label addresses. Rather than a break at the end of each case I could jump directly to the next instruction. I had hoped the optimizer would do this for me, but looking at the disassembly it apparently doesn't: there is a repeated constant jump at the end of most op_codes.
If relevant the VM is a simple register based machine with floating point and integer registers (8 of each). There is no stack, only a global heap (that language is not that complicated).
One very easy optimisation is that instead of
switch /case/case/case/case/case,
just define an array with function pointers (where each function would process a specified command, or a couple of commands in which case you could set several entries in the array to the same function, and the function itself could check the exact code), and instead of
switch(cmd)
just do a
array[cmd]()
This is given that you dont have too many commands. Also, do some checking if you will not define all the possible commands (maybe you only have 300 commands, but you have to use 2 bytes for representing them, so instead of definining an array with 65536 items, just check if the command is less than 301 and if its not, dont do the lookup)
If you won't do that, at least sort the commands that the most used ones are in the beginning of the switch statement.
Otherwise it would be to look into hashtables, but I assume you don't have that many commands, and in that case overhead of doing a hash function would probably cost you more than not having to do a switch. (Or have a VERY simple hash function)
What's the architecture? You may get a speed-up with word-aligned opcodes, but it'll blow out your code size, which means you'll have to balance it against the cost of a cache miss.
Few obvious optimization I see are,
If you don't use cmd anywhere than switch() then, directly use the pointer indirection, switch( *code++ ). For longer while(true) loop, this can be little helpful.
In switch(), you can use continue instead of break. Because when continue is used inside if/else or switch, compiler knows that execution has to jump to the outer loop; the same is not true for break (with respect to switch).
Hope this helps.

Performance question on nested if's

is there be any performance effect on "Lines of code - (C)" running inside nested ifs?
if (condition_1)
{
/* Lines of code */ - (A)
if (condition_2)
{
/* Lines of code */ - (B)
if (condition_n)
{
/* Lines of code */ - (C)
}
}
}
Does that mean you can nest any number of if statements without effecting the execution time for the code enclosing at the end of last if statement?
Remember C and C++ are translated to their assembly equivalents. In most cases, this is likely to be via some form of compare (e.g. cmp) and some form of jmp instruction.
As such, whatever code is generated from (C) will still be the same. The if nesting has no bearing on the output. If the resultant code is to generate add eax, 1 no matter how many ifs precede that, it will still be the same thing.
The only performance penalty will be in the number of if statements you use and whether or not the resultant assembly (jxx) is expensive on your system. However, I doubt that repeated nested use of if is likely to be a performance bottleneck in your application. Usually, it is time required to process data and or time required to get data.
You won't affect the execution time of the indicated code itself, but if evaluating your conditions is complex, or affected by other factors, then it could potentially lengthen the total time of execution.
The code will run as fast as if it was outside.
Just remember that evaluating an expression (in a if statement) is not "free" and will take a bit of time (more if the condition is more complex), so if your code is deeply nested it will take more time to reach it.

Gentle introduction to JIT and dynamic compilation / code generation

The deceptively simple foundation of dynamic code generation within a C/C++ framework has already been covered in another question. Are there any gentle introductions into topic with code examples?
My eyes are starting to bleed staring at highly intricate open source JIT compilers when my needs are much more modest.
Are there good texts on the subject that don't assume a doctorate in computer science? I'm looking for well worn patterns, things to watch out for, performance considerations, etc. Electronic or tree-based resources can be equally valuable. You can assume a working knowledge of (not just x86) assembly language.
Well a pattern I've used in emulators goes something like this:
typedef void (*code_ptr)();
unsigned long instruction_pointer = entry_point;
std::map<unsigned long, code_ptr> code_map;
void execute_block() {
code_ptr f;
std::map<unsigned long, void *>::iterator it = code_map.find(instruction_pointer);
if(it != code_map.end()) {
f = it->second
} else {
f = generate_code_block();
code_map[instruction_pointer] = f;
}
f();
instruction_pointer = update_instruction_pointer();
}
void execute() {
while(true) {
execute_block();
}
}
This is a simplification, but the idea is there. Basically, every time the engine is asked to execute a "basic block" (usually a everything up to next flow control op or whole function in possible), it will look it up to see if it has already been created. If so, execute it, else create it, add it and then execute.
rinse repeat :)
As for the code generation, that gets a little complicated, but the idea is to emit a proper "function" which does the work of your basic block in the context of your VM.
EDIT: note that I haven't demonstrated any optimizations either, but you asked for a "gentle introduction"
EDIT 2: I forgot to mention one of the most immediately productive speed ups you can implement with this pattern. Basically, if you never remove a block from your tree (you can work around it if you do but it is way simpler if you never do), then you can "chain" blocks together to avoid lookups. Here's the concept. Whenever you return from f() and are about to do the "update_instruction_pointer", if the block you just executed ended in either a call, unconditional jump, or didn't end in flow control at all, then you can "fixup" its ret instruction with a direct jmp to the next block it'll execute (cause it'll always be the same one) if you have already emited it. This makes it so you are executing more and more often in the VM and less and less in the "execute_block" function.
I'm not aware of any sources specifically related to JITs, but I imagine that it's pretty much like a normal compiler, only simpler if you aren't worried about performance.
The easiest way is to start with a VM interpreter. Then, for each VM instruction, generate the assembly code that the interpreter would have executed.
To go beyond that, I imagine that you would parse the VM byte codes and convert them into some sort of suitable intermediate form (three address code? SSA?) and then optimize and generate code as in any other compiler.
For a stack based VM, it may help to to keep track of the "current" stack depth as you translate the byte codes into intermediate form, and treat each stack location as a variable. For example, if you think that the current stack depth is 4, and you see a "push" instruction, you might generate an assignment to "stack_variable_5" and increment a compile time stack counter, or something like that. An "add" when the stack depth is 5 might generate the code "stack_variable_4 = stack_variable_4+stack_variable_5" and decrement the compile time stack counter.
It is also possible to translate stack based code into syntax trees. Maintain a compile-time stack. Every "push" instruction causes a representation of the thing being pushed to be stored on the stack. Operators create syntax tree nodes that include their operands. For example, "X Y +" might cause the stack to contain "var(X)", then "var(X) var(Y)" and then the plus pops both var references off and pushes "plus(var(X), var(Y))".
Get yourself a copy of Joel Pobar's book on Rotor (when it's out), and delve through the source to the SSCLI. Beware, insanity lies within :)