Inserting a block between two blocks in LLVM - c++

I want to insert a block in between two basic blocks in LLVM. So for example, if a basic block A was jumping to basic block B, I want to insert a basic block C in between them such that A jumps to C and C jumps to B. How can I do that? I do have the basic idea that I need to change the terminating instruction of Basic Block A, such that the target B is replaced by C, but how do I go on adding the new basic block C in between?

Yes, you need to change (or replace) the terminating instruction of basic block A - for example, if it's a branch, you can use BranchInst::setSuccessor(). You then create basic block C and make sure that its terminating instruction jumps to B, which will make it in-between.
All you need to do is to change the terminators' targets - you don't need to rearrange the block order in the memory or anything like that.
However, you must be aware that there are two special instructions you need to worry about - phi nodes and landing pads.
Phi nodes only refer to the block's immediate predecessor. That means that if you insert C between A and B, you must fix all the phi nodes in B by either removing them or making them refer to C instead of A.
If B is a landingpad block (contains a landingpad instruction), it is only legal to jump into it directly from the unwind target of an invoke instruction. If the jump from A to B is through the unwind target, you can't add a basic block in-between unless you make C itself into a landingpad and remove the landingpad from B.

There is a function called llvm::splitEdge. It does exactly what the question asked for.

Related

How to determine if a BasicBlock is controled by a `if`

I want to use LLVM to analyze if a basic block is affected by a control flow of an if(i.e., br instruction). "A basic block BB is NOT affected by br" means that no matter which of two blocks the br goes to BB will be executed for sure. I use an example to briefly show what I want:
My current rule to determine if a basic block BB is affected is (if true, affected.)
¬(postDominate(BB, BranchInst->leftBB) && postDominate(BB, BranchInst->rightBB))
Since I can not exhaustively test the rule above to all possible CFGs, I want to know if this rule is sound and complete.
Thanks!
Further question
I am also confused if I should use dominate rather than postDominate like (I know the difference between post-dominance and dominance, but which should I use? Both rules seems work in this example, but I am not sure which one will/won't work in other cases):
Dominate(BranchInst->leftBB, BB) || Dominate(BranchInst->rightBB, BB)
A block Y is control dependent on block X if and only if Y postdominates at least one but not all successors of X.
llvm::ReverseIDFCalculator in llvm/Analysis/IteratedDominanceFrontier.h will calculate the post-dominance frontier for you, which is exactly what you need. The "iterated" part isn't relevant to your use case, ignore the setLiveInBlocks() method.
I have not tested this at all, but I expect that something like this should do the trick:
// PDT is the llvm::PostDominatorTree
SmallPtrSet<BasicBlock *, 1> BBSet{block_with_branch};
SmallVector<BasicBlock *, 32> IDFBlocks;
ReverseIDFCalculator IDFs(PDT);
IDFs.setDefiningBlocks(BBSet);
IDFs.calculate(IDFBlocks);
The relation control dependent is transitive. Applying the definition to all the control-dependent-impacted block(s) iteratively is the right way.

LLVM insert if/else into existing basic block

I want to check the value of some instruction at runtime. Therefore, I create a compare instruction and a branch instruction which branches to either the "then" basic block or the "else" basic block. However, I am not sure how I can insert the created basic block after the conditional branch and how the splitting of the existing basic block works.
Instruction* someInst;
IRBuilder<> B(someInst);
Value* condition = B.CreateICmp(CmpInst::ICMP_UGT, someInst, someValue);
BasicBlock* thenBB = BasicBlock::Create(*ctx, "then");
BasicBlock* elseBB = BasicBlock::Create(*ctx, "else");
B.CreateCondBr(condition, thenBB, elseBB);
B.SetInsertPoint(thenBB);
//insert stuff
B.SetInsertPoint(elseBB);
//insert stuff
How can I insert an if/else in the middle of an existing basic block?
Short answer: you can probably use llvm::SplitBlockAndInsertIfThenElse. Don't forget your PHI node.
According to Wikipedia, a basic block:
is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.
An if-then-else therefore involves several blocks:
The block that contains the condition,
The then block
The else block
Optionally, the block after the then and else blocks (if then and else don't return or branch elsewhere).
To insert an if-then-else, the original Basic Block must be split into (1) and (4). The condition checking and conditional branching go into (1), and (2) and (3) finish with a branch to (4). The SplitBlockAndInsertIfThenElse function (docs) will do this for you in simple cases. If you have more complicated requirements - such as then or else containing their own control flow - you may need to do the splitting yourself.
If your then or else blocks modify variables, you will need a PHI node. The Kaleidoscope tutorial explains why PHI nodes are needed and how to use them. The tutorial references the Single Static Assignment Wikipedia article, which is useful background.
There is a helper function you can use called llvm::SplitBlockAndInsertIfThenElse. You'll need to #include "llvm/Transforms/Utils/BasicBlockUtils.h".

How to get the memory address of all operands on an expression

I have some expression as a=b+c-d*e, and with the help of LLVM pass I want to make a string like this
"[Hexadecimal address of 'b'] [opcode of +] [Hexadecimal address of 'c'] [opcode of -] [Hexadecimal address of 'd'] [opcode of *] [Hexadecimal address of 'e']".
Than how can I do it .
First of all, keep in mind that variables do not necessarily reside in memory; they can be stored in registers or elided altogether. In the context of LLVM IR, it means either the value will be used directly from another value (without store or load).
Assuming all the variables involved do need to be loaded from memory, the most straightforward way I can think of for doing this is locating the store, then doing a post-order DFS backwards through the operands, recording the opcodes, and stopping when you identify a load. For your provided snippet, it should give you b's load, then plus opcode, then c's load, then minus opcode, etc.
Now that you have such a sequence, I'd say the simplest way to generate a string from it is to insert a call to C's sprintf with a dynamically-built format string, passing it the pointers that you found (that were loaded from).
I see two issues with the above, though:
There's some inherent ambiguity here - just visiting them this way cannot distinguish, for example, (b+c-d)*e from b+(c-d)*e. So I think it would make sense to also record "(" and ")" whenever you enter an arithmetic instruction and leave it, respectively.
This approach does not actually check that all the operations are part of the same expression. So if you have tmp = b+c; a = tmp-d*e;, and tmp is optimized away, then it will look the same in the IR. The only way I can think of for enforcing that is compiling with debug symbols and digging into those to identify distinct expressions - though I don't really know if that's possible - or actually modify Clang to record expression boundaries :\
Pseudo-code for this approach (with simplistic sequence-handling operations):
functionPass:
for each instruction:
if instruction is store:
processExpression(store)
processExpression(store):
sequence <- initialize
visit(sequence, store.value)
generateSprintfCallFromSequence(sequence)
visit(sequence, value):
if value is load:
sequence.add(load.pointer)
else if value is binaryop:
// sequence.add(openingParen)
visit(sequence, binaryop.operand(0))
sequence.add(binaryop.opcode)
visit(sequence, binaryop.operand(1))
// sequence.add(closingParen);

C++ Writing an Interpreter - determining loops target for break statement c++

I am writing a simple program interpreter in c++. When I am building the internal representation of the program and I get a break statement, how do I determine the encompassing loops target location?
void Imp::whilestmt()
{
Expr *pExpr;
accept(Token::WHILE);
expr(pExpr);
WhileStmt *pwhilestmt = new WhileStmt(pExpr,vm.getLocationCounter);
vm.add(pwhilestmt);
accept(Token::LOOP);
stmtlist();
pwhilestmt->setTarget(vm.getLocationCounter);
accept(Token::END);
accept(Token::LOOP);
vm.add(new EndLoopStmt);
}
My break statement object is going to take the the while statement's target as a parameter, how can I determine this?
I'd consider building a kind of execution tree/pipeline. Every LOOP/WHILE would be a new branch (similarily to every function) so when you encounter END/BREAK instruction you just revert to the branches origin point and continue down the line.
I think the solution is to add a forward reference that is resolved (by looking up the location of the end of the loop) when all the code for that level of loop has been produced.
In other words, when generating the code for the loop, you need to form a "jump" instruction, which has it's target set to somewhere you don't know where it is yet. The solution is to have a jump with an unknown destination (set the "destination" to instruction 0 or -1 or 0xdeaddead or something else that can be easily identified for debugging purposes later - because the best way to avoid getting bugs of "I didn't fix it up properly" is to make it easy to identify those places - bugs only occur in things that are hard to identify, just like it never rains when you carry an umbrella), and keep a fixup list of such jumps until you have generated the entire loop, then work your way through that fixup list, and fill in the relevant address that you now know is "here" (the next instruction after the loop). I suspect you also need something similar for the condition of the loop itself - if that's false, then you need to continue "after" the loop.
I added setTarget as a virtual function of Stmt.
I stored the start location in the part that handles the if statements and then checked if I had any break stmts from the start location to the current location, and if I did I set the target to the current location.
really messy way to do it, but it works for now

How can I avoid using the stack with continuation-passing style?

For my diploma thesis I chose to implement the task of the ICFP 2004 contest.
The task--as I translated it to myself--is to write a compiler which translates a high-level ant-language into a low-level ant-assembly. In my case this means using a DSL written in Clojure (a Lisp dialect) as the high-level ant-language to produce ant-assembly.
UPDATE:
The ant-assembly has several restrictions: there are no assembly-instructions for calling functions (that is, I can't write CALL function1, param1), nor returning from functions, nor pushing return addresses onto a stack. Also, there is no stack at all (for passing parameters), nor any heap, or any kind of memory. The only thing I have is a GOTO/JUMP instruction.
Actually, the ant-assembly is for to describe the transitions of a state machine (=the ants' "brain"). For "function calls" (=state transitions) all I have is a JUMP/GOTO.
While not having anything like a stack, heap or a proper CALL instruction, I still would like to be able to call functions in the ant-assembly (by JUMPing to certain labels).
At several places I read that transforming my Clojure DSL function calls into continuation-passing style (CPS) I can avoid using the stack[1], and I can translate my ant-assembly function calls into plain JUMPs (or GOTOs). Which is exactly what I need, because in the ant-assembly I have no stack at all, only a GOTO instruction.
My problem is that after an ant-assembly function has finished, I have no way to tell the interpreter (which interprets the ant-assembly instructions) where to continue. Maybe an example helps:
The high-level Clojure DSL:
(defn search-for-food [cont]
(sense-food-here? ; a conditional w/ 2 branches
(pickup-food ; true branch, food was found
(go-home ; ***
(drop-food
(search-for-food cont))))
(move ; false branch, continue searching
(search-for-food cont))))
(defn run-away-from-enemy [cont]
(sense-enemy-here? ; a conditional w/ 2 branches
(go-home ; ***
(call-help-from-others cont))
(search-for-food cont)))
(defn go-home [cont]
(turn-backwards
; don't bother that this "while" is not in CPS now
(while (not (sense-home-here?))
(move)))
(cont))
The ant-assembly I'd like to produce from the go-home function is:
FUNCTION-GO-HOME:
turn left nextline
turn left nextline
turn left nextline ; now we turned backwards
SENSE-HOME:
sense here home WE-ARE-AT-HOME CONTINUE-MOVING
CONTINUE-MOVING:
move SENSE-HOME
WE-ARE-AT-HOME:
JUMP ???
FUNCTION-DROP-FOOD:
...
FUNCTION-CALL-HELP-FROM-OTHERS:
...
The syntax for the ant-asm instructions above:
turn direction which-line-to-jump
sense direction what jump-if-true jump-if-false
move which-line-to-jump
My problem is that I fail to find out what to write to the last line in the assembly (JUMP ???). Because--as you can see in the example--go-home can be invoked with two different continuations:
(go-home
(drop-food))
and
(go-home
(call-help-from-others))
After go-home has finished I'd like to call either drop-food or call-help-from-others. In assembly: after I arrived at home (=the WE-ARE-AT-HOME label) I'd like to jump either to the label FUNCTION-DROP-FOOD or to the FUNCTION-CALL-HELP-FROM-OTHERS.
How could I do that without a stack, without PUSHing the address of the next instruction (=FUNCTION-DROP-FOOD / FUNCTION-CALL-HELP-FROM-OTHERS) to the stack? My problem is that I don't understand how continuation-passing style (=no stack, only a GOTO/JUMP) could help me solving this problem.
(I can try to explain this again if the things above are incomprehensible.)
And huge thanks in advance for your help!
--
[1] "interpreting it requires no control stack or other unbounded temporary storage". Steele: Rabbit: a compiler for Scheme.
Yes, you've provided the precise motivation for continuation-passing style.
It looks like you've partially translated your code into continuation-passing-style, but not completely.
I would advise you to take a look at PLAI, but I can show you a bit of how your function would be transformed, assuming I can guess at clojure syntax, and mix in scheme's lambda.
(defn search-for-food [cont]
(sense-food-here? ; a conditional w/ 2 branches
(search-for-food
(lambda (r)
(drop-food r
(lambda (s)
(go-home s cont)))))
(search-for-food
(lambda (r)
(move r cont)))))
I'm a bit confused by the fact that you're searching for food whether or not you sense food here, and I find myself suspicious that either this is weird half-translated code, or just doesn't mean exactly what you think it means.
Hope this helps!
And really: go take a look at PLAI. The CPS transform is covered in good detail there, though there's a bunch of stuff for you to read first.
Your ant assembly language is not even Turing-complete. You said it has no memory, so how are you supposed to allocate the environments for your function calls? You can at most get it to accept regular languages and simulate finite automata: anything more complex requires memory. To be Turing-complete you'll need what amounts to a garbage-collected heap. To do everything you need to do to evaluate CPS terms you'll also need an indirect GOTO primitive. Function calls in CPS are basically (possibly indirect) GOTOs that provide parameter passing, and the parameters you pass require memory.
Clearly, your two basic options are to inline everything, with no "external" procedures (for extra credit look up the original meaning of "internal" and "external" here), or somehow "remember" where you need to go on "return" from a procedure "call" (where the return point does not necessarily need to fall in the physical locations immediately following the "calling" point). Basically, the return point identifier can be a code address, an index into a branch table, or even a character symbol -- it just needs to identify the return target relative to the called procedure.
The most obvious here would be to track, in your compiler, all of the return targets for a given call target, then, at the end of the called procedure, build a branch table (or branch ladder) to select from one of the several possible return targets. (In most cases there are only a handful of possible return targets, though for commonly used procedures there could be hundreds or thousands.) Then, at the call point, the caller needs to load a parameter with the index of its return point relative to the called procedure.
Obviously, if the callee in turn calls another procedure, the first return point identifier must be preserved somehow.
Continuation passing is, after all, just a more generalized form of a return address.
You might be interested in Andrew Appel's book Compiling with Continuations.