Checking for backedges in an LLVM pass - c++

I am writing an LLVM pass that modifies the intermediate code. I want to check each terminating instruction of a basic block to see if it has a back edge. To make it more clear, in the following example, I want to see if to reach labels land.lhs.true or if.end, a back jump is required.
entry:
%pa = alloca %struct.Vertex, align 4
.........
br i1 %cmp, label %land.lhs.true, label %if.end

Not sure what you mean by back edge or back jump here, as LLVM intermediate code has no explicit layout in memory. You should think of basic blocks within each function has having no explicit order and no explicit assignment to memory addresses. This is handled by the backend when emitting assembly code.

Related

Checking the top bits of an i64 Value in LLVM IR

I am going to keep this short and to the point, but if further clarifications are needed please let me know.
I have an i64 Value that I want to check the top bits of if they are zeros or not. If they are zeros, I would do something, if they are not, I would do something else. How do I instrument the IR to allow this to happen at runtime?
One thing I found is that LLVM has an intrinsic "llvm.ctlz" that counts the leading zeros and puts them in an i64 Value, but how do I use its return value to do the checking? Or how do I instrument so the checking happens at runtime?
Any help or suggestions would be appreciated. Thanks!
You didn't say how many top bits, so I'll do an example with the top 32 bits. Given i64 %x, I'd check it with %result = icmp uge i64 %x, i64 4294967296 because 4294967296 is 2^32 and that is the first value which has a 1 bit in the top 32-bits. If you want to check the top two bits to be zero, use 2^62 (4611686018427387904) instead.
In order to do two different things based on the value of %result in general you'll want to branch on it. BasicBlock has a method splitBasicBlock that takes an instruction to split at. Use that to split your block into a before and after. Create new blocks for the true side an false side, add a branch on your result to your new blocks, br i1 %result, label %cond_true, label %cond_false. Make sure those two new blocks branch back to the after block.
Depending on what you want to do, you may not need an entire block, for instance if you're only calculating a value and not doing any side-effecting operations you might be able to use a select instruction instead of a branch and separate blocks.

How LLVM mem2reg pass works

mem2reg is an important optimization pass in llvm. I want to understand how this optimization works but didn't find good articles, books, tutorials and similar.
I found these two links:
https://blog.katastros.com/a?ID=01300-3d6589c1-1993-4fb1-8975-939f10c20503
https://www.zzzconsulting.se/2018/07/16/llvm-exercise.html
Both links explains that one can use Cytron's classical SSA algorithm to implement this pass, but reading the original paper I didn't see how alloca instructions are converted to registers.
As alloca is an instruction specific to llvm IR, I wonder if the algorithm that converts alloca instructions to registers is an ad-hoc algorithm that only works for llvm. Or if there is a general theory framework that I just don't know the name yet that explains how to promote memory variables to registers variables.
On the official documentation, it says:
mem2reg: This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.
By this description, it seems that one just need to check if all users of the variable are load and store instructions and if so, it can be promoted to a register.
So if you can link me to articles, books, algorithms and so on I would appreciate.
"Efficiently Computing Static Single Assignment Form and the Control Dependence Graph" by Cytron, Ferrante et al.
The alloca instruction is named after the alloca() function in C, the rarely used counterpart to malloc() that allocates memory on the stack instead of the heap, so that the memory disappears when the function returns without needing to be explicitly freed.
In the paper, figure 2 shows an example of "straight-line code":
V ← 4
← V + 5
V ← 6
← V + 7
That text isn't valid LLVM IR, if we wanted to rewrite the same example in pre-mem2reg LLVM IR, it would look like this:
; V ← 4
store i32 4, i32* %V.addr
; ← V + 5
%tmp1 = load i32* %V.addr
%tmp2 = add i32 %tmp1, 5
store i32 %tmp2, i32* %V.addr
; V ← 6
store i32 6, i32* %V.addr
; ← V + 7
%tmp3 = load i32* %V.addr
%tmp4 = add i32 %tmp3, i32 7
store i32 %tmp4, i32* %V.addr
It's easy enough to see in this example how you could always replace %tmp1 with i32 4 using store-to-load forwarding, but you can't always remove the final store. Knowing nothing about %V.addr means that we must assume it may be used elsewhere.
If you know that %V.addr is an alloca, you can simplify a lot of things. You can see the alloca instruction so you know you can see all the uses, you never have to worry that a store to %ptr may alias with %V.addr. You see the allocation so you know what its alignment is. You know pointer accesses can not fault anywhere in the function, there is no free() equivalent for an alloca that anyone could call. If you see a store that isn't loaded before the end of the function, you can eliminate that store as a dead store since you know the alloca's lifetime ends at the function return. Finally you may delete the alloca itself if you've removed all its uses.
Thus if you start with an alloca whose only users are loads and stores, the problem has little in common with most memory optimization problems: there's no possibility of pointer aliasing nor concern about faulting memory accesses. What you do need to do is place the ϕ functions (phi instructions in LLVM) in the right places given the control flow graph, and that's what the paper describes how to do.

Use of inserted instruction is necessary or not

In LLVM is it necessary that if we insert some instruction in LLVM IR through LLVM Pass ,than also we have to insert an instruction which will use the result of our previous inserted instruction or we have to store result of our inserted instruction into some variable already present in LLVM IR that is not useless.
for example cant i insert instruction
%result = add i32 4 3
and %result is not used in subsequent instructions.
You should be able to insert it but if an optimization pass runs after your pass it might be eliminated because it's unused and doesn't have side effects.
No, it's absolutely not necessary. If you insert the instruction properly (i.e. use the API correctly), it can be left unused.
As a matter of fact, unused values can be left around by various optimization passes as well. LLVM has other passes like DCE (dead code elimination) that will remove unused instructions.

Using the address from LLVM store instruction to create another

I'm working with LLVM to take a store instruction and replace it with another so that I can take something like
store i64 %0, i64* %a
and replace it with
store i64 <value>, i64* %a
I've used
llvm::Value *str = i->getOperand(1);
to get the address that my old instruction is using, and then I create a new store via (i is the current instruction location, so this store will be created before the store I'm replacing)
StoreInstr *store = new StoreInst(value, str, i);
I then delete the store I've replaced with
i->eraseFromParent();
But I'm getting the error:
While deleting: i64%
Use still stuck around after Def is destroyed: store i64 , i64* %a
and a failure message that Assertion "use empty" && uses remain when a value is destroyed fail.
How could I get around this? I'd love to create a store instruction and then use LLVM's ReplaceInstWithInst, but I can't find a way to create a store instruction without giving it a location to insert itself. I'm also not 100% that will solve my issue.
I'll add that prior to my store replacement, I'm matching an instruction i, then getting the value I need before performing i->eraseFromParent, so I'm not sure if that is part of my problem; I'm assuming that eraseFromParent moves i along to the following store instruction.
eraseFromParent removes an instruction from the enclosing basic block (and consequently, from the enclosing function). It doesn't move it anywhere. Erasing an instruction this way without taking care of its uses first will leave your IR malformed, which is why you're getting the error - it's as if you deleted line 1 from the following C snippet:
1 int x = 3;
2 int y = x + 1;
Obviously you'll get an error on the remaining line, the definition of x is now missing!
ReplaceInstWithInst is probably the best way to replace one instruction with another. You don't need to supply the new instruction with a location to insert it with: just leave the instruction as NULL (or better yet, omit the argument) and it will create a dangling instruction which you can then place wherever you want.
Because of the above, by the way, the key method that ReplaceInstWithInst invokes is Value::replaceAllUsesWith, this ensures that you won't be left with missing values in your IR.

effect of goto on C++ compiler optimization

What are the performance benefits or penalties of using goto with a modern C++ compiler?
I am writing a C++ code generator and use of goto will make it easier to write. No one will touch the resulting C++ files so don't get all "goto is bad" on me. As a benefit, they save the use of temporary variables.
I was wondering, from a purely compiler optimization perspective, the result that goto has on the compiler's optimizer? Does it make code faster, slower, or generally no change in performance compared to using temporaries / flags.
The part of a compiler that would be affected works with a flow graph. The syntax you use to create a particular flow graph will normally be irrelevant as long as you're writing strictly portable code--if you create something like a while loop using a goto instead of an actual while statement, it's not going to produce the same flow graph as if you used the syntax for a while loop. Using non-portable code, however, modern compilers allow you to add annotations to loops to predict whether they'll be taken or not. Depending on the compiler, you may or may not be able to duplicate that extra information using a goto (but most that have annotation for loops also have annotation for if statements, so a likely taken or likely not taken on the if controlling a goto would generally have the same effect as a similar annotation on the corresponding loop).
It is possible, however, to produce a flow graph with gotos that couldn't be produced by any normal flow control statements (loops, switch, etc.), such conditionally jumping directly into the middle of a loop, depending on the value in a global. In such a case, you may produce an irreducible flow graph, and when/if you do, that will often limit the ability of the compiler to optimize the code.
In other words, if (for example) you took code that was written with normal for, while, switch, etc., and converted it to use goto in every case, but retained the same structure, almost any reasonably modern compiler could probably produce essentially identical code either way. If, however, you use gotos to produce the mess of spaghetti like some of the FORTRAN I had to look at decades ago, then the compiler probably won't be able to do much with it.
How do you think that loops are represented, at the assembly level ?
Using jump instructions to labels...
Many compilers will actually use jumps even in their Intermediate Representation:
int loop(int* i) {
int result = 0;
while(*i) {
result += *i;
}
return result;
}
int jump(int* i) {
int result = 0;
while (true) {
if (not *i) { goto end; }
result += *i;
}
end:
return result;
}
Yields in LLVM:
define i32 #_Z4loopPi(i32* nocapture %i) nounwind uwtable readonly {
%1 = load i32* %i, align 4, !tbaa !0
%2 = icmp eq i32 %1, 0
br i1 %2, label %3, label %.lr.ph..lr.ph.split_crit_edge
.lr.ph..lr.ph.split_crit_edge: ; preds = %.lr.ph..lr.ph.split_crit_edge, %0
br label %.lr.ph..lr.ph.split_crit_edge
; <label>:3 ; preds = %0
ret i32 0
}
define i32 #_Z4jumpPi(i32* nocapture %i) nounwind uwtable readonly {
%1 = load i32* %i, align 4, !tbaa !0
%2 = icmp eq i32 %1, 0
br i1 %2, label %3, label %.lr.ph..lr.ph.split_crit_edge
.lr.ph..lr.ph.split_crit_edge: ; preds = %.lr.ph..lr.ph.split_crit_edge, %0
br label %.lr.ph..lr.ph.split_crit_edge
; <label>:3 ; preds = %0
ret i32 0
}
Where br is the branch instruction (a conditional jump).
All optimizations are performed on this structure. So, goto is the bread and butter of optimizers.
I was wondering, from a purely compiler optimzation prespective, the result that goto's have on the compiler's optimizer? Does it make code faster, slower, or generally no change in performance compared to using temporaries / flags.
Why do you care? Your primary concern should be getting your code generator to create the correct code. Efficiency is of much less importance than correctness. Your question should be "Will my use of gotos make my generated code more likely or less likely to be correct?"
Look at the code generated by lex/yacc or flex/bison. That code is chock full of gotos. There's a good reason for that. lex and yacc implement finite state machines. Since the machine goes to another state at state transitions, the goto is arguably the most natural tool for such transitions.
There is a simple way to eliminate those gotos in many cases by using a while loop around a switch statement. This is structured code. Per Douglas Jones (Jones D. W., How (not) to code a finite state machine, SIGPLAN Not. 23, 8 (Aug. 1988), 19-22.), this is the worst way to encode a FSM. He argues that a goto-based scheme is better.
He also argues that there is an even better approach, which is convert your FSM to a control flow diagram using graph theory techniques. That's not always easy. It is an NP hard problem. That's why you still see a lot of FSMs, particularly auto-generated FSMs, implemented as either a loop around a switch or with state transitions implemented via gotos.
I agree heartily with David Hammen's answer, but I only have one point to add.
When people are taught about compilers, they are taught about all the wonderful optimizations that compilers can do.
They are not taught that the actual value of this depends on who the user is.
If the code you are writing (or generating) and compiling contains very few function calls and could itself consume a large fraction of some other program's time, then yes, compiler optimization matters.
If the code being generated contains function calls, or if for some other reason the program counter spends a small fraction of its time in the generated code, it's not worth worrying about.
Why? Because even if that code could be so aggressively optimized that it took zero time, it would save no more than that small fraction, and there are probably much bigger performance issues, that the compiler can't fix, that are happy to be evading your attention.