When using a phi node in a basic block is there a suggested order in which I should place the labels if there is a higher probability that the predecessor is a certain block. For example take the simple factorial function listed below.
define private i64 #fact(i64 %start) {
entry:
%0 = icmp sle i64 1, %start
br i1 %0, label %loop, label %endcond
loop: ; preds = %loop, %entry
%1 = phi i64 [ %res, %loop ], [ 1, %entry ] ; if %start > 2 predecessor
%2 = phi i64 [ %3, %loop ], [ %start, %entry ] ; is likely %loop
%res = mul i64 %1, %2
%3 = sub i64 %2, 1
%cond = icmp sle i64 1, %3
br i1 %cond, label %loop, label %endcond
endcond: ; preds = %loop, %entry
%fin = phi i64 [ %res, %loop ], [ 1, %entry ] ; highly unlikely
ret i64 %fin ; predecessor is %entry
}
While it is possible that the user will input #fact(1) it is unlikely, so I expect in most cases the predecessor block for the phi node in endcond to be post.loop. So is my assumption that in this case
%fin = phi i64 [ %res, %post.loop ], [ 1, %entry ]
is better than
%fin = phi i64 [ 1, %entry ], [ %res, %post.loop ]
correct? And if so, why or why not?
Doesn't make a difference. LLVM will do an analysis on your code to estimate the branch probabilities, and it uses that to order the resulting block.
You can influence this by using branch weight metadata: http://llvm.org/docs/BlockFrequencyTerminology.html
Related
I am a newbie to LLVM. And I am trying to change the type of the loop variable (PHINode).
For example, I have an IR as follows:
for.cond1.preheader: ; preds = %entry, %for.inc15
%k.03 = phi i32 [ 0, %entry ], [ %inc16, %for.inc15 ]
br label %for.cond4.preheader
for.cond4.preheader: ; preds = %for.cond1.preheader, %for.inc12
%indvars.iv5 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next6, %for.inc12 ]
br label %for.body6
for.body6: ; preds = %for.cond4.preheader, %for.body6
......
%add7 = add nsw i32 %2, 1
......
for.inc12: ; preds = %for.body6
......
for.inc15: ; preds = %for.inc12
%inc16 = add nuw nsw i32 %k.03, 1
%exitcond9 = icmp ne i32 %inc16, 100
br i1 %exitcond9, label %for.cond1.preheader, label %for.end17, !llvm.loop !7
I want to change the type of the variable %k.03, hoping it will change from 32bit to 64bit. And recursively change all its references (Uses) to 64bit. The effect is as follows:
for.cond1.preheader: ; preds = %entry, %for.inc15
%k.03 = phi i64 [ 0, %entry ], [ %inc16, %for.inc15 ]
br label %for.cond4.preheader
for.cond4.preheader: ; preds = %for.cond1.preheader, %for.inc12
%indvars.iv5 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next6, %for.inc12 ]
br label %for.body6
for.body6: ; preds = %for.cond4.preheader, %for.body6
......
%add7 = add nsw i32 %2, 1
......
for.inc12: ; preds = %for.body6
......
for.inc15: ; preds = %for.inc12
%inc16 = add nuw nsw i64 %k.03, 1
%exitcond9 = icmp ne i64 %inc16, 100
br i1 %exitcond9, label %for.cond1.preheader, label %for.end17, !llvm.loop !7
Then my approach is as follows:
static void __ChangeInsTo64Bit(User *inst) {
if (!std::strcmp(cast<Instruction>(inst)->getOpcodeName(), "br")) {
return;
}
if (std::strcmp(cast<Instruction>(inst)->getOpcodeName(), "icmp")) {
inst->mutateType(inst->getType()->getWithNewBitWidth(64));
}
for (auto OI = inst->op_begin(), OE = inst->op_end(); OI != OE; ++OI) {
Value *val = *OI;
val->mutateType(val->getType()->getWithNewBitWidth(64));
LLVM_DEBUG(dbgs() << "The Operand is: " << val->getName()
<< "; The bitwidth is: " << val->getType()->getIntegerBitWidth() << "\n");
}
}
static void __ChangePHINodeWidthTo64(Loop *OuterLoop, ScalarEvolution *SE) {
PHINode *OuterPHINode = OuterLoop->getInductionVariable(*SE);
if (!OuterPHINode) {
return ;
}
unsigned int outerPHIWidth = OuterPHINode->getType()->getIntegerBitWidth();
if (outerPHIWidth == 32) {
__ChangeInsTo64Bit(OuterPHINode);
for (User* user : OuterPHINode->users()) {
__ChangeInsTo64Bit(user);
for (User* u : user->users()) {
__ChangeInsTo64Bit(u);
}
}
}
}
Through the above code, I can achieve my goal, but weirdly modify the data type in the basic block for.body6:
%add7 = add nsw i32 %2, i64 1
This will cause program errors. Can someone help me point out the wrong point in my approach or provide the correct approach?
if (!std::strcmp(cast<Instruction>(inst)->getOpcodeName(), "br")) {
return;
}
if (std::strcmp(cast<Instruction>(inst)->getOpcodeName(), "icmp")) {
inst->mutateType(inst->getType()->getWithNewBitWidth(64));
}
That seems like a very janky and error-prone way to do RTTI, especially since LLVM provides the isa<> template for that very purpose.
Case in point, the "icmp" check will match everything that is not icmp (since there is no ! in front of the condition), and I would expect that this is not what you intented.
At the very least, the following is equivalent to your code and a lot more legible:
if (isa<BranchInst>(inst)) {
return;
}
if (!isa<ICmpInst>(inst)) {
inst->mutateType(inst->getType()->getWithNewBitWidth(64));
}
Removing the negation in front of the second test should address at least part of your issue.
Trying to understand phi instruction semantics in llvm-IR
(https://llvm.org/docs/LangRef.html#phi-instruction)
Let's consider the following example:
; Function Attrs: norecurse nounwind
define i32 #main( i32 %argc, i8** %argv) {
entry:
switch i32 %argc, label %L1 [ i32 0, label %L0
i32 1, label %L1 ]
L0:
%x = add i32 %argc, 1
br label %L1
L1:
%y = phi i32 [ %argc, %entry ], [ %x, %L0 ]
%z = sub i32 %y, 1
%w = udiv i32 100, %z
ret i32 %w
}
Compilation with clang-7.0.1
$ clang-7.0.1 -O0 test.ll -o a.out
PHINode should have one entry for each predecessor of its parent basic
block!
%y = phi i32 [ %argc, %entry ], [ %x, %L0 ]
fatal error: error in backend: Broken function found, compilation
aborted!
When I replaced "%y = phi ..." by "%y = add i32 2, 1" the test was compiled successfully.
The question here is about error message:
why not all predecessors are listed in phi in the test? From description of phi instruction in LangRef.html#phi-instruction I can't
understand it.
When there are multiple edges from block A to block B, then the PHI nodes in B must list A as many times as there are edges, each time with the same value. In your case there are two edges from entry to L1 (one for the default case of the switch and one for the 1 case), so entry needs to be listed twice in the PHI node.
But perhaps the cleaner solution in this case would be to just remove the [i32 1, label %L1] case from your switch as that's redundant anyway. Then there'd only be one edge and you'd only need one entry for entry.
Say the IR code looks like:
define void #_Z1mbb(i1 zeroext %r, i1 zeroext %y) nounwind {
entry:
%r.addr = alloca i8, align 1
%y.addr = alloca i8, align 1
%l = alloca i8, align 1
%frombool = zext i1 %r to i8
store i8 %frombool, i8* %r.addr, align 1
%frombool1 = zext i1 %y to i8
store i8 %frombool1, i8* %y.addr, align 1
%0 = load i8* %y.addr, align 1
%tobool = trunc i8 %0 to i1
br i1 %tobool, label %lor.end, label %lor.rhs
lor.rhs: ; preds = %entry
%1 = load i8* %r.addr, align 1
%tobool2 = trunc i8 %1 to i1
br label %lor.end
lor.end: ; preds = %lor.rhs, %entry
%2 = phi i1 [ true, %entry ], [ %tobool2, %lor.rhs ]
%frombool3 = zext i1 %2 to i8
store i8 %frombool3, i8* %l, align 1
ret void
}
the phinode has 2 pairs [ true, %entry ], [ %tobool2, %lor.rhs ]. How do I extract %entry and %lor.rhs and find the corresponding basicblock of each pair? Any help will be appreciated.
PHI->getgetNumIncomingValues() : returns number of incoming values in PHINode
For your phi node:
%2 = phi i1 [ true, %entry ], [ %tobool2, %lor.rhs ]
PHI->getIncomingValue(0) : gives true
PHI->getIncomingBlock(0) : gives %entry
There are iterators for blocks and values as well.
http://llvm.org/doxygen/classllvm_1_1PHINode.html
Always refer to doxygen docs to see all the APIs associated with a class(Ex: PHINode).
I am getting the following error while inserting an instruction using an llvm pass:
Instruction does not dominate all uses!
%add = add nsw i32 10, 2
%cmp3 = icmp ne i32 %a.01, %add
Broken module found, compilation aborted!
I have the source code in a bitcode file whose snippet is:
if.then: ; preds = %entry
%add = add nsw i32 10, 2
br label %if.end
if.else: ; preds = %entry
%sub = sub nsw i32 10, 2
br label %if.end
if.end: ; preds = %if.else, %if.then
%a.0 = phi i32 [ %add, %if.then ], [ %sub, %if.else ]
%a.01 = call i32 #tauInt32Ty(i32 %a.0) ; line A
%add3 = add nsw i32 %a.01, 2
%add4 = add nsw i32 %a.01, 3
%call5 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([7 x i8]* #.str2, i32 0, i32 0), i32 %add3, i32 %add4)
I want to insert a new instruction after "line A" which is :
%cmp3 = icmp ne i32 %a.01, %add
And I have written a function pass whose snippet of the code which does this task is :
for (Function::iterator bb = F.begin(), e = F.end(); bb != e; ++bb) {
for (BasicBlock::iterator i = bb->begin(), e = bb->end(); i != e; ++i) {
std::string str;
if(isa<CallInst>(i))// || true) {
BasicBlock::iterator next_it = i;
next_it++;
Instruction* next = dyn_cast<Instruction>(&*next_it);
CallInst* ci = dyn_cast<CallInst>(&*i);
Function* ff = ci->getCalledFunction();
str = ff->getName();
errs()<<"> "<<str<<"\n";
if(!str.compare("tauInt32Ty")) {
hotPathSSA1::varVersionWithPathsSet::iterator start = tauArguments[&*ci].begin();
hotPathSSA1::varVersionWithPathsSet::iterator end = tauArguments[&*ci].end();
Value* specArgs = start->second; // specArgs points to %add
ICmpInst* int1_cmp_56 = new ICmpInst(next, ICmpInst::ICMP_NE, ci, specArgs, "cmp3");
}
}
}
}
I have not encountered such a problem jet but I think your problem is the if statement. %add belonges to the if.then BasicBlock and it is not accessable from the if.end block. This is why the phi instruction "chooses" which value is available %add or %sub. So you have to take %a.0 for your IcmpInst as argument not %add.
So at the suggestion of a colleague, I just tested the speed difference between the ternary operator and the equivalent If-Else block... and it seems that the ternary operator yields code that is between 1x and 2x faster than If-Else. My code is:
gettimeofday(&tv3, 0);
for(i = 0; i < N; i++)
{
a = i & 1;
if(a) a = b; else a = c;
}
gettimeofday(&tv4, 0);
gettimeofday(&tv1, 0);
for(i = 0; i < N; i++)
{
a = i & 1;
a = a ? b : c;
}
gettimeofday(&tv2, 0);
(Sorry for using gettimeofday and not clock_gettime... I will endeavor to better myself.)
I tried changing the order in which I timed the blocks, but the results seem to persist. What gives? Also, the If-Else shows much more variability in terms of execution speed. Should I be examining the assembly that gcc generates?
By the way, this is all at optimization level zero (-O0).
Am I imagining this, or is there something I'm not taking into account, or is this a machine-dependent thing, or what? Any help is appreciated.
There's a good chance that the ternary operator gets compiled into a cmov while the if/else results in a cmp+jmp. Just take a look at the assembly (using -S) to be sure. With optimizations enabled, it won't matter any more anyway, as any good compiler should produce the same code in both cases.
You could also go completely branchless and measure if it makes any difference:
int m = -(i & 1);
a = (b & m) | (c & ~m);
On today's architectures, this style of programming has grown a bit out of fashion.
This is a nice explanation: http://www.nynaeve.net/?p=178
Basically, there are "conditional set" processor instructions, which is faster than branching and setting in separate instructions.
If there is any, change your compiler!
For this kind of questions I use the Try Out LLVM page. It's an old release of LLVM (still using the gcc front-end), but those are old tricks.
Here is my little sample program (simplified version of yours):
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main (int argc, char* argv[]) {
int N = atoi(argv[0]);
int a = 0, d = 0, b = atoi(argv[1]), c = atoi(argv[2]);
int i;
for(i = 0; i < N; i++)
{
a = i & 1;
if(a) a = b+i; else a = c+i;
}
for(i = 0; i < N; i++)
{
d = i & 1;
d = d ? b+i : c+i;
}
printf("%d %d", a, d);
return 0;
}
And there is the corresponding LLVM IR generated:
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%0 = load i8** %argv, align 8 ; <i8*> [#uses=1]
%N = tail call i32 #atoi(i8* %0) nounwind readonly ; <i32> [#uses=5]
%2 = getelementptr inbounds i8** %argv, i64 1 ; <i8**> [#uses=1]
%3 = load i8** %2, align 8 ; <i8*> [#uses=1]
%b = tail call i32 #atoi(i8* %3) nounwind readonly ; <i32> [#uses=2]
%5 = getelementptr inbounds i8** %argv, i64 2 ; <i8**> [#uses=1]
%6 = load i8** %5, align 8 ; <i8*> [#uses=1]
%c = tail call i32 #atoi(i8* %6) nounwind readonly ; <i32> [#uses=2]
%8 = icmp sgt i32 %N, 0 ; <i1> [#uses=2]
br i1 %8, label %bb, label %bb11
bb: ; preds = %bb, %entry
%9 = phi i32 [ %10, %bb ], [ 0, %entry ] ; <i32> [#uses=2]
%10 = add nsw i32 %9, 1 ; <i32> [#uses=2]
%exitcond22 = icmp eq i32 %10, %N ; <i1> [#uses=1]
br i1 %exitcond22, label %bb10.preheader, label %bb
bb10.preheader: ; preds = %bb
%11 = and i32 %9, 1 ; <i32> [#uses=1]
%12 = icmp eq i32 %11, 0 ; <i1> [#uses=1]
%.pn13 = select i1 %12, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp21 = add i32 %N, -1 ; <i32> [#uses=1]
%a.1 = add i32 %.pn13, %tmp21 ; <i32> [#uses=2]
br i1 %8, label %bb6, label %bb11
bb6: ; preds = %bb6, %bb10.preheader
%13 = phi i32 [ %14, %bb6 ], [ 0, %bb10.preheader ] ; <i32> [#uses=2]
%14 = add nsw i32 %13, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %14, %N ; <i1> [#uses=1]
br i1 %exitcond, label %bb10.bb11_crit_edge, label %bb6
bb10.bb11_crit_edge: ; preds = %bb6
%15 = and i32 %13, 1 ; <i32> [#uses=1]
%16 = icmp eq i32 %15, 0 ; <i1> [#uses=1]
%.pn = select i1 %16, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp = add i32 %N, -1 ; <i32> [#uses=1]
%d.1 = add i32 %.pn, %tmp ; <i32> [#uses=1]
br label %bb11
bb11: ; preds = %bb10.bb11_crit_edge, %bb10.preheader, %entry
%a.0 = phi i32 [ %a.1, %bb10.bb11_crit_edge ], [ %a.1, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
%d.0 = phi i32 [ %d.1, %bb10.bb11_crit_edge ], [ 0, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
%17 = tail call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i32 %a.0, i32 %d.0) nounwind ; <i32> [#uses=0]
ret i32 0
}
Okay, so it's likely to be chinese, even though I went ahead and renamed some variables to make it a bit easier to read.
The important bits are these two blocks:
%.pn13 = select i1 %12, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp21 = add i32 %N, -1 ; <i32> [#uses=1]
%a.1 = add i32 %.pn13, %tmp21 ; <i32> [#uses=2]
%.pn = select i1 %16, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp = add i32 %N, -1 ; <i32> [#uses=1]
%d.1 = add i32 %.pn, %tmp ; <i32> [#uses=1]
Which respectively set a and d.
And the conclusion is: No difference
Note: in a simpler example the two variables actually got merged, it seems here that the optimizer did not detect the similarity...
Any decent compiler should generate the same code for these if optimisation is turned on.
Understand that it's entirely up to the compiler how it interprets ternary expression (unless you actually force it not to with (inline) asm). It could just as easily understand ternary expression as 'if..else' in its Internal Representation language, and depending on the target backend, it may choose to generate conditional move instruction (on x86, CMOVcc is such one. There should also be ones for min/max, abs, etc). The main motivation of using conditional move is to transfer the risk of branch mispredict to a memory/register move operation. The caveat to this instruction is that nearly all the time, the operand register that will be conditionally loaded will have to be evaluated down to register form to take advantage of the cmov instruction.
This means that the unconditional evaluation process now has to be unconditional, and this will appear to increase the length of the unconditional path of the program. But understand that branch mispredict is most often resolved as 'flushing' the pipeline, which means that the instructions that would have finished executing are ignored (turned to No Operation instructions). This means that the actual number of instructions executed is higher because of the stalls or NOPs, and the effect scales with the depth of the processor pipeline and the misprediction rate.
This brings an interesting dilemma in determining the right heuristics. First, we know for sure that if the pipeline is too shallow or the branch prediction is fully able to learn pattern from branch history, then cmov is not worth doing. It's also not worth doing if the cost of evaluation of conditional argument is greater on than the cost from misprediction on average.
These are perhaps the core reasons why compilers have difficulty exploiting cmov instruction, since the heuristics determination is largely dependent on the runtime profiling information. It makes more sense to use this on JIT compiler since it can provide runtime instrumentation feedback and build a stronger heuristics for using this ("Is the branch truly unpredictable?"). On static compiler side without training data or profiler, it's most difficult to assume when this will be useful. However, a simple negative heuristic is, as aforementioned, if the compiler knows that the dataset is completely random or forcing cond. to uncond. evaluation is costly (perhaps due to irreducible, costly operations like fp divides), it would make good heuristics not to do this.
Any compiler worth its salt will do all that. Question is, what will it do after all dependable heuristics have been used up...