I have to do a comparison and I want to know which will be faster.
1)
for (i=0;i<4;i++){
if (object1(i)==object2(i))
retval = true;
else {
retval = false;
break;
}
}
2)
if ( (object1(0)==object2(0) && (object1(1)==object2(1) && (object1(2)==object2(2) && (object1(3)==object2(3)){
retval = true;
else
retval = false;
Or both will perform the same?
Thanks for Advice
Strictly speaking the most efficient path would be:
retval = object1(0) == object2(0) && object1(1) == object2(1).....
This basically does the same as your loop, but doesn't have to compare the result to true to determine the outcome of the condition.
However, I strongly recommend keeping the loop, as it is far easier to adapt to add or remove numbers.
You need to measure. But in any case the first code can be simplified quite a bit:
for (i = 0; i < 4; ++i)
if (object1(i) != object2(i))
return false;
return true;
Now choose the more readable form. I’d choose the loop here, unless you have confirmed that there is a performance problem caused by this code.
If the optimization flags are on, then the compiler might produce same machine instructtions for both code, unlooping the for loop completely, as the exact number of iteration is known to the compiler:
loop unrolling
By the way, if you care so much, then you could write this:
bool retValue = (object1(0)==object2(0)) &&
(object1(1)==object2(1)) &&
(object1(2)==object2(2)) &&
(object1(3)==object2(3));
which avoids both: for loop, as well as if-else branch, and it doesn't depend on compiler optimization.
As always with optimization, the one and single rule is MEASURE.
Furthermore, I guess that the compiler could optimize this code in some ways you (and I) couldn't even imagine. Therefore I'd suggest to write it in the most readable form.
I like to play with the Try out LLVM and Clang page for this:
struct Object {
int operator()(int i) const;
};
bool loop(Object const& left, Object const& right) {
bool retval = false;
for (int i = 0; i < 4; i++) {
if (left(i) == right(i) )
retval = true;
else {
retval = false;
break;
}
}
return true;
}
bool inlineif(Object const& left, Object const& right) {
bool retval = true;
if ( left(0) == right(0) &&
left(1) == right(1) &&
left(2) == right(2) &&
left(3) == right(3))
retval = true;
else
retval = false;
return retval;
}
bool betterloop(Object const& left, Object const& right) {
for (int i = 0; i < 4; ++i)
if (left(i) != right(i))
return false;
return true;
}
bool betterif(Object const& left, Object const& right) {
return left(0) == right(0) &&
left(1) == right(1) &&
left(2) == right(2) &&
left(3) == right(3);
}
Produces the following IR for loops (regardless of how they are written):
define zeroext i1 #_Z4loopRK6ObjectS1_(%struct.Object* %left, %struct.Object* %right) uwtable {
br label %1
; <label>:1 ; preds = %7, %0
%i.0 = phi i32 [ 0, %0 ], [ %8, %7 ]
%2 = icmp slt i32 %i.0, 4
br i1 %2, label %3, label %9
; <label>:3 ; preds = %1
%4 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %left, i32 %i.0)
%5 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %right, i32 %i.0)
%6 = icmp eq i32 %4, %5
br i1 %6, label %7, label %9
; <label>:7 ; preds = %3
%8 = add nsw i32 %i.0, 1
br label %1
; <label>:9 ; preds = %3, %1
ret i1 true
}
And a very similar IR for the two if (so I'll give only one):
define zeroext i1 #_Z8betterifRK6ObjectS1_(%struct.Object* %left, %struct.Object* %right) uwtable {
%1 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %left, i32 0)
%2 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %right, i32 0)
%3 = icmp eq i32 %1, %2
br i1 %3, label %4, label %16
; <label>:4 ; preds = %0
%5 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %left, i32 1)
%6 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %right, i32 1)
%7 = icmp eq i32 %5, %6
br i1 %7, label %8, label %16
; <label>:8 ; preds = %4
%9 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %left, i32 2)
%10 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %right, i32 2)
%11 = icmp eq i32 %9, %10
br i1 %11, label %12, label %16
; <label>:12 ; preds = %8
%13 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %left, i32 3)
%14 = tail call i32 #_ZNK6ObjectclEi(%struct.Object* %right, i32 3)
%15 = icmp eq i32 %13, %14
br label %16
; <label>:16 ; preds = %12, %8, %4, %0
%17 = phi i1 [ false, %8 ], [ false, %4 ], [ false, %0 ], [ %15, %12 ]
ret i1 %17
}
The important instructions here is br which is the branching instruction. It can be used either as a simple goto or with conditions on the edges:
br i1 %11, label %12, label %16
means if i1 is true, go to label %12, otherwise go to label %16.
It seems that "naturally" LLVM will not unroll the traditional loop version, so the if version performs better here. I am quite surprised, actually, that it does not and I cannot figure out why it would not...
So, the inline if code might be a bit faster, but it might also be unnoticeable depending on the cost of left(i) == right(i) (and even then), as CPU are quite good at branch prediction.
Related
I'm writing an LLVM LoopPass in which I need to know which values
are used outside the loop. For that I have this code:
virtual bool runOnLoop(Loop *loop, LPPassManager &LPM)
{
for (auto it = loop->block_begin(); it != loop->block_end(); it++)
{
for (auto inst = (*it)->begin(); inst != (*it)->end(); inst++)
{
if (Is_Used_Outside_This_loop(loop,(Instruction *) inst))
{
errs() << inst->getName().str();
errs() << " is used outside the loop\n";
}
}
}
// ...
}
The inner function seemed right at first, but with the *.ll file below,
it gives incorrect classification for %tmp5 since it is used twice
inside a basic block of the loop.
bool Is_Used_Outside_This_loop(Loop *loop, Value *v)
{
int n=0;
int numUses = v->getNumUses();
for (auto it = loop->block_begin(); it != loop->block_end(); it++)
{
if (v->isUsedInBasicBlock(*it))
{
n++;
}
}
if (n == numUses) return false;
else return true;
}
The following *.ll code shows that %tmp5 is used twice
inside a basic block of the loop. When I carefully searched the API,
I couldn't find anything like Value::numUsesInBasicBlock( ... )
; Function Attrs: nounwind uwtable
define internal void #foo(i8* %s) #0 {
entry:
%s.addr = alloca i8*, align 8
%c = alloca i8, align 1
store i8* %s, i8** %s.addr, align 8
store i8 0, i8* %c, align 1
br label %while.cond
while.cond: ; preds = %while.body, %entry
%tmp = load i8*, i8** %s.addr, align 8
%tmp1 = load i8, i8* %tmp, align 1
%conv = sext i8 %tmp1 to i32
%cmp = icmp eq i32 %conv, 97
br i1 %cmp, label %lor.end, label %lor.rhs
lor.rhs: ; preds = %while.cond
%tmp2 = load i8*, i8** %s.addr, align 8
%tmp3 = load i8, i8* %tmp2, align 1
%conv2 = sext i8 %tmp3 to i32
%cmp3 = icmp eq i32 %conv2, 98
br label %lor.end
lor.end:; preds = %lor.rhs, %while.cond
%tmp4 = phi i1 [ true, %while.cond ], [ %cmp3, %lor.rhs ]
br i1 %tmp4, label %while.body, label %while.end
while.body: ; preds = %lor.end
%tmp5 = load i8*, i8** %s.addr, align 8
%incdec.ptr = getelementptr inbounds i8, i8* %tmp5, i32 1
store i8* %incdec.ptr, i8** %s.addr, align 8
%tmp6 = load i8, i8* %tmp5, align 1
store i8 %tmp6, i8* %c, align 1
br label %while.cond
while.end: ; preds = %lor.end
%tmp7 = load i8*, i8** %s.addr, align 8
%tmp8 = load i8, i8* %tmp7, align 1
%conv5 = sext i8 %tmp8 to i32
%cmp6 = icmp eq i32 %conv5, 99
br i1 %cmp6, label %if.then, label %if.end
if.then: ; preds = %while.end
%tmp9 = load i8*, i8** %s.addr, align 8
%incdec.ptr8 = getelementptr inbounds i8, i8* %tmp9, i32 1
store i8* %incdec.ptr8, i8** %s.addr, align 8
br label %if.end
if.end: ; preds = %if.then, %while.end
ret void
}
Clearly, there's got to be a way of doing this, right? Thanks!
The problem is your exit condition. You ask for uses, but then compare with the number of blocks that the value is used in.
So, if the value is used twice in the same basic block, the number of uses is 2 and the n counter of basic blocks that used the value is only incremented once, hence the mismatch of n and numUses.
Maybe a more concise way to do what you want is:
void FindUsesNotIn(
llvm::SmallPtrSetImpl<llvm::BasicBlock *> &Blocks,
llvm::SmallPtrSetImpl<llvm::Value *> &OutUses) {
for(const auto &b : Blocks)
for(auto &i : *b)
for(const auto &u : i.users()) {
auto *userInst = llvm::dyn_cast<llvm::Instruction>(u);
if(userInst && !Blocks.count(userInst->getParent())) {
OutUses.insert(&i);
break;
}
}
}
and then in the runOnLoop method have something like this:
virtual bool runOnLoop(llvm::Loop *loop, llvm::LPPassManager &LPM) {
llvm::SmallPtrSet<llvm::BasicBlock*, 10> loopBlocks(loop->block_begin(), loop->block_end());
llvm::SmallPtrSet<llvm::Value *, 10> outs;
FindUsesNotIn(loopBlocks, outs);
for(const auto *e : outs)
llvm::dbgs() << *e << '\n';
return false;
}
Here's how I solved it, though it looks overly complicated:
virtual bool runOnLoop(Loop *loop, LPPassManager &LPM)
{
for (auto it = loop->block_begin(); it != loop->block_end(); it++)
{
for (auto inst = (*it)->begin(); inst != (*it)->end(); inst++)
{
int n=0;
for (auto use = inst->use_begin(); use != inst->use_end(); use++)
{
Instruction *i = (Instruction *) use->getUser();
if (BasicBlockBelongsToLoop(i->getParent(),loop))
{
n++;
}
}
assert(n <= inst->getNumUses());
if (n < inst->getNumUses())
{
errs() << inst->getName().str();
errs() << " is used outside the loop\n";
}
}
}
// ...
I also couldn't found in the API, how to check if a basic block belongs to a loop,
so I had to write my own BasicBlockBelongsToLoop, here it is:
bool BasicBlockBelongsToLoop(BasicBlock *BB, Loop *loop)
{
for (auto it = loop->block_begin(); it != loop->block_end(); it++)
{
if (BB == (*it))
{
return true;
}
}
return false;
}
There is a branch in ir that I want to delete completely(condtion + branch + true_basic_block + false_basic_block). It looks like this:
%4 = icmp sge i32 %2, %3
br i1 %4, label %5, label %7
; <label>:5 ; preds = %0
%6 = load i32* %x, align 4
store i32 %6, i32* %z, align 4
br label %9
; <label>:7 ; preds = %0
%8 = load i32* %y, align 4
store i32 %8, i32* %z, align 4
br label %9
; <label>:9 ; preds = %7, %5
%10 = call dereferenceable(140) %"class.std::basic_ostream"*#_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc(%"class.std::basic_ostream"* dereferenceable(140) #_ZSt4cout, i8* getelementptr inbounds ([5 x i8]* #.str, i32 0, i32 0))
%11 = load i32* %z, align 4
%12 = call dereferenceable(140) %"class.std::basic_ostream"* #_ZNSolsEi(%"class.std::basic_ostream"* %10, i32 %11)
%13 = call dereferenceable(140) %"class.std::basic_ostream"* #_ZNSolsEPFRSoS_E(%"class.std::basic_ostream"* %12, %"class.std::basic_ostream"* (%"class.std::basic_ostream"*)* #_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_)
ret i32 0
Now to delete it , is there a removeBranch function , or do I need to delete instructions one by one. I have been trying the latter way but I have seen every error from "Basic block in main does not have an terminator" to "use remains when def is destroyed", and many more.. I have used erasefromparent, replaceinstwithvalue, replaceinstwithinst, removefromparent, etc.
Can anyone be kind enough to point me in the correct direction?
This is my function_pass :
bool runOnFunction(Function &F) override {
for (auto& B : F)
for (auto& I : B)
if(auto* brn = dyn_cast<BranchInst>(&I))
if(brn->isConditional()){
Instruction* cond = dyn_cast<Instruction>(brn->getCondition());
if(cond->getOpcode() == Instruction::ICmp){
branch_vector.push_back(brn);
//removeConditionalBranch(dyn_cast<BranchInst>(brn));
}
}
/*For now just delete the branches in the vector.*/
for(auto b : branch_vector)
removeConditionalBranch(dyn_cast<BranchInst>(b));
return true;
}
This is the output :
I don't know of any RemoveBranch utility function, but something like this should work. The idea is to delete the branch instruction, then delete anything that becomes dead as a result, and then merge the initial block with the join block.
// for DeleteDeadBlock, MergeBlockIntoPredecessor
#include "llvm/Transforms/Utils/BasicBlockUtils.h"
// for RecursivelyDeleteTriviallyDeadInstructions
#include "llvm/Transforms/Utils/Local.h"
void removeConditionalBranch(BranchInst *Branch) {
assert(Branch &&
Branch->isConditional() &&
Branch->getNumSuccessors() == 2);
BasicBlock *Parent = Branch->getParent();
BasicBlock *ThenBlock = Branch->getSuccessor(0);
BasicBlock *ElseBlock = Branch->getSuccessor(1);
BasicBlock *ThenSuccessor = ThenBlock->getUniqueSuccessor();
BasicBlock *ElseSuccessor = ElseBlock->getUniqueSuccessor();
assert(ThenSuccessor && ElseSuccessor && ThenSuccessor == ElseSuccessor);
Branch->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(Branch->getCondition());
DeleteDeadBlock(ThenBlock);
DeleteDeadBlock(ElseBlock);
IRBuilder<> Builder(Parent);
Builder.CreateBr(ThenSuccessor);
bool Merged = MergeBlockIntoPredecessor(ThenSuccessor);
assert(Merged);
}
This code only handles the simple case you've shown, with the then and else blocks both jumping unconditionally to a common join block (it will fail with an assertion error for anything more complicated). More complicated control flow will be a bit trickier to handle, but you should still be able to use this code as a starting point.
I am getting the following error while inserting an instruction using an llvm pass:
Instruction does not dominate all uses!
%add = add nsw i32 10, 2
%cmp3 = icmp ne i32 %a.01, %add
Broken module found, compilation aborted!
I have the source code in a bitcode file whose snippet is:
if.then: ; preds = %entry
%add = add nsw i32 10, 2
br label %if.end
if.else: ; preds = %entry
%sub = sub nsw i32 10, 2
br label %if.end
if.end: ; preds = %if.else, %if.then
%a.0 = phi i32 [ %add, %if.then ], [ %sub, %if.else ]
%a.01 = call i32 #tauInt32Ty(i32 %a.0) ; line A
%add3 = add nsw i32 %a.01, 2
%add4 = add nsw i32 %a.01, 3
%call5 = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([7 x i8]* #.str2, i32 0, i32 0), i32 %add3, i32 %add4)
I want to insert a new instruction after "line A" which is :
%cmp3 = icmp ne i32 %a.01, %add
And I have written a function pass whose snippet of the code which does this task is :
for (Function::iterator bb = F.begin(), e = F.end(); bb != e; ++bb) {
for (BasicBlock::iterator i = bb->begin(), e = bb->end(); i != e; ++i) {
std::string str;
if(isa<CallInst>(i))// || true) {
BasicBlock::iterator next_it = i;
next_it++;
Instruction* next = dyn_cast<Instruction>(&*next_it);
CallInst* ci = dyn_cast<CallInst>(&*i);
Function* ff = ci->getCalledFunction();
str = ff->getName();
errs()<<"> "<<str<<"\n";
if(!str.compare("tauInt32Ty")) {
hotPathSSA1::varVersionWithPathsSet::iterator start = tauArguments[&*ci].begin();
hotPathSSA1::varVersionWithPathsSet::iterator end = tauArguments[&*ci].end();
Value* specArgs = start->second; // specArgs points to %add
ICmpInst* int1_cmp_56 = new ICmpInst(next, ICmpInst::ICMP_NE, ci, specArgs, "cmp3");
}
}
}
}
I have not encountered such a problem jet but I think your problem is the if statement. %add belonges to the if.then BasicBlock and it is not accessable from the if.end block. This is why the phi instruction "chooses" which value is available %add or %sub. So you have to take %a.0 for your IcmpInst as argument not %add.
As in, say my header file is:
class A
{
void Complicated();
}
And my source file
void A::Complicated()
{
...really long function...
}
Can I split the source file into
void DoInitialStuff(pass necessary vars by ref or value)
{
...
}
void HandleCaseA(pass necessary vars by ref or value)
{
...
}
void HandleCaseB(pass necessary vars by ref or value)
{
...
}
void FinishUp(pass necessary vars by ref or value)
{
...
}
void A::Complicated()
{
...
DoInitialStuff(...);
switch ...
HandleCaseA(...)
HandleCaseB(...)
...
FinishUp(...)
}
Entirely for readability and without any fear of impact in terms of performance?
You should mark the functions static so that the compiler know they are local to that translation unit.
Without static the compiler cannot assume (barring LTO / WPA) that the function is only called once, so is less likely to inline it.
Demonstration using the LLVM Try Out page.
That said, code for readability first, micro-optimizations (and such tweaking is a micro-optimization) should only come after performance measures.
Example:
#include <cstdio>
static void foo(int i) {
int m = i % 3;
printf("%d %d", i, m);
}
int main(int argc, char* argv[]) {
for (int i = 0; i != argc; ++i) {
foo(i);
}
}
Produces with static:
; ModuleID = '/tmp/webcompile/_27689_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
#.str = private constant [6 x i8] c"%d %d\00" ; <[6 x i8]*> [#uses=1]
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%cmp4 = icmp eq i32 %argc, 0 ; <i1> [#uses=1]
br i1 %cmp4, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%0 = phi i32 [ %inc, %for.body ], [ 0, %entry ] ; <i32> [#uses=3]
%rem.i = srem i32 %0, 3 ; <i32> [#uses=1]
%call.i = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i32 %0, i32 %rem.i) nounwind ; <i32> [#uses=0]
%inc = add nsw i32 %0, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %inc, %argc ; <i1> [#uses=1]
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret i32 0
}
declare i32 #printf(i8* nocapture, ...) nounwind
Without static:
; ModuleID = '/tmp/webcompile/_27859_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
#.str = private constant [6 x i8] c"%d %d\00" ; <[6 x i8]*> [#uses=1]
define void #foo(int)(i32 %i) nounwind {
entry:
%rem = srem i32 %i, 3 ; <i32> [#uses=1]
%call = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i32 %i, i32 %rem) ; <i32> [#uses=0]
ret void
}
declare i32 #printf(i8* nocapture, ...) nounwind
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%cmp4 = icmp eq i32 %argc, 0 ; <i1> [#uses=1]
br i1 %cmp4, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%0 = phi i32 [ %inc, %for.body ], [ 0, %entry ] ; <i32> [#uses=3]
%rem.i = srem i32 %0, 3 ; <i32> [#uses=1]
%call.i = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i32 %0, i32 %rem.i) nounwind ; <i32> [#uses=0]
%inc = add nsw i32 %0, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %inc, %argc ; <i1> [#uses=1]
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret i32 0
}
Depends on aliasing (pointers to that function) and function length (a large function inlined in a branch could throw the other branch out of cache, thus hurting performance).
Let the compiler worry about that, you worry about your code :)
A complicated function is likely to have its speed dominated by the operations within the function; the overhead of a function call won't be noticeable even if it isn't inlined.
You don't have much control over the inlining of a function, the best way to know is to try it and find out.
A compiler's optimizer might be more effective with shorter pieces of code, so you might find it getting faster even if it's not inlined.
If you split up your code into logical groupings the compiler will do what it deems best: If it's short and easy, the compiler should inline it and the result is the same. If however the code is complicated, making an extra function call might actually be faster than doing all the work inlined, so you leave the compiler the option to do that too. On top of all that, the logically split code can be far easier for a maintainer to grok and avoid future bugs.
I suggest you create a helper class to break your complicated function into method calls, much like you were proposing, but without the long, boring and unreadable task of passing arguments to each and every one of these smaller functions. Pass these arguments only once by making them member variables of the helper class.
Don't focus on optimization at this point, make sure your code is readable and you'll be fine 99% of the time.
So at the suggestion of a colleague, I just tested the speed difference between the ternary operator and the equivalent If-Else block... and it seems that the ternary operator yields code that is between 1x and 2x faster than If-Else. My code is:
gettimeofday(&tv3, 0);
for(i = 0; i < N; i++)
{
a = i & 1;
if(a) a = b; else a = c;
}
gettimeofday(&tv4, 0);
gettimeofday(&tv1, 0);
for(i = 0; i < N; i++)
{
a = i & 1;
a = a ? b : c;
}
gettimeofday(&tv2, 0);
(Sorry for using gettimeofday and not clock_gettime... I will endeavor to better myself.)
I tried changing the order in which I timed the blocks, but the results seem to persist. What gives? Also, the If-Else shows much more variability in terms of execution speed. Should I be examining the assembly that gcc generates?
By the way, this is all at optimization level zero (-O0).
Am I imagining this, or is there something I'm not taking into account, or is this a machine-dependent thing, or what? Any help is appreciated.
There's a good chance that the ternary operator gets compiled into a cmov while the if/else results in a cmp+jmp. Just take a look at the assembly (using -S) to be sure. With optimizations enabled, it won't matter any more anyway, as any good compiler should produce the same code in both cases.
You could also go completely branchless and measure if it makes any difference:
int m = -(i & 1);
a = (b & m) | (c & ~m);
On today's architectures, this style of programming has grown a bit out of fashion.
This is a nice explanation: http://www.nynaeve.net/?p=178
Basically, there are "conditional set" processor instructions, which is faster than branching and setting in separate instructions.
If there is any, change your compiler!
For this kind of questions I use the Try Out LLVM page. It's an old release of LLVM (still using the gcc front-end), but those are old tricks.
Here is my little sample program (simplified version of yours):
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main (int argc, char* argv[]) {
int N = atoi(argv[0]);
int a = 0, d = 0, b = atoi(argv[1]), c = atoi(argv[2]);
int i;
for(i = 0; i < N; i++)
{
a = i & 1;
if(a) a = b+i; else a = c+i;
}
for(i = 0; i < N; i++)
{
d = i & 1;
d = d ? b+i : c+i;
}
printf("%d %d", a, d);
return 0;
}
And there is the corresponding LLVM IR generated:
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%0 = load i8** %argv, align 8 ; <i8*> [#uses=1]
%N = tail call i32 #atoi(i8* %0) nounwind readonly ; <i32> [#uses=5]
%2 = getelementptr inbounds i8** %argv, i64 1 ; <i8**> [#uses=1]
%3 = load i8** %2, align 8 ; <i8*> [#uses=1]
%b = tail call i32 #atoi(i8* %3) nounwind readonly ; <i32> [#uses=2]
%5 = getelementptr inbounds i8** %argv, i64 2 ; <i8**> [#uses=1]
%6 = load i8** %5, align 8 ; <i8*> [#uses=1]
%c = tail call i32 #atoi(i8* %6) nounwind readonly ; <i32> [#uses=2]
%8 = icmp sgt i32 %N, 0 ; <i1> [#uses=2]
br i1 %8, label %bb, label %bb11
bb: ; preds = %bb, %entry
%9 = phi i32 [ %10, %bb ], [ 0, %entry ] ; <i32> [#uses=2]
%10 = add nsw i32 %9, 1 ; <i32> [#uses=2]
%exitcond22 = icmp eq i32 %10, %N ; <i1> [#uses=1]
br i1 %exitcond22, label %bb10.preheader, label %bb
bb10.preheader: ; preds = %bb
%11 = and i32 %9, 1 ; <i32> [#uses=1]
%12 = icmp eq i32 %11, 0 ; <i1> [#uses=1]
%.pn13 = select i1 %12, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp21 = add i32 %N, -1 ; <i32> [#uses=1]
%a.1 = add i32 %.pn13, %tmp21 ; <i32> [#uses=2]
br i1 %8, label %bb6, label %bb11
bb6: ; preds = %bb6, %bb10.preheader
%13 = phi i32 [ %14, %bb6 ], [ 0, %bb10.preheader ] ; <i32> [#uses=2]
%14 = add nsw i32 %13, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %14, %N ; <i1> [#uses=1]
br i1 %exitcond, label %bb10.bb11_crit_edge, label %bb6
bb10.bb11_crit_edge: ; preds = %bb6
%15 = and i32 %13, 1 ; <i32> [#uses=1]
%16 = icmp eq i32 %15, 0 ; <i1> [#uses=1]
%.pn = select i1 %16, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp = add i32 %N, -1 ; <i32> [#uses=1]
%d.1 = add i32 %.pn, %tmp ; <i32> [#uses=1]
br label %bb11
bb11: ; preds = %bb10.bb11_crit_edge, %bb10.preheader, %entry
%a.0 = phi i32 [ %a.1, %bb10.bb11_crit_edge ], [ %a.1, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
%d.0 = phi i32 [ %d.1, %bb10.bb11_crit_edge ], [ 0, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
%17 = tail call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i32 %a.0, i32 %d.0) nounwind ; <i32> [#uses=0]
ret i32 0
}
Okay, so it's likely to be chinese, even though I went ahead and renamed some variables to make it a bit easier to read.
The important bits are these two blocks:
%.pn13 = select i1 %12, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp21 = add i32 %N, -1 ; <i32> [#uses=1]
%a.1 = add i32 %.pn13, %tmp21 ; <i32> [#uses=2]
%.pn = select i1 %16, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp = add i32 %N, -1 ; <i32> [#uses=1]
%d.1 = add i32 %.pn, %tmp ; <i32> [#uses=1]
Which respectively set a and d.
And the conclusion is: No difference
Note: in a simpler example the two variables actually got merged, it seems here that the optimizer did not detect the similarity...
Any decent compiler should generate the same code for these if optimisation is turned on.
Understand that it's entirely up to the compiler how it interprets ternary expression (unless you actually force it not to with (inline) asm). It could just as easily understand ternary expression as 'if..else' in its Internal Representation language, and depending on the target backend, it may choose to generate conditional move instruction (on x86, CMOVcc is such one. There should also be ones for min/max, abs, etc). The main motivation of using conditional move is to transfer the risk of branch mispredict to a memory/register move operation. The caveat to this instruction is that nearly all the time, the operand register that will be conditionally loaded will have to be evaluated down to register form to take advantage of the cmov instruction.
This means that the unconditional evaluation process now has to be unconditional, and this will appear to increase the length of the unconditional path of the program. But understand that branch mispredict is most often resolved as 'flushing' the pipeline, which means that the instructions that would have finished executing are ignored (turned to No Operation instructions). This means that the actual number of instructions executed is higher because of the stalls or NOPs, and the effect scales with the depth of the processor pipeline and the misprediction rate.
This brings an interesting dilemma in determining the right heuristics. First, we know for sure that if the pipeline is too shallow or the branch prediction is fully able to learn pattern from branch history, then cmov is not worth doing. It's also not worth doing if the cost of evaluation of conditional argument is greater on than the cost from misprediction on average.
These are perhaps the core reasons why compilers have difficulty exploiting cmov instruction, since the heuristics determination is largely dependent on the runtime profiling information. It makes more sense to use this on JIT compiler since it can provide runtime instrumentation feedback and build a stronger heuristics for using this ("Is the branch truly unpredictable?"). On static compiler side without training data or profiler, it's most difficult to assume when this will be useful. However, a simple negative heuristic is, as aforementioned, if the compiler knows that the dataset is completely random or forcing cond. to uncond. evaluation is costly (perhaps due to irreducible, costly operations like fp divides), it would make good heuristics not to do this.
Any compiler worth its salt will do all that. Question is, what will it do after all dependable heuristics have been used up...