llc: LLVM ERROR: Cannot select: - llvm

The llc gave me the following error:
LLVM ERROR: Cannot select: t20: i8,ch = load<LD1[%x], zext from i1> t0, FrameIndex:i16<0>, undef:i16
t1: i16 = FrameIndex<0>
t3: i16 = undef
In function: main
This is the content of the prg.ll file:
; ModuleID = 'new_module'
define i16 #main() {
entry:
%x = alloca i1
store i1 true, i1* %x
%0 = load i1, i1* %x
%relation_op = icmp eq i1 %0, true
br i1 %relation_op, label %then, label %else
then: ; preds = %entry
store i1 false, i1* %x
br label %ifcont3
else: ; preds = %entry
%1 = load i1, i1* %x
%relation_op1 = icmp eq i1 %1, false
br i1 %relation_op1, label %then2, label %ifcont
then2: ; preds = %else
store i1 true, i1* %x
br label %ifcont
ifcont: ; preds = %then2, %else
br label %ifcont3
ifcont3: ; preds = %ifcont, %then
ret i16 0
}
I can not understand what the llc says. The prg.ll output is from my avr custom compiler. I found the LLVM-backend for the avr at this link: avr-llvm backend. Until now, the backend works fine. Does someone see what is the problem?
Thanks in advance!

I changed bool type width in my compiler from i1 to i8 (in this case, x is bool). That solved my problem. The avr-backend probably doesn't support i1 or whatever. I will post the answer from the issue tracker if they answer me what's the problem exactly.
The answer from issue tracker:
A bunch of the LLVM backends handle i1 badly (which is pretty sad). This is why almost all frontends define bool to be i8.
I would definitely like to fix this though. By the looks of this, it is probably failing on the zext from i1 operation. All that should be needed is to promote the i1 to an i8 internally.

Related

Is there a way to automatically eliminate trialing terminator instructions in llvm?

Recently I added function verification to every function generation and found this error:
Terminator found in the middle of a basic block!
label %then
Currently for a return statement I just generate a ret and not anything else, but things go wrong when it's inside a conditional block with br after it. Seems like you can't put terminators in the middle of a block, but I can't decide how I should solve it. Using phi node seems weird as the code doesn't necessarily return a value on each branch, and I cannot figure out a way to pass return value to the phi node. Also, I'm unsure if using some kind of signals to indicate codegen to stop will make things more complicated. That's why I want to know if there's a way to automatically eliminate code after the first terminator.
Here's a fibbonacci function for example, generated from code:
define weak i64 #.testModule.recFib(i64 %n) {
entry:
%_bool.0 = icmp eq i64 %n, 1
%_bool.1 = icmp eq i64 %n, 2
%_bool.2 = or i1 %_bool.0, %_bool.1
br i1 %_bool.2, label %then, label %continue
then: ; preds = %entry
ret i64 1
br label %continue
continue: ; preds = %then, %entry
%_int.0 = sub i64 %n, 1
%_int.1 = call i64 #.testModule.recFib(i64 %_int.0)
%_int.2 = sub i64 %n, 2
%_int.3 = call i64 #.testModule.recFib(i64 %_int.2)
%_int.4 = add i64 %_int.1, %_int.3
ret i64 %_int.4
}

How to get labels from a phinode and their corresponding basicblocks in LLVM?

Say the IR code looks like:
define void #_Z1mbb(i1 zeroext %r, i1 zeroext %y) nounwind {
entry:
%r.addr = alloca i8, align 1
%y.addr = alloca i8, align 1
%l = alloca i8, align 1
%frombool = zext i1 %r to i8
store i8 %frombool, i8* %r.addr, align 1
%frombool1 = zext i1 %y to i8
store i8 %frombool1, i8* %y.addr, align 1
%0 = load i8* %y.addr, align 1
%tobool = trunc i8 %0 to i1
br i1 %tobool, label %lor.end, label %lor.rhs
lor.rhs: ; preds = %entry
%1 = load i8* %r.addr, align 1
%tobool2 = trunc i8 %1 to i1
br label %lor.end
lor.end: ; preds = %lor.rhs, %entry
%2 = phi i1 [ true, %entry ], [ %tobool2, %lor.rhs ]
%frombool3 = zext i1 %2 to i8
store i8 %frombool3, i8* %l, align 1
ret void
}
the phinode has 2 pairs [ true, %entry ], [ %tobool2, %lor.rhs ]. How do I extract %entry and %lor.rhs and find the corresponding basicblock of each pair? Any help will be appreciated.
PHI->getgetNumIncomingValues() : returns number of incoming values in PHINode
For your phi node:
%2 = phi i1 [ true, %entry ], [ %tobool2, %lor.rhs ]
PHI->getIncomingValue(0) : gives true
PHI->getIncomingBlock(0) : gives %entry
There are iterators for blocks and values as well.
http://llvm.org/doxygen/classllvm_1_1PHINode.html
Always refer to doxygen docs to see all the APIs associated with a class(Ex: PHINode).

In LLVM LOOP PASS, I want to copy and paste a set of Instructions. error occurs when copy icmp and branch instruction

Below is my code in LLVM LoopPass.
virtual bool runOnLoop(Loop* L, LPPassManager &LPM) {
BasicBlock& loopCondBlock = *(L->getHeader());
BasicBlock& loopIncBlock = *(L->getLoopLatch());
BranchInst* brInsInLoopInc = dyn_cast<BranchInst>(loopIncBlock.getTerminator());
for (auto &inst: loopCondBlock) {
auto *new_inst = inst.clone();
new_inst->insertBefore(brInsInLoopInc);
llvm::ValueToValueMapTy vmap;
llvm::RemapInstruction(new_inst, vmap, RF_NoModuleLevelChanges | RF_IgnoreMissingLocals);
}
return true;
}
I want to copy instructions in for.cond and paste them on for.inc before branch back to for.cond instruction.
Example original IR:
for.cond: ; preds = %for.inc, %entry
%0 = load i32, i32* %i, align 4
%cmp = icmp ult i32 %0, 50000000
br i1 %cmp, label %for.body, label %for.end
for.body: ; preds = %for.cond
...
for.inc: ; preds = %for.body
...
br label %for.cond
IR Expected:
for.cond: ; preds = %for.inc, %entry
%0 = load i32, i32* %i, align 4
%cmp = icmp ult i32 %0, 50000000
br i1 %cmp, label %for.body, label %for.end
for.body: ; preds = %for.cond
...
for.inc: ; preds = %for.body
...
// ******PASS ADDED******
%4 = load i32, i32* %i, align 4
%cmp2 = icmp ult i32 %4, 50000000
br i1 %cmp2, label %for.body, label %for.end
// ******PASS ADDED******
My Loop Pass Result:
for.cond: ; preds = %for.inc, %entry
%0 = load i32, i32* %i, align 4
%cmp = icmp ult i32 %0, 50000000
br i1 %cmp, label %for.body, label %for.end
for.body: ; preds = %for.inc, %for.cond
...
for.inc: ; preds = %for.body
...
// ******PASS ADDED******
%4 = load i32, i32* %i, align 4
%5 = icmp ult i32 %0, 50000000
br i1 %cmp, label %for.body, label %for.end
// ******PASS ADDED******
br label %for.cond
How to fix the icmp and related branch instruction to be correct and remove the "br label %for.cond"?
Thanks you for your help.
I think you forgot to actually fill the value to value map used in the remap call with mappings from original to copy.
According to the manual, deleting instructions should work by calling eraseFromParent.
This code is untested, so it will likely contain errors, but it should convey the idea:
virtual bool runOnLoop(Loop* L, LPPassManager &LPM) {
BasicBlock& loopCondBlock = *(L->getHeader());
BasicBlock& loopIncBlock = *(L->getLoopLatch());
BranchInst* brInsInLoopInc = dyn_cast<BranchInst>(loopIncBlock.getTerminator());
llvm::ValueToValueMapTy vmap;
for (auto &inst: loopCondBlock) {
auto *new_inst = inst.clone();
new_inst->insertBefore(brInsInLoopInc);
// map each instruction to its copy
vmap[&inst] = new_inst;
// now this should remap each instruction to its copy
llvm::RemapInstruction(new_inst, vmap, RF_NoModuleLevelChanges | RF_IgnoreMissingLocals);
}
// now erase the original branch
brInsInLoopInc->eraseFromParent();
}
I solve my question just add below code after "inst.clone();".
if(inst.hasName()) {
new_inst->setName(inst.getName()+NameSuffix);
}

Input in LLVM, I think I do not understand dominance and the location of phi nodes

My goal is to do something simple in LLVM. I want to, using the C library function getchar, define an LLVM function that reads an input from the commandline. Here is my algorithm in pseudocode:
getInt:
get a character, set the value to VAL
check if VAL is '-'
if yes then set SGN to -1 and set VAL to the next character else set SGN to 1
set NV = to the next char minus 48
while (NV >= 0) // 48 is the first ASCII character that represents a number
set VAL = VAL*10
set VAL = VAL + NV
set NV to the next char minus 48
return SGN*VAL
So now, the LLVM code I come up with for doing this is in my head the most straightforward way to translate the above into LLVM IR. However, I get the error
"PHI nodes not grouped at the top of the basic block." If I move some things around to fix this error, I get errors about dominance. Below is the LLVM IR code that gives me the PHI nodes error. I believe I am misunderstanding something basic about LLVM IR, so any help you can give is super appreciated.
define i32 #getIntLoop() {
_L1:
%0 = call i32 #getchar()
%1 = phi i32 [ %0, %_L1 ], [ %3, %_L2 ], [ %8, %_L4 ]
%2 = icmp eq i32 %1, 45
br i1 %2, label %_L2, label %_L5
_L2: ; preds = %_L1
%3 = call i32 #getchar()
br label %_L3
_L3: ; preds = %_L4, %_L2
%4 = call i32 #getchar()
%5 = icmp slt i32 %4, 40
br i1 %5, label %_L5, label %_L4
_L4: ; preds = %_L3
%6 = sub i32 %4, 48
%7 = mul i32 %1, 10
%8 = add i32 %6, %7
br label %_L3
_L5: ; preds = %_L3, %_L1
br i1 %2, label %_L6, label %_L7
_L6: ; preds = %_L5
%9 = mul i32 -1, %1
ret i32 %9
_L7: ; preds = %_L5
ret i32 %1
}
You're getting a very clear error, though. According to the LLVM IR language reference:
There must be no non-phi instructions between the start of a basic
block and the PHI instructions: i.e. PHI instructions must be first in
a basic block.
You have a phi in L1 which violates this.
Why does it have %_L1 as one of its sources? There are no jumps to %_L1 anywhere else. I think you should first understand how phi works, possibly by compiling small pieces of C code into LLVM IR with Clang and see what gets generated.
Put simply, a phi is needed to have consistency in SSA form while being able to assign one of several values into the same register. Make sure you read about SSA - it explains Phi node as well. And additional good resource is the LLVM tutorial which you should go through. In particular, part 5 covers Phis. As suggested above, running small pieces of C through Clang is a great way to understand how things work. This is in no way "hacky" - it's the scientific method! You read the theory, think hard about it, form hypotheses about how things work and then verify those hypotheses by running Clang and seeing what it generates for real-life control flow.

Speed difference between If-Else and Ternary operator in C...?

So at the suggestion of a colleague, I just tested the speed difference between the ternary operator and the equivalent If-Else block... and it seems that the ternary operator yields code that is between 1x and 2x faster than If-Else. My code is:
gettimeofday(&tv3, 0);
for(i = 0; i < N; i++)
{
a = i & 1;
if(a) a = b; else a = c;
}
gettimeofday(&tv4, 0);
gettimeofday(&tv1, 0);
for(i = 0; i < N; i++)
{
a = i & 1;
a = a ? b : c;
}
gettimeofday(&tv2, 0);
(Sorry for using gettimeofday and not clock_gettime... I will endeavor to better myself.)
I tried changing the order in which I timed the blocks, but the results seem to persist. What gives? Also, the If-Else shows much more variability in terms of execution speed. Should I be examining the assembly that gcc generates?
By the way, this is all at optimization level zero (-O0).
Am I imagining this, or is there something I'm not taking into account, or is this a machine-dependent thing, or what? Any help is appreciated.
There's a good chance that the ternary operator gets compiled into a cmov while the if/else results in a cmp+jmp. Just take a look at the assembly (using -S) to be sure. With optimizations enabled, it won't matter any more anyway, as any good compiler should produce the same code in both cases.
You could also go completely branchless and measure if it makes any difference:
int m = -(i & 1);
a = (b & m) | (c & ~m);
On today's architectures, this style of programming has grown a bit out of fashion.
This is a nice explanation: http://www.nynaeve.net/?p=178
Basically, there are "conditional set" processor instructions, which is faster than branching and setting in separate instructions.
If there is any, change your compiler!
For this kind of questions I use the Try Out LLVM page. It's an old release of LLVM (still using the gcc front-end), but those are old tricks.
Here is my little sample program (simplified version of yours):
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main (int argc, char* argv[]) {
int N = atoi(argv[0]);
int a = 0, d = 0, b = atoi(argv[1]), c = atoi(argv[2]);
int i;
for(i = 0; i < N; i++)
{
a = i & 1;
if(a) a = b+i; else a = c+i;
}
for(i = 0; i < N; i++)
{
d = i & 1;
d = d ? b+i : c+i;
}
printf("%d %d", a, d);
return 0;
}
And there is the corresponding LLVM IR generated:
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%0 = load i8** %argv, align 8 ; <i8*> [#uses=1]
%N = tail call i32 #atoi(i8* %0) nounwind readonly ; <i32> [#uses=5]
%2 = getelementptr inbounds i8** %argv, i64 1 ; <i8**> [#uses=1]
%3 = load i8** %2, align 8 ; <i8*> [#uses=1]
%b = tail call i32 #atoi(i8* %3) nounwind readonly ; <i32> [#uses=2]
%5 = getelementptr inbounds i8** %argv, i64 2 ; <i8**> [#uses=1]
%6 = load i8** %5, align 8 ; <i8*> [#uses=1]
%c = tail call i32 #atoi(i8* %6) nounwind readonly ; <i32> [#uses=2]
%8 = icmp sgt i32 %N, 0 ; <i1> [#uses=2]
br i1 %8, label %bb, label %bb11
bb: ; preds = %bb, %entry
%9 = phi i32 [ %10, %bb ], [ 0, %entry ] ; <i32> [#uses=2]
%10 = add nsw i32 %9, 1 ; <i32> [#uses=2]
%exitcond22 = icmp eq i32 %10, %N ; <i1> [#uses=1]
br i1 %exitcond22, label %bb10.preheader, label %bb
bb10.preheader: ; preds = %bb
%11 = and i32 %9, 1 ; <i32> [#uses=1]
%12 = icmp eq i32 %11, 0 ; <i1> [#uses=1]
%.pn13 = select i1 %12, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp21 = add i32 %N, -1 ; <i32> [#uses=1]
%a.1 = add i32 %.pn13, %tmp21 ; <i32> [#uses=2]
br i1 %8, label %bb6, label %bb11
bb6: ; preds = %bb6, %bb10.preheader
%13 = phi i32 [ %14, %bb6 ], [ 0, %bb10.preheader ] ; <i32> [#uses=2]
%14 = add nsw i32 %13, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %14, %N ; <i1> [#uses=1]
br i1 %exitcond, label %bb10.bb11_crit_edge, label %bb6
bb10.bb11_crit_edge: ; preds = %bb6
%15 = and i32 %13, 1 ; <i32> [#uses=1]
%16 = icmp eq i32 %15, 0 ; <i1> [#uses=1]
%.pn = select i1 %16, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp = add i32 %N, -1 ; <i32> [#uses=1]
%d.1 = add i32 %.pn, %tmp ; <i32> [#uses=1]
br label %bb11
bb11: ; preds = %bb10.bb11_crit_edge, %bb10.preheader, %entry
%a.0 = phi i32 [ %a.1, %bb10.bb11_crit_edge ], [ %a.1, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
%d.0 = phi i32 [ %d.1, %bb10.bb11_crit_edge ], [ 0, %bb10.preheader ], [ 0, %entry ] ; <i32> [#uses=1]
%17 = tail call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([6 x i8]* #.str, i64 0, i64 0), i32 %a.0, i32 %d.0) nounwind ; <i32> [#uses=0]
ret i32 0
}
Okay, so it's likely to be chinese, even though I went ahead and renamed some variables to make it a bit easier to read.
The important bits are these two blocks:
%.pn13 = select i1 %12, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp21 = add i32 %N, -1 ; <i32> [#uses=1]
%a.1 = add i32 %.pn13, %tmp21 ; <i32> [#uses=2]
%.pn = select i1 %16, i32 %c, i32 %b ; <i32> [#uses=1]
%tmp = add i32 %N, -1 ; <i32> [#uses=1]
%d.1 = add i32 %.pn, %tmp ; <i32> [#uses=1]
Which respectively set a and d.
And the conclusion is: No difference
Note: in a simpler example the two variables actually got merged, it seems here that the optimizer did not detect the similarity...
Any decent compiler should generate the same code for these if optimisation is turned on.
Understand that it's entirely up to the compiler how it interprets ternary expression (unless you actually force it not to with (inline) asm). It could just as easily understand ternary expression as 'if..else' in its Internal Representation language, and depending on the target backend, it may choose to generate conditional move instruction (on x86, CMOVcc is such one. There should also be ones for min/max, abs, etc). The main motivation of using conditional move is to transfer the risk of branch mispredict to a memory/register move operation. The caveat to this instruction is that nearly all the time, the operand register that will be conditionally loaded will have to be evaluated down to register form to take advantage of the cmov instruction.
This means that the unconditional evaluation process now has to be unconditional, and this will appear to increase the length of the unconditional path of the program. But understand that branch mispredict is most often resolved as 'flushing' the pipeline, which means that the instructions that would have finished executing are ignored (turned to No Operation instructions). This means that the actual number of instructions executed is higher because of the stalls or NOPs, and the effect scales with the depth of the processor pipeline and the misprediction rate.
This brings an interesting dilemma in determining the right heuristics. First, we know for sure that if the pipeline is too shallow or the branch prediction is fully able to learn pattern from branch history, then cmov is not worth doing. It's also not worth doing if the cost of evaluation of conditional argument is greater on than the cost from misprediction on average.
These are perhaps the core reasons why compilers have difficulty exploiting cmov instruction, since the heuristics determination is largely dependent on the runtime profiling information. It makes more sense to use this on JIT compiler since it can provide runtime instrumentation feedback and build a stronger heuristics for using this ("Is the branch truly unpredictable?"). On static compiler side without training data or profiler, it's most difficult to assume when this will be useful. However, a simple negative heuristic is, as aforementioned, if the compiler knows that the dataset is completely random or forcing cond. to uncond. evaluation is costly (perhaps due to irreducible, costly operations like fp divides), it would make good heuristics not to do this.
Any compiler worth its salt will do all that. Question is, what will it do after all dependable heuristics have been used up...