I was challenged with rather educational task to extend LLVM in the following way:
Add register XACC and instructions LRXACC(), SRXACC(arg1), XMAC(arg1,arg2) to the SPARC back-end. Instructions do the following:
LRXACC: load value from XACC,
SRXACC: write to XACC,
XMAC: XACC+=(arg1*arg2)>>31
Provide builtins in the Clang front-end for all three of them.
My source for testing is:
int main() {
int acc = 0;
__builtin___srxacc(acc);
__builtin___xmac(12345,6789);
acc = __builtin___lrxacc();
return 0;
}
I was able to support conversion of builtins into intrinsic function. IR file I get from clang looks fine:
define i32 #main() #0 {
entry:
%retval = alloca i32, align 4
%acc = alloca i32, align 4
store i32 0, i32* %retval
store i32 0, i32* %acc, align 4
%0 = load i32* %acc, align 4
call void #llvm.sparc.srxacc(i32 %0)
call void #llvm.sparc.xmac(i32 12345, i32 6789)
%1 = call i32 #llvm.sparc.lrxacc()
store i32 %1, i32* %acc, align 4
ret i32 0
}
Issue appears during DAG combining step and final output code looks like that:
.text
.file "file22.ll"
.globl main
.align 4
.type main,#function
main: ! #main
! BB#0: ! %entry
add %sp, -104, %sp
st %g0, [%sp+100]
st %g0, [%sp+96]
lrxacc %xacc, %o0
st %o0, [%sp+96]
sethi 0, %o0
retl
add %sp, 104, %sp
.Ltmp0:
.size main, .Ltmp0-main
.ident "clang version 3.6.0 (trunk)"
DAGCombiner deletes srxacc and xmac instructions as redundant. (In the ::Run method it checks node for use_empty() and deletes it if it's so)
Combiner does that because they store result in the register, so it's not visible from graph, that one of them depends on another.
I would appreciate any suggestions on how to avoid removal of my instructions.
Thank you!
Edit
To simplify and concretize: Instructions, which represented in IR code like that void #llvm.sparc.srxacc(i32 %0) look to combiner like they don't affect computation and corresponding SDNodes receive empty UseList. How to get around that?
You may use chain tokens to represent control dependency between SDNodes. This way you can add fake dependency between two instructions even if the second one doesn't consume any output of the first one.
You may use CopyToRegister and CopyFromRegister to cope with predefined physical registers
You may use Glue to glue several simple instructions into a complex pseudo-instruction.
Consider the following simple example compiled for x86:
int foo(int aa, int bb) {
return aa / bb;
}
You may want also to investigate more complicated example, though DAG picture is too big to post it here (you can view DAG with -view-sched-dags option):
void foo(int aa, int bb, int *x, int *y, int *z) {
*x = aa / bb;
*y = aa % bb;
*z = aa * bb;
}
Take a look at this post too.
Related
I want to insert a load and a store instruction before the first instruction of the first basicblock of a function (used to simulation the performance overhead of our work). The the LLVM pass is written as following:
Value *One = llvm::ConstantInt::get(Type::getInt32Ty(Context),1);
for(Function::iterator bb = tmp->begin(); bb != tmp->end(); ++bb {
//for every instruction of the block
for (BasicBlock::iterator inst = bb->begin(); inst != bb->end(); ++inst){
if(inst == bb->begin() && bb == tmp->begin()){
BasicBlock* bp = &*bb;
Instruction* pinst = &*inst;
AllocaInst *pa = new AllocaInst(Int32Ty, "buf", pinst);
StoreInst* newstore = new StoreInst(One, pa, pinst);
LoadInst* newload = new LoadInst(pa, "loadvalue", pinst);
}
}
}
The inserted load and store instructions can be seen in the xx.ll file:
define i32 #fun() #0 {
entry:
%buf = alloca i32
%loadvalue = load i32, i32* %buf
store i32 %loadvalue, i32* %buf
%call = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* #.str, i64 0, i64 0))
ret i32 1
}
However, the inserted instructions disappeared in the target executable file.
How can I fix this problem?
They were probably eliminated by optimization because they don't have any visible effect. Try marking the load and store as volatile.
In general your algorithm won't work because LLVM expects all of the allocas in a function to be the first instructions. You should scan through the first basic block and find the first non-alloca instruction and insert new code there, making sure to add allocas first (as you did).
In the first answer here, the following was mentioned about the stack memory in C++:
When a function is called, a block is reserved on the top of the stack for local variables and some bookkeeping data.
This makes perfect sense on the top-level, and makes me curious about how smart compilers are when allocating this memory in and of itself, given the context of this question: Since braces themselves are not a stack frame in C (I assume this holds true for C++ as well), I want to check whether compilers optimize reserved memory based on variable scopes within a single function.
In the following I'm assuming that the stack looks like this before a function call:
--------
|main()|
-------- <- stack pointer: space above it is used for current scope
| |
| |
| |
| |
--------
And then the following after invoking a function f():
--------
|main()|
-------- <- old stack pointer (osp)
| f() |
-------- <- stack pointer, variables will now be placed between here and osp upon reaching their declarations
| |
| |
| |
| |
--------
For example, given this function
void f() {
int x = 0;
int y = 5;
int z = x + y;
}
Presumably, this will just allocate 3*sizeof(int) + some extra overhead for bookkeeping.
However, what about this function:
void g() {
for (int i = 0; i < 100000; i++) {
int x = 0;
}
{
MyObject myObject[1000];
}
{
MyObject myObject[1000];
}
}
Ignoring compiler optimizations which may elide a lot of stuff in the above since really they do nothing, I'm curious about the following in the second example:
For the for loop: will the stack space be large enough to fit all 100000 ints?
On top of that, will the stack space contain 1000*sizeof(MyObject) or 2000*sizeof(MyObject)?
In general: does the compiler take variable scope into account when determining how much memory it will need for the new stack frame, before invoking a certain function? If this is compiler-specific, how do some well-known compilers do it?
The compiler will allocate space as needed (typically for all items at the beginning of the function), but not for each iteration in the loop.
For example, what Clang produces, as LLVM-IR
define void #_Z1gv() #0 {
%i = alloca i32, align 4
%x = alloca i32, align 4
%myObject = alloca [1000 x %class.MyObject], align 16
%myObject1 = alloca [1000 x %class.MyObject], align 16
store i32 0, i32* %i, align 4
br label %1
; <label>:1: ; preds = %5, %0
%2 = load i32, i32* %i, align 4
%3 = icmp slt i32 %2, 100000
br i1 %3, label %4, label %8
; <label>:4: ; preds = %1
store i32 0, i32* %x, align 4
br label %5
; <label>:5: ; preds = %4
%6 = load i32, i32* %i, align 4
%7 = add nsw i32 %6, 1
store i32 %7, i32* %i, align 4
br label %1
; <label>:8: ; preds = %1
ret void
}
This is the result of:
class MyObject
{
public:
int x, y;
};
void g() {
for (int i = 0; i < 100000; i++)
{
int x = 0;
}
{
MyObject myObject[1000];
}
{
MyObject myObject[1000];
}
}
So, as you can see, x is allocated only once, not 100000 times. Because only ONE of those variables will exist at any given time.
(The compiler could reuse the space for myObject[1000] for x and the second myObject[1000] - and probably would do so for an optimised build, but in that case it would also completely remove these variables as they are not used, so it wouldn't show very well)
In a modern compiler, the function is first transformed to a flow graph. In every arc of the flow, the compiler knows how many variables are live - that is to say holding a visible value. Some of those will live in registers, and for the others the compiler will need to reserve stack space.
Things get a bit more complicated as the optimizer gets further involved, because it may prefer not to move stack variables around. That's not free.
Still, in the end the compiler has all the assembly operations ready, and can just count how many unique stack addresses are used.
I am learning LLVM basics. I am trying to get into the builder framework and have set up the module, a function header etc, but I have not been able yet to figure out a way to create a simple sequence like this in builder:
%0 = 41
%1 = add i32 42, %0
Meaning how can I use the pseudo register notation through the builder framework?
I have tried to create a plus instruction based on two constants. The core line I'm using to generate the (integer) addition is:
Value *L = (Value *)m_left->Create_LLVM( );
Value *R = (Value *)m_right->Create_LLVM();
if ( L == 0 || R == 0 ) return 0;
llvm::Value *p_instruction = Get_Builder().CreateAdd( L, R, "addtmp" );
This contains lots of my own functions but I guess the basics is clear. I get a Value pointer for the left and right operands which are both constants and then create an add operation with the builder framwork. Again the module and builder are set up correctly, when I call dump() I see all the other stuff I do, but this line above does not create any IR code.
I would expect it co create something like
%4 = add i32 %3, %2
or something similar. Am I misunderstanding something fundamental about the way operations are to be constructed with the builder or is it just some small oversight of some detail?
Thanks
Hard to say what you are doing wrong without the fancy Create_LLVM() functions but in general for adding two constants:
You have to create 2 ConstantInt:
const auto& ctx = getGlobalContext(); // just your LLVMContext
auto* L = ConstantInt::get(Type::getInt32Ty(ctx), 41);
auto* R = ConstantInt::get(Type::getInt32Ty(ctx), 42);
const auto& builder = Get_Builder();
builder.Insert(L); // just a no-op in standard builder impl
builder.Insert(R); // just a no-op in standard builder impl
builder.CreateAdd(L, R, "addtmp");
You should get:
%addtmp = add i32 41, i32 42;
You said that your builder is set up correctly, so it will add the add at the end of the BasicBlock it currently operates on. And I assume you have allready created a Function with at least one BasicBlock.
Edit:
What will bring give you an add instruction in any way is to create it just calling the C++ API without the builder:
BinaryOperator* add = BinaryOperator::Create(BinaryOps::Add, L, R, "addtmp", BB);
where BB is the current BasicBlock.
To get something more sophisticated (adding to variables) the canonical way is this:
At first you need some memory. The AllocaInst allocates memory on the stack:
You can use the builder for this:
auto* A = builder.CreateAlloca (Type::getInt32Ty(ctx), nullptr, "a");
auto* B = builder.CreateAlloca (Type::getInt32Ty(ctx), nullptr, "b");
For simplicity I'll just take the constants from above and store them in A and B.
To store values we need StoreInst:
builder.CreateStore (L, A, /*isVolatile=*/false);
builder.CreateStore (R, B, /*isVolatile=*/false);
For the addition we load the value from memory to a register using LoadInst:
auto* addLHS = builder.CreateLoad(A);
auto* addRHS = builder.CreateLoad(B);
Finally the addition as above:
auto* add = builder.CreateAdd(addLHS, addRHS , "add");
And with the pointer to add you can go on, e.g., returning it or storing it to another variable.
The IR should look like this:
define i32 foo() {
entry:
%a = alloca i32, align 4
%b = alloca i32, align 4
store i32 41, i32* %a, align 4
store i32 42, i32* %b, align 4
%0 = load i32* %a, align 4
%1 = load i32* %b, align 4
%add = add i32 %0, %1
ret i32 %add
}
Are measurable performance gains possible from using VC++'s __assume? If so, please post proof with code and benchmarks in your answer.
The sparse MSDN article on __assume: http://msdn.microsoft.com/en-us/library/1b3fsfxw(v=vs.100).aspx
Mentioned in the article is the use of __assume(0) to make switch statements faster by __assume(0)ing the default case. I measured no performance increase from using __assume(0) in that way:
void NoAssumeSwitchStatement(int i)
{
switch (i)
{
case 0:
vector<int>();
break;
case 1:
vector<int>();
break;
default:
break;
}
}
void AssumeSwitchStatement(int i)
{
switch (i)
{
case 0:
vector<int>();
break;
case 1:
vector<int>();
break;
default:
__assume(0);
}
}
int main(int argc, char* argv[])
{
const int Iterations = 1000000;
LARGE_INTEGER start, middle, end;
QueryPerformanceCounter(&start);
for (int i = 0; i < Iterations; ++i)
{
NoAssumeSwitchStatement(i % 2);
}
QueryPerformanceCounter(&middle);
for (int i = 0; i < Iterations; ++i)
{
AssumeSwitchStatement(i % 2);
}
QueryPerformanceCounter(&end);
LARGE_INTEGER cpuFrequency;
QueryPerformanceFrequency(&cpuFrequency);
cout << "NoAssumeSwitchStatement: " << (((double)(middle.QuadPart - start.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
cout << " AssumeSwitchStatement: " << (((double)(end.QuadPart - middle.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
return 0;
}
Rounded console output, 1000000 iterations:
NoAssumeSwitchStatement: 46ms
AssumeSwitchStatement: 46ms
Benchmark lies. They rarely measure what you want them to. In this particular case, the methods were probably inlined, and so the __assume was just redundant.
As to the actual question, yes it may help. A switch is generally implemented by a jump table, by reducing the size of this table, or removing some entries, the compiler might be able to select better CPU instructions to implement the switch.
In your extreme case, it can turn the switch into a if (i == 0) { } else { } structure, which is usually efficient.
Furthermore, trimming dead branches help keeping code tidy, and less code means a better usage of the CPU instruction cache.
However, those are micro-optimizations, and they rarely pay off: you need a profiler to point them out you, and even them it might be difficult to understand the particular transformation to make (is __assume the best ?). This is expert's work.
EDIT: In action with LLVM
void foo(void);
void bar(void);
void regular(int i) {
switch(i) {
case 0: foo(); break;
case 1: bar(); break;
}
}
void optimized(int i) {
switch(i) {
case 0: foo(); break;
case 1: bar(); break;
default: __builtin_unreachable();
}
}
Note that the only difference is the presence, or absence of the __builtin_unreachable() which is similar to MSVC __assume(0).
define void #regular(i32 %i) nounwind uwtable {
switch i32 %i, label %3 [
i32 0, label %1
i32 1, label %2
]
; <label>:1 ; preds = %0
tail call void #foo() nounwind
br label %3
; <label>:2 ; preds = %0
tail call void #bar() nounwind
br label %3
; <label>:3 ; preds = %2, %1, %0
ret void
}
define void #optimized(i32 %i) nounwind uwtable {
%cond = icmp eq i32 %i, 1
br i1 %cond, label %2, label %1
; <label>:1 ; preds = %0
tail call void #foo() nounwind
br label %3
; <label>:2 ; preds = %0
tail call void #bar() nounwind
br label %3
; <label>:3 ; preds = %2, %1
ret void
}
And note here how the switch statement in regular can be optimized into a simple comparison in optimized.
This maps to the following x86 assembly:
.globl regular | .globl optimized
.align 16, 0x90 | .align 16, 0x90
.type regular,#function | .type optimized,#function
regular: | optimized:
.Ltmp0: | .Ltmp3:
.cfi_startproc | .cfi_startproc
# BB#0: | # BB#0:
cmpl $1, %edi | cmpl $1, %edi
je .LBB0_3 | je .LBB1_2
# BB#1: |
testl %edi, %edi |
jne .LBB0_4 |
# BB#2: | # BB#1:
jmp foo | jmp foo
.LBB0_3: | .LBB1_2:
jmp bar | jmp bar
.LBB0_4: |
ret |
.Ltmp1: | .Ltmp4:
.size regular, .Ltmp1-regular | .size optimized, .Ltmp4-optimized
.Ltmp2: | .Ltmp5:
.cfi_endproc | .cfi_endproc
.Leh_func_end0: | .Leh_func_end1:
Note how, in the second case:
the code is tighter (less instructions)
there is a single comparison/jump (cmpl/je) on all paths (and not one path with a single jump and a path with two)
Also note how this is so close that I have no idea how to measure anything else than noise...
On the other hand, semantically it does indicate an intent, though perhaps an assert could be better suited for the semantics only.
Seems it does make a little difference if you set the right compiler switches...
Three runs follow. No optimizations, opt for speed and opt for size.
This run has no optimizations
C:\temp\code>cl /EHsc /FAscu assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj
C:\temp\code>assume
NoAssumeSwitchStatement: 29.5321ms
AssumeSwitchStatement: 31.0288ms
This is with max optimizations (/Ox) Note that /O2 was basically identical speedwise.
C:\temp\code>cl /Ox /EHsc /Fa assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj
C:\temp\code>assume
NoAssumeSwitchStatement: 1.33492ms
AssumeSwitchStatement: 0.666948ms
This run was to minimize code space
C:\temp\code>cl -O1 /EHsc /FAscu assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj
C:\temp\code>assume
NoAssumeSwitchStatement: 5.67691ms
AssumeSwitchStatement: 5.36186ms
Note that the output assembly code agrees with what Matthiu M. had to say when speed opts are used. The switch functions were called in other cases.
After compilation next snippet of code with clang -O2 (or with online demo):
#include <stdio.h>
#include <stdlib.h>
int flop(int x);
int flip(int x) {
if (x == 0) return 1;
return (x+1)*flop(x-1);
}
int flop(int x) {
if (x == 0) return 1;
return (x+0)*flip(x-1);
}
int main(int argc, char **argv) {
printf("%d\n", flip(atoi(argv[1])));
}
I'm getting next snippet of llvm assembly in flip:
bb1.i: ; preds = %bb1
%4 = add nsw i32 %x, -2 ; <i32> [#uses=1]
%5 = tail call i32 #flip(i32 %4) nounwind ; <i32> [#uses=1]
%6 = mul nsw i32 %5, %2 ; <i32> [#uses=1]
br label %flop.exit
I thought that tail call means dropping current stack (i.e. return will be to the upper frame, so next instruction should be ret %5), but according to this code it will do mul for it. And in native assembly there is simple call without tail optimisation (even with appropriate flag for llc)
Can sombody explain why clang generates such code?
As well I can't understand why llvm have tail call if it can simply check that next ret will use result of prev call and later do appropriate optimisation or generate native equivalent of tail-call instruction?
Take a look at the 'call' instruction in the LLVM Assembly Language Reference Manual. It says:
The optional "tail" marker indicates that the callee function does not access any allocas or varargs in the caller. Note that calls may be marked "tail" even if they do not occur before a ret instruction.
It's likely that one of the LLVM optimization passes in Clang analyzes whether or not the callee accesses any allocas or varargs in the caller. If it doesn't, the pass marks the call as a tail call and lets another part of the LLVM figure out what to do with the "tail" marker. Maybe the function can't be a real tail call right now, but after further transformations it could be. I'm guessing it's done this way to make the ordering of the passes less important.