Are measurable performance gains possible from using VC++'s __assume? - c++

Are measurable performance gains possible from using VC++'s __assume? If so, please post proof with code and benchmarks in your answer.
The sparse MSDN article on __assume: http://msdn.microsoft.com/en-us/library/1b3fsfxw(v=vs.100).aspx
Mentioned in the article is the use of __assume(0) to make switch statements faster by __assume(0)ing the default case. I measured no performance increase from using __assume(0) in that way:
void NoAssumeSwitchStatement(int i)
{
switch (i)
{
case 0:
vector<int>();
break;
case 1:
vector<int>();
break;
default:
break;
}
}
void AssumeSwitchStatement(int i)
{
switch (i)
{
case 0:
vector<int>();
break;
case 1:
vector<int>();
break;
default:
__assume(0);
}
}
int main(int argc, char* argv[])
{
const int Iterations = 1000000;
LARGE_INTEGER start, middle, end;
QueryPerformanceCounter(&start);
for (int i = 0; i < Iterations; ++i)
{
NoAssumeSwitchStatement(i % 2);
}
QueryPerformanceCounter(&middle);
for (int i = 0; i < Iterations; ++i)
{
AssumeSwitchStatement(i % 2);
}
QueryPerformanceCounter(&end);
LARGE_INTEGER cpuFrequency;
QueryPerformanceFrequency(&cpuFrequency);
cout << "NoAssumeSwitchStatement: " << (((double)(middle.QuadPart - start.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
cout << " AssumeSwitchStatement: " << (((double)(end.QuadPart - middle.QuadPart)) * 1000) / (double)cpuFrequency.QuadPart << "ms" << endl;
return 0;
}
Rounded console output, 1000000 iterations:
NoAssumeSwitchStatement: 46ms
AssumeSwitchStatement: 46ms

Benchmark lies. They rarely measure what you want them to. In this particular case, the methods were probably inlined, and so the __assume was just redundant.
As to the actual question, yes it may help. A switch is generally implemented by a jump table, by reducing the size of this table, or removing some entries, the compiler might be able to select better CPU instructions to implement the switch.
In your extreme case, it can turn the switch into a if (i == 0) { } else { } structure, which is usually efficient.
Furthermore, trimming dead branches help keeping code tidy, and less code means a better usage of the CPU instruction cache.
However, those are micro-optimizations, and they rarely pay off: you need a profiler to point them out you, and even them it might be difficult to understand the particular transformation to make (is __assume the best ?). This is expert's work.
EDIT: In action with LLVM
void foo(void);
void bar(void);
void regular(int i) {
switch(i) {
case 0: foo(); break;
case 1: bar(); break;
}
}
void optimized(int i) {
switch(i) {
case 0: foo(); break;
case 1: bar(); break;
default: __builtin_unreachable();
}
}
Note that the only difference is the presence, or absence of the __builtin_unreachable() which is similar to MSVC __assume(0).
define void #regular(i32 %i) nounwind uwtable {
switch i32 %i, label %3 [
i32 0, label %1
i32 1, label %2
]
; <label>:1 ; preds = %0
tail call void #foo() nounwind
br label %3
; <label>:2 ; preds = %0
tail call void #bar() nounwind
br label %3
; <label>:3 ; preds = %2, %1, %0
ret void
}
define void #optimized(i32 %i) nounwind uwtable {
%cond = icmp eq i32 %i, 1
br i1 %cond, label %2, label %1
; <label>:1 ; preds = %0
tail call void #foo() nounwind
br label %3
; <label>:2 ; preds = %0
tail call void #bar() nounwind
br label %3
; <label>:3 ; preds = %2, %1
ret void
}
And note here how the switch statement in regular can be optimized into a simple comparison in optimized.
This maps to the following x86 assembly:
.globl regular | .globl optimized
.align 16, 0x90 | .align 16, 0x90
.type regular,#function | .type optimized,#function
regular: | optimized:
.Ltmp0: | .Ltmp3:
.cfi_startproc | .cfi_startproc
# BB#0: | # BB#0:
cmpl $1, %edi | cmpl $1, %edi
je .LBB0_3 | je .LBB1_2
# BB#1: |
testl %edi, %edi |
jne .LBB0_4 |
# BB#2: | # BB#1:
jmp foo | jmp foo
.LBB0_3: | .LBB1_2:
jmp bar | jmp bar
.LBB0_4: |
ret |
.Ltmp1: | .Ltmp4:
.size regular, .Ltmp1-regular | .size optimized, .Ltmp4-optimized
.Ltmp2: | .Ltmp5:
.cfi_endproc | .cfi_endproc
.Leh_func_end0: | .Leh_func_end1:
Note how, in the second case:
the code is tighter (less instructions)
there is a single comparison/jump (cmpl/je) on all paths (and not one path with a single jump and a path with two)
Also note how this is so close that I have no idea how to measure anything else than noise...
On the other hand, semantically it does indicate an intent, though perhaps an assert could be better suited for the semantics only.

Seems it does make a little difference if you set the right compiler switches...
Three runs follow. No optimizations, opt for speed and opt for size.
This run has no optimizations
C:\temp\code>cl /EHsc /FAscu assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj
C:\temp\code>assume
NoAssumeSwitchStatement: 29.5321ms
AssumeSwitchStatement: 31.0288ms
This is with max optimizations (/Ox) Note that /O2 was basically identical speedwise.
C:\temp\code>cl /Ox /EHsc /Fa assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj
C:\temp\code>assume
NoAssumeSwitchStatement: 1.33492ms
AssumeSwitchStatement: 0.666948ms
This run was to minimize code space
C:\temp\code>cl -O1 /EHsc /FAscu assume.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
assume.cpp
Microsoft (R) Incremental Linker Version 10.00.40219.01
/out:assume.exe
assume.obj
C:\temp\code>assume
NoAssumeSwitchStatement: 5.67691ms
AssumeSwitchStatement: 5.36186ms
Note that the output assembly code agrees with what Matthiu M. had to say when speed opts are used. The switch functions were called in other cases.

Related

Insert a load and a store instruction before the first instruction of the first basicblock of a function

I want to insert a load and a store instruction before the first instruction of the first basicblock of a function (used to simulation the performance overhead of our work). The the LLVM pass is written as following:
Value *One = llvm::ConstantInt::get(Type::getInt32Ty(Context),1);
for(Function::iterator bb = tmp->begin(); bb != tmp->end(); ++bb {
//for every instruction of the block
for (BasicBlock::iterator inst = bb->begin(); inst != bb->end(); ++inst){
if(inst == bb->begin() && bb == tmp->begin()){
BasicBlock* bp = &*bb;
Instruction* pinst = &*inst;
AllocaInst *pa = new AllocaInst(Int32Ty, "buf", pinst);
StoreInst* newstore = new StoreInst(One, pa, pinst);
LoadInst* newload = new LoadInst(pa, "loadvalue", pinst);
}
}
}
The inserted load and store instructions can be seen in the xx.ll file:
define i32 #fun() #0 {
entry:
%buf = alloca i32
%loadvalue = load i32, i32* %buf
store i32 %loadvalue, i32* %buf
%call = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* #.str, i64 0, i64 0))
ret i32 1
}
However, the inserted instructions disappeared in the target executable file.
How can I fix this problem?
They were probably eliminated by optimization because they don't have any visible effect. Try marking the load and store as volatile.
In general your algorithm won't work because LLVM expects all of the allocas in a function to be the first instructions. You should scan through the first basic block and find the first non-alloca instruction and insert new code there, making sure to add allocas first (as you did).

C++: How does the compiler know how much memory to allocate for each stack frame?

In the first answer here, the following was mentioned about the stack memory in C++:
When a function is called, a block is reserved on the top of the stack for local variables and some bookkeeping data.
This makes perfect sense on the top-level, and makes me curious about how smart compilers are when allocating this memory in and of itself, given the context of this question: Since braces themselves are not a stack frame in C (I assume this holds true for C++ as well), I want to check whether compilers optimize reserved memory based on variable scopes within a single function.
In the following I'm assuming that the stack looks like this before a function call:
--------
|main()|
-------- <- stack pointer: space above it is used for current scope
| |
| |
| |
| |
--------
And then the following after invoking a function f():
--------
|main()|
-------- <- old stack pointer (osp)
| f() |
-------- <- stack pointer, variables will now be placed between here and osp upon reaching their declarations
| |
| |
| |
| |
--------
For example, given this function
void f() {
int x = 0;
int y = 5;
int z = x + y;
}
Presumably, this will just allocate 3*sizeof(int) + some extra overhead for bookkeeping.
However, what about this function:
void g() {
for (int i = 0; i < 100000; i++) {
int x = 0;
}
{
MyObject myObject[1000];
}
{
MyObject myObject[1000];
}
}
Ignoring compiler optimizations which may elide a lot of stuff in the above since really they do nothing, I'm curious about the following in the second example:
For the for loop: will the stack space be large enough to fit all 100000 ints?
On top of that, will the stack space contain 1000*sizeof(MyObject) or 2000*sizeof(MyObject)?
In general: does the compiler take variable scope into account when determining how much memory it will need for the new stack frame, before invoking a certain function? If this is compiler-specific, how do some well-known compilers do it?
The compiler will allocate space as needed (typically for all items at the beginning of the function), but not for each iteration in the loop.
For example, what Clang produces, as LLVM-IR
define void #_Z1gv() #0 {
%i = alloca i32, align 4
%x = alloca i32, align 4
%myObject = alloca [1000 x %class.MyObject], align 16
%myObject1 = alloca [1000 x %class.MyObject], align 16
store i32 0, i32* %i, align 4
br label %1
; <label>:1: ; preds = %5, %0
%2 = load i32, i32* %i, align 4
%3 = icmp slt i32 %2, 100000
br i1 %3, label %4, label %8
; <label>:4: ; preds = %1
store i32 0, i32* %x, align 4
br label %5
; <label>:5: ; preds = %4
%6 = load i32, i32* %i, align 4
%7 = add nsw i32 %6, 1
store i32 %7, i32* %i, align 4
br label %1
; <label>:8: ; preds = %1
ret void
}
This is the result of:
class MyObject
{
public:
int x, y;
};
void g() {
for (int i = 0; i < 100000; i++)
{
int x = 0;
}
{
MyObject myObject[1000];
}
{
MyObject myObject[1000];
}
}
So, as you can see, x is allocated only once, not 100000 times. Because only ONE of those variables will exist at any given time.
(The compiler could reuse the space for myObject[1000] for x and the second myObject[1000] - and probably would do so for an optimised build, but in that case it would also completely remove these variables as they are not used, so it wouldn't show very well)
In a modern compiler, the function is first transformed to a flow graph. In every arc of the flow, the compiler knows how many variables are live - that is to say holding a visible value. Some of those will live in registers, and for the others the compiler will need to reserve stack space.
Things get a bit more complicated as the optimizer gets further involved, because it may prefer not to move stack variables around. That's not free.
Still, in the end the compiler has all the assembly operations ready, and can just count how many unique stack addresses are used.

LLVM: Prevent deletion of relevant SDNodes by DAGCombiner::recursivelyDeleteUnusedNodes

I was challenged with rather educational task to extend LLVM in the following way:
Add register XACC and instructions LRXACC(), SRXACC(arg1), XMAC(arg1,arg2) to the SPARC back-end. Instructions do the following:
LRXACC: load value from XACC,
SRXACC: write to XACC,
XMAC: XACC+=(arg1*arg2)>>31
Provide builtins in the Clang front-end for all three of them.
My source for testing is:
int main() {
int acc = 0;
__builtin___srxacc(acc);
__builtin___xmac(12345,6789);
acc = __builtin___lrxacc();
return 0;
}
I was able to support conversion of builtins into intrinsic function. IR file I get from clang looks fine:
define i32 #main() #0 {
entry:
%retval = alloca i32, align 4
%acc = alloca i32, align 4
store i32 0, i32* %retval
store i32 0, i32* %acc, align 4
%0 = load i32* %acc, align 4
call void #llvm.sparc.srxacc(i32 %0)
call void #llvm.sparc.xmac(i32 12345, i32 6789)
%1 = call i32 #llvm.sparc.lrxacc()
store i32 %1, i32* %acc, align 4
ret i32 0
}
Issue appears during DAG combining step and final output code looks like that:
.text
.file "file22.ll"
.globl main
.align 4
.type main,#function
main: ! #main
! BB#0: ! %entry
add %sp, -104, %sp
st %g0, [%sp+100]
st %g0, [%sp+96]
lrxacc %xacc, %o0
st %o0, [%sp+96]
sethi 0, %o0
retl
add %sp, 104, %sp
.Ltmp0:
.size main, .Ltmp0-main
.ident "clang version 3.6.0 (trunk)"
DAGCombiner deletes srxacc and xmac instructions as redundant. (In the ::Run method it checks node for use_empty() and deletes it if it's so)
Combiner does that because they store result in the register, so it's not visible from graph, that one of them depends on another.
I would appreciate any suggestions on how to avoid removal of my instructions.
Thank you!
Edit
To simplify and concretize: Instructions, which represented in IR code like that void #llvm.sparc.srxacc(i32 %0) look to combiner like they don't affect computation and corresponding SDNodes receive empty UseList. How to get around that?
You may use chain tokens to represent control dependency between SDNodes. This way you can add fake dependency between two instructions even if the second one doesn't consume any output of the first one.
You may use CopyToRegister and CopyFromRegister to cope with predefined physical registers
You may use Glue to glue several simple instructions into a complex pseudo-instruction.
Consider the following simple example compiled for x86:
int foo(int aa, int bb) {
return aa / bb;
}
You may want also to investigate more complicated example, though DAG picture is too big to post it here (you can view DAG with -view-sched-dags option):
void foo(int aa, int bb, int *x, int *y, int *z) {
*x = aa / bb;
*y = aa % bb;
*z = aa * bb;
}
Take a look at this post too.

Which one will execute faster, if (flag==0) or if (0==flag)?

Interview question: Which one will execute faster, if (flag==0) or if (0==flag)? Why?
I haven't seen any correct answer yet (and there are already some) caveat: Nawaz did point out the user-defined trap. And I regret my hastily cast upvote on "stupidest question" because it seems that many did not get it right and it gives room for a nice discussion on compiler optimization :)
The answer is:
What is flag's type?
In the case where flag actually is a user-defined type. Then it depends on which overload of operator== is selected. Of course it can seem stupid that they would not be symmetric, but it's certainly allowed, and I have seen other abuses already.
If flag is a built-in, then both should take the same speed.
From the Wikipedia article on x86, I'd bet for a Jxx instruction for the if statement: perhaps a JNZ (Jump if Not Zero) or some equivalent.
I'd doubt the compiler misses such an obvious optimization, even with optimizations turned off. This is the type of things for which Peephole Optimization is designed for.
EDIT: Sprang up again, so let's add some assembly (LLVM 2.7 IR)
int regular(int c) {
if (c == 0) { return 0; }
return 1;
}
int yoda(int c) {
if (0 == c) { return 0; }
return 1;
}
define i32 #regular(i32 %c) nounwind readnone {
entry:
%not. = icmp ne i32 %c, 0 ; <i1> [#uses=1]
%.0 = zext i1 %not. to i32 ; <i32> [#uses=1]
ret i32 %.0
}
define i32 #yoda(i32 %c) nounwind readnone {
entry:
%not. = icmp ne i32 %c, 0 ; <i1> [#uses=1]
%.0 = zext i1 %not. to i32 ; <i32> [#uses=1]
ret i32 %.0
}
Even if one does not know how to read the IR, I think it is self explanatory.
Same code for amd64 with GCC 4.1.2:
.loc 1 4 0 # int f = argc;
movl -20(%rbp), %eax
movl %eax, -4(%rbp)
.loc 1 6 0 # if( f == 0 ) {
cmpl $0, -4(%rbp)
jne .L2
.loc 1 7 0 # return 0;
movl $0, -36(%rbp)
jmp .L4
.loc 1 8 0 # }
.L2:
.loc 1 10 0 # if( 0 == f ) {
cmpl $0, -4(%rbp)
jne .L5
.loc 1 11 0 # return 1;
movl $1, -36(%rbp)
jmp .L4
.loc 1 12 0 # }
.L5:
.loc 1 14 0 # return 2;
movl $2, -36(%rbp)
.L4:
movl -36(%rbp), %eax
.loc 1 15 0 # }
leave
ret
There will be no difference in your versions.
I'm assuming that the type of flag is not user-defined type, rather it's some built-in type. Enum is exception!. You can treat enum as if it's built-in. In fact, it' values are one of built-in types!
In case, if it's user-defined type (except enum), then the answer entirely depends on how you've overloaded the operator == . Note that you've to overload == by defining two functions, one for each of your versions!
There is absolutely no difference.
You might gain points in answering that interview question by referring to the elimination of assignment/comparison typos, though:
if (flag = 0) // typo here
{
// code never executes
}
if (0 = flag) // typo and syntactic error -> compiler complains
{
// ...
}
While it's true, that e.g. a C-compiler does warn in case of the former (flag = 0), there are no such warnings in PHP, Perl or Javascript or <insert language here>.
There will be absolutely no difference speed-wise. Why should there be?
Well there is a difference when flag is a user defined type
struct sInt
{
sInt( int i ) : wrappedInt(i)
{
std::cout << "ctor called" << std::endl;
}
operator int()
{
std::cout << "operator int()" << std::endl;
return wrappedInt;
}
bool operator==(int nComp)
{
std::cout << "bool operator==(int nComp)" << std::endl;
return (nComp == wrappedInt);
}
int wrappedInt;
};
int
_tmain(int argc, _TCHAR* argv[])
{
sInt s(0);
//in this case this will probably be faster
if ( 0 == s )
{
std::cout << "equal" << std::endl;
}
if ( s == 0 )
{
std::cout << "equal" << std::endl;
}
}
In the first case (0==s) the conversion operator is called and then the returned result is compared to 0. In the second case the == operator is called.
When in doubt benchmark it and learn the truth.
They should be exactly the same in terms of speed.
Notice however that some people use to put the constant on the left in equality comparisons (the so-called "Yoda conditionals") to avoid all the errors that may arise if you write = (assignment operator) instead of == (equality comparison operator); since assigning to a literal triggers a compilation error, this kind of mistake is avoided.
if(flag=0) // <--- typo: = instead of ==; flag is now set to 0
{
// this is never executed
}
if(0=flag) // <--- compiler error, cannot assign value to literal
{
}
On the other hand, most people find "Yoda conditionals" weird-looking and annoying, especially since the class of errors they prevent can be spotted also by using adequate compiler warnings.
if(flag=0) // <--- warning: assignment in conditional expression
{
}
As others have said, there is no difference.
0 has to be evaluated. flag has to be evaluated. This process takes the same time, no matter which side they're placed.
The right answer would be: they're both the same speed.
Even the expressions if(flag==0) and if(0==flag) have the same amount of characters! If one of them was written as if(flag== 0), then the compiler would have one extra space to parse, so you would have a legitimate reason at pointing out compile time.
But since there is no such thing, there is absolutely no reason why one should be faster than other. If there is a reason, then the compiler is doing some very, very strange things to generated code...
Which one's fast depends on which version of == you are using. Here's a snippet that uses 2 possible implementations of ==, and depending on whether you choose to call x == 0 or 0 == x one of the 2 is selected.
If you are just using a POD this really shouldn't matter when it comes to speed.
#include <iostream>
using namespace std;
class x {
public:
bool operator==(int x) { cout << "hello\n"; return 0; }
friend bool operator==(int x, const x& a) { cout << "world\n"; return 0; }
};
int main()
{
x x1;
//int m = 0;
int k = (x1 == 0);
int j = (0 == x1);
}
Well, I am agreeing completely with all said in the comments to the OP, for the exercise sake:
If the compiler is not clever enough (indeed you should not use it) or the optimization is disabled, x == 0 could compile to a native assembly jump if zero instruction, while 0 == x could be a more generic (and costly) comparison of numeric values.
Still, I wouldn't like to work for a boss who thinks in these terms...
Surely no difference in terms of execution speeds. The condition needs to be evaluated in both cases in the same way.
I think the best answer is "what language is this example in"?
The question did not specify the language and it's tagged both 'C' and 'C++'. A precise answer needs more information.
It's a lousy programming question, but it could be a good in the devious "let's give the interviewee enough rope to either hang himself or build a tree swing" department. The problem with those kinds of questions is they usually get written down and handed down from interviewer to interviewer until it gets to people who don't really understand it from all the angles.
Build two simple programs using the suggested ways.
Assemble the codes. Look at the assembly and you can judge, but I doubt there is a difference!
Interviews are getting lower than ever.
Just as an aside ( I actually think any decent compiler will make this question moot, since it will optimise it ) using 0 == flag over flag == 0 does prevent the typo where you forget one of the = ( ie if you accidently type flag = 0 it will compile, but 0 = flag will not ), which I think is a mistake everyone has made at one point or another...
If at all there was a difference, what stops compiler to choose the faster once?
So logically, there can't be any difference. Probably this is what the interviewer expects. It is actually a brilliant question.

tail call generated by clang 1.1 and 1.0 (llvm 2.7 and 2.6)

After compilation next snippet of code with clang -O2 (or with online demo):
#include <stdio.h>
#include <stdlib.h>
int flop(int x);
int flip(int x) {
if (x == 0) return 1;
return (x+1)*flop(x-1);
}
int flop(int x) {
if (x == 0) return 1;
return (x+0)*flip(x-1);
}
int main(int argc, char **argv) {
printf("%d\n", flip(atoi(argv[1])));
}
I'm getting next snippet of llvm assembly in flip:
bb1.i: ; preds = %bb1
%4 = add nsw i32 %x, -2 ; <i32> [#uses=1]
%5 = tail call i32 #flip(i32 %4) nounwind ; <i32> [#uses=1]
%6 = mul nsw i32 %5, %2 ; <i32> [#uses=1]
br label %flop.exit
I thought that tail call means dropping current stack (i.e. return will be to the upper frame, so next instruction should be ret %5), but according to this code it will do mul for it. And in native assembly there is simple call without tail optimisation (even with appropriate flag for llc)
Can sombody explain why clang generates such code?
As well I can't understand why llvm have tail call if it can simply check that next ret will use result of prev call and later do appropriate optimisation or generate native equivalent of tail-call instruction?
Take a look at the 'call' instruction in the LLVM Assembly Language Reference Manual. It says:
The optional "tail" marker indicates that the callee function does not access any allocas or varargs in the caller. Note that calls may be marked "tail" even if they do not occur before a ret instruction.
It's likely that one of the LLVM optimization passes in Clang analyzes whether or not the callee accesses any allocas or varargs in the caller. If it doesn't, the pass marks the call as a tail call and lets another part of the LLVM figure out what to do with the "tail" marker. Maybe the function can't be a real tail call right now, but after further transformations it could be. I'm guessing it's done this way to make the ordering of the passes less important.