effect of goto on C++ compiler optimization - c++

What are the performance benefits or penalties of using goto with a modern C++ compiler?
I am writing a C++ code generator and use of goto will make it easier to write. No one will touch the resulting C++ files so don't get all "goto is bad" on me. As a benefit, they save the use of temporary variables.
I was wondering, from a purely compiler optimization perspective, the result that goto has on the compiler's optimizer? Does it make code faster, slower, or generally no change in performance compared to using temporaries / flags.

The part of a compiler that would be affected works with a flow graph. The syntax you use to create a particular flow graph will normally be irrelevant as long as you're writing strictly portable code--if you create something like a while loop using a goto instead of an actual while statement, it's not going to produce the same flow graph as if you used the syntax for a while loop. Using non-portable code, however, modern compilers allow you to add annotations to loops to predict whether they'll be taken or not. Depending on the compiler, you may or may not be able to duplicate that extra information using a goto (but most that have annotation for loops also have annotation for if statements, so a likely taken or likely not taken on the if controlling a goto would generally have the same effect as a similar annotation on the corresponding loop).
It is possible, however, to produce a flow graph with gotos that couldn't be produced by any normal flow control statements (loops, switch, etc.), such conditionally jumping directly into the middle of a loop, depending on the value in a global. In such a case, you may produce an irreducible flow graph, and when/if you do, that will often limit the ability of the compiler to optimize the code.
In other words, if (for example) you took code that was written with normal for, while, switch, etc., and converted it to use goto in every case, but retained the same structure, almost any reasonably modern compiler could probably produce essentially identical code either way. If, however, you use gotos to produce the mess of spaghetti like some of the FORTRAN I had to look at decades ago, then the compiler probably won't be able to do much with it.

How do you think that loops are represented, at the assembly level ?
Using jump instructions to labels...
Many compilers will actually use jumps even in their Intermediate Representation:
int loop(int* i) {
int result = 0;
while(*i) {
result += *i;
}
return result;
}
int jump(int* i) {
int result = 0;
while (true) {
if (not *i) { goto end; }
result += *i;
}
end:
return result;
}
Yields in LLVM:
define i32 #_Z4loopPi(i32* nocapture %i) nounwind uwtable readonly {
%1 = load i32* %i, align 4, !tbaa !0
%2 = icmp eq i32 %1, 0
br i1 %2, label %3, label %.lr.ph..lr.ph.split_crit_edge
.lr.ph..lr.ph.split_crit_edge: ; preds = %.lr.ph..lr.ph.split_crit_edge, %0
br label %.lr.ph..lr.ph.split_crit_edge
; <label>:3 ; preds = %0
ret i32 0
}
define i32 #_Z4jumpPi(i32* nocapture %i) nounwind uwtable readonly {
%1 = load i32* %i, align 4, !tbaa !0
%2 = icmp eq i32 %1, 0
br i1 %2, label %3, label %.lr.ph..lr.ph.split_crit_edge
.lr.ph..lr.ph.split_crit_edge: ; preds = %.lr.ph..lr.ph.split_crit_edge, %0
br label %.lr.ph..lr.ph.split_crit_edge
; <label>:3 ; preds = %0
ret i32 0
}
Where br is the branch instruction (a conditional jump).
All optimizations are performed on this structure. So, goto is the bread and butter of optimizers.

I was wondering, from a purely compiler optimzation prespective, the result that goto's have on the compiler's optimizer? Does it make code faster, slower, or generally no change in performance compared to using temporaries / flags.
Why do you care? Your primary concern should be getting your code generator to create the correct code. Efficiency is of much less importance than correctness. Your question should be "Will my use of gotos make my generated code more likely or less likely to be correct?"
Look at the code generated by lex/yacc or flex/bison. That code is chock full of gotos. There's a good reason for that. lex and yacc implement finite state machines. Since the machine goes to another state at state transitions, the goto is arguably the most natural tool for such transitions.
There is a simple way to eliminate those gotos in many cases by using a while loop around a switch statement. This is structured code. Per Douglas Jones (Jones D. W., How (not) to code a finite state machine, SIGPLAN Not. 23, 8 (Aug. 1988), 19-22.), this is the worst way to encode a FSM. He argues that a goto-based scheme is better.
He also argues that there is an even better approach, which is convert your FSM to a control flow diagram using graph theory techniques. That's not always easy. It is an NP hard problem. That's why you still see a lot of FSMs, particularly auto-generated FSMs, implemented as either a loop around a switch or with state transitions implemented via gotos.

I agree heartily with David Hammen's answer, but I only have one point to add.
When people are taught about compilers, they are taught about all the wonderful optimizations that compilers can do.
They are not taught that the actual value of this depends on who the user is.
If the code you are writing (or generating) and compiling contains very few function calls and could itself consume a large fraction of some other program's time, then yes, compiler optimization matters.
If the code being generated contains function calls, or if for some other reason the program counter spends a small fraction of its time in the generated code, it's not worth worrying about.
Why? Because even if that code could be so aggressively optimized that it took zero time, it would save no more than that small fraction, and there are probably much bigger performance issues, that the compiler can't fix, that are happy to be evading your attention.

Related

How LLVM mem2reg pass works

mem2reg is an important optimization pass in llvm. I want to understand how this optimization works but didn't find good articles, books, tutorials and similar.
I found these two links:
https://blog.katastros.com/a?ID=01300-3d6589c1-1993-4fb1-8975-939f10c20503
https://www.zzzconsulting.se/2018/07/16/llvm-exercise.html
Both links explains that one can use Cytron's classical SSA algorithm to implement this pass, but reading the original paper I didn't see how alloca instructions are converted to registers.
As alloca is an instruction specific to llvm IR, I wonder if the algorithm that converts alloca instructions to registers is an ad-hoc algorithm that only works for llvm. Or if there is a general theory framework that I just don't know the name yet that explains how to promote memory variables to registers variables.
On the official documentation, it says:
mem2reg: This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.
By this description, it seems that one just need to check if all users of the variable are load and store instructions and if so, it can be promoted to a register.
So if you can link me to articles, books, algorithms and so on I would appreciate.
"Efficiently Computing Static Single Assignment Form and the Control Dependence Graph" by Cytron, Ferrante et al.
The alloca instruction is named after the alloca() function in C, the rarely used counterpart to malloc() that allocates memory on the stack instead of the heap, so that the memory disappears when the function returns without needing to be explicitly freed.
In the paper, figure 2 shows an example of "straight-line code":
V ← 4
← V + 5
V ← 6
← V + 7
That text isn't valid LLVM IR, if we wanted to rewrite the same example in pre-mem2reg LLVM IR, it would look like this:
; V ← 4
store i32 4, i32* %V.addr
; ← V + 5
%tmp1 = load i32* %V.addr
%tmp2 = add i32 %tmp1, 5
store i32 %tmp2, i32* %V.addr
; V ← 6
store i32 6, i32* %V.addr
; ← V + 7
%tmp3 = load i32* %V.addr
%tmp4 = add i32 %tmp3, i32 7
store i32 %tmp4, i32* %V.addr
It's easy enough to see in this example how you could always replace %tmp1 with i32 4 using store-to-load forwarding, but you can't always remove the final store. Knowing nothing about %V.addr means that we must assume it may be used elsewhere.
If you know that %V.addr is an alloca, you can simplify a lot of things. You can see the alloca instruction so you know you can see all the uses, you never have to worry that a store to %ptr may alias with %V.addr. You see the allocation so you know what its alignment is. You know pointer accesses can not fault anywhere in the function, there is no free() equivalent for an alloca that anyone could call. If you see a store that isn't loaded before the end of the function, you can eliminate that store as a dead store since you know the alloca's lifetime ends at the function return. Finally you may delete the alloca itself if you've removed all its uses.
Thus if you start with an alloca whose only users are loads and stores, the problem has little in common with most memory optimization problems: there's no possibility of pointer aliasing nor concern about faulting memory accesses. What you do need to do is place the ϕ functions (phi instructions in LLVM) in the right places given the control flow graph, and that's what the paper describes how to do.

Pre evaluate LLVM IR

Let's suppose we have expressions like:
%rem = srem i32 %i.0, 10
%mul = mul nsw i32 %rem, 2
%i.0 is a llvm::PHINode which I can get the bounds.
The question is: Is there a way to get the value of %mul during compile time? I'm writing a llvm Pass and I need to evaluate some expressions which use %i.0. I'm searching for a function, class or something else which I will give a value to %i.0 and it will evaluate the expression and return the result.
You could clone the code (the containing function or the entire module, depending on how much context you need), then replace %i.0 with a constant value, run the constant propagation pass on the code, and finally check whether %mul is assigned to a constant value and if so, extract it.
It's not elegant, but I think it would work. Just pay attention to:
Make sure %mul is not elided out - for example, return it from the function, or store its value to memory, or something.
Be aware constant propagation assumes some things about the code, in particular that it already passed through mem2reg.

Julia/LLVM Efficient Division of Integer Numbers with Integer Result

I ran into a basic type stability issue where dividing two Integers will produce some concrete type of AbstractFloat.
typeof(60 * 5 / 60)
> Float64
Now this is the safe thing to do, but it incurs runtime overhead converting to a float.
What if we know that division will always result in a number with remainder 0, ie. an Integer?
We can use either:
div(60 * 5 , 60)
fld(60 * 5 , 60)
Which gives us some concrete type of Integer, however this approach still has overhead which we can see from the LLVM IR:
#code_llvm div(60 * 5 , 60)
So is there any magic we can do to remove the runtime overhead when we know that the result will not have a remainder?
Possible Solution Paths:
I would prefer this be solved using a Julia construct, even if we need to create it, rather than injecting LLVM IR... But then again, we could just wrap that injection into a Julia function...
Or maybe we need a macro like #inbounds for safe integer division resulting in an integer.
Or maybe there is some purely mathematical way to perform this that applies to any language?
Integer division is one of the slowest cache-independent operations on a CPU; indeed, floating-point division is faster on most CPUs (test it yourself and see). If you know what you'll be dividing by in advance (and want to divide by it many times), it can be worth precomputing factors that allow you to replace integer-division with multiplication/shift/add. There are many sites that describe this basic idea, here's one.
For an implementation in julia, see
https://gist.github.com/simonster/a3b691e71cc2b3826e39
You're right — there is a little bit of overhead in the div function, but it's not because there may be a remainder. It's because div(typemin(Int),-1) is an error, as is div(x, 0). So the overhead you're seeing in #code_llvm is just the checks for those cases. The LLVM instruction that you want is just sdiv i64 %0, %1… and the processor will even throw a SIGFPE in those error conditions. We can use llvmcall to create our own "overhead-free" version:
julia> unsafe_div(x::Int64,y::Int64) = Base.llvmcall("""
%3 = sdiv i64 %0, %1
ret i64 %3""", Int64, Tuple{Int64, Int64}, x, y)
unsafe_div (generic function with 1 method)
julia> unsafe_div(8,3)
2
julia> #code_llvm unsafe_div(8,3)
define i64 #julia_unsafe_div_21585(i64, i64) {
top:
%2 = sdiv i64 %0, %1
ret i64 %2
}
julia> unsafe_div(8,0)
ERROR: DivideError: integer division error
in unsafe_div at none:1
So if that works, why does Julia insist on inserting those checks into the LLVM IR itself? It's because LLVM considers those error cases to be undefined behavior within its optimization passes. So if LLVM can ever prove that it would error through static analysis, it changes its output to skip the division (and subsequent exception) entirely! This custom div function is indeed unsafe:
julia> f() = unsafe_div(8,0)
f (generic function with 2 methods)
julia> f()
13315560704
julia> #code_llvm f()
define i64 #julia_f_21626() {
top:
ret i64 undef
}
On my machine (an old Nehalem i5), this unsafe version can speed div up by about 5-10%, so the overhead here isn't really all that terrible relative to the inherent cost of integer division. As #tholy notes, it's still really slow compared to almost all other CPU operations, so if you're frequently dividing by the same number, you may want to investigate the alternatives in his answer.

Find loop temination condition variable

I want to find the variable which is used to check for termination of the loop,
For example,in the loop below i should get "%n":
for.body8: ; preds = %for.body8.preheader,for.body8
%i.116 = phi i32 [ %inc12, %for.body8 ], [ 0, %for.body8.preheader ]
%inc12 = add nsw i32 %i.116, 1
.....
%6 = load i32* %n, align 4, !tbaa !0
% cmp7 = icmp slt i32 %inc12, %6
br i1 %cmp7, label %for.body8, label %for.end13.loopexit
Is there any direct method to get this value?.
One way I can do is by,Iterating instruction and checking for icmp instruction.But I dont think its a proper method.
Please suggest me a method.
Thanks in advance.
While there is no way to do this for general loops, it is possible to find this out in some cases. In LLVM there is a pass called '-indvars: Canonicalize Induction Variables' which is described as
This transformation analyzes and transforms the induction variables
(and computations derived from them) into simpler forms suitable for
subsequent analysis and transformation.
This transformation makes the following changes to each loop with an
identifiable induction variable:
All loops are transformed to have a single canonical induction variable which starts at zero and steps by one.
The canonical induction variable is guaranteed to be the first PHI node in the loop header block.
Any pointer arithmetic recurrences are raised to use array subscripts.
If the trip count of a loop is computable, this pass also makes the
following changes:
The exit condition for the loop is canonicalized to compare the induction value against the exit value. This turns loops like:
for (i = 7; i*i < 1000; ++i)
into
for (i = 0; i != 25; ++i)
Any use outside of the loop of an expression derived from the indvar is changed to compute the derived value outside of the loop,
eliminating the dependence on the exit value of the induction
variable. If the only purpose of the loop is to compute the exit value
of some derived expression, this transformation will make the loop
dead.
This transformation should be followed by strength reduction after all
of the desired loop transformations have been performed. Additionally,
on targets where it is profitable, the loop could be transformed to
count down to zero (the "do loop" optimization).
and sounds like it does just what you need.
Unfortunately there is no general solution to this. Your question is an instance of the Halting Problem, proven to have no general solution: http://en.wikipedia.org/wiki/Halting_problem
If you're able to cut the problem space down to something extremely simple, using a subset of operations that are not turing complete (http://en.wikipedia.org/wiki/Turing-complete), you may be able to come up with a solution. However there is no general solution.

How to see lowered c++

I'm trying to improve my understanding of how C++ actually works. Is there a way to see how the compiler lowers my code into something simpler? For example, I'd like to see how all the copy constructors are called, how overloaded function calls have been resolved, all the template expansion and instantiation complete, etc. Right now I'm learning about how C++ compilers interpret my code through experimentation, but it'd be nice just to see a lowered form of my code, even if it is very ugly. I'm looking for something analogous to g++ -E, which shows the result of the preprocessor, but for C++.
Edit: I should have added that I'm not looking for a disassembler. There's a huge gulf between C++ source code and assembled code. Inside this gulf are complicated things like template meta-programming and all sorts of implicit calls to operator methods (assignments! casts! constructors! ...) as well as heavily overloaded functions with very complicated resolution rules, etc. I'm looking for tools to help me understand how my code is interpreted by the C++ compiler. Right now, the only thing I can do is try little experiments and piecemeal put together an understanding of what the compiler is doing. I'd like to see more detail on what's going on. It would help greatly, for example, in debugging template metaprogramming problems.
At the moment, I think that your best bet is Clang (you can try some simple code on the Try Out LLVM page).
When compiling C, C++ or Obj-C with Clang/LLVM, you may ask the compiler to emit the Intermediate Representation (LLVM IR) instead of going the full way to assembly/binary form.
The LLVM IR is a full specified language used internally by the compiler:
CLang lowers the C++ code to LLVM IR
LLVM optimizes the IR
A LLVM Backend (for example x86) produces the assembly from the IR
The IR is the last step before machine-specific code, so you don't have to learn specific assembly directives and you still get a very low-level representation of what's really going on under the hood.
You can get the IR both before and after optimizations, the latter being more representative of real code, but further away from what you origially wrote.
Example with a C program:
#include <stdio.h>
#include <stdlib.h>
static int factorial(int X) {
if (X == 0) return 1;
return X*factorial(X-1);
}
int main(int argc, char **argv) {
printf("%d\n", factorial(atoi(argv[1])));
}
Corresponding IR:
; ModuleID = '/tmp/webcompile/_10956_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
#.str = private unnamed_addr constant [4 x i8] c"%d\0A\00"
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind {
; <label>:0
%1 = getelementptr inbounds i8** %argv, i64 1
%2 = load i8** %1, align 8, !tbaa !0
%3 = tail call i64 #strtol(i8* nocapture %2, i8** null, i32 10) nounwind
%4 = trunc i64 %3 to i32
%5 = icmp eq i32 %4, 0
br i1 %5, label %factorial.exit, label %tailrecurse.i
tailrecurse.i: ; preds = %tailrecurse.i, %0
%indvar.i = phi i32 [ %indvar.next.i, %tailrecurse.i ], [ 0, %0 ]
%accumulator.tr1.i = phi i32 [ %6, %tailrecurse.i ], [ 1, %0 ]
%X.tr2.i = sub i32 %4, %indvar.i
%6 = mul nsw i32 %X.tr2.i, %accumulator.tr1.i
%indvar.next.i = add i32 %indvar.i, 1
%exitcond = icmp eq i32 %indvar.next.i, %4
br i1 %exitcond, label %factorial.exit, label %tailrecurse.i
factorial.exit: ; preds = %tailrecurse.i, %0
%accumulator.tr.lcssa.i = phi i32 [ 1, %0 ], [ %6, %tailrecurse.i ]
%7 = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %accumulator.tr.lcssa.i) nounwind
ret i32 0
}
declare i32 #printf(i8* nocapture, ...) nounwind
declare i64 #strtol(i8*, i8** nocapture, i32) nounwind
!0 = metadata !{metadata !"any pointer", metadata !1}
!1 = metadata !{metadata !"omnipotent char", metadata !2}
!2 = metadata !{metadata !"Simple C/C++ TBAA", null}
I personally find it relatively readable (it tries to preserve the variable names, somewhat, the function names are still there) once you get past the original discovery of the language.
The first C++ compiler was cfront, which was, as the name implies, a front-end for C; in theory, cfront's output is what you'd like to see. But cfront hasn't been available for many years; it was a commercial product, and the source is not available.
Modern C++ compilers don't use a C intermediary; if there's an intermediary at all, it's an internal compiler representation, not something you'd enjoy looking at! The -S option to g++ will spit out *.s files: assembly code, which includes just enough symbols that you could, in theory, follow it.
The very first (circa 1989) C++ compilers compiled C++ into C. But that's not been true for a very long time, very long time meaning I know of no widely available compiler than did things that way for the last 15 years. The best you are going to do is look at the assembly language output, which requires some amount of knowledge and analysis to understand.
The assembly level output of a C++ compiler is generally not called 'lowered'. It's called 'compiled'. I can understand how you might come by that terminology. Assembly is a lower-level language. But that's not the terminology everybody else uses and it will confuse people if you use it.
Most popular C++ compilers have an option somewhere that allows you to see the assembly level output. The Open Source g++ compiler has the -S option that does this. It will create a file that ends in .s. You can look through this file to see the resulting assembly language.
In order for the assembly language to more directly correspond to the C++ code I would recommend compiling with the -O0 option to turn off optimization. The results of optimization can result in assembly code that bears little or no obvious resemblance to the original C++ code. Though viewing that code can help you understand what the optimizer is doing.
Another problem is that the symbols (the names for functions and classes and things) in the assembly output will be what is called 'mangled'. This is because most assembly languages do not allow :: as part of the symbol name, and because C++ can also have the same names for different kinds of symbols. The compiler transforms the names of things in your C++ code into different names that will be valid in the assembly code.
For g++ this mangling can be undone with the c++filt program.
c++filt <myprogram.s >myprogram_demangled.s
This will help make the assembly file more readable.
First step you can preprocess it (It is the first step that the compiler actually do before compiling)
with cpp or g++ -E
Second step is to parse and translate it
with g++ -S
this link about the compilation process might interest you
You can run g++ (or any gcc front-end) with one or more of the -fdump-tree- flags (complete list), which will dump intermediate representations of the code from different compiler passes in an output format that looks similar to C. However, this output is usually quite hard to read, with lots of compiler-generated temporary variables and other artifacts of compilation. It's mainly intended for debugging the compiler itself, but for simple examples you might be able to infer what gcc is doing with your C++ code by studying the intermediate representation.
The Comeau C++ compiler generates C code. But you'll have to pay for it.
Instead of doing experimentation, you can use a debugger and see the flow of your code. This way you can easily see which constructors or overloaded functions actual mapping is happening.