llvm no optimization for certain instructions - llvm

I am inserting some instructions into llvm IR to do some Integer Overflow(IO) checking. However, I am annoyed to find some of my instructions inserted get optimized by the optimizer.
For example, when there is
int result = data + 1;
My pass would see this operation as a potential overflow site and add(IR of course, I wrote C here to make my life easier):
int64_t result_64bit = (int64_t)data + 1
if (result_64bit > 2147483647) { report(); }
However, this instruction gets optimized out.
Here's what I tried:
add -O0 flag. It won't work. I have dumped IR after every pass and that happened on pass Remove redundant operation. (Not Redundant from my perspective...)
Move the pass to the end of optimization by setting EP_OptimizerLast flag.
Well, this is not optimal as some information gets lost. For example, I heavily rely on nsw/nuw flag to decide signs, some of these flags gets removed after optimizations.
So what I am asking is if there is some kind of instruction guard in LLVM that guards an instruction against been optimized or removed?

Related

How good is the Visual Studio compiler at branch-prediction for simple if-statements?

Here is some c++ pseudo-code as an example:
bool importantFlag = false;
for (SomeObject obj : arr) {
if (obj.someBool) {
importantFlag = true;
}
obj.doSomethingUnrelated();
}
Obviously, once the if-statement evaluates as true and runs the code inside, there is no reason to even perform the check again since the result will be the same either way. Is the compiler smart enough to recognize this or will it continue checking the if-statement with each loop iteration and possibly redundantly assign importantFlag to true again? This could potentially have a noticeable impact on performance if the number of loop iterations is large, and breaking out of the loop is not an option here.
I generally ignore these kinds of situations and just put my faith into the compiler, but it would be nice to know exactly how it handles these kinds of situations.
Branch-prediction is a run-time thing, done by the CPU not the compiler.
The relevant optimization here would be if-conversion to a very cheap branchless flag |= obj.someBool;.
Ahead-of-time C++ compilers make machine code for the CPU to run; they aren't interpreters. See also Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” and How to remove "noise" from GCC/clang assembly output? for more about looking at optimized compiler-generated asm.
I guess what you're suggesting could be making a 2nd version of the loop that doesn't look at the bool, and converting the if() into an if() goto to set the flag once and then run the other version of the loop from this point onward. That would likely not be worth it, since a single OR instruction is so cheap if other members of the same object are already getting accessed.
But it's a plausible optimization; however I don't think compilers would typically do it for you. You can of course do it manually, although you'd have to iterate manually instead of using a range-for, because you want to use the same iterator to start part-way through the range.
Branch likelihood estimation at compile time is a thing compilers do to figure out whether branchy or branchless code is appropriate, e.g. gcc optimization flag -O3 makes code slower than -O2 uses CMOV for a case that looks unpredictable, but when run on sorted data is actually very predictable. My answer there shows the asm real-world compilers make; note that they don't multi-version the loop, although that wouldn't be possible in that case if the compiler didn't know about the data being sorted.
Also to guess which side of a branch is more likely, so they can lay out the fast path with fewer taken branches. That's what the C++20 [[likely]] / [[unlikely]] hints are for, BTW, not actually influencing run-time branch prediction. Except on some CPUs, indirectly via static prediction the first time a CPU sees a branch. Or a few ISAs, like PowerPC and MIPS, have "branch-likely" instructions with actual run-time hints for the CPU which compilers might or might not actually use even if available. See
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? - They influence branch layout, making the "likely" path a straight line (branches usually not-taken) for I-cache locality and contiguous fetch.
Is there a compiler hint for GCC to force branch prediction to always go a certain way?
If you expect to have a large data set you could just have two for loops, the first of them breaking when the importantFlag is set to true. It's hard to know specifically what optimizations the compiler will make since it's not well documented.
Peter Cordes has already given a great answer.
I'd also like to mention shortcircuiting
In this example
if( importantFlag || some_expensive_check() ) {
importantFlag = true;
}
Once important Flag is set to true, the expensive check will never be performed, since the || stops at the first true.

Is it good practice to construct long circuit statements?

Question Context: [C++] I want to know what is theoretically the fastest, and what the compiler will do. I don't want to hear about premature optimization is the root of all evil, etc.
I was writing some code like this:
bool b0 = ...;
bool b1 = ...;
if (b0 && b1)
{
...
}
But then I was thinking: the code, as-is, will compile into two TEST instructions, if compiled without optimizations. This means two branches. So I was thinking that it might be better to write:
if (b0 & b1)
Which will produce only one TEST instruction, if no optimization is done by the compiler. But then I feel that this is against my code-style. I usually write && and ||.
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast). Will the compiler automatically compile it like &, even if I have used a && in the code? And what is theoretically faster? Does the behavior change if I do this:
if (b0 && b1)
{ ... }
else if (b0)
{ ... }
else if (b1)
{ ... }
else
{ ... }
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast).
Most likely nothing more to increase the optimization.
As stated in my comments, you really can't optimize the evaluation any further than:
AND B0 WITH B1 (sets condition flags)
JUMP ZERO TO ...
Although, if you have a lot of simple boolean logic or data operations, some processors may conditionally execute them.
Will the compiler automatically compile it like &, even if I have used a && in the code?
And what is theoretically faster?
In most platforms, there is no difference in evaluation of A & B versus A && B.
In the final evaluation, either a compare or an AND instruction is executed, then a jump based on the status. Two instructions.
Most processors don't have Boolean registers. It's all numbers and bits.
Optimize By Boolean Logic
Your best option is to review the design and set up your algorithms to use Boolean algebra. You can than simplify the Boolean expressions.
Another option is to implement the code so that the compiler can generate conditional assembly instructions, if the platform supports them.
Optimize: Reduce jumps
Processors favor arithmetic and data transfers over jumps.
Many processors are always feeding an instruction pipeline. When it comes to a conditional branch instruction, the processor has to wait (suspend the instruction prefetching) until the condition status is determined. Then it can determine where the next instruction will be fetched.
If you can't remove the jumps, such as in a loop, make the ratio of data processing to jumping bigger in the data side. Search for "Loop Unrolling". Many compilers will perform this when optimization levels are increased.
Optimize: Data Cache
You may notice increased performance by organizing your data for best data cache usage.
For example, instead of 3 large arrays, use one array of a structure containing 3 elements. This allows the elements in use to be close to each other (and reduce the likelihood of accessing data outside of the cache).
Summary
The difference in evaluation of A && B versus A & B as conditional expressions is known as a micro-optimization. You will achieve improved performance by using Boolean algebra to reduce the quantity of conditional expressions. Jumps, or changes in execution path, slow down instruction execution. Fetching data outside of the data cache also slows down execution. You will most likely get better performance by redesigning your code and helping the compiler to reduce the branches and more effective use of the data cache.
If you care about what's fastest, why do you care what the compiler will do without optimisation?
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
This question seems to assume that the compiler transforms C++ code into more C++ code. It doesn't. It transforms your code into machine instructions (including the assembler as part of the compiler for argument's sake). You should not assume there is a one-to-one mapping from a C++ operator like && or & to a particular instruction.
With optimisation the compiler will do whatever it thinks will be faster. If a single instruction would be faster the compiler will generate a single instruction for if (b0 && b1), you don't need to bugger up your code with micro-optimisations to help it make such a simple transformation.
The compiler knows the instruction set it's using, it knows the context the condition is in and whether it can be removed entirely as dead code, or moved elsewhere to help the pipeline, or simplified by constant propagation, etc. etc.
And if you really care about what's fastest, why would you compute b1 until you know it's actually needed? If obtaining the value of b1 has no side effects the compiler could even transform your code to:
bool b0 = ...;
if (b0)
{
bool b1 = ...;
if (b1)
{
Does that mean two if conditions are faster than a &?! Of course not.
In other words, the whole premise of the question is flawed. Do not compromise the readability and simplicity of your code in the misguided pursuit of the "theoretically fastest" micro-optimisation. Spend your time improving the algorithms and data structures used not trying to second guess which instructions the compiler will generate.

C++: Why does this speed my code up?

I have the following function
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = patch_top_left_row + random_values[0];
int first_pixel_col = patch_top_left_col + random_values[1];
int second_pixel_row = patch_top_left_row + random_values[2];
int second_pixel_col = patch_top_left_col + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
Which is called about one and a half million times with different values for patch_top_left_row and patch_top_left_col. This takes about 2 seconds to run, now when I change the calculation of first_pixel_row etc to not use the arguments but hard coded numbers instead (shown below), the thing runs sub second and I don't know why. Is the compiler doing something smart here ( I am using gcc cross compiler)?
double single_channel_add(int patch_top_left_row, int patch_top_left_col,
int image_hash_key,
Mat* preloaded_images,
int* random_values){
int first_pixel_row = 5 + random_values[0];
int first_pixel_col = 6 + random_values[1];
int second_pixel_row = 8 + random_values[2];
int second_pixel_col = 10 + random_values[3];
int channel = random_values[4];
Vec3b* first_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(first_pixel_row, first_pixel_col);
Vec3b* second_pixel_bgr = preloaded_images[image_hash_key].ptr<Vec3b>(second_pixel_row, second_pixel_col);
return (*first_pixel_bgr)[channel] + (*second_pixel_bgr)[channel];
}
EDIT:
I have pasted the assembly from the two versions of the function
using arguments: http://pastebin.com/tpCi8c0F
using constants: http://pastebin.com/bV0d7QH7
EDIT:
After compiling with -O3 I get the following clock ticks and speeds:
using arguments: 1990000 ticks and 1.99seconds
using constants: 330000 ticks and 0.33seconds
EDIT:
using argumenst with -03 compilation: http://pastebin.com/fW2HCnHc
using constant with -03 compilation: http://pastebin.com/FHs68Agi
On the x86 platform there are instructions that very quickly add small integers to a register. These instructions are the lea (aka 'load effective address') instructions and they are meant for computing address offsets for structures and the like. The small integer being added is actually part of the instruction. Smart compilers know that these instructions are very quick and use them for addition even when addresses are not involved.
I bet if you changed the constants to some random value that was at least 24 bits long that you would see much of the speedup disappear.
Secondly those constants are known values. The compiler can do a lot to arrange for those values to end up in a register in the most efficient way possible. With an argument, unless the argument is passed in a register (and I think your function has too many arguments for that calling convention to be used) the compiler has no choice but to fetch the number from memory using a stack offset load instruction. That isn't a particularly slow instruction or anything, but with constants the compiler is free to do something much faster than may involve simply fetching the number from the instruction itself. The lea instructions are simply the most extreme example of this.
Edit: Now that you've pasted the assembly things are much clearer
In the non-constant code, here is how the add is done:
addl -68(%rbp), %eax
This fetches a value from the stack an offset -68(%rpb) and adds it to the %eax% register.
In the constant code, here is how the add is done:
addl $5, %eax
and if you look at the actual numbers, you see this:
0138 83C005
It's pretty clear that the constant being added is encoded directly into the instruction as a small value. This is going to be much faster to fetch than fetching a value from a stack offset for a number of reasons. First it's smaller. Secondly, it's part of an instruction stream with no branches. So it will be pre-fetched and pipelined with no possibility for cache stalls of any kind.
So while my surmise about the lea instruction wasn't correct, I was still on the right track. The constant version uses a small instruction specifically oriented towards adding a small integer to a register. The non-constant version has to fetch an integer that may be of indeterminate size (so it has to fetch ALL the bits, not just the low ones) from a stack offset (which adds in an additional add to compute the actual address from the offset and stack base address).
Edit 2: Now that you've posted the -O3 results
Well, it's much more confusing now. It's apparently inlined the function in question and it jumps around a whole ton between the code for the inlined function and the code for the calling function. I'm going to need to see the original code for the whole file to make a proper analysis.
But what I strongly suspect is happening now is that the unpredictability of the values retrieved from get_random_number_in_range is severely limiting the optimization options available to the compiler. In fact, it looks like in the constant version it doesn't even bother to call get_random_number_in_range because the value is tossed out and never used.
I'm assuming that the values of patch_top_left_row and patch_top_left_col are generated in a loop somewhere. I would push this loop into this function. If the compiler knows the values are generated as part of a loop, there are a very large number of optimization options open to it. In the extreme case it could use some of the SIMD instructions that are part of the various SSE or 3dnow! instruction suites to make things a whole ton faster than even the version you have that uses constants.
The other option would be to make this function inline, which would hint to the compiler that it should try inserting it into the loop in which it's called. If the compiler takes the hint (this function is a bit largish, so the compiler might not) it will have much the same effect as if you'd stuffed the loop into the function.
Well, binary arithmetic operations of immediate constant vs. memory format are expected to produce faster code than the ones of memory vs. memory format, but the timing effect you observe appears to be too extreme, especially considering that there are other operations inside that function.
Could it be that the compiler decided to inline your function? Inlining would allow the compiler to easily eliminate everything related to the unused patch_top_left_row and patch_top_left_col parameters in the second version, including any steps that prepare/calculate these parameters in the calling code.
Technically, this can be done even if the function is not inlined, but it is generally more complicated.

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.
Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}
So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.
In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.
What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)
Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.
You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.
a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)
Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.
If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.
GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.