Prevent C++11 removal of endless loops

Prevent C++11 removal of endless loops - c++

As discussed in this question, C++11 optimizes endless loops away.
However, in embedded devices which have a single purpose, endless loops make sense and are actually quite often used. Even a completely empty while(1); is useful for a watchdog-assisted reset. Terminating but empty loops can also be useful in embedded development.
Is there an elegant way to specifically tell the compiler to not remove empty or endless loops, without disabling optimization altogether?

One of the requirements for a loop to be removed (as mentioned in that question) is that it
does not access or modify volatile objects
So,
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
should do the trick, although I would certainly verify this by looking at the disassembly of a program produced with your particular toolchain.
A function like this would be a good candidate for GCC's noreturn attribute as well.
void wait_forever(void) __attribute__ ((noreturn));
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
int main(void)
{
if (something_bad_happened)
wait_forever();
}

Related

How to speed up dynamic dispatch by 20% using computed gotos in standard C++

Before you down-vote or start saying that gotoing is evil and obsolete, please read the justification of why it is viable in this case. Before you mark it as duplicate, please read the full question.
I was reading about virtual machine interpreters, when I stumbled across computed gotos . Apparently they allow significant performance improvement of certain pieces of code. The most known example is the main VM interpreter loop.
Consider a (very) simple VM like this:
#include <iostream>
enum class Opcode
{
HALT,
INC,
DEC,
BIT_LEFT,
BIT_RIGHT,
RET
};
int main()
{
Opcode program[] = { // an example program that returns 10
Opcode::INC,
Opcode::BIT_LEFT,
Opcode::BIT_LEFT,
Opcode::BIT_LEFT,
Opcode::INC,
Opcode::INC,
Opcode::RET
};
int result = 0;
for (Opcode instruction : program)
{
switch (instruction)
{
case Opcode::HALT:
break;
case Opcode::INC:
++result;
break;
case Opcode::DEC:
--result;
break;
case Opcode::BIT_LEFT:
result <<= 1;
break;
case Opcode::BIT_RIGHT:
result >>= 1;
break;
case Opcode::RET:
std::cout << result;
return 0;
}
}
}
All this VM can do is a few simple operations on one number of type int and print it. In spite of its doubtable usefullness, it illustrates the subject nonetheless.
The critical part of the VM is obviously the switch statement in the for loop. Its performance is determined by many factors, of which the most inportant ones are most certainly branch prediction and the action of jumping to the appropriate point of execution (the case labels).
There is room for optimization here. In order to speed up the execution of this loop, one might use, so called, computed gotos.
Computed Gotos
Computed gotos are a construct well known to Fortran programmers and those using a certain (non-standard) GCC extension. I do not endorse the use of any non-standard, implementation-defined, and (obviously) undefined behavior. However to illustrate the concept in question, I will use the syntax of the mentioned GCC extension.
In standard C++ we are allowed to define labels that can later be jumped to by a goto statement:
goto some_label;
some_label:
do_something();
Doing this isn't considered good code (and for a good reason!). Although there are good arguments against using goto (of which most are related to code maintainability) there is an application for this abominated feature. It is the improvement of performance.
Using a goto statement can be faster than a function invocation. This is because the amount of "paperwork", like setting up the stack and returning a value, that has to be done when invoking a function. Meanwhile a goto can sometimes be converted into a single jmp assembly instruction.
To exploit the full potential of goto an extension to the GCC compiler was made that allows goto to be more dynamic. That is, the label to jump to can be determined at run-time.
This extension allows one to obtain a label pointer, similar to a function pointer and gotoing to it:
void* label_ptr = &&some_label;
goto (*label_ptr);
some_label:
do_something();
This is an interesting concept that allows us to further enhance our simple VM. Instead of using a switch statement we will use an array of label pointers (a so called jump table) and than goto to the appropriate one (the opcode will be used to index the array):
// [Courtesy of Eli Bendersky][4]
// This code is licensed with the [Unlicense][5]
int interp_cgoto(unsigned char* code, int initval) {
/* The indices of labels in the dispatch_table are the relevant opcodes
*/
static void* dispatch_table[] = {
&&do_halt, &&do_inc, &&do_dec, &&do_mul2,
&&do_div2, &&do_add7, &&do_neg};
#define DISPATCH() goto *dispatch_table[code[pc++]]
int pc = 0;
int val = initval;
DISPATCH();
while (1) {
do_halt:
return val;
do_inc:
val++;
DISPATCH();
do_dec:
val--;
DISPATCH();
do_mul2:
val *= 2;
DISPATCH();
do_div2:
val /= 2;
DISPATCH();
do_add7:
val += 7;
DISPATCH();
do_neg:
val = -val;
DISPATCH();
}
}
This version is about 25% faster than the one that uses a switch (the one on the linked blog post, not the one above). This is because there is only one jump performed after each operation, instead of two.
Control flow with switch:
For example, if we wanted to execute Opcode::FOO and then Opcode::SOMETHING, it would look like this:
As you can see, there are two jumps being performed after an instruction is executed. The first one is back to the switch code and the second is to the actual instruction.
In contrary, if we would go with an array of label pointers (as a reminder, they are non-standard), we would have only one jump:
It is worthwhile to note that in addition to saving cycles by doing less operations, we also enhance the quality of branch prediction by eliminating the additional jump.
Now, we know that by using an array of label pointers instead of a switch we can improve the performance of our VM significantly (by about 20%). I figured that maybe this could have some other applications too.
I came to the conclusion that this technique could be used in any program that has a loop in which it sequentially indirectly dispatches some logic. A simple example of this (apart from the VM) could be invoking a virtual method on every element of a container of polymorphic objects:
std::vector<Base*> objects;
objects = get_objects();
for (auto object : objects)
{
object->foo();
}
Now, this has much more applications.
There is one problem though: There is nothing such as label pointers in standard C++. As such, the question is: Is there a way to simulate the behaviour of computed gotos in standard C++ that can match them in performance?.
Edit 1:
There is yet another down side to using the switch. I was reminded of it by user1937198. It is bound checking. In short, it checks if the value of the variable inside of the switch matches any of the cases. It adds redundant branching (this check is mandated by the standard).
Edit 2:
In response to cmaster, I will clarify what is my idea on reducing overhead of virtual function calls. A dirty approach to this would be to have an id in each derived instance representing its type, that would be used to index the jump table (label pointer array). The problem is that:
There are no jump tables is standard C++
It would require as to modify all jump tables when a new derived class is added.
I would be thankful, if someone came up with some type of template magic (or a macro as a last resort), that would allow to write it to be more clean, extensible and automated, like this:

On a recent versions of MSVC, the key is to give the optimizer the hints it needs so that it can tell that just indexing into the jump table is a safe transform. There are two constraints on the original code that prevent this, and thus make optimising to the code generated by the computed label code an invalid transform.
Firstly in the original code, if the program counter overflows the program, then the loop exits. In the computed label code, undefined behavior (dereferencing an out of range index) is invoked. Thus the compiler has to insert a check for this, causing it to generate a basic block for the loop header rather than inlining that in each switch block.
Secondly in the original code, the default case is not handled. Whilst the switch covers all enum values, and thus it is undefined behavior for no branches to match, the msvc optimiser is not intelligent enough to exploit this, so generates a default case that does nothing. Checking this default case requires a conditional as it handles a large range of values. The computed goto code invokes undefined behavior in this case as well.
The solution to the first issue is simple. Don't use a c++ range for loop, use a while loop or a for loop with no condition. The solution for the second unfortunatly requires platform specific code to tell the optimizer the default is undefined behavior in the form of _assume(0), but something analogous is present in most compilers (__builtin_unreachable() in clang and gcc), and can be conditionally compiled to nothing when no equivalent is present without any correctness issues.
So the result of this is:
#include <iostream>
enum class Opcode
{
HALT,
INC,
DEC,
BIT_LEFT,
BIT_RIGHT,
RET
};
int run(Opcode* program) {
int result = 0;
for (int i = 0; true;i++)
{
auto instruction = program[i];
switch (instruction)
{
case Opcode::HALT:
break;
case Opcode::INC:
++result;
break;
case Opcode::DEC:
--result;
break;
case Opcode::BIT_LEFT:
result <<= 1;
break;
case Opcode::BIT_RIGHT:
result >>= 1;
break;
case Opcode::RET:
std::cout << result;
return 0;
default:
__assume(0);
}
}
}
The generated assembly can be verified on godbolt

Is there a way to turn off loop optimisation on both C++ and Rust compilation?

I'm looking for a compiler flag that will allow me to prevent the compiler optimising away the loop in code like this:
void func() {
std::unique_ptr<int> up1(new int(0)), up2;
up2 = move(up1);
for(int i = 0; i < 1000000000; i++) {
if(up2) {
*up2 += 1;
}
}
if(up2)
printf("%d", *up2);
}
in both C++ and Rust code. I'm trying to compare similar sections of code in terms of speed and running this loop rather than just evaluating the overall result is important. Since Rust statically guarantees that the pointer ownership hasn't been moved, it doesn't need the null pointer checks on each iteration of the loop and I would imagine therefore it would produce faster code if the loop couldn't be optimised out for whatever reason.
Rust compiles using an LLVM backend, so I would preferably be using that for C++ as well.

In Rust you can use test::black_box.
In C++ (using gcc) asm volatile("" : "+r" (datum));. See this.

One typical way to avoid having the compiler optimize away loops is to make their bounds indeterminate at compile time. In this example, rather than looping up to 10000000, loop up to a count which is read from stdin or argv.

Optimization and multithreading in B.Stroustrup's new book

Please refer to section 41.2.2 Instruction Reordering of "TCPL" 4th edition by B.Stroustrup, which I transcribe below:
To gain performance, compilers, optimizers, and hardware reorder
instructions. Consider:
// thread 1:
int x;
bool x_init;
void init()
{
x = initialize(); // no use of x_init in initialize()
x_init = true;
// ...
}
For this piece of code there is no stated reason to assign to x before
assigning to x_init. The optimizer (or the hardware instruction
scheduler) may decide to speed up the program by executing x_init =
true first. We probably meant for x_init to indicate whether x had
been initialized by initializer() or not. However, we did not say
that, so the hardware, the compiler, and the optimizer do not know
that.
Add another thread to the program:
// thread 2:
extern int x;
extern bool x_init;
void f2()
{
int y;
while (!x_init) // if necessary, wait for initialization to complete
this_thread::sleep_for(milliseconds{10});
y = x;
// ...
}
Now we have a problem: thread 2 may never wait and thus will assign an
uninitialized x to y. Even if thread 1 did not set x_init and x in
‘‘the wrong order,’’ we still may have a problem. In thread 2, there
are no assignments to x_init, so an optimizer may decide to lift the
evaluation of !x_init out of the loop, so that thread 2 either never
sleeps or sleeps forever.
Does the Standard allow the reordering in thread 1? (some quote from the Standard would be forthcoming) Why would that speed up the program?
Both answers in this discussion on SO seem to indicate that no such optimization occurs when there are global variables in the code, as x_init above.
What does the author mean by "to lift the evaluation of !x_init out of the loop"? Is this something like this?
if( !x_init ) while(true) this_thread::sleep_for(milliseconds{10});
y = x;

This is not so much a issue of the C++ compiler/standard, but that of modern CPUs. Have a look here. The compiler isn't going to emit memory barrier instructions between the assignments of x and x_init unless you tell it to.
For what it is worth, prior to C++11, the standard had no notion of multi threading in it's abstract machine model. Things are a bit nicer these days.

The C++11 standard does not "allow" or "prevent" reordering. It specifies some way to force a specific "barrier" that, it turns, prevent the compiler to move instructions before/after them. A compiler, in this example might reorder the assignment because it might be more efficient on a CPU with multiple computing unit (ALU/Hyperthreading/etc...) even with a single core. Typically, if your CPU has 2 ALU that can work in parallel, there is no reason the compiler would not try to feed them with as much work as it can.
I'm not speaking of out-of-order reordering of the CPU instructions that's done internally in Intel CPU (for example), but compile time ordering to ensure all the computing resources are busy doing some work.
I think it depends on the compiler compilation flags. Typically unless you tell it to, the compiler must assume that another compilation unit (say B.cpp, which is not visible at compile time) can have a "extern bool x_init", and can change it at any time. Then, the re-ordering optimization would break with the expected behavior (B can define the initialize() function). This example is trivial and unlikely to break. The linked SO answer are not related to this "optimization", but simply that, in their case, the compiler can not make the assumption that the global array is not modified externally, and as such can not make the optimization. This is not like your example.
Yes. It's a very common optimization trick, instead of:
// test is a bool
for (int i = 0; i < 345; i++) {
if (test) do_something();
}
The compiler might do:
if (test) for(int i = 0; i < 345; i++) { do_something(); }
And save 344 useless tests.

Pure/const functions in C++

I'm thinking of using pure/const functions more heavily in my C++ code. (pure/const attribute in GCC)
However, I am curious how strict I should be about it and what could possibly break.
The most obvious case are debug outputs (in whatever form, could be on cout, in some file or in some custom debug class). I probably will have a lot of functions, which don't have any side effects despite this sort of debug output. No matter if the debug output is made or not, this will absolutely have no effect on the rest of my application.
Or another case I'm thinking of is the use of some SmartPointer class which may do some extra stuff in global memory when being in debug mode. If I use such an object in a pure/const function, it does have some slight side effects (in the sense that some memory probably will be different) which should not have any real side effects though (in the sense that the behaviour is in any way different).
Similar also for mutexes and other stuff. I can think of many complex cases where it has some side effects (in the sense of that some memory will be different, maybe even some threads are created, some filesystem manipulation is made, etc) but has no computational difference (all those side effects could very well be left out and I would even prefer that).
So, to summarize, I want to mark functions as pure/const which are not pure/const in a strict sense. An easy example:
int foo(int) __attribute__((const));
int bar(int x) {
int sum = 0;
for(int i = 0; i < 100; ++i)
sum += foo(x);
return sum;
}
int foo_callcounter = 0;
int main() {
cout << "bar 42 = " << bar(42) << endl;
cout << "foo callcounter = " << foo_callcounter << endl;
}
int foo(int x) {
cout << "DEBUG: foo(" << x << ")" << endl;
foo_callcounter++;
return x; // or whatever
}
Note that the function foo is not const in a strict sense. Though, it doesn't matter what foo_callcounter is in the end. It also doesn't matter if the debug statement is not made (in case the function is not called).
I would expect the output:
DEBUG: foo(42)
bar 42 = 4200
foo callcounter = 1
And without optimisation:
DEBUG: foo(42) (100 times)
bar 42 = 4200
foo callcounter = 100
Both cases are totally fine because what only matters for my usecase is the return value of bar(42).
How does it work out in practice? If I mark such functions as pure/const, could it break anything (considering that the code is all correct)?
Note that I know that some compilers might not support this attribute at all. (BTW., I am collecting them here.) I also know how to make use of thes attributes in a way that the code stays portable (via #defines). Also, all compilers which are interesting to me support it in some way; so I don't care about if my code runs slower with compilers which do not.
I also know that the optimised code probably will look different depending on the compiler and even the compiler version.
Very relevant is also this LWN article "Implications of pure and constant functions", especially the "Cheats" chapter. (Thanks ArtemGr for the hint.)

I'm thinking of using pure/const functions more heavily in my C++ code.
That’s a slippery slope. These attributes are non-standard and their benefit is restricted mostly to micro-optimizations.
That’s not a good trade-off. Write clean code instead, don’t apply such micro-optimizations unless you’ve profiled carefully and there’s no way around it. Or not at all.
Notice that in principle these attributes are quite nice because they state implied assumptions of the functions explicitly for both the compiler and the programmer. That’s good. However, there are other methods of making similar assumptions explicit (including documentation). But since these attributes are non-standard, they have no place in normal code. They should be restricted to very judicious use in performance-critical libraries where the author tries to emit best code for every compiler. That is, the writer is aware of the fact that only GCC can use these attributes, and has made different choices for other compilers.

You could definitely break the portability of your code. And why would you want to implement your own smart pointer - learning experience apart? Aren't there enough of them available for you in (near) standard libraries?

I would expect the output:
I would expect the input:
int bar(int x) {
return foo(x) * 100;
}
Your code actually looks strange for me. As a maintainer I would think that either foo actually has side effects or more likely rewrite it immediately to the above function.
How does it work out in practice? If I mark such functions as pure/const, could it break anything (considering that the code is all correct)?
If the code is all correct then no. But the chances that your code is correct are small. If your code is incorrect then this feature can mask out bugs:
int foo(int x) {
globalmutex.lock();
// complicated calculation code
return -1;
// more complicated calculation
globalmutex.unlock();
return x;
}
Now given the bar from above:
int main() {
cout << bar(-1);
}
This terminates with __attribute__((const)) but deadlocks otherwise.
It also highly depends on the implementation. For example:
void f() {
for(;;)
{
globalmutex.unlock();
cout << foo(42) << '\n';
globalmutex.lock();
}
}
Where the compiler should move the call foo(42)? Is it allowed to optimize this code? Not in general! So unless the loop is really trivial you have no benefits of your feature. But if your loop is trivial you can easily optimize it yourself.
EDIT: as Albert requested a less obvious situation, here it comes:
F
or example if you implement operator << for an ostream, you use the ostream::sentry which locks the stream buffer. Suppose you call pure/const f after you released or before you locked it. Someone uses this operator cout << YourType() and f also uses cout << "debug info". According to you the compiler is free to put the invocation of f into the critical section. Deadlock occurs.

I would examine the generated asm to see what difference they make. (My guess would be that switching from C++ streams to something else would yield more of a real benefit, see: http://typethinker.blogspot.com/2010/05/are-c-iostreams-really-slow.html )

I think nobody knows this (with the exception of gcc programmers), simply because you rely on undefined and undocumented behaviour, which can change from version to version. But how about something like this:
#ifdef NDEBUG \
#define safe_pure __attribute__((pure)) \
#else \
#define safe_pure \
#endif
I know it's not exactly what you want, but now you can use the pure attribute without breaking the rules.
If you do want to know the answer, you may ask in the gcc forums (mailing list, whatever), they should be able to give you the exact answer.
Meaning of the code: When NDEBUG (symbol used in assert macros) is defined, we don't debug, have no side effects, can use pure attribute. When it is defined, we have side effects, so it won't use pure attribute.

Can C++ compilers optimize "if" statements inside "for" loops?

Consider an example like this:
if (flag)
for (condition)
do_something();
else
for (condition)
do_something_else();
If flag doesn't change inside the for loops, this should be semantically equivalent to:
for (condition)
if (flag)
do_something();
else
do_something_else();
Only in the first case, the code might be much longer (e.g. if several for loops are used or if do_something() is a code block that is mostly identical to do_something_else()), while in the second case, the flag gets checked many times.
I'm curious whether current C++ compilers (most importantly, g++) would be able to optimize the second example to get rid of the repeated tests inside the for loop. If so, under what conditions is this possible?

Yes, if it is determined that flag doesn't change and can't be changed by do_something or do_something_else, it can be pulled outside the loop. I've heard of this called loop hoisting, but Wikipedia has an entry called "loop invariant code motion".
If flags is a local variable, the compiler should be able to do this optimization since it's guaranteed to have no effect on the behavior of the generated code.
If flags is a global variable, and you call functions inside your loop it might not perform the optimization - it may not be able to determine if those functions modify the global.
This can also be affected by the sort of optimization you do - optimizing for size would favor the non-hoisted version while optimizing for speed would probably favor the hoisted version.
In general, this isn't the sort of thing that you should worry about, unless profiling tells you that the function is a hotspot and you see that less than efficient code is actually being generated by going over the assembly the compiler outputs. Micro-optimizations like this you should always just leave to the compiler unless you absolutely have to.

Tried with GCC and -O3:
void foo();
void bar();
int main()
{
bool doesnt_change = true;
for (int i = 0; i != 3; ++i) {
if (doesnt_change) {
foo();
}
else {
bar();
}
}
}
Result for main:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
call __Z3foov
call __Z3foov
call __Z3foov
xorl %eax, %eax
leave
ret
So it does optimize away the choice (and unrolls smaller loops).
This optimization is not done if doesnt_change is global.

I'm sure if the compiler can determine that the flag will remain constant, it can do some shufflling:
const bool flag = /* ... */;
for (..;..;..;)
{
if (flag)
{
// ...
}
else
{
// ...
}
}
If the flag is not const, the compiler cannot necessarily optimize the loop, because it can't be sure flag won't change. It can if it does static analysis, but not all compilers do, I think. const is the sure-fire way of telling the compiler the flag won't change, after that it's up to the compiler.
As usual, profile and find out if it's really a problem.

I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread?
That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.

As many have said: it depends.
If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this:
for (condition)
do_it<flag>();

Generally, yes. But there is no guarantee, and the places where the compiler will do it are probably rare.
What most compilers do without a problem is hoisting immutable evaluations out of the loop, e.g. if your condition is
if (a<b) ....
when a and b are not affected by the loop, the comparison will be made once before the loop.
This means if the compiler can determine the condition does not change, the test is cheap and the jump wenll predicted. This in turn means the test itself costs one cycle or no cycle at all (really).
In which cases splitting the loop would be beneficial?
a) a very tight loop where the 1 cycle is a significant cost
b) the entire loop with both parts does not fit the code cache
Now, the compiler can only make assumptions about the code cache, and usually can order the code in a way that one branch will fit the cache.
Without any testing, I'dexpect a) the only case where such an optimization would be applied, becasue it's nto always the better choice:
In which cases splitting the loop would be bad?
When splitting the loop increases code size beyond the code cache, you will take a significant hit. Now, that only affects you if the loop itself is called within another loop, but that's something the compiler usually can't determine.
[edit]
I couldn't get VC9 to split the following loop (one of the few cases where it might actually be beneficial)
extern volatile int vflag = 0;
int foo(int count)
{
int sum = 0;
int flag = vflag;
for(int i=0; i<count; ++i)
{
if (flag)
sum += i;
else
sum -= i;
}
return sum;
}
[edit 2]
note that with int flag = true; the second branch does get optimized away. (and no, const doesn't make a difference here ;))
What does that mean? Either it doesn't support that, it doesn't matter, ro my analysis is wrong ;-)
Generally, I'd asume this is an optimization that is valuable only in a very few cases, and can be done by hand easily in most scenarios.

It's called a loop invariant and the optimization is called loop invariant code motion and also code hoisting. The fact that it's in a conditional will definitely make the code analysis more complex and the compiler may or may not invert the loop and the conditional depending on how clever the optimizer is.
There is a general answer for any specific case of this kind of question, and that's to compile your program and look at the generated code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Prevent C++11 removal of endless loops - c++

Related

How to speed up dynamic dispatch by 20% using computed gotos in standard C++

Is there a way to turn off loop optimisation on both C++ and Rust compilation?

Optimization and multithreading in B.Stroustrup's new book

Pure/const functions in C++

Can C++ compilers optimize "if" statements inside "for" loops?

Categories

Resources