Does Using goto to Escape Control Structures Ever Produce Different Assembly?

Does Using goto to Escape Control Structures Ever Produce Different Assembly? - c++

There is a lot of debate about the goto command, this question is not about the rightness or wrongness of its use, but more simply a question of whether it ever actually creates different assembly.
I'm specifically looking at Visual Studio 2013, but an example in any compiler would be wonderful.
Bjarne Stroustrup states:
The scope of a label is the function it is in (§6.3.4). This implies that you can use goto to jump into and out of blocks. The only restriction is that you cannot jump past an initializer or an exception handler (§13.5).
One to the few sensible uses of goto in ordinary code is to break out from a nested loop or switch-statement.
My question then: Is there any instance in which goto still produces different assembly than what can already be accomplished by use of other control structures?
For example, this produces identical assembly:
auto r = rand();
auto a = 0;
for(auto i = rand(); i > 0; --i){
switch(r){
case 1:
++sum;
goto END;
case default:
sum += rand();
break;
}
}
sum++;
END:
To this non-goto code:
auto r = rand();
auto b = false;
auto a = 0;
for(auto i = rand(); i > 0; --i){
switch(r){
case 1:
++sum;
b = true;
break;
case default:
sum += rand();
break;
}
if(b)break;
}
if(!b)sum++;

Here's my experience: I once had a bit of code that was extremely time critical. And it had a loop that iterated on average zero times (while (condition) ...) and the condition was almost always false. The compiler insisted on loop optimisations, moving things outside the loop - even when the loop wasn't executed at all, therefore slowing it down.
I tried to rewrite the loop using goto, hoping to confuse the optimiser enough to give up on optimising the code, and failed. gcc and clang optimise depending on the actual control flow, not depending on what C or C++ code you use.

Related

Trivial example of reordering memory operations

I was trying to write some code that allow me to observe reordering of memory operations.
In the fallowing example I expected that on some executions of set_values() order of assigning values could change. Especialy notification = 1 may occur before the rest of operations, but in dosn't happend even after thousens of iterations.
I've compiled code with -O3 optimization.
Here is youtube material that i'm refering to : https://youtu.be/qlkMbxUbKfw?t=200
int a{0};
int b{0};
int c{0};
int notification{0};
void set_values()
{
a = 1;
b = 2;
c = 3;
notification = 1;
}
void calculate()
{
while(notification != 1);
a += b + c;
}
void reset()
{
a = 0;
b = 0;
c = 0;
notification = 0;
}
int main()
{
a=6; //just to allow first iteration
for(int i = 0 ; a == 6 ; i++)
{
reset();
std::thread t1(calculate);
std::thread t2(set_values);
t1.join();
t2.join();
std::cout << "Iteration: " << i << ", " "a = " << a << std::endl;
}
return 0;
}
Now the program is stuck in infinited loop. I expect that in some iterations order of instructions in set_values() function can change (due to optimalization on cash memory). For example notification = 1 will be executed before c = 3 what will trigger execution of calculate() function and gives a==3 what satisfies the condition of terminating the loop and prove reordering
Or maybe someone can provide other trivial example of code that help observe reordering of memory operations?

The compiler can indeed reorder your assignments in the function set_values. However, it is not required to do so. In this case it has no reason to reorder anything, since you are assigning constants to all four variables.
Now the program is stuck in infinited loop.
This is probably because while(notification != 1); will be optimized to an infinite loop.
With a bit of work, we can find a way to make the compiler reorder the assignment notify = 1 before the other statements, see https://godbolt.org/z/GY-pAw.
Notice that the program reads x from the standard input, this is done to force the compiler to read from a memory location.
I've also made the variable notification volatile, so that while(notification != 1); doesn't get optimised away.
You can try this example on your machine, I've been able to consistently fail the assertion using g++9.2 and -O3 running on an Intel Sandy Bridge cpu.
Be aware that the cpu itself can reorder instructions if they are independent of each other, see https://en.wikipedia.org/wiki/Out-of-order_execution. This is, however, a bit tricky to test and reproduce consistently.

Your compiler optimizes in unexpected ways but is allowed to do so because you are violating a fundamental rule of the C++ memory model.
You cannot access a memory location from multiple threads if at least one of them is a writer.
To synchronize, either use a std:mutex or use std:atomic<int> instead of int for your variables

Does continue statement really increases the speed of the loop in C++?

So, I am new to online competitive programming and i came across a code where i am using the if else statement inside a for loop. I want to increase the speed of the loop and after doing some research i came across break and continue statements.
So my question is that does using continue really increases the speed of the loop or not.
CODE :
int even_sum = 0;
for(int i=0;i<200;i++){
if(i%4 == 0){
even_sum +=i;
continue;
}else{
//do other stuff when sum of multiple of 4 is not calculated
}
}

In the specific code in the question, the code has the identical meaning with and without the continue: In either case, after execution leaves even_sum +=i;, it flows to the closing } of the for statement. Any compiler of even modest quality should treat the two options identically.
The intended purpose of continue is not to speed up code by requesting a jump the compiler is going to make anyway but to skip code that is undesired in the current loop iteration—it acts as if the remaining code had been enclosed in an else clause but may be more visually appealing and less disruptive to human perception of the code.
It is conceivable a very rudimentary compiler, or even a decent compiler but with optimization disabled, might generate a jump instruction for the continue and also a jump instruction for the “then” clause of the if statement to jump over the else clause. The latter would never be executed and would have no direct effect on program execution time, but it would increase the size of the program and thus could have indirect effects. This possibility is of negligible concern in typical modern environments, where you are unlikely to encounter such a rudimentary compiler.

No, there's no speed advantage when using continue here. Both of your codes are identical and even without optimizations they produce the same machine code.
However, sometimes continue can make your code a lot more efficient, if you have structured your loop in a specific way, e.g.
This:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
continue;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
is a lot more efficient, than:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
because the former doesn't have to execute the huge_computation_but_always_false_when_multiple_of_4() function every time.
So even though both of these codes would always produce the same result (given that huge_computation_but_always_false_when_multiple_of_4() has no side effects), the first one, which uses continue, would be a lot faster.

If statement not executed slows program

I have an if statement that is currently never executed, however if I print something to the screen it takes over ten times longer for the program to run than if a variable is declared. Doing a bit of research online this seems to be some kind of branch prediction issue. Is there anything I can do to improve the program speed?
Basically both myTest and myTest_new return the same thing except one is a macro and one is a function. I am just monitoring the time it takes for bitTest to execute. and it executes in 3 seconds with just declaration in if statement but takes over a minute when Serial.print is in if statement even though neither are executed.
void bitTest()
{
int count = 0;
Serial1.println("New Test");
int lastint = 0;
Serial1.println("int");
for (int index = -2147483647; index <= 2147483647; index+=1000) {
if (index <= 0 && lastint > 0) {
break;
}
lastint = index;
for (int num = 0; num <= 31; num++) {
++1000;
int vcr1 = myTest(index, num);
int vcr2 = myTest_new(index, num);
if (vcr1 != vcr2) {
Serial1.println("Test"); // leave this println() and it takes 300 seconds for the test to run
//int x = 0;
}
} // if (index)
} // for (index)
Serial1.print("count = ");
Serial1.println(count);
return;
}

It is much less likely to be caused by a branch prediction (that branch prediction shouldn't be influenced by what you do inside your code) but by the fact that
{
int x = 0;
}
simply does nothing, because the scope of x ends at }, so that the compiler simply ditches the whole if clause, including the check. Note that this is only possible because the expression that if checks has no side effects, and neither does the block that would get executed.
By the way, the code you showed would usually directly be "compiled away", because the compiler, at compile time, can determine whether the if clause could ever be executed, unless you explicitly tell the compiler to omit such safe optimizations. Hence, I kind of doubt your "10 times as slow" measurement. Either the code you're showing isn't the actual example on which you demonstrate this, or you should turn on compiler optimization prior to doing performance comparisons.

The reason why your program takes forever is that it's buggy:
for (int index = -2147483647; index <= 2147483647; index+=1000) {
simply: at a very large index close to the maximum integer value, a wrap-around will occur. There's no "correct" way for your program to terminate. Hence you invented your strange lastint > 0 checking.
Now, fix up the loop (I mean, you're really just using every 1000th element, so why not simply loop index from 0 to 2*2147483?)
++1000;
should be illegal in C, because you can't increase a constant numeral. This is very much WTF.
All in all, your program is a mess. Re-write it, and debug a clean, well-defined version of it.

Is continue instant?

In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"

The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.

If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro

How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}

How to keep unreachable code?

I'd like to write a function that would have some optional code to be executed or not depending on user settings. The function is cpu-intensive and having ifs in it would be slow since the branch predictor is not that good.
My idea is making a copy in memory of the function and replace NOPs with a jump when I don't want to execute some code. My working example goes like this:
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
x *= 2; // Op to skip or not
x *= 2;
}
return x;
}
In my test's main, I copy this function into a newly allocated executable memory and replace the NOPs by a JMP 2 so that the following x *= 2 is not executed. JMP 2 is really "skip the next 2 bytes".
The problem is that I would have to change the JMP operand every time I edit the code to be skipped and change its size.
An alternative that would fix this problem would be:
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
I would then want to skip or not the goto, which has a fixed size. Unfortunately, in full optimization mode, the goto and the x*=2 are removed because they are unreachable at compilation time.
Hence the need to keep that dead code.
I'm using VStudio 2008.

You can cut the cost of the branch by up to 10, just by moving it out of the loop:
int Test()
{
int x = 2;
if (should_skip) {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
}
} else {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
x *= 2;
}
}
return x;
}
In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.
If this results in too much duplicated code to be maintainable, use a template that takes x by reference:
int x = 2;
if (should_skip) {
doLoop<true>(x);
} else {
doLoop<false>(x);
}
And check that the compiler inlines it.
Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.

If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.

You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.
You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.
I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.

Here's an actual answer to the actual question!
volatile int y = 0;
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
keep_my_code:
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
}
if (y) goto keep_my_code;
return x;
}

Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.

This may give insight:
#pragma optimize for Visual Studio.
That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.
At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.

If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Does Using goto to Escape Control Structures Ever Produce Different Assembly? - c++

Related

Trivial example of reordering memory operations

Does continue statement really increases the speed of the loop in C++?

If statement not executed slows program

Is continue instant?

How to keep unreachable code?

Categories

Resources