How is gcc optimizing this loop?

How is gcc optimizing this loop? - c++

So I encountered some strange behavior, which I stripped down to the following minimal example:
#include <iostream>
#include <vector>
int main()
{
std::vector<int> vec;
for(int i = 0; i < 1000; i++)
{
vec.push_back(2150000 * i);
if(i % 100 == 0) std::cout << i << std::endl;
}
}
When compiling with gcc 7.3.0 using the command
c++ -Wall -O2 program.cpp -o program
I get no warnings. Running the program produces the following output:
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
[ snip several thousand lines of output ]
1073741600
1073741700
1073741800
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
which I guess means that I finally ran out of memory for the vector.
Clearly something is wrong here. I guess this has something to do with the fact that 2150000 * 1000 is slightly larger than 2^31, but it's not quite as simple as that -- if I decrease this number to 2149000 then the program behaves as expected:
0
100
200
300
400
500
600
700
800
900
The cout isn't necessary to reproduce this behavior, so I suppose a minimal example is actually
#include <vector>
int main()
{
std::vector<int> vec;
for(int i = 0; i < 1000; i++)
{
vec.push_back(2150000 * i);
}
}
Running this causes the program to wait for a long time and then crash.
Question
I'm fairly new to C++ at any serious level. Am I doing something stupid here that allows for undefined behavior, and if so, what? Or is this a bug in gcc?
I did try to Google this, but I don't really know what to Google.
Addendum
I see that (signed) integer overflow is undefined behavior in C++. To my understanding, that would only mean that the behavior of the expression
21500000 * i
is undefined -- i.e. that it could evaluate to an arbitrary number. That said, we can see that this expression is at least not changing the value of i.

To answer my own question, after examining the assembler output it looks like g++ optimizes this loop by changing
for(int i = 0; i < 1000; i++)
{
vec.push_back(2150000 * i);
}
to something like
for(int j = 0; j < 1000 * 2150000; j += 2150000)
{
vec.push_back(j);
}
I guess the addition is faster than doing a multiplication each cycle, and the rule about overflows being undefined behavior means that this change can be made without worrying about whether this introduces unexpected behavior if that calculation overflows.
Of course the conditional in the optimized loop always fails, so ultimately I end up with something more like
for(int j = 0; true; j += 2150000)
{
vec.push_back(j);
}

Related

infinite for loop with correct limits

can you please explain why this code is going in infinite loop? I am unable to find the error. It is working fine with small values of n and m.
#include<bits/stdc++.h>
using namespace std;
int main()
{
long long n=1000000, m=1000000;
long long k = 1;
for (long long i = 0; i < n; i++)
{
for (long long j = 0; j < m; j++)
{
k++;
}
}
cout << k;
return 0;
}

It's not infinite, but that k++ operation has to run for 1,000,000 * 1,000,000 = 1,000,000,000,000 times. It's not infinite, but it takes too long. That's exactly why it works well with small n and m values.

It is a typical target for optimization.
Build with -Ofast.
g++ t_duration.cpp -Ofast -std=c++11 -o a_fast
#time ./a_fast
1000000000001
real 0m0.002s
user 0m0.000s
sys 0m0.002s
it takes almost no time to return the output.
Build with -O1.
g++ t_duration.cpp -O1 -std=c++11 -o a_1
#./a_1
419774 ms
About 420 seconds to complete the calculation.

when I Run this code I got this as my output
1000000000001
reference : I had run this code in the code chef IDE(https://www.codechef.com/ide)
you can try in this IDE once, I guess there is some problem with your IDE
or might be some other issue
it took me less than 20 sec to run this(on clock😁)
But when I put the same code in CODE::BLOCKS its taking long time(like you said infinite loops running) and the reason is quite simple it should do 1000000000000 runs
however this brings me a questions what's difference between code chef IDE and code::blocks compiler
(Got a New question from your question 😁🤔(Difference Between Code-Chef IDE and Code::Blocks))
Finally answer is try on with code chef IDE that's it , this code runs fast there🤣
Hope this Helps you 😃

How can a C++ program print either 0 or 1 depending on compilation options?

I have the following C++ code. It should print 1. There is only one illegal memory access and it is a reading one, i.e., for i=5 in the for loop, we obtain c[5]=cdata[5] and cdata[5] does not exist. Still, all writing calls are legal, i.e., the program never tries to write to some non-allocated array cell. If I compile it with g++ -O3, it prints 0. If I compile without -O3 it prints 1. It's the strangest error I've ever seen with g++.
Later edit to address some answers. The undefined behavior should be limited to reading cdata[5] and writing it to c[5]. The assignment c[5]=cdata[5] may read some unknown garbage or something undefined from cdata[5]. Ok, but that should only copy the garbage to c[5] without writing anything whatsoever somewhere else!
#include <iostream>
#include <cstdlib>
using namespace std;
int n,d;
int *f, *c;
void loadData(){
int fdata[] = {7, 2, 2, 7, 7, 1};
int cdata[] = {66, 5, 4, 3, 2};
n = 6;
d = 3;
f = new int[n+1];
c = new int[n];
f[0] = fdata[0];
c[0] = cdata[0];
for (int i = 1;i<n;i++){
f[i] = fdata[i];
c[i] = cdata[i];
}
cout<<f[5]<<endl;
}
int main(){
loadData();
}

It will be very hard to find out exactly what is happening to your code before and after the optimisation. As you already knew and pointed out yourself, you are trying to go out of bound of an array, which leads to undefined behaviour.
However, you are curious on why it is (apparently) the -O3 flag which "causes" the issue!
Let's start by saying that it is actually the flag -O -fpeel-loops which is causing your code to re-organise in a way that your error becomes apparent. The -O3 will enable several optimisation flags, among which -O -fpeel-loops.
You can read here about what the compiler flags are at which stage of optimisation.
In a nutshell, -fpeel-loops wheel re-organise the loop, so that the first and last couple of iterations are actually taken out of the loop itself, and some variables are actually cleared of memory. Small loops may even be taken apart completely!
With this said and considered, try running this piece of code, with -O -fpeel-loops or even with -O3:
#include <iostream>
#include <cstdlib>
using namespace std;
int n,d;
int *f, *c;
void loadData(){
int fdata[] = {7, 2, 2, 7, 7, 1};
int cdata[] = {66, 5, 4, 3, 2};
n = 6;
d = 3;
f = new int[n+1];
c = new int[n];
f[0] = fdata[0];
c[0] = cdata[0];
for (int i = 1;i<n;i++){
cout << f[i];
f[i] = fdata[i];
c[i] = cdata[i];
}
cout << "\nFINAL F[5]:" << f[5]<<endl;
}
int main(){
loadData();
}
You will see that it will print 1 regardless of your optimisation flags.
That is because of the statement: cout << f[i], which will change the way that fpeel-loops is operating.
Also, experiment with this block of code:
f[0] = fdata[0];
c[0] = cdata[0];
c[1] = cdata[1];
c[2] = cdata[2];
c[3] = cdata[3];
c[4] = cdata[4];
for (int i = 1; i<n; i++) {
f[i] = fdata[i];
c[5] = cdata[5];
}
cout << "\nFINAL F[5]:" << f[5] <<endl;
You will see that even in this case, with all your optimisation flags, the output is 1 and not 0. Even in this case:
for (int i = 1; i<n; i++) {
c[i] = cdata[i];
}
for (int i = 1; i<n; i++) {
f[i] = fdata[i];
}
The produced output is actually 1. This is, again, because we have changed the structure of the loop, and fpeel-loops is not able to reorganise it as before, in the way that the error was produced. It's also the case of wrapping it into a while loop:
int i = 1;
while (i < 6) {
f[i] = fdata[i];
c[i] = cdata[i];
i++;
}
Although on a while loop -O3 will prevent compilation here because of -Waggressive-loop-optimizations, so you should test it with -O -fpeel-loops
So, we can't really know for sure how your compiler is reorganising that loop for you, however it is using the so-called as-if rule to do so, and you can read more about it here.
Of course, your compiler takes the freedom o refactoring that code for you basing on the fact that you abide to set rules. The as-if rule for the compiler will always produce the desired output, providing that the programmer has not caused undefined behaviour. In our case, we do indeed have broken the rules by going out of bounds of that array, hence the strategy with which that loop was refactored, was built upon the idea that this could not happen, but we made it happen.
So yes, it is not everything as simple and straightforward as saying that all your problems are created by reading at an unallocated memory address. Of course, it is ultimately why the problem has verfied itself, but you were reading cdata out of bounds, which is a different array! So it is not as simple and easy as saying that the mere reading out of bounds of your array is causing the issue, as if you do it like that:
int i = 0;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
It will work and print 1! Even if we are definitely going out of bounds with that cdata array! So it is not the mere fact of reading at an unallocated memory address that is causing the issue.
Once again, it is actually the loop-refactoring strategy of fpeel-loops, which believed that you would not go out-of-bounds of that array and changed your loop accordingly.
Lastly, the fact that optimisations flags will lead you to have an output of 0 is strictly related to your machine. That 0 has no meaning, is is not the product of an actual logical operation, because it is actually a junk value, found at a non allocated memory address, or the product of a logical operation performed on a junk value, resulting in NaN.

If you want to discover why this happens, you can examine the generated assembly code. Here it is in Compiler Explorer: https://godbolt.org/z/bs7ej7
Note that with -O3, the generated assembly code is not exactly easy to read.

As you mentioned, you have memory problem in your code.
"cdata" has 5 member and maximum index for it is 4.
But in for loop when "i = 5", invalid data is read and it could cause a undefined behavior.
c[i] = cdata[i];
Please modify your code like this and try again.
for (int i = 1; i < n; i++)
f[i] = fdata[i];
for (int i = 1; i < 5; i++)
c[i] = cdata[i];

The answer is simple - you are accessing memory out of bounds of the array. Who knows what is there? It could be some other variable in your program, it could be random garbage, it is completely undefined.
So, when you compile sometimes you just so happen to get data there that is valid and other times when you compile the data is not what you expect.
It is purely coincidental that the data you expect is there or not, and depending on what the compiler does will determine what data is there i.e. it is completely unpredictable.
As for memory access, reading invalid memory is still a big no-no. So is reading potentially uninitialized values. You are doing both. They may not cause a segmentation fault, but they are still big no-no's.
My recommendation is use valgrind or a similar memory checking tool. If you run this code through valgrind, it will yell at you. With this, you can catch memory errors in your code that the compiler might not catch. This memory error is obvious, but some are hard to track down and almost all lead to undefined behavior, which makes you want to pull your teeth out with a rusty pair of pliers.
In short, don't access out of bounds elements and pray that the answer you are looking for just so happens to exist at that memory address.

Is branch prediction still significantly speeding up array processing? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I was reading a interesting post about why is it faster to process a sorted array than an unsorted array? and saw a comment made by #mp31415 that said:
Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases
So I tried it on a online c/c++ compiler (with, I suppose, modern server architecture), I get, for the sorted and unsorted, respectively, ~1.9s and ~1.85s, not so much different but repeatable.
So I wonder if it is still true for modern architectures?
Question was from 2012, not so far from now...
Or where am I wrong?
Question precision for reopening:
Please forget about me adding the C code as example. This was a terrible mistake. Not only erroneous was the code, posting it misled people who were focusing on the code itself, rather than on the question.
When I tried first the C++ code used in the link above and got only 2% difference (1.9s & 1.85s).
My first question and intent was about the previous post, its c++ code and the comment of #mp31415.
#rustyx made an interesting comment, and I wondered if it could explain what I observed.
Interestingly, a debug build exhibits 400% difference between sorted/unsorted, and a release build at most 5% difference (i7-7700).
In other words, my question is:
Why does the c++ code in the previous post did not worked with as good performances as those claimed by the previous OP?
precised by:
Does the timing difference between the release build and debug build could explain it?

You're a victim of the as-if rule:
... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...
Consider the function under test ...
const size_t arraySize = 32768;
int *data;
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
And the generated assembly (VS 2017, x86_64 /O2 mode)
The machine does not execute your loops, instead it executes a similar program that does this:
long long test()
{
long long sum = 0;
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
for (size_t i = 0; i < 20000; ++i)
{
if (data[c] >= 128)
sum += data[c] * 5;
}
}
return sum;
}
Observe how the optimizer reversed the order of the loops and defeated your benchmark.
Obviously the latter version is much more branch-predictor-friendly.
You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
sum += data[sum % 15]; // <== dependency!
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).
Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.

C++/RAII: Could this cause a memory leak?

I have a weird problem. I have written some MEX/Matlab-functions using C++. On my computer everything works fine. However, using the institute's cluster, the code sometimes simply stops running without any error (a core file is created which says "CPU limit time exceeded"). Unfortunately, on the cluster I cannot really debug my code and I also cannot reproduce the error.
What I do know is, that the error only occurs for very large runs, i.e., when a lot of memory is required. My assumption is therefore that my code has some memoryleaks.
The only real part where I could think of is the following bit:
#include <vector>
using std::vector;
vector<int> createVec(int length) {
vector<int> out(length);
for(int i = 0; i < length; ++i)
out[i] = 2.0 + i; // the real vector is obviously filled different, but it's just simple computations
return out;
}
void someFunction() {
int numUser = 5;
int numStages = 3;
// do some stuff
for(int user = 0; user < numUser; ++user) {
vector< vector<int> > userData(numStages);
for(int stage = 0; stage < numStages; ++stage) {
userData[stage] = createVec(42);
// use the vector for some computations
}
}
}
My question now is: Could this bit produce memory leaks or is this save due to RAII (which I would think it is)? Question for the MATLAB-experts: Does this behave any different when run as a mex file?
Thanks

Value Printed changes based on instructions that come after it

I appear to have coded a class that travels backwards in time. Allow me to explain:
I have a function, OrthogonalCamera::project(), that sets a matrix to a certain value. I then print out the value of that matrix, as such.
cam.project();
std::cout << "My Projection Matrix: " << std::endl << ProjectionMatrix::getMatrix() << std::endl;
cam.project() pushes a matrix onto ProjectionMatrix's stack (I am using the std::stack container), and ProjectionMatrix::getMatrix() just returns the stack's top element. If I run just this code, I get the following output:
2 0 0 0
0 7.7957 0 0
0 0 -0.001 0
-1 -1 -0.998 1
But if I run the code with these to lines after the std::cout call
float *foo = new float[16];
Mat4 fooMatrix = foo;
Then I get this output:
2 0 0 0
0 -2 0 0
0 0 -0.001 0
-1 1 -0.998 1
My question is the following: what could I possibly be doing such that code executed after I print a value changes the value being printed?
Some of the functions I'm using:
static void load(Mat4 &set)
{
if(ProjectionMatrix::matrices.size() > 0)
ProjectionMatrix::matrices.pop();
ProjectionMatrix::matrices.push(set);
}
static Mat4 &getMatrix()
{
return ProjectionMatrix::matrices.top();
}
and
void OrthogonalCamera::project()
{
Mat4 orthProjection = { { 2.0f / (this->r - this->l), 0, 0, -1 * ((this->r + this->l) / (this->r - this->l)) },
{ 0, 2.0f / (this->t - this->b), 0, -1 * ((this->t + this->b) / (this->t - this->b)) },
{ 0, 0, -2.0f / (this->farClip - this->nearClip), -1 * ((this->farClip + this->nearClip) / (this->farClip - this->nearClip)) },
{ 0, 0, 0, 1 } }; //this is apparently the projection matrix for an orthographic projection.
orthProjection = orthProjection.transpose();
ProjectionMatrix::load(orthProjection);
}
EDIT: whoever formatted my code, thank you. I'm not really too good with the formatting here, and it looks much nicer now :)
FURTHER EDIT: I have verified that the initialization of fooMatrix is running after I call std::cout.
UPTEENTH EDIT: Here is the function that initializes fooMatrix:
typedef Matrix<float, 4, 4> Mat4;
template<typename T, unsigned int rows, unsigned int cols>
Matrix<T, rows, cols>::Matrix(T *set)
{
this->matrixData = new T*[rows];
for (unsigned int i = 0; i < rows; i++)
{
this->matrixData[i] = new T[cols];
}
unsigned int counter = 0; //because I was too lazy to use set[(i * cols) + j]
for (unsigned int i = 0; i < rows; i++)
{
for (unsigned int j = 0; j < cols; j++)
{
this->matrixData[i][j] = set[counter];
counter++;
}
}
}
g64th EDIT: This isn't just an output problem. I actually have to use the value of the matrix elsewhere, and it's value aligns with the described behaviours (whether or not I print it).
TREE 3rd EDIT: Running it through the debugger gave me a yet again different value:
-7.559 0 0 0
0 -2 0 0
0 0 -0.001 0
1 1 -0.998 1
a(g64, g64)th EDIT: the problem does not exist compiling on linux. Just on Windows with MinGW. Could it be a compiler bug? That would make me sad.
FINAL EDIT: It works now. I don't know what I did, but it works. I've made sure I was using an up-to-date build that didn't have the code that ensures causality still functions, and it works. Thank you for helping me figure this out, stackoverflow community. As always you've been helpful and tolerant of my slowness. I'll by hypervigilant for any undefined behaviours or pointer screw-ups that can cause this unpredictability.

You're not writing your program instruction by instruction. You are describing its behavior to a C++ compiler, which then tries to express the same in machine code.
The compiler is allowed to reorder your code, as long as the observable behavior does not change.
In other words, the compiler is almost certainly reordering your code. So why does the observable behavior change?
Because your code exhibits undefined behavior.
Again, you are writing C++ code. C++ is a standard, a specification saying what the "meaning" of your code is. You're working under a contract that "As long as I, the programmer, write code that can be interpreted according to the C++ standard, then you, the compiler, will generate an executable whose behavior matches that of my source code".
If your code does anything not specified in this standard, then you have violated this contract. You have fed the compiler code whose behavior can not be interpreted according to the C++ standard. And then all bets are off. The compiler trusted you. It believed that you would fulfill the contract. It analyzed your code and generated an executable based on the assumption that you would write code that had a well-defined meaning. You did not, so the compiler was working under a false assumption. And then anything it builds on top of that assumption is also invalid.
Garbage in, garbage out. :)
Sadly, there's no easy way to pinpoint the error. You can carefully study ever piece of your code, or you can try stepping through the offending code in the debugger. Or break into the debugger at the point where the "wrong" value is seen, and study the disassembly and how you got there.
It's a pain, but that's undefined behavior for you. :)
Static analysis tools (Valgrind on Linux, and depending on your version of Visual Studio, the /analyze switch may or may not be available. Clang has a similar option built in) may help

What is your compiler? If you are compiling with gcc, try turning on thorough and verbose warnings. If you are using Visual Studio, set your warnings to /W4 and treat all warnings as errors.
Once you have done that and can still compile, if the bug still exists, then run the program through Valgrind. It is likely that at some point in your program, at an earlier point, you read past the end of some array and then write something. That something you write is overwriting what you're trying to print. Therefore, when you put more things on the stack, reading past the end of some array will put you in a completely different location in memory, so you are instead overwriting something else. Valgrind was made to catch stuff like that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js