Auto in loop and optimizations - c++

Can you explain me why there is such difference in computation time with the following codes (not optimized). I suspect RVO vs move-construction but I'm not really sure.
In general, what is the best practice when encountering such case ? Is auto declaration in a loop considered as a bad practice when initializing non-POD data ?
Using auto inside the loop :
std::vector<int> foo()
{
return {1,2,3,4,5};
}
int main()
{
for (size_t i = 0; i < 1000000; ++i)
auto f = foo();
return 0;
}
Output :
./a.out 0.17s user 0.00s system 97% cpu 0.177 total
Vector instance outside the loop :
std::vector<int> foo()
{
return {1,2,3,4,5};
}
int main()
{
std::vector<int> f;
for (size_t i = 0; i < 1000000; ++i)
f = foo();
return 0;
}
Output :
./a.out 0.32s user 0.00s system 99% cpu 0.325 total

I suspect RVO vs move-construction but I'm not really sure.
Yes, that is almost certainly what's happening. The first case move-initialises a variable from the function's return value: in this case, the move can be elided by making the function initialise it in place. The second case move-assigns from the return value; assignments can't be elided. I believe GCC performs elision even at optimisation level zero, unless you explicitly disable it.
In the final case (with -O3, which has now been removed from the question) the compiler probably notices that the loop has no side effects, and removes it entirely.
You might (or might not) get a more useful benchmark by declaring the vector volatile and compiling with optimisation. This will force the compiler to actually create/assign it on each iteration, even if it thinks it knows better.
Is auto declaration in a loop considered as a bad practice when initializing non-POD data ?
No; if anything, it's considered better practice to declare things in the narrowest scope that's needed. So if it's only needed in the loop, declare it in the loop. In some circumstances, you may get better performance by declaring a complicated object outside a loop to avoid recreating it on each iteration; but only do that when you're sure that the performance benefit (a) exists and (b) is worth the loss of locality.

I don't see your example having anything to do with auto. You wrote two different programs.
While
for (size_t i = 0; i < 1000000; ++i)
auto f = foo();
is equivalent to
for (size_t i = 0; i < 1000000; ++i)
std::vector<int> f = foo();
-- which means, you create a new vector (and destroying the old one). And, yes, in your foo-implementation using RVO, but that is not the point here: You still create a new vector in the place where the outer loop is making room for f.
The snippet
std::vector<int> f;
for (size_t i = 0; i < 1000000; ++i)
f = foo();
uses assign to an existing vector. And, yes, with RVO it may become a move-assign, depending on foo, and it is in your case, so you can expect it to be fast. But it still is a different thing -- it is always the one f that is in charge in managing the resources.
But what you do show very beautifully here is that it often makes sense to follow the general rule
Declare variables as close to their use as possible.
See this Discussion

Related

Benchmarking adding elements to vector when size is known

I have made a tiny benchmark for adding new elements to vector which I know its size.
Code:
struct foo{
foo() = default;
foo(double x, double y, double z) :x(x), y(y), z(y){
}
double x;
double y;
double z;
};
void resize_and_index(){
std::vector<foo> bar(1000);
for (auto& item : bar){
item.x = 5;
item.y = 5;
item.z = 5;
}
}
void reserve_and_push(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(foo(5, 5, 5));
}
}
void reserve_and_push_move(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(std::move(foo(5, 5, 5)));
}
}
void reserve_and_embalce(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.emplace_back(5, 5, 5);
}
}
I have then call each method 100000 times.
results:
resize_and_index: 176 mSec
reserve_and_push: 560 mSec
reserve_and_push_move: 574 mSec
reserve_and_embalce: 143 mSec
Calling code:
const size_t repeate = 100000;
auto start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
resize_and_index();
}
auto stop_time = clock();
std::cout << "resize_and_index: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push();
}
stop_time = clock();
std::cout << "reserve_and_push: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push_move();
}
stop_time = clock();
std::cout << "reserve_and_push_move: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_embalce();
}
stop_time = clock();
std::cout << "reserve_and_embalce: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
My questions:
Why did I get these results? what make emplace_back superior to others?
Why does std::move make the performance slightly worse ?
Benchmarking conditions:
Compiler: VS.NET 2013 C++ compiler (/O2 Max speed Optimization)
OS : Windows 8
Processor: Intel Core i7-410U CPU # 2.00 GHZ
Another Machine (By horstling):
VS2013, Win7, Xeon 1241 # 3.5 Ghz
resize_and_index: 144 mSec
reserve_and_push: 199 mSec
reserve_and_push_move: 201 mSec
reserve_and_embalce: 111 mSec
First, reserve_and_push and reserve_and_push_move are semantically equivalent. The temporary foo you construct is already an rvalue (the rvalue reference overload of push_back is already used); wrapping it in a move does not change anything, except possibly obscure the code for the compiler, which could explain the slight performance loss. (Though I think it more likely to be noise.) Also, your class has identical copy and move semantics.
Second, the resize_and_index variant might be more optimal if you write the loop's body as
item = foo(5, 5, 5);
although only profiling will show that. The point is that the compiler might generate suboptimal code for the three separate assignments.
Third, you should also try this:
std::vector<foo> v(100, foo(5, 5, 5));
Fourth, this benchmark is extremely sensitive to the compiler realizing that none of these functions actually do anything and simply optimizing their complete bodies out.
Now for analysis. Note that if you really want to know what's going on, you'll have to inspect the assembly the compiler generates.
The first version does the following:
Allocate space for 1000 foos.
Loop and default-construct each one.
Loop over all elements and reassign the values.
The main question here is whether the compiler realizes that the constructor in the second step is a no-op and that it can omit the entire loop. Assembly inspection can show that.
The second and third versions do the following:
Allocate space for 1000 foos.
1000 times:
construct a temporary foo object
ensure there is still enough allocated space
move (for your type, equivalent to a copy, since your class doesn't have special move semantics) the temporary into the allocated space.
Increment the vector's size.
There is a lot of room for optimization here for the compiler. If it inlines all operations into the same function, it could realize that the size check is superfluous. It could then realize that your move constructor cannot throw, which means the entire loop is uninterruptible, which means it could merge all the increments into one assignment. If it doesn't inline the push_back, it has to place the temporary in memory and pass a reference to it; there's a number of ways it could special-case this to be more efficient, but it's unlikely that it will.
But unless the compiler does some of these, I would expect this version to be a lot slower than the others.
The fourth version does the following:
Allocate enough space for 1000 foos.
1000 times:
ensure there is still enough allocated space
create a new object in the allocated space, using the constructor with the three arguments
increment the size
This is similar to the previous, with two differences: first, the way the MS standard library implements push_back, it has to check whether the passed reference is a reference into the vector itself; this greatly increases complexity of the function, inhibiting inlining. emplace_back does not have this problem. Second, emplace_back gets three simple scalar arguments instead of a reference to a stack object; if the function doesn't get inlined, this is significantly more efficient to pass.
Unless you work exclusively with Microsoft's compiler, I would strongly recommend you compare with other compilers (and their standard libraries). I also think that my suggested version would beat all four of yours, but I haven't profiled this.
In the end, unless the code is really performance-sensitive, you should write the version that is most readable. (That's another place where my version wins, IMO.)
Why did I get these results? what make emplace_back superior to
others?
You got these results because you benchmarked it and you had to get some results :).
Emplace back in this case is doing better because its directly creating/constructing the object at the memory location reserved by the vector. So, it does not have to first create an object (temp maybe) outside then copy/move it to the vector's reserved location thereby saving some overhead.
Why does std::move make the performance slightly worse ?
If you are asking why its more costly than emplace then it would be because it has to 'move' the object. In this case the move operation could have been very well reduced to copy. So, it must be the copy operation that is taking more time, since this copy was not happening for the emplace case.
You can try digging the assembly code generated and see what exactly is happening.
Also, I dont think comparing the rest of the functions against 'resize_and_index' is fair. There is a possibility that objects are being instantiated more than once in other cases.
I am not sure if discrepancy between reserve_and_push and reserve_and_push_move is just noise. I did a simple test using g++ 4.8.4 and noticed increase in executable size/additional assembly instructions, even though theoretically in this case the std::move can be ignored by the complier.

Memory allocation for return value of a function in a loop in C++11: how does it optimize?

I'm in the mood for some premature optimization and was wondering the following.
If one has a for-loop, and inside that loop there is a call to a function that returns a container, say a vector, of which the value is caught as an rvalue into a variable in the loop using move semantics, for instance:
std::vector<any_type> function(int i)
{
std::vector<any_type> output(3);
output[0] = i;
output[1] = i*2;
output[2] = i-3;
return(output);
}
int main()
{
for (int i = 0; i < 10; ++i)
{
// stuff
auto value = function(i);
// do stuff with value ...
// ... but in such a way that it can be discarded in the next iteration
}
}
How do compilers handle this memory-wise in the case that move semantics are applied (and that the function will not be inlined)? I would imagine that the most efficient thing to do is to allocate a single piece of memory for all the values, both inside the function and outside in the for-loop, that will get overwritten in each iteration.
I am mainly interested in this, because in my real-life application the vectors I'm creating are a lot larger than in the example given here. I am concerned that if I use functions like this, the allocation and destruction process will take up a lot of useless time, because I already know that I'm going to use that fixed amount of memory a lot of times. So, what I'm actually asking is whether there's some way that compilers would optimize to something of this form:
void function(int i, std::vector<any_type> &output)
{
// fill output
}
int main()
{
std::vector<any_type> dummy; // allocate memory only once
for (int i = 0; i < 10; ++i)
{
// stuff
function(i, dummy);
// do stuff with dummy
}
}
In particular I'm interested in the GCC implementation, but would also like to know what, say, the Intel compiler does.
Here, the most predictable optimization is RVO. When a function return an object, if it is used to initialize a new variable, the compiler can elide additional copy and move to construct directly on the destination ( it means that a program can contains two versions of the function depending on the use case ).
Here, you will still pay for allocating and destroying a buffer inside the vector at each loo iteration. If it is unacceptable, you will have to rely on an other solution, like std::array as your function seems to use fixed size dimension or move the vector before the loop and reuse it.
I would imagine that the most efficient thing to do is to allocate a
single piece of memory for all the values, both inside the function
and outside in the for-loop, that will get overwritten in each
iteration.
I don't think that any of the current compilers can do that. (I would be stunned to see that.) If you want to get insights, watch Chandler Carruth's talk.
If you need this kind of optimization, you need to do it yourself: Allocate the vector outside the loop and pass it by non-const reference to function() as argument. Of course, don't forget to call clear() when you are done or call clear() first inside function().
All this has nothing to do with move semantics, nothing has changed with C++11 in this respect.
If your loop is a busy loop, than allocating a container in each iteration can cost you a lot. It's easier to find yourself in such a situation than you would probably expect. Andrei Alexandrescu presents an example in his talk Writing Quick Code in C++, Quickly. The surprising thing is that doing unnecessary heap allocations in a tight loop like the one in his example can be slower than the actual file IO. I was surprised to see that. By the way, the container was std::string.

clearing a vector or defining a new vector, which one is faster

Which method is faster and has less overhead?
Method 1:
void foo() {
std::vector< int > aVector;
for ( int i = 0; i < 1000000; ++i ) {
aVector.clear();
aVector.push_back( i );
}
}
Method 2:
void foo() {
for ( int i = 0; i < 1000000; ++i ) {
std::vector< int > aVector;
aVector.push_back( i );
}
}
You may say that the example is meaningless! But this is just a snippet from my big code. In short I want to know is it better to
"create a vector once and clear it for usage"
or
"create a new vector every time"
UPDATE
Thanks for the suggestions, I tested both and here are the results
Method 1:
$ time ./test1
real 0m0.044s
user 0m0.042s
sys 0m0.002s
Method 2:
$ time ./test2
real 0m0.601s
user 0m0.599s
sys 0m0.002s
Clearing the vector is better. Maybe this help someone else :)
The clear() is most likely to be faster, as you will retain the memory that has been allocated for previous push_back()s into the vector, thus decreasing the need for allocation.
Also you do away with 1 constructor call and 1 destructor call per loop.
This is all ignoring what you're compiler optimizer might do with this code.
To create an empty vector is very little overhead. To GROW a vector to a large size is potentially quite expensive, as it doubles in size each time - so a 1M entry vector would have 15-20 "copies" made of the current content.
For trivial basic types, such as int, the overhead of creating an object and destroying the object is "nothing", but for any more complex object, you will have to take into account the construction and destruction of the object, which is often substantially more than the "put the object in the vector" and "remove it from the vector". In other words, the constructor and destructor for each object is what matters.
For EVERY "which is faster of X or Y" you really need to benchmark for the circumstances that you want to understand, unless it's VERY obvious that one is clearly faster than the other (such as "linear search or binary search of X elements", where linear search is proportional to X, and binary search is log2(x)).
Further, I'm slightly confused about your example - storing ONE element in a vector is quite cumbersome, and a fair bit of overhead over int x = i; - I presume you don't really mean that as a benchmark. In other words, your particular comparison is not very fair, because clearly constructing 1M vectors is more work than constructing ONE vector and filling it and clearing it 1M times. However, if you made your test something like this:
void foo() {
for ( int i = 0; i < 1000; ++i ) {
std::vector< int > aVector;
for(j = 0; j < 1000; j++)
{
aVector.push_back( i );
}
}
}
[and the corresponding change to the other code], I expect the results would be fairly similar.

c++ variable declaration

Im wondering if this code:
int main(){
int p;
for(int i = 0; i < 10; i++){
p = ...;
}
return 0
}
is exactly the same as that one
int main(){
for(int i = 0; i < 10; i++){
int p = ...;
}
return 0
}
in term of efficiency ?
I mean, the p variable will be recreated 10 times in the second example ?
It's is the same in terms of efficiency.
It's not the same in terms of readability. The second is better in this aspect, isn't it?
It's a semantic difference which the code keeps hidden because it's not making a difference for int, but it makes a difference to the human reader. Do you want to carry the value of whatever calculation you do in ... outside of the loop? You don't, so you should write code that reflects your intention.
A human reader will need to seek the function and look for other uses of p to confirm himself that what you did was just premature "optimization" and didn't have any deeper purpose.
Assuming it makes a difference for the type you use, you can help the human reader by commenting your code
/* p is only used inside the for-loop, to keep it from reallocating */
std::vector<int> p;
p.reserve(10);
for(int i = 0; i < 10; i++){
p.clear();
/* ... */
}
In this case, it's the same. Use the smallest scope possible for the most readable code.
If int were a class with a significant constructor and destructor, then the first (declaring it outside the loop) can be a significant savings - but inside you usually need to recreate the state anyway... so oftentimes it ends up being no savings at all.
One instance where it might make a difference is containers. A string or vector uses internal storage that gets grown to fit the size of the data it is storing. You may not want to reconstruct this container each time through the loop, instead, just clear its contents and it may not need as many reallocations inside the loop. This can (in some cases) result in a significant performance improvement.
The bottom-line is write it clearly, and if profiling shows it matters, move it out :)
They are equal in terms of efficiency - you should trust your compiler to get rid of the immeasurably small difference. The second is better design.
Edit: This isn't necessarily true for custom types, especially those that deal with memory. If you were writing a loop for any T, I'd sure use the first form just in case. But if you know that it's an inbuilt type, like int, pointer, char, float, bool, etc. I'd go for the second.
In second example the p is visible only inside of the for loop. you cannot use it further in your code.
In terms of efficiency they are equal.

c++ for loop temporary variable use

Which of the following is better and why? (Particular to c++)
a.
int i(0), iMax(vec.length());//vec is a container, say std::vector
for(;i < iMax; ++i)
{
//loop body
}
b.
for( int i(0);i < vec.length(); ++i)
{
//loop body
}
I have seen advice for (a) because of the call to length function. This is bothering me. Doesn't any modern compiler do the optimization of (b) to be similar to (a)?
Example (b) has a different meaning to example (a), and the compiler must interpret it as you write it.
If, (for some made-up reason that I can't think of), I wrote code to do this:
for( int i(0);i < vec.length(); ++i)
{
if(i%4 == 0)
vec.push_back(Widget());
}
I really would not have wanted the compiler to optimise out each call to vec.length(), because I would get different results.
I like:
for (int i = 0, e = vec.length(); i != e; ++i)
Of course, this would also work for iterators:
for (vector<int>::const_iterator i = v.begin(), e = v.end(); i != e; ++i)
I like this because it's both efficient (calling end() just once), and also relatively succinct (only having to type vector<int>::const_iterator once).
I'm surprised nobody has said the obvious:
In 99.99% of cases, it doesn't matter.
Unless you are using some container where calculating size() is an expensive operation, it is unfathomable that your program will go even a few nanoseconds slower. I would say stick with the more readable until you profile your code and find that size() is a bottleneck.
There are two issues to debate here:
The variable scope
The end condition re-evaluation
Variable scope
Normally, you wouldn't need the loop variable to be visible outside of the loop. That's why you can declare it inside the for construct.
End condition re-evaluation
Andrew Shepherd stated it nicely: it means something different to put a function call inside the end condition:
for( vector<...>::size_type i = 0; i < v.size(); ++i ) { // vector size may grow.
if( ... ) v.push_back( i ); // contrived, but possible
}
// note: this code may be replaced by a std::for_each construct, the previous can't.
for( vector<...>::size_type i = 0, elements = v.size(); i != elements; ++i ) {
}
Why is it bodering you?
Those two alternatives dont see to be doing the same. One is doing a fixed number of iterations, while the other is dependant on the loops body.
Another alternative colud be
for (vector<T>::iterator it=vec.begin();it!=vec.end();it++){
//loop body
}
Unless you need the loop variable outside the loop, the second approach is preferable.
Iterators will actually give you as good or better performance. (There was a big comparison thread on comp.lang.c++.moderated a few years back).
Also, I would use
int i = 0;
Rather than the constructor like syntax you're using. While valid, it's not idiomatic.
Somewhat unrelated:
Warning: Comparison between signed and unsigned integer.
The correct type for array and vector indices is size_t.
Strictly speaking, in C++ it is even std::vector<>::size_type.
Amazing how many C/C++ developers still get this one wrong.
Let's see on the generated code (I use MSVS 2008 with full optimization).
a.
int i(0), iMax(vec.size());//vec is a container, say std::vector
for(;i < iMax; ++i)
{
//loop body
}
The for loop produces 2 assembler instructions.
b.
for( int i(0);i < vec.size(); ++i)
{
//loop body
}
The for loop produces 8 assembler instructions. vec.size() is successfully inlined.
c.
for (std::vector<int>::const_iterator i = vec.begin(), e = vec.end(); i != e; ++i)
{
//loop body
}
The for loop produces 15 assembler instructions (everything is inlined, but the code has a lot of jumps)
So, if your application is performance critical use a). Otherwise b) or c).
It should be noted that the iterator examples:
for (vector<T>::iterator it=vec.begin();it!=vec.end();it++){
//loop body
}
could invalidate the loop iterator 'it' should the loop body cause the vector to reallocate. Thus it is not equivalent to
for (int i=0;i<vec.size();++i){
//loop body
}
where loop body adds elements to vec.
Simple question: are you modifying vec in the loop?
answer to this question will lead to your answer too.
jrh
It's very hard for a compiler to hoist the vec.length() call in the safe knowledge that it's constant, unless it gets inlined (which hopefully it often will!). But at least i should definitely be declared in the second style "b", even if the length call needs to be "manually" hoisted out of the loop!
This one is preferable:
typedef vector<int> container; // not really required,
// you could just use vector<int> in for loop
for (container::const_iterator i = v.begin(); i != v.end(); ++i)
{
// do something with (*i)
}
I can tell right away that the vector
is not being updated
anyone can tell what is happening
here
I know how many loops
v.end() returns pointer one past the
last element so there's no overhead
of checking size
easy to update for different
containers or value types
(b) won't calculate/call the function each time.
-- begin excerpt ----
Loop Invariant Code Motion:
GCC includes loop invariant code motion as part of its loop optimizer as well as in its partial redundancy elimination pass. This optimization removes instructions from loops, which compute a value which does not change throughout the lifetime of a loop.
--- end excerpt --
More optimizations for gcc:
https://www.in.redhat.com/software/gnupro/technical/gnupro_gcc.php3
Why not sidestep the issue entirely with BOOST_FOREACH
#include <boost/foreach.hpp>
std::vector<double> vec;
//...
BOOST_FOREACH( double &d, vec)
{
std::cout << d;
}