Trusting the Return Value Optimization

Trusting the Return Value Optimization - c++

How do you go about using the return value optimization?
Is there any cases where I can trust a modern compiler to use the optimization, or should I always go the safe way and return a pointer of some type/use a reference as parameter?
Is there any known cases where the return value optimization cant be made?,
Seems to me that the return value optimization would be fairly easy for a compiler to perform.

Whenever compiler optimizations are enabled (and in most compilers, even when optimizations are disabled), RVO will take place. NRVO is slightly less common, but most compilers will perform this optimization as well, at least when optimizations are enabled.
You're right, the optimization is fairly easy for a compiler to perform, which is why compilers almost always do it. The only cases where it "can't be made" are the ones where the optimization doesn't apply: RVO only applies when you return an unnamed temporary. If you want to return a named local variable, NRVO applies instead, and while it is slightly more complex for a compiler to implement, it's doable, and modern compilers have no problem with it.

See: http://en.wikipedia.org/wiki/Return_value_optimization#Compiler_support

To have the best chance that it occurs, you can return an object constructed directly in the return statement [can anyone remember the name for this idiom - I've forgotten it]:
Foo f() {
....
return Foo( ... );
}
But as with all optimisations, the compiler can always choose not to do it. And at the end of the day, if you need to return a value thee is no alternative to trusting the compiler - pointers and references won't cut it.

Related

Out params vs. NRVO in the presence of error codes

We have a code base that uses out params extensively because every function can fail with some error enum.
This is getting very messy and the code is sometimes unreadable.
I want to eliminate this pattern and bring a more modern approach.
The goal is to transform:
error_t fn(param_t *out) {
//filling 'out'
}
param_t param;
error_t err = fn(&param);
into something like:
std::expected<error_t, param_t> fn() {
param_t ret;
//filling 'ret'
return ret;
}
auto& [err, param] = fn();
The following questions are in order to convince myself and others this change is for the best:
I know that on the standard level, NRVO is not mandatory (unlike RVO in c++17) but practically is there any chance it won't happen in any of the major compilers?
Are there any advantages of using out parameters instead of NRVO?
Assuming NRVO happens, is there a a significant change in the generated assembly (assuming an optimized expected implementation [perhaps with the boolean representing whether an error occured completly disappear])?

First off, a few assumptions:
We are looking at functions that are not being inlined. It's going to be almost guaranteed to be absolutely equivalent in that case.
We are going to assume that the call sites of the function actually check the error condition before using the returned value.
We are going to assume that the returned value has not been pre-initialized with partial data.
We are going to assume that we only care about optimized code here.
That being established:
I know that on the standard level, NRVO is not mandatory (unlike RVO in c++17) but practically is there any chance it won't happen in any of the major compilers?
Assuming that NRVO is being performed is a safe bet at this point. I'm sure someone could come up with a contrived situation where it wouldn't happen, but I generally feel confident that in almost all use-cases, NRVO is being performed on current modern compilers.
That being said, I would never rely on this behavior for program correctness. I.E. I wouldn't make a weird copy-constructor with side-effects with the assumption that it doesn't get invoked due to NRVO.
Are there any advantages of using out parameters instead of NRVO?
In general no, but like all things in C++, there are edge-case scenarios where it could come up. Explicit memory layouts for maximizing cache coherency would be a good use-case for "returning by pointer".
Assuming NRVO happens, is there a a significant change in the generated assembly (assuming an optimized expected implementation [perhaps with the boolean representing whether an error occured completly disappear])?
That question doesn't make that much sense to me. expected<> behaves a lot more like a variant<> than a tuple<>, so the "boolean representing whether an error occured completly disappear" doesn't really make sense.
That being said, I think we can use std::variant to estimate:
https://godbolt.org/g/XpqLLG
It's "different" but not necessarily better or worse in my opinion.

How can I be sure that Return value optimization will be done

I have written functions that are returning huge objects by value. My coworkers complain that it will do redundant copy and suggest to return objects by reference as a function argument. I know that Return Value optimization will be done and copies will be eliminated but the code will be used in a library that can be compiled by different compilers and I can't test on all of them. In order to convince my coworker that it is save to return objects by value I need some document where it is stated.
I have looked at c++03 standard but can't find anything about Return Value Optimization. Can you please give link to a document (standard) where is defined that RVO will be done. Or if it doesn't exist where I can find list of compilers that support RVO?

The standard does never guarantee RVO to happen, it just allows it.
You can check the actual code produced to find out if it happened, but this still is no guarantee that it will still happen in the future.
But at the end, every decent compiler can perform RVO in many cases, and even if no RVO happens, C++11 (and later) move construction can make returning relatively cheap.

One method you could use to prove to your coworkers that RVO is being done is to put printfs or other similar statements in the code.
HugeObject& HugeObject::operator=(const HugeObject& rhs)
{
printf("HugeObject::operator= called\n");
}
and
HugeObject::HugeObject(const HugeObject& rhs)
{
printf("HugeObject::copy constructor called\n");
}
and
HugeObject SomeFunctionThatCreatesHugeObject()
{
...
printf("SomeFunction returning HugeObject\n"
}
Then run the code in question, and verify that the expected number of objects have been constructed/copied.

return value optimization vs auto_ptr for large vectors

If I use auto_ptr as a return value of a function that populates large vectors, this makes the function a source function (it will create an internal auto_ptr and pass over ownership when it returns a non const auto_ptr). However, I cannot use this function with STL algorithms because, in order to access the data, I need to derefference the auto_ptr. A good example I guess would be a field of vectors of size N, with each vector having 100 components. Wether the function returns each 100 component vector by value or by ref is not the same, if N is large.
Also, when I try this very basic code:
class t
{
public:
t() { std::cout << "ctor" << std::endl; }
~t() { std::cout << "dtor" << std::endl; }
};
t valueFun()
{
return t();
}
std::auto_ptr<t> autoFun()
{
return std::auto_ptr(new t());
}
both autoFun and fun calls result with the output
Ctor
Dtor
so I cannot actually see the automatic variable which is being created to be passed away to the return statement. Does this mean that the Return Value Optimization is set for the valueFun call? Does valueFun create two automatic objects at all in this case?
How do I then optimize a population of such a large data structure with a function?

There are many options for this, and dynamic allocation may not be the best.
Before we even delve in this discussion: is this a bottleneck ?
If you did not profile and ensured it was a bottleneck, then this discussion could be completely off... Remember than profiling debug builds is pretty much useless.
Now, in C++03 there are several options, from the most palatable to the least one:
trust the compiler: unnamed variables use RVO even in Debug builds in gcc, for example.
use an "out" parameter (pass by reference)
allocate on the heap and return a pointer (smart or not)
check the compiler output
Personally, I would trust my compiler on this unless a profiler proves I am wrong.
In C++11, move semantics help us getting more confident, because whenever there is a return statement, if RVO cannot kick in, then a move constructor (if available) can be used automatically; and move constructors on vector are dirt cheap.
So it becomes:
trust the compiler: either RVO or move semantics
allocate on the heap and return a unique_ptr
but really the second point should be used only for those few classes where move semantics do not help much: the cost of move semantics is usually proportional to the return of sizeof, for example a std::array<T,10> has a size equal to 10*sizeof(T) so it's not so good and might benefit from heap allocation + unique_ptr.
Tangent: you trust your compiler already. You trust it to warn you about errors, you trust it to warn you about dangerous/probably incorrect constructs, you trust it to correctly translate your code into machine assembly, you trust it to apply meaningful optimization to get a decent speed-up... Not trusting a compiler to apply RVO in obvious cases is like not trusting your heart surgeon with a $10 bill: it's the least of your worries. ;)

I am fairly sure that the compiler will do Return Value Optimization for valueFun. The main cases where return value optimization cannot be applied by the compiler are:
returning parameters
returning a different object based on a conditional
Thus the auto_ptr is not necessary, and would be even slower due to having to use the heap.
If you are still worried about the costs of moving around such a large vector, you might want to look in to using the move semantics(std::vector aCopy(std::move(otherVector)) of C++11. These are almost as fast as RVO and can be used anywhere(it is also guaranteed to be used for return values when RVO is not able to be used.)
I believe most modern compilers support move semantics(or rvalue references technically) at this point

Is it meaningful to optimize i++ as ++i to avoid the temporary variable?

Someone told me that I can write
for (iterator it = somecontainer.begin(); it != somecontainer.end(); ++it)
instead of
for (iterator it = somecontainer.begin(); it != somecontainer.end(); it++)
...since the latter one has the cost of an extra unused temporary variable. Is this optimization useful for modern compiler? Do I need to consider this optimization when writing code?

It's a good habit to get into, since iterators may be arbitrarily complex. For vector::iterator or int indexes, no, it won't make a difference.
The compiler can never eliminate (elide) the copy because copy elision only eliminates intermediate temporaries, not unused ones. For lightweight objects including most iterators, the compiler can optimize out the code implementing the copy. However, it isn't always obvious when it can't. For example, post-incrementing istream_iterator<string> is guaranteed to copy the last string object read. On a non-reference-counted string implementation, that will allocate, copy, and immediately free memory. The issue is even more likely to apply to heavier, non-iterator classes supporting post-increment.
There is certainly no disadvantage. It seems to have become the predominant style over the past decade or two.

I don't normally consider the prefix ++ an optimization. It's just what I write by default, because it might be faster, it's just as easy to write, and there's no downside.
But I doubt it's worth going back and changing existing code to use the prefix version.

Yes, that's conceptually the right thing to do. Do you care if it's i++ or ++i? No, you don't. Which one is better? The second one is better since it's potentially faster. So you choose the second (pre-increment).
In typical cases there will be no difference in emitted code, but the iterators can have whatever implementation. You can't control it and you can't control whether the compiler will emit good code. What you can control is how you conveys your intent. You don't need the post-increment here.

No. (Stylistically I prefer it, because my native language is English which is mostly a verb-precedes-noun language, and so "increment it" reads more easily than "it increment". But that's style and subjective.)
However, unless you're changing somecontainer's contents in your loop, you might consider grabbing the return value of somecontainer.end() into a temporary variable. What you're doing there will actually call the function on every loop.

It depends on the type to which you're applying operator++. If you are talking about a user defined type, that's going to involve copying the entire UDT, which can be very expensive.
However, if you are talking about a builtin type, then there's likely no difference at all.
It's probably good to get in the habit of using preincrement where possible even when postincrement is fine, simply to be in the habit and to give your code a consistent look. You should be leaning on the compiler to optimize some things for you, but there's no reason to make it's job harder than it has to be for no reason whatsoever.

Consider how post-increment is typically implemented in a user-defined type:
MyIterator operator++(int) {
MyIterator retval(*this);
++*this;
return retval;
}
So we have two questions: can my compiler optimise this in cases where the return value is unused, and will my compiler optimise this in cases where the return value is unused?
As for "can it?", there certainly are cases where it can't. If the code isn't inlined, we're out of luck. If the copy constructor of MyIterator has observable side effects then we're out of luck - copy constructor elision allows return values and copies of temporaries to be constructed in place, so at least the value might only be copied once. But it doesn't allow return values not to be constructed at all. Depending on your compiler, memory allocation might well be an observable side-effect.
Of course iterators usually are intended to be inlined, and iterators usually are lightweight and won't take any effort to copy. As long as we're OK on these two counts, I think we're in business - the compiler can inline the code, spot that retval is unused and that its creation is side-effect free, and remove it. This just leaves a pre-increment.
As for "will it?" - try it on your compiler. I can't be bothered, because I always use pre-increment where the two are equivalent ;-)
For "do I need to consider this optimization when writing code?", I would say not as such, but that premature pessimization is as bad a habit as premature optimization. Just because one is (at best) a waste of time does not mean that the other is noble - don't go out of your way to make your code slower just because you can. If someone genuinely prefers to see i++ in a loop then it's fairly unlikely ever to make their code slower, so they can optimise for aesthetics if they must. Personally I'd prefer that people improve their taste...

It will actually make a difference in unoptimized code, which is to say debug builds.

Will the c++ compiler optimize away unused return value?

If I have a function that returns an object, but this return value is never used by the caller, will the compiler optimize away the copy? (Possibly an always/sometimes/never answer.)
Elementary example:
ReturnValue MyClass::FunctionThatAltersMembersAndNeverFails()
{
//Do stuff to members of MyClass that never fails
return successfulResultObject;
}
void MyClass::DoWork()
{
// Do some stuff
FunctionThatAltersMembersAndNeverFails();
// Do more stuff
}
In this case, will the ReturnValue object get copied at all? Does it even get constructed? (I know it probably depends on the compiler, but let's narrow this discussion down to the popular modern ones.)
EDIT: Let's simplify this a bit, since there doesn't seem to be a consensus in the general case. What if ReturnValue is an int, and we return 0 instead of successfulResultObject?

If the ReturnValue class has a non-trivial copy constructor, the compiler must not eliminate the call to the copy constructor - it is mandated by the language that it is invoked.
If the copy constructor is inline, the compiler might be able to inline the call, which in turn might cause a elimination of much of its code (also depending on whether FunctionThatAltersMembersAndNeverFails is inline).

They most likely will if the optimization level causes them to inline the code. If not, they would have to generate two different translations of the same code to make it work, which could open up a lot of edge case problems.

The linker can take care of this sort of thing, even if the original caller and called are in different compilation units.
If you have a good reason to be concerned about the CPU load dedicated to a method call (premature optimization is the root of all evil,) you might consider the many inlining options available to you, including (gasp!) a macro.
Do you REALLY need to optimize at this level?

If return value is an int and you return 0 (as in the edited question), then this may get optimized away.
You have to look at the underlying assembly. If the function is not inlined then the underlying assembly will execute a mov eax, 0 (or xor eax, eax) to set eax (which is usually used for integer return values) to 0. If the function is inlined, this will certainly get optimized away.
But this senario isn't too useful if you're worried about what happens when you return objects larger than 32-bits. You'll need to refer to the answers to the unedit question, which paint a pretty good picture: If everything is inlined then most of it will be optimized out. If it is not inlined, then the functions must be called even if they don't really do anything, and that includes the constructor of an object (since the compiler doesn't know whether the constructor modified global variables or did something else weird).

I doubt most compilers could do that if they were in different compilation objects (ie. different files). Maybe if they were both in the same file, they could.

There is a pretty good chance that a peephole optimizer will catch this. Many (most?) compilers implement one, so the answer is probably "yes".
As others have notes this is not a trivial question at the AST rewriting level.
Peephole optimizers work on a representation of the code at a level equivalent to assembly language (but before generation of actual machine code). There is a chance to notice the load of the return value into a register followed by a overwrite with no intermediate read, and just remove the load. This is done on a case by case basis.

just tried this example on compiler explorer, and at -O3 the mov is not generated when the return value is not used.
https://gcc.godbolt.org/z/v5WGPr

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js