zero Functor construction and overhead even with new and delete? - c++

If I have a functor class with no state, but I create it from the heap with new, are typical compilers smart enough to optimize away the creation overhead entirely?
This question has come up when making a bunch of stateless functors. If they're allocated on the stack, does their 0 state class body mean that the stack really isn't changed at all? It seems it must in case you later take an address of the functor instance.
Same for heap allocation.
In that case, functors are always adding a (trivial, but non-zero) overhead in their creation. But maybe compilers can see whether the address is used and if not it can eliminate that stack allocation. (Or, can it even eliminate a heap allocation?)
But how about a functor that's created as a temporary?
#include <iostream>
struct GTfunctor
{
inline bool operator()(int a, int b) {return a>b; }
};
int main()
{
GTfunctor* f= new GTfunctor;
GTfunctor g;
std::cout<< (*f)(2,1) << std::endl;
std::cout<< g(2,1) << std::endl;
std::cout<< GTfunctor()(2,1) << std::endl;
delete f;
}
So in the concrete example above, the three lines each call the same functor in three different ways. In this example, is there any efficiency difference between the ways? Or is the compiler able to optimize each line all the way down to being a compute-less print statement?
Edit:
Most answers say that the compiler could never inline/eliminate the heap allocated functor. But is this really true as well? Most compilers (GCC, MS, Intel) have linktime optimization as well which could indeed do this optimization. (but does it?)

are typical compilers smart enough to optimize away the creation overhead entirely?
When you're creating them on the heap, I doubt whether the compiler is allowed to. IMO:
Invoking new implies invoking operator new.
operator new is a non-trivial function defined in the run-time library.
The compiler isn't allowed to decide that you didn't really mean to invoke such a function and to decide that as an optimization it will silently not invoke it.
When you're creating them on the stack, and not taking their address, then maybe ... or maybe not: my guess is that every object has a non-zero size, in order to occupy some memory, in order to have an identity, even when the object has no state apart from its identity.

Obviously, it depends on your compiler.
I would say
No compiler will optimize away the object on the heap. (This is because, as ChrisW says, compilers will never optimize away a call to new, which is almost certainly defined in another translation unit.)
Some compilers will optimize away a named object on the stack. I've known gcc to do this optimization quite often.
Most compilers will optimize away an unnamed object on the stack. This is one of the "standard" C++ optimizations, especially as more advanced C++ users tend to create lots of unnamed temporary variables.
Unfortunately, these are only rules of thumb. Optimizers are notoriously unpredictable; really the only way to know what your compiler is doing is to read the assembly output.

I highly doubt this type of optimization is allowed, but if your functor has no state, why would you want to initialize it on the heap? It should be just as easy to use it as a temporary.

A C++ object is always non-zero in size. "Empty base class optimization" allows empty base class to have zero size but that doesn't apply here.
I have not worked on any C++ optimizer, so whatever i say is just speculating. I think 2nd and 3rd will be expanded inline easily and there will be no overhead, and no GTFunctor is created. The functor pointer, however, is a different story. In your example, it may seem simple enough and any optimizer should be able to eliminate heap allocation, but in a non trivial program, you maybe creating the functors in one translation unit and use it in another. Or even in a different library where the compiler/linker/loader/runtime system don't have source code to, and it is almost impossible to optimize. Given the fact that optimizing it is not easy, the potential gain in performance is not great, and the number of cases where empty functor is allocated in the heap is probably small, i think most optimizer programmer will probably not put this optimization high in their to do list.

The compiler cannot optimize out a call to new or delete. It may however optimize out the variable created on the stack since it has no state.

Simple way to answer the heap question:
GTfunctor *f = new GTfunctor;
The value of f must not be null, so what should it be? And you also had:
GTfunctor *g = new GTfunctor;
Now the value of g must not equal the value of f, so what should each be?
Furthermore, neither f or g may be equal to any other pointer obtained from new, unless some pointer elsewhere is somehow initialised to be equal to f or g, which (depending on the code that comes after) may involve examining what the entire rest of the program does.
Yes, if by local examination of the code the compiler can see that you never rely on any of these requirements, then it could perform a rewrite such that no heap allocation occurs. The problem is, if your code was that simple, you could probably do that rewrite yourself and end up with a more readable program anyway, e.g. your test program would look like your stack-based g example. So real programs would not benefit from such an optimisation in the compiler.
Presumably the reason you're doing this is because sometimes the functor does have data, depending on which type is chosen at runtime. So compile-time analysis cannot usefully work its magic here.

C++ Standard states that each object (imho on the heap) must at least have a size one byte, so it can be uniquely addressed.
Generating functors with new can lead to two problems:
The constructions can generally not optimized away. New is a function with complex side effects (bad_alloc).
Because you address the functor indirectly the compiler may not be able to inline the function.
Chances are good that you will not see a sign of the functor, if you generate it on the stack.
Side note: The inline statement is not necessary. Every function which is defined in a class definition is treated as inlineable.

The compiler can probably figure out that operator() doesn't use any member variables, and optimize it to the max. I wouldn't make any assumptions about the local or heap allocated variables, though.
Edit: When in doubt, turn on the assembly output option on your compiler and see what it's actually doing. No sense in listening to a bunch of idiots on the web when you can see the real answer for yourself.

The answer to your question has two aspects.
Does the compiler optimize away the heap allocation: I strongly doubt it, but I'm not a standard guy, so I have to look it up.
Can the compiler optimize by inline the object's operator()? Yes. As long as you don't specify the call as virtual, even the pointer dereferencing isn't actually performed.

Related

Manipulating std::array in a function

I have a multidimensional array of fixed size in my code, and I need to be able to change the values within it in a separate function. I want to know, are std::arrays passed as references in a method or is a copy made? So can I do this:
using std::array;
void foo (array<array<int,WIDTH>,HEIGHT> bar);
//manipulates the values in the array
...
int main() {
array<array<int,WIDTH>,HEIGHT> baz;
...
foo(baz);
//baz is changed
}
Or do I need to explicitly turn it into a reference? I fear that if I created an array function that returned a copy, it would be too messy and not as fast.
I want to know, are std::arrays passed as references in a method or is a copy made?
std::array is a value type. If you pass by value, a copy will (conceptually) be made.
This is different from the behaviour of a c-style array (int bar[HEIGHT][WIDTH]). c-style arrays are passed by reference. This is due to a fundamental design decision (some would say error) in the C language many moons ago.
do I need to explicitly turn it into a reference?
If you wish to pass a reference, yes.
I fear that if I created an array function that returned a copy, it would be too messy
Functional programmers would argue that it was cleaner.
and not as fast.
Be careful of assumptions like this. For amongst other reasons,
a) Compilers don't always do as you tell them. They can write code that has the same effect as if they'd done what you tell them, but achieves the same result quicker. (google: "as-if rule c++")
b) Modern CPU architectures are specifically engineered to be extremely good at moving contiguous blocks of memory.
Write the cleanest code you can which, using abstract concepts to express your intent clearly. If the program runs too slowly, uses too much memory, uses too much power or your end users are complaining, then maybe is the time to concern yourself with by-reference or by-value optimisations.
However, I can assure you that any delays you are seeing in execution speed are much more likely to be in your treatment of IO or due to the selection of algorithms exhibiting high time complexity.
Pass by reference (or pointer) if you want to avoid having a copy made:
void foo (array<array<int,WIDTH>,HEIGHT> &bar);

Inlining a function that manipulates data on heap

I'm working on optimizing a code where most of the objects are allocated on heap.
What I'm trying to understand is: if/why the compiler might not inline a function call that potentially manipulates data on heap.
To make things more clear, suppose you have the following code:
class A
{
public:
void foo() // non-const function
{
// modify data
i++;
...
}
private:
int i;
// can be anything here, including pointers
};
int main()
{
A a; // allocate something on stack
auto ptr = std::make_unique<A>(); // allocate something on heap
a.foo(); // case 1
ptr->foo(); // case 2
return 0;
}
Is it possible that a.foo() gets inlined while ptr->foo() does not?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
Assume that there are no virtual functions
EDIT: I guess my question is partially theoretical. Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
That is not relevant. Inline function and "regular" function calls have the same effect on the heap.
The implementation, inline or not, is in the code segment anyway.
Is it possible that a.foo() gets inlined while ptr->foo() does not?
Highly unlikely. Both of these calls will be probably inlined if the implementation is visible to the compiler and the compiler decide that it would be beneficial.
I used "case 2" in my code numerous times and it was always inlined using g++.
Although it is mostly implementation specific, there are no real limitation that restrict pointer function call compared to calling using an on stack object (beside the virtual functions which you already mentioned).
You should note that the produced inlined code might still be different. Case 2 will have to first determine the actual address which will have an impact on the performance, but it should be pretty much the same from there.
if/why the compiler might not inline a function call that potentially manipulates data on heap.
The compiler is free to inline or not a function call (and might decide that after devirtualization). The inlining decision is the freedom of the compiler (so inline keyword, like register, is often ignored to make optimizing decisions). The compiler often would decide to inline (or not) every particular call (so every occurrence of the called function name).
Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
This is really easy. Often, (among other criteria) the inlining is decided according to the depth of previously inlined nested function calls, or according the current size of the expanded internal representation. So it does happen that a particular occurrence of ptr->foo() would be inlined (e.g. because it occurs in a small function) but another occurrence of a.foo() won't be inlined.
Remember, inlining decisions is generally taken at each call site. And on some compilers, the thresholds used by the compiler may vary or can be tuned.
But inlining does not always speed up execution time (because of CPU cache and branch predictor issues, and many other mysteries....), and that is yet another reason why sometimes a compiler won't inline a particular call.
For GCC compiler, read about inline functions and various optimization options (notice that -finline-limit=100 and -finline-limit=200 will give different inlining decisions; you could even play with different --params options; the MILEPOST GCC project used machine learning techniques to tune these....).
Perhaps some compilers can more easily do devirtualization for stack allocated data (I really don't know, and compilers are making progress on such issues). This is probably the reason why (perhaps!) heap vs stack allocation could influence inlining decisions.

Is there any difference in performance to declare a large variable inside a function as `static`?

Not sure if this has already been asked before. While answering this very simple question, I asked myself the following instead. Consider this:
void foo()
{
int i{};
const ReallyAnyType[] data = { item1, item2, item3,
/* many items that may be potentially heavy to recreate, e.g. of class type */ };
/* function code here... */
}
Now in theory, local variables are recreated every time control reaches function, right? I.e. look at int i above - it's going to be recreated on the stack for sure. What about the array above? Can a compiler be as smart as to optimize its creation to occur only once, or do I need a static modifier here anyway? What about if the array is not const? (OK, if it's not const, there probably i snot sense in creating it only once, since re-initialization to the default state may be required between calls due to modifications being made during function execution.)
Might sound like a basic question, but for some reason I still ponder. Also, ignore the "why would you want to do this" - this is just a language question, not applied to a certain programming problem or design. I mean both C and C++ here. Should there be differences between the two regarding this question, please outline those.
There a two questions here, I think:
Can a compiler optimize a non-static const object to be effectively static so that it is only created once; and
Is it a reasonable expectation that a given compiler will do so.
I think the answer to the second question is "No", because I don't see the point of doing a huge amount of control flow analysis to save the programmer the trouble of typing the word static. However, I've often been surprised what optimizations people spend their time writing (as opposed to the optimizations which I think they should be working on :-) ). All the same, I would strongly recommend using the word static if that's what you wanted.
For the first question, there are circumstances under which the compiler could perform the optimization based on the "as-if" rule, but in very few cases would it work out.
First of all, if any object or subobject in the initializer has a non-trivial constructor/destructor, then the construction/destruction is visible, and this is not an example of copy elision. (This paragraph is C++ only, of course.)
The same would be true if any computation in the initializer list has visible side-effects.
And it should go without saying that if any subobject's value is not constant, the computation of that subobject would need to be done on each construction.
If the object and all subobjects are trivially copyable, all the initializer-list computations are constant, and the only construction cost is that of copying from a template into the object, then the compiler still couldn't perform the optimization if there is any chance that the addresses of more than one live instance of the object might be simultaneously visible. For example, if the function were recursive, and the object's address was used somewhere (hard to avoid for an array), then there would be the possibility that the addresses of two of these objects from different recursive invocations of the function might be compared. And they would have to compare unequal, since they are in fact separate objects. (And, now that I think of it, the function would not even need to be recursive in a multi-threaded environment.)
So the burden of proof for a compiler wishing to optimize that object into a single static instance is quite high. As I said, it may well be that a given compiler actually attempts to perform that task, but I definitely wouldn't expect it to.
The compiler would almost certainly do whatever is deemed most optimal, but most likely it will have it in read-only memory and turn your local variable into a pointer that points to the array in read-only memory. This assumes your array is equivalent to a POD type (or a class composed of POD types; if your class does something non-trivial and/or modifies other things, there is no way the compiler can fairly do this optimization).

as-if rule and removal of allocation

The "as-if rule" gives the compiler the right to optimize out or reorder expressions that would not make a difference to the output and correctness of a program under certain rules, such as;
§1.9.5
A conforming implementation executing a well-formed program shall
produce the same observable behavior as one of the possible executions
of the corresponding instance of the abstract machine with the same
program and the same input.
The cppreference url I linked above specifically mentions special rules for the values of volatile objects, as well as for "new expressions", under C++14:
New-expression has another exception from the as-if rule: the compiler
may remove calls to the replaceable allocation functions even if a
user-defined replacement is provided and has observable side-effects.
I assume "replaceable" here is what is talked about for example in
§18.6.1.1.2
Replaceable: a C++ program may define a function with this function
signature that displaces the default version defined by the C++
standard library.
Is it correct that mem below can be removed or reordered under the as-if rule?
{
... some conformant code // upper block of code
auto mem = std::make_unique<std::array<double, 5000000>>();
... more conformant code, not using mem // lower block of code
}
Is there a way to ensure it's not removed, and stays between the upper and lower blocks of code? A well placed volatile (either/or volatile std::array or left of auto) comes to mind, but as there is no reading of mem, I think even that would not help under the as-if rule.
Side note; I've not been able to get visual studio 2015 to optimize out mem and the allocation at all.
Clarification: The way to observe this would be that the allocation call to the OS comes between any i/o from the two blocks. The point of this is for test cases and/or trying to get objects to be allocated at new locations.
Yes; No. Not within C++.
The abstract machine of C++ does not talk about system allocation calls at all. Only the side effects of such a call that impact the behavior of the abstract machine are fixed by C++, and even then the compiler is free to do something else, so long as-if it results in the same observable behavior on the part of the program in the abstract machine.
In the abstract machine, auto mem = std::make_unique<std::array<double, 5000000>>(); creates a variable mem. It, if used, gives you access to a large amount of doubles packed into an array. The abstract machine is free to throw an exception, or provide you with that large amount of doubles; either is fine.
Note that it is a legal C++ compiler to replace all allocations through new with an unconditional throw of an allocation failure (or returning nullptr for the no throw versions), but that would be a poor quality of implementation.
In the case where it is allocated, the C++ standard doesn't really say where it comes from. The compiler is free to use a static array, for example, and make the delete call a no-op (note it may have to prove it catches all ways to call delete on the buffer).
Next, if you have a static array, if nobody reads or writes to it (and the construction cannot be observed), the compiler is free to eliminate it.
That being said, much of the above relies on the compiler knowing what is going on.
So an approach is to make it impossible for the compiler to know. Have your code load a DLL, then pass a pointer to the unique_ptr to that DLL at the points where you want its state to be known.
Because the compiler cannot optimize over run-time DLL calls, the state of the variable has to basically be what you'd expect it to be.
Sadly, there is no standard way to dynamically load code like that in C++, so you'll have to rely upon your current system.
Said DLL can be separately written to be a noop; or, even, you can examine some external state, and conditionally load and pass the data to the DLL based on the external state. So long as the compiler cannot prove said external state will occur, it cannot optimize around the calls not being made. Then, never set that external state.
Declare the variable at the top of the block. Pass a pointer to it to the fake-external-DLL while uninitialized. Repeat just before initializing it, then after. Then finally, do it at the end of the block before destroying it, .reset() it, then do it again.

return value optimization vs auto_ptr for large vectors

If I use auto_ptr as a return value of a function that populates large vectors, this makes the function a source function (it will create an internal auto_ptr and pass over ownership when it returns a non const auto_ptr). However, I cannot use this function with STL algorithms because, in order to access the data, I need to derefference the auto_ptr. A good example I guess would be a field of vectors of size N, with each vector having 100 components. Wether the function returns each 100 component vector by value or by ref is not the same, if N is large.
Also, when I try this very basic code:
class t
{
public:
t() { std::cout << "ctor" << std::endl; }
~t() { std::cout << "dtor" << std::endl; }
};
t valueFun()
{
return t();
}
std::auto_ptr<t> autoFun()
{
return std::auto_ptr(new t());
}
both autoFun and fun calls result with the output
Ctor
Dtor
so I cannot actually see the automatic variable which is being created to be passed away to the return statement. Does this mean that the Return Value Optimization is set for the valueFun call? Does valueFun create two automatic objects at all in this case?
How do I then optimize a population of such a large data structure with a function?
There are many options for this, and dynamic allocation may not be the best.
Before we even delve in this discussion: is this a bottleneck ?
If you did not profile and ensured it was a bottleneck, then this discussion could be completely off... Remember than profiling debug builds is pretty much useless.
Now, in C++03 there are several options, from the most palatable to the least one:
trust the compiler: unnamed variables use RVO even in Debug builds in gcc, for example.
use an "out" parameter (pass by reference)
allocate on the heap and return a pointer (smart or not)
check the compiler output
Personally, I would trust my compiler on this unless a profiler proves I am wrong.
In C++11, move semantics help us getting more confident, because whenever there is a return statement, if RVO cannot kick in, then a move constructor (if available) can be used automatically; and move constructors on vector are dirt cheap.
So it becomes:
trust the compiler: either RVO or move semantics
allocate on the heap and return a unique_ptr
but really the second point should be used only for those few classes where move semantics do not help much: the cost of move semantics is usually proportional to the return of sizeof, for example a std::array<T,10> has a size equal to 10*sizeof(T) so it's not so good and might benefit from heap allocation + unique_ptr.
Tangent: you trust your compiler already. You trust it to warn you about errors, you trust it to warn you about dangerous/probably incorrect constructs, you trust it to correctly translate your code into machine assembly, you trust it to apply meaningful optimization to get a decent speed-up... Not trusting a compiler to apply RVO in obvious cases is like not trusting your heart surgeon with a $10 bill: it's the least of your worries. ;)
I am fairly sure that the compiler will do Return Value Optimization for valueFun. The main cases where return value optimization cannot be applied by the compiler are:
returning parameters
returning a different object based on a conditional
Thus the auto_ptr is not necessary, and would be even slower due to having to use the heap.
If you are still worried about the costs of moving around such a large vector, you might want to look in to using the move semantics(std::vector aCopy(std::move(otherVector)) of C++11. These are almost as fast as RVO and can be used anywhere(it is also guaranteed to be used for return values when RVO is not able to be used.)
I believe most modern compilers support move semantics(or rvalue references technically) at this point