Compiler optimization problem

Compiler optimization problem - c++

Most of the functions in <functional> use functors. If I write a struct like this:
struct Test
{
bool operator()
{
//Something
}
//No member variables
};
Is there a perf hit? Would an object of Test be created? Or can the compiler optimize the object away?

GCC at least can optimize the object creation and inline your functor, so you can expect performance as with hand-crafted loop. Of cource you must compile with -O2.

Yes, the compiler can optimize "object creation" (which is trivial in this case) out if it wants so. However if you really care you should compile your program and inspect the assembly code.

Even if the compiler was having a bad day and somehow couldn't figure out how to optimize this (it's very simple as optimizations go) - with no data members and no constructor the "performance hit" to "create an object" would be at most one instruction (plus maybe a couple more to copy the object, if the compiler also doesn't figure out how to inline the function call that uses the functor) to increment the stack pointer (since every object must have a unique address). "Creating objects" is cheap. What takes time is allocating memory, via new (because the OS has to be petitioned for the memory, and it has to search for a contiguous block that isn't being used by something else). Putting things on the stack is trivial.

There is no "use" of the structure, so as the code currently stands, it is still just a definition (and takes up no space).
If you create an object of type Test, it will take up non-zero space. If the compiler can deduce that nothing takes its address (or anything similar), it is free to optimize away the space usage.

Related

function returning - unique_ptr VS passing result as parameter VS returning by value

In c++, what the preferred/recommended way to create an object in a function/method and return it to be used outside the creation function's scope?
In most functional languages, option 3 (and sometimes even option 1) would be preferred, but what's the c++ way of best handling this?
Option 1 (return unique_ptr)
pros: function is pure and does not change input params
cons: is this an unnecessarily complicated solution?
std::unique_ptr<SomeClass> createSometing(){
auto s = std::make_unique<SomeClass>();
return s;
}
Option 2 (pass result as a reference parameter)
pros: simple and does not involve pointers
cons: input parameter is changed (makes function less pure and more unpredictable - the result reference param could be changed anywhere within the function and it could get hard/messy to track in larger functions).
void createSometing(SomeClass& result){
SomeClass s;
result = s;
}
Option 3 (return by value - involves copying)
pros: simple and clear
cons: involves copying an object - which could be expensive. But is this ok?
SomeClass createSometing(){
SomeClass s;
return s;
}

In modern C++, the rule is that the compiler is smarter than the programmer. Said differently the programmer is expected to write code that will be easy to read and maintain. And except when profiling have proven that there is a non acceptable bottleneck, low level concerns should be left to the optimizing compilers.
For that reason and except if profiling has proven that another way is required I would first try option 3 and return a plain object. If the object is moveable, moving an object is generally not too expensive. Furthermore, most compilers are able to fully elide the copy/move operation if they can. If I correctly remember, copy elision is even required starting with C++17 for statements like that:
T foo = functionReturningT();

This is a loaded question, because the matter involves a decision to create the object on the heap vs not creating it on the heap. In C++, it’s ideal to have objects that can be passed around as values cheaply. std::string is a good example of that. It’s generally a premature pessimization to allocate std::string on the heap. On the other hand, the object you may be creating may be large and expensive to copy. In that case, putting it on the heap would be preferable. But that assumes that a copy would have to take place. By default, the copy is eluded! But also: figure out if the type could be made cheaper to copy.
So there’s no “one way suits all”. In my experience, legacy code tends to overuse the heap.
In most cases, returning by value is preferable, since all mainstream compilers will have the function instantiate the object in the storage where it’ll reside, without moves nor copies.
Then, the object can be copy-constructed on the heap by the user of the function, if they so desire, and the compiler will get rid of that copy as well.
Micromanagement of this stuff, without looking at actual generated code, is typically a waste of time, since the code declares intent and not the implementation. Compilers these days literally produce code that has equivalent meaning, taking the C++ source’s semantics, but not necessarily using the source to dictate identical implementation at the machine level.
Thus, in most instances, returning by value is the sensible default, unless the type is borked and doesn’t support that. Unfortunately , some widely used types are in this camp, eg. Qt’s QObject.
TL;DR: Given MyType myFactoryFunction();, the statement auto obj = std::make_unique<MyType>(myFactoryFunction()); will not copy nor move on modern compilers in the release build, if the type is designed well.

There isn't a single right answer and it depends on the situation and personal preference to some extent. Here are pros and cons of different approaches.
Just declare it
SomeClass foo(arg1, arg2);
Factory functions should be relatively uncommon and only needed if the code creating the object doesn't have all the necessary information to create it (or shouldn't, due to encapsulation reasons). Perhaps it's more common in other languages to have factory functions for everything, but instantiating objects directly should be the first pick.
Return by value
SomeClass createSomeClass();
The first question is whether you want the resulting object to live on the stack or the heap. The default for small objects is the stack, since it's more efficient as you skip the call to malloc(). With Return Value Optimization usually there's no copy.
Return by pointer
std::unique_ptr<SomeClass> createSomeClass();
or
SomeClass* createSomeClass();
Reasons you might pick this include being a large object that you want to be heap allocated; the object is created out of some data store and the caller won't own the memory; you want a nullable return type to signal errors.
Out parameter
bool createSomeClass(SomeClass&);
Main benefits of using out parameters are when you have multiple return types. For example, you might want to return true/false for whether the object creation succeeded (e.g. if your object doesn't have a valid "unset" state, like an integer). You might also have a factory function that returns multiple things, e.g.
void createUserAndToken(User& user, Token& token);
In summary, I'd say by default, go with return by value. Do you need to signal failure? Out parameter or pointer. Is it a large object that lives on the heap, or some other data structure and you're giving out a handle? Return by pointer. If you don't strictly need a factory function, just declare it.

How does the compiler determine the needed stack size for a function with compiler generated temporaries?

Consider following code:
class cFoo {
private:
int m1;
char m2;
public:
int doSomething1();
int doSomething2();
int doSomething3();
}
class cBar {
private:
cFoo mFoo;
public:
cFoo getFoo(){ return mFoo; }
}
void some_function_in_the_callstack_hierarchy(cBar aBar) {
int test1 = aBar.getFoo().doSomething1();
int test2 = aBar.getFoo().doSomething2();
...
}
In the line where getFoo() is called the compiler will generate a temporary object of cFoo, to be able to call doSomething1().
Does the compiler reuse the stack memory which is used for these temporary objects?
How many stack memory will the call of "some_function_in_the_callstack_hierarchy" reservate? Does it reservate memory for every generated temporary?
My guess was that the compiler only reserve memory for one object of cFoo and will reuse the memory for different calls, but if I add
int test3 = aBar.getFoo().doSomething3();
I can see that the needed stack size for "some_function_in_the_callstack_hierarchy" is way more and its not only because of the additional local int variable.
On the other hand if i then replace
cFoo getFoo(){ return mFoo; }
with a reference (Only for testing purpose, because returning a reference to a private member is not good)
const cFoo& getFoo(){ return mFoo; }
it needs way less stack memory, than the size of one cFoo.
So for me it seems that the compiler reserves extra stack memory for every generated temporary object in the function. But this would be very inefficient.
Can someone explain this?

The optimizing compiler is transforming your source code into some internal representation, and normalizing it.
With free software compilers (like GCC & Clang/LLVM), you are able to look into that internal representation (at the very least by patching the compiler code or running it in some debugger).
BTW, sometimes, temporary values do not even need any stack space, e.g. because they have been optimized, or because they can sit in registers. And quite often they would reuse some unneeded slot in the current call frame. Also (particularly in C++) a lot of (small) functions are inlined -like your getFoo probably is- (so they don't have any call frame themselves). Recent GCC are even sometimes able of tail-call optimizations (essentially, reusing the caller's call frame).
If you compile with GCC (i.e. g++) I would suggest to play with optimization options and developer options (and some others). Perhaps use also -Wstack-usage=48 (or some other value, in bytes per call frame) and/or -fstack-usage
First, if you can read assembler code, compile yourcode.cc with g++ -S -fverbose-asm -O yourcode.cc and look into the emitted yourcode.s
(don't forget to play with optimization flags, so replace -O with -O2 or -O3 ....)
Then, if you are more curious about how the compiler is optimizing, try g++ -O -fdump-tree-all -c yourcode.cc and you'll get a lot of so called "dump files" which contain a partial textual rendering of internal representations relevant to GCC.
If you are even more curious, look into my GCC MELT and notably its documentation page (which contains a lot of slides & references).
So for me it seems that the compiler reserves extra stack memory for every generated temporary object in the function.
Certainly not, in the general case (and of course assuming you enable some optimizations). And even if some space is reserved, it would be very quickly reused.
BTW: notice that the C++11 standard does not speak of stack. One could imagine some C++ program compiled without using any stack (e.g. a whole program optimization detecting a program without recursion whose stack space and layout could be optimized to avoid any stack. I don't know any such compiler, but I do know that compilers can be quite clever....)

Attempting to analyse how a compiler is going to treat a particular piece of code is getting progressively more difficult as optimisation strategies get more aggressive.
All a compiler has to do is implement the C++ standard and compile the code without introducing or cancelling any side-effects (with some exceptions such as return and named return value optimisation).
You can see from your code that, since cFoo is not a polymorphic type and has no member data, a compiler could optimise out the creation of an object altogether and call what are essentially therefore static functions directly. I'd imagine that even at the time of my writing, some compilers are already doing that. You could always check the output assembly to be sure.
Edit: The OP has now introduced class members. But since these are never initialised and are private, the compiler can remove them without thinking too hard about that. This answer therefore still applies.

Life time of a temporary object is up until the end of the full containing expression, see the paragraph "12.2 Temporary objects" of the Standard.
It is very unlikely that even with the lowest optimisation settings a compiler will not reuse the space after the end of life of a temporary object.

struct as parameter to function: more efficient to copy or dereference?

I am currently developing a simple clone detector for C, written in C++ and am asking myself endless questions about efficiency and how to optimise C++ code.
One question I have is regarding how to efficiently pass structs. If given a struct similar to something below:
typedef struct {
unsigned int a;
void *b;
} my_struct;
and a function which performs numerous operations (not assignments) on its my_struct parameter. This function is called for each node in an AST traversal (so quite a lot...) and, based on some preliminary reading, what I understand is that passing an instance of a struct (non-pointer) causes a copy of it to be made for the called function.
Therefore, is it more efficient to pass the struct as a pointer and then dereference?
void foo(my_struct *s) {
// then dereference s->a...
Basically: speed of copy vs. speed of dereference is my question.
I assume, due to memory consumption, it would be smarter to pass the struct as a pointer but I have no idea of any side-effects regarding speed.

It depends.
In your case, your struct isn't much bigger than a pointer; it's likely that passing the entire struct as a function argument won't be much slower than passing a pointer.
Inside your function, accessing members of the struct via a pointer might be slower than accessing members of a local struct object. If so, then passing the struct directly might give you faster code overall if you're doing a lot inside the function. But that depends on the capabilities of the CPU and the generated code; it may be that member access has the same speed whether it's through a pointer or not.
The only way to answer the question is to measure the performance of your own code. Any answers you get will apply only to your own current situation, and could change on other targets systems or with a different version of the compiler.
Be sure you're telling your compiler to optimize your code. If you don't do that, there's not much point in measuring performance.

Reading off the stack (copy) will be more reliably fast since the stack is less likly to be paged out of cache, however you also incure the cost of doing the copy, and potentially fill up the stack more.
My rule of thumb is to pass "simple" data by value and "complex" data by refrence. generally if my data is more than 8-16 bytes I start considering if it is worth passing a refrence.
The other thing to consider here is if it is even worth optimizing. I try to avoid doing things in unexpected ways (different from the rest of the code) unless I can prove that there is or will be a problem, as breaking a pattern will make the code harder to support.

can the compiler inline methods that generate objects within a loop?

Just a question for my own curiosity. I have heard many times that it's best to use the copy/destroy paradigm when writing a method. So if you have a method like this:
OtherClass MyClass::getObject(){
OtherClass returnedObject;
return returnedObject;
}
supposedly the compiler will optimize this by essentially inlining the method and generating the class on the stack of the method that calls getObject. I'm wondering how would that work in a loop such as this
for(int i=0; i<10; i++){
list.push_back(myClass.getObject());
}
would the compiler put 10 instances of OtherClass on the stack so it could inline this method and avoid the copy and destroy that would happen in unoptimized code? What about code like this:
while(!isDone){
list.push_back(myClass.getObject());
//other logic which decides rather or not to set isDone
}
In this case the compiler couldn't possible know how many times getObject will be called so presumable it can pre-allocate anything to the stack, so my assumption is no inlining is done and every time the method is called I will pay the full cost of copying OtherObject?
I realize that all compilers are different, and that this depends on rather the compiler believes this code is optimal. I'm speaking only in general terms, how will most compiles be most likely to respond? I'm curious how this sort of optimization is done.

for(int i=0; i<10; i++){
list.push_back(myClass.getObject());
}
would the compiler put 10 instances of OtherClass on the stack so it could inline this method and avoid the copy and destroy that would happen in unoptimized code?
It doesn't need to put 10 instances on the stack just to avoid the copy and destroy... if there's space for one object to be returned with or without Return Value Optimisation, that it can reuse that space 10 times - each time copying from that same stack space to some new heap-allocated memory by the list push_back.
It would even be within the compilers rights to allocate new memory and arrange for myClass.getObject() to construct the objects directly in that memory.
Further, iff the optimiser chooses to unroll the loop, it could potentially call myClass.getObject() 10 times - even with some overlap or parallelism - IF it can somehow convince itself that that produces the same overall result. In that situation, it would indeed need space for 10 return objects, and again it's up to the compiler whether that's on the stack or through some miraculously clever optimisation, directly in the heap memory.
In practice, I would expect compilers to need to copy from stack to heap - I doubt very much any mainstream compiler's clever enough to arrange direct construction in the heap memory. Loop unrolling and RVO are common optimisations though. But, even if both kick in, I'd expect each call to getObject to serially construct a result on the stack which is then copied to heap.
If you want to "know", write some code to test for your own compiler. You can have the constructor write out the "this" pointer value.
What about code like this:
while(!isDone){
list.push_back(myClass.getObject());
//other logic which decides rather or not to set isDone
}
The more complex and less idiomatic the code is, the less likely the compiler writers have been able and bothered to optimise for it. Here, you're not even showing us a complexity level we can speculate on. Try it for your compiler and optimisation settings and see....

It depends on which version of which compiler on which os.
Why not get your compiler to output its assembly and you can take a look yourself.
gcc - http://www.delorie.com/djgpp/v2faq/faq8_20.html
visual studio - Viewing Assembly level code from Visual C++ project

In general, an optimizing compiler can make any changes whatsoever to your code so long as the resulting program's behavior isn't visibly changed. That includes inlineing functions (or not), even if that function isn't marked inline by the programmer.

The only thing compiler has to care about is the program behavior. If the optimization keeps the program logic and data intact the optimization is legal. What goes in(all possible program input), has to come out(all possible program output) the same way as without opimization.
Whether this particular optimization is possible(it surely is, whether it is an actual optimization is different thing!) depends on the target platform instruction set and if it is feasible to implement it.

Performance when accessing class members

I'm writing something performance-critical and wanted to know if it could make a difference if I use:
int test( int a, int b, int c )
{
// Do millions of calculations with a, b, c
}
or
class myStorage
{
public:
int a, b, c;
};
int test( myStorage values )
{
// Do millions of calculations with values.a, values.b, values.c
}
Does this basically result in similar code? Is there an extra overhead of accessing the class members?
I'm sure that this is clear to an expert in C++ so I won't try and write an unrealistic benchmark for it right now

The compiler will probably equalize them. If it has any brains at all, it will copy values.a, values.b, and values.c into local variables or registers, which is also what happens in the simple case.
The relevant maxims:
Premature optimization is the root of much evil.
Write it so you can read it at 1am six months from now and still understand what you were trying to do.
Most of the time significant optimization comes from restructuring your algorithm, not small changes in how variables are accessed. Yes, I know there are exceptions, but this probably isn't one of them.

This sounds like premature optimization.
That being said, there are some differences and opportunities but they will affect multiple calls to the function rather than performance in the function.
First of all, in the second option you may want to pass MyStorage as a constant reference.
As a result of that, your compiled code will likely be pushing a single value into the stack (to allow you to access the container), rather than pushing three separate values. If you have additional fields (in addition to a-c), sending MyStorage not as a reference might actually cost you more because you will be invoking a copy constructor and essentially copying all the additional fields. All of this would be costs per-call, not within the function.
If you are doing tons of calculations with a b and c within the function, then it really doesn't matter how you transfer or access them. If you passed by reference, the initial cost might be slightly more (since your object, if passed by reference, could be on the heap rather than the stack), but once accessed for the first time, caching and registers on your machine will probably mean low-cost access. If you have passed your object by value, then it really doesn't matter, since even initially, the values will be nearby on the stack.
For the code you provided, if these are the only fields, there will likely not be a difference. the "values.variable" is merely interpreted as an offset in the stack, not as "lookup one object, then access another address".
Of course, if you don't buy these arguments, just define local variables as the first step in your function, copy the values from the object, and then use these variables. If you realy use them multiple times, the initial cost of this copy wouldn't matter :)

No, your cpu would cache the variables you use over and over again.

I think there are some overhead, but may not be much. Because the memory address of the object will be stored in the stack, which points to the heap memory object, then you access the instance variable.
If you store the variable int in stack, it would be really faster, because the value is already in stack and the machine just go to stack to get it out to calculate:).
It also depends on if you store the class's instance variable value on stack or not. If inside the test(), you do like:
int a = objA.a;
int b = objA.b;
int c = objA.c;
I think it would be almost the same performance

If you're really writing performance critical code and you think one version should be faster than the other one, write both versions and test the timing (with the code compiled with right optimization switch). You may even want to see the generated assembly codes. A lot of things can affect the speed of a code snippets that are quite subtle, like register spilling, etc.

you can also start your function with
int & a = values.a;
int & b = values.b;
although the compiler should be smart enough to do that for you behind the scenes. In general I prefer to pass around structures or classes, this makes it often clearer what the function is meant to do, plus you don't have to change the signatures every time you want to take another parameter into account.

As with your previous, similar question: it depends on the compiler and platform. If there is any difference at all, it will be very small.
Both values on the stack and values in an object are commonly accessed using a pointer (the stack pointer, or the this pointer) and some offset (the location in the function's stack frame, or the location inside the class).
Here are some cases where it might make a difference:
Depending on your platform, the stack pointer might be held in a CPU register, whereas the this pointer might not. If this is the case, accessing this (which is presumably on the stack) would require an extra memory lookup.
Memory locality might be different. If the object in memory is larger than one cache line, the fields are spread out over multiple cache lines. Bringing only the relevant values together in a stack frame might improve cache efficiency.
Do note, however, how often I used the word "might" here. The only way to be sure is to measure it.

If you can't profile the program, print out the assembly language for the code fragments.
In general, less assembly code means less instructions to execute which speeds up performance. This is a technique for getting a rough estimate of performance when a profiler is not available.
An assembly language listing will allow you to see differences, if any, between implementations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js