can the compiler inline methods that generate objects within a loop?

can the compiler inline methods that generate objects within a loop? - c++

Just a question for my own curiosity. I have heard many times that it's best to use the copy/destroy paradigm when writing a method. So if you have a method like this:
OtherClass MyClass::getObject(){
OtherClass returnedObject;
return returnedObject;
}
supposedly the compiler will optimize this by essentially inlining the method and generating the class on the stack of the method that calls getObject. I'm wondering how would that work in a loop such as this
for(int i=0; i<10; i++){
list.push_back(myClass.getObject());
}
would the compiler put 10 instances of OtherClass on the stack so it could inline this method and avoid the copy and destroy that would happen in unoptimized code? What about code like this:
while(!isDone){
list.push_back(myClass.getObject());
//other logic which decides rather or not to set isDone
}
In this case the compiler couldn't possible know how many times getObject will be called so presumable it can pre-allocate anything to the stack, so my assumption is no inlining is done and every time the method is called I will pay the full cost of copying OtherObject?
I realize that all compilers are different, and that this depends on rather the compiler believes this code is optimal. I'm speaking only in general terms, how will most compiles be most likely to respond? I'm curious how this sort of optimization is done.

for(int i=0; i<10; i++){
list.push_back(myClass.getObject());
}
would the compiler put 10 instances of OtherClass on the stack so it could inline this method and avoid the copy and destroy that would happen in unoptimized code?
It doesn't need to put 10 instances on the stack just to avoid the copy and destroy... if there's space for one object to be returned with or without Return Value Optimisation, that it can reuse that space 10 times - each time copying from that same stack space to some new heap-allocated memory by the list push_back.
It would even be within the compilers rights to allocate new memory and arrange for myClass.getObject() to construct the objects directly in that memory.
Further, iff the optimiser chooses to unroll the loop, it could potentially call myClass.getObject() 10 times - even with some overlap or parallelism - IF it can somehow convince itself that that produces the same overall result. In that situation, it would indeed need space for 10 return objects, and again it's up to the compiler whether that's on the stack or through some miraculously clever optimisation, directly in the heap memory.
In practice, I would expect compilers to need to copy from stack to heap - I doubt very much any mainstream compiler's clever enough to arrange direct construction in the heap memory. Loop unrolling and RVO are common optimisations though. But, even if both kick in, I'd expect each call to getObject to serially construct a result on the stack which is then copied to heap.
If you want to "know", write some code to test for your own compiler. You can have the constructor write out the "this" pointer value.
What about code like this:
while(!isDone){
list.push_back(myClass.getObject());
//other logic which decides rather or not to set isDone
}
The more complex and less idiomatic the code is, the less likely the compiler writers have been able and bothered to optimise for it. Here, you're not even showing us a complexity level we can speculate on. Try it for your compiler and optimisation settings and see....

It depends on which version of which compiler on which os.
Why not get your compiler to output its assembly and you can take a look yourself.
gcc - http://www.delorie.com/djgpp/v2faq/faq8_20.html
visual studio - Viewing Assembly level code from Visual C++ project

In general, an optimizing compiler can make any changes whatsoever to your code so long as the resulting program's behavior isn't visibly changed. That includes inlineing functions (or not), even if that function isn't marked inline by the programmer.

The only thing compiler has to care about is the program behavior. If the optimization keeps the program logic and data intact the optimization is legal. What goes in(all possible program input), has to come out(all possible program output) the same way as without opimization.
Whether this particular optimization is possible(it surely is, whether it is an actual optimization is different thing!) depends on the target platform instruction set and if it is feasible to implement it.

Related

Declare var outside loop is bad?

I wrote this basic code for a DSP/audio application I'm making:
double input = 0.0;
for (int i = 0; i < nChannels; i++) {
input = inputs[i];
and some DSP engineering expert tell to me: "you should not declare it outside the loop, otherwise it create a dependency and the compiler can't deal with it as efficiently as possible."
He's talking about var input I think. Why this? Isn't better decleare once and overwrite it?
Maybe somethings to do with different memory location used? i.e. register instead of stack?

Good old K&R C compilers in the early eighties used to produce code as near as possible what the programmer wrote, and programmers used to do their best to produce optimized source code. Modern optimizing compilers can rework things provided the resulting code has same observable effects as the original code. So here, assuming the input variable is not used outside the loop, an optimizing compiler could optimize out the line double input = 0.0; because there are no observable effects until next assignation : input = inputs[i];. And it could the same factor the variable assignation outside the loop (whether in source C++ file it is inside or not) for the same reason.
Short story, unless you want to produce code for one specific compiler with one specific parameters set, and in that case you should thoroughly examine the generated assembly code, you should never worry for those low level optimizations. Some people say compiler is smarter than you, other say compiler will produce its own code whatever way I wrote mine.
What matters is just readability and variable scoping. Here input is functionaly local to the loop, so it should be declared inside the loop. Full stop. Any other optimization consideration is just useless, unless you do have special requirements for low level optimization (profiling showing that these lines require special processing).

Many people think that declaring a variable allocates some memory for you to use. It does not work like that. It does not allocate a register either.
It only creates for you a name (and an associated type) that you can use to link consumers of values with their producers.
On a 50 year old compiler (or one written by students in their 3rd year Compiler Construction course), that may be implemented by indeed allocating some memory for the variable on the stack, and using that every time the variable is referenced. It's simple, it works, and it's horribly inefficient. A good step up is putting local variables in registers when possible, but that uses registers inefficiently and it's not where we're at currently (have been for some time).
Linking consumers with producers creates a data flow graph. In most modern compilers, it's the edges in that graph that receive registers. This is completely removed from any variables as you declared them. They no longer exist. You can see this in action if you use -emit-llvm in clang.
So variables aren't real, they're just labels. Use them as you want.

It is better to declare variable inside the loop, but the reason is wrong.
There is a rule of thumb: declare variables in the smallest scope possible. Your code is more readable and less error prone this way.
As for performance question, it doesn't matter at all for any modern compiler where exactly you declare your variables. For example, clang eliminates variable entirely at -O1 from its own IR: https://godbolt.org/g/yjs4dA
One corner case, however: if you ever takes an address of input, variable can't be eliminated (easily), and you should declare it inside the loop, if you care about performance.

Understanding cost of multiple . and -> operator use?

Out of habit, when accessing values via . or ->, I assign them to variables anytime the value is going to be used more than once. My understanding is that in scripting languages like actionscript, this is pretty important. However, in C/C++, I'm wondering if this is a meaningless chore; am I wasting effort that the compiler is going to handle for me, or am I exercising a good practice, and why?
public struct Foo
{
public:
Foo(int val){m_intVal = val;)
int GetInt(){return m_intVal;}
int m_intVal; // not private for sake of last example
};
public void Bar()
{
Foo* foo = GetFooFromSomewhere();
SomeFuncUsingIntValA(foo->GetInt()); // accessing via dereference then function
SomeFuncUsingIntValB(foo->GetInt()); // accessing via dereference then function
SomeFuncUsingIntValC(foo->GetInt()); // accessing via dereference then function
// Is this better?
int val = foo->GetInt();
SomeFuncUsingIntValA(val);
SomeFuncUsingIntValB(val);
SomeFuncUsingIntValC(val);
///////////////////////////////////////////////
// And likewise with . operator
Foo fooDot(5);
SomeFuncUsingIntValA(fooDot.GetInt()); // accessing via function
SomeFuncUsingIntValB(fooDot.GetInt()); // accessing via function
SomeFuncUsingIntValC(fooDot.GetInt()); // accessing via function
// Is this better?
int valDot = foo.GetInt();
SomeFuncUsingIntValA(valDot);
SomeFuncUsingIntValB(valDot);
SomeFuncUsingIntValC(valDot);
///////////////////////////////////////////////
// And lastly, a dot operator to a member, not a function
SomeFuncUsingIntValA(fooDot.m_intVal); // accessing via member
SomeFuncUsingIntValB(fooDot.m_intVal); // accessing via member
SomeFuncUsingIntValC(fooDot.m_intVal); // accessing via member
// Is this better?
int valAsMember = foo.m_intVal;
SomeFuncUsingIntValA(valAsMember);
SomeFuncUsingIntValB(valAsMember);
SomeFuncUsingIntValC(valAsMember);
}

Ok so I try to go for an answer here.
Short version: you definitely don’t need to to this.
Long version: you might need to do this.
So here it goes: in interpreted programs like Javascript theese kind of things might have a noticeable impact. In compiled programs, like C++, not so much to the point of not at all.
Most of the times you don’t need to worry with these things because an immense amount of resources have been pulled into compiler optimization algorithms (and actual implementations) that the compiler will correctly decide what to do: allocate an extra register and save the result in order to reuse it or recompute every time and save that register space, etc.
There are instances where the compiler can’t do this. That is when it can’t prove multiple calls produce the same result. Then it has no choice but to make all the calls.
Now let’s assume that the compiler makes the wrong choice and you as a precaution make the effort of micro–optimizations. You make the optimization and you squish a 10% performance increase (which is already an overly overly optimistic figure for this kind of optimization) on that portion of code. But what do you know, your code spends only 1% of his time in that portion of code. The rest of the time is most likely spend in some hot loops and waiting for data fetch. So you spend a non-negligible amount of effort to optimize yourself the code only to get a 0.1% performance increase in total time, which won’t even be observable due to the external factors that vary the execution time by way more than that amount.
So don’t spend time with micro-optimizations in C++.
However there are cases where you might need to do this and even crazier things. But this is only after properly profiling your code and this is another discussion.
So worry about readability, don’t worry about micro–optimizations.

The question is not really related to -> and . operators, but rather about repetitive expressions in general. Yes, it is true that most modern compilers are smart enough to optimize the code that evaluates the same expression repeatedly (assuming it has no observable side-effects).
However, using an explicit intermediate variable typically makes the program much more readable, since it explicitly exposes the fact that the same value is supposed to be used in all contexts. It exposes the fact the it was your intent to use the same value in all contexts.
If you repeat using the same expression to generate that value again and again, this fact becomes much less obvious. Firstly, it is difficult to say at the first sight whether the expressions are really identical (especially when they are long). Secondly, it is not obvious whether sequential evaluations of the seemingly the same expression produce identical results.
Finally, slicing long expressions into smaller ones by using intermediate variables can significantly simply debugging the code in step-by-step debugger, since it give the user much greater degree of control through "step in" and "step over" commands.

It's for sure better in terms of readability and maintainability to have such temporary variable.
In terms of performance, you shouldn't worry about such micro-optimization at this stage (premature optimization). Moreover, modern C++ compilers can optimize it anyway, so you really shouldn't worry about it.

Compiler optimization problem

Most of the functions in <functional> use functors. If I write a struct like this:
struct Test
{
bool operator()
{
//Something
}
//No member variables
};
Is there a perf hit? Would an object of Test be created? Or can the compiler optimize the object away?

GCC at least can optimize the object creation and inline your functor, so you can expect performance as with hand-crafted loop. Of cource you must compile with -O2.

Yes, the compiler can optimize "object creation" (which is trivial in this case) out if it wants so. However if you really care you should compile your program and inspect the assembly code.

Even if the compiler was having a bad day and somehow couldn't figure out how to optimize this (it's very simple as optimizations go) - with no data members and no constructor the "performance hit" to "create an object" would be at most one instruction (plus maybe a couple more to copy the object, if the compiler also doesn't figure out how to inline the function call that uses the functor) to increment the stack pointer (since every object must have a unique address). "Creating objects" is cheap. What takes time is allocating memory, via new (because the OS has to be petitioned for the memory, and it has to search for a contiguous block that isn't being used by something else). Putting things on the stack is trivial.

There is no "use" of the structure, so as the code currently stands, it is still just a definition (and takes up no space).
If you create an object of type Test, it will take up non-zero space. If the compiler can deduce that nothing takes its address (or anything similar), it is free to optimize away the space usage.

Performance when accessing class members

I'm writing something performance-critical and wanted to know if it could make a difference if I use:
int test( int a, int b, int c )
{
// Do millions of calculations with a, b, c
}
or
class myStorage
{
public:
int a, b, c;
};
int test( myStorage values )
{
// Do millions of calculations with values.a, values.b, values.c
}
Does this basically result in similar code? Is there an extra overhead of accessing the class members?
I'm sure that this is clear to an expert in C++ so I won't try and write an unrealistic benchmark for it right now

The compiler will probably equalize them. If it has any brains at all, it will copy values.a, values.b, and values.c into local variables or registers, which is also what happens in the simple case.
The relevant maxims:
Premature optimization is the root of much evil.
Write it so you can read it at 1am six months from now and still understand what you were trying to do.
Most of the time significant optimization comes from restructuring your algorithm, not small changes in how variables are accessed. Yes, I know there are exceptions, but this probably isn't one of them.

This sounds like premature optimization.
That being said, there are some differences and opportunities but they will affect multiple calls to the function rather than performance in the function.
First of all, in the second option you may want to pass MyStorage as a constant reference.
As a result of that, your compiled code will likely be pushing a single value into the stack (to allow you to access the container), rather than pushing three separate values. If you have additional fields (in addition to a-c), sending MyStorage not as a reference might actually cost you more because you will be invoking a copy constructor and essentially copying all the additional fields. All of this would be costs per-call, not within the function.
If you are doing tons of calculations with a b and c within the function, then it really doesn't matter how you transfer or access them. If you passed by reference, the initial cost might be slightly more (since your object, if passed by reference, could be on the heap rather than the stack), but once accessed for the first time, caching and registers on your machine will probably mean low-cost access. If you have passed your object by value, then it really doesn't matter, since even initially, the values will be nearby on the stack.
For the code you provided, if these are the only fields, there will likely not be a difference. the "values.variable" is merely interpreted as an offset in the stack, not as "lookup one object, then access another address".
Of course, if you don't buy these arguments, just define local variables as the first step in your function, copy the values from the object, and then use these variables. If you realy use them multiple times, the initial cost of this copy wouldn't matter :)

No, your cpu would cache the variables you use over and over again.

I think there are some overhead, but may not be much. Because the memory address of the object will be stored in the stack, which points to the heap memory object, then you access the instance variable.
If you store the variable int in stack, it would be really faster, because the value is already in stack and the machine just go to stack to get it out to calculate:).
It also depends on if you store the class's instance variable value on stack or not. If inside the test(), you do like:
int a = objA.a;
int b = objA.b;
int c = objA.c;
I think it would be almost the same performance

If you're really writing performance critical code and you think one version should be faster than the other one, write both versions and test the timing (with the code compiled with right optimization switch). You may even want to see the generated assembly codes. A lot of things can affect the speed of a code snippets that are quite subtle, like register spilling, etc.

you can also start your function with
int & a = values.a;
int & b = values.b;
although the compiler should be smart enough to do that for you behind the scenes. In general I prefer to pass around structures or classes, this makes it often clearer what the function is meant to do, plus you don't have to change the signatures every time you want to take another parameter into account.

As with your previous, similar question: it depends on the compiler and platform. If there is any difference at all, it will be very small.
Both values on the stack and values in an object are commonly accessed using a pointer (the stack pointer, or the this pointer) and some offset (the location in the function's stack frame, or the location inside the class).
Here are some cases where it might make a difference:
Depending on your platform, the stack pointer might be held in a CPU register, whereas the this pointer might not. If this is the case, accessing this (which is presumably on the stack) would require an extra memory lookup.
Memory locality might be different. If the object in memory is larger than one cache line, the fields are spread out over multiple cache lines. Bringing only the relevant values together in a stack frame might improve cache efficiency.
Do note, however, how often I used the word "might" here. The only way to be sure is to measure it.

If you can't profile the program, print out the assembly language for the code fragments.
In general, less assembly code means less instructions to execute which speeds up performance. This is a technique for getting a rough estimate of performance when a profiler is not available.
An assembly language listing will allow you to see differences, if any, between implementations.

Overhead of calling tiny functions from a tight inner loop? [C++]

Say you see a loop like this one:
for(int i=0;
i<thing.getParent().getObjectModel().getElements(SOME_TYPE).count();
++i)
{
thing.getData().insert(
thing.GetData().Count(),
thing.getParent().getObjectModel().getElements(SOME_TYPE)[i].getName()
);
}
if this was Java I'd probably not think twice. But in performance-critical sections of C++, it makes me want to tinker with it... however I don't know if the compiler is smart enough to make it futile.
This is a made up example but all it's doing is inserting strings into a container. Please don't assume any of these are STL types, think in general terms about the following:
Is having a messy condition in the for loop going to get evaluated each time, or only once?
If those get methods are simply returning references to member variables on the objects, will they be inlined away?
Would you expect custom [] operators to get optimized at all?
In other words is it worth the time (in performance only, not readability) to convert it to something like:
ElementContainer &source =
thing.getParent().getObjectModel().getElements(SOME_TYPE);
int num = source.count();
Store &destination = thing.getData();
for(int i=0;i<num;++i)
{
destination.insert(thing.GetData().Count(), source[i].getName());
}
Remember, this is a tight loop, called millions of times a second. What I wonder is if all this will shave a couple of cycles per loop or something more substantial?
Yes I know the quote about "premature optimisation". And I know that profiling is important. But this is a more general question about modern compilers, Visual Studio in particular.

The general way to answer such questions is to looked at the produced assembly. With gcc, this involve replacing the -c flag with -S.
My own rule is not to fight the compiler. If something is to be inlined, then I make sure that the compiler has all the information needed to perform such an inline, and (possibly) I try to urge him to do so with an explicit inline keyword.
Also, inlining saves a few opcodes but makes the code grow, which, as far as L1 cache is concerned, can be very bad for performance.

All the questions you are asking are compiler-specific, so the only sensible answer is "it depends". If it is important to you, you should (as always) look at the code the compiler is emitting and do some timing experiments. Make sure your code is compiled with all optimisations turned on - this can make a big difference for things like operator[](), which is often implemented as an inline function, but which won't be inlined (in GCC at least) unless you turn on optimisation.

If the loop is that critical, I can only suggest that you look at the code generated. If the compiler is allowed to aggressively optimise the calls away then perhaps it will not be an issue. Sorry to say this but modern compilers can optimise incredibly well and the I really would suggest profiling to find the best solution in your particular case.

If the methods are small and can and will be inlined, then the compiler may do the same optimizations that you have done. So, look at the generated code and compare.
Edit: It is also important to mark const methods as const, e.g. in your example count() and getName() should be const to let the compiler know that these methods do not alter the contents of the given object.

As a rule, you should not have all that garbage in your "for condition" unless the result is going to be changing during your loop execution.
Use another variable set outside the loop. This will eliminate the WTF when reading the code, it will not negatively impact performance, and it will sidestep the question of how well the functions get optimized. If those calls are not optimized this will also result in performance increase.

I think in this case you are asking the compiler to do more than it legitimately can given the scope of compile-time information it has access to. So, in particular cases the messy condition may be optimized away, but really, the compiler has no particularly good way to know what kind of side effects you might have from that long chain of function calls. I would assume that breaking out the test would be faster unless I have benchmarking (or disassembly) that shows otherwise.
This is one of the cases where the JIT compiler has a big advantage over a C++ compiler. It can in principle optimize for the most common case seen at runtime and provide optimized bytecode for that (plus checks to make sure that one falls into that case). This sort of thing is used all the time in polymorphic method calls that turn out not to actually be used polymorphically; whether it could catch something as complex as your example, though, I'm not certain.
For what it's worth, if speed really mattered, I'd split it up in Java too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js