Coding Practices which enable the compiler/optimizer to make a faster program - c++

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}

Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc

Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.

I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.

In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.

A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

Align your data to native/natural boundaries.

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

Related

Vector re-declaration versus insertions in looping operations - C++

I have an option to either create and destroy a vector on every call to func() and push elements in each iteration, as shown in Example A OR fixed the initialization and only overwrite old values in each iteration, as shown in Example B.
Example A:
void func ()
{
std::vector<double> my_vec(5, 0.0);
for ( int i = 0; i < my_vec.size(); i++) {
my_vec.push_back(i);
// do something
}
}
while (condition) {
func();
}
Example B:
void func (std::vector<double>& my_vec)
{
for ( int i = 0; i < my_vec.size(); i++) {
my_vec[i] = i;
// do something
}
}
while (condition) {
std::vector<double> my_vec(5, 0.0);
func(myVec);
}
Which of the two would be computationally inexpensive. The size of the array won't be more than 10.
I still suspect that the question that was asked is not the question that was intended, but it occurred to me that the main point of my answer would likely not change. If the question gets updated, I can always edit this answer to match (or delete it, if it turns out to be inapplicable).
De-prioritize optimizations
There are various factors that should affect how you write your code. Among the desirable goals are space optimization, time optimization, data encapsulation, logic encapsulation, readability, robustness, and correct functionality. Ideally, all of these goals would be achievable in every piece of code, but that is not particularly realistic. Much more likely is a situation where one or more of these goals must be sacrificed in favor of the others. When that happens, optimizations should typically yield to everything else.
That is not to say that optimizations should be ignored. There are plenty of optimizations that rarely obstruct the higher-priority goals. These range from the small, such as passing by const reference instead of by value, to the large, such as choosing the logarithmic algorithm instead of the exponential one. However, the optimizations that do interfere with the other goals should be postponed until after your code is reasonably complete and functioning correctly. At that point, a profiler should be used to determine where the bottlenecks actually are. Those bottlenecks are the only places where other goals should yield to optimizations, and only if the profiler confirms that the optimizations achieved their goals.
For the question being asked, this means that the main concern should not be computational expense, but encapsulation. Why should the caller of func() need to allocate space for func() to work with? It should not, unless a profiler identified this as a performance bottleneck. And if a profiler did that, it would be much easier (and more reliable!) to ask the profiler if the change helps than to ask Stack Overflow.
Why?
I can think of two major reasons to de-prioritize optimizations. First, the "sniff test" is unreliable. While there might be a few people who can identify bottlenecks by looking at code, there are many, many more who merely think they can. Second, that's why we have optimizing compilers. It is not unheard of for someone to come up with this super-clever optimization trick only to discover that the compiler was already doing it. Keep your code clean and let the compiler handle the routine optimizations. Only step in when the task demonstrably exceeds the compiler's capabilities.
See also: premature-optimization
Choosing an optimization
OK, suppose the profiler did identify construction of this small, 10-element array as a bottleneck. The next step is to test an alternative, right? Almost. First you need an alternative, and I would consider a review of the theoretical benefits of various alternatives to be useful. Just keep in mind that this is theoretical and that the profiler gets the final say. So I'll go into the pros and cons of the alternatives from the question, as well as some other alternatives that might bear consideration. Let's start from the worst options, working our way to the better ones.
Example A
In Example A, a vector is created with 5 elements, then elements are pushed onto the vector until i meets or exceeds the vector's size. Seeing how i and the vector's size are both increased by one each iteration (and i starts smaller than the size), this loop will run until the vector grows large enough to crash the program. That means probably billions of iterations (despite the question's claim that the size will not exceed 10).
Easily the most computationally expensive option. Don't do this.
Example B
In example B, a vector is created for each iteration of the outer while loop, which is then accessed by reference from within func(). The performance cons here include passing a parameter to func() and having func() access the vector indirectly through a reference. There are no performance pros as this does everything the baseline (see below) would do, plus some extra steps.
Even though a compiler might be able to compensate for the cons, I see no reason to try this approach.
Baseline
The baseline I'm using is a fix to Example A's infinite loop. Specifically, replace "my_vec.push_back(i);" with Example B's "my_vec[i] = i;". This simple approach is along the lines of what I would expect for the initial assessment by the profiler. If you cannot beat simple, stick with it.
Example B*
The text of the question presents an inaccurate assessment of Example B. Interestingly, the assessment describes an approach that has the potential to improve on the baseline. To get code that matches the textual description, move Example B's "std::vector<double> my_vec(5, 0.0);" to the line immediately before the while statement. This has the effect of constructing the vector only once, rather than constructing it with each iteration.
The cons of this approach are the same as those of Example B as originally coded. However, we now pick up a gain in that the vector's constructor is called only once. If construction is more expensive than the indirection costs, the result should be a net improvement once the while loop iterates often enough. (Beware these conditions: that's a significant "if" and there is no a priori guess as to how many iterations is "enough".) It would be reasonable to try this and see what the profiler says.
Get some static
A variant on Example B* that helps preserve encapsulation is to use the baseline (the fixed Example A), but precede the declaration of the vector with the keyword static. This brings in the benefit of constructing the vector only once, but without the overhead associated with making the vector a parameter. In fact, the benefit could be greater than in Example B* since construction happens only once per program execution, rather than each time the while loop is started. The more times the while loop is started, the greater this benefit.
The main con here is that the vector will occupy memory throughout the program's execution. Unlike Example B*, it will not release its memory when the block containing the while loop ends. Using this approach in too many places would lead to memory bloat. So while it is reasonable to profile this approach, you might want to consider other options. (Of course if the profiler calls this out as the bottleneck, dwarfing all others, the cost is small enough to pay.)
Fix the size
My personal choice for what optimization to try here would be to start from the baseline and switch the vector to std::array<10,double>. My main motivation is that the needed size won't be more than 10. Also relevant is that the construction of a double is trivial. Construction of the array should be on par with declaring 10 variables of type double, which I would expect to be negligible. So no need for fancy optimization tricks. Just let the compiler do its thing.
The expected possible benefit of this approach is that a vector allocates space on the heap for its storage, which has an overhead cost. The local array would not have this cost. However, this is only a possible benefit. A vector implementation might already take advantage of this performance consideration for small vectors. (Maybe it does not use the heap until the capacity needs to exceed some magic number, perhaps more than 10.) I would refer you back to earlier when I mentioned "super-clever" and "compiler was already doing it".
I'd run this through the profiler. If there's no benefit, there is likely no benefit from the other approaches. Give them a try, sure, since they're easy enough, but it would probably be a better use of your time to look at other aspects to optimize.

Efficiency of accessing a value through a pointer vs storing as temporary value

If there's function that takes a pointer to struct as parameter and the function has a loop that access a member at every iteration like:
int function_a(struct b *b)
{
// ... do something then
for(int i = 0; i < 500; ++i){
b->c->d[i] = value();
}
}
Is it going to retrieve the location c points to and d points to from memory every time?
Now consider the following situation:
int function_a(struct b *b)
{
// ... do something then
float *f = b->c->d;
for(int i = 0; i < 500; ++i){
f[i] = value();
}
}
Would that be faster?
I urge you to heed Thomas Matthews' advice regarding profiling, however to answer your question: it depends.
This particular transformation is also known as code hoisting which consists in moving code without side-effect and with the same result at each call outside of a loop. As noted though, this is only performed if the compiler can prove:
that there is no side-effect
that the same result is computed at each call
In both cases, this basically means that the compiler should have access to the full code (see the definitions) of both:
the expression itself, to prove the absence of side-effect
anything that might alter something about the expression, to prove that the same result is computed each time
Therefore, it is actually unlikely that it will perform the optimization unless all the code for the body loop is included in the headers (and thus can be inlined) because any opaque function could hide a modification of b->c (for example) through an evil global variable.
In your example, nothing proves that value() does not change b->c... so no, the compiler would be wrong to hoist the code unless it has access to the definition of value() and can rule this possibility out.
Using a temporary may not be faster than accessing a temporary: depends on the platform.
When in doubt, look at the assembly language generated by the compiler. On the ARM processor, when accessing the memory:
A register is loaded with the address of the variable.
The register is dereferenced to obtain the value (and stored in
another register).
This is very similar to dereferencing a pointer:
A register is loaded with the pointer value.
The register is dereferenced to obtain the value.
There may be a second load from memory to fetch the pointer value. The truth is in the assembly language.
This is known as a micro-optimization and should only be applied as a last resort to speed up code in performance critical areas. Use a profiler to find out where the bottlenecks are and address those first.
You don't make it clear why you are focusing on this, whether you have already done some performance analysis and homed in on this routine or whether you are doing "blind mans optimization" - which consists of looking at the code and saying "maybe that's slow".
Let me first address your up-front question:
a->b->c[i]
vs
f[i]
If you compile these two pieces of code without optimization then there is a very high probability that f[i] will be faster.
As soon as you enable optimization all bets are off. Firstly, which architecture you are using is unknown, so the cost of the sequential fetches in a->b->c is unknown, also we don't know how many registers are available, or what optimizations the compiler might use. It is conceivable that the cost of any single write might be high enough that, if the CPU uses pipelining, the writes take long enough to make it irrelevant whether we spend some time doing pointer math between writes or not.
As a somewhat experienced optimizer, I'd be more interested in "what does value() do?". Can the compiler be certain that value() does not modify any of the values in a, a-> or a->b->c?
If you absolutely, definitively know that these values won't change, have done perf analysis and found that this loop is a bottleneck, looked at the assembler to determine that the compiler does not emit the most efficient code, then you might optimize it as follows:
int function_a(struct b* const b)
{
/// optimization: we found XYC compiler for Leg architecture was
/// emitting instructions that repeatedly fetched the array base
/// address every iteration.
float* const end = f + 500;
for (float* it = b->c->d; it < end; ++it)
*it = value();
}
HOWEVER: Making such a low-level optimization carries a risk. C/C++ optimizers these days can be pretty smart. One way to keep them from generating the most efficient code possible is to start hand-optimizing things.
What we've done here is made an efficiently tight loop, but that may not be the most efficient way to achieve the result in assembly.
In the i = 0; i < 500 case, depending on the implementation of value(), may actually stride or vectorize the loop in such a way as to keep the memory bus busy, or it might use special, wide registers to do multiple operations at a time. Our optimization may create a pathelogical scenario whereby we force the compiler to emit the least efficient order of operations.
Again - we don't know the reasons you are focusing on this part of the code, but in practice I've always found it is very unlikely you will gain much by hand-optimizing this part of a loop.
If you are developing under Linux, you may want to look into valgrind to assist you with perf analysis. If you are developing under Visual Studio then "Analyze" -> "Performance and Diagnostics" (ctrl-alt-f9) will bring up the perf wizard. Click start and select "Instrumentation".

Loops - storing data or recalculating

How much time does saving a value cost me processor-vise? Say i have a calculated value x that i will use 2 times, 5 times, or 20 times.At what point does it get more optimal to save the value calculated instead of recalculating it each time i use it?
example:
int a=0,b=-5;
for(int i=0;i<k;++i)
a+=abs(b);
or
int a=0,b=-5;
int x=abs(b);
for(int i=0;i<k;++i)
a+=x;
At what k value does the second scenario produce better results? Also, how much is this RAM dependent?
Since the value of abs(b) doesn't change inside the for loop, a compiler will most likely optimize both snippets to the same result i.e. evaluating the value of abs(b) just once.
It is almost impossible to provide an answer other than measure in a real scenario. When you cache the data in the code, it may be stored in a register (in the code you provide it will most probably be), or it might be flushed to L1 cache, or L2 cache... depending on what the loop is doing (how much data is it using?). If the value is cached in a register the cost is 0, the farther it is pushed the higher the cost it will take to retrieve the value.
In general, write code that is easy to read and maintain, then measure the performance of the application, and if that is not good, profile. Find the hotspots, find why they are hotspots and then work from there on. I doubt that caching vs. calculating abs(x) for something as above would ever be a hotspot in a real application. So don't sweat it.
I would suggest (this is without testing mind you) that the example with
int x=abs(b)
outside the loop will be faster simply because you're avoiding allocating a stack frame each iteration in order to call abs().
That being said, if the compiler is smart enough, it may figure out what you're doing and produce the same (or similar) instructions for both.
As a rule of thumb it doesn't cost you much, if anything, to store that value outside the loop, since the compiler is most likely going to store the result of abs(x) into a register anyways. In fact, when the compiler optimizes this code (assuming you have optimizations turned on), one of the first things it will do is pull that abs(x) out of the loop.
You can further help the compiler generate good code by qualifying your declaration of "x" with the "register" hint. This will ask the compiler to store x into a register value if possible.
If you want to see what the compiler actually does with your code, one thing to do is to tell it to compile but not assemble (in gcc, the option is -S) and look at the resulting assembly code. In many cases, the compiler will generate better code than you can optimize by hand. However, there's also no reason to NOT do these easy optimizations yourself.
Addendum:
Compiling the above code with optimizations turned on in GCC will result in code equivalent to:
a = abs(b) * k;
Try it and see.
For many cases it produces better perf from k=2. The example you gave is . not one. Most compilers try to perform this kind of hoisting when even low levels of optimization are enabled. The value is stored, at worst, on the local stack and so is likely to stay fairly cache warm, negating your memory concerns.
But potentially it will be held in a register.
The original has to perform an adittional branch, repeat the calculations and return the value. Abs is one example of a function the compiler may be able to recognize as a constexpr and hoist.
While developing your own classes, this is one of the reason you should try to mark members and references as construe whenever possible.

performance of std::vector c++ size() inside loop in member function

Similar question, but less specific:
Performance issue for vector::size() in a loop
Suppose we're in a member function like:
void Object::DoStuff() {
for( int k = 0; k < (int)this->m_Array.size(); k++ )
{
this->SomeNotConstFunction();
this->ConstFunction();
double x = SomeExternalFunction(i);
}
}
1) I'm willing to believe that if only the "SomeExternalFunction" is called that the compiler will optimize and not redundantly call size() on m_Array ... is this the case?
2) Wouldn't you almost certainly get a boost in speed from doing
int N = m_Array.size()
for( int k = 0; k < N; k++ ) { ... }
if you're calling some member function that is not const ?
Edit Not sure where these down-votes and snide comments about micro-optimization are coming from, perhaps I can clarify:
Firstly, it's not to optimize per-se but just understand what the compiler will and will not fix. Usually I use the size() function but I ask now because here the array might have millions of data points.
Secondly, the situation is that "SomeNotConstFunction" might have a very rare chance of changing the size of the array, or its ability to do so might depend on some other variable being toggled. So, I'm asking at what point will the compiler fail, and what exactly is the time cost incurred by size() when the array really might change, despite human-known reasons that it won't?
Third, the operations in-loop are pretty trivial, there are just millions of them but they are embarrassingly parallel. I would hope that by externally placing the value would let the compiler vectorize some of the work.
Do not get into the habit of doing things like that.
The cases where the optimization you make in (2) is:
safe to do
has a noticeable difference
something your compiler cannot figure out on its own
are few and far in-between.
If it were just the latter two points, I would just advise that you're worrying about something unimportant. However, that first point is the real killer: you do not want to get in the habit of giving yourself extra chances to make mistakes. It's far, far easier to accelerate slow, correct code than it is to debug fast, buggy code.
Now, that said, I'll try answering your question. The definitions of the functions SomeNotConstFunction and SomeConstFunction are (presumably) in the same translation unit. So if these functions really do not modify the vector, the compiler can figure it out, and it will only "call" size once.
However, the compiler does not have access to the definition of SomeExternalFunction, and so must assume that every call to that function has the potential of modifying your vector. The presence of that function in your loop guarantees that `size is "called" every time.
I put "called" in quotes, however, because it is such a trivial function that it almost certainly gets inlined. Also, the function is ridiculously cheap -- two memory lookups (both nearly guaranteed to be cache hits), and either a subtraction and a right shift, or maybe even a specialized single instruction that does both.
Even if SomeExternalFunction does absolutely nothing, it's quite possible that "calling" size every time would still only be a small-to-negligible fraction of the running time of your loop.
Edit: In response to the edit....
what exactly is the time cost incurred by size() when the array really might change
The difference in the times you see when you time the two different versions of code. If you're doing very low level optimizations like that, you can't get answers through "pure reason" -- you must empirically test the results.
And if you really are doing such low level optimizations (and you can guarantee that the vector won't resize), you should probably be more worried about the fact the compiler doesn't know the base pointer of the array is constant, rather than it not knowing the size is constant.
If SomeExternalFunction really is external to the compilation unit, then you have pretty much no chance of the compiler vectorizing the loop, no matter what you do. (I suppose it might be possible at link time, though....) And it's also unlikely to be "trivial" because it requires function call overhead -- at least if "trivial" means the same thing to you as to me. (again, I don't know how good link time optimizations are....)
If you really can guarantee that some operations will not resize the vector, you might consider refining your class's API (or at least it's protected or private parts) to include functions that self-evidently won't resize the vector.
The size method will typically be inlined by the compiler, so there will be a minimal performance hit, though there will usually be some.
On the other hand, this is typically only true for vectors. If you are using a std::list, for instance, the size method can be quite expensive.
If you are concerned with performance, you should get in the habit of using iterators and/or algorithms like std::for_each, rather than a size-based for loop.
The micro optimization remarks are probably because the two most common implementations of vector::size() are
return _Size;
and
return _End - _Begin;
Hoisting them out of the loop will probably not noticably improve the performance.
And if it is obvious to everyone that it can be done, the compiler is also likely to notice. With modern compilers, and if SomeExternalFunction is statically linked, the compiler is usually able to see if the call might affect the vector's size.
Trust your compiler!
In MSVC 2015, it does a return (this->_Mylast() - this->_Myfirst()). I can't tell you offhand just how the optimizer might deal with this; but unless your array is const, the optimizer must allow for the possibility that you may modify its number of elements; making it hard to optimize out. In Qt, it equates to an inline function that that does a return d->size; ; that is, for a QVector.
I've taken to doing it in one particular project I'm working on, but it is for performance-oriented code. Unless you are interested in deeply optimizing something, I wouldn't bother. It probably is pretty fast any of these ways. In Qt, it is at most one pointer dereferencing, and is more typing. It looks like it could make a difference in MSVC.
I think nobody has offered a definitive answer so far; but if you really want to test it, have the compiler emit assembly source code, and inspect it both ways. I wouldn't be surprised to find that there's no difference when highly optimized. Let's not forget, though, that unoptimized performance during debug is also a factor that might be taken into consideration, when a lot of e.g. number crunching is involved.
I think the OP's original ? really could use to give how the array is declared.

Shall I optimize or let compiler to do that?

What is the preferred method of writing loops according to efficiency:
Way a)
/*here I'm hoping that compiler will optimize this
code and won't be calling size every time it iterates through this loop*/
for (unsigned i = firstString.size(); i < anotherString.size(), ++i)
{
//do something
}
or maybe should I do it this way:
Way b)
unsigned first = firstString.size();
unsigned second = anotherString.size();
and now I can write:
for (unsigned i = first; i < second, ++i)
{
//do something
}
the second way seems to me like worse option for two reasons: scope polluting and verbosity but it has the advantage of being sure that size() will be invoked once for each object.
Looking forward to your answers.
I usually write this code as:
/* i and size are local to the loop */
for (size_t i = firstString.size(), size = anotherString.size(); i < size; ++i) {
// do something
}
This way I do not pollute the parent scope and avoid calling anotherString.size() for each loop iteration.
It is especially useful with iterators:
for(some_generic_type<T>::forward_iterator it = container.begin(), end = container.end();
it != end; ++it) {
// do something with *it
}
Since C++ 11 the code can be shortened even more by writing a range-based for loop:
for(const auto& item : container) {
// do something with item
}
or
for(auto item : container) {
// do something with item
}
In general, let the compiler do it. Focus on the algorithmic complexity of what you're doing rather than micro-optimizations.
However, note that your two examples are not semantically identical - if the body of the loop changes the size of the second string, the two loops will not iterate the same amount of times. For that reason, the compiler might not be able to perform the specific optimization you're talking about.
I would first use the first version, simply because it looks cleaner and easier to type. Then you can profile it to see if anything needs to be more optimized.
But I highly doubt that the first version will cause a noticable performance drop. If the container implements size() like this:
inline size_t size() const
{
return _internal_data_member_representing_size;
}
then the compiler should be able to inline the function, eliding the function call. My compiler's implementation of the standard containers all do this.
How will a good compiler optimize your code? Not at all, as it can't be sure size() has any side-effects. If size() had any side effects your code relied on, they'd now be gone after a possible compiler optimization.
This kind of optimization really isn't safe from a compiler's perspective, you need to do it on your own. Doing on your own doesn't mean you need to introduce two additional local variables. Depending on your implementation of size, it might be an O(1) operation. If size is also declared inline, you'll also spare the function call, making the call to size() as good as a local member access.
Don't pre-optimize your code. If you have a performance problem, use a profiler to find it, otherwise you are wasting development time. Just write the simplest / cleanest code that you can.
This is one of those things that you should test yourself. Run the loops 10,000 or even 100,000 iterations and see what difference, if any, exists.
That should tell you everything you want to know.
My recommendation is to let inconsequential optimizations creep into your style. What I mean by this is that if you learn a more optimal way of doing something, and you cant see any disadvantages to it (as far as maintainability, readability, etc) then you might as well adopt it.
But don't become obsessed. Optimizations that sacrifice maintainability should be saved for very small sections of code that you have measured and KNOW will have a major impact on your application. When you do decide to optimize, remember that picking the right algorithm for the job is often far more important than tight code.
I'm hoping that compiler will optimize this...
You shouldn't. Anything involving
A call to an unknown function or
A call to a method that might be overridden
is hard for a C++ compiler to optimize. You might get lucky, but you can't count on it.
Nevertheless, because you find the first version simpler and easier to read and understand, you should write the code exactly the way it is shown in your simple example, with the calls to size() in the loop. You should consider the second version, where you have extra variables that pull the common call out of the loop, only if your application is too slow and if you have measurements showing that this loop is a bottleneck.
Here's how I look at it. Performance and style are both important, and you have to choose between the two.
You can try it out and see if there is a performance hit. If there is an unacceptable performance hit, then choose the second option, otherwise feel free to choose style.
You shouldn't optimize your code, unless you have a proof (obtained via profiler) that this part of code is bottleneck. Needless code optimization will only waste your time, it won't improve anything.
You can waste hours trying to improve one loop, only to get 0.001% performance increase.
If you're worried about performance - use profilers.
There's nothing really wrong with way (b) if you just want to write something that will probably be no worse than way (a), and possibly faster. It also makes it clearer that you know that the string's size will remain constant.
The compiler may or may not spot that size will remain constant; just in case, you might as well perform this optimization yourself. I'd certainly do this if I was suspicious that the code I was writing was going to be run a lot, even if I wasn't sure that it would be a big deal. It's very straightforward to do, it takes no more than 10 extra seconds thinking about it, it's very unlikely to slow things down, and, if nothing else, will almost certainly make the unoptimized build run a bit more quickly.
(Also the first variable in style (b) is unnecessary; the code for the init expression is run only once.)
How much percent of time is spent in for as opposed to // do something? (Don't guess - sample it.) If it is < 10% you probably have bigger issues elsewhere.
Everybody says "Compilers are so smart these days."
Well they're no smarter than the poor coders who write them.
You need to be smart too. Maybe the compiler can optimize it but why tempt it not to?
For the "std::size_t size()const" member function which not only is O(1) but is also declared "const" and so can be automatically pulled out of the loop by the compiler, it probably doesn't matter. That said, I wouldn't count on the compiler to remove it from the loop, and I think it is a good habit to get into to factor out the calls within the loop for cases where the function isn't constant or O(1). In addition, I think assigning the values to a variable leads to the code being more readable. I would not suggest, though, that you make any premature optimizations if it will result in the code being harder to read. Again, though, I think the following code is more readable, since there is less to read within the loop:
std::size_t firststrlen = firststr.size();
std::size_t secondstrlen = secondstr.size();
for ( std::size_t i = firststrlen; i < secondstrlen; i++ ){
// ...
}
Also, I should point out that you should use "std::size_t" instead of "unsigned", as the type of "std::size_t" can vary from one platform to another, and using "unsigned" can lead to trunctations and errors on platforms for which the type of "std::size_t" is "unsigned long" instead of "unsigned int".