Recursive functions in C/C++

Recursive functions in C/C++ - c++

If we consider recursive function in C/C++, are they useful in any way? Where exactly they are used mostly?
Are there any advantages in terms of memory by using recursive functions?
Edit: is the recursion better or using a while loop?

Recursive functions are primarily used for ease of designing algorithms. For example you need to traverse a directory tree recursively - its depth it limited, so you're quite likely to never face anything like too deep recursion and consequent stack overflow, but writing a tree traversal recursively is soo much easier, then doing the same in iterative manner.
In most cases recursive functions don't save memory compared to iterative solutions. Even worse they consume stack memory which is relatively scarse.

they have many uses and some things become very difficult to impossible without them. iterating trees for instance.

Recursion definitively has advantages at problems with a recursive nature. Other posters named some of them.
To use the capability of C for recursion definitively has advantages in memory management. When you try to avoid recursion, most of the time an own stack or other dynamic data type is used to break the problem. This involves dynamic memory management in C/C++. Dynamic memory management is costly and errorprone!
You can't beat the stack
On the other hand, when you just use the stack and use recursion with local variables -- the memory management is just simple and the stack is most of the time more time-efficient then all the memory management you can do by yourself or with plain C/C++ memory-management. The reason is that the system stack is such a simple and convenient data structure with low overhead and it is implemented using special processor operations that are optimized for this work. Believe me, you can't beat that, since compilers, operation systems and processors are optimized for stack manipulations for decades!
PS: Also the stack does not become fragmented, like heap memory does easily. This way it is also possible to save memory by using the stack / recursion.

Implement QuickSort with and without using recursion, then you can tell by yourself if it's useful or not.

I often find recursive algorithms easier to understand because they involve less mutable state. Consider the algorithm for determining the greatest common divisor of two numbers.
unsigned greatest_common_divisor_iter(unsigned x, unsigned y)
{
while (y != 0)
{
unsigned temp = x % y;
x = y;
y = temp;
}
return x;
}
unsigned greatest_common_divisor(unsigned x, unsigned y)
{
return y == 0 ? x : greatest_common_divisor(y, x % y);
}
There is too much renaming going on in the iterative version for my taste. In the recursive version, everything is immutable, so you could even make x and y const if you liked.

When using recursion, you can store data on the stack (effectively, in the calling contexts of all the functions above the current instance) that you would have instead to store in the heap with dynamic allocation if you were trying to do the same thing with a while loop.
Think of most divide-and-conquer algorithms where there are two things to do on each call (that is, one of the calls is not tail-recursive).
And with respect to Tom's interesting comment/subquestion, this advantage of recursive functions is maybe particularly noticeable in C where the management of dynamic memory is so basic. But that doesn't make it very specific to C.

Dynamic programming is a key area where recursion is crucial, though it goes beyond that (remembering sub-answers can give drastic performance improvements). Algorithms are where recursion is normally used, rather than typical day to day coding. It's more a computer-science concept than a programming one.

One thing that is worth mentioning is that in most functional languages (Scheme for example), you can take advantage of tail call optimizations, and thus you can use recursive functions without increasing the amount of memory in your stack.
Basically, complex recursive tail calls can runs flawlessly in Scheme while in C/C++ the same ones will create a stack overflow.

There are two reasons I see for using recursion:
an algorithm operates on recursive data structures (like e.g. a tree)
an algorithm is of recursive nature (often happens for mathematical problems as recursion often offers beautiful solutions)
Handle recursion with care as there is always the danger of infinite recursion.

Recursive functions make it easier to code solutions that have a recurrence relation.
For instance the factorial function has a recurrence relation:
factorial(0) = 1
factorial(n) = n * factorial(n-1)
Below I have implemented factorial using recursion and looping.
The recursive version and recurrence relation defined above look similar and is
hence easier to read.
Recursive:
double factorial ( int n )
{
return ( n ? n * factorial ( n-1 ) : 1 );
}
Looping:
double factorial ( int n )
{
double result = 1;
while ( n > 1 )
{
result = result * n;
n--;
}
return result;
}
One more thing:
The recursive version of factorial includes a tail call to itself and can be tail-call optimized. This brings the space complexity of recursive factorial down to the space complexity of the iterative factorial.

Related

Binary Search Tree preorder traversal, recursion vs loop?

we can traverse the binary search tree through recursion like:
void traverse1(node t) {
if( t != NULL) {
visit(t);
traverse1(t->left);
traverse1(t->right);
}
}
and also through loop( with stack) like:
void traverse2(node root) {
stack.push(root);
while (stack.notEmpty()) {
node next = stack.pop();
visit(next);
if (next->right != NULL)
stack.push(next->right);
if (next->left != NUll)
stack.push(next->left)
}
}
Question
Which one is more efficiency? why?
I think these two method time complexity is both O(n). so all the differences are all in the space complexity or ..?

It will depend on how you define efficiency? It is in runtime, amount of code, size of executable, how much memory/stack space is used, or how easy it is to understand the code?
The recursion is very easy to code, hopefully easy to understand, and is less code. Looping may be a bit more complex (depending on how you view complexity) and code. Recursion may be easier to understand and will be less in the amount of code and in executable size. Recursion will use more stack space assuming you have a few items to transverse.
Looping will have a larger amount of code (as your above example shows), and could possibly be considered a bit more complex. But the transverse is just one call to be place on the stack, rather than several. So if you have a lot of items to transverse, loop will be faster as you don't have the time to push items on the stack and the pop them off, which is what will occur when using recursion.

Apart from efficiency, if your tree is too deep or if your stack space is limited, you may run into overflow - stack overflow!!
With iterative approach, you can use the much larger heap space to place the allocated stack. With recursion, you don't have a choice as the stack frames are pushed and popped for you.
I know that such constrained stack environments may be a bit rare; nevertheless, one needs to be aware of it.

Both versions have the same space and time complexity.
The recursion implicitly uses the stack (memory location) for storing call context, and the second uses a stack abstract data type effectively emulating the first version, using stack explicitly.
The difference is that with the first version, you risk stack overflow with deep, unbalanced trees, however it's simpler conceptually (less opportunities for bugs). The second uses dynamic allocation for storing the pointers to parent nodes.

You will have to measure the difference to know for sure. I personally have a feeling that the recursive formulation will beat the one with an explicit stack in this particular instance.
What the non-recursive version has going for it is that it eliminates the calls. On the other hand - depending on the exact library implementation - the pushes and the pop might also resolve to function calls.
Any decent compiler will actually encode your recursive function in a way similar to the following pseudo-code:
void traverse1(node t) {
1:
if( t != NULL) {
visit(t);
traverse1(t->left);
t = t->right;
goto 1;
}
}
Thus eliminating one of the recursive calls. This is known as tail call elimination.

The time and space complexity are the same. The only difference is that traverse2 doesn't call itself recursively. This should make it slightly faster, as pushing/popping from a stack is a cheaper operation than calling a function.
That said, I think the recursive version is "cleaner", so I'd personally use that, unless it turns out to be too slow in practice.

How to avoid stack overflow of a recursive function?

For example if we are traversaling a rather big tree by following function, it is possible that we get stack overflow.
void inorder(node* n)
{
if(n == null) return;
inorder(n->l);
n->print();
inorder(n->r);
}
How to add a condition or something into the function to prevent such overflow from happening?

consider iteration over recursion , if that is really a concern.
http://en.wikipedia.org/wiki/Tree_traversal
see the psedo code there for iteration
iterativeInorder
iterativePreorder
iterativePostorder
Basdically use your own list as stack data structure in a while loop , You can effectively replace function recursion.

There's no portable solution other than by replacing recursion
with explicit management of the stack (using
std::vector<Node*>). Non-portably, you can keep track of the
depth using a static variable; if you know the maximum stack
size, and how much stack each recursion takes, then you can
check that the depth doesn't exceed that.
A lot of systems, like Linux and Solaris, you can't know the
maximum stack depth up front, since the stack is allocated
dynamically. At least under Linux and Solaris, however, once
memory has been allocated to the stack, it will remain allocated
and remain affected to the stack. So you can recurse fairly
deeply at the start of the program (possibly crashing, but
before having done anything), and then check against this value
later:
static char const* lowerBound = nullptr;
// At start-up...
void
preallocateStack( int currentCount ) {
{
char dummyToTakeSpace[1000];
-- currentCount;
if ( currentCount <= 0 ) {
lowerBound = dummyToTakeSpace;
} else {
preallocateStack( currentCount - 1 );
}
}
void
checkStack()
{
char dummyForAddress;
if ( &dummyForAddress < lowerBound ) {
throw std::bad_alloc(); // Or something more specific.
}
}
You'll note that there are a couple of cases of
undefined/unspecified behavior floating around in that code, but
I've used it successfully on a couple of occasions (under
Solaris on Sparc, but Linux on PC works exactly the same in this
regard). It will, in fact, work on almost any system where:
- the stack grows down, and
- local variables are allocated on the stack.
Thus, it will also work on Windows, but if it fails to
allocate the initial stack, you'll have to relink, rather than
just run the program at a moment when there's less activity on
the box (or change ulimits) (since the stack size on Windows
is fixed at link time).
EDIT:
One comment concerning the use of an explicit stack: some
systems (including Linux, by default) overcommit, which means
that you cannot reliably get an out of memory error when
extending an std::vector<>; the system will tell
std::vector<> that the memory is there, and then give the
program a segment violation when it attempts to access it.

The thing about recursion is that you can never guarantee that it will never overflow the stack, unless you can put some bounds on both the (minimum) size of memory and (maximum) size of the input. What you can do, however, is guarantee that it will overflow the stack if you have an infinite loop...
I see your "if() return;" terminating condition, so you should avoid infinite loops as long every branch of your tree ends in a null. So one possibility is malformed input where some branch of the tree never reaches a null. (This would occur, e.g., if you have a loop in your tree data structure.)
The only other possibility I see is that your tree data structure is simply too big for the amount of stack memory available. (N.B. this is virtual memory and swap space can be used, so it's not necessarily an issue of insufficient RAM.) If that's the case, you may need to come up with some other algorithmic approach that is not recursive. Although your function has a small memory footprint, so unless you've omitted some additional processing that it does, your tree would really have to be REALLY DEEP for this to be an issue. (N.B. it's maximum depth that's an issue here, not the total number of nodes.)

You could increase the stack size for your OS. This is normally configured through ulimit if you're on a Unix-like environment.
E.g. on Linux you can do ulimit -s unlimited which will set the size of the stack to 'unlimited' although IIRC there is a hard limit and you cannot dedicate your entire memory to one process (although one of the answers in the links below mentions an unlimited amount).
My suggestions would be to run ulimit -s which will give you the current size of the stack and if you're still getting a stack overflow double that limit until you're happy.
Have a look here, here and here for a more detailed discussion on the size of the stack and how to update it.

If you have a very large tree, and you are running into issues with overrunning your stack using recursive traversals, the problem is likely that you do not have a balanced tree. The first suggestion then is to try a balanced binary tree, such as red-black tree or AVL tree, or a tree with more than 2 children per node, such as a B+ tree. The C++ library provides std::map<> and std::set<> which are typically implemented using a balanced tree.
However, one simple way to avoid recursive in-order traversals is to modify your tree to be threaded. That is, use the right pointer of leaf nodes indicate the next node. The traversal of such a tree would look something like this:
n = find_first(n);
while (! is_null(n)) {
n->print();
if (n->is_leaf()) n = n->r;
else n = find_first(n->r);
}

You can add a static variable to keep track of the times the function is called. If it's getting close to what you think would crash your system, perform some routine to notify the user of the error.

A small prototype of the alterations that can be made by associating another int variable with the recursive function.You could pass the variable as an argument to the function starting with zero value by default at the root and decrement it as u go down the tree ...
drawback: this solution comes at the cost of an overhead of an int variable being allocated for every node.
void inorder(node* n,int counter)
{
if(counter<limit) //limit specified as per system limit
{
if(n == null) return;
inorder(n->l,counter-1);
n->print();
inorder(n->r,counter-1);
}
else
n->print();
}
consider for further research:
Although the problem may not be with traversal if only recursion is to be considered. And could be avoided with better tree creation and updation. check the concept of balanced trees if not already considered.

Reducing for loop overhead

I have a need to iterate through an amortization formula, which looks like this:
R = ( L * (r / m) ) / ( 1 - pow( (1 + (r / m)), (-1 * m * t ) );
I'm using a for loop for iteration, and incrementing the L (loan value) by 1 each time. The loop works just fine, but it did make me wonder about something else, which is the value (or lack thereof) in performing basic operations before a loop executes and then referencing those values through a variable. For example, I could further modify this function to look like
// outside for loop
amortization = (r/m)/(1 - pow( (1+(r/m)), (-1*m*t) ) )
// inside for loop
R = L * amortization
This way, instead of having to perform lots of math operations on every iteration of the loop, I can just reference the variable amount and perform a single operation.
What I'm wondering is how relevant is this? Is there any actual value in extracting these operations, or is the time saved so small that we're talking about a savings of milliseconds from a for loop that iterates approx. 200,000 times. Follow up question: would extracting operations like this be worth it if I were doing more expensive operations like sqrt?
(note: in case it matters, I'm asking about this specifically with c++ in mind)

Compilers would exercise an optimization technique here which is called loop invariant code motion. It does pretty much what you did manually, i.e. extracting a constant part of expression evaluated repeteadly in loop into a precomputed value stored in variable (or register). Hence it is not likely that you gain any performance by doing this yourself.
Of course if it's critical speed-wise, you should profile and/or review the assembly code produced by compiler in both cases.

Compilers already move loop invariant code to the outside of the loop. This optimisation is known as "Loop Invariant Code Motion" or "Hoisting Invariants".
If you want to know how much it affects performance then the only way you are going to know is if you try. I would imagine that if you are doing this 200,000 times then it certainly could affect performance (if the compiler doesn't already do it for you).

As others have mentioned, good compilers will do this sort of optimization automatically. However...
First, pow is probably a library function, so your compiler might or might not know it is a "pure" function (i.e., that its behavior depends only on its arguments). If not, it will be unable to perform this optimization, because for all it knows pow might print a message or something.
Second, I think factoring this expression out of the loop makes the code easier to understand, anyway. The human reading your code can also factor out this expression "automatically", but why make them? It just distracts them from your algorithm's flow.
So I would say to make this change regardless.
That said, if you really care about performance, get a decent profiler and do not worry about micro-optimizations until it tells you to. Your priorities early on should be (a) use a decent algorithm and (b) implement it clearly. And not in that order.

When you have opimizations turned on then moving constant expressions outside of a loop is something that compilers are pretty good at doing on their own, so this might buy you no speed up anyway.
But if it doesn't this looks like a reasonable thing to try, and then time it IF this is actually taking longer than you require.

Comparing forward and reverse loop for int with one limit as 0

Consider the example with for loop:
for(int i = 0; i <= NUM; i++); // forward
for(int i = NUM; i >= 0; i--); // reverse
I tested this loops with gcc (linux-64). Without any optimization flag, forward loop was faster and with optimization to O3/O4, reverse loop was faster.
Somewhere I heard that due to better cache replacement techniques, forward loop is faster.
Personally I think, reverse loop should be faster (whether NUM is a constant or variable). Because any microprocessor will have single instruction for comparison with 0, i >= 0 (i.e. JLZ (jump if less than zero) and equivalent).
Is there any deterministic answer to this ?

No, there is absolutely no deterministic answer for this. You're looking at two different levels of abstraction.
C++ has absolutely nothing to say about what happens under the covers, performance-wise. It specifies a virtual machine which executes C++ code and, while it covers functionality, it does not cover performance of the underlying environment (a).
Which of those is faster will depend on a variety of factors. You may find yourself running on a CPU which makes no distinction between comparing with an arbitrary value and comparing with zero.
You may find an architecture where incrementing a register is ten times faster than decrementing one, bizarre though that may seem.
You may even find a brain-dead architecture that has no decrement, add or subtract instructions at all, and you have to emulate decrement by calling increment 2n-1 times (where n is the word size).
Bottom line: you can't presume to know what's going on under the hood unless you want to look at a very specific CPU, compiler, etc.
You should optimise your code for readability first. If you need to process things in an increasing manner, use the first option. If a decreasing manner, use the latter. If either way seems equally natural, then choose the fastest one, discovered by benchmarking or analysis of the underlying architecture and assembler code. But only do this if you have a specific performance problem, otherwise you're wasting effort.
In any case, since you're almost certainly going to be using i for something, it's likely that whatever tiny increase in performance you get by going the fastest way will be more than swamped by the fact that you now have to calculate NUM-i inside the loop (unless, of course, the compiler is smarter than the developer which, based on what I've seen from gcc, is quite possible).
(a) It does specify certain performance-related things such as the time complexity of some things in the containers library, but not specifically the thing you're asking about, whether forward loops or reverse ones are faster.

The cache replacement techniques only come in effect if there is a conflict. Perhaps NUM isn't big enough for it to have an effect, or perhaps the mapping of virtual to physical memory happens to be favorable for the cache replacement algorithm.
Trying to potentially save a single machine instruction is showing lack of trust for the compiler. If it was that easy, surely the optimizer would know that!

Maybe incrementing a loop variable is so much more common that CPU's branch prediction works better on those.
With the compiler optimization, you loop might be just unrolled—given that I correctly assume, your NUM is a #define constant—and therefore faster.

Although it doesn't really answer your question, but a thought. How about this loop:
int i = NUM + 1;
while ( i --> 0 )//it looks as if i goes to zero (like in calculus)!
{
}

Coding Practices which enable the compiler/optimizer to make a faster program

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}

Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc

Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.

I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.

In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.

A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

Align your data to native/natural boundaries.

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js