State of "memset" functionality in C++ with modern compilers

State of "memset" functionality in C++ with modern compilers - c++

Context:
A while ago, I stumbled upon this 2001 DDJ article by Alexandrescu:
http://www.ddj.com/cpp/184403799
It's about comparing various ways to initialized a buffer to some value. Like what "memset" does for single-byte values. He compared various implementations (memcpy, explicit "for" loop, duff's device) and did not really find the best candidate across all dataset sizes and all compilers.
Quote:
There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. (...) Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory.
Question:
does anyone have more recent information about this problem ? Do recent GCC and Visual C++ implementations perform significantly better than 7 years ago ?
I'm writing code that has a lifetime of 5+ (probably 10+) years and that will process arrays' sizes from a few bytes to hundred of megabytes. I can't assume that my choices now will still be optimal in 5 years. What should I do:
a) use the system's memset (or equivalent) and forget about optimal performance or assume the runtime and compiler will handle this for me.
b) benchmark once and for all on various array sizes and compilers and switch at runtime between several routines.
c) run the benchmark at program initialization and switch at runtime based on accurate (?) data.
Edit: I'm working on image processing software. My array items are PODs and every millisecond counts !
Edit 2: Thanks for the first answers, here are some additional informations:Buffer initialization may represent 20%-40% of total runtime of some algorithms. The platform may vary in the next 5+ years, although it will stay in the "fastest CPU money can buy from DELL" category. Compilers will be some form of GCC and Visual C++. No embedded stuff or exotic architectures on the radarI'd like to hear from people who had to update their software when MMX and SSE appeared, since I'll have to do the same when "SSE2015" becomes available... :)

The DDJ article acknowledges that memset is the best answer, and much faster than what he was trying to achieve:
There is something sacrosanct about
C's memory manipulation functions
memset, memcpy, and memcmp. They are
likely to be highly optimized by the
compiler vendor, to the extent that
the compiler might detect calls to
these functions and replace them with
inline assembler instructions — this
is the case with MSVC.
So, if memset works for you (ie. you are initializing with a single byte) then use it.
Whilst every millisecond may count, you should establish what percentage of your execution time is lost to setting memory. It is likely very low (1 or 2%??) given that you have useful work to do as well. Given that the optimization effort would likely have a much better rate of return elsewhere.

The MASM Forum has a lot of incredible assembly language programmers/hobbyists who have beaten this issue completely to death (have a look through The Laboratory). The results were much like Christopher's response: SSE is incredible for large, aligned, buffers, but going down you will eventually reach such a small size that a basic for loop is just as quick.

Memset/memcpy are mostly written with a basic instruction set in mind, and so can be outperformed by specialized SSE routines, which on the other hand enforce certain alignment constraints.
But to reduce it to a list :
For data-sets <= several hundred kilobytes memcpy/memset perform faster than anything you could mock up.
For data-sets > megabytes use a combination of memcpy/memset to get the alignment and then use your own SSE optimized routines/fallback to optimized routines from Intel etc.
Enforce the alignment at the start up and use your own SSE-routines.
This list only comes into play for things where you need the performance. Too small/or once initialized data-sets are not worth the hassle.
Here is an implementation of memcpy from AMD, I can't find the article which described the concept behind the code.

d) Accept that trying to play "jedi mind tricks" with the initialization will lead to more lost programmer hours than the cumulative milliseconds difference between some obscure but fast method versus something obvious and clear.

It depends what you're doing. If you have a very specific case, you can often vastly outperform the system libc (and/or compiler inlining) of memset and memcpy.
For example, for the program I work on, I wrote a 16-byte-aligned memcpy and memset designed for small data sizes. The memcpy was made for multiple-of-16 sizes greater than or equal to 64 only (with data aligned to 16), and memset was made for multiple-of-128 sizes only. These restrictions allowed me to get enormous speed, and since I controlled the application, I could tailor the functions specifically to what was needed, and also tailor the application to align all necessary data.
The memcpy performed at about 8-9x the speed of the Windows native memcpy, knocing a 460-byte copy down to a mere 50 clock cycles. The memset was about 2.5x faster, filling a stack array of zeros extremely quickly.
If you're interested in these functions, they can be found here; drop down to around line 600 for the memcpy and memset. They're rather trivial. Note they're designed for small buffers that are supposed to be in cache; if you want to initialize enormous amounts of data in memory while bypassing cache, your issue may be more complex.

You can take a look on liboil, they (try to) provide different implementation of the same function and choosing the fastest on initialization. Liboil has a pretty liberal licence, so you can take it also for proprietary software.
http://liboil.freedesktop.org/

Well this all depends on your problem domain and your specifications, have you ran into performance issues, failed to meet timing deadline and pinpointed memset as being the root of all evil ? If it this you're in the one and only case where you could consider some memset tuning.
Then you should also keep in mind that the memset anyhow will vary on the hardware the platform it is ran on, during those five years, will the software run on the same platform ? On the same architecture ? One you come to that conclusion you can try to 'roll your own' memset, typically playing with the alignment of buffers, making sure you zero 32 bit values at once depending on what is most performant on your architecture.
I once ran into the same for memcmpt where the alignment overhead caused some problems, bit typically this will not result in miracles, only a small improvement, if any. If you're missing your requirements by an order of mangnitude than this won't get you any further.

If memory is not a problem, then precreate a static buffer of the size you need, initialized to your value(s). As far as I know, both these compilers are optimizing compilers, so if you use a simple for-loop, the compiler should generate the optimum assembler-commands to copy the buffer across.
If memory is a problem, use a smaller buffer & copy that accross at sizeof(..) offsets into the new buffer.
HTH

I would always choose an initialization method that is part of the runtime or OS (memset) I am using (worse case pick one that is part of a library that I am using).
Why: If you are implementing your own initialization, you might end up with a marginally better solution now, but it is likely that in a couple of years the runtime has improved. And you don't want to do the same work that the guys maintaining the runtime do.
All this stands if the improvement in runtime is marginal. If you have a difference of an order of magnitude between memset and your own initialization, then it makes sense to have your code running, but I really doubt this case.

If you have to allocate your memory as well as initialize it, I would:
Use calloc instead of malloc
Change as much of my default values to be zero as possible (ex: let my default enumeration value be zero; or if a boolean variable's default value is 'true', store it's inverse value in the structure)
The reason for this is that calloc zero-initializes memory for you. While this will involve the overhead for zeroing memory, most compilers are likely to have this routine highly-optimized -- more optimized that malloc/new with a call to memcpy.

As always with these types of questions, the problem is constrained by factors outside of your control, namely, the bandwidth of the memory. And if the host OS decides to start paging the memory then things get far worse. On Win32 platforms, the memory is paged and pages are only allocated on first use which will generate a big pause every page boundary whilst the OS finds a page to use (this may require another process' page to be paged to disk).
This, however, is the absolute fastest memset ever written:
void memset (void *memory, size_t size, byte value)
{
}
Not doing something is always the fastest way. Is there any way the algorithms can be written to avoid the initial memset? What are the algorithms you're using?

The year isn't 2001 anymore. Since then, new versions of Visual Studio have appeared. I've taken the time to study the memset in those. They will use SSE for memset (if available, of course). If your old code was correct, statistically if will now be faster. But you might hit an unfortunate cornercase.
I expect the same from GCC, although I haven't studied the code. It's a fairly obvious improvement, and an Open-Source compiler. Someone will have created the patch.

Related

How to measure sequential memory read speed in C/C++

The problem does not take CPU cache into consideration. That is, let the cache do its job (let cpu cache improve the performance).
My idea is to allocate a big enough chunk of memory (so that not all of it fit into cache) and treat them as one data type(like int) and do addition to avoid the compiler completely optimize away the code to read the memory. The problem is does the data type affect the measurement? Or is there a more general way of doing it?
EDIT: Might be a bit mis-leading before. An example is AIDA64's memory and cache benchmark, which is able to measure the memory read/write speed as well as latency. I want to know a general idea of how it is done.

Microbenchmarks like this are not easy in C/C++. The amount of time something takes in C++ is not a specified aspect of the language. Indeed, for every use case except this one, faster is better, so compilers are encouraged to do smart things.
The trick to these is to write the benchmark, compile it, and then look at the assembly to see whether its doing clever tricks. Or, at the very least, check to make sure that it makes sense (accessing more memory = more time).
Compilers are getting smart. Addition is not always enough. More than once I've had Visual Studio realize what I was doing to construct the microbenchmark and compile it all away.
At the moment, I am having good luck using the argc argument passed into main as a seed, and using a cryptographic hash like SHA1 or MD-5 to fill the data. This tends to be enough to trick the compiler into actually writing all of the reads. But verify your results. There's no guarantee that a new compiler doesn't get even smarter.

What are some good guidelines for choosing the size of integer types?

I've been searching around a bit, and I haven't really come up with an answer for this.
When I'm programming on embedded devices with limited memory, I'm generally in the habit of using the smallest integral/floating point type that will do the job, for instance, if I know that a counter will always be between zero and 255, I'll declare it as a uint8_t.
However, in less memory-constrained environments, I'm used to just using int for everything, as per the Google C++ Styleguide. When I look at existing code, it often tends to be done this way.
To be clear, I get the rationale behind doing this, (Google explains it very well), but I'm not precisely clear on the rationale behind doing things the first way.
It seems to me that reducing the memory footprint of your program, even on a system where you don't care about memory usage, would be good for overall speed, since, logically, less overall data would mean more of it could fit in CPU cache.
Complicating matters, however, is the fact that compilers will automatically pad data and align it to boundaries such that it can be fetched in a single bus cycle. I guess, then, it comes down to whether or not compilers are smart enough to take, say, two 32-bit integers and stick them together in a single 64-bit block vs. individually padding each one to 64 bits.
I suppose whether or not the CPU itself could take advantage of this also depends on its exact internals, but the idea that optimizing memory size improves performance, particularly on newer processors, is evidenced in the fact that the Linux kernel relied for awhile on gcc's -0s option for an overall performance boost.
So I guess that my question is why the Google method seems to be so much more prevalent in actual code. Is there a hidden cost here that I'm missing?

The usual reasons that the "google method" is commonly used is because int is often good enough, and it is typically the first option taught in beginner's material. It also takes more effort (man hours, etc) to optimise a nontrivial code for "limited memory" - effort which is pointless if not actually needed.
If the "actual code" is written for portability, then int is a good default choice.
Whether written for portability or not, a lot of programs are only ever run on hosts with sufficient memory resources and with an int type that can represent the required range of values. This means it is not necessary to worry about memory usage (e.g. optimising size of variables based on the specific range of values they need to support) and the program just works.
Programming for "limited memory" is certainly common, but not typically why most code is written. Quite a few modern embedded systems have more than enough memory and other resources, so the techniques are not always needed for them.
A lot of code written for what you call "limited memory" also does not actually need to be. There is a point, as programmers learn more, that a significant number start indulging in premature optimisation - worrying about performance or memory usage, even when there is no demonstrated need for them to do so. While there is certainly a significant body of code written for "limited memory" because of a genuine need, there is a lot more such code written due to premature optimisation.

"embedded devices ... counter between zero and 255, I'll declare it as a uint8_t"
That might be counterproductive. Especially on embedded systems, 8 bit fetches might be slower. Besides, a counter is likely in a register, and there's no benefit in using half a register.
The chief reason to use uint8_t is when you have a contiguous set of them. Could be an array, but also adjacent members in a class.
As the comments already note, -Os is unrelated - its benefit is that with smaller code, the memory bus has more bandwidth left for data.

From my experience 90% of all code in a bigger project does not need particular optimization, since 95% of all memory consumption and of all execution time is spend in less than 10% of the code you write. In the rest of the code, try to emphasize simplicity and maintainability. Mostly, that means using ints or size_t as integer types. Usually, there is not need to optimize the size of local variables, but it can make sense, if you have a lot of instances of a type in a large array. Item 6 in the excellent book C++ Coding Standards: 101 Rules, Guidelines and Best Practices (C++ In-Depth) by Herb Sutter and Andrei Alexandrescu says:
"Correctness, simplicity, and clarity come first."
Most importantly, understand where these less than 10% of code are, that really need optimization. Otherwise, keep interfaces simple and uniform.

Nice discussion! But I wonder why nobody speaks about cpu register size, memory bus architecture, cpu architecture and so on. Saying "int is best" is not a general at all. If you have small embedded systems like 8 bit avr, int is a very bad choice for a counter running from 0 .. 255.
And using int on ARM where you maybe have a 16 bit bus interface can also be a very bad idea if you really only need 16 bits or less.
As for all optimizations: Look in the code the compiler produces, measure how long actions really take and look for memory consumption on heap/stack if it is necessary. It makes no sense to hand craft unmaintainable code to save 8 bits somewhere if your hardware still have MBytes left.
Using tools like valgrind and the profiling supported by the target/compiler give much more ideas as any theoretical discussions here.
There is no general "best integer type"! It depends always on CPU architecture, memory bus, caches and a some more.

How to allocate a memory block and place it into Cache?

I want to dynamically allocate a memory block for an array in C/C++, and this array will be accessed at a high frequency. So I want this array to stay on chip, i.e., in the Cache. How can I do this explicitly with code in C/C++?

There is no standard C++ language feature that allows you to do this.
Depending on your compiler and CPU, you may be able to use an arch-specific CPU instruction in an asm block:
T* p = new T(...);
size_t n = sizeof(T);
asm {
"CACHE n bytes at address p"
}
...or some builtin compiler function ("intrinsic") that does this.
You will need to consult your CPU manual and/or your compiler manual.
As an example, x86 CPUs have a set of instructions starting with PREFETCH.
And another example, GCC has a function called __builtin_prefetch. See GCC Data Prefetch Support

I will try to answer this question from a bit different perspective. Do you really need to do this. And even if it would be a way to do so, will it worth it? Imagine there is a "magic" void * malloc_and_lock_in_cache( int cacheLevel ) function. What you going to do with this data. If it's an application limited to while (1) loop with random array access from single thread you will have such behaviour anyway due to optimisation and CPU architecture. If you think about more real world solutions you always have logic around access. For example locking for multithreading, certain conditions, etc. The the question - do the rest of your application algorithms are so perfect that only left to do is to allocate array in cache.
Do all other access/sorting/lookup functions are state-of-art logic which cannot be reviewed rather then gaining very limited performance kickback trying to overwrite CPU optimisation.
Also do you consider to run your application without ANY operation system on a raw hardware so you shouldn't care about how your allocation will affects OS behaviour, rest of application running around?
And what should happen if your application will run inside virtual machine or environments like XEN.?
I can remember one similar popular subject 15-18 years ago about physical memory usage and disk caching utilities. Indeed tools like MS-DOS smartdrive or similar utilities were REALLY useful and speed up things a lot. Usenet was full of 'tuning advices' and performance analyses for things like write-through/write-back settings.
Especially if your DOS application were processing large amounts of data and implemented some memory swapping logic (I am talking about times then 4MB RAM was luxury) that's became mostly a drama, that from one point of view you need as much memory you can, but from another point of view you need swapping, so you actually need to swap, but swapping goes through cache etc..
But what happened next. We've got VM386 mode, disk cache/memory swaps integrated into OS, and who was care anymore about things like tuning smartdrive/ramdisks. In general it was 'cheaper' to allocate as much as you need VM then implement own voodoo algorithms to swap physical memory blocks (although this functionality is still in WinAPI).
So I would really recommend to concentrate efforts on algorithms and application design rather then trying to use some very low level features with really unpredictable results until you dont develop some new microkernel OS.

I don't think you can. First, which cache? L3, L2, L1? You can prefetch, and align so it its access is more optimized, and then you can query it periodically maybe to make it stay and not go LRU'd, but you can't really make it stay in cache.

First you have to know what's the architecture of the machine you want to run the code on. Then you should check it there's an instruction doing that kind of stuff.
Actually using the memory heavily will force the cache controller to put this region in cache.
And there are three rules of optimizing, you may want to know them first :)
http://c2.com/cgi/wiki?RulesOfOptimization

Effective optimization strategies on modern C++ compilers

I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Lastly, to nip certain kinds of answers in the bud:
I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.

Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.
General process:
Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)
Some other specific things:
Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.
WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
I would only consider this as a last option. The STL containers and algorithms have been thoroughly tested. Creating new ones are expensive in terms of development time.
Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
First, try reserving space for the vectors. Check out the std::vector::reserve method. A vector that keeps growing or changing to larger sizes is going to waste dynamic memory and execution time. Add some code to determine a good value for an upper bound.
I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
As a matter of principle, always pass large structures by reference or pointer. Prefer passing by constant reference. If you are using pointers, consider using smart pointers.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Modern compilers are very aware of instruction caches (pipelines) and try to keep them from being reloaded. You can always assist your compiler by writing code that uses less branches (from if, switch, loop constructs and function calls).
You may see more significant performance gain by adjusting your program to optimize the data cache. Search the web for Data Driven Design. There are many excellent articles on this topic.
Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
For accuracy, keep everything as a double. Adjust for rounding only when necessary and perhaps before displaying. This falls under the optimization rule, Use less code, eliminate extraneous or deadwood code.
Also see the section above about reserving space in containers before using them.
Some processors can load and store floating point numbers either faster or as fast as integers. This would require gathering profile data before optimizing. However, if you know there is minimal resolution, you could use integers and change your base to that minimal resolution . For example, when dealing with U.S. money, integers can be used to represent 1/100 or 1/1000 of a dollar.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
This an incorrect assumption. Compilers can optimize based on the function's signature, especially if the parameters correctly use const. I always like to assist the compiler by moving constant stuff outside of the loop. For an upper limit value, such as a string length, assign it to a const variable before the loop. The const modifier will assist the Optimizer.
There is always the count-down optimization in loops. For many processors, a jump on register equals zero is more efficient than compare and jump if less than.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
I would avoid "micro optimizations". If you have any doubts, print out the assembly code generated by the compiler (for the area you are questioning) under the highest optimization setting. Try rewriting the code to express the compiler's assembly code. Optimize this code, if you can. Anything more requires platform specific instructions.
Optimization Ideas & Concepts
1. Computers prefer to execute sequential instructions.
Branching upsets them. Some modern processors have enough instruction cache to contain code for small loops. When in doubt, don't cause branches.
2. Eliminate Requirements
Less code, more performance.
3. Optimize designs before code
Often times, more performance can be gained by changing the design versus changing the implementation of the design. Less design promotes less code, generates more performance.
4. Consider data organization
Optimize the data.
Organize frequently used fields into substructures.
Set data sizes to fit into a data cache line.
Remove constant data out of data structures.
Use const specifier as much as possible.
5. Consider page swapping
Operating systems will swap out your program or task for another one. Often times into a 'swap file' on the hard drive. Breaking up the code into chunks that contain heavily executed code and less executed code will assist the OS. Also, coagulate heavily used code into tighter units. The idea is to reduce the swapping of code from the hard drive (such as fetching "far" functions). If code must be swapped out, it should be as one unit.
6. Consider I/O optimizations
(Includes file I/O too).
Most I/O prefers fewer large chunks of data to many small chunks of data. Hard drives like to keep spinning. Larger data packets have less overhead than smaller packets.
Format data into a buffer then write the buffer.
7. Eliminate the competition
Get rid of any programs and tasks that are competing against your application for the processor(s). Such tasks as virus scanning and playing music. Even I/O drivers want a piece of the action (which is why you want to reduce the number or I/O transactions).
These should keep you busy for a while. :-)

Use of memory buffer pools can be of great performance benefit vs. dynamic allocation. More so if they reduce or prevent heap fragmentation over long execution runs.
Be aware of data location. If you have a significant mix of local vs. global data you may be overworking the cache mechanism. Try to keep data sets in close proximity to make maximum use of cache line validity.
Even though compilers do a wonderful job with loops, I still scrutinize them when performance tuning. You can spot architectural flaws that yield orders of magnitude where the compiler may only trim percentages.
If a single priority queue is using a lot of time in its operation, there may be benefit to creating a battery of queues representing buckets of priority. It would be complexity being traded for speed in this case.
I notice you didn't mention the use of SSE type instructions. Could they be applicable to your type of number crunching?
Best of luck.

Here is a nice paper on the subject.

About STL containers.
Most people here claim STL offers one of the fastest possible implementations of the container algorithms. And I say the opposite: for the most real-world scenarios the STL containers taken as-is yield a really catastrophic performance.
People argue about the complexity of the algorithms used in STL. Here STL is good: O(1) for list/queue, vector (amortized), and O(log(N)) for map. But this is not the real bottleneck of the performance for a typical application! For many applications the real bottleneck is the heap operations (malloc/free, new/delete, etc.).
A typical operation on the list costs just a few CPU cycles. On a map - some tens, may be more (this depends on the cache state and log(N) of course). And typical heap operations cost from hunders to thousands (!!!) of CPU cycles. For multithreaded applications for instance they also require synchronization (interlocked operations). Plus on some OSs (such as Windows XP) the heap functions are implemented entirely in the kernel mode.
So that the actual performance of the STL containers in a typical scenario is dominated by the amount of heap operations they perform. And here they're disastrous. Not because they're implemented poorly, but because of their design. That is, this is the question of the design.
On the other hand there're other containers which are designed differently.
Once I've designed and written such containers for my own needs:
http://www.codeproject.com/KB/recipes/Containers.aspx
And it proved for me to be superior from the performance point of view, and not only.
But recently I've discovered I'm not the only one who thought about this.
boost::intrusive is the container library that is implemented in the manner similar to what I did then.
I suggest you try it (if you didn't already)

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
Generally, not unless you're working with a poor implementation. I wouldn't replace an STL container or algorithm just because you think you can write tighter code. I'd do it only if the STL version is more general than it needs to be for your problem. If you can write a simpler version that does just what you need, then there might be some speed to gain there.
One exception I've seen is to replace a copy-on-write std::string with one that doesn't require thread synchronization.
for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Unlikely. But if you're using a lot of time allocating up to a certain size, it might be profitable to add a reserve() call.
performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference.
When working with containers, I pass iterators for the inputs and an output iterator, which is still pretty general.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
Not very. Yes. I find that missed branch predictions and cache-hostile memory access patterns are the two biggest killers of performance (once you've gotten to reasonable algorithms). A lot of older code uses "early out" tests to reduce calculations. But on modern processors, that's often more expensive than doing the math and ignoring the result.
A significant bottleneck in my code used to be conversions from floating point to integers
Yup. I recently discovered the same issue.
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop.
Some compilers can deal with this. Visual C++ has a "link-time code generation" option that effective re-invokes the compiler to do further optimization. And, in the case of functions like strlen, many compilers will recognize that as an intrinsic function.
Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
When you're optimizing at this low level, there are few reliable rules of thumb. Compilers will vary. Measure your current solution, and decide if it's too slow. If it is, come up with a hypothesis (e.g., "What if I replace the inner if-statements with a look-up table?"). It might help ("eliminates stalls due to failed branch predictions") or it might hurt ("look-up access pattern hurts cache coherence"). Experiment and measure incrementally.
I'll often clone the straightforward implementation and use an #ifdef HAND_OPTIMIZED/#else/#endif to switch between the reference version and the tweaked version. It's useful for later code maintenance and validation. I commit each successful experiment to change control, and keep a log (spreadsheet) with the changelist number, run times, and explanation for each step in optimization. As I learn more about how the code behaves, the log makes it easy to back up and branch off in another direction.
You need a framework for running reproducible timing tests and to compare results to the reference version to make sure you don't inadvertently introduce bugs.

If I were working on this, I would expect an end-stage where things like cache locality and vector operations would come into play.
However, before getting to the end stage, I would expect to find a series of problems of different sizes having less to do with compiler-level optimization, and more to do with odd stuff going on that could never be guessed, but once found, are simple to fix. Usually they revolve around class overdesign and data structure issues.
Here's an example of this kind of process.
I have found that generalized container classes with iterators, which in principle the compiler can optimize down to minimal cycles, often are not so optimized for some obscure reason. I've also heard other cases on SO where this happens.
Others have said, before you do anything else, profile. I agree with that approach except I think there's a better way, and it's indicated in that link. Whenever I find myself asking if some specific thing, like STL, could be a problem, I just might be right - BUT - I'm guessing. The fundamental winning idea in performance tuning is find out, don't guess. It is easy to find out for sure what is taking the time, so don't guess.

here is some stuff I had used:
templates to specialize innermost loops bounds (makes them really fast)
use __restrict__ keywords for alias problems
reserve vectors beforehand to sane defaults.
avoid using map (it can be really slow)
vector append/ insert can be significantly slow. If that is the case, raw operations may make it faster
N-byte memory alignment (Intel has pragma aligned, http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm)
trying to keep memory within L1/L2 caches.
compiled with NDEBUG
profile using oprofile, use opannotate to look for specific lines (stl overhead is clearly visible then)
here are sample parts of profile data (so you know where to look for problems)
* Output annotated source file with samples
* Output all files
*
* CPU: Core 2, speed 1995 MHz (estimated)
--
* Total samples for file : "/home/andrey/gamess/source/blas.f"
*
* 1020586 14.0896
--
* Total samples for file : "/home/andrey/libqc/rysq/src/fock.cpp"
*
* 962558 13.2885
--
* Total samples for file : "/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp"
*
* 748150 10.3285
--
* Total samples for file : "/usr/include/boost/numeric/ublas/functional.hpp"
*
* 639714 8.8315
--
* Total samples for file : "/home/andrey/gamess/source/eigen.f"
*
* 429129 5.9243
--
* Total samples for file : "/usr/include/c++/4.3/bits/stl_algobase.h"
*
* 411725 5.6840
--
example of code from my project
template<int ni, int nj, int nk, int nl>
inline void eval(const Data::density_type &D, const Data::fock_type &F,
const double *__restrict Q, double scale) {
const double * __restrict Dij = D[0];
...
double * __restrict Fij = F[0];
...
for (int l = 0, kl = 0, ijkl = 0; l < nl; ++l) {
for (int k = 0; k < nk; ++k, ++kl) {
for (int j = 0, ij = 0; j < nj; ++j, ++jk, ++jl) {
for (int i = 0; i < ni; ++i, ++ij, ++ik, ++il, ++ijkl) {

And I think the main hint anyone could give you is: measure, measure, measure. That and improving your algorithms.
The way you use certain language features, the compiler version, std lib implementation, platform, machine - all ply their role in performance and you haven't mentioned many of those and no one of us ever had your exact setup.
Regarding replacing std::vector: use a drop-in replacement (e.g., this one) and just try it out.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

Is there any benefit to replacing STL
containers/algorithms with hand-rolled
ones? In particular, my program
includes a very large priority queue
(currently a std::priority_queue)
whose manipulation is taking a lot of
total time. Is this something worth
looking into, or is the STL
implementation already likely the
fastest possible?
The STL is generally the fastest, general case. If you have a very specific case, you might see a speed-up with a hand-rolled one. For example, std::sort (normally quicksort) is the fastest general sort, but if you know in advance that your elements are virtually already ordered, then insertion sort might be a better choice.
Along similar lines, for std::vectors
whose needed sizes are unknown but
have a reasonably small upper bound,
is it profitable to replace them with
statically-allocated arrays?
This depends on where you are going to do the static allocation. One thing I tried along this line was to static allocate a large amount of memory on the stack, then re-use later. Results? Heap memory was substantially faster. Just because an item is on the stack doesn't make it faster to access- the speed of stack memory also depends on things like cache. A statically allocated global array may not be any faster than the heap. I assume that you have already tried techniques like just reserving the upper bound. If you have a lot of vectors that have the same upper bound, consider improving cache by having a vector of structs, which contain the data members.
I've found that dynamic memory
allocation is often a severe
bottleneck, and that eliminating it
can lead to significant speedups. As a
consequence I'm interesting in the
performance tradeoffs of returning
large temporary data structures by
value vs. returning by pointer vs.
passing the result in by reference. Is
there a way to reliably determine
whether or not the compiler will use
RVO for a given method (assuming the
caller doesn't need to modify the
result, of course)?
I personally normally pass the result in by reference in this scenario. It allows for a lot more re-use. Passing large data structures by value and hoping that the compiler uses RVO is not a good idea when you can just manually use RVO yourself.
How cache-aware do compilers tend to
be? For example, is it worth looking
into reordering nested loops?
I found that they weren't particularly cache-aware. The issue is that the compiler doesn't understand your program and can't predict the vast majority of it's state, especially if you depend heavily on heap. If you have a profiler that ships with your compiler, for example Visual Studio's Profile Guided Optimization, then this can produce excellent speedups.
Given the scientific nature of the
program, floating-point numbers are
used everywhere. A significant
bottleneck in my code used to be
conversions from floating point to
integers: the compiler would emit code
to save the current rounding mode,
change it, perform the conversion,
then restore the old rounding mode ---
even though nothing in the program
ever changed the rounding mode!
Disabling this behavior significantly
sped up my code. Are there any similar
floating-point-related gotchas I
should be aware of?
There are different floating-point models - Visual Studio gives an fp:fast compiler setting. As for the exact effects of doing such, I can't be certain. However, you could try altering the floating point precision or other settings in your compiler and checking the result.
One consequence of C++ being compiled
and linked separately is that the
compiler is unable to do what would
seem to be very simple optimizations,
such as move method calls like
strlen() out of the termination
conditions of loop. Are there any
optimization like this one that I
should look out for because they can't
be done by the compiler and must be
done by hand?
I've never come across such a scenario. However, if you're genuinely concerned about such, then the option remains to do it manually. One of the things that you could try is calling a function on a const reference, suggesting to the compiler that the value won't change.
One of the other things that I want to point out is the use of non-standard extensions to the compiler, for example provided by Visual Studio is __assume. http://msdn.microsoft.com/en-us/library/1b3fsfxw(VS.80).aspx
There's also multithread, which I would expect you've gone down that road. You could try some specific opts, like another answer suggested SSE.
Edit: I realized that a lot of the suggestions I posted referenced Visual Studio directly. That's true, but, GCC almost certainly provides alternatives to the majority of them. I just have personal experience with VS most.

The STL priority queue implementation is fairly well-optimized for what it does, but certain kinds of heaps have special properties that can improve your performance on certain algorithms. Fibonacci heaps are one example. Also, if you're storing objects with a small key and a large amount of satellite data, you'll get a major improvement in cache performance if you store that data separately, even if it means storing one extra pointer per object.
As for arrays, I've found std::vector to even slightly out-perform compile-time-constant arrays. That said, its optimizations are general, and specific knowledge of your algorithm's access patterns may allow you to optimize further for cache locality, alignment, coloring, etc. If you find that your performance drops significantly past a certain threshold due to cache effects, hand-optimized arrays may move that problem size threshold by as much as a factor of two in some cases, but it's unlikely to make a huge difference for small inner loops that fit easily within the cache, or large working sets that exceed the size of any CPU cache. Work on the priority queue first.
Most of the overhead of dynamic memory allocation is constant with respect to the size of the object being allocated. Allocating one large object and returning it by a pointer isn't going to hurt much as much as copying it. The threshold for copying vs. dynamic allocation varies greatly between systems, but it should be fairly consistent within a chip generation.
Compilers are quite cache-aware when cpu-specific tuning is turned on, but they don't know the size of the cache. If you're optimizing for cache size, you may want to detect that or have the user specify it at run-time, since that will vary even between processors of the same generation.
As for floating point, you absolutely should be using SSE. This doesn't necessarily require learning SSE yourself, as there are many libraries of highly-optimized SSE code that do all sorts of important scientific computing operations. If you're compiling 64-bit code, the compiler might emit some SSE code automatically, as SSE2 is part of the x86_64 instruction set. SSE will also save you some of the overhead of x87 floating point, since it's not converting back and forth to 80-bit values internally. Those conversions can also be a source of accuracy problems, since you can get different results from the same set of operations depending on how they get compiled, so it's good to be rid of them.

If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
On some compilers this is incorrect. The compiler has perfect knowledge of all code across all translation units (including static libraries) and can optimize the code the same way it would do if it were in a single translation unit. A few ones that support this feature come to my mind:
Microsoft Visual C++ compilers
Intel C++ Compiler
LLVC-GCC
GCC (I think, not sure)

i'm surprised no one has mentioned these two:
Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.
Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)
Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file

If you are doing heavy floating point math you should consider using SSE to vectorize your computations if that maps well to your problem.
Google SSE intrinsics for more information about this.

Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of
switch(num) {
case 1: result = f1(param); break;
case 2: result = f2(param); break;
//...
}
Then I got a serious performance boost when I changed it to
// init:
funcs[N] = {f1, f2 /*...*/};
// later in the code:
result = (funcs[num])(param);
Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.

My current project is a media server, with multi thread processing (C++ language). It's a time critical application, once low performance functions could cause bad results on media streaming like lost of sync, high latency, huge delays and so.
The strategy i usually use to grantee the best performance possible is to minimize the amount of heavy operational system calls that allocate or manage resources like memory, files, sockets and so.
At first i wrote my own STL, network and file manage classes.
All my containers classes ("MySTL") manage their own memory blocks to avoid multiple alloc (new) / free (delete) calls. The objects released are enqueued on a memory block pool to be reused when needed. On that way i improve performance and protect my code against memory fragmentation.
The parts of the code that need to access lower performance system resources (like files, databases, script, network write) i use separate threads for them. But not one thread for each unit (like not 1 thread for each socket), if so the operational system would lose performance while managing a high number of threads. So you can group objects of same classes to be processed on a separate thread if possible.
For example, if you have to write data to a network socket, but the socket write buffer is full, i save the data on a sendqueue buffer (which shares memory with all sockets together) to be sent on a separate thread as soon as the sockets become writeable again. At this way your main threads should never stop processing on a blocked state waiting for the operational system frees a specific resource. All the buffers released are saved and reused when needed.
After all a profile tool would be welcome to look for program bottles and shows which algorithms should be improved.
i got succeeded using that strategy once i have servers running like 500+ days on a linux machine without rebooting, with thousands users logging everyday.
[02:01] -alpha.ip.tv- Uptime: 525days 12hrs 43mins 7secs

Optimizing for space instead of speed in C++

When you say "optimization", people tend to think "speed". But what about embedded systems where speed isn't all that critical, but memory is a major constraint? What are some guidelines, techniques, and tricks that can be used for shaving off those extra kilobytes in ROM and RAM? How does one "profile" code to see where the memory bloat is?
P.S. One could argue that "prematurely" optimizing for space in embedded systems isn't all that evil, because you leave yourself more room for data storage and feature creep. It also allows you to cut hardware production costs because your code can run on smaller ROM/RAM.
P.P.S. References to articles and books are welcome too!
P.P.P.S. These questions are closely related: 404615, 1561629

My experience from an extremely constrained embedded memory environment:
Use fixed size buffers. Don't use pointers or dynamic allocation because they have too much overhead.
Use the smallest int data type that works.
Don't ever use recursion. Always use looping.
Don't pass lots of function parameters. Use globals instead. :)

There are many things you can do to reduce your memory footprints, I'm sure people have written books on the subject, but a few of the major ones are:
Compiler options to reduce code size (including -Os and packing/alignment options)
Linker options to strip dead code
If you're loading from flash (or ROM) to ram to execute (rather than executing from flash), then use a compressed flash image, and decompress it with your bootloader.
Use static allocation: a heap is an inefficient way to allocate limited memory, and if it might fail due to fragmentation if it is constrained.
Tools to find the stack high-watermark (typically they fill the stack with a pattern, execute the program, then see where the pattern remains), so you can set the stack size(s) optimally
And of course, optimising the algorithms you use for memory footprint (often at expense of speed)

A few obvious ones
If speed isn't critical, execute the code directly from flash.
Declare constant data tables using const. This will avoid the data being copied from flash to RAM
Pack large data tables tightly using the smallest data types, and in the correct order to avoid padding.
Use compression for large sets of data (as long as the compression code doesn't outweigh the data)
Turn off exception handling and RTTI.
Did anybody mention using -Os? ;-)
Folding knowledge into data
One of the rules of Unix philosophy can help make code more compact:
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
I can't count how many times I've seen elaborate branching logic, spanning many pages, that could've been folded into a nice compact table of rules, constants, and function pointers. State machines can often be represented this way (State Pattern). The Command Pattern also applies. It's all about the declarative vs imperative styles of programming.
Log codes + binary data instead of text
Instead of logging plain text, log event codes and binary data. Then use a "phrasebook" to reconstitute the event messages. The messages in the phrasebook can even contain printf-style format specifiers, so that the event data values are displayed neatly within the text.
Minimize the number of threads
Each thread needs it own memory block for a stack and TSS. Where you don't need preemption, consider making your tasks execute co-operatively within the same thread (cooperative multi-tasking).
Use memory pools instead of hoarding
To avoid heap fragmentation, I've often seen separate modules hoard large static memory buffers for their own use, even when the memory is only occasionally required. A memory pool could be used instead so the the memory is only used "on demand". However, this approach may require careful analysis and instrumentation to make sure pools are not depleted at runtime.
Dynamic allocation only at initialization
In embedded systems where only one application runs indefinitely, you can use dynamic allocation in a sensible way that doesn't lead to fragmentation: Just dynamically allocate once in your various initialization routines, and never free the memory. reserve() your containers to the correct capacity and don't let them auto-grow. If you need to frequently allocate/free buffers of data (say, for communication packets), then use memory pools. I once even extended the C/C++ runtimes so that it would abort my program if anything tried to dynamically allocate memory after the initialization sequence.

As with all optimization, first optimize algorithms, second optimize the code and data, finally optimize the compiler.
I don't know what your program does, so I can't advice on algorithms. Many others have written about the compiler. So, here's some advice on code and data:
Eliminate redundancy in your code. Any repeated code that's three or more lines long, repeated three times in your code, should be changed to a function call.
Eliminate redundancy in your data. Find the most compact representation: merge read-only data, and consider using compression codes.
Run the code through a regular profiler; eliminate all code that isn't used.

Generate a map file from your linker. It will show how the memory is allocated. This is a good start when optimizing for memory usage. It also will show all the functions and how the code-space is laid out.

Here's a book on the subject Small Memory Software: Patterns for systems with limited memory.

Compile in VS with /Os. Often times this is even faster than optimizing for speed anyway, because smaller code size == less paging.
Comdat folding should be enabled in the linker (it is by default in release builds)
Be careful about data structure packing; often time this results in the compiler generated more code (== more memory) to generate the assembly to access unaligned memory. Using 1 bit for a boolean flag is a classic example.
Also, be careful when choosing a memory efficient algorithm over an algorithm with a better runtime. This is where premature optimizations come in.

Ok most were mentioned already, but here is my list anyway:
Learn what your compiler can do. Read compiler documentation, experiment with code examples. Check settings.
Check generated code at target optimization level. Sometimes results are surprising and often it turns out optimization actually slows things down (or just take too much space).
choose suitable memory model. If you target really small tight system, large or huge memory model might not be the best choice (but usually easisest to program for...)
Prefer static allocation. Use dynamic allocation only on startup or over
statically allocated buffer (pool or maximum instance sized static buffer).
Use C99 style data types. Use smallest sufficient data type, for storage types. Local variables like loop variables are sometimes more efficient with "fast" data types.
Select inline candidates. Some parameter heavy function with relatively simple bodies are better off when inlined. Or consider passing structure of parameters. Globals are also option, but be careful - tests and maintenance can become difficult if anyone in them isn't disciplned enough.
Use const keyword well , be aware of array initialization implications.
Map file, ideally also with module sizes. Check also what is included from crt (is it really neccessary?).
Recursion just say no (limited stack space)
Floating point numbers - prefer fixed point math. Tends to include and call a lot of code (even for simple addition or multiplication).
C++ you should know C++ VERY WELL. If you don't, program constrainted embedded systems in C, please. Those who dare must be careful with all advanced C++ constructs (inheritance, templates, exceptions, overloading, etc.). Consider close to HW code to be
rather Super-C and C++ is used where it counts: in high level logic, GUI, etc.
Disable whatever you don't need in compiler settings (be it parts of libraries, language constructs, etc.)
Last but not least - while hunting for smallest possible code size - don't overdo it. Watch out also for performance and maintainability. Over-optimized code tends to decay very quickly.

Firstly, tell your compiler to optimize for code size. GCC has the -Os flag for this.
Everything else is at the algorithmic level - use similar tools that you would for finding memory leaks, but instead look for allocs and frees that you could avoid.
Also take a look at commonly used data structure packing - if you can shave a byte or two off them, you can cut down memory use substantially.

If you're looking for a good way to profile your application's heap usage, check out valgrind's massif tool. It will let you take snapshots of your app's memory usage profile over time, and you can then use that information to better see where the "low hanging fruit" is, and aim your optimizations accordingly.

Profiling code or data bloat can be done via map files: for gcc see here, for VS see here.
I have yet to see a useful tool for size profiling though (and don't have time to fix my VS AddIn hack).

on top what others suggest:
Limit use of c++ features, write like in ANSI C with minor extensions. Standard (std::) templates use a large system of dynamic allocation. If you can, avoid templates altogether. While not inherently harmful, they make it way too easy to generate lots and lots of machine code from just a couple simple, clean, elegant high-level instructions. This encourages writing in a way that - despite all the "clean code" advantages - is very memory hungry.
If you must use templates, write your own or use ones designed for embedded use, pass fixed sizes as template parameters, and write a test program so you can test your template AND check your -S output to ensure the compiler is not generating horrible assembly code to instantiate it.
Align your structures by hand, or use #pragma pack
{char a; long b; char c; long d; char e; char f; } //is 18 bytes,
{char a; char c; char d; char f; long b; long d; } //is 12 bytes.
For the same reason, use a centralized global data storage structure instead of scattered local static variables.
Intelligently balance usage of malloc()/new and static structures.
If you need a subset of functionality of given library, consider writing your own.
Unroll short loops.
for(i=0;i<3;i++){ transform_vector[i]; }
is longer than
transform_vector[0];
transform_vector[1];
transform_vector[2];
Don't do that for longer ones.
Pack multiple files together to let the compiler inline short functions and perform various optimizations Linker can't.

Don't be afraid to write 'little languages' inside your program. Sometimes a table of strings and an interpreter can get a LOT done. For instance, in a system I've worked on, we have a lot of internal tables, which have to be accessed in various ways (loop through, whatever). We've got an internal system of commands for referencing the tables that forms a sort of half-way language that's quite compact for what it gets donw.
But, BE CAREFUL! Know that you are writing such things (I wrote one accidentally, myself), and DOCUMENT what you are doing. The original developers do NOT seem to have been conscious of what they were doing, so it's much harder to manage than it should be.

Optimizing is a popular term but often technically incorrect. It literally means to make optimal. Such a condition is never actually achieved for either speed or size. We can simply take measures to move toward optimization.
Many (but not all) of the techniques used to move toward minimum time to a computing result sacrifices memory requirement, and many (but not all) of the techniques used to move toward minimum memory requirement lengthens the time to result.
Reduction of memory requirements amounts to a fixed number of general techniques. It is difficult to find a specific technique that does not neatly fit into one or more of these. If you did all of them, you'd have something very close to the minimal space requirement for the program if not the absolute minimum possible. For a real application, it could take a team of experienced programmers a thousand years to do it.
Remove all redundancy from stored data, including intermediates.
Remove all need for storing data that could be streamed instead.
Allocate only the number of bytes needed, never a single more.
Remove all unused data.
Remove all unused variables.
Free data as soon as it is no longer possibly needed.
Remove all unused algorithms and branches within algorithms.
Find the algorithm that is represented in the minimally sized execution unit.
Remove all unused space between items.
This is a computer science view of the topic, not a developer's one.
For instance, packing a data structure is an effort that combines (3) and (9) above. Compressing data is a way to at least partly achieve (1) above. Reducing overhead of higher level programming constructs is a way to achieve some progress in (7) and (8). Dynamic allocation is an attempt to exploit a multitasking environment to employ (3). Compilation warnings, if turned on, can help with (5). Destructors attempt to assist with (6). Sockets, streams, and pipes can be used to accomplish (2). Simplifying a polynomial is a technique to gain ground in (8).
Understanding of the meaning of nine and the various ways to achieve them is the result of years of learning and checking memory maps resulting from compilation. Embedded programmers often learn them more quickly because of limited memory available.
Using the -Os option on a gnu compiler makes a request to the compiler to attempt to find patterns that can be transformed to accomplish these, but the -Os is an aggregate flag that turns on a number of optimization features, each of which attempts to perform transformations to accomplish one of the 9 tasks above.
Compiler directives can produce results without programmer effort, but automated processes in the compiler rarely correct problems created by lack of awareness in the writers of the code.

Bear in mind the implementation cost of some C++ features, such as virtual function tables and overloaded operators that create temporary objects.

Along with that everyone else said, I'd just like to add don't use virtual functions because with virtual functions a VTable must be created which can take up who knows how much space.
Also watch out for exceptions. With gcc, I don't believe there is a growing size for each try-catch block(except for 2 function calls for each try-catch), but there is a fixed size function which must be linked in which could be wasting precious bytes

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js