JNA Fortran performance tuning

JNA Fortran performance tuning - fortran

I'm wrapping a native code (mostly Fortran 77) using JNA. The output (i.e. the results) of the native function consits of a bunch of nested (custom) types/structs, which I map to corresponding Structure in JNA. These Structures mostly consist of an array of other Structures (so Structure A holds an array of Structure B, Structure B holds an array of structure C etc).
Using same benchmarking (mainly by logging time-differences) I've found that most of the time is not spent in the native code, but during mapping of JNA. Fortran subroutine call takes about 50ms, but total time is 250ms.
I've found that
.setAutoWrite(false) on our Structure reduces overhead by ~ factor of 2 (total execution time almost halfes)
Keeping (statically allocated) arrays as small as possible helps to keeps JNA overhead low
Changing DOUBLE PRECISION (double) to REAL (float) seems not to make any difference
Are there any further tricks to optimize JNA performance in our case? I know I could flatten down my structures to a 1D array of primitives and use direct mapping, but I try to avoid that (because it will be a pain to encode/decode these structures)

As noted in the JNA FAQ, direct mapping would be your best performance increase, but you've excluded that as an option. It also notes that the calling overhead for each native call is another performance hit, which you've partially addressed by changing setAutoWrite().
You also did mention flattening your structures to an array of primitives, but rejected that due to encoding/decoding complexity. However, moving in this direction is probably the next best choice, and it's possible that the biggest performance issue you're currently facing is a combination of JNA's Structure access using reflection and native reads. Oracle notes:
Because reflection involves types that are dynamically resolved,
certain Java virtual machine optimizations can not be performed.
Consequently, reflective operations have slower performance than their
non-reflective counterparts, and should be avoided in sections of code
which are called frequently in performance-sensitive applications.
Since you are here asking a performance-related question and using JNA Structures, I can only assume you're writing a "performance-sensitive application". Internally, the Structure does this:
for (StructField structField : fields().values()) {
readField(structField);
}
which does a single Native read for each field, followed by this, which ends up using reflection under the hood.
setFieldValue(structField.field, result, true);
The moral of the story is that normally with Structures, generally each field involves a native read + reflection write, or a reflection read + native write.
The first step you can make without making any other changes is to setAutoSynch(false) on the structure. (You've already done half of this with the "write" version; this does both read and write.) From the docs:
For extremely large or complex structures where you only need to
access a small number of fields, you may see a significant performance
benefit by avoiding automatic structure reads and writes. If auto-read
and -write are disabled, it is up to you to ensure that the Java
fields of interest are synched before and after native function calls
via readField(String) and writeField(String,Object). This is typically
most effective when a native call populates a large structure and you
only need a few fields out of it. After the native call you can call
readField(String) on only the fields of interest.
To really go all out, flattening will possibly help a little more to get rid of any of the reflection overhead. The trick is making the offset conversions easy.
Some directions to go, balancing complexity vs. performance:
To write to native memory, allocate and clear a buffer of bytes (mem = new Memory(size); mem.clear(); or just new byte[size]), and write specific fields to the byte offset you determine using the value from Structure.fieldOffset(name). This does use reflection, but you could do this once for each structure and store a map of name to offset for later use.
For reading from native memory, make all your native read calls using a flat buffer to reduce the native overhead to a single read/write. You can cast that buffer to a Structure when you read it (incurring reflection for each field once) or read specific byte offsets per the above strategy.

Related

C++ Classes for High Performance Computing

According to this Quora forum,
One of the simplest rules of thumb is to remember that hardware loves arrays, and is highly optimized for iteration over arrays. A simple optimization for many problems is just to stop using fancy data structures and just use plain arrays (or std::vectors in C++). This can take some getting used to.
Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?

If your class looks like this:
struct Person {
double age;
double income;
size_t location;
};
then you might benefit from rearranging to
std::vector<double> ages;
std::vector<double> incomes;
std::vector<size_t> locations;
But it depends on your access patterns. If you frequently access multiple elements of a person at a time, then having the elements blocked together makes sense.
If your class looks like this:
struct Population {
std::vector<double> many_ages;
std::vector<double> many_incomes;
std::vector<size_t> many_locations;
};
Then you're using the form your resource recommended. Using any one of these arrays individually is faster than using the first class, but using elements from all three arrays simultaneously is probably slower with the second class.
Ultimately, you should structure your code to be as clean and intuitive as a possible. The biggest source of speed will be a strong understanding and appropriate use of algorithms, not the memory layout. I recommend disregarding this unless you already have strong HPC skills and need to squeeze maximum performance from your machine. In almost every other case your development time and sanity is worth far more than saving a few clock cycles.
More broadly
An interesting paper related to this is SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. A lot of work has gone into mapping ML algorithms to GPUs and, for ML applications, getting the memory layout right does make a real difference since so much time is spent on training and GPUs are optimized specifically for contiguous-array processing. But, the authors of the paper contend that even here if you understand algorithms well you can beat specialized hardware with optimized memory layouts, and they demonstrate this by getting their CPU to train 3.5x faster than their GPU.
More broadly, your question deals with the idea of cache misses. Since a cache miss is 200x more expensive than an L1 reference (link), if your data layout is optimized to your computation, then you can really save time. However, as the above suggests, it is rarely the case that simply rearranging your data magically makes everything faster. Consider matrix multiplication. It's the perfect example because the data is laid out in a single array, as requested by your resource. However, for a simple triple-loop matmult GEMM implementation there are still 6 ways to arrange your loops. Some of these ways are much more efficient than others, but none of them give you anywhere near peak performance. Read through this step-by-step explanation of matmult to get a better sense of all the algorithmic optimizations necessary to get good performance.
What the above should demonstrate is that even for situations in which we have only a few arrays laid out exactly as your resource suggests, the layout alone doesn't give us the speed. Good algorithms do. Data layout considerations, if any, flow from the algorithms we choose and higher-level hardware constraints.
If this is so for simple arrays and operations like matrix multiplication, by extension you should also expect it to be so for "fancy data structures" as well.

Are C++ classes one of those "fancy data structures,
A C++ class is a construct which can be used to create a data type. It can be used to create data structures such as lists, queues etc.
i.e. a kind of data type
A class is a data type
that can be replaced by arrays
A class and array are not interchangeable. Arrays are data structures. You are comparing apples with oranges.
to achieve a higher performance in a C++ program?
That depends on how you implement your class

Are C++ classes one of those "fancy data structures,"
I think they are referring in particular to containers like std::map, std::deque, std::list etc, that hold data in many different heap allocations and therefore iterating over the contents of the container requires the CPU to "hop around" in the RAM address space to some extent, rather than just reading RAM sequentially. It's that hopping-around that often limits performance, as the CPU's on-board memory caches are less effective at avoiding execution-stalls due to RAM latency when future RAM access locations aren't easily predictable.
A C++ class on its own may or may not encourage non-sequential RAM access; whether it does or not depends entirely on how the class was implemented (and in particular on whether it is holding its data via multiple heap allocations). The std::vector class (mentioned in the forum text) is an example of a C++ class that does not require any non-sequential memory accesses as you iterate across its contents.

Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?
Both computer time and your development time are valuable.
Don't optimize code unless you are sure it is taking most of the CPU time.
So use first a profiler (e.g. Gprof) and read the documentation of your C or C++ compiler (e.g. GCC). Compilers are capable of fancy optimizations.
If you really care about HPC, learn about GPGPU programming with e.g. OpenCL or OpenACC.
If you happen to use Linux (a common OS in the HPC world), read Advanced Linux Programming, then syscalls(2), then time(7).

How can in-memory data structures be pre-initialized for ROM-based programs?

Consider STL's unordered_map. The same template class is used for both hashtables that are generated at runtime, and hashtables comprised of compile-time constant values. While recent versions of C++ add constexpr support it does not extend to more complex operations involving the free-store, consequently building a hashtable from compile-time constants must still happen at runtime, making it just as expensive as building any other hashtable at runtime.
Ideally, a perfect compiler would see this and pre-evaluate the hashtable construction at compile-time and embed it directly within the program.
It got me thinking about retrocomputing and microcontrollers which would conceivably have their software written in C or C++ given the development cost of assembly: these environments often have limited RAM but plenty of ROM, and those in-memory data structures (such as an unordered_map) certainly could be pre-generated and persisted to ROM all at compile-time.
As mentioned, the C++ language does not support this for non-trivial constexpr. I understand you could hack this together assuming you can base your complex data-structure on an array-type or reduce it down to a constexpr - or write it all in assembly and manually setting each byte of the structure in a hex-editor and hope it matches the compiler's representation of your struct types (for example).
How is it done today then? And how was it done in the days of 16-bit and 32-bit games consoles where RAM and CPU cycles were at a premium? I'm especially keen to learn about ROM cartridge-based games where the structures are immediately accessible as raw memory.

In C++ microcontroller systems, all constructors of objects with static storage duration are called during boot-up, around the point where .data and .bss etc segments are initialized, before main() is called. This is one of several reasons why C++ tend to be unsuitable for such systems - it actually executes application code during the start-up phase.
The kind of applications you mention, like old video games, most likely had tables pre-calculated and written in ROM. I doubt that C++ was used much for such games, or if it was, they used a restricted sub-set of the language.

unordered_map is a highly inefficient type of data structure you'd never use in that kind of setting. More reasonable things like arrays and trees are easy to do with static initializers. If it gets more complex, you write a program to generate C containing the wanted static initializers and run it on a system that can handle it.

Back in "Ye Olden Days" - early 80's and 90's, RAM was very expensive, flash was both expensive and unreliable, but ROM, particularly masked ROM was cheap.
Video game consoles usually executed the games from ROM using a tiny amount of RAM as "scratchpad" memory. For example, the original NES console had 2048 bytes of RAM
Compiled languages weren't used in game development, so to answer your original question, data structures were initialized copying an empty structure from ROM to RAM

I need an infinite bit-mask in C++

Skippable Context: I have a simulation loop (using fixed update but variable rendering pattern), which instantiates entities of classes that are generated on the fly according to user-input and/or file configurations from a database of millions of components (of the container state modifying pattern).
I have implemented a system that automagically...deduces(some kind of logic cell/math I don't know the name to) and applies needed components whenever the user-input/config ignores the fact that one of their options requires additional components.
How so? Many of the components are complex formulae or fuzzy logic (gates?) or other complex scientific reasoning, coded in a way that can operate my simulation's structure, its objects, its environment, thus sometimes one component relies on the other and I need the 'deduction algorithm/system' to be able to pass that reliance to the class constructor.
I've used maximum granularity in the way I decided to store these 'knowledge pieces' because I really can't waste any memory given the sizes and compute intensity of the simulations and number of individual instances but now I'm running the issue of a single instance needing thousands, sometimes tens of thousands of components, and I need the instance's "creation-map" both saved and still bound as a private member so I can: 1st - know where my deductions are leading the constructor of instances and maybe be able to use memoization to reduce build times; 2nd - implement injection of changes to a live instance during the simulation¹.
What I think I need: I need a possibly infinite or at least very long
bit-mask, so I can iterate the creation faster and let the final
component-tree of my dynamically-constructed objects logged for future
use.
Possible approaches, which I have no idea will work: 1st - Manually
and sequentially store values for the bit flags in each RAM cell,
using the RAM wafer as my bit-mask. 2nd - Breakdown the map into
smaller bit-masks of known size(hard because the final number of
components is unknown until the creation is done and decoupling the
deduction is something which might even be impossible without
refactoring the entire system). 3rd - Figure out a way to make an
infinite bit-mask, or use some library that has implemented a really
long integer(5.12e+11 or bigger).
My code is written in C++, my rendering and compute kernels are Vulkan.
My objective question:
How do I implement an arbitrarily long bit-mask that is memory and compute efficient?
If I am allowed a bonus question, what is the most efficient way to implement such a bit-mask, assuming I have no architecture (software and hardware) constraints?
¹ I cannot browse an object's tree during the simulation and I also cannot afford to pause the simulation and wait for the browsing to be done before injecting the modification, I need to know and be able to make an arbitrary change on an arbitrary frame of simulation in both a pre-scheduled manner and a real-time manner.

what is the most efficient way to implement such a bit-mask
Put std::vector<bool> there and profile. If/when you’ll discover you’re spending significant time working with that vector, read further.
There’s no most efficient way, it all depends on what exactly does your code do with these bits.
One standard approach is keep continuous vector of integers, and treat them as a vector of bits. For example, std::vector<bool> on Windows uses 32-bit values, keeping 32 bools in each one.
However, the API of std::vector is too general. If your vector has majority of 0-s, or majority of 1-s, you’ll be able to implement many operations (get value, find first/next/previous 0/1) much faster than std::vector<bool> does.
Also depending on your access patterns, you may want either smaller or larger elements. If you’ll decide to use large SIMD types i.e. __m128i or __m256i don’t forget about alignment for them.

Not enough memory in C++ : write to file instead, read data in when needed?

I'm developing a tool for wavelet image analysis and machine learning on Linux machines in C++.
It is limited by the size of the images, the number of scales and their corresponding filters (up to 2048x2048 doubles) for each of N orientations as well as additional memory and processing overhead by a machine learning algorithm.
Unfortunately my skills of Linux system programming are shallow at best,
so I'm currently using no swap but figure it should be possible somehow?
I'm required to keep the imaginary and real part of the
filtered images of each scale and orientation, as well as the corresponding wavelets for reconstruction purposes. I keep them in memory for additional speed for small images.
Regarding the memory use: I already
store everything no more than once,
only what is needed,
cut out any double entries or redundancy,
pass by reference only,
use pointers over temporary objects,
free memory as soon as it is not required any more and
limit the number of calculations to the absolute minimum.
As with most data processing tools, speed is at the essence. As long as there
is enough memory the tool is about 3x as fast compared to the same implementation in Matlab code.
But as soon as I'm out of memory nothing goes any more. Unfortunately most of the images I'm training the algorithm on are huge (raw data 4096x4096 double entries, after symmetric padding even larger), therefore I hit the ceiling quite often.
Would it be bad practise to temporarily write data that is not needed for the current calculation / processing step from memory to the disk?
What approach / data format would be most suitable to do that?
I was thinking of using rapidXML to read and write an XML to a binary file and then read out only the required data. Would this work?
Is a memory-mapped file what I need? https://en.wikipedia.org/wiki/Memory-mapped_file
I'm aware that this will result in performance loss, but it is more important that the software runs smoothly and does not freeze.
I know that there are libraries out there that can do wavelet image analysis, so please spare the "Why reinvent the wheel, just use XYZ instead". I'm using very specific wavelets, I'm required to do it myself and I'm not supposed to use external libraries.

Yes, writing data to the disk to save memory is bad practice.
There is usually no need to manually write your data to the disk to save memory, unless you are reaching the limits of what you can address (4GB on 32bit machines, much more in 64bit machines).
The reason for this is that the OS is already doing exactly the same thing. It is very possible that your own solution would be slower than what the OS is doing. Read this Wikipedia article if you are not familiar with the concept of paging and virtual memory.

Did you look into using mmap and munmap to bring the images (and temporary results) into your address space and discard them when you no longer need them. mmap allows you to map the content of a file directly in memory. no more fread/fwrite. Direct memory access. Writes to the memory region are written back to the file too and bringing back that intermediate state later on is no harder than redoing an mmap.
The big advantages are:
no encoding in a bloated format like XML
perfectly suitable for transient results such as matrices that are represented in contiguous memory regions.
Dead simple to implement.
Completely delegate to the OS the decision of when to swap in and out.

This doesn't solve your fundamental problem, but: Are you sure you need to be doing everything in double precision? You may not be able to use integer coefficient wavelets, but storing the image data itself in doubles is usually pretty wasteful. Also, 4k images aren't very big ... I'm assuming you are actually using frames of some sort so have redundant entries, otherwise your numbers don't seem to add up (and are you storing them sparsely?) ... or maybe you are just using a large number at once.
As for "should I write to disk"? This can help, particularly if you are getting a 4x increase (or more) by taking image data to double precision. You can answer it for yourself though, just measure the time to load and compare to your compute time to see if this is worth pursuing. The wavelet itself should be very cheap, so I'm guess you're mostly dominated by your learning algorithm. In that case, go ahead and throw out original data or whatever until you need it again.

Optimizing for space instead of speed in C++

When you say "optimization", people tend to think "speed". But what about embedded systems where speed isn't all that critical, but memory is a major constraint? What are some guidelines, techniques, and tricks that can be used for shaving off those extra kilobytes in ROM and RAM? How does one "profile" code to see where the memory bloat is?
P.S. One could argue that "prematurely" optimizing for space in embedded systems isn't all that evil, because you leave yourself more room for data storage and feature creep. It also allows you to cut hardware production costs because your code can run on smaller ROM/RAM.
P.P.S. References to articles and books are welcome too!
P.P.P.S. These questions are closely related: 404615, 1561629

My experience from an extremely constrained embedded memory environment:
Use fixed size buffers. Don't use pointers or dynamic allocation because they have too much overhead.
Use the smallest int data type that works.
Don't ever use recursion. Always use looping.
Don't pass lots of function parameters. Use globals instead. :)

There are many things you can do to reduce your memory footprints, I'm sure people have written books on the subject, but a few of the major ones are:
Compiler options to reduce code size (including -Os and packing/alignment options)
Linker options to strip dead code
If you're loading from flash (or ROM) to ram to execute (rather than executing from flash), then use a compressed flash image, and decompress it with your bootloader.
Use static allocation: a heap is an inefficient way to allocate limited memory, and if it might fail due to fragmentation if it is constrained.
Tools to find the stack high-watermark (typically they fill the stack with a pattern, execute the program, then see where the pattern remains), so you can set the stack size(s) optimally
And of course, optimising the algorithms you use for memory footprint (often at expense of speed)

A few obvious ones
If speed isn't critical, execute the code directly from flash.
Declare constant data tables using const. This will avoid the data being copied from flash to RAM
Pack large data tables tightly using the smallest data types, and in the correct order to avoid padding.
Use compression for large sets of data (as long as the compression code doesn't outweigh the data)
Turn off exception handling and RTTI.
Did anybody mention using -Os? ;-)
Folding knowledge into data
One of the rules of Unix philosophy can help make code more compact:
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
I can't count how many times I've seen elaborate branching logic, spanning many pages, that could've been folded into a nice compact table of rules, constants, and function pointers. State machines can often be represented this way (State Pattern). The Command Pattern also applies. It's all about the declarative vs imperative styles of programming.
Log codes + binary data instead of text
Instead of logging plain text, log event codes and binary data. Then use a "phrasebook" to reconstitute the event messages. The messages in the phrasebook can even contain printf-style format specifiers, so that the event data values are displayed neatly within the text.
Minimize the number of threads
Each thread needs it own memory block for a stack and TSS. Where you don't need preemption, consider making your tasks execute co-operatively within the same thread (cooperative multi-tasking).
Use memory pools instead of hoarding
To avoid heap fragmentation, I've often seen separate modules hoard large static memory buffers for their own use, even when the memory is only occasionally required. A memory pool could be used instead so the the memory is only used "on demand". However, this approach may require careful analysis and instrumentation to make sure pools are not depleted at runtime.
Dynamic allocation only at initialization
In embedded systems where only one application runs indefinitely, you can use dynamic allocation in a sensible way that doesn't lead to fragmentation: Just dynamically allocate once in your various initialization routines, and never free the memory. reserve() your containers to the correct capacity and don't let them auto-grow. If you need to frequently allocate/free buffers of data (say, for communication packets), then use memory pools. I once even extended the C/C++ runtimes so that it would abort my program if anything tried to dynamically allocate memory after the initialization sequence.

As with all optimization, first optimize algorithms, second optimize the code and data, finally optimize the compiler.
I don't know what your program does, so I can't advice on algorithms. Many others have written about the compiler. So, here's some advice on code and data:
Eliminate redundancy in your code. Any repeated code that's three or more lines long, repeated three times in your code, should be changed to a function call.
Eliminate redundancy in your data. Find the most compact representation: merge read-only data, and consider using compression codes.
Run the code through a regular profiler; eliminate all code that isn't used.

Generate a map file from your linker. It will show how the memory is allocated. This is a good start when optimizing for memory usage. It also will show all the functions and how the code-space is laid out.

Here's a book on the subject Small Memory Software: Patterns for systems with limited memory.

Compile in VS with /Os. Often times this is even faster than optimizing for speed anyway, because smaller code size == less paging.
Comdat folding should be enabled in the linker (it is by default in release builds)
Be careful about data structure packing; often time this results in the compiler generated more code (== more memory) to generate the assembly to access unaligned memory. Using 1 bit for a boolean flag is a classic example.
Also, be careful when choosing a memory efficient algorithm over an algorithm with a better runtime. This is where premature optimizations come in.

Ok most were mentioned already, but here is my list anyway:
Learn what your compiler can do. Read compiler documentation, experiment with code examples. Check settings.
Check generated code at target optimization level. Sometimes results are surprising and often it turns out optimization actually slows things down (or just take too much space).
choose suitable memory model. If you target really small tight system, large or huge memory model might not be the best choice (but usually easisest to program for...)
Prefer static allocation. Use dynamic allocation only on startup or over
statically allocated buffer (pool or maximum instance sized static buffer).
Use C99 style data types. Use smallest sufficient data type, for storage types. Local variables like loop variables are sometimes more efficient with "fast" data types.
Select inline candidates. Some parameter heavy function with relatively simple bodies are better off when inlined. Or consider passing structure of parameters. Globals are also option, but be careful - tests and maintenance can become difficult if anyone in them isn't disciplned enough.
Use const keyword well , be aware of array initialization implications.
Map file, ideally also with module sizes. Check also what is included from crt (is it really neccessary?).
Recursion just say no (limited stack space)
Floating point numbers - prefer fixed point math. Tends to include and call a lot of code (even for simple addition or multiplication).
C++ you should know C++ VERY WELL. If you don't, program constrainted embedded systems in C, please. Those who dare must be careful with all advanced C++ constructs (inheritance, templates, exceptions, overloading, etc.). Consider close to HW code to be
rather Super-C and C++ is used where it counts: in high level logic, GUI, etc.
Disable whatever you don't need in compiler settings (be it parts of libraries, language constructs, etc.)
Last but not least - while hunting for smallest possible code size - don't overdo it. Watch out also for performance and maintainability. Over-optimized code tends to decay very quickly.

Firstly, tell your compiler to optimize for code size. GCC has the -Os flag for this.
Everything else is at the algorithmic level - use similar tools that you would for finding memory leaks, but instead look for allocs and frees that you could avoid.
Also take a look at commonly used data structure packing - if you can shave a byte or two off them, you can cut down memory use substantially.

If you're looking for a good way to profile your application's heap usage, check out valgrind's massif tool. It will let you take snapshots of your app's memory usage profile over time, and you can then use that information to better see where the "low hanging fruit" is, and aim your optimizations accordingly.

Profiling code or data bloat can be done via map files: for gcc see here, for VS see here.
I have yet to see a useful tool for size profiling though (and don't have time to fix my VS AddIn hack).

on top what others suggest:
Limit use of c++ features, write like in ANSI C with minor extensions. Standard (std::) templates use a large system of dynamic allocation. If you can, avoid templates altogether. While not inherently harmful, they make it way too easy to generate lots and lots of machine code from just a couple simple, clean, elegant high-level instructions. This encourages writing in a way that - despite all the "clean code" advantages - is very memory hungry.
If you must use templates, write your own or use ones designed for embedded use, pass fixed sizes as template parameters, and write a test program so you can test your template AND check your -S output to ensure the compiler is not generating horrible assembly code to instantiate it.
Align your structures by hand, or use #pragma pack
{char a; long b; char c; long d; char e; char f; } //is 18 bytes,
{char a; char c; char d; char f; long b; long d; } //is 12 bytes.
For the same reason, use a centralized global data storage structure instead of scattered local static variables.
Intelligently balance usage of malloc()/new and static structures.
If you need a subset of functionality of given library, consider writing your own.
Unroll short loops.
for(i=0;i<3;i++){ transform_vector[i]; }
is longer than
transform_vector[0];
transform_vector[1];
transform_vector[2];
Don't do that for longer ones.
Pack multiple files together to let the compiler inline short functions and perform various optimizations Linker can't.

Don't be afraid to write 'little languages' inside your program. Sometimes a table of strings and an interpreter can get a LOT done. For instance, in a system I've worked on, we have a lot of internal tables, which have to be accessed in various ways (loop through, whatever). We've got an internal system of commands for referencing the tables that forms a sort of half-way language that's quite compact for what it gets donw.
But, BE CAREFUL! Know that you are writing such things (I wrote one accidentally, myself), and DOCUMENT what you are doing. The original developers do NOT seem to have been conscious of what they were doing, so it's much harder to manage than it should be.

Optimizing is a popular term but often technically incorrect. It literally means to make optimal. Such a condition is never actually achieved for either speed or size. We can simply take measures to move toward optimization.
Many (but not all) of the techniques used to move toward minimum time to a computing result sacrifices memory requirement, and many (but not all) of the techniques used to move toward minimum memory requirement lengthens the time to result.
Reduction of memory requirements amounts to a fixed number of general techniques. It is difficult to find a specific technique that does not neatly fit into one or more of these. If you did all of them, you'd have something very close to the minimal space requirement for the program if not the absolute minimum possible. For a real application, it could take a team of experienced programmers a thousand years to do it.
Remove all redundancy from stored data, including intermediates.
Remove all need for storing data that could be streamed instead.
Allocate only the number of bytes needed, never a single more.
Remove all unused data.
Remove all unused variables.
Free data as soon as it is no longer possibly needed.
Remove all unused algorithms and branches within algorithms.
Find the algorithm that is represented in the minimally sized execution unit.
Remove all unused space between items.
This is a computer science view of the topic, not a developer's one.
For instance, packing a data structure is an effort that combines (3) and (9) above. Compressing data is a way to at least partly achieve (1) above. Reducing overhead of higher level programming constructs is a way to achieve some progress in (7) and (8). Dynamic allocation is an attempt to exploit a multitasking environment to employ (3). Compilation warnings, if turned on, can help with (5). Destructors attempt to assist with (6). Sockets, streams, and pipes can be used to accomplish (2). Simplifying a polynomial is a technique to gain ground in (8).
Understanding of the meaning of nine and the various ways to achieve them is the result of years of learning and checking memory maps resulting from compilation. Embedded programmers often learn them more quickly because of limited memory available.
Using the -Os option on a gnu compiler makes a request to the compiler to attempt to find patterns that can be transformed to accomplish these, but the -Os is an aggregate flag that turns on a number of optimization features, each of which attempts to perform transformations to accomplish one of the 9 tasks above.
Compiler directives can produce results without programmer effort, but automated processes in the compiler rarely correct problems created by lack of awareness in the writers of the code.

Bear in mind the implementation cost of some C++ features, such as virtual function tables and overloaded operators that create temporary objects.

Along with that everyone else said, I'd just like to add don't use virtual functions because with virtual functions a VTable must be created which can take up who knows how much space.
Also watch out for exceptions. With gcc, I don't believe there is a growing size for each try-catch block(except for 2 function calls for each try-catch), but there is a fixed size function which must be linked in which could be wasting precious bytes

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js