2 Arrays vs Array of Structures with 2 data members - c++

I was wondering which one is better, 2 Arrays or the Array of a structure with 2 data members.
I want insights regarding:
Is struct a kind of wrapper which takes extra memory? I am aware of padding in structures to fit it into word limits.
Which one is faster if I want to access both the data members together? I think its array of structures.
And does an array of a structure may occupy more memory than 2 arrays due to the possibility of structure padding.
Answers both in general context and language specific are welcome.
And please don't suggest having a look at SoA vs AoS questions, already done that.

It entirely depends on what you're trying to do, neither answer is always going to be "correct".
Outside of compiler-specific padding, structs do not take up any extra memory unless you make it virtual, in which case it'll get a vtable pointer but that's it.
As long as you're targeting a machine with a cache large enough to fit two pages (usually 4KB each, but check for your specific CPU), it doesn't matter and you should choose whichever is easier to work with and makes more sense in your code. The array of structs will use one page and cause a cache miss once for every 4KB of structs you load, while the array of values will load two pages that cause two cache misses half as often. If you do happen to be working with a dinky cache that only allows one cache for your program data, then yes, it'll be much faster to use an array of structs since the alternative would cause cache misses on every read.
Same answer as #1 - arrays will never have their own padding, but a struct might have padding built into it by your compiler.
Struct padding though depends entirely on your compiler, which probably has flags to turn it on or off or set the maximum pad size or whatever. Inspect the raw data of an array of your objects to see if they have padding, and if so, find out how to turn that off in your compiler if you need that memory.
What compiler are you using, and what are you trying to do with your project?
And perhaps more importantly: What stage is your project in, and are you running into speed issues already? Pre-optimization is the root of all evils, and you are likely wasting your time worrying about the issue.

Structures are padded to allow optimal access to the members by the CPU so they may take more memory. The fields may be already aligned so no padding needed. So they are not a wrapper in the sense that they are wrapped with data always. Think of structure padding/adjustment of optimizations by the compiler.
Structures will be faster together as the entire structure is likely to fit in cache together. If you have separate lists, they may fall out of the cache.
If padded, yes.
Don't forget one important reason to keep the data together: code readability. If you are planning to process each field independently a different thread. You may get performance improvements if you use arrays.

Related

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

Cache locality performance

If I had a C or C++ program where I was using say 20 integers throughout the program, would it improve performance to create an array of size 20 to store the integers and then create aliases for each number?
Would this improve the cache locality (rather than just creating 20 normal ints) because the ints would be loaded into the cache together as part of the int array(or at least , improve the chances of this)?
The question is how do you allocate space for them? I doubt that you just randomly do new int 20 times here and there in the code. If they are local variables then they will get on stack and get cached.
The main question is is that worth bothering? Try to write your program in readable and elegant way first, and then try to remove major bottlenecks and only after start messing with microoptimizations. If you are processing 20 ints, should not they be array essentially?
Also is it theoretical question? If it is, then yes, array will likely be cached better then 20 random areas in memory. If it is practical question, then I doubt that this is really important unless you are writing supercritical performance code, and even then microoptimizations are last thing to deal with.
It might improve performance a bit, yes. It might also completely ruin your performance. Or it might have no impact whatsoever because the compiler already did something similar for you. Or it might have no impact because you're just not using those integers often enough for this to make a difference.
It also depends on whether one or multiple threads access these integers, and whether they just read, or also modify the numbers. (if you have multiple threads and you write to those integers, then putting them in an array will cause false sharing which will hurt your performance far more than anything you'd hoped to gain)
So why don't you just try it?
There is no simple, single answer. The only serious answer you're going to get is "it depends". If you want to know how it would behave in your case, then you have two options:
try it and see what happens, or
gain a thorough understanding of how your CPU works, gather data on exactly how often these values are accessed and in which patterns, so you can make an educated guess at how the change would affect your performance.
If you choose #2, you'll likely need to follow it up with #1 anyway, to verify that your guess was correct.
Performance isn't simple. There are few universal rules, and everything depends on context. A change which is an optimization in one case might slow everything down in another.
If you're serious about optimizing your code, then there's no substitute for the two steps above. And if you're not serious about it, don't do it. :)
Yes, the theoretical chance of the 20 integers ending up on the same cache line would be higher, although I think a good compiler would almost always be able to replicate the same performance for you even when not using an array.
So, you currently have int positionX, positionY, positionZ;then somewhere else int fuzzy; and int foo;, etc to make about 20 integers?
And you want to do something like this:
int arr[20];
#define positionX arr[0]
#define positionY arr[1]
#define positionZ arr[2]
#define fuzzy arr[3]
#define foo arr[4]
I would expect that if there is ANY performance difference, it may make it slower, because the compiler will notice that you are using arr in some other place, and thus can't use registers to store the value of foo, since it sees that you call update_position which touches arr[0]..arr[2]. It depends on how finegrained the compilers detection of "we're touching the same data" is. And I suspect it may quite often be based on "object" rather than individual fields of an object - particularly for arrays.
However, if you do have data that is used close together, e.g. position variables, it would probably help to have them next to each other.
But I seriously think that you are wasting your time trying to put variables next to each other, and using an array is almost certainly a BAD idea.
It would likely decrease performance. Modern compilers will move variables around in memory when you're not looking, and may store two variables at the same address when they're not used concurrently. With your array idea, those variables cannot overlap, and must use distinct cache lines.
Yes this may improve your performance, however it may not, as its really Variables that get used together that should be stored together.
So if they are used together then yes. Variables and objects should really be declared in the function in which they are used as they will be stored on the stack (level-1 cache in most cases).
So yes if you are going to use them together i.e they are relevant to each other, then this would probably be a little more efficient, provding you also take into consideration how you allocate them memory.

Alignment of data members and member functions for performance

Is it true aligning data members of a struct/class no longer yields the benefits it used to, especially on nehalem because of hardware improvements? If so, is it still the case that alignment will always make better performance, just very small noticeable improvements compared with on past CPUs?
Does alignment of member variables extend to member functions? I believe I once read (it could be on the wikibooks "C++ performance") that there are rules for "packing" member functions into various "units" (i.e. source files) for optimum loading into the instruction cache? (If I have got my terminology wrong here please correct me).
Processors are still much faster than what the RAM can deliver, so they still need caches. Caches still consist of fixed-size cache lines. Also, main memory is delivered in pages and pages are accessed using a translation lookaside buffer. This buffer, again, has a fixed size cache.
Which means that both spatial and temporal locality matter a lot (i.e. how you pack stuff, and how you access it). Packing structures well (sorted by padding/alignment requirements) as opposed to packing them in some haphazard order usually results in smaller structure sizes.
Smaller structure sizes mean, if you have loads of data:
more structures fit into one cache line (cache miss = 50-200 cycles)
fewer pages are needed (page fault = 10-20 million CPU cycles)
fewer TLB entries are needed, fewer TLB misses (TLB miss = 50-500 cycles)
Going linearly over a few gigabytes of tightly packed SoA data can be 3 orders of magnitude faster (or 8-10 orders of magnitude, if page faults are involved) than doing the same thing in a naive way with bad layout/packing.
Whether or not you hand-align individual 4-byte or 2-byte values (say, a typical int or short) to 2 or 4 bytes makes a very small difference on recent Intel CPUs (hardly noticeable). Insofar, it may seem tempting to "optimize" on that, but I strongly advise against doing so.
This is usually something one best doesn't worry about and leaves to the compiler to figure out. If for no other reason, then because the gains are marginal at best, but some other processor architectures will raise an exception if you get it wrong. Therefore, if you try to be too smart, you'll suddenly have unexplainable crashes once you compile on some other architecture. When that happens, you'll feel sorry.
Of course, if you don't have at least several dozen of megabytes of data to process, you need not care at all.
Aligning data to suit the processor will never hurt, but some processors will have more notable drawbacks than others, I think is the best way to answer this question.
Aligning functions into cache-line units seems a bit of a red herring to me. For small functions, what you really want is inlining if at all possible. If the code can't be inlined, then it's probably larger than a cache-line anyway. [Unless it's a virtual function, of course]. I don't think this has ever been a huge factor tho - either code is generally called often, and thus normally in the cache, or it's not called very often, and not very often in the cache. I'm sure it's possibe to come up with some code where calling one function, func1() will also drag in func2() into the cache, so if you always call func1() and func2() in short succession, it would have some benefit. But it's really not something that is that great of a benefit unless you have a lot of functions with pairs or groups of functions that are called close together. [By the way, I don't think the compiler is guaranteed to place your function code in any particular order, no matter which order you place it in the source file].
Cache-alignment is a slightly different matter, since cache-lines can still have a HUGE effect if you get it right vs. getting it wrong. This is more important for multithreading than general "loading data". The key here is to avoid sharing data in the same cache-line between processors. In a project I worked on some 10 or so years ago, a benchmark had a function that used an array of two integers to count up the number of iterations each thread did. When that got split into two separate cache-lines, the benchmark improved from 0.6x of running on a single processor to 1.98x of one processor. The same effect will happen on modern CPU's, even if they are much faster - the effect may not be exactly the same, but it will be a large slowdown (and the more processors sharing data, the more effect, so a quad-core system would be worse than a dual core, etc). This is because every time a processor updates something in a cache-line, all other processors that have read that cache-line must reload it from the processor that updated it [or from memory in the old days].

Higher dimensional array vs 1-D array efficiency in C++

I'm curious about the efficiency of using a higher dimensional array vs a one dimensional array. Do you lose anything when defining, and iterating through an array like this:
array[i][j][k];
or defining and iterating through an array like this:
array[k + j*jmax + i*imax];
My inclination is that there wouldn't be a difference, but I'm still learning about high efficiency programming (I've never had to care about this kind of thing before).
Thanks!
The only way to know for sure is to benchmark both ways (with optimization flags on in the compiler of course). The one think you lose for sure in the second method is the clarity of reading.
The former way and the latter way to access arrays are identical once you compile it. Keep in mind that accessing memory locations that are close to one another does make a difference in performance, as they're going to be cached differently. Thus, if you're storing a high-dimensional matrix, ensure that you store rows one after the other if you're going to be accessing them that way.
In general, CPU caches optimize for temporal and spacial ordering. That is, if you access memory address X, the odds of you accessing X+1 are higher. It's much more efficient to operate on values within the same cache line.
Check out this article on CPU caches for more information on how different storage policies affect performance: http://en.wikipedia.org/wiki/CPU_cache
If you can rewrite the indexing, so can the compiler. I wouldn't worry about that.
Trust your compiler(tm)!
It probably depends on implementation, but I'd say it more or less amounts to your code for one-dimensional array.
Do yourself a favor and care about such things after profiling the code. It is very unlikely that something like that will affect the performance of the application as a whole. Using the correct algorithms is much more important
And even if it does matter, it is most certainly only a single inner loop that needs attention.

Optimizing for space instead of speed in C++

When you say "optimization", people tend to think "speed". But what about embedded systems where speed isn't all that critical, but memory is a major constraint? What are some guidelines, techniques, and tricks that can be used for shaving off those extra kilobytes in ROM and RAM? How does one "profile" code to see where the memory bloat is?
P.S. One could argue that "prematurely" optimizing for space in embedded systems isn't all that evil, because you leave yourself more room for data storage and feature creep. It also allows you to cut hardware production costs because your code can run on smaller ROM/RAM.
P.P.S. References to articles and books are welcome too!
P.P.P.S. These questions are closely related: 404615, 1561629
My experience from an extremely constrained embedded memory environment:
Use fixed size buffers. Don't use pointers or dynamic allocation because they have too much overhead.
Use the smallest int data type that works.
Don't ever use recursion. Always use looping.
Don't pass lots of function parameters. Use globals instead. :)
There are many things you can do to reduce your memory footprints, I'm sure people have written books on the subject, but a few of the major ones are:
Compiler options to reduce code size (including -Os and packing/alignment options)
Linker options to strip dead code
If you're loading from flash (or ROM) to ram to execute (rather than executing from flash), then use a compressed flash image, and decompress it with your bootloader.
Use static allocation: a heap is an inefficient way to allocate limited memory, and if it might fail due to fragmentation if it is constrained.
Tools to find the stack high-watermark (typically they fill the stack with a pattern, execute the program, then see where the pattern remains), so you can set the stack size(s) optimally
And of course, optimising the algorithms you use for memory footprint (often at expense of speed)
A few obvious ones
If speed isn't critical, execute the code directly from flash.
Declare constant data tables using const. This will avoid the data being copied from flash to RAM
Pack large data tables tightly using the smallest data types, and in the correct order to avoid padding.
Use compression for large sets of data (as long as the compression code doesn't outweigh the data)
Turn off exception handling and RTTI.
Did anybody mention using -Os? ;-)
Folding knowledge into data
One of the rules of Unix philosophy can help make code more compact:
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
I can't count how many times I've seen elaborate branching logic, spanning many pages, that could've been folded into a nice compact table of rules, constants, and function pointers. State machines can often be represented this way (State Pattern). The Command Pattern also applies. It's all about the declarative vs imperative styles of programming.
Log codes + binary data instead of text
Instead of logging plain text, log event codes and binary data. Then use a "phrasebook" to reconstitute the event messages. The messages in the phrasebook can even contain printf-style format specifiers, so that the event data values are displayed neatly within the text.
Minimize the number of threads
Each thread needs it own memory block for a stack and TSS. Where you don't need preemption, consider making your tasks execute co-operatively within the same thread (cooperative multi-tasking).
Use memory pools instead of hoarding
To avoid heap fragmentation, I've often seen separate modules hoard large static memory buffers for their own use, even when the memory is only occasionally required. A memory pool could be used instead so the the memory is only used "on demand". However, this approach may require careful analysis and instrumentation to make sure pools are not depleted at runtime.
Dynamic allocation only at initialization
In embedded systems where only one application runs indefinitely, you can use dynamic allocation in a sensible way that doesn't lead to fragmentation: Just dynamically allocate once in your various initialization routines, and never free the memory. reserve() your containers to the correct capacity and don't let them auto-grow. If you need to frequently allocate/free buffers of data (say, for communication packets), then use memory pools. I once even extended the C/C++ runtimes so that it would abort my program if anything tried to dynamically allocate memory after the initialization sequence.
As with all optimization, first optimize algorithms, second optimize the code and data, finally optimize the compiler.
I don't know what your program does, so I can't advice on algorithms. Many others have written about the compiler. So, here's some advice on code and data:
Eliminate redundancy in your code. Any repeated code that's three or more lines long, repeated three times in your code, should be changed to a function call.
Eliminate redundancy in your data. Find the most compact representation: merge read-only data, and consider using compression codes.
Run the code through a regular profiler; eliminate all code that isn't used.
Generate a map file from your linker. It will show how the memory is allocated. This is a good start when optimizing for memory usage. It also will show all the functions and how the code-space is laid out.
Here's a book on the subject Small Memory Software: Patterns for systems with limited memory.
Compile in VS with /Os. Often times this is even faster than optimizing for speed anyway, because smaller code size == less paging.
Comdat folding should be enabled in the linker (it is by default in release builds)
Be careful about data structure packing; often time this results in the compiler generated more code (== more memory) to generate the assembly to access unaligned memory. Using 1 bit for a boolean flag is a classic example.
Also, be careful when choosing a memory efficient algorithm over an algorithm with a better runtime. This is where premature optimizations come in.
Ok most were mentioned already, but here is my list anyway:
Learn what your compiler can do. Read compiler documentation, experiment with code examples. Check settings.
Check generated code at target optimization level. Sometimes results are surprising and often it turns out optimization actually slows things down (or just take too much space).
choose suitable memory model. If you target really small tight system, large or huge memory model might not be the best choice (but usually easisest to program for...)
Prefer static allocation. Use dynamic allocation only on startup or over
statically allocated buffer (pool or maximum instance sized static buffer).
Use C99 style data types. Use smallest sufficient data type, for storage types. Local variables like loop variables are sometimes more efficient with "fast" data types.
Select inline candidates. Some parameter heavy function with relatively simple bodies are better off when inlined. Or consider passing structure of parameters. Globals are also option, but be careful - tests and maintenance can become difficult if anyone in them isn't disciplned enough.
Use const keyword well , be aware of array initialization implications.
Map file, ideally also with module sizes. Check also what is included from crt (is it really neccessary?).
Recursion just say no (limited stack space)
Floating point numbers - prefer fixed point math. Tends to include and call a lot of code (even for simple addition or multiplication).
C++ you should know C++ VERY WELL. If you don't, program constrainted embedded systems in C, please. Those who dare must be careful with all advanced C++ constructs (inheritance, templates, exceptions, overloading, etc.). Consider close to HW code to be
rather Super-C and C++ is used where it counts: in high level logic, GUI, etc.
Disable whatever you don't need in compiler settings (be it parts of libraries, language constructs, etc.)
Last but not least - while hunting for smallest possible code size - don't overdo it. Watch out also for performance and maintainability. Over-optimized code tends to decay very quickly.
Firstly, tell your compiler to optimize for code size. GCC has the -Os flag for this.
Everything else is at the algorithmic level - use similar tools that you would for finding memory leaks, but instead look for allocs and frees that you could avoid.
Also take a look at commonly used data structure packing - if you can shave a byte or two off them, you can cut down memory use substantially.
If you're looking for a good way to profile your application's heap usage, check out valgrind's massif tool. It will let you take snapshots of your app's memory usage profile over time, and you can then use that information to better see where the "low hanging fruit" is, and aim your optimizations accordingly.
Profiling code or data bloat can be done via map files: for gcc see here, for VS see here.
I have yet to see a useful tool for size profiling though (and don't have time to fix my VS AddIn hack).
on top what others suggest:
Limit use of c++ features, write like in ANSI C with minor extensions. Standard (std::) templates use a large system of dynamic allocation. If you can, avoid templates altogether. While not inherently harmful, they make it way too easy to generate lots and lots of machine code from just a couple simple, clean, elegant high-level instructions. This encourages writing in a way that - despite all the "clean code" advantages - is very memory hungry.
If you must use templates, write your own or use ones designed for embedded use, pass fixed sizes as template parameters, and write a test program so you can test your template AND check your -S output to ensure the compiler is not generating horrible assembly code to instantiate it.
Align your structures by hand, or use #pragma pack
{char a; long b; char c; long d; char e; char f; } //is 18 bytes,
{char a; char c; char d; char f; long b; long d; } //is 12 bytes.
For the same reason, use a centralized global data storage structure instead of scattered local static variables.
Intelligently balance usage of malloc()/new and static structures.
If you need a subset of functionality of given library, consider writing your own.
Unroll short loops.
for(i=0;i<3;i++){ transform_vector[i]; }
is longer than
transform_vector[0];
transform_vector[1];
transform_vector[2];
Don't do that for longer ones.
Pack multiple files together to let the compiler inline short functions and perform various optimizations Linker can't.
Don't be afraid to write 'little languages' inside your program. Sometimes a table of strings and an interpreter can get a LOT done. For instance, in a system I've worked on, we have a lot of internal tables, which have to be accessed in various ways (loop through, whatever). We've got an internal system of commands for referencing the tables that forms a sort of half-way language that's quite compact for what it gets donw.
But, BE CAREFUL! Know that you are writing such things (I wrote one accidentally, myself), and DOCUMENT what you are doing. The original developers do NOT seem to have been conscious of what they were doing, so it's much harder to manage than it should be.
Optimizing is a popular term but often technically incorrect. It literally means to make optimal. Such a condition is never actually achieved for either speed or size. We can simply take measures to move toward optimization.
Many (but not all) of the techniques used to move toward minimum time to a computing result sacrifices memory requirement, and many (but not all) of the techniques used to move toward minimum memory requirement lengthens the time to result.
Reduction of memory requirements amounts to a fixed number of general techniques. It is difficult to find a specific technique that does not neatly fit into one or more of these. If you did all of them, you'd have something very close to the minimal space requirement for the program if not the absolute minimum possible. For a real application, it could take a team of experienced programmers a thousand years to do it.
Remove all redundancy from stored data, including intermediates.
Remove all need for storing data that could be streamed instead.
Allocate only the number of bytes needed, never a single more.
Remove all unused data.
Remove all unused variables.
Free data as soon as it is no longer possibly needed.
Remove all unused algorithms and branches within algorithms.
Find the algorithm that is represented in the minimally sized execution unit.
Remove all unused space between items.
This is a computer science view of the topic, not a developer's one.
For instance, packing a data structure is an effort that combines (3) and (9) above. Compressing data is a way to at least partly achieve (1) above. Reducing overhead of higher level programming constructs is a way to achieve some progress in (7) and (8). Dynamic allocation is an attempt to exploit a multitasking environment to employ (3). Compilation warnings, if turned on, can help with (5). Destructors attempt to assist with (6). Sockets, streams, and pipes can be used to accomplish (2). Simplifying a polynomial is a technique to gain ground in (8).
Understanding of the meaning of nine and the various ways to achieve them is the result of years of learning and checking memory maps resulting from compilation. Embedded programmers often learn them more quickly because of limited memory available.
Using the -Os option on a gnu compiler makes a request to the compiler to attempt to find patterns that can be transformed to accomplish these, but the -Os is an aggregate flag that turns on a number of optimization features, each of which attempts to perform transformations to accomplish one of the 9 tasks above.
Compiler directives can produce results without programmer effort, but automated processes in the compiler rarely correct problems created by lack of awareness in the writers of the code.
Bear in mind the implementation cost of some C++ features, such as virtual function tables and overloaded operators that create temporary objects.
Along with that everyone else said, I'd just like to add don't use virtual functions because with virtual functions a VTable must be created which can take up who knows how much space.
Also watch out for exceptions. With gcc, I don't believe there is a growing size for each try-catch block(except for 2 function calls for each try-catch), but there is a fixed size function which must be linked in which could be wasting precious bytes