I'm wondering if taking care of memory sizes in C++ is a good or bad thing.
This question confused me ( Why does mode_t use 4 byte? ).
So it's not performant to just use a char if I don't need to store a bigger amount of data, because a modern CPU has to fill up the rest?
So thinking of performance and saving computing time the best would be to always use a size_t for every integer typed variable I need?
Does a CPU still need more instructions to deal with a short value than dealing with an size_t if I have a large array?
What about char arrays? Wouldn't they be supposed to be slower, too?
All in all: What's the best practice? I'd like to save as much memory as possible, because my server does not have a lot of memory. On the other hand I don't want to loose performance because of me thinking memory is more important.
Is there somewhere a great explaination about how all this works and what's faster under what circumstances?
There is no one answer to this question.
Reducing the size of the integer types that you use can increase locality and decrease the required memory bandwidth. So, that's a plus. (Note: the actual memory fetch does not cost less.)
Increasing the size of integer types that you use can decrease the number of conversions required. So, that's a plus.
So the questions are, how much memory do you save by choosing #1? How many conversions do you save by choosing #2?
The objective answer
In general, nothing less than whole-system profiling will tell you which is the better alternative. This is because answering questions about reduced memory pressure is incredibly difficult and system-specific. Reducing the memory usage of part of your program will typically increase the percentage of time your program spends in that part — and it may even increase the percentage of time that your program uses on the entire system — either due to the larger number of conversions necessary, or because the reduced memory pressure makes other parts of your system faster. Hence the need for whole system profiling.
This, unsurprisingly, is a real pain.
The subjective answer
However, my instinct tells me that it's almost never worth the effort to try and minimize memory usage of individual fields this way. How many copies of mode_t do you think your program will have in memory at a time? A handful, at most. So I have a rule of thumb for this:
If it goes in an array, then use the smallest type that has sufficient range. E.g., a string is char[] instead of int[].
If it goes anywhere else, use int or larger.
So my subjective answer is, spend your precious time elsewhere. Your time is valuable and you have better things to do than choose whether a field should be int or short.
This sounds like premature optimization. You are worried about running out of memory when it seems like it hasn't actually happened yet.
In general, accessing a small subsection of the native word size of your CPU generates more CODE. So the space you save putting data into only 8-bits is probably lost 50+ times over by the added CODE needed to only manipulate the specific 8-bits you care about. You could also end up in places where your "optimization" slows things down, too:
struct foo {
char a1, a2, a3;
short b1;
};
If the above structure is packed tightly, b1 crosses a 32bit boundary which on some architectures will throw exceptions and on other architectures will require two fetches to retrieve the data.
OR not. It depends on the CPU architecture, the computer's data architecture, the compiler, and your program's typical use patterns. I doubt there is a single "best practice" that is correct 99% of the time here.
If space is really important, tell the compiler to optimize for size rather than speed and see if that helps. But unless you are sharing the data across a slow binary pipe, you should not generally care how big it is as long as it is big enough to hold all valid values for your application.
tl;dr? Just use size_t until you can prove that reducing the size of that specific variable will significantly improve server performance.
Your answer is processor dependent: depends on the processor for the target platform. Read its data sheet to find out how it handles single 8-bit fetches.
The ARM7TDMI processor likes to fetch 32bit quantities. It is very efficient at that. It is labelled as an 8/32 processor and can handle 8-bit quantities as well.
The processor may be able to fetch 8-bit quantities directly depending on how it is wired up. Otherwise, it calculates the nearest 32-bit aligned address, reads 32 bits and discards the unused bits. This takes processing time.
So the trade-off is memory versus processing time:
Will compressing your application to use 8-bits significantly
increase processing time?
Does your development schedule gain any time by this task? (a.k.a. Return On Investment, ROI)
Do your clients complain about the size of the application?
Is your application correct and error free before worrying about
memory usage?
Related
I've been searching around a bit, and I haven't really come up with an answer for this.
When I'm programming on embedded devices with limited memory, I'm generally in the habit of using the smallest integral/floating point type that will do the job, for instance, if I know that a counter will always be between zero and 255, I'll declare it as a uint8_t.
However, in less memory-constrained environments, I'm used to just using int for everything, as per the Google C++ Styleguide. When I look at existing code, it often tends to be done this way.
To be clear, I get the rationale behind doing this, (Google explains it very well), but I'm not precisely clear on the rationale behind doing things the first way.
It seems to me that reducing the memory footprint of your program, even on a system where you don't care about memory usage, would be good for overall speed, since, logically, less overall data would mean more of it could fit in CPU cache.
Complicating matters, however, is the fact that compilers will automatically pad data and align it to boundaries such that it can be fetched in a single bus cycle. I guess, then, it comes down to whether or not compilers are smart enough to take, say, two 32-bit integers and stick them together in a single 64-bit block vs. individually padding each one to 64 bits.
I suppose whether or not the CPU itself could take advantage of this also depends on its exact internals, but the idea that optimizing memory size improves performance, particularly on newer processors, is evidenced in the fact that the Linux kernel relied for awhile on gcc's -0s option for an overall performance boost.
So I guess that my question is why the Google method seems to be so much more prevalent in actual code. Is there a hidden cost here that I'm missing?
The usual reasons that the "google method" is commonly used is because int is often good enough, and it is typically the first option taught in beginner's material. It also takes more effort (man hours, etc) to optimise a nontrivial code for "limited memory" - effort which is pointless if not actually needed.
If the "actual code" is written for portability, then int is a good default choice.
Whether written for portability or not, a lot of programs are only ever run on hosts with sufficient memory resources and with an int type that can represent the required range of values. This means it is not necessary to worry about memory usage (e.g. optimising size of variables based on the specific range of values they need to support) and the program just works.
Programming for "limited memory" is certainly common, but not typically why most code is written. Quite a few modern embedded systems have more than enough memory and other resources, so the techniques are not always needed for them.
A lot of code written for what you call "limited memory" also does not actually need to be. There is a point, as programmers learn more, that a significant number start indulging in premature optimisation - worrying about performance or memory usage, even when there is no demonstrated need for them to do so. While there is certainly a significant body of code written for "limited memory" because of a genuine need, there is a lot more such code written due to premature optimisation.
"embedded devices ... counter between zero and 255, I'll declare it as a uint8_t"
That might be counterproductive. Especially on embedded systems, 8 bit fetches might be slower. Besides, a counter is likely in a register, and there's no benefit in using half a register.
The chief reason to use uint8_t is when you have a contiguous set of them. Could be an array, but also adjacent members in a class.
As the comments already note, -Os is unrelated - its benefit is that with smaller code, the memory bus has more bandwidth left for data.
From my experience 90% of all code in a bigger project does not need particular optimization, since 95% of all memory consumption and of all execution time is spend in less than 10% of the code you write. In the rest of the code, try to emphasize simplicity and maintainability. Mostly, that means using ints or size_t as integer types. Usually, there is not need to optimize the size of local variables, but it can make sense, if you have a lot of instances of a type in a large array. Item 6 in the excellent book C++ Coding Standards: 101 Rules, Guidelines and Best Practices (C++ In-Depth) by Herb Sutter and Andrei Alexandrescu says:
"Correctness, simplicity, and clarity come first."
Most importantly, understand where these less than 10% of code are, that really need optimization. Otherwise, keep interfaces simple and uniform.
Nice discussion! But I wonder why nobody speaks about cpu register size, memory bus architecture, cpu architecture and so on. Saying "int is best" is not a general at all. If you have small embedded systems like 8 bit avr, int is a very bad choice for a counter running from 0 .. 255.
And using int on ARM where you maybe have a 16 bit bus interface can also be a very bad idea if you really only need 16 bits or less.
As for all optimizations: Look in the code the compiler produces, measure how long actions really take and look for memory consumption on heap/stack if it is necessary. It makes no sense to hand craft unmaintainable code to save 8 bits somewhere if your hardware still have MBytes left.
Using tools like valgrind and the profiling supported by the target/compiler give much more ideas as any theoretical discussions here.
There is no general "best integer type"! It depends always on CPU architecture, memory bus, caches and a some more.
What are the basic tips and tricks that a C++ programmer should know when trying to optimize his code in the context of Caching?
Here's something to think about:
For instance, I know that reducing a function's footprint would make the code run a bit faster since you would have less overall instructions on the processor's instruction register I.
When trying to allocate an std::array<char, <size>>, what would be the ideal size that could make your read and writes faster to the array?
How big can an object be to decide to put it on the heap instead of the stack?
In most cases, knowing the correct answer to your question will gain you less than 1% overall performance.
Some (data-)cache optimizations that come to my mind are:
For arrays: use less RAM. Try shorter data types or a simple compression algorithm like RLE. This can also save CPU at the same time, or in the opposite waste CPU cycles with data type conversions. Especially floating point to integer conversions can be quite expensive.
Avoid access to the same cacheline (usually around 64 bytes) from different threads, unless all access is read-only.
Group members that are often used together next to each other. Prefer sequential access to random access.
If you really want to know all about caches, read What Every Programmer Should Know About Memory. While I disagree with the title, it's a great in-depth document.
Because your question suggests that you actually expect gains from just following the tips above (in which case you will be disappointed), here are some general optimization tips:
Tip #1: About 90% of your code you should be optimized for readability, not performance. If you decide to attempt an optimization for performance, make sure you actually measure the gain. When it is below 5% I usually go back to the more readable version.
Tip #2: If you have an existing codebase, profile it first. If you don't profile it, you will miss some very effective optimizations. Usually there are some calls to time-consuming functions that can be completely eliminated, or the result cached.
If you don't want to use a profiler, at least print the current time in a couple of places, or interrupt the program with a debugger a couple of times to check where it is most often spending its time.
Is it true aligning data members of a struct/class no longer yields the benefits it used to, especially on nehalem because of hardware improvements? If so, is it still the case that alignment will always make better performance, just very small noticeable improvements compared with on past CPUs?
Does alignment of member variables extend to member functions? I believe I once read (it could be on the wikibooks "C++ performance") that there are rules for "packing" member functions into various "units" (i.e. source files) for optimum loading into the instruction cache? (If I have got my terminology wrong here please correct me).
Processors are still much faster than what the RAM can deliver, so they still need caches. Caches still consist of fixed-size cache lines. Also, main memory is delivered in pages and pages are accessed using a translation lookaside buffer. This buffer, again, has a fixed size cache.
Which means that both spatial and temporal locality matter a lot (i.e. how you pack stuff, and how you access it). Packing structures well (sorted by padding/alignment requirements) as opposed to packing them in some haphazard order usually results in smaller structure sizes.
Smaller structure sizes mean, if you have loads of data:
more structures fit into one cache line (cache miss = 50-200 cycles)
fewer pages are needed (page fault = 10-20 million CPU cycles)
fewer TLB entries are needed, fewer TLB misses (TLB miss = 50-500 cycles)
Going linearly over a few gigabytes of tightly packed SoA data can be 3 orders of magnitude faster (or 8-10 orders of magnitude, if page faults are involved) than doing the same thing in a naive way with bad layout/packing.
Whether or not you hand-align individual 4-byte or 2-byte values (say, a typical int or short) to 2 or 4 bytes makes a very small difference on recent Intel CPUs (hardly noticeable). Insofar, it may seem tempting to "optimize" on that, but I strongly advise against doing so.
This is usually something one best doesn't worry about and leaves to the compiler to figure out. If for no other reason, then because the gains are marginal at best, but some other processor architectures will raise an exception if you get it wrong. Therefore, if you try to be too smart, you'll suddenly have unexplainable crashes once you compile on some other architecture. When that happens, you'll feel sorry.
Of course, if you don't have at least several dozen of megabytes of data to process, you need not care at all.
Aligning data to suit the processor will never hurt, but some processors will have more notable drawbacks than others, I think is the best way to answer this question.
Aligning functions into cache-line units seems a bit of a red herring to me. For small functions, what you really want is inlining if at all possible. If the code can't be inlined, then it's probably larger than a cache-line anyway. [Unless it's a virtual function, of course]. I don't think this has ever been a huge factor tho - either code is generally called often, and thus normally in the cache, or it's not called very often, and not very often in the cache. I'm sure it's possibe to come up with some code where calling one function, func1() will also drag in func2() into the cache, so if you always call func1() and func2() in short succession, it would have some benefit. But it's really not something that is that great of a benefit unless you have a lot of functions with pairs or groups of functions that are called close together. [By the way, I don't think the compiler is guaranteed to place your function code in any particular order, no matter which order you place it in the source file].
Cache-alignment is a slightly different matter, since cache-lines can still have a HUGE effect if you get it right vs. getting it wrong. This is more important for multithreading than general "loading data". The key here is to avoid sharing data in the same cache-line between processors. In a project I worked on some 10 or so years ago, a benchmark had a function that used an array of two integers to count up the number of iterations each thread did. When that got split into two separate cache-lines, the benchmark improved from 0.6x of running on a single processor to 1.98x of one processor. The same effect will happen on modern CPU's, even if they are much faster - the effect may not be exactly the same, but it will be a large slowdown (and the more processors sharing data, the more effect, so a quad-core system would be worse than a dual core, etc). This is because every time a processor updates something in a cache-line, all other processors that have read that cache-line must reload it from the processor that updated it [or from memory in the old days].
I'm running intensive numerical simulations. I often use long integers, but I realized it would be safe to use integers instead. Would this improve the speed of my simulations significantly?
Depends. If you have lots and lots of numbers consecutively in memory, they're more likely to fit in the L2 cache, so you have fewer cache misses. L2 cache misses - depending on your platform - can be a significant performance impact, so it's definitely a good idea to make things fit in cache as much as possible (and prefetch if you can). But don't expect your code to fly all of a sudden because your types are smaller.
EDIT: One more thing - if you choose an awkward type (like a 16-bit integer on a 32-bit or 64-bit platform), you may end up with worse performance because the CPU will have to surgically extract the 16-bit value and turn it into something it can work with. But typically, ints are a good choice.
Depends on your data set sizes. Obviously, halving the size of your integers could double the amount of data that fits into CPU caches and thus access to data would be faster. For more details I suggest you read the famous Ulrich Drepper's paper What Every Programmer Should Know About Memory.
This is why typedef is your friend. :-)
If mathematically possible, try using floats instead of integers. I read somewhere that floating point arithmetic (esp. multiplication) can actually be faster on some processors.
The best thing is to experiment and benchmark. It's damn near impossible to figure out analytically which micro-optimizations work best.
EDIT: This post discusses the performance difference between integer and float.
All the answers have already treated the CPU cache issue: if your data is two times smaller, then in some cases it can fit into L2 cache completely, yielding performance boost.
However, there is another very important and more general thing: memory bandwidth. If you algorithm is linear (aka O(N) complexity) and accesses memory sequentally, then it may be memory-bound. It means that memory reads/writes are the bottleneck, and CPU is simply wasting a lot of cycles waiting for memory operations to complete. In such case reducing total memory size in two times would yield reliable 2x performance boost.
Moreover, in such cases switching to bytes may yield even more performance boost, despite the fact that CPU computations may be slower with bytes as one of the other answerers have already mentioned.
In general, the answer depends on several things like: total size of data your algorithm works with, memory access pattern (random/sequental), algorithm asymptotic complexity, computation per memory ratio (mostly for linear algorithms).
Context:
A while ago, I stumbled upon this 2001 DDJ article by Alexandrescu:
http://www.ddj.com/cpp/184403799
It's about comparing various ways to initialized a buffer to some value. Like what "memset" does for single-byte values. He compared various implementations (memcpy, explicit "for" loop, duff's device) and did not really find the best candidate across all dataset sizes and all compilers.
Quote:
There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. (...) Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory.
Question:
does anyone have more recent information about this problem ? Do recent GCC and Visual C++ implementations perform significantly better than 7 years ago ?
I'm writing code that has a lifetime of 5+ (probably 10+) years and that will process arrays' sizes from a few bytes to hundred of megabytes. I can't assume that my choices now will still be optimal in 5 years. What should I do:
a) use the system's memset (or equivalent) and forget about optimal performance or assume the runtime and compiler will handle this for me.
b) benchmark once and for all on various array sizes and compilers and switch at runtime between several routines.
c) run the benchmark at program initialization and switch at runtime based on accurate (?) data.
Edit: I'm working on image processing software. My array items are PODs and every millisecond counts !
Edit 2: Thanks for the first answers, here are some additional informations:Buffer initialization may represent 20%-40% of total runtime of some algorithms. The platform may vary in the next 5+ years, although it will stay in the "fastest CPU money can buy from DELL" category. Compilers will be some form of GCC and Visual C++. No embedded stuff or exotic architectures on the radarI'd like to hear from people who had to update their software when MMX and SSE appeared, since I'll have to do the same when "SSE2015" becomes available... :)
The DDJ article acknowledges that memset is the best answer, and much faster than what he was trying to achieve:
There is something sacrosanct about
C's memory manipulation functions
memset, memcpy, and memcmp. They are
likely to be highly optimized by the
compiler vendor, to the extent that
the compiler might detect calls to
these functions and replace them with
inline assembler instructions — this
is the case with MSVC.
So, if memset works for you (ie. you are initializing with a single byte) then use it.
Whilst every millisecond may count, you should establish what percentage of your execution time is lost to setting memory. It is likely very low (1 or 2%??) given that you have useful work to do as well. Given that the optimization effort would likely have a much better rate of return elsewhere.
The MASM Forum has a lot of incredible assembly language programmers/hobbyists who have beaten this issue completely to death (have a look through The Laboratory). The results were much like Christopher's response: SSE is incredible for large, aligned, buffers, but going down you will eventually reach such a small size that a basic for loop is just as quick.
Memset/memcpy are mostly written with a basic instruction set in mind, and so can be outperformed by specialized SSE routines, which on the other hand enforce certain alignment constraints.
But to reduce it to a list :
For data-sets <= several hundred kilobytes memcpy/memset perform faster than anything you could mock up.
For data-sets > megabytes use a combination of memcpy/memset to get the alignment and then use your own SSE optimized routines/fallback to optimized routines from Intel etc.
Enforce the alignment at the start up and use your own SSE-routines.
This list only comes into play for things where you need the performance. Too small/or once initialized data-sets are not worth the hassle.
Here is an implementation of memcpy from AMD, I can't find the article which described the concept behind the code.
d) Accept that trying to play "jedi mind tricks" with the initialization will lead to more lost programmer hours than the cumulative milliseconds difference between some obscure but fast method versus something obvious and clear.
It depends what you're doing. If you have a very specific case, you can often vastly outperform the system libc (and/or compiler inlining) of memset and memcpy.
For example, for the program I work on, I wrote a 16-byte-aligned memcpy and memset designed for small data sizes. The memcpy was made for multiple-of-16 sizes greater than or equal to 64 only (with data aligned to 16), and memset was made for multiple-of-128 sizes only. These restrictions allowed me to get enormous speed, and since I controlled the application, I could tailor the functions specifically to what was needed, and also tailor the application to align all necessary data.
The memcpy performed at about 8-9x the speed of the Windows native memcpy, knocing a 460-byte copy down to a mere 50 clock cycles. The memset was about 2.5x faster, filling a stack array of zeros extremely quickly.
If you're interested in these functions, they can be found here; drop down to around line 600 for the memcpy and memset. They're rather trivial. Note they're designed for small buffers that are supposed to be in cache; if you want to initialize enormous amounts of data in memory while bypassing cache, your issue may be more complex.
You can take a look on liboil, they (try to) provide different implementation of the same function and choosing the fastest on initialization. Liboil has a pretty liberal licence, so you can take it also for proprietary software.
http://liboil.freedesktop.org/
Well this all depends on your problem domain and your specifications, have you ran into performance issues, failed to meet timing deadline and pinpointed memset as being the root of all evil ? If it this you're in the one and only case where you could consider some memset tuning.
Then you should also keep in mind that the memset anyhow will vary on the hardware the platform it is ran on, during those five years, will the software run on the same platform ? On the same architecture ? One you come to that conclusion you can try to 'roll your own' memset, typically playing with the alignment of buffers, making sure you zero 32 bit values at once depending on what is most performant on your architecture.
I once ran into the same for memcmpt where the alignment overhead caused some problems, bit typically this will not result in miracles, only a small improvement, if any. If you're missing your requirements by an order of mangnitude than this won't get you any further.
If memory is not a problem, then precreate a static buffer of the size you need, initialized to your value(s). As far as I know, both these compilers are optimizing compilers, so if you use a simple for-loop, the compiler should generate the optimum assembler-commands to copy the buffer across.
If memory is a problem, use a smaller buffer & copy that accross at sizeof(..) offsets into the new buffer.
HTH
I would always choose an initialization method that is part of the runtime or OS (memset) I am using (worse case pick one that is part of a library that I am using).
Why: If you are implementing your own initialization, you might end up with a marginally better solution now, but it is likely that in a couple of years the runtime has improved. And you don't want to do the same work that the guys maintaining the runtime do.
All this stands if the improvement in runtime is marginal. If you have a difference of an order of magnitude between memset and your own initialization, then it makes sense to have your code running, but I really doubt this case.
If you have to allocate your memory as well as initialize it, I would:
Use calloc instead of malloc
Change as much of my default values to be zero as possible (ex: let my default enumeration value be zero; or if a boolean variable's default value is 'true', store it's inverse value in the structure)
The reason for this is that calloc zero-initializes memory for you. While this will involve the overhead for zeroing memory, most compilers are likely to have this routine highly-optimized -- more optimized that malloc/new with a call to memcpy.
As always with these types of questions, the problem is constrained by factors outside of your control, namely, the bandwidth of the memory. And if the host OS decides to start paging the memory then things get far worse. On Win32 platforms, the memory is paged and pages are only allocated on first use which will generate a big pause every page boundary whilst the OS finds a page to use (this may require another process' page to be paged to disk).
This, however, is the absolute fastest memset ever written:
void memset (void *memory, size_t size, byte value)
{
}
Not doing something is always the fastest way. Is there any way the algorithms can be written to avoid the initial memset? What are the algorithms you're using?
The year isn't 2001 anymore. Since then, new versions of Visual Studio have appeared. I've taken the time to study the memset in those. They will use SSE for memset (if available, of course). If your old code was correct, statistically if will now be faster. But you might hit an unfortunate cornercase.
I expect the same from GCC, although I haven't studied the code. It's a fairly obvious improvement, and an Open-Source compiler. Someone will have created the patch.