Which is more efficient: Bit, byte or int? - c++

Let's say that you were to have a structure similar to the following:
struct Person {
int gender; // betwwen 0-1
int age; // between 0-200
int birthmonth; // between 0-11
int birthday; // between 1-31
int birthdayofweek; // between 0-6
}
In terms of performance, which would be the best data type to store each of the fields? (e.g. bitfield, int, char, etc.)
It will be used on an x86 processor and stored entirely in RAM. A fairly large number will need to be stored (50,000+), so processor caches and such will need to be taken into account.
Edit: OK, let me rephrase the question. If memory usage is not important, and the entire dataset will not fit into the cache no matter which datatypes are used, is it generally better to use smaller datatypes to fit more of the data into the CPU cache, or is it better to use larger datatypes to allow the CPU to perform faster operations? I am asking this for reference only, so code readability and such should not be considered.

Don't worry about it; use what is semantically correct, and most readable
There is no answer that is correct all the time. It depends on the platform and compiler. If you really care, then you have to test.
In general, I would stay stick with ints... except for gender which should probably be an enum.

int_fast#_t from <stdint.h> or <boost/cstdint.hpp>.
That said, you'll give up simplicity and consistency (these types may be a character type, for example, which are integer types in C/C++, and that might lead to surprising function resolutions) instead of just using an int.
You'll see much more significant performance benefits by concentrating on other areas, like algorithmic complexity and access patterns.
It will be used on an x86 processor and stored entirely in RAM. A fairly large number will need to be stored (50,000+), so processor caches and such will need to be taken into account.
You still have to worry about cache (after you're at that level of optimization), even if the whole data won't be cached. For example, do you access every item in sequence? unpredictable? or just one field from every item in sequence? Compare struct { int a, b; } data[N]; to int data_a[N], data_b[N];. (Imagine you need all the 'a' at once, but can ignore the other, which way is more cache friendly?) Again, this doesn't sound like the main area on which you should focus.

Amount of bits used:
gender; 1/2 (2 if you want to include intersexuality :))
age; 8 (0-255)
birthmonth; 4 (16)
birthday; 5 (32)
birthdayofweek; 3 (8)
Bits at all: less than 22.
Knowing that it is running on x86 we have the int datatype with 32 bits.
So build your own routines which can accept an int
read(gender, int* pValue);
write(gender, int* pValue);
by using shift and bit mask operators to store and
retrieve the information. For the fields you can
use typesafe enums.
That is very fast and has an extremely low memory footprint.

it depends. Are you running out of memory? Then memory efficiency becomes paramount. Is it taking too long? Then CPU time, or at least perceived user response time, becomes paramount.

It depends. Many processors can directly access words, but require additional instructions to access octets or bits. Depending on the the compiler and processor, int might be the same as the addressable word. But is speed really going to matter? Readability and maintainability is likely to be more important.

In General
Each type has advantages and disadvantages, and specifically there are scenarios where each one will have the highest performance.
Addressable types (byte, char, short, int, and on x86-64 "long int") can all be loaded from memory in a single operation and so they have the least CPU overhead on a per-operation basis.
But, bit fields or flags packed into one or more bits might result in an overall faster program because:
they use the cache more efficiently, and this is a huge win, easily paying for a few extra cpu ops needed to unpack each item
they require fewer I/O operations to read in from disk, and this additional huge win easily pays for more CPU ops, even tho once again the cpu ops must be paid per item
Processor speeds have been advancing faster than disk and network speeds for decades, and now individual CPU ops are rarely a concern, particularly in your C/C++ case. You are already using the fastest code generator in the arsenal.
The in-RAM/not-in-cache scenario you mentioned
As it happens there is a still a cache factor to consider. Because the CPU is so fast, it is likely that execution time will be dominated by DRAM access on cache loads. If this is true, there is still an advantage to packing the data but it is dimished somewhat for a linear scan through the table. As it happens, modern DRAM is far more efficiently read in order, so you can fill an entire cache block in not much more time than is required to randomly read a single address. If execution time is dominated by an in-order traversal of the data structure, this works in your favor and would tend to flatten the performance difference between using addressable units and packed data structures.
Worry about important things
Finally, it's probably obvious but I will say it anyway: the data structure in terms of maps like hashes and trees, and the choice of algorithm typically has much more influence than machine ops tuning, which gives only an essentially linear optimization.
Worrying about memory bloat does matter, and it matters a lot if there is any possibility that your app won't fit in memory. Virtual storage turned out to be really important for protection and OS-kernel-level memory management, but one thing it never managed to do was allow programs to grow bigger than available RAM without bogging everything down.

Int is the fastest.
If you're using it in an array, you'll waste more memory however, so you may want to stick with byte in that case.

What Chris said. If this is a hypothetical program you're designing, trying to pick int versus uint8 at this stage isn't going to help you one bit in the long run. Focus your effort elsewhere.
If, when it comes down to it, you have a giant complex system you've made several rounds of optimizations on, and you're curious what the effect is, switching int to uint8 is probably (should be anyway) pretty easily anyway. At that stage you can make a statistically valid comparison in a real-world use case - not before.

enum Gender { Female, Male };
struct Date {...};
class Person
{
Gender gender;
Date date_of_birth;
//no age field - can be computed from date_of_birth
};

There are many reasons why either version could be faster.
Packing the structure by making each member into a minimal number of bits means less memory traffic and cache usage, which will cause a speedup. It also means more work to extract and access/modify individual fields, which will cost you performance. It will probably mean marginally more code as well, which will take up more of the instruction cache and so cost a bit of performance.
It's really impossible to say for sure which factor will be dominant. In most cases, it will be most efficient to use ints and bytes as needed. Don't bother with bitfields.
But in some cases, that won't be true. So measure, benchmark, profile. Find out how each implementation performs in your code.

In most cases you should stick to the default word size supported by the CPU. In many situations most CPUs will pad or require the compiler to pad values that are smaller than the default word size (so that structures can be word aligned), so any memory usage gain from a smaller data type can be rendered moot.
The main exception to this where you may have a large array, for example loading a file into an array of byte[] is preferable to int[].
In general model the problem cleanly and do the simple obvious thing first, then profile to see if you have memory issues. Looking at the case supplied, it is worth noting a couple of points:
The field age is a de-normalisation of the model and is a function of birthmonth, birthday, birthdayofweek, which could be represented as a single long birthtimestamp. So, just by looking at the model you can trim 8 bytes easily.
4 (bytes per int) * 5 (number of fields) * 50 000 (number of instances) = ~1MB. My laptop has 4MB L2 cache, you're unlikely to have memory issues.

This depends on exactly one thing:
What is the ratio of blindly copying the data and actually reading the data?
If you treat the data mostly as cookies, and seldom have to access an actual record, use bit fields which consume fewer space and are more IO-friendly.
If you access it a lot, and seldom copy it (as a part of container reallocation) or serialize it to disk, use larger integers that are CPU-friendly.

The processors word size is the smallest unit the computer can read/write to/from memory. If you read or write a bit, char, short or int the CPU, northbridge overhead is the same in all cases given reasonable system/compiler behavior. There is obviously a sram cache benefit to smaller field sizes. The one gotcha for x86 is to ensure proper alignment at word sized multiples. Other architectures such as sparc are much less forgiving and will actually crash if memory access is unaligned.
I tend to shy away from data types smaller than the computers word size in performant multi-threaded apps as changes to one field within a word shared with other fields in the same word can have unintended dependancies if access to all fields within the word are not externally synchronized.

struct Person
{
uint8_t gender; // betwwen 0-1
uint8_t age; // between 0-200
uint8_t birthmonth; // between 0-11
uint8_t birthday; // between 1-31
uint8_t birthdayofweek; // between 0-6
}
If you need to access them sequentially, cache optimisation might win. Thus, reducing the data structure to the strict minimum of information may be better.
One might argue that the CPU will have to fetch a 32 bits block in memory anyway, then need to extract the byte thus requiring more time. However that is an arbitrary choice of the CPU and it should not influence your data structures. Imagine if tomorrow CPUs started working with 64 bits blocks, your "optimisation" would immediately become futile because there would be an extraction anyway. However, it's almost certain that memory cache optimisation will always be relevant. This is why I like to privilege making the application as light as it can be, memory-wise, even when concerned with performance.
Furthermore, if you are very concerned with memory you can obviously go below bytes, by packing some of those variables together. But then you'll need to use bit operations, thus additional instructions, and your code could quickly become unreadable / unmaintainable unless you formalize the storage into some sort of generic container / structure.

Thing is, I don't see large fields of Person being evaluated. Therefore, speed is not the issue here.
What IS an issue is consistence. You have a structure with public members that one could easily set to the wrong data, being it out of range or off by an index.
I mean, you are not consistent yourself here, are you?
int birthmonth; // between 0-11
int birthday; // between 1-31
Why are months starting at zero but days are not?
And even if you had this consistent, maybe one guy who uses your structure goes with "months start by zero", one guy with "months start by one". This invites error. Why has age a restriction? Sure, we will not find a 200 years old guy, but why not say max of type?
Rather use things like an enum class, a special type that restricts your range like template<int from, int to> class RangedInt or an already existing type, like Date (especially since such a type can check how many days february has in that year et cetera).
Some people commented on the gender with gender theory stuff here, but in any case, zero is not a gender. You mean male and female, but while we have some intuition when you say that day is in range 0..31, how would we guess if zero is male or female? Shouldn't be governed by comments. (And if you actually want to address gender theory stuff, I'd say that the right type is string.)
Also, in most cases, you would not change anything about a person once it is created. Changes that happen in real life (like getting a different name through marriage) would happen at database level or the like, not in class level. Therefore, I'd made this a class in which you can set only by constructor. Although, of course, this is dependent on the usage obviously. But if you can do it immutable, do it immutable.
Edit: And it's easy to make age inconsistent with the date of birth. That's no good. Remove the redundant data, make it a method or the like. Although the year is missing? Strange design, I must say.

Related

is using an integer to store many bool worth the effort?

I was considering ways to reduce memory footprint, and it is constantly mentioned that a bool takes up more memory than it logically needs to, as a byproduct of processor design.
it is also sometimes mentioned that one could store several bool within an int.
I am wondering if this would actually be more memory efficient?
if we have a usecase where we can use a significant portion of 32 (or 64) bool. and we decide to store all of them in a single int. then on the surface we have saved
7 (bits) * 32 (size of int) = 224 (bits) or 28 (bytes)
but in order to get each of those bits from the int, we needed to use some method of masking
such as:
bit shifting the int both directions (int<<x)>>y here we need to load and store x,y which are probably an int, but you could get them smaller depending on the use case
masking the int: int & int2 here we also store an additional int, which is stored and loaded
even if these aren't stored as variables, and they are defined statically within the code, it still ends up using additional memory, as it will increase the memory footprint of the instructions. as well as the instructions for the masking steps.
is there any way to do this that isn't actually worse for memory usage than just taking the hit on 7 wasted bits?
You are describing a text book example of a trade-off.
Yes, several bools in one int is hugeley more memory efficient - in itself.
Yes, you need to spend code to use that.
Yes, for only a few bools (for different values of "few"), the code might take more space than you save.
However, you could look at the kind of memory which is used. In some environments, RAM (which is saved by your idea) is much more expensive than ROM (which has to be paid for your idea).
Also, the price to pay is mostly paid once for implementation and only paid a fraction for using, especially when the using code is reused, e.g. in loops.
Altogether, in case of many bools, you can save more than you pay.
The point of actually saving needs to be determined for the special case.
On the other hand, you have missed on "currency" on the price-tag for the idea. You not only pay in memory, you also pay in execution time. You focused your question on memory, so I won't elaborate here. But for anything time critical, you should take the longer execution time into conisderation. You might find that saving memory is quite achievable with your idea, but the whole thing gets unbearably slow.
Again from the other side, as Eric Postpischil points out in a comment, execution speed can also improve due to cache effects from better memory footprint.
I am wondering if this would actually be more memory efficient?
Potentially yes. Storing multiple bools inside a single object may use less storage compared to having distinct bool object for each, if the number of bools is great enough to offset the cost in memory use of the instructions.
Also consider that there are more considerations than space efficiency. Most typically, people are concerned about time efficiency as well. In this regard, compacting bools may more or less efficient depending on the details of the use case.
is using an integer to store many bool worth the effort?
It can be worth the effort. It can also be counter productive. The difference can be minuscule or significant. Both in terms of time and space efficiency. Most accurate way to find out is to measure it.
It's not necessary to implement this yourself though, since there are solutions in the standard library. std::vector<bool> and std::bitset both implement compact storage of bools. Using bitfields may also be an option (just remember to not rely on the internal representation).

Is there an order you should declare variables in C++? [duplicate]

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.
Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.
Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.
It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)
While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.
Well the first member doesn't need an offset added to the pointer to access it.
In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).
I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.
In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.
I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.

Should I store numbers in chars to save memory?

The question is pretty simple. Should I store some numbers which will not surpass 255 in a char or uint_8t variable types to save memory?
Is it common or even worth it saving a few bytes of memory?
Depends on your processor and the amount of memory your platform has.
For 16-bit, 32-bit and 64-bit processors, it doesn't make a lot of sense. On a 32-bit processor, it likes 32-bit quantities, so you are making it work a little harder. These processors are more efficient with 32-bit numbers (including registers).
Remember that you are trading memory space for processing time. Packing values and unpacking will cost more execution time than not packing.
Some embedded systems are space constrained, so restricting on size makes sense.
In present computing, reliability (a.k.a. robustness) and quality are high on the priority list. Memory efficiency and program efficiency have gone lower on the priority ladder. Development costs are also probably higher than worrying about memory savings.
It depends mostly on how many numbers you have to store. If you have billions of them, then yes, it makes sense to store them as compactly as possible. If you only have a few thousand, then no, it probably does not make sense. If you have millions, it is debatable.
As a rule of thumb, int is the "natural size" of integers on your platform. You many need a few extra bytes of code to read and write variables with other sizes. Still, if you got thousands of small values, a few extra bytes of code is worth the savings in data.
So, std::vector<unsigned char> vc; makes a lot more sense than just a single unsigned char c; value.
If it's a local variable, it mostly doesn't matter (from my experience different compilers with different optimization options do (usually negligibly) better with different type choices).
If it's saved elsewhere (static memory / heap) because you have many of those entities, then uint_least8_t is probably the best choice (the least and fast types are generally guaranteed to exists; the exact-width types generally are not).
unsigned char will reliably provide enough bits too (UCHAR_MAX is guaranteed to be at least 255 and since sizeof(unsigned char) is guaranteed to be 1 and sizeof(AnyType) >= 1, there can be no unsigned integer type that's smaller).

Does int32_t have lower latency than int8_t, int16_t and int64_t?

(I'm referring to Intel CPUs and mainly with GCC, but poss ICC or MSVC)
Is it true using int8_t, int16_t or int64_t is less efficient compared with int32_tdue to additional instructions generated to to convert between the CPU word size and the chosen variable size?
I would be interested if anybody has any examples or best practices for this? I sometimes use smaller variable sizes to reduce cacheline loads, but say I only consumed 50 bytes of a cacheline with one variable being 8-bit int, it may be quicker processing by using the remaining cacheline space and promote the 8-bit int to a 32-bit int etc?
You can stuff more uint8_ts into a cache line, so loading N uint8_ts will be faster than loading N uint32_ts.
In addition, if you are using a modern Intel chip with SIMD instructions, a smart compiler will vectorize what it can. Again, using a small variable in your code will allow the compiler to stuff more lanes into a SIMD register.
I think it is best to use the smallest size you can, and leave the details up to the compiler. The compiler is probably smarter than you (and me) when it comes to stuff like this. For many operations (say unsigned addition), the compiler can use the same code for uint8, uint16 or uint32 (and just ignore the upper bits), so there is no speed difference.
The bottom line is that a cache miss is WAY more expensive than any arithmetic or logical operation, so it is nearly always better to worry about cache (and thus data size) than simple arithmetic.
(It used to be true a long time again that on Sun workstation, using double was significantly faster than float, because the hardware only supported double. I don't think that is true any more for modern x86, as the SIMD hardware (SSE, etc) have direct support for both single and double precision).
Mark Lakata answer points in the right direction.
I would like to add some points.
A wonderful resource for understanding and taking optimization decision are the Agner documents.
The Instruction Tables document has the latency for the most common instructions. You can see that some of them perform better in the native size version.
A mov for example may be eliminated, a mul have less latency.
However here we are talking about gaining 1 clock, we would have to execute a lot of instruction to compensate for a cache miss.
If this were the whole story it would have not worth it.
The real problems comes with the decoders.
When you use some length-changing prefixes (and you will by using non native size word) the decoder takes extra cycles.
The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes.
In, nowadays, no longer more recent (but still present) microarchs the penalty was severe, specially with some kind arithmetic instructions.
In later microarchs this has been mitigated but the penalty it is still present.
Another aspect to consider is that using non native size requires to prefix the instructions and thereby generating larger code.
This is the closest as possible to the statement "additional instructions [are] generated to to convert between the CPU word size and the chosen variable size" as Intel CPU can handle non native word sizes.
With other, specially RISC, CPUs this is not generally true and more instruction can be generated.
So while you are making an optimal use of the data cache, you are also making a bad use of the instruction cache.
It is also worth nothing that on the common x64 ABI the stack must be aligned on 16 byte boundary and that usually the compiler saves local vars in the native word size or a close one (e.g. a DWORD on 64 bit system).
Only if you are allocating a sufficient number of local vars or if you are using array or packed structs you can gain benefits from using small variable size.
If you declare a single uint16_t var, it will probably takes the same stack space of a single uint64_t, so it is best to go for the fastest size.
Furthermore when it come to the data cache it is the locality that matters, rather than the data size alone.
So, what to do?
Luckily, you don't have to decide between having small data or small code.
If you have a considerable quantity of data this is usually handled with arrays or pointers and by the use of intermediate variables. An example being this line of code.
t = my_big_data[i];
Here my approach is:
Keep the external representation of data, i.e. the my_big_data array, as small as possible. For example if that array store temperatures use a coded uint8_t for each element.
Keep the internal representation of data, i.e. the t variable, as close as possible to the CPU word size. For example t could be a uint32_t or uint64_t.
This way you program optimize both caches and use the native word size.
As a bonus you may later decide to switch to SIMD instructions without have to repack the my_big_data memory layout.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
D. Knuth
When you design your structures memory layout be problem driven. For example, age values need 8 bit, city distances in miles need 16 bits.
When you code the algorithms use the fastest type the compiler is known to have for that scope. For example integers are faster than floating point numbers, uint_fast8_t is no slower than uint8_t.
When then it is time to improve the performance start by changing the algorithm (by using faster types, eliminating redundant operations, and so on) and then if it is needed the data structures (by aligning, padding, packing and so on).

Optimizing member variable order in C++

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.
Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.
Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.
It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)
While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.
Well the first member doesn't need an offset added to the pointer to access it.
In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).
I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.
In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.
I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.