Bit Aligning for Space and Performance Boosts - c++

In the book Game Coding Complete, 3rd Edition, the author mentions a technique to both reduce data structure size and increase access performance. In essence it relies on the fact that you gain performance when member variables are memory aligned. This is an obvious potential optimization that compilers would take advantage of, but by making sure each variable is aligned they end up bloating the size of the data structure.
Or that was his claim at least.
The real performance increase, he states, is by using your brain and ensuring that your structure is properly designed to take take advantage of speed increases while preventing the compiler bloat. He provides the following code snippet:
#pragma pack( push, 1 )
struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};
struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused[ 2 ]; // fill to 8-byte boundary for array use
};
#pragma pack( pop )
Using the above struct objects in an unspecified test he reports a performance increase of 15.6% (222ms compared to 192ms) and a smaller size for the FastStruct. This all makes sense on paper to me, but it fails to hold up under my testing:
Same time results and size (counting for the char unused[ 2 ])!
Now if the #pragma pack( push, 1 ) is isolated only to FastStruct (or removed completely) we do see a difference:
So, finally, here lies the question: Do modern compilers (VS2010 specifically) already optimize for the bit alignment, hence the lack of performance increase (but increase the structure size as a side-affect, like Mike Mcshaffry stated)? Or is my test not intensive enough/inconclusive to return any significant results?
For the tests I did a variety of tasks from math operations, column-major multi-dimensional array traversing/checking, matrix operations, etc. on the unaligned __int64 member. None of which produced different results for either structure.
In the end, even if their was no performance increase, this is still a useful tidbit to keep in mind for keeping memory usage to a minimum. But I would love it if there was a performance boost (no matter how minor) that I am just not seeing.

It is highly dependent on the hardware.
Let me demonstrate:
#pragma pack( push, 1 )
struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};
struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused[ 2 ]; // fill to 8-byte boundary for array use
};
#pragma pack( pop )
int main (void){
int x = 1000;
int iterations = 10000000;
SlowStruct *slow = new SlowStruct[x];
FastStruct *fast = new FastStruct[x];
// Warm the cache.
memset(slow,0,x * sizeof(SlowStruct));
clock_t time0 = clock();
for (int c = 0; c < iterations; c++){
for (int i = 0; i < x; i++){
slow[i].a += c;
}
}
clock_t time1 = clock();
cout << "slow = " << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
// Warm the cache.
memset(fast,0,x * sizeof(FastStruct));
time1 = clock();
for (int c = 0; c < iterations; c++){
for (int i = 0; i < x; i++){
fast[i].a += c;
}
}
clock_t time2 = clock();
cout << "fast = " << (double)(time2 - time1) / CLOCKS_PER_SEC << endl;
// Print to avoid Dead Code Elimination
__int64 sum = 0;
for (int c = 0; c < x; c++){
sum += slow[c].a;
sum += fast[c].a;
}
cout << "sum = " << sum << endl;
return 0;
}
Core i7 920 # 3.5 GHz
slow = 4.578
fast = 4.434
sum = 99999990000000000
Okay, not much difference. But it's still consistent over multiple runs.So the alignment makes a small difference on Nehalem Core i7.
Intel Xeon X5482 Harpertown # 3.2 GHz (Core 2 - generation Xeon)
slow = 22.803
fast = 3.669
sum = 99999990000000000
Now take a look...
6.2x faster!!!
Conclusion:
You see the results. You decide whether or not it's worth your time to do these optimizations.
EDIT :
Same benchmarks but without the #pragma pack:
Core i7 920 # 3.5 GHz
slow = 4.49
fast = 4.442
sum = 99999990000000000
Intel Xeon X5482 Harpertown # 3.2 GHz
slow = 3.684
fast = 3.717
sum = 99999990000000000
The Core i7 numbers didn't change. Apparently it can handle
misalignment without trouble for this benchmark.
The Core 2 Xeon now shows the same times for both versions. This confirms that misalignment is a problem on the Core 2 architecture.
Taken from my comment:
If you leave out the #pragma pack, the compiler will keep everything aligned so you don't see this issue. So this is actually an example of what could happen if you misuse #pragma pack.

Such hand-optimizations are generally long dead. Alignment is only a serious consideration if you're packing for space, or if you have an enforced-alignment type like SSE types. The compiler's default alignment and packing rules are intentionally designed to maximize performance, obviously, and whilst hand-tuning them can be beneficial, it's not generally worth it.
Probably, in your test program, the compiler never stored any structure on the stack and just kept the members in registers, which do not have alignment, which means that it's fairly irrelevant what the structure size or alignment is.
Here's the thing: There can be aliasing and other nasties with sub-word accessing, and it's no slower to access a whole word than to access a sub-word. So in general, it's no more efficient, in time, to pack more tightly than word size if you're only accessing, say, one member.

Visual Studio is a great compiler when it comes to optimization. However, bear in mind that the current "Optimization War" in game development is not on the PC arena. While such optimizations may quite well be dead on the PC, on the console platforms it's a completely different pair of shoes.
That said, you might want to repost this question on the specialized gamedev stackexchange site, you might get some answers directly from "the field".
Finally, your results are exactly the same up to the microsecond which is dead impossible on a modern multithreaded system -- I'm pretty sure you either use a very low resolution timer, or your timing code is broken.

Modern compilers align members on different byte boundaries depending on the size of the member. See the bottom of this.
Normally you really shouldn't care about structure padding but if you have an object that is going to have 1000000 instances or something the rule of the thumb is simply to order your members from biggest to smallest. I wouldn't recommend messing with the padding with #pragma directives.

The compiler is going to either optimize for size or speed and unless you explicitly tell it you wont know what you get. But if you follow the advice of that book you will win-win on most compilers. Put the biggest, aligned, things first in your struct then half size stuff, then single byte stuff if any, add some dummy variables to align. Using bytes for things that dont have to be can be a performance hit anyway, as a compromise use ints for everything (have to know the pros and cons of doing that)
The x86 has made for a lot of bad programmers and compilers because it allows unaligned accesses. Making it hard for many folks to move to other platforms (that are taking over). Although unaligned accesses work on an x86 you take a serious performance hit. Which is why it is important to know how compilers work both in general as well as the particular one you are using.
having caches, and as with the modern computer platforms relying on caches to get any kind of performance, you want to both be aligned and packed. The simple rule being taught gives you both...in general. It is very good advice. Adding compiler specific pragmas is not nearly as good, makes the code non-portable, and doesnt take much searching through SO or googling to find out how often the compiler ignores the pragma or doesnt do what you really wanted.

On some platforms the compiler doesn't have an option: objects of types bigger than char often have strict requirements to be at a suitably aligned address. Typically the alignment requirements are identical to the size of the object up to the size of the biggest word supported by the CPU natively. That is short typically requires to be at an even address, long typically requires to be at an address divisible by 4, double at an address divisible by 8, and e.g. SIMD vectors at an address divisible by 16.
Since C and C++ require ordering of members in the order they are declared, the size of structures will differ quite a bit on the corresponding platforms. Since bigger structures effectively cause more cache misses, page misses, etc., there will be a substantial performance degradation when creating bigger structures.
Since I saw a claim that it doesn't matter: it matters on most (if not all) systems I'm using. There is a simple examples of showing different sizes. How much this affects the performance obviously depends on how the structures are to be used.
#include <iostream>
struct A
{
char a;
double b;
char c;
double d;
};
struct B
{
double b;
double d;
char a;
char c;
};
int main()
{
std::cout << "sizeof(A) = " << sizeof(A) << "\n";
std::cout << "sizeof(B) = " << sizeof(B) << "\n";
}
./alignment.tsk
sizeof(A) = 32
sizeof(B) = 24

The C standard specifies that fields within a struct must be allocated at increasing addresses. A struct which has eight variables of type 'int8' and seven variables of type 'int64', stored in that order, will take 64 bytes (pretty much regardless of a machine's alignment requirements). If the fields were ordered 'int8', 'int64', 'int8', ... 'int64', 'int8', the struct would take 120 bytes on a platform where 'int64' fields are aligned on 8-byte boundaries. Reordering the fields yourself will allow them to be packed more tightly. Compilers, however, will not reorder fields within a struct absent explicit permission to do so, since doing so could change program semantics.

Related

Why does this piece of code written using uint8_t run faster than analogous code written with uint32_t or uint64_t on a 64bit machine?

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion? Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t. I'm especially surprised because there is hardly any copying going on that would justify the difference. The same thing occurred regardless of the clang optimisation level.
8bit:
#define mod8(x) x&7
#define div8(x) x>>3
template<unsigned long bits>
struct bitset{
private:
uint8_t fill[8] = {};
uint8_t clear[8];
uint8_t band[(bits/8)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div8(ind)]&fill[mod8(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div8(ind)] |= fill[mod8(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div8(ind)] &= clear[mod8(ind)];
}
bitset(){
for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
32bit:
#define mod32(x) x&31
#define div32(x) x>>5
template<unsigned long bits>
struct bitset{
private:
uint32_t fill[32] = {};
uint32_t clear[32];
uint32_t band[(bits/32)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div32(ind)]&fill[mod32(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div32(ind)] |= fill[mod32(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div32(ind)] &= clear[mod32(ind)];
}
bitset(){
for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):
const int len = 1000000;
bitset<len> bs;
{
auto start = std::chrono::high_resolution_clock::now();
bs.store_high(0);
for (int ii = 1; ii < len; ++ii) {
bs.store_high(ii);
bs.store_low(ii-1);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
}
TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively.
Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.
On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).
For uint32_t band[], that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.
Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler). With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline.
But for uint8_t band[], you've moving to a new element 4x as frequently, after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work. (For the case without lfence to artificially block overlap.)
See also Modern Microprocessors
A 90-Minute Guide! and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.
Small vs. large buckets
Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something. Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that. Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that. Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool>, which is an unfortunate name for a sometimes-useful data structure.)
C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.
Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever. Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.
On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). And you'd still gain ILP; a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower. Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.
(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.)
(Different sized accesses close to each other can lead to store-forwarding stalls, e.g. a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86?)
Code review:
Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs. If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.
(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi, with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx, x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.)
Almost certainly not worth it to store both fill[] and clear[] constants. Some ISAs even have an andn instruction that can NOT one of the operands on the fly, i.e. implements (~x) & y in one instruction. For example, x86 with BMI1 extensions has andn. (gcc -march=haswell).
Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1].
As in #define div8(x) (x>>3)
But really, you shouldn't be using CPP macros for stuff like this anyway. Even in modern C, just define static const shift = 3; shift counts and masks. In C++, do that inside the struct/class scope, and use band[idx >> shift] or something. (When I was typing ind, my fingers wanted to type int; idx is probably a better name.)
Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?
This isn't a universal truth. As always, fit depends on details.
Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?
The title doesn't match the question. There are no such types as uint_X and you aren't using uintX_t. You are using uint_fastX_t. uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations.
If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t. That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.
That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:
uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};
uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};
Not only are the types (potentially) different, but the sizes of the arrays are too. This can certainly have an effect on the efficiency.

Simple C++ Loop Not Benefitting from Multithreading

I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.

Hash function: Is there a way to optimize my code further?

Above is the hash function.
I wrote the code below. I am not sure if I can use another clever way to make this more efficient. I am using the understanding that I do not need to do the mod at all since unsigned int takes care of that through overflow.
int myHash(string s)
{
unsigned int hash = 0;
long long int multiplier = 1;
for(int i = s.size()-1;i>-1;i--)
{
hash += (multiplier * s[i]);
multiplier *= 31;
}
return hash;
}
I would avoid using long long for multiplier. At least if you don't know 100% that your processor does 64-bit multiplies in the same amount of time as a 32-bit multiply. Really modern top of the range processors probably do, older & smaller processors almost certainly take longer to do 64-bit mul operations than 32-bit ones.
Multiplying by 31 can actually be quite fast even on processors that aren't good at multiplying, because x *= 31 can be converted to x = x * 32 - x; or x = (x << 5) - x; - in fact it may be worth trying that [if you haven't compiled the code to assembler and seen that the compiler already does that].
Beyond that, it would be processor or compiler-specific optimisations that comes to mind. Loop unrolling for example. Or using inline assembler or intrinsics to make use of vector instructions (subject to availability for different processor architectures and different generations). Modern compilers like recent versions of gcc or clang will probably vectorize this code, subject to being given the "right" options.
As with all optimisation projects, measure the time, using a representative workload, keep records of what you changed. Look at the generated code, try to figure out if there's a better way to do it. And don't lose track of the fact that it's the OVERALL program's performance that matter. If you spend 80% of the time in this function, by all means, optimize the heck out of it. If you spend 20% of the time, optimize it a bit, if you spend 2% of the time in it, unless there's OBVIOUS things you can do to improve it, it's not going to give you much. I've seen the results of people writing code to save a few clock-cycles in some code that takes several million cycles in the loop two lines further on. And using bit-fiddling tricks to save 2 bytes in something that takes half a megabyte. It just creates mess, not really worth doing.
I guess you could make the argument not have to copy the string for the function call, make s const string &s instead, or use std::string_view if you happen to be using C++17. Otherwise it looks fast the the point where you should leave the rest to the compiler. Try making it optimize with -O2 or your compilers equivalent.
Let me preface this by saying it's probably not worth doing -- it's unlikely that your hash function is going to be the bottleneck in your program, so making the hash function more elaborate in an attempt to make it more efficient will probably just make it harder to understand and maintain while not making your program measurably faster. So don't do this unless you've actually determined that your program spends a significant percentage of its time computing string hashes, and make sure you have a good benchmark routine that you can run "before" and "after" this change to verify that it actually did speed things up significantly, otherwise you might just be chasing rainbows.
That said, one potential way to hash long strings more quickly would be to process the string a word at a time rather than a character at a time, something like this:
unsigned int aSlightlyFasterHash(const string & s)
{
const unsigned int numWordsInString = s.size()/sizeof(unsigned int);
const unsigned int numExtraBytesInString = s.size()%sizeof(unsigned int);
// Compute the bulk of the hash by reading the string a word at a time
unsigned int hash = 0;
const unsigned int * iptr = reinterpret_cast<const unsigned int *>(s.c_str());
for (unsigned int i=0; i<numWordsInString; i++)
{
hash += *iptr;
iptr++;
}
// Then any "leftover" bytes at the end we will mix in to the hash the old way
const unsigned char * cptr = reinterpret_cast<const unsigned char *>(iptr);
unsigned int multiplier = 1;
for(unsigned int i=0; i<numExtraBytesInString; i++)
{
hash += (multiplier * *cptr);
cptr++;
multiplier *= 31;
}
return hash;
}
Note that the above function will return different hash values than the hash function you provided.
That cuts down on the number of loop iterations by a factor of four; of course it's likely that the execution of the function is limited by RAM bandwidth rather than CPU cycles anyway, so do be too surprised if this doesn't go noticeably faster on a modern CPU. If RAM bandwidth is indeed the bottleneck, then there's not too much you can do about it, since you have to read the contents of the string in order to compute a hash code for the string; there's no getting around that (except perhaps by precomputing the hash code in advance and storing it somewhere, but that only works if you know all the strings you are going to use in advance).

Strange size of class containing Eigen vectors

A C++ class containing two Eigen vectors has a strange size. I have a MWE of my problem here:
#include <iostream>
#include "Eigen/Core"
class test0 {
Eigen::Matrix<double,4,1> R;
Eigen::Matrix<double,4,1> T;
};
class test1 {
Eigen::Matrix<double,4,1> R;
Eigen::Matrix<double,3,1> T;
};
class test2 {
Eigen::Matrix<double,4,1> R;
Eigen::Matrix<double,2,1> T;
};
class test3 {
Eigen::Matrix<double,7,1> T;
};
class test4 {
Eigen::Matrix<double,3,1> T;
};
int main(int argc, char *argv[])
{
std::cout << sizeof(test0) << ", " << sizeof(test1) << ", " << sizeof(test2) << ", " << sizeof(test3) << ", " << sizeof(test4) << std::endl;
return 0;
}
The output I get on my system (MacBook Pro, Xcode Clang++ compiler) is:
64, 64, 48, 56, 24
The class "test1" has some bizarre extra padding - I would have expected it to have size 56. I don't understand the reason for it, especially given that none of the other classes have any padding. Can anyone explain, or is this an error?
This happens because of how the Eigen library is implemented, and it is not related to compiler tricks. The backing storage for Eigen::Matrix<double, 4, 1> has the EIGEN_ALIGN_TO_BOUNDARY(16) tag on it, which has compiler-specific definitions that ask the type to be aligned on a 16-byte boundary. To ensure this, the compiler has to add 8 bytes of padding at the end of the structure, since otherwise the first matrix field would not be aligned on a 16-byte boundary if you had an array of test1.
Eigen simply does not try to impose similar requirements to the backing storage of Eigen::Matrix<double, 7, 1>.
This happens in Eigen/src/Core/DenseStorage.
Padding requirements aren't mandated by the language, they're actually mandated by your processor architecture. Your class is being padded so it's 64 bytes wide. You can override this of course, but it's done so that structures sit neatly in memory and can be read efficiently, aligning to cache lines.
In what circumstances a structure is padded is a complex question, but generally speaking "memory is cheap, cycles are not". Modern computers have loads of memory and since performance gains are becoming harder to find as we approach the limits of copper, so trading some off for performance is usually a good idea.
Some additional reading is available here.
Following up on the discussion in the comments, it's worth noting that the compiler isn't your god. Not every optimisation is a good idea, and even trivial changes to your code can have vast implications for some optimisations. If you don't like what your toolchain producing and think you can do better, then do it! Take some benchmarks, make your changes and then measure again. As you do all of that, take not how long you spend on it and then ask yourself - was that a good use of you or your employers time? :)

Cache Optimization Theory

I am thinking about heavy memory cache optimization and like to have some feedback.
Consider this example:
class example
{
float phase1;
float phaseInc;
float factor;
public:
void process(float* buffer,unsigned int iSamples)//<-high prio audio thread
{
for(unsigned int i = 0; i < iSamples; i++)// mostly iSamples is 32
{
phase1 += phaseInc;
float f1 = sinf(phase1);//<-sinf is just an example!
buffer[i] = f1*factor;
}
}
};
optimization idea:
void example::process(float* buffer,unsigned int iSamples)
{
float stackMemory[3];// should fit in L1
memcpy(stackMemory,&phase1,sizeof(float)*3);// get all memory at once
for(unsigned int i = 0; i < iSamples; i++)
{
stackMemory[0] += stackMemory[1];
float f1 = sinf(stackMemory[0]);
buffer[i] = f1*stackMemory[2];
}
memcpy(&phase1,stackMemory,sizeof(float)*1);// write back only changed mameory
}
Note that the real sample loop will contain thousands of operations.
So the stackMemory can become quite big.
I think it will be not more then 32kb (are there any smaller L1's out there ?).
Does the order of the used variables in this stackmemory matter ?
I hope not, because i'd like to order them so that i can reduce the writeback size.
Or does the L1 cache have the same cachline behaviour that RAM has ?
I have the feeling that i am somehow doing what prefetch is made for, but all i read about prefetch is relative vague about how to use it efficently. Try and error is not an option with 5000+ lines of code.
Code will run on Win,Mac and iOS.
Any ARM<->Intel issues to expect ?
Is it possible that this kind optimization is useless since all memory is accessed and transferred to L1 on the first iteration of the loop anyway ?
Thanks for any hints and ideas.
At first I thought there was a good chance that the second one could be slower as a result of additional memory access and instructions required for memcpy, while the first could simply work directly with these three class members already loaded into registers.
Nevertheless, I tried fiddling with the code in GCC 5.2 with both -O2 and -O3 and found that, no matter what I tried, I got identical assembly instructions for both. This is pretty amazing considering all the extra conceptual work that memcpy typically has to do that apparently got squashed away to zilch.
The one case I can think of where your second version might be faster in some scenario, on some compiler, is if the aliasing involved to access this->data_member interfered with an optimization and caused redundant loads and stores to/from registers.
It would have nothing to do with the L1 cache in that case and everything to do with register allocation on the compiler side. Caches are largely irrelevant here when you're loading the same memory (member variables) regardless for a contiguous chunk of data, it has entirely to do with registers. Nevertheless, I couldn't find a single scenario where I could cause that to happen where the compiler did a worse job with one over the other -- every case I tested yielded identical results. In a sufficiently complex real world case, perhaps there might be a difference.
Then again, in such a case, it should be on the safer side to simply do:
void process(float* buffer,unsigned int iSamples)
{
const float pi = phaseInc;
const float p1 = phase1;
const float fact = factor;
for(unsigned int i = 0; i < iSamples; i++)
{
phase1 += pi;
float f1 = sinf(p1);
buffer[i] = f1*fact;
}
}
There's no need to jump through hoops with memcpy to store the results into an array and back. That puts additional strain on the optimizer even if, in my findings, the optimizer managed to eliminate the overhead typically associated.
I realize your example is simplified, but there should not be a need to reduce the structure down to such a primitive array no matter how many data members you're dealing with (unless such an array actually is the most convenient representation). From a performance standpoint, a compiler will have an "easier" time (even if optimizers today are pretty amazing and can handle this) optimizing if you just use local variables instead of an array to which you memcpy aggregate data members in and out.