OK, here's my situation :
I have a function - let's say U64 calc (U64 x) - which takes a 64-bit integer parameter, performs some CPU-intensive operation, and returns a 64-bit value
Now, given that I know ALL possible inputs (the xs) of that function beforehand (there are some 16000 though), I thought it might be better to pre-calculate them and then fetch them on demand (from an array-like structure).
The ideal situation would be to store them all in some array U64 CALC[] and retrieve them by index (the x again)
And here's the issue : I may know what the possible inputs for my calc function are, but they are most definitely NOT consecutive (e.g. not from 1 to 16000, but values that may go as low as 0 and as high as some trillions - always withing a 64-bit range)
E.G.
X CALC[X]
-----------------------
123123 123123123
12312 12312312
897523 986123
etc.
And here comes my question :
How would you store them?
What workaround would you prefer?
Now, given that these values (from CALC) will have to be accessed some thousands-to-millions of times, per sec, which would be the best solution performance-wise?
Note : I'm no mentioning anything I've thought of or tried so as not to turn the answers into some debate of A vs B type-of-thing, and mostly not influence anyone...
I would use some sort of hash function that creates an index to a u64 pair where one is the value the key was created from and the other the replacement value. Technically the index could be three bytes long (assuming 16 million -"16000 thousand" - pairs) if you need to conserve space but I'd use u32s. If the stored value does not match the value computed on (hash collision) you'd enter an overflow handler.
You need to determine a custom hashing algorithm to fit your data
Since you know the size of the data you don't need algorithms that allow the data to grow.
I'd be wary of using some standard algorithm because they seldom fit specific data
I'd be wary of using a C++ method unless you are sure the code is WYSIWYG (doesn't generate a lot of invisible calls)
Your index should be 25% larger than the number of pairs
Run through all possible inputs and determine min, max, average and standard deviation for the number of collisions and use these to determine the acceptable performance level. Then profile to achieve the best possible code.
The required memory space (using a u32 index) comes out to (4*1.25)+8+8 = 21 bytes per pair or 336 MeB, no problem on a typical PC.
________ EDIT________
I have been challenged by "RocketRoy" to put my money where my mouth is. Here goes:
The problem has to do with collision handling in a (fixed size) hash index. To set the stage:
I have a list of n entries where a field in the entry contains the value v that the hash is computed from
I have a vector of n*1.25 (approximately) indeces such that the number of indeces x is a prime number
A prime number y is computed which is a fraction of x
The vector is initialized to say -1 to denote unoccupied
Pretty standard stuff I think you'll agree.
The entries in the list are processed and the hash value h computed and modulo'd and used as an index into the vector and the index to the entry is placed there.
Eventually I encounter the situation where the vector entry pointed to by the index is occupied (doesn't contain -1) - voilà, a collision.
So what do I do? I keep the original h as ho, add y to h and take modulo x and get a new index into the vector. If the entry is unoccupied I use it, otherwise I continue with add y modulo x until I reach ho. In theory, this will happen because both x and y are prime numbers. In practice x is larger than n so it won't.
So the "re-hash" that RocketRoy claims is very costly is no such thing.
The tricky part with this method - as with all hashing methods - is to:
Determine a suitable value for x (becomes less tricky the larger x finally used)
Determine the algorithm a for h=a(v)%x such that a the h's index reasonably evenly ("randomly") into the index vector with as few collisions as possible
Determine a suitable value for y such that collisions index reasonably evenly ("randomly") into the index vector
________ EDIT________
I'm sorry I've taken so long to produce this code ... other things have had higher priorities.
Anyway, here is the code which proves that hashing has better prospects for quick lookups than a binary tree. It runs through a bunch of hashing index sizes and algorithms to aid in finding the most suitable combo for the specific data. For every algorithm the code will print the first index size such that no lookup takes longer than fourteen searches (worst case for binary searching) and an average lookup takes less than 1.5 searches.
I have a fondness for prime numbers in these types of applications, in case you haven't noticed.
There are many ways of creating a hashing algorithm with its mandatory overflow handling. I opted for simplicity assuming it will translate into speed ... and it does. On my laptop with an i5 M 480 # 2.67 GHz an average lookup requires between 55 and 60 clock cycles (comes out to around 45 million lookups per second). I implemented a special get operation with a constant number of indeces and ditto rehash value and the cycle count dropped to 40 (65 million lookups per second). If you look at the line calling getOpSpec the index i is xor'ed with 0x444 to exercise the caches to achieve more "real world"-like results.
I must again point out that the program suggests suitable combinations for the specific data. Other data may require a different combo.
The source code contains both the code for generating the 16000 unsigned long long pairs and for testing different constants (index sizes and rehash values):
#include <windows.h>
#define i8 signed char
#define i16 short
#define i32 long
#define i64 long long
#define id i64
#define u8 char
#define u16 unsigned short
#define u32 unsigned long
#define u64 unsigned long long
#define ud u64
#include <string.h>
#include <stdio.h>
u64 prime_find_next (const u64 value);
u64 prime_find_previous (const u64 value);
static inline volatile unsigned long long rdtsc_to_rax (void)
{
unsigned long long lower,upper;
asm volatile( "rdtsc\n"
: "=a"(lower), "=d"(upper));
return lower|(upper<<32);
}
static u16 index[65536];
static u64 nindeces,rehshFactor;
static struct PAIRS {u64 oval,rval;} pairs[16000] = {
#include "pairs.h"
};
struct HASH_STATS
{
u64 ninvocs,nrhshs,nworst;
} getOpStats,putOpStats;
i8 putOp (u16 index[], const struct PAIRS data[], const u64 oval, const u64 ci)
{
u64 nworst=1,ho,h,i;
i8 success=1;
++putOpStats.ninvocs;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffff) /* unused position */
{
index[h]=(u16)ci;
goto added;
}
if (oval==data[i].oval) goto duplicate;
++putOpStats.nrhshs;
++nworst;
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted: /* should not happen */
duplicate:
success=0;
added:
if (nworst>putOpStats.nworst) putOpStats.nworst=nworst;
return success;
}
i8 getOp (u16 index[], const struct PAIRS data[], const u64 oval, u64 *rval)
{
u64 ho,h,i;
i8 success=1;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffffu) goto not_found; /* unused position */
if (oval==data[i].oval)
{
*rval=data[i].rval; /* fetch the replacement value */
goto found;
}
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted:
not_found: /* should not happen */
success=0;
found:
return success;
}
volatile i8 stop = 0;
int main (int argc, char *argv[])
{
u64 i,rval,mulup,divdown,start;
double ave;
SetThreadAffinityMask (GetCurrentThread(), 0x00000004ull);
divdown=5; //5
while (divdown<=100)
{
mulup=3; // 3
while (mulup<divdown)
{
nindeces=16000;
while (nindeces<65500)
{
nindeces= prime_find_next (nindeces);
rehshFactor=nindeces*mulup/divdown;
rehshFactor=prime_find_previous (rehshFactor);
memset (index, 0xff, sizeof(index));
memset (&putOpStats, 0, sizeof(struct HASH_STATS));
i=0;
while (i<16000)
{
if (!putOp (index, pairs, pairs[i].oval, (u16) i)) stop=1;
++i;
}
ave=(double)(putOpStats.ninvocs+putOpStats.nrhshs)/(double)putOpStats.ninvocs;
if (ave<1.5 && putOpStats.nworst<15)
{
start=rdtsc_to_rax ();
i=0;
while (i<16000)
{
if (!getOp (index, pairs, pairs[i^0x0444]. oval, &rval)) stop=1;
++i;
}
start=rdtsc_to_rax ()-start+8000; /* 8000 is half of 16000 (pairs), for rounding */
printf ("%u;%u;%u;%u;%1.3f;%u;%u\n", (u32)mulup, (u32)divdown, (u32)nindeces, (u32)rehshFactor, ave, (u32) putOpStats.nworst, (u32) (start/16000ull));
goto found;
}
nindeces+=2;
}
printf ("%u;%u\n", (u32)mulup, (u32)divdown);
found:
mulup=prime_find_next (mulup);
}
divdown=prime_find_next (divdown);
}
SetThreadAffinityMask (GetCurrentThread(), 0x0000000fu);
return 0;
}
It was not possible to include the generated pairs file (an answer is apparently limited to 30000 characters). But send a message to my inbox and I'll mail it.
And these are the results:
3;5;35569;21323;1.390;14;73
3;7;33577;14389;1.435;14;60
5;7;32069;22901;1.474;14;61
3;11;35107;9551;1.412;14;59
5;11;33967;15427;1.446;14;61
7;11;34583;22003;1.422;14;59
3;13;34253;7901;1.439;14;61
5;13;34039;13063;1.443;14;60
7;13;32801;17659;1.456;14;60
11;13;33791;28591;1.436;14;59
3;17;34337;6053;1.413;14;59
5;17;32341;9511;1.470;14;61
7;17;32507;13381;1.474;14;62
11;17;33301;21529;1.454;14;60
13;17;34981;26737;1.403;13;59
3;19;33791;5333;1.437;14;60
5;19;35149;9241;1.403;14;59
7;19;33377;12289;1.439;14;97
11;19;34337;19867;1.417;14;59
13;19;34403;23537;1.430;14;61
17;19;33923;30347;1.467;14;61
3;23;33857;4409;1.425;14;60
5;23;34729;7547;1.429;14;60
7;23;32801;9973;1.456;14;61
11;23;33911;16127;1.445;14;60
13;23;33637;19009;1.435;13;60
17;23;34439;25453;1.426;13;60
19;23;33329;27529;1.468;14;62
3;29;32939;3391;1.474;14;62
5;29;34543;5953;1.437;13;60
7;29;34259;8263;1.414;13;59
11;29;34367;13033;1.409;14;60
13;29;33049;14813;1.444;14;60
17;29;34511;20219;1.422;14;60
19;29;33893;22193;1.445;13;61
23;29;34693;27509;1.412;13;92
3;31;34019;3271;1.441;14;60
5;31;33923;5449;1.460;14;61
7;31;33049;7459;1.442;14;60
11;31;35897;12721;1.389;14;59
13;31;35393;14831;1.397;14;59
17;31;33773;18517;1.425;14;60
19;31;33997;20809;1.442;14;60
23;31;34841;25847;1.417;14;59
29;31;33857;31667;1.426;14;60
3;37;32569;2633;1.476;14;61
5;37;34729;4691;1.419;14;59
7;37;34141;6451;1.439;14;60
11;37;34549;10267;1.410;13;60
13;37;35117;12329;1.423;14;60
17;37;34631;15907;1.429;14;63
19;37;34253;17581;1.435;14;60
23;37;32909;20443;1.453;14;61
29;37;33403;26177;1.445;14;60
31;37;34361;28771;1.413;14;59
3;41;34297;2503;1.424;14;60
5;41;33587;4093;1.430;14;60
7;41;34583;5903;1.404;13;59
11;41;32687;8761;1.440;14;60
13;41;34457;10909;1.439;14;60
17;41;34337;14221;1.425;14;59
19;41;32843;15217;1.476;14;62
23;41;35339;19819;1.423;14;59
29;41;34273;24239;1.436;14;60
31;41;34703;26237;1.414;14;60
37;41;33343;30089;1.456;14;61
3;43;34807;2423;1.417;14;59
5;43;35527;4129;1.413;14;60
7;43;33287;5417;1.467;14;61
11;43;33863;8647;1.436;14;60
13;43;34499;10427;1.418;14;78
17;43;34549;13649;1.431;14;60
19;43;33749;14897;1.429;13;60
23;43;34361;18371;1.409;14;59
29;43;33149;22349;1.452;14;61
31;43;34457;24821;1.428;14;60
37;43;32377;27851;1.482;14;81
41;43;33623;32057;1.424;13;59
3;47;33757;2153;1.459;14;61
5;47;33353;3547;1.445;14;61
7;47;34687;5153;1.414;13;59
11;47;34519;8069;1.417;14;60
13;47;34549;9551;1.412;13;59
17;47;33613;12149;1.461;14;61
19;47;33863;13687;1.443;14;60
23;47;35393;17317;1.402;14;59
29;47;34747;21433;1.432;13;60
31;47;34871;22993;1.409;14;59
37;47;34729;27337;1.425;14;59
41;47;33773;29453;1.438;14;60
43;47;31253;28591;1.487;14;62
3;53;33623;1901;1.430;14;59
5;53;34469;3229;1.430;13;60
7;53;34883;4603;1.408;14;59
11;53;34511;7159;1.412;13;59
13;53;32587;7963;1.453;14;60
17;53;34297;10993;1.432;13;80
19;53;33599;12043;1.443;14;64
23;53;34337;14897;1.415;14;59
29;53;34877;19081;1.424;14;61
31;53;34913;20411;1.406;13;59
37;53;34429;24029;1.417;13;60
41;53;34499;26683;1.418;14;59
43;53;32261;26171;1.488;14;62
47;53;34253;30367;1.437;14;79
3;59;33503;1699;1.432;14;61
5;59;34781;2939;1.424;14;60
7;59;35531;4211;1.403;14;59
11;59;34487;6427;1.420;14;59
13;59;33563;7393;1.453;14;61
17;59;34019;9791;1.440;14;60
19;59;33967;10937;1.447;14;60
23;59;33637;13109;1.438;14;60
29;59;34487;16943;1.424;14;59
31;59;32687;17167;1.480;14;61
37;59;35353;22159;1.404;14;59
41;59;34499;23971;1.431;14;60
43;59;34039;24799;1.445;14;60
47;59;32027;25471;1.499;14;62
53;59;34019;30557;1.449;14;61
3;61;35059;1723;1.418;14;60
5;61;34351;2803;1.416;13;60
7;61;35099;4021;1.412;14;59
11;61;34019;6133;1.442;14;60
13;61;35023;7459;1.406;14;88
17;61;35201;9803;1.414;14;61
19;61;34679;10799;1.425;14;101
23;61;34039;12829;1.441;13;60
29;61;33871;16097;1.446;14;60
31;61;34147;17351;1.427;14;61
37;61;34583;20963;1.412;14;59
41;61;32999;22171;1.452;14;62
43;61;33857;23857;1.431;14;98
47;61;34897;26881;1.431;14;60
53;61;33647;29231;1.434;14;60
59;61;32999;31907;1.454;14;60
3;67;32999;1471;1.455;14;61
5;67;35171;2621;1.403;14;59
7;67;33851;3533;1.463;14;61
11;67;34607;5669;1.437;14;60
13;67;35081;6803;1.416;14;61
17;67;33941;8609;1.417;14;60
19;67;34673;9829;1.427;14;60
23;67;35099;12043;1.415;14;60
29;67;33679;14563;1.452;14;61
31;67;34283;15859;1.437;14;60
37;67;32917;18169;1.460;13;61
41;67;33461;20443;1.441;14;61
43;67;34313;22013;1.426;14;60
47;67;33347;23371;1.452;14;61
53;67;33773;26713;1.434;14;60
59;67;35911;31607;1.395;14;58
61;67;34157;31091;1.431;14;63
3;71;34483;1453;1.423;14;59
5;71;34537;2423;1.428;14;59
7;71;33637;3313;1.428;13;60
11;71;32507;5023;1.465;14;79
13;71;35753;6529;1.403;14;59
17;71;33347;7963;1.444;14;61
19;71;35141;9397;1.410;14;59
23;71;32621;10559;1.475;14;61
29;71;33637;13729;1.429;14;60
31;71;33599;14657;1.443;14;60
37;71;34361;17903;1.396;14;59
41;71;33757;19489;1.435;14;61
43;71;34583;20939;1.413;14;59
47;71;34589;22877;1.441;14;60
53;71;35353;26387;1.418;14;59
59;71;35323;29347;1.406;14;59
61;71;35597;30577;1.401;14;59
67;71;34537;32587;1.425;14;59
3;73;34613;1409;1.418;14;59
5;73;32969;2251;1.453;14;62
7;73;33049;3167;1.448;14;61
11;73;33863;5101;1.435;14;60
13;73;34439;6131;1.456;14;60
17;73;33629;7829;1.455;14;61
19;73;34739;9029;1.421;14;60
23;73;33071;10399;1.469;14;61
29;73;33359;13249;1.460;14;61
31;73;33767;14327;1.422;14;59
37;73;32939;16693;1.490;14;62
41;73;33739;18947;1.438;14;60
43;73;33937;19979;1.432;14;61
47;73;33767;21739;1.422;14;59
53;73;33359;24203;1.435;14;60
59;73;34361;27767;1.401;13;59
61;73;33827;28229;1.443;14;60
67;73;34421;31583;1.423;14;71
71;73;33053;32143;1.447;14;60
3;79;35027;1327;1.410;14;60
5;79;34283;2161;1.432;14;60
7;79;34439;3049;1.432;14;60
11;79;34679;4817;1.416;14;59
13;79;34667;5701;1.405;14;59
17;79;33637;7237;1.428;14;60
19;79;34469;8287;1.417;14;60
23;79;34439;10009;1.433;14;60
29;79;33427;12269;1.448;13;61
31;79;33893;13297;1.445;14;61
37;79;33863;15823;1.439;14;60
41;79;32983;17107;1.450;14;60
43;79;34613;18803;1.431;14;60
47;79;33457;19891;1.457;14;61
53;79;33961;22777;1.435;14;61
59;79;32983;24631;1.465;14;60
61;79;34337;26501;1.428;14;60
67;79;33547;28447;1.458;14;61
71;79;32653;29339;1.473;14;61
73;79;34679;32029;1.429;14;64
3;83;35407;1277;1.405;14;59
5;83;32797;1973;1.451;14;60
7;83;33049;2777;1.443;14;61
11;83;33889;4483;1.431;14;60
13;83;35159;5503;1.409;14;59
17;83;34949;7151;1.412;14;59
19;83;32957;7541;1.467;14;61
23;83;32569;9013;1.470;14;61
29;83;33287;11621;1.474;14;61
31;83;33911;12659;1.448;13;60
37;83;33487;14923;1.456;14;62
41;83;33587;16573;1.438;13;60
43;83;34019;17623;1.435;14;60
47;83;31769;17987;1.483;14;62
53;83;33049;21101;1.451;14;61
59;83;32369;23003;1.465;14;61
61;83;32653;23993;1.469;14;61
67;83;33599;27109;1.437;14;61
71;83;33713;28837;1.452;14;61
73;83;33703;29641;1.454;14;61
79;83;34583;32911;1.417;14;59
3;89;34147;1129;1.415;13;60
5;89;32797;1831;1.461;14;61
7;89;33679;2647;1.443;14;73
11;89;34543;4261;1.427;13;60
13;89;34603;5051;1.419;14;60
17;89;34061;6491;1.444;14;60
19;89;34457;7351;1.422;14;79
23;89;33529;8663;1.450;14;61
29;89;34283;11161;1.431;14;60
31;89;35027;12197;1.411;13;59
37;89;34259;14221;1.403;14;59
41;89;33997;15649;1.434;14;60
43;89;33911;16127;1.445;14;60
47;89;34949;18451;1.419;14;59
53;89;34367;20443;1.434;14;60
59;89;33791;22397;1.430;14;59
61;89;34961;23957;1.404;14;59
67;89;33863;25471;1.433;13;60
71;89;35149;28031;1.414;14;79
73;89;33113;27143;1.447;14;60
79;89;32909;29209;1.458;14;61
83;89;33617;31337;1.400;14;59
3;97;34211;1051;1.448;14;60
5;97;34807;1789;1.430;14;60
7;97;33547;2417;1.446;14;60
11;97;35171;3967;1.407;14;89
13;97;32479;4349;1.474;14;61
17;97;34319;6011;1.444;14;60
19;97;32381;6337;1.491;14;64
23;97;33617;7963;1.421;14;59
29;97;33767;10093;1.423;14;59
31;97;33641;10739;1.447;14;60
37;97;34589;13187;1.425;13;60
41;97;34171;14437;1.451;14;60
43;97;31973;14159;1.484;14;62
47;97;33911;16127;1.445;14;61
53;97;34031;18593;1.448;14;80
59;97;32579;19813;1.457;14;61
61;97;34421;21617;1.417;13;60
67;97;33739;23297;1.448;14;60
71;97;33739;24691;1.435;14;60
73;97;33863;25471;1.433;13;60
79;97;34381;27997;1.419;14;59
83;97;33967;29063;1.446;14;60
89;97;33521;30727;1.441;14;60
Cols 1 and 2 are used to calculate a rough relationship between the rehash value and the index size. The next two are the first index size/rehash factor combination which averages less than 1.5 searches for a lookup with a worst case of 14 searches. Then average and worst case. Finally, the last column is the average number of clock cycles per lookup. It does not take into account the time required to read the time stamp register.
The actual memory space for the best constants (# of indeces = 31253 and rehash factor = 28591) comes out to more than I initially indicated (16000*2*8 + 1,25*16000*2 => 296000 bytes). The actual size is 16000*2*8+31253*2 => 318506.
The fastest combination is an approximate ratio of 11/31 with an index size of 35897 and rehash value of 12721. This will average 1.389 (1 initial hash + 0.389 rehashes) with a maximum of 14 (1+13).
________ EDIT________
I removed the "goto found;" in main () to show all combinations and it shows that much better performance is possible, of course at the expense of a larger index size. For example the combination 57667 and 33797 yields and average of 1.192 and a maximum rehash of 6. The combination 44543 and 23399 yields a 1.249 average and 10 maximum rehashes (it saves (57667-44543)*2=26468 bytes of index table compared to 57667/33797).
Specialized functions with hard-coded hash index size and rehash factor will execute in 60-70% of the time compared to variables. This is probably due to the compiler (gcc 64-bit) substituting the modulo with multiplications and not having to fetch the values from memory locations as they will be coded as immediate values.
________ EDIT________
On the subject of caches I see two issues.
The first is data cacheing which I don't think will be possible because the lookup will just be a small step in some larger process and you run the risk of the table data's cache lines begin invalidated to a lesser or (probably) greater degree - if not entirely - by other data accesses in other steps of the larger process. I e the more code executed and data accessed in the process as a whole the less likely it will be that any pertinent lookup data will remain in the caches (this may or may not be pertinent to the OP's situation). To find an entry using (my) hashing you will encounter two cache misses (one to load the correct part of the index, and the other to load the area containg the entry itself) for every comparison that needs to be performed. Finding an entry on the first try will have cost two misses, the second try four etc. In my example the 60 clock cycle average cost per lookup implies that the table probably resided entirely in the L2 cache and with L1 not having to go there in a majority of the cases. My x86-64 CPU has L1-3, RAM wait states of approximately 4, 10, 40 and 100 which to me shows that RAM was completely kept out and L3 mostly.
The second is code cacheing which will have a more significant impact if it is small, tight, in-lined and with few control transfers (jumps and calls). My hash routine probably resides entirely in the L1 code cache. For more normal cases, the fewer the number of code cache line loads the faster it will be.
Make an array of structures of key val pairs.
Sort the array by key, put this in your program as static array, would only be 128kbyte.
Then in your program a simple binary look up by key will need on average only 14 key comparisons to find the right value. Should be able to approach speeds of 300 million look ups per second on modern pc.
You can sort with qsort and search with bsearch, both std lib functions.
Perform memonization, or in simple terms, cache the values you've computed already and calculate the new ones. You should hash the input and check the cache for that result. You can even start off with a set of cache values that you think the function would get called more often for. Besides that, I don't think you need to go to any extreme as the other answer suggest. Do things simple and when you are done with your application you can use a profiling tool to find bottle necks.
EDIT: Some code
#include <iostream>
#include <ctime>
using namespace std;
const int MAX_SIZE = 16000;
int preCalcData[MAX_SIZE] = {};
int getPrecalculatedResult(int x){
return preCalcData[x];
}
void setupPreCalcDataCache(){
for(int i = 0; i < MAX_SIZE; ++i){
preCalcData[i] = i*i; //or whatever calculation
}
}
int main(){
setupPreCalcDataCache();
cout << getPrecalculatedResult(0) << endl;
cout << getPrecalculatedResult(15999) << endl;
return 0;
}
I wouldn't worry about performance too much. This simple example, using an array and binary search lower_bound
#include <stdint.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <memory>
const int N = 16000;
typedef std::pair<uint64_t, uint64_t> CALC;
CALC calc[N];
static inline bool cmp_calcs(const CALC &c1, const CALC &c2)
{
return c1.first < c2.first;
}
int main(int argc, char **argv)
{
std::iostream::sync_with_stdio(false);
for (int i = 0; i < N; ++i)
calc[i] = std::make_pair(i, i);
std::sort(&calc[0], &calc[N], cmp_calcs);
for (long i = 0; i < 10000000; ++i) {
int r = rand() % 16000;
CALC *p = std::lower_bound(&calc[0], &calc[N], std::make_pair(r, 0), cmp_calcs);
if (p->first == r)
std::cout << "found\n";
}
return 0;
}
and compiled with
g++ -O2 example.cpp
does, including setup, 10,000,000 searches in about 2 seconds on my 5 year old PC.
You need to store 16 thousand values efficiently, preferably in memory. We are assuming that the computation of these values is more time consuming than accessing them from storage.
You have at your disposal many different data structures to get the job done, including databases. If you access these values in queriable chunks, then the DB overhead may very well be absorbed and spread accross your processing.
You mentioned map and hashmap (or hashtable) already in your question tags, but these are probably not the best possible answers for your problem, although they could do a fair job, provided that the hashing function isn't more expensive than the direct computation of the target UINT64 value, which has to be your reference benchmark.
Van Emde Boas Trees
Many variants of B-Trees (used extensively in database engines, high performance filesystems),
Tries
Are probably much better suited. Having some experience with it, I would probably go for a B-tree: they support fairly well serialization. That should let you prepare your dataset in advance in a different program. VEB trees have a very good access time (O(log log(n)), but I don't know how easily they may be serialized.
Later on, if you need even more performance, it would also be interesting to know usage patterns of your "database" to figure out what caching techniques you could implement on top of the store.
Using std::pair is better than any of map for speed.
but if I were you, I firstly use a std::list to store the data, after I got them all, I move them into a simple vector, then retrieving goes very fast if you implement a simple binary tree search by yourself.
Related
I have been using https://github.com/google/benchmark and g++ 9.4.0 to check the performance of data access in different scenarios (compilation with "-O3"). The result has been surprising to me.
My baseline is accessing longs in an std::array ("reduced data"). I want to add an additional byte datum. One time I create an additional container ("split data") and one time I store a struct in the arrays ("combined data").
This is the code:
#include <benchmark/benchmark.h>
#include <array>
#include <random>
constexpr int width = 640;
constexpr int height = 480;
std::array<std::uint64_t, width * height> containerWithReducedData;
std::array<std::uint64_t, width * height> container1WithSplitData;
std::array<std::uint8_t, width * height> container2WithSplitData;
struct CombinedData
{
std::uint64_t first;
std::uint8_t second;
};
std::array<CombinedData, width * height> containerWithCombinedData;
void fillReducedData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
containerWithReducedData.at(static_cast<unsigned int>(row * width + column)) = number;
}
}
}
std::uint64_t accessReducedData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += containerWithReducedData.at(static_cast<unsigned int>(row * width + column));
}
}
return value;
}
static void BM_AccessReducedData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessReducedData());
}
}
BENCHMARK(BM_AccessReducedData)->Setup(fillReducedData);
void fillSplitData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
container1WithSplitData.at(static_cast<unsigned int>(row * width + column)) = number;
const std::uint8_t additionalNumber = bytesDistribution(engine);
container2WithSplitData.at(static_cast<unsigned int>(row * width + column)) = additionalNumber;
}
}
}
std::uint64_t accessSplitData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += container1WithSplitData.at(static_cast<unsigned int>(row * width + column));
value += container2WithSplitData.at(static_cast<unsigned int>(row * width + column));
}
}
return value;
}
static void BM_AccessSplitData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessSplitData());
}
}
BENCHMARK(BM_AccessSplitData)->Setup(fillSplitData);
void fillCombinedData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first = number;
const std::uint8_t additionalNumber = bytesDistribution(engine);
containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second = additionalNumber;
}
}
}
std::uint64_t accessCombinedData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first;
value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second;
}
}
return value;
}
static void BM_AccessCombinedData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessCombinedData());
}
}
BENCHMARK(BM_AccessCombinedData)->Setup(fillCombinedData);
Live demo
And this is the result:
Run on (12 X 4104.01 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.33, 1.82, 1.06
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_AccessReducedData 55133 ns 55133 ns 12309
BM_AccessSplitData 64089 ns 64089 ns 10439
BM_AccessCombinedData 170470 ns 170470 ns 3827
I am not surprised by the long running times of BM_AccessCombinedData. There is additional effort (compared to "reduced data") to add the bytes. My interpretation is that the added byte does not fit into the cache line anymore, which makes the access much more expensive. (Might there be even another effect?)
But why is it so fast to access different containers ("split data")? There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower? But it is almost three times faster than the access of the combined data! Isn't this surprising?
Preface: This answer was written only for the example/scenario you provided in your benchmark link: a summing reduction over interleaved vs non-interleaved collections of differently sized integers. Summing is an unsequenced operation. You can visit elements of the collections and add them to the accumulating result in any order. And whether you "combine" (via struct) or "split" (via separate arrays), the order of accumulation doesn't matter.
Note: It would help if you provided some information about what you already know about optimization techniques and what processors/memory are usually capable of. Your comments show you know about caching, but I have no idea what else you know, or what exactly you know about caching.
Terminology
This choice of "combined" vs "split" has other well known names:
parallel array (wikipedia article)
structure of arrays vs array of structures (wikipedia article)
This question fits into the area of concerns people have when talking about data-oriented design.
For the rest of this answer, I will stay consistent with your terminology.
Alignment, Padding, and Structs
quoting from CppReference,
The C++ language has this requirement:
Every complete object type has a property called alignment requirement, which is an integer value of type size_t representing the number of bytes between successive addresses at which objects of this type can be allocated. The valid alignment values are non-negative integral powers of two.
"Every complete object" includes instances of structs in memory. Reading on...
In order to satisfy alignment requirements of all members of a struct, padding may be inserted after some of its members.
One of its examples demonstrate:
// objects of struct X must be allocated at 4-byte boundaries
// because X.n must be allocated at 4-byte boundaries
// because int's alignment requirement is (usually) 4
struct X {
int n; // size: 4, alignment: 4
char c; // size: 1, alignment: 1
// three bytes padding
}; // size: 8, alignment: 4
This is what Peter Cordes mentioned in the comments. Because of this requirement/property/feature of the C++ language, there is inserted padding for your "combined" collection.
I am not sure if there is a significant detriment to cache performance resulting from the padding here, since the sum only visits each element of the arrays once. In a scenario where elements are frequently revisited, this is more likely to matter: the padding of the combined representation results in "wasted" bytes of the cache when compared with the split representation, and that wastage is more likely to have a significant impact on cache performance. But the degree to which this matters depends on the patterns of revisiting the data.
SIMD
wikipedia article
SIMD instructions are specialized CPU machine instructions for performing an operation on multiple pieces of data in memory, such as summing a group of same-sized integers which are laid out next to each other in memory (which is exactly what can be done in the "split"-representation version of your scenario).
Compared to machine code that doesn't use SIMD, SIMD usage can provide a constant-factor improvement (the value of the constant factor is based on the SIMD instruction). Ex. a SIMD instruction that adds 8 bytes together should be 8 times faster than a loop which does the same thing, or an unrolled loop which does the same thing.
Other keywords: vectorization, parallelized code.
Peter Cordes mentioned relevant examples (psadbw, paddq). Here's a list of intel SSE instructions for arithmetic.
As Peter mentioned, a degree of SIMD usage is still possible in the "combined" representation, but not as much as is possible with the "split" representation. It comes down to what the target machine architecture's instruction set provides. I don't think there's a dedicated SIMD instruction for your example's "combined" representation.
The Code
For the "split" representation, I would do something like:
// ...
#include <numeric> // for `std::reduce`
#include <execution> // for `std::execution`
#include <functional> // for `std::plus`
std::uint64_t accessSplitData()
{
return std::reduce(std::execution::unseq, container1WithSplitData.cbegin(), container1WithSplitData.cend(), std::uint64_t{0}, std::plus{});
+ std::reduce(std::execution::unseq, container2WithSplitData.cbegin(), container2WithSplitData.cend(), std::uint64_t{0}, std::plus{});
}
// ...
It's a much more direct way to communicate (to readers of the code and to a compiler) an unsequenced sum of collections of integers.
CppReference for std::reduce
CppReference for std::execution::<...>
Execution policies allow you to convey how an algorithm can and is desired to be performed (whether it is safe/still-correct and desirable to use SIMD or multiple threads). Many of the algorithms in the C++ standard library have a similar overload to accept an execution policy argument.
CppReference for std::plus
But What about the Different Positions?
There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower?
As I showed in the code above, for your specific scenario, there doesn't need to be alternating access. But if the specific scenario is changed to require alternating access, on average, usually I don't think there would be much cache impact.
There is the possible problem of conflict misses if the corresponding entries of the split arrays map to the same cache sets. I don't know how likely this is to be encountered, or if there are techniques in C++ to prevent that. If anyone knows, please edit this answer. If a cache has N-way set associativity, and the access pattern to the "split" representation data only accesses N or fewer arrays in the hot loop (ie. doesn't access any other memory), I believe it should be impossible to run into this.
Misc Notes
I would recommend that you keep your benchmark link in your question unchanged, and if you want to update it, add a new link, so people viewing the discussion will be able to see older versions being referred to.
Out of curiosity, is there a reason why you're not using newer compiler versions for the benchmark like gcc 11?
I highly recommend the usage I showed of std::reduce. It's a widely recommended practice to use a dedicated C++ standard algorithm instead of a raw loop where the algorithm. See the reasons cited in the CppCoreGuidlines link. The code may be long (and in that sense, ugly), but it clearly conveys the intent to perform a sum where the reduction operator (plus) is unsequenced.
Your question is specifically about speed, but it's noteworthy that in C++, the choice of struct-of-array vs array-of-struct can be important where space costs matter, precisely because of alignment and padding.
There are more considerations in choosing struct-of-array vs array-of-struct that I haven't listed: memory-access-patterns are the main consideration for performance. readability and simplicity are also important considerations; you can alleviate problems by building good abstractions, but there's still a limit to that, and maintenance, readability, and simplicity costs of building the abstraction itself.
You may also find this video by Matt Godbolt on "Memory and Caches" useful.
In a self-educational project I measure the bandwidth of the memory with help of the following code (here paraphrased, the whole code follows at the end of the question):
unsigned int doit(const std::vector<unsigned int> &mem){
const size_t BLOCK_SIZE=16;
size_t n = mem.size();
unsigned int result=0;
for(size_t i=0;i<n;i+=BLOCK_SIZE){
result+=mem[i];
}
return result;
}
//... initialize mem, result and so on
int NITER = 200;
//... measure time of
for(int i=0;i<NITER;i++)
resul+=doit(mem)
BLOCK_SIZE is choosen in such a way, that a whole 64byte cache line is fetched per single integer-addition. My machine (an Intel-Broadwell) needs about 0.35 nanosecond per integer-addion, so the code above could saturate a bandwith as high as 182GB/s (this value is just an upper bound and is probably quite off, what is important is the ratio of bandwidths for different sizes). The code is compiled with g++ and -O3.
Varying the size of the vector, I can observe expected bandwidths for L1(*)-, L2-, L3-caches and the RAM-memory:
However, there is an effect I'm really struggling to explain: the collapse of the measured bandwidth of L1-cache for sizes around 2 kB, here in somewhat higher resolution:
I could reproduce the results on all machines I have access to (which have Intel-Broadwell and Intel-Haswell processors).
My question: What is the reason for the performance-collapse for memory-sizes around 2 KB?
(*) I hope I understand correctly, that for L1-cache not 64 bytes but only 4 bytes per addition are read/transfered (there is no further faster cache where a cache line must be filled), so the plotted bandwidth for L1 is only the upper limit and not the badwidth itself.
Edit: When the step size in the inner for-loop is chosen to be
8 (instead of 16) the collapse happens for 1KB
4 (instead of 16) the collapse happens for 0.5KB
i.e. when the inner loop consists of about 31-35 steps/reads. That means the collapse isn't due to the memory-size but due to the number of steps in the inner loop.
It can be explained with branch misses as shown in #user10605163's great answer.
Listing for reproducing the results
bandwidth.cpp:
#include <vector>
#include <chrono>
#include <iostream>
#include <algorithm>
//returns minimal time needed for one execution in seconds:
template<typename Fun>
double timeit(Fun&& stmt, int repeat, int number)
{
std::vector<double> times;
for(int i=0;i<repeat;i++){
auto begin = std::chrono::high_resolution_clock::now();
for(int i=0;i<number;i++){
stmt();
}
auto end = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9/number;
times.push_back(time);
}
return *std::min_element(times.begin(), times.end());
}
const int NITER=200;
const int NTRIES=5;
const size_t BLOCK_SIZE=16;
struct Worker{
std::vector<unsigned int> &mem;
size_t n;
unsigned int result;
void operator()(){
for(size_t i=0;i<n;i+=BLOCK_SIZE){
result+=mem[i];
}
}
Worker(std::vector<unsigned int> &mem_):
mem(mem_), n(mem.size()), result(1)
{}
};
double PREVENT_OPTIMIZATION=0.0;
double get_size_in_kB(int SIZE){
return SIZE*sizeof(int)/(1024.0);
}
double get_speed_in_GB_per_sec(int SIZE){
std::vector<unsigned int> vals(SIZE, 42);
Worker worker(vals);
double time=timeit(worker, NTRIES, NITER);
PREVENT_OPTIMIZATION+=worker.result;
return get_size_in_kB(SIZE)/(1024*1024)/time;
}
int main(){
int size=BLOCK_SIZE*16;
std::cout<<"size(kB),bandwidth(GB/s)\n";
while(size<10e3){
std::cout<<get_size_in_kB(size)<<","<<get_speed_in_GB_per_sec(size)<<"\n";
size=(static_cast<int>(size+BLOCK_SIZE)/BLOCK_SIZE)*BLOCK_SIZE;
}
//ensure that nothing is optimized away:
std::cerr<<"Sum: "<<PREVENT_OPTIMIZATION<<"\n";
}
create_report.py:
import sys
import pandas as pd
import matplotlib.pyplot as plt
input_file=sys.argv[1]
output_file=input_file[0:-3]+'png'
data=pd.read_csv(input_file)
labels=list(data)
plt.plot(data[labels[0]], data[labels[1]], label="my laptop")
plt.xlabel(labels[0])
plt.ylabel(labels[1])
plt.savefig(output_file)
plt.close()
Building/running/creating report:
>>> g++ -O3 -std=c++11 bandwidth.cpp -o bandwidth
>>> ./bandwidth > report.txt
>>> python create_report.py report.txt
# image is in report.png
I changed the values slightly: NITER = 100000 and NTRIES=1 to get a less noisy result.
I don't have a Broadwell available right now, however I tried your code on my Coffee-Lake and got a performance drop, not at 2KB, but around 4.5KB. In addition I find erratic behavior of the throughput slightly above 2KB.
The blue line in the graph corresponds to your measurement (left axis):
The red line here is the result from perf stat -e branch-instructions,branch-misses, giving the fraction of branches that were not correctly predicted (in percent, right axis). As you can see there is a clear anti-correlation between the two.
Looking into the more detailed perf report, I found that basically all of these branch mispredictions happen in the most inner loop in Worker::operator(). If the taken/non-taken pattern for the loop branch becomes too long the branch predictor will not be able to keep track of it and so the exit branch of the inner loop will be mispredicted, leading to the sharp drop in throughput. With further increasing number of iterations the impact of this single mispredict will become less significant leading to the slow recover of the throughput.
For further information on the erratic behavior before the drop see the comments made by #PeterCordes below.
In any case the best way to avoid branch mispredictions is to avoid branches and so I manually unrolled the loop in Worker::operator(), like e.g.:
void operator()(){
for(size_t i=0;i+3*BLOCK_SIZE<n;i+=BLOCK_SIZE*4){
result+=mem[i];
result+=mem[i+BLOCK_SIZE];
result+=mem[i+2*BLOCK_SIZE];
result+=mem[i+3*BLOCK_SIZE];
}
}
Unrolling 2, 3, 4, 6 or 8 iterations gives the results below. Note that I did not correct for the blocks at the end of the vector which were ignored due to the unrolling. Therefore the periodic peaks in the blue line should be ignored, the lower bound base line of the periodic pattern is the actual bandwidth.
As you can see the fraction of branch mispredictions didn't really change, but because the total number of branches is reduced by the factor of unrolled iterations, they will not contribute strongly to the performance anymore.
There is also an additional benefit of the processor being more free to do the calculations out-of-order if the loop is unrolled.
If this is supposed to have practical application I would suggest to try to give the hot loop a compile-time fixed number of iteration or some guarantee on divisibility, so that (maybe with some extra hints) the compiler can decide on the optimal number of iterations to unroll.
Might be unrelated but your Linux machine might playing with CPU frequency. I know Ubuntu 18 has a gouverner that is balanced between power and performance. You also want to play with the process affinity to make sure it does not get migrated to different core while running.
I am C++ student and I am working on creating a random number generator.
Infact I should say my algorithm selects a number within a defined range.
I am writing this just because of my curiosity.
I am not challenging existing library functions.
I always use library functions when writing applications based on randomness but I am again stating that I just want to make it because of my curiosity.
I would also like to know if there is something wrong with my algorithm or with my approach.
Because i googled how PRNGs work and on some sites they said that a mathematical algorithm is there and a predefined series of numbers and a seed just sets the pointer in a different point in the series and after some intervals the sequence repeats itself.
My algorithm just starts moving to and fro in the array of possible values and the seed breaks the loop with different values each time. I don't i this approach is wrong. I got answers suggesting a different algorithm but they didn't explain What's wrong with my current algorithm?
Yes,there was a problem with my seed as it was not precise and made results little predictable as here:-
cout<
<
rn(50,100);
The results in running four times are 74,93,56,79.
See the pattern of "increasing order".
And for large ranges patterns could be seen easily.I got an answer on getting good seeds but that too recommended a new algorithm(but didn't say why?).
An alternative way could be to shuffle my array randomly generating a new sequence every time.And the pattern of increasing order will go off.Any help with that rearranging too will also be good.Here is the code below.And if my function is not possible please notify me.
Thanking you in anticipation.
int rn(int lowerlt, int upperlt)
{
/* Over short ranges, results are satisfactory.
* I want to make it effective for big ranges.
*/
const int size = upperlt - lowerlt; // Constant size of the integer array.
int ar[size]; // Array to store all possible values within defined range.
int i, x, ret; // Variables to control loops and return value.
long pointer = 0; //pointer variable. The one which breaks the main loop.
// Loop to initialize the array with possible values..
for (i=0, x=lowerlt; x <= upperlt; i++, x++)
ar[i]=x;
long seed = time(0);
//Main loop . To find the random number.
for (i=0; pointer <= seed; i++, pointer++)
{
ret = ar[i];
if (i == size-1)
{
// Reverse loop.
for (; i >= 0; i--)
{
ret=ar[i];
}
}
}
return ret;
}
Caveat: From your post, aside from your random generator algorithm, one of your problems is getting a good seed value, so I'll address that part of it.
You could use /dev/random to get a seed value. That would be a great place to start [and would be sufficient on its own], but might be considered "cheating" from some perspective.
So, here are some other sources of "entropy":
Use a higher resolution time of day clock source: gettimeofday or clock_gettime(CLOCK_REALTIME,...) call it "cur_time". Use only the microsecond or nanosecond portion respectively, call it "cur_nano". Note that cur_nano is usually pretty random all by itself.
Do a getpid(2). This has a few unpredictable bits because between invocations other programs are starting and we don't know how many.
Create a new temp file and get the file's inode number [then delete it]. This varies slowly over time. It may be the same on each invocation [or not]
Get the high resolution value for the system's time of day clock when the system was booted, call it "sysboot".
Get the high resolution value for the start time of your "session": When your program's parent shell was started, call it "shell_start".
If you were using Linux, you could compute a checksum of /proc/interrupts as that's always changing. For other systems, get some hash of the number of interrupts of various types [should be available from some type of syscall].
Now, create some hash of all of the above (e.g.):
dev_random * cur_nano * (cur_time - sysboot) * (cur_time - shell_start) *
getpid * inode_number * interrupt_count
That's a simple equation. You could enhance it with some XOR and/or sum operations. Experiment until you get one that works for you.
Note: This only gives you the seed value for your PRNG. You'll have to create your PRNG from something else (e.g. earl's linear algorithm)
unsigned int Random::next() {
s = (1664525 * s + 1013904223);
return s;
}
's' is growing with every call of that function.
Correct is
unsigned int Random::next() {
s = (1664525 * s + 1013904223) % xxxxxx;
return s;
}
Maybe use this function
long long Factor = 279470273LL, Divisor = 4294967291LL;
long long seed;
next()
{
seed = (seed * Factor) % Divisor;
}
During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.
You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}
How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.
The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.
I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.
Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }
I've a program that does Block Nested loop join (link text). Basically what it does is, it reads contents from a file (say 10GB file) into buffer1 (say 400MB), puts it into a hash table. Now read contents of the second file (say 10GB file) into buffer 2 (say 100MB) and see if the elements in buffer2 are present in the hash. Outputting the result doesn't matter. I'm just concerned with efficiency of the program for now. In this program, I need to read 8 bytes at a time from both files so I use long long int.
The problem is my program is very inefficient. How can I make it efficient ?
// I compile using g++ -o hash hash.c -std=c++0x
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdint.h>
#include <math.h>
#include <limits.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_map>
using namespace std;
typedef std::unordered_map<unsigned long long int, unsigned long long int> Mymap;
int main()
{
uint64_t block_size1 = (400*1024*1024)/sizeof(long long int); //block size of Table A - division operator used to make the block size 1 mb - refer line 26,27 malloc statements.
uint64_t block_size2 = (100*1024*1024)/sizeof(long long int); //block size of table B
int i=0,j=0, k=0;
uint64_t x,z,l=0;
unsigned long long int *buffer1 = (unsigned long long int *)malloc(block_size1 * sizeof(long long int));
unsigned long long int *buffer2 = (unsigned long long int *)malloc(block_size2 * sizeof(long long int));
Mymap c1 ; // Hash table
//Mymap::iterator it;
FILE *file1 = fopen64("10G1.bin","rb"); // Input is a binary file of 10 GB
FILE *file2 = fopen64("10G2.bin","rb");
printf("size of buffer1 : %llu \n", block_size1 * sizeof(long long int));
printf("size of buffer2 : %llu \n", block_size2 * sizeof(long long int));
while(!feof(file1))
{
k++;
printf("Iterations completed : %d \n",k);
fread(buffer1, sizeof(long long int), block_size1, file1); // Reading the contents into the memory block from first file
for ( x=0;x< block_size1;x++)
c1.insert(Mymap::value_type(buffer1[x], x)); // inserting values into the hash table
// std::cout << "The size of the hash table is" << c1.size() * sizeof(Mymap::value_type) << "\n" << endl;
/* // display contents of the hash table
for (Mymap::const_iterator it = c1.begin();it != c1.end(); ++it)
std::cout << " [" << it->first << ", " << it->second << "]";
std::cout << std::endl;
*/
while(!feof(file2))
{
i++; // Counting the number of iterations
// printf("%d\n",i);
fread(buffer2, sizeof(long long int), block_size2, file2); // Reading the contents into the memory block from second file
for ( z=0;z< block_size2;z++)
c1.find(buffer2[z]); // finding the element in hash table
// if((c1.find(buffer2[z]) != c1.end()) == true) //To check the correctness of the code
// l++;
// printf("The number of elements equal are : %llu\n",l); // If input files have exactly same contents "l" should print out the block_size2
// l=0;
}
rewind(file2);
c1.clear(); //clear the contents of the hash table
}
free(buffer1);
free(buffer2);
fclose(file1);
fclose(file2);
}
Update :
Is it possible to directly read a chunk (say 400 MB) from a file and directly put it into hash table using C++ stream readers? I think that can further reduce the overhead.
If you're using fread, then try using setvbuf(). The default buffers used by the standard lib file I/O calls are tiny (often of the order of 4kB). When processing large amounts of data quickly, you will be I/O bound and the overhead of fetching many small buffer-fuls of data can become a significant bottleneck. Set this to a larger size (e.g. 64kB or 256kB) and you can reduce that overhead and may see significant improvements - try out a few values to see where you get the best gains as you will get diminishing returns.
The running time for your program is (l1 x bs1 x l2 x bs2) (where l1 is the number of lines in the first file, and bs1 is the block size for the first buffer, and l2 is the number of lines in the second file, and bs2 is the block size for the second buffer) since you have four nested loops. Since your block sizes are constant, you can say that your order is O(n x 400 x m x 400) or O(1600mn), or in the worst case O(1600n2) which essentially ends up being O(n2).
You can have an O(n) algorithm if you do something like this (pseudocode follows):
map = new Map();
duplicate = new List();
unique = new List();
for each line in file1
map.put(line, true)
end for
for each line in file2
if(map.get(line))
duplicate.add(line)
else
unique.add(line)
fi
end for
Now duplicate will contain a list of duplicate items and unique will contain a list of unique items.
In your original algorithm, you needlessly traverse the second file for every line in the first file. So you actually end up losing the benefit of the hash (which gives you O(1) lookup time). The trade-off in this case, of course, is that you have to store the entire 10GB in memory which probably is not that helpful. Usually in cases like these the trade-off is between run-time and memory.
There is probably a better way to do this. I need to think about it some more. If not, I'm pretty sure someone will come up with a better idea :).
UPDATE
You can probably reduce memory-usage if you can find a good way to hash the line (that you read in from the first file) so that you get a unique value (i.e., a 1-to-1 mapping between the line and the hash value). Essentially you would do something like this:
for each line in file1
map.put(hash(line), true)
end for
for each line in file2
if(map.get(hash(line)))
duplicate.add(line)
else
unique.add(line)
fi
end for
Here the hash function is the one that performs the hashing. This way you don't have to store all the lines in memory. You only have to store their hashed values. This might help you a little bit. Even still, in the worse case (where you are either comparing two files that are identical, or entirely different) you can still end up with 10Gb in memory for either duplicate or unique list. You can get around with it with the loss of some information if you simply store a count of unique or duplicate items instead of the items themselves.
long long int *ptr = mmap() your files, then compare them with memcmp() in chunks.
Once a discrepancy is found, step back one chunk and compare them in more detail. (More detail means long long int in this case.)
If you expect to find discrepancies often, do not bother with memcmp(), just write your own loop comparing the long long ints to each other.
The only way to know is to profile it, eg with gprof. Create a benchmark of your current implementation and then experiment with other modifications methodically and re-run the benchmark.
I'd bet if you read in larger chunks you'd get better performance. fread() and Process multiple blocks per pass.
The problem I see is that you are reading the second file n-times. Really slow.
The best way to make this faster is to pre-sort the files then do a Sort-merge join. The sort is almost always worth it, in my experience.