Pre-compute once cos() and sin() in tables

Pre-compute once cos() and sin() in tables - c++

I'd like to improve performance of my Dynamic Linked Library (DLL).
For that I want to use lookup tables of cos() and sin() as I use a lot of them.
As I want maximum performance, I want to create a table from 0 to 2PI that contains the resulting cos and sin computations.
For a good result in term of precision, I think tables of 1 mb for each function is a good trade between size and precision.
I would like to know how to create and uses these tables without using an external file (as it is a DLL) : I want to keep everything within one file.
Also I don't want to compute the sin and cos function when the plugin starts : they have to be computed once and put in a standard vector.
But how do I do that in C++?
EDIT1: code from jons34yp is very good to create the vector files.
I did a small benchmark and found that if you need good precision and good speed you can do a 250000 units vector and linear interpolate between them you will have a 7.89E-11 max error (!) and it is the fastest between all the approximations I tried (and it is more than 12x faster than sin() (13,296 x faster exactly)

Easiest solution is to write a separate program that creates a .cc file with definition of your vector.
For example:
#include <iostream>
#include <cmath>
int main()
{
std::ofstream out("values.cc");
out << "#include \"static_values.h\"\n";
out << "#include <vector>\n";
out << "std::vector<float> pi_values = {\n";
out << std::precision(10);
// We only need to compute the range from 0 to PI/2, and use trigonometric
// transformations for values outside this range.
double range = 3.141529 / 2;
unsigned num_results = 250000;
for (unsigned i = 0; i < num_results; i++) {
double value = (range / num_results) * i;
double res = std::sin(value);
out << " " << res << ",\n";
}
out << "};\n"
out.close();
}
Note that this is unlikely to improve performance, since a table of this size probably won't fit in your L2 cache. This means a large percentage of trigonometric computations will need to access RAM; each such access costs roughly several hundreds of CPU cycles.
By the way, have you looked at approximate SSE SIMD trigonometric libraries. This looks like a good use case for them.

You can use precomputation instead of storing them already precomputed in the executable:
double precomputed_sin[65536];
struct table_filler {
table_filler() {
for (int i=0; i<65536; i++) {
precomputed_sin[i] = sin(i*2*3.141592654/65536);
}
}
} table_filler_instance;
This way the table is computed just once at program startup and it's still at a fixed memory address. After that tsin and tcos can be implemented inline as
inline double tsin(int x) { return precomputed_sin[x & 65535]; }
inline double tcos(int x) { return precomputed_sin[(x + 16384) & 65535]; }

The usual answer to this sort of question is to write a small
program which generates a C++ source file with the values in
a table, and compile it into your DLL. If you're thinking of
tables with 128000 entries (128000 doubles are 1MB), however,
you might run up against some internal limits in your compiler.
In that case, you might consider writing the values out to
a file as a memory dump, and mmaping this file when you load
the DLL. (Under windows, I think you could even put this second
file into a second stream of your DLL file, so you wouldn't have
to distribute a second file.)

Related

How can one improve performance in a neural network?

I am implementing a basic NeuroEvolution program. The Data Set I am attempting is the "Cover Type" data set. This set has 15,120 records with 56 inputs (numerical data on patches of forest land) and 1 output (the cover type). As recommended, I am using 150 hidden neurons. The fitness function attempts to iterate through all 15,120 records to calculate an error, and use that error to calculate the fitness. The method is shown below.
double getFitness() {
double error = 0.0;
// for every record in the dataset.
for (int record = 0; record < targets.size(); record++) {
// evaluate output vector for this record of inputs.
vector<double> outputs = eval(inputs[record]);
// for every value in those records.
for (int value = 0; value < outputs.size(); value++) {
// add to error the difference.
error += abs(outputs[value] - targets[record][value]);
}
}
return 1 - (error / targets.size());
}
"inputs" and "targets" are 2-D vectors read-in from the CSV file. The entire program uses ~40 MB of memory at run-time. That is not a problem. The program mutates off of a parent network, both are evaluated for fitness, and the most fit is kept to be mutated-upon. In the entire process, the getFitness() function is taking the most time. The program is written in Visual Studio 2017 (on a 2.6GHz i7) in Windows 10.
It took ~7 minutes to evaluate ONE network's fitness, using 21% of CPU. Smaller problems have required hundreds of thousands.
What methods are available to get that number down?

The program, implemented making (apparently) significant use of vectors, cannot be directly offloaded to a GPU using OpenCL or CUDA without significant modifications. This makes OpenMP the most viable option. As few as 2 lines of code can be added to "go parallel." In addition, Visual Studio must be set to use OpenMP (Project -> Properties -> C/C++ -> Language -> Open MP Support).
#include <omp.h>
#include <numeric>
// libraries, variables, functions, etc...
double getFitness() {
double err = 0.0;
vector<double> error;
error.resize(inputs.size());
// for every record in the dataset.
#pragma omp parallel for
for (int record = 0; record < targets.size(); record++) {
// evaluate output vector for this record of inputs.
vector<double> outputs = eval(inputs[record]);
// for every value in those records.
for (int value = 0; value < outputs.size(); value++) {
// add to error the difference.
error[record] = abs(outputs[value] - targets[record][value]);
}
}
err = accumulate(error.begin(), error.end(), 0);
return 1 - (err / targets.size());
}
In the above code, OMP should create a thread for each of the records. However, unless you run this on a cloud instance, it will eat your CPU. Expect a ~4x improvement on a quad-core CPU.

c++ stack efficient for multicore application

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:
struct densityStack {
vector<int> density; //will store the sum of boolean varaibles
int card; //will store the amount of elements we summed over for normalizing at the end
densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
density = vector<int> (size, 0);
card = 0;
}
void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
*valStack += *newVal;
card++;
}
void savef(const char * fname){ //method outputting into a file
ofstream out(fname);
out.precision(10);
out << card << "\n"; //saving the cardinal in first line
for(auto val = density.begin(); val != density.end(); ++val)
out << << (double) *val/card << "\n";
out.close();
}
};
Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.
My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).
Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.

How are you synchronizing your shared densityStack instance?
From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

large loop for timing tests gets somehow optimized to nothing?

I am trying to test a series of libraries for matrix-vector computations. For that I just make a large loop and inside I call the routine I want to time. Very simple. However I sometimes see that when I increase the level of optimization for the compiler the time drops to zero no matter how large the loop is. See the example below where I try to time a C macro to compute cross products. What is the compiler doing? how can I avoid it but to allow maximum optimization for floating point arithmetics? Thank you in advance
The example below was compiled using g++ 4.7.2 on a computer with an i5 intel processor.
Using optimization level 1 (-O1) it takes 0.35 seconds. For level two or higher it drops down to zero. Remember, I want to time this so I want the computations to actually happen even if, for this simple test, unnecessary.
#include<iostream>
using namespace std;
typedef double Vector[3];
#define VecCross(A,assign_op,B,dummy_op,C) \
( A[0] assign_op (B[1] * C[2]) - (B[2] * C[1]), \
A[1] assign_op (B[2] * C[0]) - (B[0] * C[2]), \
A[2] assign_op (B[0] * C[1]) - (B[1] * C[0]) \
)
double get_time(){
return clock()/(double)CLOCKS_PER_SEC;
}
int main()
{
unsigned long n = 1000000000u;
double start;
{//C macro cross product
Vector u = {1,0,0};
Vector v = {1,1,0};
Vector w = {1.2,1.2,1.2};
start = get_time();
for(unsigned long i=0;i<n;i++){
VecCross (w, =, u, X, v);
}
cout << "C macro cross product: " << get_time()-start << endl;
}
return 0;
}

Ask yourself, what does your program actually do, in terms of what is visible to the end-user?
It displays the result of a calculation: get_time()-start. The contents of your loop have no bearing on the outcome of that calculation, because you never actually use the variables being modified inside the loop.
Therefore, the compiler optimises out the entire loop since it is irrelevant.
One solution is to output the final state of the variables being modified in the loop, as part of your cout statement, thus forcing the compiler to compute the loop. However, a smart compiler could also figure out that the loop always calculates the same thing, and it can simply insert the result directly into your cout statement, because there's no need to actually calculate it at run-time. As a workaround to this, you could for example require that one of the inputs to the loop be provided at run-time (e.g. read it in from a file, command line argument, cin, etc.).
For more (and possibly better) solutions, check out this duplicate thread: Force compiler to not optimize side-effect-less statements

The fastest way to retrieve 16k Key-Value pairs?

OK, here's my situation :
I have a function - let's say U64 calc (U64 x) - which takes a 64-bit integer parameter, performs some CPU-intensive operation, and returns a 64-bit value
Now, given that I know ALL possible inputs (the xs) of that function beforehand (there are some 16000 though), I thought it might be better to pre-calculate them and then fetch them on demand (from an array-like structure).
The ideal situation would be to store them all in some array U64 CALC[] and retrieve them by index (the x again)
And here's the issue : I may know what the possible inputs for my calc function are, but they are most definitely NOT consecutive (e.g. not from 1 to 16000, but values that may go as low as 0 and as high as some trillions - always withing a 64-bit range)
E.G.
X CALC[X]
-----------------------
123123 123123123
12312 12312312
897523 986123
etc.
And here comes my question :
How would you store them?
What workaround would you prefer?
Now, given that these values (from CALC) will have to be accessed some thousands-to-millions of times, per sec, which would be the best solution performance-wise?
Note : I'm no mentioning anything I've thought of or tried so as not to turn the answers into some debate of A vs B type-of-thing, and mostly not influence anyone...

I would use some sort of hash function that creates an index to a u64 pair where one is the value the key was created from and the other the replacement value. Technically the index could be three bytes long (assuming 16 million -"16000 thousand" - pairs) if you need to conserve space but I'd use u32s. If the stored value does not match the value computed on (hash collision) you'd enter an overflow handler.
You need to determine a custom hashing algorithm to fit your data
Since you know the size of the data you don't need algorithms that allow the data to grow.
I'd be wary of using some standard algorithm because they seldom fit specific data
I'd be wary of using a C++ method unless you are sure the code is WYSIWYG (doesn't generate a lot of invisible calls)
Your index should be 25% larger than the number of pairs
Run through all possible inputs and determine min, max, average and standard deviation for the number of collisions and use these to determine the acceptable performance level. Then profile to achieve the best possible code.
The required memory space (using a u32 index) comes out to (4*1.25)+8+8 = 21 bytes per pair or 336 MeB, no problem on a typical PC.
________ EDIT________
I have been challenged by "RocketRoy" to put my money where my mouth is. Here goes:
The problem has to do with collision handling in a (fixed size) hash index. To set the stage:
I have a list of n entries where a field in the entry contains the value v that the hash is computed from
I have a vector of n*1.25 (approximately) indeces such that the number of indeces x is a prime number
A prime number y is computed which is a fraction of x
The vector is initialized to say -1 to denote unoccupied
Pretty standard stuff I think you'll agree.
The entries in the list are processed and the hash value h computed and modulo'd and used as an index into the vector and the index to the entry is placed there.
Eventually I encounter the situation where the vector entry pointed to by the index is occupied (doesn't contain -1) - voilà, a collision.
So what do I do? I keep the original h as ho, add y to h and take modulo x and get a new index into the vector. If the entry is unoccupied I use it, otherwise I continue with add y modulo x until I reach ho. In theory, this will happen because both x and y are prime numbers. In practice x is larger than n so it won't.
So the "re-hash" that RocketRoy claims is very costly is no such thing.
The tricky part with this method - as with all hashing methods - is to:
Determine a suitable value for x (becomes less tricky the larger x finally used)
Determine the algorithm a for h=a(v)%x such that a the h's index reasonably evenly ("randomly") into the index vector with as few collisions as possible
Determine a suitable value for y such that collisions index reasonably evenly ("randomly") into the index vector
________ EDIT________
I'm sorry I've taken so long to produce this code ... other things have had higher priorities.
Anyway, here is the code which proves that hashing has better prospects for quick lookups than a binary tree. It runs through a bunch of hashing index sizes and algorithms to aid in finding the most suitable combo for the specific data. For every algorithm the code will print the first index size such that no lookup takes longer than fourteen searches (worst case for binary searching) and an average lookup takes less than 1.5 searches.
I have a fondness for prime numbers in these types of applications, in case you haven't noticed.
There are many ways of creating a hashing algorithm with its mandatory overflow handling. I opted for simplicity assuming it will translate into speed ... and it does. On my laptop with an i5 M 480 # 2.67 GHz an average lookup requires between 55 and 60 clock cycles (comes out to around 45 million lookups per second). I implemented a special get operation with a constant number of indeces and ditto rehash value and the cycle count dropped to 40 (65 million lookups per second). If you look at the line calling getOpSpec the index i is xor'ed with 0x444 to exercise the caches to achieve more "real world"-like results.
I must again point out that the program suggests suitable combinations for the specific data. Other data may require a different combo.
The source code contains both the code for generating the 16000 unsigned long long pairs and for testing different constants (index sizes and rehash values):
#include <windows.h>
#define i8 signed char
#define i16 short
#define i32 long
#define i64 long long
#define id i64
#define u8 char
#define u16 unsigned short
#define u32 unsigned long
#define u64 unsigned long long
#define ud u64
#include <string.h>
#include <stdio.h>
u64 prime_find_next (const u64 value);
u64 prime_find_previous (const u64 value);
static inline volatile unsigned long long rdtsc_to_rax (void)
{
unsigned long long lower,upper;
asm volatile( "rdtsc\n"
: "=a"(lower), "=d"(upper));
return lower|(upper<<32);
}
static u16 index[65536];
static u64 nindeces,rehshFactor;
static struct PAIRS {u64 oval,rval;} pairs[16000] = {
#include "pairs.h"
};
struct HASH_STATS
{
u64 ninvocs,nrhshs,nworst;
} getOpStats,putOpStats;
i8 putOp (u16 index[], const struct PAIRS data[], const u64 oval, const u64 ci)
{
u64 nworst=1,ho,h,i;
i8 success=1;
++putOpStats.ninvocs;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffff) /* unused position */
{
index[h]=(u16)ci;
goto added;
}
if (oval==data[i].oval) goto duplicate;
++putOpStats.nrhshs;
++nworst;
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted: /* should not happen */
duplicate:
success=0;
added:
if (nworst>putOpStats.nworst) putOpStats.nworst=nworst;
return success;
}
i8 getOp (u16 index[], const struct PAIRS data[], const u64 oval, u64 *rval)
{
u64 ho,h,i;
i8 success=1;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffffu) goto not_found; /* unused position */
if (oval==data[i].oval)
{
*rval=data[i].rval; /* fetch the replacement value */
goto found;
}
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted:
not_found: /* should not happen */
success=0;
found:
return success;
}
volatile i8 stop = 0;
int main (int argc, char *argv[])
{
u64 i,rval,mulup,divdown,start;
double ave;
SetThreadAffinityMask (GetCurrentThread(), 0x00000004ull);
divdown=5; //5
while (divdown<=100)
{
mulup=3; // 3
while (mulup<divdown)
{
nindeces=16000;
while (nindeces<65500)
{
nindeces= prime_find_next (nindeces);
rehshFactor=nindeces*mulup/divdown;
rehshFactor=prime_find_previous (rehshFactor);
memset (index, 0xff, sizeof(index));
memset (&putOpStats, 0, sizeof(struct HASH_STATS));
i=0;
while (i<16000)
{
if (!putOp (index, pairs, pairs[i].oval, (u16) i)) stop=1;
++i;
}
ave=(double)(putOpStats.ninvocs+putOpStats.nrhshs)/(double)putOpStats.ninvocs;
if (ave<1.5 && putOpStats.nworst<15)
{
start=rdtsc_to_rax ();
i=0;
while (i<16000)
{
if (!getOp (index, pairs, pairs[i^0x0444]. oval, &rval)) stop=1;
++i;
}
start=rdtsc_to_rax ()-start+8000; /* 8000 is half of 16000 (pairs), for rounding */
printf ("%u;%u;%u;%u;%1.3f;%u;%u\n", (u32)mulup, (u32)divdown, (u32)nindeces, (u32)rehshFactor, ave, (u32) putOpStats.nworst, (u32) (start/16000ull));
goto found;
}
nindeces+=2;
}
printf ("%u;%u\n", (u32)mulup, (u32)divdown);
found:
mulup=prime_find_next (mulup);
}
divdown=prime_find_next (divdown);
}
SetThreadAffinityMask (GetCurrentThread(), 0x0000000fu);
return 0;
}
It was not possible to include the generated pairs file (an answer is apparently limited to 30000 characters). But send a message to my inbox and I'll mail it.
And these are the results:
3;5;35569;21323;1.390;14;73
3;7;33577;14389;1.435;14;60
5;7;32069;22901;1.474;14;61
3;11;35107;9551;1.412;14;59
5;11;33967;15427;1.446;14;61
7;11;34583;22003;1.422;14;59
3;13;34253;7901;1.439;14;61
5;13;34039;13063;1.443;14;60
7;13;32801;17659;1.456;14;60
11;13;33791;28591;1.436;14;59
3;17;34337;6053;1.413;14;59
5;17;32341;9511;1.470;14;61
7;17;32507;13381;1.474;14;62
11;17;33301;21529;1.454;14;60
13;17;34981;26737;1.403;13;59
3;19;33791;5333;1.437;14;60
5;19;35149;9241;1.403;14;59
7;19;33377;12289;1.439;14;97
11;19;34337;19867;1.417;14;59
13;19;34403;23537;1.430;14;61
17;19;33923;30347;1.467;14;61
3;23;33857;4409;1.425;14;60
5;23;34729;7547;1.429;14;60
7;23;32801;9973;1.456;14;61
11;23;33911;16127;1.445;14;60
13;23;33637;19009;1.435;13;60
17;23;34439;25453;1.426;13;60
19;23;33329;27529;1.468;14;62
3;29;32939;3391;1.474;14;62
5;29;34543;5953;1.437;13;60
7;29;34259;8263;1.414;13;59
11;29;34367;13033;1.409;14;60
13;29;33049;14813;1.444;14;60
17;29;34511;20219;1.422;14;60
19;29;33893;22193;1.445;13;61
23;29;34693;27509;1.412;13;92
3;31;34019;3271;1.441;14;60
5;31;33923;5449;1.460;14;61
7;31;33049;7459;1.442;14;60
11;31;35897;12721;1.389;14;59
13;31;35393;14831;1.397;14;59
17;31;33773;18517;1.425;14;60
19;31;33997;20809;1.442;14;60
23;31;34841;25847;1.417;14;59
29;31;33857;31667;1.426;14;60
3;37;32569;2633;1.476;14;61
5;37;34729;4691;1.419;14;59
7;37;34141;6451;1.439;14;60
11;37;34549;10267;1.410;13;60
13;37;35117;12329;1.423;14;60
17;37;34631;15907;1.429;14;63
19;37;34253;17581;1.435;14;60
23;37;32909;20443;1.453;14;61
29;37;33403;26177;1.445;14;60
31;37;34361;28771;1.413;14;59
3;41;34297;2503;1.424;14;60
5;41;33587;4093;1.430;14;60
7;41;34583;5903;1.404;13;59
11;41;32687;8761;1.440;14;60
13;41;34457;10909;1.439;14;60
17;41;34337;14221;1.425;14;59
19;41;32843;15217;1.476;14;62
23;41;35339;19819;1.423;14;59
29;41;34273;24239;1.436;14;60
31;41;34703;26237;1.414;14;60
37;41;33343;30089;1.456;14;61
3;43;34807;2423;1.417;14;59
5;43;35527;4129;1.413;14;60
7;43;33287;5417;1.467;14;61
11;43;33863;8647;1.436;14;60
13;43;34499;10427;1.418;14;78
17;43;34549;13649;1.431;14;60
19;43;33749;14897;1.429;13;60
23;43;34361;18371;1.409;14;59
29;43;33149;22349;1.452;14;61
31;43;34457;24821;1.428;14;60
37;43;32377;27851;1.482;14;81
41;43;33623;32057;1.424;13;59
3;47;33757;2153;1.459;14;61
5;47;33353;3547;1.445;14;61
7;47;34687;5153;1.414;13;59
11;47;34519;8069;1.417;14;60
13;47;34549;9551;1.412;13;59
17;47;33613;12149;1.461;14;61
19;47;33863;13687;1.443;14;60
23;47;35393;17317;1.402;14;59
29;47;34747;21433;1.432;13;60
31;47;34871;22993;1.409;14;59
37;47;34729;27337;1.425;14;59
41;47;33773;29453;1.438;14;60
43;47;31253;28591;1.487;14;62
3;53;33623;1901;1.430;14;59
5;53;34469;3229;1.430;13;60
7;53;34883;4603;1.408;14;59
11;53;34511;7159;1.412;13;59
13;53;32587;7963;1.453;14;60
17;53;34297;10993;1.432;13;80
19;53;33599;12043;1.443;14;64
23;53;34337;14897;1.415;14;59
29;53;34877;19081;1.424;14;61
31;53;34913;20411;1.406;13;59
37;53;34429;24029;1.417;13;60
41;53;34499;26683;1.418;14;59
43;53;32261;26171;1.488;14;62
47;53;34253;30367;1.437;14;79
3;59;33503;1699;1.432;14;61
5;59;34781;2939;1.424;14;60
7;59;35531;4211;1.403;14;59
11;59;34487;6427;1.420;14;59
13;59;33563;7393;1.453;14;61
17;59;34019;9791;1.440;14;60
19;59;33967;10937;1.447;14;60
23;59;33637;13109;1.438;14;60
29;59;34487;16943;1.424;14;59
31;59;32687;17167;1.480;14;61
37;59;35353;22159;1.404;14;59
41;59;34499;23971;1.431;14;60
43;59;34039;24799;1.445;14;60
47;59;32027;25471;1.499;14;62
53;59;34019;30557;1.449;14;61
3;61;35059;1723;1.418;14;60
5;61;34351;2803;1.416;13;60
7;61;35099;4021;1.412;14;59
11;61;34019;6133;1.442;14;60
13;61;35023;7459;1.406;14;88
17;61;35201;9803;1.414;14;61
19;61;34679;10799;1.425;14;101
23;61;34039;12829;1.441;13;60
29;61;33871;16097;1.446;14;60
31;61;34147;17351;1.427;14;61
37;61;34583;20963;1.412;14;59
41;61;32999;22171;1.452;14;62
43;61;33857;23857;1.431;14;98
47;61;34897;26881;1.431;14;60
53;61;33647;29231;1.434;14;60
59;61;32999;31907;1.454;14;60
3;67;32999;1471;1.455;14;61
5;67;35171;2621;1.403;14;59
7;67;33851;3533;1.463;14;61
11;67;34607;5669;1.437;14;60
13;67;35081;6803;1.416;14;61
17;67;33941;8609;1.417;14;60
19;67;34673;9829;1.427;14;60
23;67;35099;12043;1.415;14;60
29;67;33679;14563;1.452;14;61
31;67;34283;15859;1.437;14;60
37;67;32917;18169;1.460;13;61
41;67;33461;20443;1.441;14;61
43;67;34313;22013;1.426;14;60
47;67;33347;23371;1.452;14;61
53;67;33773;26713;1.434;14;60
59;67;35911;31607;1.395;14;58
61;67;34157;31091;1.431;14;63
3;71;34483;1453;1.423;14;59
5;71;34537;2423;1.428;14;59
7;71;33637;3313;1.428;13;60
11;71;32507;5023;1.465;14;79
13;71;35753;6529;1.403;14;59
17;71;33347;7963;1.444;14;61
19;71;35141;9397;1.410;14;59
23;71;32621;10559;1.475;14;61
29;71;33637;13729;1.429;14;60
31;71;33599;14657;1.443;14;60
37;71;34361;17903;1.396;14;59
41;71;33757;19489;1.435;14;61
43;71;34583;20939;1.413;14;59
47;71;34589;22877;1.441;14;60
53;71;35353;26387;1.418;14;59
59;71;35323;29347;1.406;14;59
61;71;35597;30577;1.401;14;59
67;71;34537;32587;1.425;14;59
3;73;34613;1409;1.418;14;59
5;73;32969;2251;1.453;14;62
7;73;33049;3167;1.448;14;61
11;73;33863;5101;1.435;14;60
13;73;34439;6131;1.456;14;60
17;73;33629;7829;1.455;14;61
19;73;34739;9029;1.421;14;60
23;73;33071;10399;1.469;14;61
29;73;33359;13249;1.460;14;61
31;73;33767;14327;1.422;14;59
37;73;32939;16693;1.490;14;62
41;73;33739;18947;1.438;14;60
43;73;33937;19979;1.432;14;61
47;73;33767;21739;1.422;14;59
53;73;33359;24203;1.435;14;60
59;73;34361;27767;1.401;13;59
61;73;33827;28229;1.443;14;60
67;73;34421;31583;1.423;14;71
71;73;33053;32143;1.447;14;60
3;79;35027;1327;1.410;14;60
5;79;34283;2161;1.432;14;60
7;79;34439;3049;1.432;14;60
11;79;34679;4817;1.416;14;59
13;79;34667;5701;1.405;14;59
17;79;33637;7237;1.428;14;60
19;79;34469;8287;1.417;14;60
23;79;34439;10009;1.433;14;60
29;79;33427;12269;1.448;13;61
31;79;33893;13297;1.445;14;61
37;79;33863;15823;1.439;14;60
41;79;32983;17107;1.450;14;60
43;79;34613;18803;1.431;14;60
47;79;33457;19891;1.457;14;61
53;79;33961;22777;1.435;14;61
59;79;32983;24631;1.465;14;60
61;79;34337;26501;1.428;14;60
67;79;33547;28447;1.458;14;61
71;79;32653;29339;1.473;14;61
73;79;34679;32029;1.429;14;64
3;83;35407;1277;1.405;14;59
5;83;32797;1973;1.451;14;60
7;83;33049;2777;1.443;14;61
11;83;33889;4483;1.431;14;60
13;83;35159;5503;1.409;14;59
17;83;34949;7151;1.412;14;59
19;83;32957;7541;1.467;14;61
23;83;32569;9013;1.470;14;61
29;83;33287;11621;1.474;14;61
31;83;33911;12659;1.448;13;60
37;83;33487;14923;1.456;14;62
41;83;33587;16573;1.438;13;60
43;83;34019;17623;1.435;14;60
47;83;31769;17987;1.483;14;62
53;83;33049;21101;1.451;14;61
59;83;32369;23003;1.465;14;61
61;83;32653;23993;1.469;14;61
67;83;33599;27109;1.437;14;61
71;83;33713;28837;1.452;14;61
73;83;33703;29641;1.454;14;61
79;83;34583;32911;1.417;14;59
3;89;34147;1129;1.415;13;60
5;89;32797;1831;1.461;14;61
7;89;33679;2647;1.443;14;73
11;89;34543;4261;1.427;13;60
13;89;34603;5051;1.419;14;60
17;89;34061;6491;1.444;14;60
19;89;34457;7351;1.422;14;79
23;89;33529;8663;1.450;14;61
29;89;34283;11161;1.431;14;60
31;89;35027;12197;1.411;13;59
37;89;34259;14221;1.403;14;59
41;89;33997;15649;1.434;14;60
43;89;33911;16127;1.445;14;60
47;89;34949;18451;1.419;14;59
53;89;34367;20443;1.434;14;60
59;89;33791;22397;1.430;14;59
61;89;34961;23957;1.404;14;59
67;89;33863;25471;1.433;13;60
71;89;35149;28031;1.414;14;79
73;89;33113;27143;1.447;14;60
79;89;32909;29209;1.458;14;61
83;89;33617;31337;1.400;14;59
3;97;34211;1051;1.448;14;60
5;97;34807;1789;1.430;14;60
7;97;33547;2417;1.446;14;60
11;97;35171;3967;1.407;14;89
13;97;32479;4349;1.474;14;61
17;97;34319;6011;1.444;14;60
19;97;32381;6337;1.491;14;64
23;97;33617;7963;1.421;14;59
29;97;33767;10093;1.423;14;59
31;97;33641;10739;1.447;14;60
37;97;34589;13187;1.425;13;60
41;97;34171;14437;1.451;14;60
43;97;31973;14159;1.484;14;62
47;97;33911;16127;1.445;14;61
53;97;34031;18593;1.448;14;80
59;97;32579;19813;1.457;14;61
61;97;34421;21617;1.417;13;60
67;97;33739;23297;1.448;14;60
71;97;33739;24691;1.435;14;60
73;97;33863;25471;1.433;13;60
79;97;34381;27997;1.419;14;59
83;97;33967;29063;1.446;14;60
89;97;33521;30727;1.441;14;60
Cols 1 and 2 are used to calculate a rough relationship between the rehash value and the index size. The next two are the first index size/rehash factor combination which averages less than 1.5 searches for a lookup with a worst case of 14 searches. Then average and worst case. Finally, the last column is the average number of clock cycles per lookup. It does not take into account the time required to read the time stamp register.
The actual memory space for the best constants (# of indeces = 31253 and rehash factor = 28591) comes out to more than I initially indicated (16000*2*8 + 1,25*16000*2 => 296000 bytes). The actual size is 16000*2*8+31253*2 => 318506.
The fastest combination is an approximate ratio of 11/31 with an index size of 35897 and rehash value of 12721. This will average 1.389 (1 initial hash + 0.389 rehashes) with a maximum of 14 (1+13).
________ EDIT________
I removed the "goto found;" in main () to show all combinations and it shows that much better performance is possible, of course at the expense of a larger index size. For example the combination 57667 and 33797 yields and average of 1.192 and a maximum rehash of 6. The combination 44543 and 23399 yields a 1.249 average and 10 maximum rehashes (it saves (57667-44543)*2=26468 bytes of index table compared to 57667/33797).
Specialized functions with hard-coded hash index size and rehash factor will execute in 60-70% of the time compared to variables. This is probably due to the compiler (gcc 64-bit) substituting the modulo with multiplications and not having to fetch the values from memory locations as they will be coded as immediate values.
________ EDIT________
On the subject of caches I see two issues.
The first is data cacheing which I don't think will be possible because the lookup will just be a small step in some larger process and you run the risk of the table data's cache lines begin invalidated to a lesser or (probably) greater degree - if not entirely - by other data accesses in other steps of the larger process. I e the more code executed and data accessed in the process as a whole the less likely it will be that any pertinent lookup data will remain in the caches (this may or may not be pertinent to the OP's situation). To find an entry using (my) hashing you will encounter two cache misses (one to load the correct part of the index, and the other to load the area containg the entry itself) for every comparison that needs to be performed. Finding an entry on the first try will have cost two misses, the second try four etc. In my example the 60 clock cycle average cost per lookup implies that the table probably resided entirely in the L2 cache and with L1 not having to go there in a majority of the cases. My x86-64 CPU has L1-3, RAM wait states of approximately 4, 10, 40 and 100 which to me shows that RAM was completely kept out and L3 mostly.
The second is code cacheing which will have a more significant impact if it is small, tight, in-lined and with few control transfers (jumps and calls). My hash routine probably resides entirely in the L1 code cache. For more normal cases, the fewer the number of code cache line loads the faster it will be.

Make an array of structures of key val pairs.
Sort the array by key, put this in your program as static array, would only be 128kbyte.
Then in your program a simple binary look up by key will need on average only 14 key comparisons to find the right value. Should be able to approach speeds of 300 million look ups per second on modern pc.
You can sort with qsort and search with bsearch, both std lib functions.

Perform memonization, or in simple terms, cache the values you've computed already and calculate the new ones. You should hash the input and check the cache for that result. You can even start off with a set of cache values that you think the function would get called more often for. Besides that, I don't think you need to go to any extreme as the other answer suggest. Do things simple and when you are done with your application you can use a profiling tool to find bottle necks.
EDIT: Some code
#include <iostream>
#include <ctime>
using namespace std;
const int MAX_SIZE = 16000;
int preCalcData[MAX_SIZE] = {};
int getPrecalculatedResult(int x){
return preCalcData[x];
}
void setupPreCalcDataCache(){
for(int i = 0; i < MAX_SIZE; ++i){
preCalcData[i] = i*i; //or whatever calculation
}
}
int main(){
setupPreCalcDataCache();
cout << getPrecalculatedResult(0) << endl;
cout << getPrecalculatedResult(15999) << endl;
return 0;
}

I wouldn't worry about performance too much. This simple example, using an array and binary search lower_bound
#include <stdint.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <memory>
const int N = 16000;
typedef std::pair<uint64_t, uint64_t> CALC;
CALC calc[N];
static inline bool cmp_calcs(const CALC &c1, const CALC &c2)
{
return c1.first < c2.first;
}
int main(int argc, char **argv)
{
std::iostream::sync_with_stdio(false);
for (int i = 0; i < N; ++i)
calc[i] = std::make_pair(i, i);
std::sort(&calc[0], &calc[N], cmp_calcs);
for (long i = 0; i < 10000000; ++i) {
int r = rand() % 16000;
CALC *p = std::lower_bound(&calc[0], &calc[N], std::make_pair(r, 0), cmp_calcs);
if (p->first == r)
std::cout << "found\n";
}
return 0;
}
and compiled with
g++ -O2 example.cpp
does, including setup, 10,000,000 searches in about 2 seconds on my 5 year old PC.

You need to store 16 thousand values efficiently, preferably in memory. We are assuming that the computation of these values is more time consuming than accessing them from storage.
You have at your disposal many different data structures to get the job done, including databases. If you access these values in queriable chunks, then the DB overhead may very well be absorbed and spread accross your processing.
You mentioned map and hashmap (or hashtable) already in your question tags, but these are probably not the best possible answers for your problem, although they could do a fair job, provided that the hashing function isn't more expensive than the direct computation of the target UINT64 value, which has to be your reference benchmark.
Van Emde Boas Trees
Many variants of B-Trees (used extensively in database engines, high performance filesystems),
Tries
Are probably much better suited. Having some experience with it, I would probably go for a B-tree: they support fairly well serialization. That should let you prepare your dataset in advance in a different program. VEB trees have a very good access time (O(log log(n)), but I don't know how easily they may be serialized.
Later on, if you need even more performance, it would also be interesting to know usage patterns of your "database" to figure out what caching techniques you could implement on top of the store.

Using std::pair is better than any of map for speed.
but if I were you, I firstly use a std::list to store the data, after I got them all, I move them into a simple vector, then retrieving goes very fast if you implement a simple binary tree search by yourself.

When does a map become better than two vectors?

A map does binary search on all its elements, which has logarithmic complexity — this means that for a small enough collection of objects, a map will perform worse than two vectors that have linear search.
How large should the object (key) pool be for a map to start performing better than two vectors?
Edit: A more generalized version of the question: how large should the object pool be for binary search to perform better than linear search?
I'm using strings as keys and the values are pointers, but my particular use case probably shouldn't matter. I'm more curious to understand how to use the two tools properly.

If you'll forgive my saying so, most of the answers sound to me like various ways of saying: "I don't know", without actually admitting that they don't know. While I generally agree with the advice they've given, none of them seems to have attempted to directly address the question you asked: what is the break-even point.
To be fair, when I read the question, I didn't really know either. It's one of those things what we all know the basics: for a small enough collection, a linear search will probably be faster, and for a large enough collection, a binary search will undoubtedly be faster. I, however, have never really had much reason to investigate anything about what the break-even point would really be. Your question got me curious, however, so I decided to write a bit of code to get at least some idea.
This code is definitely a very quick hack (lots of duplication, only currently supports one type of key, etc.) but at least it might give some idea of what to expect:
#include <set>
#include <vector>
#include <string>
#include <time.h>
#include <iostream>
#include <algorithm>
int main() {
static const int reps = 100000;
std::vector<int> data_vector;
std::set<int> data_set;
std::vector<int> search_keys;
for (int size=10; size<100; size += 10) {
data_vector.clear();
data_set.clear();
search_keys.clear();
int num_keys = size / 10;
for (int i=0; i<size; i++) {
int key = rand();
if (i % num_keys == 0)
search_keys.push_back(key);
data_set.insert(key);
data_vector.push_back(key);
}
// Search for a few keys that probably aren't present.
for (int i=0; i<10; i++)
search_keys.push_back(rand());
long total_linear =0, total_binary = 0;
clock_t start_linear = clock();
for (int k=0; k<reps; k++) {
for (int j=0; j<search_keys.size(); j++) {
std::vector<int>::iterator pos = std::find(data_vector.begin(), data_vector.end(), search_keys[j]);
if (pos != data_vector.end())
total_linear += *pos;
}
}
clock_t stop_linear = clock();
clock_t start_binary = clock();
for (int k=0; k<reps; k++) {
for (int j=0; j<search_keys.size(); j++) {
std::set<int>::iterator pos = data_set.find(search_keys[j]);
if (pos != data_set.end())
total_binary += *pos;
}
}
clock_t stop_binary = clock();
std::cout << "\nignore: " << total_linear << " ignore also: " << total_binary << "\n";
std::cout << "For size = " << size << "\n";
std::cout << "\tLinear search time = " << stop_linear - start_linear << "\n";
std::cout << "\tBinary search time = " << stop_binary - start_binary << "\n";
}
return 0;
}
Here are the results I get running this on my machine:
ignore: 669830816 ignore also: 669830816
For size = 10
Linear search time = 37
Binary search time = 45
ignore: 966398112 ignore also: 966398112
For size = 20
Linear search time = 60
Binary search time = 47
ignore: 389263520 ignore also: 389263520
For size = 30
Linear search time = 83
Binary search time = 63
ignore: -1561901888 ignore also: -1561901888
For size = 40
Linear search time = 106
Binary search time = 65
ignore: -1741869184 ignore also: -1741869184
For size = 50
Linear search time = 127
Binary search time = 69
ignore: 1130798112 ignore also: 1130798112
For size = 60
Linear search time = 155
Binary search time = 75
ignore: -1949669184 ignore also: -1949669184
For size = 70
Linear search time = 173
Binary search time = 83
ignore: -1991069184 ignore also: -1991069184
For size = 80
Linear search time = 195
Binary search time = 90
ignore: 1750998112 ignore also: 1750998112
For size = 90
Linear search time = 217
Binary search time = 79
Obviously that's not the only possible test (or even close to the best one possible), but it seems to me that even a little hard data is better than none at all.
Edit: I would note for the record that I see no reason that code using two vectors (or a vector of pairs) can't be just as clean as code using a set or map. Obviously you'd want to put the code for it into a small class of its own, but I see no reason at all that it couldn't present precisely the same interface to the outside world that map does. In fact, I'd probably just call it a "tiny_map` (or something on that order).
One of the basic points of OO programming (and it remains the case in generic programming, to least some degree) is to separate the interface from the implementation. In this case, you're talking about purely an implementation detail that need not affect the interface at all. In fact, if I were writing a standard library, I'd be tempted to incorporate this as a "small map optimization" analogous to the common small string optimization. Basically, just allocate an array of 10 (or so) objects of value_type directly in the map object itself, and use them when/if the map is small, then move the data to a tree iff it grows large enough to justify it. The only real question is whether people us tiny maps often enough to justify the work.

The code for map will be much cleaner than the code for two vectors; that should be your primary concern. Only once you've determined that the performance of map is a problem in your own application should you worry about the difference, and at that point you should benchmark it yourself.

A map will never equal a sorted vector search speed, because the vector memory is better packed, and memory access is more straightforward.
However, it's only half the tale. As soon as you start modifying the container, node-base containers (which do not need moving around their elements to accomodate insertions in the middle), get a theoritecal advantage. Like the search, the better cache locality still gives an edge to vector for a small number of elements.
Exact measurements are hard to provide and depend on specific optimizations, context usage and machine, so I would advise to simply go by functionality. A map has a natural interface for searching by key, so using it will make your life easier.
If you really find yourself measuring and need to really squeeze performance out of the use case, then you probably need a more specialized data structure. Look up B-Trees.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pre-compute once cos() and sin() in tables - c++

Related

How can one improve performance in a neural network?

c++ stack efficient for multicore application

large loop for timing tests gets somehow optimized to nothing?

The fastest way to retrieve 16k Key-Value pairs?

When does a map become better than two vectors?

Categories

Resources