Random memory accesses are expensive? - c++

During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.

You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}

How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.

The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.

I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.

Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }

Related

c++ stack efficient for multicore application

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:
struct densityStack {
vector<int> density; //will store the sum of boolean varaibles
int card; //will store the amount of elements we summed over for normalizing at the end
densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
density = vector<int> (size, 0);
card = 0;
}
void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
*valStack += *newVal;
card++;
}
void savef(const char * fname){ //method outputting into a file
ofstream out(fname);
out.precision(10);
out << card << "\n"; //saving the cardinal in first line
for(auto val = density.begin(); val != density.end(); ++val)
out << << (double) *val/card << "\n";
out.close();
}
};
Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.
My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).
Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.
How are you synchronizing your shared densityStack instance?
From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

Forwards vs Backwards array walking

Let me first preface this with the fact that I know these kind of micro-optimisations are rarely cost-effective. I'm curious about how stuff works though. For all cacheline numbers etc, I am thinking in terms of an x86-64 i5 Intel CPU. The numbers would obviously differ for different CPUs.
I've often been under the impression that walking an array forwards is faster than walking it backwards. This is, I believed, due to the fact that pulling in large amounts of data is done in a forward-facing manner - that is, if I read byte 0x128, then the cacheline (assuming 64bytes in length) will read in bytes 0x128-0x191 inclusive. Consequently, if the next byte I wanted to access was at 0x129, it would already be in the cache.
However, after reading a bit, I'm now under the impression that it actually wouldn't matter? Because cache line alignment will pick the starting point at the closest 64-divisible boundary, then if I pick byte 0x127 to start with, I will load 0x64-0x127 inclusive, and consequently will have the data in the cache for my backwards walk. I will suffer a cachemiss when transitioning from 0x128 to 0x127, but that's a consequence of where I've picked the addresses for this example more than any real-world consideration.
I am aware that the cachelines are read in as 8-byte chunks, and as such the full cacheline would have to be loaded before the first operation could begin if we were walking backwards, but I doubt it would make a hugely significant difference.
Could somebody clear up if I'm right here, and old me is wrong? I've searched for a full day and still not been able to get a final answer on this.
tl;dr : Is the direction in which we walk an array really that important? Does it actually make a difference? Did it make a difference in the past? (To 15 years back or so)
I have tested with the following basic code, and see the same results forwards and backwards:
#include <windows.h>
#include <iostream>
// Size of dataset
#define SIZE_OF_ARRAY 1024*1024*256
// Are we walking forwards or backwards?
#define FORWARDS 1
int main()
{
// Timer setup
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
int* intArray = new int[SIZE_OF_ARRAY];
// Memset - shouldn't affect the test because my cache isn't 256MB!
memset(intArray, 0, SIZE_OF_ARRAY);
// Arbitrary numbers for break points
intArray[SIZE_OF_ARRAY - 1] = 55;
intArray[0] = 15;
int* backwardsPtr = &intArray[SIZE_OF_ARRAY - 1];
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Actual code
if (FORWARDS)
{
while (true)
{
if (*(intArray++) == 55)
break;
}
}
else
{
while (true)
{
if (*(backwardsPtr--) == 15)
break;
}
}
// Cleanup
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
std::cout << ElapsedMicroseconds.QuadPart << std::endl;
// So I can read the output
char a;
std::cin >> a;
return 0;
}
I apologise for A) Windows code, and B) Hacky implementation. It's thrown together to test a hypothesis, but doesn't prove the reasoning.
Any information about how the walking direction could make a difference, not just with cache but also other aspects, would be greatly appreciated!
Just as your experimentation shows, there is no difference. Unlike the interface between the processor and L1 cache, the memory system transacts on full cachelines, not bytes. As #user657267 pointed out, processor specific prefetchers exist. These might preference forward vs. backward, but I heavily doubt it. All modern prefetchers detect direction rather than assuming them. Furthermore, they detect stride as well. They involve incredibly complex logic and something as easy as direction isn't going to be their downfall.
Short answer: go in either direction you want and enjoy the same performance for both!

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/
As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!
Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.
Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.
No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.

Wrangling memory for a highly iterative c++ program

tl:dr I am needing a way to better manage memory in C++ while retaining large datasets.
I am currently creating a program that outputs a database that I need for a later project, and I am struggling with memory control. I have the program written to a functional level that outputs the dataset that I am needing on a small scale, but to ramp up the size to where I need it and keep it realistic, I need to increase the number of iterations. Problem is when I do that I end up running out of memory on my computer (4gb) and it has to start pagefiling, which slows the processing considerably.
The basic outline is that I am creating stores, then creating a year's worth of transactional data for said store. When the store is created, a list of numbers is generated that represents the daily sales goals for the transactions, then transactions are randomly generated until that number is reached. This method gives some nicely organic results that I am quite happy with. Unfortunately all of those transactions have to be stored in memory until they are output to my file.
When the transactions are created they are temporarily stored in a vector, which I execute .clear() on after I store a copy of the vector in my permanent storage location.
I have started to try to move to unique_ptr's for my temporary storage, but I am unsure if they are even being deleted properly upon returning from the functions that are generating my data.
the code is something like this (I cut some superfluous code that wasn't pertinent to the question at hand)
void store::populateTransactions() {
vector<transaction> tempVec;
int iterate=0, month=0;
double dayTotal=0;
double dayCost=0;
int day=0;
for(int i=0; i<365; i++) {
if(i==dsf[month]) {
month++;
day=0;
}
while(dayTotal<dailySalesTargets[i]) {
tempVec.push_back(transaction(2013, month+1, day+1, 1.25, 1.1));
dayTotal+=tempVec[iterate].returnTotal();
dayCost+=tempVec[iterate].returnCost();
iterate++;
}
day++;
dailyTransactions.push_back(tempVec);
dailyCost.push_back(dayCost);
dailySales.push_back(dayTotal);
tempVec.clear();
dayTotal = 0;
dayCost = 0;
iterate = 0;
}
}
transaction::transaction(int year, int month, int day, double avg, double dev) {
rng random;
transTime = &testing;
testing = random.newTime(year, month, day);
itemCount = round(random.newNum('l', avg, dev,0));
if(itemCount <= 0) {
itemCount = 1;
}
for(int i=0; i<itemCount; i++) {
int select = random.newNum(0,libs::products.products.size());
items.push_back(libs::products.products[select]);
transTotal += items[i].returnPrice();
transCost += items[i].returnCost();
}
}
The reason you are running into memory issues is because as you add elements to the vector it eventually has to resize it's internal buffer. This entails allocating a new block of memory, copying the existing data to the new member and then deleting the old buffer.
Since you know the number of elements the vector will hold before hand you can call the vectors reserve() member function to allocate the memory ahead of time. This will eliminate the constant resizing that you are no doubt encountering now.
For instance in the constructor for transaction you would do the following before the loop that adds data to the vector.
items.reserve(itemCount);
In store::populateTransactions() you should calculate the total number of elements the vector will hold and call tempVec.reserve() in the same was described above. Also keep in mind that if you are using a local variable to populate the vector you will eventually need to copy it. This will cause the same issues as the destination vector will need to allocate memory before the contents can be copied (unless you use move semantics available in C++11). If the data needs to be returned to the calling function (as opposed to being a member variable of store) you should take it by reference as a parameter.
void store::populateTransactions(vector<transaction>& tempVec)
{
//....
}
If it is not practical to determine the number of elements ahead of time you should consider using std::deque instead. From cppreference.com
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The storage of a deque is automatically expanded and contracted as needed. Expansion of a deque is cheaper than the expansion of a std::vector because it does not involve copying of the existing elements to a new memory location.
In regard to the comment by Rafael Baptista about how the resize operation allocates memory the following example should give you a better idea of what it going on. The amount of memory listed is the amount required during the resize
#include <iostream>
#include <vector>
int main ()
{
std::vector<int> data;
for(int i = 0; i < 10000001; i++)
{
size_t oldCap = data.capacity();
data.push_back(1);
size_t newCap = data.capacity();
if(oldCap != newCap)
{
std::cout
<< "resized capacity from "
<< oldCap
<< " to "
<< newCap
<< " requiring " << (oldCap + newCap) * sizeof(int)
<< " total bytes of memory"
<< std::endl;
}
}
return 0;
}
When compiled with VC++10 the following results are generated when adding 1,000,001 elements to a vector. These results are specific to VC++10 and can vary between implementations of std::vector.
resized capacity from 0 to 1 requiring 4 total bytes of memory
resized capacity from 1 to 2 requiring 12 total bytes of memory
resized capacity from 2 to 3 requiring 20 total bytes of memory
resized capacity from 3 to 4 requiring 28 total bytes of memory
resized capacity from 4 to 6 requiring 40 total bytes of memory
resized capacity from 6 to 9 requiring 60 total bytes of memory
resized capacity from 9 to 13 requiring 88 total bytes of memory
resized capacity from 13 to 19 requiring 128 total bytes of memory
...snip...
resized capacity from 2362204 to 3543306 requiring 23622040 total bytes of memory
resized capacity from 3543306 to 5314959 requiring 35433060 total bytes of memory
resized capacity from 5314959 to 7972438 requiring 53149588 total bytes of memory
resized capacity from 7972438 to 11958657 requiring 79724380 total bytes of memory
This is fun! Some quick comments I can think of.
a. STL clear() does not always free the memory instantaneously. Instead you can use std::vector<transaction>().swap(tmpVec);.
b. If you are using a compiler which has C++11 vector::emplace_back then you should remove the push_back and use it. It should be a big boost both in memory and speed. With push_back you basically have two copies of the same data floating around and you are at the mercy of allocator to return it back to the OS.
c. Any reason you cannot flush dailyTransactions to disk every once in a while? You can always serialize the vector and write it out to disk, clear the memory and you should be good again.
d. As pointed by others, reserve should also help a lot.

The fastest way to retrieve 16k Key-Value pairs?

OK, here's my situation :
I have a function - let's say U64 calc (U64 x) - which takes a 64-bit integer parameter, performs some CPU-intensive operation, and returns a 64-bit value
Now, given that I know ALL possible inputs (the xs) of that function beforehand (there are some 16000 though), I thought it might be better to pre-calculate them and then fetch them on demand (from an array-like structure).
The ideal situation would be to store them all in some array U64 CALC[] and retrieve them by index (the x again)
And here's the issue : I may know what the possible inputs for my calc function are, but they are most definitely NOT consecutive (e.g. not from 1 to 16000, but values that may go as low as 0 and as high as some trillions - always withing a 64-bit range)
E.G.
X CALC[X]
-----------------------
123123 123123123
12312 12312312
897523 986123
etc.
And here comes my question :
How would you store them?
What workaround would you prefer?
Now, given that these values (from CALC) will have to be accessed some thousands-to-millions of times, per sec, which would be the best solution performance-wise?
Note : I'm no mentioning anything I've thought of or tried so as not to turn the answers into some debate of A vs B type-of-thing, and mostly not influence anyone...
I would use some sort of hash function that creates an index to a u64 pair where one is the value the key was created from and the other the replacement value. Technically the index could be three bytes long (assuming 16 million -"16000 thousand" - pairs) if you need to conserve space but I'd use u32s. If the stored value does not match the value computed on (hash collision) you'd enter an overflow handler.
You need to determine a custom hashing algorithm to fit your data
Since you know the size of the data you don't need algorithms that allow the data to grow.
I'd be wary of using some standard algorithm because they seldom fit specific data
I'd be wary of using a C++ method unless you are sure the code is WYSIWYG (doesn't generate a lot of invisible calls)
Your index should be 25% larger than the number of pairs
Run through all possible inputs and determine min, max, average and standard deviation for the number of collisions and use these to determine the acceptable performance level. Then profile to achieve the best possible code.
The required memory space (using a u32 index) comes out to (4*1.25)+8+8 = 21 bytes per pair or 336 MeB, no problem on a typical PC.
________ EDIT________
I have been challenged by "RocketRoy" to put my money where my mouth is. Here goes:
The problem has to do with collision handling in a (fixed size) hash index. To set the stage:
I have a list of n entries where a field in the entry contains the value v that the hash is computed from
I have a vector of n*1.25 (approximately) indeces such that the number of indeces x is a prime number
A prime number y is computed which is a fraction of x
The vector is initialized to say -1 to denote unoccupied
Pretty standard stuff I think you'll agree.
The entries in the list are processed and the hash value h computed and modulo'd and used as an index into the vector and the index to the entry is placed there.
Eventually I encounter the situation where the vector entry pointed to by the index is occupied (doesn't contain -1) - voilà, a collision.
So what do I do? I keep the original h as ho, add y to h and take modulo x and get a new index into the vector. If the entry is unoccupied I use it, otherwise I continue with add y modulo x until I reach ho. In theory, this will happen because both x and y are prime numbers. In practice x is larger than n so it won't.
So the "re-hash" that RocketRoy claims is very costly is no such thing.
The tricky part with this method - as with all hashing methods - is to:
Determine a suitable value for x (becomes less tricky the larger x finally used)
Determine the algorithm a for h=a(v)%x such that a the h's index reasonably evenly ("randomly") into the index vector with as few collisions as possible
Determine a suitable value for y such that collisions index reasonably evenly ("randomly") into the index vector
________ EDIT________
I'm sorry I've taken so long to produce this code ... other things have had higher priorities.
Anyway, here is the code which proves that hashing has better prospects for quick lookups than a binary tree. It runs through a bunch of hashing index sizes and algorithms to aid in finding the most suitable combo for the specific data. For every algorithm the code will print the first index size such that no lookup takes longer than fourteen searches (worst case for binary searching) and an average lookup takes less than 1.5 searches.
I have a fondness for prime numbers in these types of applications, in case you haven't noticed.
There are many ways of creating a hashing algorithm with its mandatory overflow handling. I opted for simplicity assuming it will translate into speed ... and it does. On my laptop with an i5 M 480 # 2.67 GHz an average lookup requires between 55 and 60 clock cycles (comes out to around 45 million lookups per second). I implemented a special get operation with a constant number of indeces and ditto rehash value and the cycle count dropped to 40 (65 million lookups per second). If you look at the line calling getOpSpec the index i is xor'ed with 0x444 to exercise the caches to achieve more "real world"-like results.
I must again point out that the program suggests suitable combinations for the specific data. Other data may require a different combo.
The source code contains both the code for generating the 16000 unsigned long long pairs and for testing different constants (index sizes and rehash values):
#include <windows.h>
#define i8 signed char
#define i16 short
#define i32 long
#define i64 long long
#define id i64
#define u8 char
#define u16 unsigned short
#define u32 unsigned long
#define u64 unsigned long long
#define ud u64
#include <string.h>
#include <stdio.h>
u64 prime_find_next (const u64 value);
u64 prime_find_previous (const u64 value);
static inline volatile unsigned long long rdtsc_to_rax (void)
{
unsigned long long lower,upper;
asm volatile( "rdtsc\n"
: "=a"(lower), "=d"(upper));
return lower|(upper<<32);
}
static u16 index[65536];
static u64 nindeces,rehshFactor;
static struct PAIRS {u64 oval,rval;} pairs[16000] = {
#include "pairs.h"
};
struct HASH_STATS
{
u64 ninvocs,nrhshs,nworst;
} getOpStats,putOpStats;
i8 putOp (u16 index[], const struct PAIRS data[], const u64 oval, const u64 ci)
{
u64 nworst=1,ho,h,i;
i8 success=1;
++putOpStats.ninvocs;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffff) /* unused position */
{
index[h]=(u16)ci;
goto added;
}
if (oval==data[i].oval) goto duplicate;
++putOpStats.nrhshs;
++nworst;
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted: /* should not happen */
duplicate:
success=0;
added:
if (nworst>putOpStats.nworst) putOpStats.nworst=nworst;
return success;
}
i8 getOp (u16 index[], const struct PAIRS data[], const u64 oval, u64 *rval)
{
u64 ho,h,i;
i8 success=1;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffffu) goto not_found; /* unused position */
if (oval==data[i].oval)
{
*rval=data[i].rval; /* fetch the replacement value */
goto found;
}
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted:
not_found: /* should not happen */
success=0;
found:
return success;
}
volatile i8 stop = 0;
int main (int argc, char *argv[])
{
u64 i,rval,mulup,divdown,start;
double ave;
SetThreadAffinityMask (GetCurrentThread(), 0x00000004ull);
divdown=5; //5
while (divdown<=100)
{
mulup=3; // 3
while (mulup<divdown)
{
nindeces=16000;
while (nindeces<65500)
{
nindeces= prime_find_next (nindeces);
rehshFactor=nindeces*mulup/divdown;
rehshFactor=prime_find_previous (rehshFactor);
memset (index, 0xff, sizeof(index));
memset (&putOpStats, 0, sizeof(struct HASH_STATS));
i=0;
while (i<16000)
{
if (!putOp (index, pairs, pairs[i].oval, (u16) i)) stop=1;
++i;
}
ave=(double)(putOpStats.ninvocs+putOpStats.nrhshs)/(double)putOpStats.ninvocs;
if (ave<1.5 && putOpStats.nworst<15)
{
start=rdtsc_to_rax ();
i=0;
while (i<16000)
{
if (!getOp (index, pairs, pairs[i^0x0444]. oval, &rval)) stop=1;
++i;
}
start=rdtsc_to_rax ()-start+8000; /* 8000 is half of 16000 (pairs), for rounding */
printf ("%u;%u;%u;%u;%1.3f;%u;%u\n", (u32)mulup, (u32)divdown, (u32)nindeces, (u32)rehshFactor, ave, (u32) putOpStats.nworst, (u32) (start/16000ull));
goto found;
}
nindeces+=2;
}
printf ("%u;%u\n", (u32)mulup, (u32)divdown);
found:
mulup=prime_find_next (mulup);
}
divdown=prime_find_next (divdown);
}
SetThreadAffinityMask (GetCurrentThread(), 0x0000000fu);
return 0;
}
It was not possible to include the generated pairs file (an answer is apparently limited to 30000 characters). But send a message to my inbox and I'll mail it.
And these are the results:
3;5;35569;21323;1.390;14;73
3;7;33577;14389;1.435;14;60
5;7;32069;22901;1.474;14;61
3;11;35107;9551;1.412;14;59
5;11;33967;15427;1.446;14;61
7;11;34583;22003;1.422;14;59
3;13;34253;7901;1.439;14;61
5;13;34039;13063;1.443;14;60
7;13;32801;17659;1.456;14;60
11;13;33791;28591;1.436;14;59
3;17;34337;6053;1.413;14;59
5;17;32341;9511;1.470;14;61
7;17;32507;13381;1.474;14;62
11;17;33301;21529;1.454;14;60
13;17;34981;26737;1.403;13;59
3;19;33791;5333;1.437;14;60
5;19;35149;9241;1.403;14;59
7;19;33377;12289;1.439;14;97
11;19;34337;19867;1.417;14;59
13;19;34403;23537;1.430;14;61
17;19;33923;30347;1.467;14;61
3;23;33857;4409;1.425;14;60
5;23;34729;7547;1.429;14;60
7;23;32801;9973;1.456;14;61
11;23;33911;16127;1.445;14;60
13;23;33637;19009;1.435;13;60
17;23;34439;25453;1.426;13;60
19;23;33329;27529;1.468;14;62
3;29;32939;3391;1.474;14;62
5;29;34543;5953;1.437;13;60
7;29;34259;8263;1.414;13;59
11;29;34367;13033;1.409;14;60
13;29;33049;14813;1.444;14;60
17;29;34511;20219;1.422;14;60
19;29;33893;22193;1.445;13;61
23;29;34693;27509;1.412;13;92
3;31;34019;3271;1.441;14;60
5;31;33923;5449;1.460;14;61
7;31;33049;7459;1.442;14;60
11;31;35897;12721;1.389;14;59
13;31;35393;14831;1.397;14;59
17;31;33773;18517;1.425;14;60
19;31;33997;20809;1.442;14;60
23;31;34841;25847;1.417;14;59
29;31;33857;31667;1.426;14;60
3;37;32569;2633;1.476;14;61
5;37;34729;4691;1.419;14;59
7;37;34141;6451;1.439;14;60
11;37;34549;10267;1.410;13;60
13;37;35117;12329;1.423;14;60
17;37;34631;15907;1.429;14;63
19;37;34253;17581;1.435;14;60
23;37;32909;20443;1.453;14;61
29;37;33403;26177;1.445;14;60
31;37;34361;28771;1.413;14;59
3;41;34297;2503;1.424;14;60
5;41;33587;4093;1.430;14;60
7;41;34583;5903;1.404;13;59
11;41;32687;8761;1.440;14;60
13;41;34457;10909;1.439;14;60
17;41;34337;14221;1.425;14;59
19;41;32843;15217;1.476;14;62
23;41;35339;19819;1.423;14;59
29;41;34273;24239;1.436;14;60
31;41;34703;26237;1.414;14;60
37;41;33343;30089;1.456;14;61
3;43;34807;2423;1.417;14;59
5;43;35527;4129;1.413;14;60
7;43;33287;5417;1.467;14;61
11;43;33863;8647;1.436;14;60
13;43;34499;10427;1.418;14;78
17;43;34549;13649;1.431;14;60
19;43;33749;14897;1.429;13;60
23;43;34361;18371;1.409;14;59
29;43;33149;22349;1.452;14;61
31;43;34457;24821;1.428;14;60
37;43;32377;27851;1.482;14;81
41;43;33623;32057;1.424;13;59
3;47;33757;2153;1.459;14;61
5;47;33353;3547;1.445;14;61
7;47;34687;5153;1.414;13;59
11;47;34519;8069;1.417;14;60
13;47;34549;9551;1.412;13;59
17;47;33613;12149;1.461;14;61
19;47;33863;13687;1.443;14;60
23;47;35393;17317;1.402;14;59
29;47;34747;21433;1.432;13;60
31;47;34871;22993;1.409;14;59
37;47;34729;27337;1.425;14;59
41;47;33773;29453;1.438;14;60
43;47;31253;28591;1.487;14;62
3;53;33623;1901;1.430;14;59
5;53;34469;3229;1.430;13;60
7;53;34883;4603;1.408;14;59
11;53;34511;7159;1.412;13;59
13;53;32587;7963;1.453;14;60
17;53;34297;10993;1.432;13;80
19;53;33599;12043;1.443;14;64
23;53;34337;14897;1.415;14;59
29;53;34877;19081;1.424;14;61
31;53;34913;20411;1.406;13;59
37;53;34429;24029;1.417;13;60
41;53;34499;26683;1.418;14;59
43;53;32261;26171;1.488;14;62
47;53;34253;30367;1.437;14;79
3;59;33503;1699;1.432;14;61
5;59;34781;2939;1.424;14;60
7;59;35531;4211;1.403;14;59
11;59;34487;6427;1.420;14;59
13;59;33563;7393;1.453;14;61
17;59;34019;9791;1.440;14;60
19;59;33967;10937;1.447;14;60
23;59;33637;13109;1.438;14;60
29;59;34487;16943;1.424;14;59
31;59;32687;17167;1.480;14;61
37;59;35353;22159;1.404;14;59
41;59;34499;23971;1.431;14;60
43;59;34039;24799;1.445;14;60
47;59;32027;25471;1.499;14;62
53;59;34019;30557;1.449;14;61
3;61;35059;1723;1.418;14;60
5;61;34351;2803;1.416;13;60
7;61;35099;4021;1.412;14;59
11;61;34019;6133;1.442;14;60
13;61;35023;7459;1.406;14;88
17;61;35201;9803;1.414;14;61
19;61;34679;10799;1.425;14;101
23;61;34039;12829;1.441;13;60
29;61;33871;16097;1.446;14;60
31;61;34147;17351;1.427;14;61
37;61;34583;20963;1.412;14;59
41;61;32999;22171;1.452;14;62
43;61;33857;23857;1.431;14;98
47;61;34897;26881;1.431;14;60
53;61;33647;29231;1.434;14;60
59;61;32999;31907;1.454;14;60
3;67;32999;1471;1.455;14;61
5;67;35171;2621;1.403;14;59
7;67;33851;3533;1.463;14;61
11;67;34607;5669;1.437;14;60
13;67;35081;6803;1.416;14;61
17;67;33941;8609;1.417;14;60
19;67;34673;9829;1.427;14;60
23;67;35099;12043;1.415;14;60
29;67;33679;14563;1.452;14;61
31;67;34283;15859;1.437;14;60
37;67;32917;18169;1.460;13;61
41;67;33461;20443;1.441;14;61
43;67;34313;22013;1.426;14;60
47;67;33347;23371;1.452;14;61
53;67;33773;26713;1.434;14;60
59;67;35911;31607;1.395;14;58
61;67;34157;31091;1.431;14;63
3;71;34483;1453;1.423;14;59
5;71;34537;2423;1.428;14;59
7;71;33637;3313;1.428;13;60
11;71;32507;5023;1.465;14;79
13;71;35753;6529;1.403;14;59
17;71;33347;7963;1.444;14;61
19;71;35141;9397;1.410;14;59
23;71;32621;10559;1.475;14;61
29;71;33637;13729;1.429;14;60
31;71;33599;14657;1.443;14;60
37;71;34361;17903;1.396;14;59
41;71;33757;19489;1.435;14;61
43;71;34583;20939;1.413;14;59
47;71;34589;22877;1.441;14;60
53;71;35353;26387;1.418;14;59
59;71;35323;29347;1.406;14;59
61;71;35597;30577;1.401;14;59
67;71;34537;32587;1.425;14;59
3;73;34613;1409;1.418;14;59
5;73;32969;2251;1.453;14;62
7;73;33049;3167;1.448;14;61
11;73;33863;5101;1.435;14;60
13;73;34439;6131;1.456;14;60
17;73;33629;7829;1.455;14;61
19;73;34739;9029;1.421;14;60
23;73;33071;10399;1.469;14;61
29;73;33359;13249;1.460;14;61
31;73;33767;14327;1.422;14;59
37;73;32939;16693;1.490;14;62
41;73;33739;18947;1.438;14;60
43;73;33937;19979;1.432;14;61
47;73;33767;21739;1.422;14;59
53;73;33359;24203;1.435;14;60
59;73;34361;27767;1.401;13;59
61;73;33827;28229;1.443;14;60
67;73;34421;31583;1.423;14;71
71;73;33053;32143;1.447;14;60
3;79;35027;1327;1.410;14;60
5;79;34283;2161;1.432;14;60
7;79;34439;3049;1.432;14;60
11;79;34679;4817;1.416;14;59
13;79;34667;5701;1.405;14;59
17;79;33637;7237;1.428;14;60
19;79;34469;8287;1.417;14;60
23;79;34439;10009;1.433;14;60
29;79;33427;12269;1.448;13;61
31;79;33893;13297;1.445;14;61
37;79;33863;15823;1.439;14;60
41;79;32983;17107;1.450;14;60
43;79;34613;18803;1.431;14;60
47;79;33457;19891;1.457;14;61
53;79;33961;22777;1.435;14;61
59;79;32983;24631;1.465;14;60
61;79;34337;26501;1.428;14;60
67;79;33547;28447;1.458;14;61
71;79;32653;29339;1.473;14;61
73;79;34679;32029;1.429;14;64
3;83;35407;1277;1.405;14;59
5;83;32797;1973;1.451;14;60
7;83;33049;2777;1.443;14;61
11;83;33889;4483;1.431;14;60
13;83;35159;5503;1.409;14;59
17;83;34949;7151;1.412;14;59
19;83;32957;7541;1.467;14;61
23;83;32569;9013;1.470;14;61
29;83;33287;11621;1.474;14;61
31;83;33911;12659;1.448;13;60
37;83;33487;14923;1.456;14;62
41;83;33587;16573;1.438;13;60
43;83;34019;17623;1.435;14;60
47;83;31769;17987;1.483;14;62
53;83;33049;21101;1.451;14;61
59;83;32369;23003;1.465;14;61
61;83;32653;23993;1.469;14;61
67;83;33599;27109;1.437;14;61
71;83;33713;28837;1.452;14;61
73;83;33703;29641;1.454;14;61
79;83;34583;32911;1.417;14;59
3;89;34147;1129;1.415;13;60
5;89;32797;1831;1.461;14;61
7;89;33679;2647;1.443;14;73
11;89;34543;4261;1.427;13;60
13;89;34603;5051;1.419;14;60
17;89;34061;6491;1.444;14;60
19;89;34457;7351;1.422;14;79
23;89;33529;8663;1.450;14;61
29;89;34283;11161;1.431;14;60
31;89;35027;12197;1.411;13;59
37;89;34259;14221;1.403;14;59
41;89;33997;15649;1.434;14;60
43;89;33911;16127;1.445;14;60
47;89;34949;18451;1.419;14;59
53;89;34367;20443;1.434;14;60
59;89;33791;22397;1.430;14;59
61;89;34961;23957;1.404;14;59
67;89;33863;25471;1.433;13;60
71;89;35149;28031;1.414;14;79
73;89;33113;27143;1.447;14;60
79;89;32909;29209;1.458;14;61
83;89;33617;31337;1.400;14;59
3;97;34211;1051;1.448;14;60
5;97;34807;1789;1.430;14;60
7;97;33547;2417;1.446;14;60
11;97;35171;3967;1.407;14;89
13;97;32479;4349;1.474;14;61
17;97;34319;6011;1.444;14;60
19;97;32381;6337;1.491;14;64
23;97;33617;7963;1.421;14;59
29;97;33767;10093;1.423;14;59
31;97;33641;10739;1.447;14;60
37;97;34589;13187;1.425;13;60
41;97;34171;14437;1.451;14;60
43;97;31973;14159;1.484;14;62
47;97;33911;16127;1.445;14;61
53;97;34031;18593;1.448;14;80
59;97;32579;19813;1.457;14;61
61;97;34421;21617;1.417;13;60
67;97;33739;23297;1.448;14;60
71;97;33739;24691;1.435;14;60
73;97;33863;25471;1.433;13;60
79;97;34381;27997;1.419;14;59
83;97;33967;29063;1.446;14;60
89;97;33521;30727;1.441;14;60
Cols 1 and 2 are used to calculate a rough relationship between the rehash value and the index size. The next two are the first index size/rehash factor combination which averages less than 1.5 searches for a lookup with a worst case of 14 searches. Then average and worst case. Finally, the last column is the average number of clock cycles per lookup. It does not take into account the time required to read the time stamp register.
The actual memory space for the best constants (# of indeces = 31253 and rehash factor = 28591) comes out to more than I initially indicated (16000*2*8 + 1,25*16000*2 => 296000 bytes). The actual size is 16000*2*8+31253*2 => 318506.
The fastest combination is an approximate ratio of 11/31 with an index size of 35897 and rehash value of 12721. This will average 1.389 (1 initial hash + 0.389 rehashes) with a maximum of 14 (1+13).
________ EDIT________
I removed the "goto found;" in main () to show all combinations and it shows that much better performance is possible, of course at the expense of a larger index size. For example the combination 57667 and 33797 yields and average of 1.192 and a maximum rehash of 6. The combination 44543 and 23399 yields a 1.249 average and 10 maximum rehashes (it saves (57667-44543)*2=26468 bytes of index table compared to 57667/33797).
Specialized functions with hard-coded hash index size and rehash factor will execute in 60-70% of the time compared to variables. This is probably due to the compiler (gcc 64-bit) substituting the modulo with multiplications and not having to fetch the values from memory locations as they will be coded as immediate values.
________ EDIT________
On the subject of caches I see two issues.
The first is data cacheing which I don't think will be possible because the lookup will just be a small step in some larger process and you run the risk of the table data's cache lines begin invalidated to a lesser or (probably) greater degree - if not entirely - by other data accesses in other steps of the larger process. I e the more code executed and data accessed in the process as a whole the less likely it will be that any pertinent lookup data will remain in the caches (this may or may not be pertinent to the OP's situation). To find an entry using (my) hashing you will encounter two cache misses (one to load the correct part of the index, and the other to load the area containg the entry itself) for every comparison that needs to be performed. Finding an entry on the first try will have cost two misses, the second try four etc. In my example the 60 clock cycle average cost per lookup implies that the table probably resided entirely in the L2 cache and with L1 not having to go there in a majority of the cases. My x86-64 CPU has L1-3, RAM wait states of approximately 4, 10, 40 and 100 which to me shows that RAM was completely kept out and L3 mostly.
The second is code cacheing which will have a more significant impact if it is small, tight, in-lined and with few control transfers (jumps and calls). My hash routine probably resides entirely in the L1 code cache. For more normal cases, the fewer the number of code cache line loads the faster it will be.
Make an array of structures of key val pairs.
Sort the array by key, put this in your program as static array, would only be 128kbyte.
Then in your program a simple binary look up by key will need on average only 14 key comparisons to find the right value. Should be able to approach speeds of 300 million look ups per second on modern pc.
You can sort with qsort and search with bsearch, both std lib functions.
Perform memonization, or in simple terms, cache the values you've computed already and calculate the new ones. You should hash the input and check the cache for that result. You can even start off with a set of cache values that you think the function would get called more often for. Besides that, I don't think you need to go to any extreme as the other answer suggest. Do things simple and when you are done with your application you can use a profiling tool to find bottle necks.
EDIT: Some code
#include <iostream>
#include <ctime>
using namespace std;
const int MAX_SIZE = 16000;
int preCalcData[MAX_SIZE] = {};
int getPrecalculatedResult(int x){
return preCalcData[x];
}
void setupPreCalcDataCache(){
for(int i = 0; i < MAX_SIZE; ++i){
preCalcData[i] = i*i; //or whatever calculation
}
}
int main(){
setupPreCalcDataCache();
cout << getPrecalculatedResult(0) << endl;
cout << getPrecalculatedResult(15999) << endl;
return 0;
}
I wouldn't worry about performance too much. This simple example, using an array and binary search lower_bound
#include <stdint.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <memory>
const int N = 16000;
typedef std::pair<uint64_t, uint64_t> CALC;
CALC calc[N];
static inline bool cmp_calcs(const CALC &c1, const CALC &c2)
{
return c1.first < c2.first;
}
int main(int argc, char **argv)
{
std::iostream::sync_with_stdio(false);
for (int i = 0; i < N; ++i)
calc[i] = std::make_pair(i, i);
std::sort(&calc[0], &calc[N], cmp_calcs);
for (long i = 0; i < 10000000; ++i) {
int r = rand() % 16000;
CALC *p = std::lower_bound(&calc[0], &calc[N], std::make_pair(r, 0), cmp_calcs);
if (p->first == r)
std::cout << "found\n";
}
return 0;
}
and compiled with
g++ -O2 example.cpp
does, including setup, 10,000,000 searches in about 2 seconds on my 5 year old PC.
You need to store 16 thousand values efficiently, preferably in memory. We are assuming that the computation of these values is more time consuming than accessing them from storage.
You have at your disposal many different data structures to get the job done, including databases. If you access these values in queriable chunks, then the DB overhead may very well be absorbed and spread accross your processing.
You mentioned map and hashmap (or hashtable) already in your question tags, but these are probably not the best possible answers for your problem, although they could do a fair job, provided that the hashing function isn't more expensive than the direct computation of the target UINT64 value, which has to be your reference benchmark.
Van Emde Boas Trees
Many variants of B-Trees (used extensively in database engines, high performance filesystems),
Tries
Are probably much better suited. Having some experience with it, I would probably go for a B-tree: they support fairly well serialization. That should let you prepare your dataset in advance in a different program. VEB trees have a very good access time (O(log log(n)), but I don't know how easily they may be serialized.
Later on, if you need even more performance, it would also be interesting to know usage patterns of your "database" to figure out what caching techniques you could implement on top of the store.
Using std::pair is better than any of map for speed.
but if I were you, I firstly use a std::list to store the data, after I got them all, I move them into a simple vector, then retrieving goes very fast if you implement a simple binary tree search by yourself.