Forwards vs Backwards array walking

Forwards vs Backwards array walking - c++

Let me first preface this with the fact that I know these kind of micro-optimisations are rarely cost-effective. I'm curious about how stuff works though. For all cacheline numbers etc, I am thinking in terms of an x86-64 i5 Intel CPU. The numbers would obviously differ for different CPUs.
I've often been under the impression that walking an array forwards is faster than walking it backwards. This is, I believed, due to the fact that pulling in large amounts of data is done in a forward-facing manner - that is, if I read byte 0x128, then the cacheline (assuming 64bytes in length) will read in bytes 0x128-0x191 inclusive. Consequently, if the next byte I wanted to access was at 0x129, it would already be in the cache.
However, after reading a bit, I'm now under the impression that it actually wouldn't matter? Because cache line alignment will pick the starting point at the closest 64-divisible boundary, then if I pick byte 0x127 to start with, I will load 0x64-0x127 inclusive, and consequently will have the data in the cache for my backwards walk. I will suffer a cachemiss when transitioning from 0x128 to 0x127, but that's a consequence of where I've picked the addresses for this example more than any real-world consideration.
I am aware that the cachelines are read in as 8-byte chunks, and as such the full cacheline would have to be loaded before the first operation could begin if we were walking backwards, but I doubt it would make a hugely significant difference.
Could somebody clear up if I'm right here, and old me is wrong? I've searched for a full day and still not been able to get a final answer on this.
tl;dr : Is the direction in which we walk an array really that important? Does it actually make a difference? Did it make a difference in the past? (To 15 years back or so)
I have tested with the following basic code, and see the same results forwards and backwards:
#include <windows.h>
#include <iostream>
// Size of dataset
#define SIZE_OF_ARRAY 1024*1024*256
// Are we walking forwards or backwards?
#define FORWARDS 1
int main()
{
// Timer setup
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
int* intArray = new int[SIZE_OF_ARRAY];
// Memset - shouldn't affect the test because my cache isn't 256MB!
memset(intArray, 0, SIZE_OF_ARRAY);
// Arbitrary numbers for break points
intArray[SIZE_OF_ARRAY - 1] = 55;
intArray[0] = 15;
int* backwardsPtr = &intArray[SIZE_OF_ARRAY - 1];
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Actual code
if (FORWARDS)
{
while (true)
{
if (*(intArray++) == 55)
break;
}
}
else
{
while (true)
{
if (*(backwardsPtr--) == 15)
break;
}
}
// Cleanup
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
std::cout << ElapsedMicroseconds.QuadPart << std::endl;
// So I can read the output
char a;
std::cin >> a;
return 0;
}
I apologise for A) Windows code, and B) Hacky implementation. It's thrown together to test a hypothesis, but doesn't prove the reasoning.
Any information about how the walking direction could make a difference, not just with cache but also other aspects, would be greatly appreciated!

Just as your experimentation shows, there is no difference. Unlike the interface between the processor and L1 cache, the memory system transacts on full cachelines, not bytes. As #user657267 pointed out, processor specific prefetchers exist. These might preference forward vs. backward, but I heavily doubt it. All modern prefetchers detect direction rather than assuming them. Furthermore, they detect stride as well. They involve incredibly complex logic and something as easy as direction isn't going to be their downfall.
Short answer: go in either direction you want and enjoy the same performance for both!

Related

How to optimise large loops for debug mode

I have implemented a pixel mask class used for checking for perfect collision. I am using SFML so the implementation is fairly straight forward:
Loop through each pixel of the image and decide whether its true or false based on its transparency value. Here is the code I have used:
// Create an Image from the given texture
sf::Image image(texture.copyToImage());
// measure the time this function takes
sf::Clock clock;
sf::Time time = sf::Time::Zero;
clock.restart();
// Reserve memory for the pixelMask vector to avoid repeating allocation
pixelMask.reserve(image.getSize().x);
// Loop through every pixel of the texture
for (unsigned int i = 0; i < image.getSize().x; i++)
{
// Create the mask for one line
std::vector<bool> tempMask;
// Reserve memory for the pixelMask vector to avoid repeating allocation
tempMask.reserve(image.getSize().y);
for (unsigned int j = 0; j < image.getSize().y; j++)
{
// If the pixel is not transparrent
if (image.getPixel(i, j).a > 0)
// Some part of the texture is there --> push back true
tempMask.push_back(true);
else
// The user can't see this part of the texture --> push back false
tempMask.push_back(false);
}
pixelMask.push_back(tempMask);
}
time = clock.restart();
std::cout << std::endl << "The creation of the pixel mask took: " << time.asMicroseconds() << " microseconds (" << time.asSeconds() << ")";
I have used the an instance of the sf::Clock to meassure time.
My problem is that this function takes ages (e.g. 15 seconds) for larger images(e.g. 1280x720). Interestingly, only in debug mode. When compiling the release version the same texture/image only takes 0.1 seconds or less.
I have tried to reduce memory allocations by using the resize() method but it didn't change much. I know that looping through almost 1 million pixels is slow but it should not be 15 seconds slow should it?
Since I want to test my code in debug mode (for obvious reasons) and I don't want to wait 5 min till all the pixel masks have been created, what I am looking for is basically a way to:
Either optimise the code / have I missed somthing obvious?
Or get something similar to the release performance in debug mode
Thanks for your help!

Optimizing For Debug
Optimizing for debug builds is generally a very counter-productive idea. It could even have you optimize for debug in a way that not only makes maintaining code more difficult, but may even slow down release builds. Debug builds in general are going to be much slower to run. Even with the flattest kind of C code I write which doesn't pose much for an optimizer to do beyond reasonable register allocation and instruction selection, it's normal for the debug build to take 20 times longer to finish an operation. That's just something to accept rather than change too much.
That said, I can understand the temptation to do so at times. Sometimes you want to debug a certain part of code only for the other operations in the software to takes ages, requiring you to wait a long time before you can even get to the code you are interested in tracing through. I find in those cases that it's helpful, if you can, to separate debug mode input sizes from release mode (ex: having the debug mode only work with an input that is 1/10th of the original size). That does cause discrepancies between release and debug as a negative, but the positives sometimes outweigh the negatives from a productivity standpoint. Another strategy is to build parts of your code in release and just debug the parts you're interested in, like building a plugin in debug against a host application in release.
Approach at Your Own Peril
With that aside, if you really want to make your debug builds run faster and accept all the risks associated, then the main way is to just pose less work for your compiler to optimize away. That's going to be flatter code typically with more plain old data types, less function calls, and so forth.
First and foremost, you might be spending a lot of time on debug mode assertions for safety. See things like checked iterators and how to disable them:
https://msdn.microsoft.com/en-us/library/aa985965.aspx
For your case, you can easily flatten your nested loop into a single loop. There's no need to create these pixel masks with separate containers per scanline, since you can always get at your scanline data with some basic arithmetic (y*image_width or y*image_stride). So initially I'd flatten the loop. That might even help modestly for release mode. I don't know the SFML API so I'll illustrate with pseudocode.
const int num_pixels = image.w * image.h;
vector<bool> pixelMask(num_pixels);
for (int j=0; j < num_pixels; ++j)
pixelMask[j] = image.pixelAlpha(j) > 0;
Just that already might help a lot. Hopefully SFML lets you access pixels with a single index without having to specify column and row (x and y). If you want to go even further, it might help to grab the pointer to the array of pixels from SFML (also hopefully possible) and use that:
vector<bool> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
for (int j=0; j < num_pixels; ++j)
{
// Assuming 32-bit pixels (should probably use uint32_t).
// Note that no right shift is necessary when you just want
// to check for non-zero values.
const unsigned int alpha = pixels[j] & 0xff000000;
pixelMask[j] = alpha > 0;
}
Also vector<bool> stores each boolean as a single bit. That saves memory but translates to some more instructions for random-access. Sometimes you can get a speed up even in release by just using more memory. I'd test both release and debug and time carefully, but you can try this:
vector<char> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
char* pixelUsed = &pixelMask[0];
for (int j=0; j < num_pixels; ++j)
{
const unsigned int alpha = pixels[j] & 0xff000000;
pixelUsed[j] = alpha > 0;
}

Loops are faster if working with costants:
1. for (unsigned int i = 0; i < image.getSize().x; i++) get this image.getSize() before the loop.
2. get the mask for one line out of the loop and reuse it. Lines are of the same length I assume. std::vector tempMask;
This shall speed you up a bit.
Note that the compilation for debugging gives way more different machine code.

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/

As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!

Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.

Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.

No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.

The fastest way to retrieve 16k Key-Value pairs?

OK, here's my situation :
I have a function - let's say U64 calc (U64 x) - which takes a 64-bit integer parameter, performs some CPU-intensive operation, and returns a 64-bit value
Now, given that I know ALL possible inputs (the xs) of that function beforehand (there are some 16000 though), I thought it might be better to pre-calculate them and then fetch them on demand (from an array-like structure).
The ideal situation would be to store them all in some array U64 CALC[] and retrieve them by index (the x again)
And here's the issue : I may know what the possible inputs for my calc function are, but they are most definitely NOT consecutive (e.g. not from 1 to 16000, but values that may go as low as 0 and as high as some trillions - always withing a 64-bit range)
E.G.
X CALC[X]
-----------------------
123123 123123123
12312 12312312
897523 986123
etc.
And here comes my question :
How would you store them?
What workaround would you prefer?
Now, given that these values (from CALC) will have to be accessed some thousands-to-millions of times, per sec, which would be the best solution performance-wise?
Note : I'm no mentioning anything I've thought of or tried so as not to turn the answers into some debate of A vs B type-of-thing, and mostly not influence anyone...

I would use some sort of hash function that creates an index to a u64 pair where one is the value the key was created from and the other the replacement value. Technically the index could be three bytes long (assuming 16 million -"16000 thousand" - pairs) if you need to conserve space but I'd use u32s. If the stored value does not match the value computed on (hash collision) you'd enter an overflow handler.
You need to determine a custom hashing algorithm to fit your data
Since you know the size of the data you don't need algorithms that allow the data to grow.
I'd be wary of using some standard algorithm because they seldom fit specific data
I'd be wary of using a C++ method unless you are sure the code is WYSIWYG (doesn't generate a lot of invisible calls)
Your index should be 25% larger than the number of pairs
Run through all possible inputs and determine min, max, average and standard deviation for the number of collisions and use these to determine the acceptable performance level. Then profile to achieve the best possible code.
The required memory space (using a u32 index) comes out to (4*1.25)+8+8 = 21 bytes per pair or 336 MeB, no problem on a typical PC.
________ EDIT________
I have been challenged by "RocketRoy" to put my money where my mouth is. Here goes:
The problem has to do with collision handling in a (fixed size) hash index. To set the stage:
I have a list of n entries where a field in the entry contains the value v that the hash is computed from
I have a vector of n*1.25 (approximately) indeces such that the number of indeces x is a prime number
A prime number y is computed which is a fraction of x
The vector is initialized to say -1 to denote unoccupied
Pretty standard stuff I think you'll agree.
The entries in the list are processed and the hash value h computed and modulo'd and used as an index into the vector and the index to the entry is placed there.
Eventually I encounter the situation where the vector entry pointed to by the index is occupied (doesn't contain -1) - voilà, a collision.
So what do I do? I keep the original h as ho, add y to h and take modulo x and get a new index into the vector. If the entry is unoccupied I use it, otherwise I continue with add y modulo x until I reach ho. In theory, this will happen because both x and y are prime numbers. In practice x is larger than n so it won't.
So the "re-hash" that RocketRoy claims is very costly is no such thing.
The tricky part with this method - as with all hashing methods - is to:
Determine a suitable value for x (becomes less tricky the larger x finally used)
Determine the algorithm a for h=a(v)%x such that a the h's index reasonably evenly ("randomly") into the index vector with as few collisions as possible
Determine a suitable value for y such that collisions index reasonably evenly ("randomly") into the index vector
________ EDIT________
I'm sorry I've taken so long to produce this code ... other things have had higher priorities.
Anyway, here is the code which proves that hashing has better prospects for quick lookups than a binary tree. It runs through a bunch of hashing index sizes and algorithms to aid in finding the most suitable combo for the specific data. For every algorithm the code will print the first index size such that no lookup takes longer than fourteen searches (worst case for binary searching) and an average lookup takes less than 1.5 searches.
I have a fondness for prime numbers in these types of applications, in case you haven't noticed.
There are many ways of creating a hashing algorithm with its mandatory overflow handling. I opted for simplicity assuming it will translate into speed ... and it does. On my laptop with an i5 M 480 # 2.67 GHz an average lookup requires between 55 and 60 clock cycles (comes out to around 45 million lookups per second). I implemented a special get operation with a constant number of indeces and ditto rehash value and the cycle count dropped to 40 (65 million lookups per second). If you look at the line calling getOpSpec the index i is xor'ed with 0x444 to exercise the caches to achieve more "real world"-like results.
I must again point out that the program suggests suitable combinations for the specific data. Other data may require a different combo.
The source code contains both the code for generating the 16000 unsigned long long pairs and for testing different constants (index sizes and rehash values):
#include <windows.h>
#define i8 signed char
#define i16 short
#define i32 long
#define i64 long long
#define id i64
#define u8 char
#define u16 unsigned short
#define u32 unsigned long
#define u64 unsigned long long
#define ud u64
#include <string.h>
#include <stdio.h>
u64 prime_find_next (const u64 value);
u64 prime_find_previous (const u64 value);
static inline volatile unsigned long long rdtsc_to_rax (void)
{
unsigned long long lower,upper;
asm volatile( "rdtsc\n"
: "=a"(lower), "=d"(upper));
return lower|(upper<<32);
}
static u16 index[65536];
static u64 nindeces,rehshFactor;
static struct PAIRS {u64 oval,rval;} pairs[16000] = {
#include "pairs.h"
};
struct HASH_STATS
{
u64 ninvocs,nrhshs,nworst;
} getOpStats,putOpStats;
i8 putOp (u16 index[], const struct PAIRS data[], const u64 oval, const u64 ci)
{
u64 nworst=1,ho,h,i;
i8 success=1;
++putOpStats.ninvocs;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffff) /* unused position */
{
index[h]=(u16)ci;
goto added;
}
if (oval==data[i].oval) goto duplicate;
++putOpStats.nrhshs;
++nworst;
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted: /* should not happen */
duplicate:
success=0;
added:
if (nworst>putOpStats.nworst) putOpStats.nworst=nworst;
return success;
}
i8 getOp (u16 index[], const struct PAIRS data[], const u64 oval, u64 *rval)
{
u64 ho,h,i;
i8 success=1;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffffu) goto not_found; /* unused position */
if (oval==data[i].oval)
{
*rval=data[i].rval; /* fetch the replacement value */
goto found;
}
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted:
not_found: /* should not happen */
success=0;
found:
return success;
}
volatile i8 stop = 0;
int main (int argc, char *argv[])
{
u64 i,rval,mulup,divdown,start;
double ave;
SetThreadAffinityMask (GetCurrentThread(), 0x00000004ull);
divdown=5; //5
while (divdown<=100)
{
mulup=3; // 3
while (mulup<divdown)
{
nindeces=16000;
while (nindeces<65500)
{
nindeces= prime_find_next (nindeces);
rehshFactor=nindeces*mulup/divdown;
rehshFactor=prime_find_previous (rehshFactor);
memset (index, 0xff, sizeof(index));
memset (&putOpStats, 0, sizeof(struct HASH_STATS));
i=0;
while (i<16000)
{
if (!putOp (index, pairs, pairs[i].oval, (u16) i)) stop=1;
++i;
}
ave=(double)(putOpStats.ninvocs+putOpStats.nrhshs)/(double)putOpStats.ninvocs;
if (ave<1.5 && putOpStats.nworst<15)
{
start=rdtsc_to_rax ();
i=0;
while (i<16000)
{
if (!getOp (index, pairs, pairs[i^0x0444]. oval, &rval)) stop=1;
++i;
}
start=rdtsc_to_rax ()-start+8000; /* 8000 is half of 16000 (pairs), for rounding */
printf ("%u;%u;%u;%u;%1.3f;%u;%u\n", (u32)mulup, (u32)divdown, (u32)nindeces, (u32)rehshFactor, ave, (u32) putOpStats.nworst, (u32) (start/16000ull));
goto found;
}
nindeces+=2;
}
printf ("%u;%u\n", (u32)mulup, (u32)divdown);
found:
mulup=prime_find_next (mulup);
}
divdown=prime_find_next (divdown);
}
SetThreadAffinityMask (GetCurrentThread(), 0x0000000fu);
return 0;
}
It was not possible to include the generated pairs file (an answer is apparently limited to 30000 characters). But send a message to my inbox and I'll mail it.
And these are the results:
3;5;35569;21323;1.390;14;73
3;7;33577;14389;1.435;14;60
5;7;32069;22901;1.474;14;61
3;11;35107;9551;1.412;14;59
5;11;33967;15427;1.446;14;61
7;11;34583;22003;1.422;14;59
3;13;34253;7901;1.439;14;61
5;13;34039;13063;1.443;14;60
7;13;32801;17659;1.456;14;60
11;13;33791;28591;1.436;14;59
3;17;34337;6053;1.413;14;59
5;17;32341;9511;1.470;14;61
7;17;32507;13381;1.474;14;62
11;17;33301;21529;1.454;14;60
13;17;34981;26737;1.403;13;59
3;19;33791;5333;1.437;14;60
5;19;35149;9241;1.403;14;59
7;19;33377;12289;1.439;14;97
11;19;34337;19867;1.417;14;59
13;19;34403;23537;1.430;14;61
17;19;33923;30347;1.467;14;61
3;23;33857;4409;1.425;14;60
5;23;34729;7547;1.429;14;60
7;23;32801;9973;1.456;14;61
11;23;33911;16127;1.445;14;60
13;23;33637;19009;1.435;13;60
17;23;34439;25453;1.426;13;60
19;23;33329;27529;1.468;14;62
3;29;32939;3391;1.474;14;62
5;29;34543;5953;1.437;13;60
7;29;34259;8263;1.414;13;59
11;29;34367;13033;1.409;14;60
13;29;33049;14813;1.444;14;60
17;29;34511;20219;1.422;14;60
19;29;33893;22193;1.445;13;61
23;29;34693;27509;1.412;13;92
3;31;34019;3271;1.441;14;60
5;31;33923;5449;1.460;14;61
7;31;33049;7459;1.442;14;60
11;31;35897;12721;1.389;14;59
13;31;35393;14831;1.397;14;59
17;31;33773;18517;1.425;14;60
19;31;33997;20809;1.442;14;60
23;31;34841;25847;1.417;14;59
29;31;33857;31667;1.426;14;60
3;37;32569;2633;1.476;14;61
5;37;34729;4691;1.419;14;59
7;37;34141;6451;1.439;14;60
11;37;34549;10267;1.410;13;60
13;37;35117;12329;1.423;14;60
17;37;34631;15907;1.429;14;63
19;37;34253;17581;1.435;14;60
23;37;32909;20443;1.453;14;61
29;37;33403;26177;1.445;14;60
31;37;34361;28771;1.413;14;59
3;41;34297;2503;1.424;14;60
5;41;33587;4093;1.430;14;60
7;41;34583;5903;1.404;13;59
11;41;32687;8761;1.440;14;60
13;41;34457;10909;1.439;14;60
17;41;34337;14221;1.425;14;59
19;41;32843;15217;1.476;14;62
23;41;35339;19819;1.423;14;59
29;41;34273;24239;1.436;14;60
31;41;34703;26237;1.414;14;60
37;41;33343;30089;1.456;14;61
3;43;34807;2423;1.417;14;59
5;43;35527;4129;1.413;14;60
7;43;33287;5417;1.467;14;61
11;43;33863;8647;1.436;14;60
13;43;34499;10427;1.418;14;78
17;43;34549;13649;1.431;14;60
19;43;33749;14897;1.429;13;60
23;43;34361;18371;1.409;14;59
29;43;33149;22349;1.452;14;61
31;43;34457;24821;1.428;14;60
37;43;32377;27851;1.482;14;81
41;43;33623;32057;1.424;13;59
3;47;33757;2153;1.459;14;61
5;47;33353;3547;1.445;14;61
7;47;34687;5153;1.414;13;59
11;47;34519;8069;1.417;14;60
13;47;34549;9551;1.412;13;59
17;47;33613;12149;1.461;14;61
19;47;33863;13687;1.443;14;60
23;47;35393;17317;1.402;14;59
29;47;34747;21433;1.432;13;60
31;47;34871;22993;1.409;14;59
37;47;34729;27337;1.425;14;59
41;47;33773;29453;1.438;14;60
43;47;31253;28591;1.487;14;62
3;53;33623;1901;1.430;14;59
5;53;34469;3229;1.430;13;60
7;53;34883;4603;1.408;14;59
11;53;34511;7159;1.412;13;59
13;53;32587;7963;1.453;14;60
17;53;34297;10993;1.432;13;80
19;53;33599;12043;1.443;14;64
23;53;34337;14897;1.415;14;59
29;53;34877;19081;1.424;14;61
31;53;34913;20411;1.406;13;59
37;53;34429;24029;1.417;13;60
41;53;34499;26683;1.418;14;59
43;53;32261;26171;1.488;14;62
47;53;34253;30367;1.437;14;79
3;59;33503;1699;1.432;14;61
5;59;34781;2939;1.424;14;60
7;59;35531;4211;1.403;14;59
11;59;34487;6427;1.420;14;59
13;59;33563;7393;1.453;14;61
17;59;34019;9791;1.440;14;60
19;59;33967;10937;1.447;14;60
23;59;33637;13109;1.438;14;60
29;59;34487;16943;1.424;14;59
31;59;32687;17167;1.480;14;61
37;59;35353;22159;1.404;14;59
41;59;34499;23971;1.431;14;60
43;59;34039;24799;1.445;14;60
47;59;32027;25471;1.499;14;62
53;59;34019;30557;1.449;14;61
3;61;35059;1723;1.418;14;60
5;61;34351;2803;1.416;13;60
7;61;35099;4021;1.412;14;59
11;61;34019;6133;1.442;14;60
13;61;35023;7459;1.406;14;88
17;61;35201;9803;1.414;14;61
19;61;34679;10799;1.425;14;101
23;61;34039;12829;1.441;13;60
29;61;33871;16097;1.446;14;60
31;61;34147;17351;1.427;14;61
37;61;34583;20963;1.412;14;59
41;61;32999;22171;1.452;14;62
43;61;33857;23857;1.431;14;98
47;61;34897;26881;1.431;14;60
53;61;33647;29231;1.434;14;60
59;61;32999;31907;1.454;14;60
3;67;32999;1471;1.455;14;61
5;67;35171;2621;1.403;14;59
7;67;33851;3533;1.463;14;61
11;67;34607;5669;1.437;14;60
13;67;35081;6803;1.416;14;61
17;67;33941;8609;1.417;14;60
19;67;34673;9829;1.427;14;60
23;67;35099;12043;1.415;14;60
29;67;33679;14563;1.452;14;61
31;67;34283;15859;1.437;14;60
37;67;32917;18169;1.460;13;61
41;67;33461;20443;1.441;14;61
43;67;34313;22013;1.426;14;60
47;67;33347;23371;1.452;14;61
53;67;33773;26713;1.434;14;60
59;67;35911;31607;1.395;14;58
61;67;34157;31091;1.431;14;63
3;71;34483;1453;1.423;14;59
5;71;34537;2423;1.428;14;59
7;71;33637;3313;1.428;13;60
11;71;32507;5023;1.465;14;79
13;71;35753;6529;1.403;14;59
17;71;33347;7963;1.444;14;61
19;71;35141;9397;1.410;14;59
23;71;32621;10559;1.475;14;61
29;71;33637;13729;1.429;14;60
31;71;33599;14657;1.443;14;60
37;71;34361;17903;1.396;14;59
41;71;33757;19489;1.435;14;61
43;71;34583;20939;1.413;14;59
47;71;34589;22877;1.441;14;60
53;71;35353;26387;1.418;14;59
59;71;35323;29347;1.406;14;59
61;71;35597;30577;1.401;14;59
67;71;34537;32587;1.425;14;59
3;73;34613;1409;1.418;14;59
5;73;32969;2251;1.453;14;62
7;73;33049;3167;1.448;14;61
11;73;33863;5101;1.435;14;60
13;73;34439;6131;1.456;14;60
17;73;33629;7829;1.455;14;61
19;73;34739;9029;1.421;14;60
23;73;33071;10399;1.469;14;61
29;73;33359;13249;1.460;14;61
31;73;33767;14327;1.422;14;59
37;73;32939;16693;1.490;14;62
41;73;33739;18947;1.438;14;60
43;73;33937;19979;1.432;14;61
47;73;33767;21739;1.422;14;59
53;73;33359;24203;1.435;14;60
59;73;34361;27767;1.401;13;59
61;73;33827;28229;1.443;14;60
67;73;34421;31583;1.423;14;71
71;73;33053;32143;1.447;14;60
3;79;35027;1327;1.410;14;60
5;79;34283;2161;1.432;14;60
7;79;34439;3049;1.432;14;60
11;79;34679;4817;1.416;14;59
13;79;34667;5701;1.405;14;59
17;79;33637;7237;1.428;14;60
19;79;34469;8287;1.417;14;60
23;79;34439;10009;1.433;14;60
29;79;33427;12269;1.448;13;61
31;79;33893;13297;1.445;14;61
37;79;33863;15823;1.439;14;60
41;79;32983;17107;1.450;14;60
43;79;34613;18803;1.431;14;60
47;79;33457;19891;1.457;14;61
53;79;33961;22777;1.435;14;61
59;79;32983;24631;1.465;14;60
61;79;34337;26501;1.428;14;60
67;79;33547;28447;1.458;14;61
71;79;32653;29339;1.473;14;61
73;79;34679;32029;1.429;14;64
3;83;35407;1277;1.405;14;59
5;83;32797;1973;1.451;14;60
7;83;33049;2777;1.443;14;61
11;83;33889;4483;1.431;14;60
13;83;35159;5503;1.409;14;59
17;83;34949;7151;1.412;14;59
19;83;32957;7541;1.467;14;61
23;83;32569;9013;1.470;14;61
29;83;33287;11621;1.474;14;61
31;83;33911;12659;1.448;13;60
37;83;33487;14923;1.456;14;62
41;83;33587;16573;1.438;13;60
43;83;34019;17623;1.435;14;60
47;83;31769;17987;1.483;14;62
53;83;33049;21101;1.451;14;61
59;83;32369;23003;1.465;14;61
61;83;32653;23993;1.469;14;61
67;83;33599;27109;1.437;14;61
71;83;33713;28837;1.452;14;61
73;83;33703;29641;1.454;14;61
79;83;34583;32911;1.417;14;59
3;89;34147;1129;1.415;13;60
5;89;32797;1831;1.461;14;61
7;89;33679;2647;1.443;14;73
11;89;34543;4261;1.427;13;60
13;89;34603;5051;1.419;14;60
17;89;34061;6491;1.444;14;60
19;89;34457;7351;1.422;14;79
23;89;33529;8663;1.450;14;61
29;89;34283;11161;1.431;14;60
31;89;35027;12197;1.411;13;59
37;89;34259;14221;1.403;14;59
41;89;33997;15649;1.434;14;60
43;89;33911;16127;1.445;14;60
47;89;34949;18451;1.419;14;59
53;89;34367;20443;1.434;14;60
59;89;33791;22397;1.430;14;59
61;89;34961;23957;1.404;14;59
67;89;33863;25471;1.433;13;60
71;89;35149;28031;1.414;14;79
73;89;33113;27143;1.447;14;60
79;89;32909;29209;1.458;14;61
83;89;33617;31337;1.400;14;59
3;97;34211;1051;1.448;14;60
5;97;34807;1789;1.430;14;60
7;97;33547;2417;1.446;14;60
11;97;35171;3967;1.407;14;89
13;97;32479;4349;1.474;14;61
17;97;34319;6011;1.444;14;60
19;97;32381;6337;1.491;14;64
23;97;33617;7963;1.421;14;59
29;97;33767;10093;1.423;14;59
31;97;33641;10739;1.447;14;60
37;97;34589;13187;1.425;13;60
41;97;34171;14437;1.451;14;60
43;97;31973;14159;1.484;14;62
47;97;33911;16127;1.445;14;61
53;97;34031;18593;1.448;14;80
59;97;32579;19813;1.457;14;61
61;97;34421;21617;1.417;13;60
67;97;33739;23297;1.448;14;60
71;97;33739;24691;1.435;14;60
73;97;33863;25471;1.433;13;60
79;97;34381;27997;1.419;14;59
83;97;33967;29063;1.446;14;60
89;97;33521;30727;1.441;14;60
Cols 1 and 2 are used to calculate a rough relationship between the rehash value and the index size. The next two are the first index size/rehash factor combination which averages less than 1.5 searches for a lookup with a worst case of 14 searches. Then average and worst case. Finally, the last column is the average number of clock cycles per lookup. It does not take into account the time required to read the time stamp register.
The actual memory space for the best constants (# of indeces = 31253 and rehash factor = 28591) comes out to more than I initially indicated (16000*2*8 + 1,25*16000*2 => 296000 bytes). The actual size is 16000*2*8+31253*2 => 318506.
The fastest combination is an approximate ratio of 11/31 with an index size of 35897 and rehash value of 12721. This will average 1.389 (1 initial hash + 0.389 rehashes) with a maximum of 14 (1+13).
________ EDIT________
I removed the "goto found;" in main () to show all combinations and it shows that much better performance is possible, of course at the expense of a larger index size. For example the combination 57667 and 33797 yields and average of 1.192 and a maximum rehash of 6. The combination 44543 and 23399 yields a 1.249 average and 10 maximum rehashes (it saves (57667-44543)*2=26468 bytes of index table compared to 57667/33797).
Specialized functions with hard-coded hash index size and rehash factor will execute in 60-70% of the time compared to variables. This is probably due to the compiler (gcc 64-bit) substituting the modulo with multiplications and not having to fetch the values from memory locations as they will be coded as immediate values.
________ EDIT________
On the subject of caches I see two issues.
The first is data cacheing which I don't think will be possible because the lookup will just be a small step in some larger process and you run the risk of the table data's cache lines begin invalidated to a lesser or (probably) greater degree - if not entirely - by other data accesses in other steps of the larger process. I e the more code executed and data accessed in the process as a whole the less likely it will be that any pertinent lookup data will remain in the caches (this may or may not be pertinent to the OP's situation). To find an entry using (my) hashing you will encounter two cache misses (one to load the correct part of the index, and the other to load the area containg the entry itself) for every comparison that needs to be performed. Finding an entry on the first try will have cost two misses, the second try four etc. In my example the 60 clock cycle average cost per lookup implies that the table probably resided entirely in the L2 cache and with L1 not having to go there in a majority of the cases. My x86-64 CPU has L1-3, RAM wait states of approximately 4, 10, 40 and 100 which to me shows that RAM was completely kept out and L3 mostly.
The second is code cacheing which will have a more significant impact if it is small, tight, in-lined and with few control transfers (jumps and calls). My hash routine probably resides entirely in the L1 code cache. For more normal cases, the fewer the number of code cache line loads the faster it will be.

Make an array of structures of key val pairs.
Sort the array by key, put this in your program as static array, would only be 128kbyte.
Then in your program a simple binary look up by key will need on average only 14 key comparisons to find the right value. Should be able to approach speeds of 300 million look ups per second on modern pc.
You can sort with qsort and search with bsearch, both std lib functions.

Perform memonization, or in simple terms, cache the values you've computed already and calculate the new ones. You should hash the input and check the cache for that result. You can even start off with a set of cache values that you think the function would get called more often for. Besides that, I don't think you need to go to any extreme as the other answer suggest. Do things simple and when you are done with your application you can use a profiling tool to find bottle necks.
EDIT: Some code
#include <iostream>
#include <ctime>
using namespace std;
const int MAX_SIZE = 16000;
int preCalcData[MAX_SIZE] = {};
int getPrecalculatedResult(int x){
return preCalcData[x];
}
void setupPreCalcDataCache(){
for(int i = 0; i < MAX_SIZE; ++i){
preCalcData[i] = i*i; //or whatever calculation
}
}
int main(){
setupPreCalcDataCache();
cout << getPrecalculatedResult(0) << endl;
cout << getPrecalculatedResult(15999) << endl;
return 0;
}

I wouldn't worry about performance too much. This simple example, using an array and binary search lower_bound
#include <stdint.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <memory>
const int N = 16000;
typedef std::pair<uint64_t, uint64_t> CALC;
CALC calc[N];
static inline bool cmp_calcs(const CALC &c1, const CALC &c2)
{
return c1.first < c2.first;
}
int main(int argc, char **argv)
{
std::iostream::sync_with_stdio(false);
for (int i = 0; i < N; ++i)
calc[i] = std::make_pair(i, i);
std::sort(&calc[0], &calc[N], cmp_calcs);
for (long i = 0; i < 10000000; ++i) {
int r = rand() % 16000;
CALC *p = std::lower_bound(&calc[0], &calc[N], std::make_pair(r, 0), cmp_calcs);
if (p->first == r)
std::cout << "found\n";
}
return 0;
}
and compiled with
g++ -O2 example.cpp
does, including setup, 10,000,000 searches in about 2 seconds on my 5 year old PC.

You need to store 16 thousand values efficiently, preferably in memory. We are assuming that the computation of these values is more time consuming than accessing them from storage.
You have at your disposal many different data structures to get the job done, including databases. If you access these values in queriable chunks, then the DB overhead may very well be absorbed and spread accross your processing.
You mentioned map and hashmap (or hashtable) already in your question tags, but these are probably not the best possible answers for your problem, although they could do a fair job, provided that the hashing function isn't more expensive than the direct computation of the target UINT64 value, which has to be your reference benchmark.
Van Emde Boas Trees
Many variants of B-Trees (used extensively in database engines, high performance filesystems),
Tries
Are probably much better suited. Having some experience with it, I would probably go for a B-tree: they support fairly well serialization. That should let you prepare your dataset in advance in a different program. VEB trees have a very good access time (O(log log(n)), but I don't know how easily they may be serialized.
Later on, if you need even more performance, it would also be interesting to know usage patterns of your "database" to figure out what caching techniques you could implement on top of the store.

Using std::pair is better than any of map for speed.
but if I were you, I firstly use a std::list to store the data, after I got them all, I move them into a simple vector, then retrieving goes very fast if you implement a simple binary tree search by yourself.

Faster bit reading?

In my application 20% of cpu time is spent on reading bits (skip) through my bit reader. Does anyone have any idea on how one might make the following code faster? At any given time, I do not need more than 20 valid bits (which is why I, in some situations, can use fast_skip).
Bits are read in big-endian order, which is why the byte swap is needed.
class bit_reader
{
std::uint32_t* m_data;
std::size_t m_pos;
std::uint64_t m_block;
public:
bit_reader(void* data)
: m_data(reinterpret_cast<std::uint32_t*>(data))
, m_pos(0)
, m_block(_byteswap_uint64(*reinterpret_cast<std::uint64_t*>(data)))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
assert(m_pos + n_bits < 64);
return (m_block << m_pos) >> (64 - n_bits);
}
void skip(std::size_t n_bits) // 20% cpu time
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
if(m_pos > 31)
{
m_block = _byteswap_uint64(reinterpret_cast<std::uint64_t*>(++m_data)[0]);
m_pos -= 32;
}
}
void fast_skip(std::size_t n_bits)
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
}
};
Target hardware is x64.

I see from an earlier comment you are unpacking Huffman/arithmetic coded streams in JPEG.
skip() and value() are really simple enough to be inlined. There's a chance that the compiler will keep the shift register and buffer pointers in registers the whole while. Making all pointers here and in the caller with the restrict modifier might help by telling the compiler that you won't be writing the results of Huffman decoding into the bit-buffer, thus allowing further optimisation.
The average length of each Huffman/artimetic symbol is short - so, ~7 times out of 8, you won't need to top up the 64-bit shift register. Investigate giving the compiler a branch-prediction hint.
It's unusual for any symbol in the JPEG bitstream to be longer than 32-bits. Does this allow further optimization?
One very logical reason that skip() is a heavy path is that you're calling it a lot. You are consuming an entire symbol at once rather than every bit here aren't you? There are some clever tricks you can do by counting leading 0 or 1s in symbols and table lookup.
You might consider arranging your shift register such that the next bit in the stream is the LSB. This will avoid the shifts in value()

Shifting for 64 bits is definitely not a good idea. In many CPUs shift is a slow operation.
I would advise you to change your code to a byte addressing. This will limit the shift for 8 bits maximum.
In many cases you really do not need a bit by itself, but rather to check if it is present or not. This can be done with a code like:
if (data[bit_inx/64] & mask[bit_inx % 64])
{
....
}

Try substituting this line in skip:
m_block = (m_block << 32) | _byteswap_uint32(*++m_data);

I don't know if it's the cause and what the underlying implementation of _byteswap_uint64 looks like, but you should read Rob Pike's article on byteorder. Maybe that's your answer.
Abstract: endianness is less of a problem than it's often made up to be. And the implementation for byte order swapping often come with issues. But there's a simple alternative.
[EDIT] I've got a better theory. Pasted from my comment below:
Maybe it's aliasing. 64 bit architectures love to align the data by 64 bits, when you read across alignment boundaries, it gets pretty slow. So it could be the (++m_data)[0] part, as x64 is 64 bit aligned and when you reinterpret_cast a uint32_t* to uint64_t*, you are crossing alignment boundaries about half of the time.

If your source buffers are not huge, then you should pre-process them, byte-swap the buffers before you access them using the bit_reader!
Reading from your bit_reader will be much faster then, because:
you will save some conditional instructions
the CPU caches can be used more efficiently: it can read straight from memory, which is most probably already loaded into cpu cache, instead of reading from memory that will be modified after reading each 64bit chunk, and so, destroy the benefits of having had it in cache
EDIT
Oh wait, you do not modify the source buffer. However, putting the byteswap into a pre-processing stage should at least be worth a try.
Another point: make sure those assert() calls will only be in debug version.
EDIT 2
(deleted)
EDIT 3
Your code is definitely flawed, check the following usage scenario:
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x77665544
br.skip(16);
br.value(16); // -> 0x33221100
br.skip(16); // -> triggers reading more bits
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA9988
br.skip(16);
br.value(16); // -> 0x77665544
// that's not what you expect, right ???
EDIT 4
Well, no, EDIT 3 was wrong, but I can not help, the code is flawed. Isn't it?
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x7766
br.skip(16);
br.value(16); // -> 0x5544
br.skip(16); // -> triggers reading more bits (because m_pos=32, which is: m_pos>31)
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA --> not what you expect, right?

Here is another version I tried, which didn't give any performance improvements.
class bit_reader
{
public:
const std::uint64_t* m_data64;
std::size_t m_pos64;
std::uint64_t m_block0;
std::uint64_t m_block1;
bit_reader(const void* data)
: m_pos64(0)
, m_data64(reinterpret_cast<const std::uint64_t*>(data))
, m_block0(byte_swap(*m_data64++))
, m_block1(byte_swap(*m_data64++))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
return __shiftleft128(m_block1, m_block0, m_pos64) >> (64 - n_bits);
}
void skip(std::size_t n_bits)
{
m_pos64 += n_bits;
if(m_pos64 > 63)
{
m_block0 = m_block1;
m_block1 = byte_swap(*m_data64++);
m_pos64 -= 64;
}
}
void fast_skip(std::size_t n_bits)
{
skip(n_bits);
}
};

If possible it would be best to do this in multiple passes. Multiple runs can be optimized and reduced breaching.
In general it is best to do
const uint64_t * arr = data;
for(uint64_t * i = arr; i != &arr[len/sizeof(uint64_t)] ;i++)
{
*i = _byteswap_uint64(*i);
//no more operations here
}
// another similar for loop
Such code can reduce run time by huge factor
At worst you can do it in like runs of 100k blocks, to keep cache misses at minimum and single loading of data from RAM.
In your case you do it in streaming way witch is good only for keeping low memory and faster responses from slow data source, but not for speed.

Random memory accesses are expensive?

During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.

You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}

How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.

The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.

I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.

Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js