CUDA lane ID vs threadIdx.x based computation - c++

It's easiest to explain via cub::LaneId() or a function like the following:
inline __device__ unsigned get_lane_id() {
unsigned ret;
asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
return ret;
}
Versus computing the lane ID as threadIdx.x & 31 .
Do these 2 approaches produce the same value in a 1D grid?
__ballot_sync() documentation speaks of lane IDs in its mask parameter, and as I understand it returns the bits set per lane ID. So would the following asserts never fail?
int nWarps = /*...*/;
bool condition = /*...*/;
if(threadIdx.x < nWarps) {
assert(__activemask() == ((1u<<nWarps)-1));
uint32_t res = __ballot_sync(__activemask(), condition);
assert(bool(res & (1<<threadIdx.x)) == condition);
}

From the PTX ISA documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-laneid
%laneid A predefined, read-only special register that returns the thread's lane within the warp. The lane identifier ranges from zero to WARP_SZ-1.
This register will always contain the correct value, whereas threadIdx.x & 31 assumes that the warp size is 32. However, for all GPU generations to date, the warpsize has been 32, so for both old and current architectures the computed lane will be identical. There is no guarantee that this would always be the case, however.
On your question regarding assertion. With independent thread scheduling, there is no guarantee that all threads in a warp will execute __activemask() at the same time. I think the assertion may fail.
Quoting from the programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x
Note that threads within a warp can diverge even within a single code path. As a result, __activemask() and __ballot(1) may return only a subset of the threads on the current code path.

Do these 2 approaches produce the same value in a 1D grid?
Yes (while CUDA's warp size is 32). See also this question:
What's the most efficient way to calculate the warp id / lane id in a 1-D grid?
But I'd write it this way:
enum { warp_size = 32 };
// ...
inline unsigned lane_id() {
constexpr const auto lane_id_mask = warp_size - 1;
return threadIdx.x & lane_id_mask;
}
and if you want to be extra-pedantic, you could always static-assert to ensure the warp size is a power of 2 :-P
So would the following asserts never fail?
That code looks weird. Why would you left-shift by the thread ID or the number of warps? Don't see why that shouldn't fail.

Related

Why is summing over members of this struct of arrays much faster than summing over an array of structs?

I have been using https://github.com/google/benchmark and g++ 9.4.0 to check the performance of data access in different scenarios (compilation with "-O3"). The result has been surprising to me.
My baseline is accessing longs in an std::array ("reduced data"). I want to add an additional byte datum. One time I create an additional container ("split data") and one time I store a struct in the arrays ("combined data").
This is the code:
#include <benchmark/benchmark.h>
#include <array>
#include <random>
constexpr int width = 640;
constexpr int height = 480;
std::array<std::uint64_t, width * height> containerWithReducedData;
std::array<std::uint64_t, width * height> container1WithSplitData;
std::array<std::uint8_t, width * height> container2WithSplitData;
struct CombinedData
{
std::uint64_t first;
std::uint8_t second;
};
std::array<CombinedData, width * height> containerWithCombinedData;
void fillReducedData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
containerWithReducedData.at(static_cast<unsigned int>(row * width + column)) = number;
}
}
}
std::uint64_t accessReducedData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += containerWithReducedData.at(static_cast<unsigned int>(row * width + column));
}
}
return value;
}
static void BM_AccessReducedData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessReducedData());
}
}
BENCHMARK(BM_AccessReducedData)->Setup(fillReducedData);
void fillSplitData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
container1WithSplitData.at(static_cast<unsigned int>(row * width + column)) = number;
const std::uint8_t additionalNumber = bytesDistribution(engine);
container2WithSplitData.at(static_cast<unsigned int>(row * width + column)) = additionalNumber;
}
}
}
std::uint64_t accessSplitData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += container1WithSplitData.at(static_cast<unsigned int>(row * width + column));
value += container2WithSplitData.at(static_cast<unsigned int>(row * width + column));
}
}
return value;
}
static void BM_AccessSplitData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessSplitData());
}
}
BENCHMARK(BM_AccessSplitData)->Setup(fillSplitData);
void fillCombinedData(const benchmark::State& state)
{
// Variable is intentionally unused
static_cast<void>(state);
// Generate pseudo-random numbers (no seed, therefore always the same numbers)
// NOLINTNEXTLINE
auto engine = std::mt19937{};
auto longsDistribution = std::uniform_int_distribution<std::uint64_t>{};
auto bytesDistribution = std::uniform_int_distribution<std::uint8_t>{};
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
const std::uint64_t number = longsDistribution(engine);
containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first = number;
const std::uint8_t additionalNumber = bytesDistribution(engine);
containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second = additionalNumber;
}
}
}
std::uint64_t accessCombinedData()
{
std::uint64_t value = 0;
for (int row = 0; row < height; ++row)
{
for (int column = 0; column < width; ++column)
{
value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).first;
value += containerWithCombinedData.at(static_cast<unsigned int>(row * width + column)).second;
}
}
return value;
}
static void BM_AccessCombinedData(benchmark::State& state)
{
// Perform setup here
for (auto _ : state)
{
// Variable is intentionally unused
static_cast<void>(_);
// This code gets timed
benchmark::DoNotOptimize(accessCombinedData());
}
}
BENCHMARK(BM_AccessCombinedData)->Setup(fillCombinedData);
Live demo
And this is the result:
Run on (12 X 4104.01 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.33, 1.82, 1.06
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_AccessReducedData 55133 ns 55133 ns 12309
BM_AccessSplitData 64089 ns 64089 ns 10439
BM_AccessCombinedData 170470 ns 170470 ns 3827
I am not surprised by the long running times of BM_AccessCombinedData. There is additional effort (compared to "reduced data") to add the bytes. My interpretation is that the added byte does not fit into the cache line anymore, which makes the access much more expensive. (Might there be even another effect?)
But why is it so fast to access different containers ("split data")? There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower? But it is almost three times faster than the access of the combined data! Isn't this surprising?
Preface: This answer was written only for the example/scenario you provided in your benchmark link: a summing reduction over interleaved vs non-interleaved collections of differently sized integers. Summing is an unsequenced operation. You can visit elements of the collections and add them to the accumulating result in any order. And whether you "combine" (via struct) or "split" (via separate arrays), the order of accumulation doesn't matter.
Note: It would help if you provided some information about what you already know about optimization techniques and what processors/memory are usually capable of. Your comments show you know about caching, but I have no idea what else you know, or what exactly you know about caching.
Terminology
This choice of "combined" vs "split" has other well known names:
parallel array (wikipedia article)
structure of arrays vs array of structures (wikipedia article)
This question fits into the area of concerns people have when talking about data-oriented design.
For the rest of this answer, I will stay consistent with your terminology.
Alignment, Padding, and Structs
quoting from CppReference,
The C++ language has this requirement:
Every complete object type has a property called alignment requirement, which is an integer value of type size_t representing the number of bytes between successive addresses at which objects of this type can be allocated. The valid alignment values are non-negative integral powers of two.
"Every complete object" includes instances of structs in memory. Reading on...
In order to satisfy alignment requirements of all members of a struct, padding may be inserted after some of its members.
One of its examples demonstrate:
// objects of struct X must be allocated at 4-byte boundaries
// because X.n must be allocated at 4-byte boundaries
// because int's alignment requirement is (usually) 4
struct X {
int n; // size: 4, alignment: 4
char c; // size: 1, alignment: 1
// three bytes padding
}; // size: 8, alignment: 4
This is what Peter Cordes mentioned in the comments. Because of this requirement/property/feature of the C++ language, there is inserted padding for your "combined" collection.
I am not sure if there is a significant detriment to cache performance resulting from the padding here, since the sum only visits each element of the arrays once. In a scenario where elements are frequently revisited, this is more likely to matter: the padding of the combined representation results in "wasted" bytes of the cache when compared with the split representation, and that wastage is more likely to have a significant impact on cache performance. But the degree to which this matters depends on the patterns of revisiting the data.
SIMD
wikipedia article
SIMD instructions are specialized CPU machine instructions for performing an operation on multiple pieces of data in memory, such as summing a group of same-sized integers which are laid out next to each other in memory (which is exactly what can be done in the "split"-representation version of your scenario).
Compared to machine code that doesn't use SIMD, SIMD usage can provide a constant-factor improvement (the value of the constant factor is based on the SIMD instruction). Ex. a SIMD instruction that adds 8 bytes together should be 8 times faster than a loop which does the same thing, or an unrolled loop which does the same thing.
Other keywords: vectorization, parallelized code.
Peter Cordes mentioned relevant examples (psadbw, paddq). Here's a list of intel SSE instructions for arithmetic.
As Peter mentioned, a degree of SIMD usage is still possible in the "combined" representation, but not as much as is possible with the "split" representation. It comes down to what the target machine architecture's instruction set provides. I don't think there's a dedicated SIMD instruction for your example's "combined" representation.
The Code
For the "split" representation, I would do something like:
// ...
#include <numeric> // for `std::reduce`
#include <execution> // for `std::execution`
#include <functional> // for `std::plus`
std::uint64_t accessSplitData()
{
return std::reduce(std::execution::unseq, container1WithSplitData.cbegin(), container1WithSplitData.cend(), std::uint64_t{0}, std::plus{});
+ std::reduce(std::execution::unseq, container2WithSplitData.cbegin(), container2WithSplitData.cend(), std::uint64_t{0}, std::plus{});
}
// ...
It's a much more direct way to communicate (to readers of the code and to a compiler) an unsequenced sum of collections of integers.
CppReference for std::reduce
CppReference for std::execution::<...>
Execution policies allow you to convey how an algorithm can and is desired to be performed (whether it is safe/still-correct and desirable to use SIMD or multiple threads). Many of the algorithms in the C++ standard library have a similar overload to accept an execution policy argument.
CppReference for std::plus
But What about the Different Positions?
There the data is located at different positions in memory and there is alternating access to it. Shouldn't this be even slower?
As I showed in the code above, for your specific scenario, there doesn't need to be alternating access. But if the specific scenario is changed to require alternating access, on average, usually I don't think there would be much cache impact.
There is the possible problem of conflict misses if the corresponding entries of the split arrays map to the same cache sets. I don't know how likely this is to be encountered, or if there are techniques in C++ to prevent that. If anyone knows, please edit this answer. If a cache has N-way set associativity, and the access pattern to the "split" representation data only accesses N or fewer arrays in the hot loop (ie. doesn't access any other memory), I believe it should be impossible to run into this.
Misc Notes
I would recommend that you keep your benchmark link in your question unchanged, and if you want to update it, add a new link, so people viewing the discussion will be able to see older versions being referred to.
Out of curiosity, is there a reason why you're not using newer compiler versions for the benchmark like gcc 11?
I highly recommend the usage I showed of std::reduce. It's a widely recommended practice to use a dedicated C++ standard algorithm instead of a raw loop where the algorithm. See the reasons cited in the CppCoreGuidlines link. The code may be long (and in that sense, ugly), but it clearly conveys the intent to perform a sum where the reduction operator (plus) is unsequenced.
Your question is specifically about speed, but it's noteworthy that in C++, the choice of struct-of-array vs array-of-struct can be important where space costs matter, precisely because of alignment and padding.
There are more considerations in choosing struct-of-array vs array-of-struct that I haven't listed: memory-access-patterns are the main consideration for performance. readability and simplicity are also important considerations; you can alleviate problems by building good abstractions, but there's still a limit to that, and maintenance, readability, and simplicity costs of building the abstraction itself.
You may also find this video by Matt Godbolt on "Memory and Caches" useful.

Trying to understand prefix sum execution

I am trying to understand the scan implementation scan-then-fan mentioned in the book: The CUDA Handbook.
Can some one explain the device function scanWarp? Why negative indexes? Could you please mention a numerical example?
I have the same question about for the line warpPartials[16+warpid] = sum. How the assignment is happening?
Which is the contribution of this line if ( warpid==0 ) {scanWarp<T,bZeroPadded>( 16+warpPartials+tid ); }
Could you please someone explain sum += warpPartials[16+warpid-1]; ? An numerical example will be highly appreciated.
Finally, a more c++ oriented question how do we know the indexes that are used in *sPartials = sum; to store values in sPartials?
PS: A numerical example that demonstrates the whole execution would be very helpful.
template < class T, bool bZeroPadded >
inline __device__ T
scanBlock( volatile T *sPartials ){
extern __shared__ T warpPartials[];
const int tid = threadIdx.x;
const int lane = tid & 31;
const int warpid = tid >> 5;
//
// Compute this thread's partial sum
//
T sum = scanWarp<T,bZeroPadded>( sPartials );
__syncthreads();
//
// Write each warp's reduction to shared memory
//
if ( lane == 31 ) {
warpPartials[16+warpid] = sum;
}
__syncthreads();
//
// Have one warp scan reductions
//
if ( warpid==0 ) {
scanWarp<T,bZeroPadded>( 16+warpPartials+tid );
}
__syncthreads();
//
// Fan out the exclusive scan element (obtained
// by the conditional and the decrement by 1)
// to this warp's pending output
//
if ( warpid > 0 ) {
sum += warpPartials[16+warpid-1];
}
__syncthreads();
//
// Write this thread's scan output
//
*sPartials = sum;
__syncthreads();
//
// The return value will only be used by caller if it
// contains the spine value (i.e. the reduction
// of the array we just scanned).
//
return sum;
}
template < class T >
inline __device__ T
scanWarp( volatile T *sPartials ){
const int tid = threadIdx.x;
const int lane = tid & 31;
if ( lane >= 1 ) sPartials[0] += sPartials[- 1];
if ( lane >= 2 ) sPartials[0] += sPartials[- 2];
if ( lane >= 4 ) sPartials[0] += sPartials[- 4];
if ( lane >= 8 ) sPartials[0] += sPartials[- 8];
if ( lane >= 16 ) sPartials[0] += sPartials[-16];
return sPartials[0];
}
The scan-then-fan strategy is applied at two levels. For the grid-level scan (which operates on global memory), partials are written to the temporary global memory buffer allocated in the host code, scanned by recursively calling the host function, then added to the eventual output with a separate kernel invocation. For the block-level scan (which operates on shared memory), partials are written to the base of shared memory (warpPartials[]), scanned by one warp, then added to the eventual output of the block-level scan. The code that you are asking about is doing the block-level scan.
The implementation of scanWarp that you are referencing is called with a shared memory pointer that has already had threadIdx.x added to it, so each thread's version of sPartials points to a different shared memory element. Using a fixed index on sPartials causes adjacent threads to operate on adjacent shared memory elements. Negative indices are okay as long as they do not result in out-of-bounds array indexing. This implementation borrowed from the optimized version that pads shared memory with zeros, so every thread can unconditionally use a fixed negative index and threads below a certain index just read zeros. (Listing 13.14) It could just as easily have predicated execution on the lowest threads in the warp and used positive indices.
The 31st thread of each 32-thread warp contains that warp's partial sum, which has to be stored somewhere in order to be scanned and then added to the output. warpPartials[] aliases shared memory from the first element, so can be used to hold each warp's partial sum. You could use any part of shared memory to do this calculation, because each thread already has its own scan value in registers (the assignment T sum = scanWarp...).
Some warp (it could be any warp, so it might as well be warp 0) has to scan the partials that were written to warpPartials[]. At most one warp is needed because there is a hardware limitation of 1024 threads per block = 1024/32 or 32 warps. So this code is taking advantage of the coincidence that the maximum number of threads per block, divided by the warp count, is no larger than the maximum number of threads per warp.
This code is adding the scanned per-warp partials to each output element. The first warp already has the correct values, so the addition is done only by the second and subsequent warps. Another way to look at this is that it's adding the exclusive scan of the warp partials to the output.
scanBlock is a device function - the address arithmetic gets done by its caller, scanAndWritePartials: volatile T *myShared = sPartials+tid;
(Answer rewritten now I have more time)
Here's an example (based on an implementation I wrote in C++ AMP, not CUDA). To make the diagram smaller each warp is 4 elements wide and a block is 16 elements.
The following paper is also pretty useful Efficient Parallel Scan Algorithms for GPUs. As is Parallel Scan for Stream Architectures.

Optimize check for a bit-vector being a proper subset of another?

I would like some help optimizing the most computationally intensive function of my program.
Currently, I am finding that the basic (non-SSE) version is significantly faster (up to 3x). I would thus request your help in rectifying this.
The function looks for subsets in unsigned integer vectors, and reports if they exist or not. For your convenience I have included the relevant code snippets only.
First up is the basic variant. It checks to see if blocks_ is a proper subset of x.blocks_. (Not exactly equal.) These are bitmaps, aka bit vectors or bitsets.
//Check for self comparison
if (this == &x)
return false;
//A subset is equal to or smaller.
if (no_bits_ > x.no_bits_)
return false;
int i;
bool equal = false;
//Pointers should not change.
const unsigned int *tptr = blocks_;
const unsigned int *xptr = x.blocks_;
for (i = 0; i < no_blocks_; i++, tptr++, xptr++) {
if ((*tptr & *xptr) != *tptr)
return false;
if (*tptr != *xptr)
equal = true;
}
return equal;
Then comes the SSE variant, which alas does not perform according to my expectations. Both of these snippets should look for the same things.
//starting pointers.
const __m128i* start = (__m128i*)&blocks_;
const __m128i* xstart = (__m128i*)&x.blocks_;
__m128i block;
__m128i xblock;
//Unsigned ints are 32 bits, meaning 4 can fit in a register.
for (i = 0; i < no_blocks_; i+=4) {
block = _mm_load_si128(start + i);
xblock = _mm_load_si128(xstart + i);
//Equivalent to (block & xblock) != block
if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff)
return false;
//Equivalent to block != xblock
if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff)
equal = true;
}
return equal;
Do you have any suggestions as to how I may improve upon the performance of the SSE version? Am I doing something wrong? Or is this a case where optimization should be done elsewhere?
I have not yet added in the leftover calculations for no_blocks_ % 4 != 0, but there is little purpose in doing so until the performance increases, and it would only clutter up the code at this point.
There are three possibilities I see here.
First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.
Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.
Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.
Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.
bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
__m128i eq1 = _mm_setzero_si128();
__m128i eq2 = _mm_setzero_si128();
for (unsigned i = 0; i != count; i += 8)
{
__m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
__m128i xb1 = _mm_load_si128((__m128i const*)(b + i));
eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));
__m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
__m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));
eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));
if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
return false;
}
return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}
If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.
I seems that your problem is a memory bandwidth bounded problem:
Asymptotic you need about 2 operation for processing a pair of integer in memory scanned. There is not enough arithmetic complexity to get advantage of use more arithmetic throughput from CPU SSE instructions. In fact you CPU pass lot of time waiting for data transfers.
But using SSE instructions in your case induce a overall of instructions and resulting code is not well optimized by compiler.
There are some alternatives strategies to improve performance in bandwidth bounded problem:
Multi-thread hide access memory by concurrent arithmetic
operations in hyper-threading context.
Fine tuning of size of data load at time improve memory bandwidth.
Improve the pipe-line continuity by adding supplementary independents operations in a loop (scan two different sets of data at each step in your "for" loop)
Keep more data in cache or in registers (some iterations of your code may be need the same set of data many times)

The fastest way to retrieve 16k Key-Value pairs?

OK, here's my situation :
I have a function - let's say U64 calc (U64 x) - which takes a 64-bit integer parameter, performs some CPU-intensive operation, and returns a 64-bit value
Now, given that I know ALL possible inputs (the xs) of that function beforehand (there are some 16000 though), I thought it might be better to pre-calculate them and then fetch them on demand (from an array-like structure).
The ideal situation would be to store them all in some array U64 CALC[] and retrieve them by index (the x again)
And here's the issue : I may know what the possible inputs for my calc function are, but they are most definitely NOT consecutive (e.g. not from 1 to 16000, but values that may go as low as 0 and as high as some trillions - always withing a 64-bit range)
E.G.
X CALC[X]
-----------------------
123123 123123123
12312 12312312
897523 986123
etc.
And here comes my question :
How would you store them?
What workaround would you prefer?
Now, given that these values (from CALC) will have to be accessed some thousands-to-millions of times, per sec, which would be the best solution performance-wise?
Note : I'm no mentioning anything I've thought of or tried so as not to turn the answers into some debate of A vs B type-of-thing, and mostly not influence anyone...
I would use some sort of hash function that creates an index to a u64 pair where one is the value the key was created from and the other the replacement value. Technically the index could be three bytes long (assuming 16 million -"16000 thousand" - pairs) if you need to conserve space but I'd use u32s. If the stored value does not match the value computed on (hash collision) you'd enter an overflow handler.
You need to determine a custom hashing algorithm to fit your data
Since you know the size of the data you don't need algorithms that allow the data to grow.
I'd be wary of using some standard algorithm because they seldom fit specific data
I'd be wary of using a C++ method unless you are sure the code is WYSIWYG (doesn't generate a lot of invisible calls)
Your index should be 25% larger than the number of pairs
Run through all possible inputs and determine min, max, average and standard deviation for the number of collisions and use these to determine the acceptable performance level. Then profile to achieve the best possible code.
The required memory space (using a u32 index) comes out to (4*1.25)+8+8 = 21 bytes per pair or 336 MeB, no problem on a typical PC.
________ EDIT________
I have been challenged by "RocketRoy" to put my money where my mouth is. Here goes:
The problem has to do with collision handling in a (fixed size) hash index. To set the stage:
I have a list of n entries where a field in the entry contains the value v that the hash is computed from
I have a vector of n*1.25 (approximately) indeces such that the number of indeces x is a prime number
A prime number y is computed which is a fraction of x
The vector is initialized to say -1 to denote unoccupied
Pretty standard stuff I think you'll agree.
The entries in the list are processed and the hash value h computed and modulo'd and used as an index into the vector and the index to the entry is placed there.
Eventually I encounter the situation where the vector entry pointed to by the index is occupied (doesn't contain -1) - voilĂ , a collision.
So what do I do? I keep the original h as ho, add y to h and take modulo x and get a new index into the vector. If the entry is unoccupied I use it, otherwise I continue with add y modulo x until I reach ho. In theory, this will happen because both x and y are prime numbers. In practice x is larger than n so it won't.
So the "re-hash" that RocketRoy claims is very costly is no such thing.
The tricky part with this method - as with all hashing methods - is to:
Determine a suitable value for x (becomes less tricky the larger x finally used)
Determine the algorithm a for h=a(v)%x such that a the h's index reasonably evenly ("randomly") into the index vector with as few collisions as possible
Determine a suitable value for y such that collisions index reasonably evenly ("randomly") into the index vector
________ EDIT________
I'm sorry I've taken so long to produce this code ... other things have had higher priorities.
Anyway, here is the code which proves that hashing has better prospects for quick lookups than a binary tree. It runs through a bunch of hashing index sizes and algorithms to aid in finding the most suitable combo for the specific data. For every algorithm the code will print the first index size such that no lookup takes longer than fourteen searches (worst case for binary searching) and an average lookup takes less than 1.5 searches.
I have a fondness for prime numbers in these types of applications, in case you haven't noticed.
There are many ways of creating a hashing algorithm with its mandatory overflow handling. I opted for simplicity assuming it will translate into speed ... and it does. On my laptop with an i5 M 480 # 2.67 GHz an average lookup requires between 55 and 60 clock cycles (comes out to around 45 million lookups per second). I implemented a special get operation with a constant number of indeces and ditto rehash value and the cycle count dropped to 40 (65 million lookups per second). If you look at the line calling getOpSpec the index i is xor'ed with 0x444 to exercise the caches to achieve more "real world"-like results.
I must again point out that the program suggests suitable combinations for the specific data. Other data may require a different combo.
The source code contains both the code for generating the 16000 unsigned long long pairs and for testing different constants (index sizes and rehash values):
#include <windows.h>
#define i8 signed char
#define i16 short
#define i32 long
#define i64 long long
#define id i64
#define u8 char
#define u16 unsigned short
#define u32 unsigned long
#define u64 unsigned long long
#define ud u64
#include <string.h>
#include <stdio.h>
u64 prime_find_next (const u64 value);
u64 prime_find_previous (const u64 value);
static inline volatile unsigned long long rdtsc_to_rax (void)
{
unsigned long long lower,upper;
asm volatile( "rdtsc\n"
: "=a"(lower), "=d"(upper));
return lower|(upper<<32);
}
static u16 index[65536];
static u64 nindeces,rehshFactor;
static struct PAIRS {u64 oval,rval;} pairs[16000] = {
#include "pairs.h"
};
struct HASH_STATS
{
u64 ninvocs,nrhshs,nworst;
} getOpStats,putOpStats;
i8 putOp (u16 index[], const struct PAIRS data[], const u64 oval, const u64 ci)
{
u64 nworst=1,ho,h,i;
i8 success=1;
++putOpStats.ninvocs;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffff) /* unused position */
{
index[h]=(u16)ci;
goto added;
}
if (oval==data[i].oval) goto duplicate;
++putOpStats.nrhshs;
++nworst;
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted: /* should not happen */
duplicate:
success=0;
added:
if (nworst>putOpStats.nworst) putOpStats.nworst=nworst;
return success;
}
i8 getOp (u16 index[], const struct PAIRS data[], const u64 oval, u64 *rval)
{
u64 ho,h,i;
i8 success=1;
ho=oval%nindeces;
h=ho;
do
{
i=index[h];
if (i==0xffffu) goto not_found; /* unused position */
if (oval==data[i].oval)
{
*rval=data[i].rval; /* fetch the replacement value */
goto found;
}
h+=rehshFactor;
if (h>=nindeces) h-=nindeces;
} while (h!=ho);
exhausted:
not_found: /* should not happen */
success=0;
found:
return success;
}
volatile i8 stop = 0;
int main (int argc, char *argv[])
{
u64 i,rval,mulup,divdown,start;
double ave;
SetThreadAffinityMask (GetCurrentThread(), 0x00000004ull);
divdown=5; //5
while (divdown<=100)
{
mulup=3; // 3
while (mulup<divdown)
{
nindeces=16000;
while (nindeces<65500)
{
nindeces= prime_find_next (nindeces);
rehshFactor=nindeces*mulup/divdown;
rehshFactor=prime_find_previous (rehshFactor);
memset (index, 0xff, sizeof(index));
memset (&putOpStats, 0, sizeof(struct HASH_STATS));
i=0;
while (i<16000)
{
if (!putOp (index, pairs, pairs[i].oval, (u16) i)) stop=1;
++i;
}
ave=(double)(putOpStats.ninvocs+putOpStats.nrhshs)/(double)putOpStats.ninvocs;
if (ave<1.5 && putOpStats.nworst<15)
{
start=rdtsc_to_rax ();
i=0;
while (i<16000)
{
if (!getOp (index, pairs, pairs[i^0x0444]. oval, &rval)) stop=1;
++i;
}
start=rdtsc_to_rax ()-start+8000; /* 8000 is half of 16000 (pairs), for rounding */
printf ("%u;%u;%u;%u;%1.3f;%u;%u\n", (u32)mulup, (u32)divdown, (u32)nindeces, (u32)rehshFactor, ave, (u32) putOpStats.nworst, (u32) (start/16000ull));
goto found;
}
nindeces+=2;
}
printf ("%u;%u\n", (u32)mulup, (u32)divdown);
found:
mulup=prime_find_next (mulup);
}
divdown=prime_find_next (divdown);
}
SetThreadAffinityMask (GetCurrentThread(), 0x0000000fu);
return 0;
}
It was not possible to include the generated pairs file (an answer is apparently limited to 30000 characters). But send a message to my inbox and I'll mail it.
And these are the results:
3;5;35569;21323;1.390;14;73
3;7;33577;14389;1.435;14;60
5;7;32069;22901;1.474;14;61
3;11;35107;9551;1.412;14;59
5;11;33967;15427;1.446;14;61
7;11;34583;22003;1.422;14;59
3;13;34253;7901;1.439;14;61
5;13;34039;13063;1.443;14;60
7;13;32801;17659;1.456;14;60
11;13;33791;28591;1.436;14;59
3;17;34337;6053;1.413;14;59
5;17;32341;9511;1.470;14;61
7;17;32507;13381;1.474;14;62
11;17;33301;21529;1.454;14;60
13;17;34981;26737;1.403;13;59
3;19;33791;5333;1.437;14;60
5;19;35149;9241;1.403;14;59
7;19;33377;12289;1.439;14;97
11;19;34337;19867;1.417;14;59
13;19;34403;23537;1.430;14;61
17;19;33923;30347;1.467;14;61
3;23;33857;4409;1.425;14;60
5;23;34729;7547;1.429;14;60
7;23;32801;9973;1.456;14;61
11;23;33911;16127;1.445;14;60
13;23;33637;19009;1.435;13;60
17;23;34439;25453;1.426;13;60
19;23;33329;27529;1.468;14;62
3;29;32939;3391;1.474;14;62
5;29;34543;5953;1.437;13;60
7;29;34259;8263;1.414;13;59
11;29;34367;13033;1.409;14;60
13;29;33049;14813;1.444;14;60
17;29;34511;20219;1.422;14;60
19;29;33893;22193;1.445;13;61
23;29;34693;27509;1.412;13;92
3;31;34019;3271;1.441;14;60
5;31;33923;5449;1.460;14;61
7;31;33049;7459;1.442;14;60
11;31;35897;12721;1.389;14;59
13;31;35393;14831;1.397;14;59
17;31;33773;18517;1.425;14;60
19;31;33997;20809;1.442;14;60
23;31;34841;25847;1.417;14;59
29;31;33857;31667;1.426;14;60
3;37;32569;2633;1.476;14;61
5;37;34729;4691;1.419;14;59
7;37;34141;6451;1.439;14;60
11;37;34549;10267;1.410;13;60
13;37;35117;12329;1.423;14;60
17;37;34631;15907;1.429;14;63
19;37;34253;17581;1.435;14;60
23;37;32909;20443;1.453;14;61
29;37;33403;26177;1.445;14;60
31;37;34361;28771;1.413;14;59
3;41;34297;2503;1.424;14;60
5;41;33587;4093;1.430;14;60
7;41;34583;5903;1.404;13;59
11;41;32687;8761;1.440;14;60
13;41;34457;10909;1.439;14;60
17;41;34337;14221;1.425;14;59
19;41;32843;15217;1.476;14;62
23;41;35339;19819;1.423;14;59
29;41;34273;24239;1.436;14;60
31;41;34703;26237;1.414;14;60
37;41;33343;30089;1.456;14;61
3;43;34807;2423;1.417;14;59
5;43;35527;4129;1.413;14;60
7;43;33287;5417;1.467;14;61
11;43;33863;8647;1.436;14;60
13;43;34499;10427;1.418;14;78
17;43;34549;13649;1.431;14;60
19;43;33749;14897;1.429;13;60
23;43;34361;18371;1.409;14;59
29;43;33149;22349;1.452;14;61
31;43;34457;24821;1.428;14;60
37;43;32377;27851;1.482;14;81
41;43;33623;32057;1.424;13;59
3;47;33757;2153;1.459;14;61
5;47;33353;3547;1.445;14;61
7;47;34687;5153;1.414;13;59
11;47;34519;8069;1.417;14;60
13;47;34549;9551;1.412;13;59
17;47;33613;12149;1.461;14;61
19;47;33863;13687;1.443;14;60
23;47;35393;17317;1.402;14;59
29;47;34747;21433;1.432;13;60
31;47;34871;22993;1.409;14;59
37;47;34729;27337;1.425;14;59
41;47;33773;29453;1.438;14;60
43;47;31253;28591;1.487;14;62
3;53;33623;1901;1.430;14;59
5;53;34469;3229;1.430;13;60
7;53;34883;4603;1.408;14;59
11;53;34511;7159;1.412;13;59
13;53;32587;7963;1.453;14;60
17;53;34297;10993;1.432;13;80
19;53;33599;12043;1.443;14;64
23;53;34337;14897;1.415;14;59
29;53;34877;19081;1.424;14;61
31;53;34913;20411;1.406;13;59
37;53;34429;24029;1.417;13;60
41;53;34499;26683;1.418;14;59
43;53;32261;26171;1.488;14;62
47;53;34253;30367;1.437;14;79
3;59;33503;1699;1.432;14;61
5;59;34781;2939;1.424;14;60
7;59;35531;4211;1.403;14;59
11;59;34487;6427;1.420;14;59
13;59;33563;7393;1.453;14;61
17;59;34019;9791;1.440;14;60
19;59;33967;10937;1.447;14;60
23;59;33637;13109;1.438;14;60
29;59;34487;16943;1.424;14;59
31;59;32687;17167;1.480;14;61
37;59;35353;22159;1.404;14;59
41;59;34499;23971;1.431;14;60
43;59;34039;24799;1.445;14;60
47;59;32027;25471;1.499;14;62
53;59;34019;30557;1.449;14;61
3;61;35059;1723;1.418;14;60
5;61;34351;2803;1.416;13;60
7;61;35099;4021;1.412;14;59
11;61;34019;6133;1.442;14;60
13;61;35023;7459;1.406;14;88
17;61;35201;9803;1.414;14;61
19;61;34679;10799;1.425;14;101
23;61;34039;12829;1.441;13;60
29;61;33871;16097;1.446;14;60
31;61;34147;17351;1.427;14;61
37;61;34583;20963;1.412;14;59
41;61;32999;22171;1.452;14;62
43;61;33857;23857;1.431;14;98
47;61;34897;26881;1.431;14;60
53;61;33647;29231;1.434;14;60
59;61;32999;31907;1.454;14;60
3;67;32999;1471;1.455;14;61
5;67;35171;2621;1.403;14;59
7;67;33851;3533;1.463;14;61
11;67;34607;5669;1.437;14;60
13;67;35081;6803;1.416;14;61
17;67;33941;8609;1.417;14;60
19;67;34673;9829;1.427;14;60
23;67;35099;12043;1.415;14;60
29;67;33679;14563;1.452;14;61
31;67;34283;15859;1.437;14;60
37;67;32917;18169;1.460;13;61
41;67;33461;20443;1.441;14;61
43;67;34313;22013;1.426;14;60
47;67;33347;23371;1.452;14;61
53;67;33773;26713;1.434;14;60
59;67;35911;31607;1.395;14;58
61;67;34157;31091;1.431;14;63
3;71;34483;1453;1.423;14;59
5;71;34537;2423;1.428;14;59
7;71;33637;3313;1.428;13;60
11;71;32507;5023;1.465;14;79
13;71;35753;6529;1.403;14;59
17;71;33347;7963;1.444;14;61
19;71;35141;9397;1.410;14;59
23;71;32621;10559;1.475;14;61
29;71;33637;13729;1.429;14;60
31;71;33599;14657;1.443;14;60
37;71;34361;17903;1.396;14;59
41;71;33757;19489;1.435;14;61
43;71;34583;20939;1.413;14;59
47;71;34589;22877;1.441;14;60
53;71;35353;26387;1.418;14;59
59;71;35323;29347;1.406;14;59
61;71;35597;30577;1.401;14;59
67;71;34537;32587;1.425;14;59
3;73;34613;1409;1.418;14;59
5;73;32969;2251;1.453;14;62
7;73;33049;3167;1.448;14;61
11;73;33863;5101;1.435;14;60
13;73;34439;6131;1.456;14;60
17;73;33629;7829;1.455;14;61
19;73;34739;9029;1.421;14;60
23;73;33071;10399;1.469;14;61
29;73;33359;13249;1.460;14;61
31;73;33767;14327;1.422;14;59
37;73;32939;16693;1.490;14;62
41;73;33739;18947;1.438;14;60
43;73;33937;19979;1.432;14;61
47;73;33767;21739;1.422;14;59
53;73;33359;24203;1.435;14;60
59;73;34361;27767;1.401;13;59
61;73;33827;28229;1.443;14;60
67;73;34421;31583;1.423;14;71
71;73;33053;32143;1.447;14;60
3;79;35027;1327;1.410;14;60
5;79;34283;2161;1.432;14;60
7;79;34439;3049;1.432;14;60
11;79;34679;4817;1.416;14;59
13;79;34667;5701;1.405;14;59
17;79;33637;7237;1.428;14;60
19;79;34469;8287;1.417;14;60
23;79;34439;10009;1.433;14;60
29;79;33427;12269;1.448;13;61
31;79;33893;13297;1.445;14;61
37;79;33863;15823;1.439;14;60
41;79;32983;17107;1.450;14;60
43;79;34613;18803;1.431;14;60
47;79;33457;19891;1.457;14;61
53;79;33961;22777;1.435;14;61
59;79;32983;24631;1.465;14;60
61;79;34337;26501;1.428;14;60
67;79;33547;28447;1.458;14;61
71;79;32653;29339;1.473;14;61
73;79;34679;32029;1.429;14;64
3;83;35407;1277;1.405;14;59
5;83;32797;1973;1.451;14;60
7;83;33049;2777;1.443;14;61
11;83;33889;4483;1.431;14;60
13;83;35159;5503;1.409;14;59
17;83;34949;7151;1.412;14;59
19;83;32957;7541;1.467;14;61
23;83;32569;9013;1.470;14;61
29;83;33287;11621;1.474;14;61
31;83;33911;12659;1.448;13;60
37;83;33487;14923;1.456;14;62
41;83;33587;16573;1.438;13;60
43;83;34019;17623;1.435;14;60
47;83;31769;17987;1.483;14;62
53;83;33049;21101;1.451;14;61
59;83;32369;23003;1.465;14;61
61;83;32653;23993;1.469;14;61
67;83;33599;27109;1.437;14;61
71;83;33713;28837;1.452;14;61
73;83;33703;29641;1.454;14;61
79;83;34583;32911;1.417;14;59
3;89;34147;1129;1.415;13;60
5;89;32797;1831;1.461;14;61
7;89;33679;2647;1.443;14;73
11;89;34543;4261;1.427;13;60
13;89;34603;5051;1.419;14;60
17;89;34061;6491;1.444;14;60
19;89;34457;7351;1.422;14;79
23;89;33529;8663;1.450;14;61
29;89;34283;11161;1.431;14;60
31;89;35027;12197;1.411;13;59
37;89;34259;14221;1.403;14;59
41;89;33997;15649;1.434;14;60
43;89;33911;16127;1.445;14;60
47;89;34949;18451;1.419;14;59
53;89;34367;20443;1.434;14;60
59;89;33791;22397;1.430;14;59
61;89;34961;23957;1.404;14;59
67;89;33863;25471;1.433;13;60
71;89;35149;28031;1.414;14;79
73;89;33113;27143;1.447;14;60
79;89;32909;29209;1.458;14;61
83;89;33617;31337;1.400;14;59
3;97;34211;1051;1.448;14;60
5;97;34807;1789;1.430;14;60
7;97;33547;2417;1.446;14;60
11;97;35171;3967;1.407;14;89
13;97;32479;4349;1.474;14;61
17;97;34319;6011;1.444;14;60
19;97;32381;6337;1.491;14;64
23;97;33617;7963;1.421;14;59
29;97;33767;10093;1.423;14;59
31;97;33641;10739;1.447;14;60
37;97;34589;13187;1.425;13;60
41;97;34171;14437;1.451;14;60
43;97;31973;14159;1.484;14;62
47;97;33911;16127;1.445;14;61
53;97;34031;18593;1.448;14;80
59;97;32579;19813;1.457;14;61
61;97;34421;21617;1.417;13;60
67;97;33739;23297;1.448;14;60
71;97;33739;24691;1.435;14;60
73;97;33863;25471;1.433;13;60
79;97;34381;27997;1.419;14;59
83;97;33967;29063;1.446;14;60
89;97;33521;30727;1.441;14;60
Cols 1 and 2 are used to calculate a rough relationship between the rehash value and the index size. The next two are the first index size/rehash factor combination which averages less than 1.5 searches for a lookup with a worst case of 14 searches. Then average and worst case. Finally, the last column is the average number of clock cycles per lookup. It does not take into account the time required to read the time stamp register.
The actual memory space for the best constants (# of indeces = 31253 and rehash factor = 28591) comes out to more than I initially indicated (16000*2*8 + 1,25*16000*2 => 296000 bytes). The actual size is 16000*2*8+31253*2 => 318506.
The fastest combination is an approximate ratio of 11/31 with an index size of 35897 and rehash value of 12721. This will average 1.389 (1 initial hash + 0.389 rehashes) with a maximum of 14 (1+13).
________ EDIT________
I removed the "goto found;" in main () to show all combinations and it shows that much better performance is possible, of course at the expense of a larger index size. For example the combination 57667 and 33797 yields and average of 1.192 and a maximum rehash of 6. The combination 44543 and 23399 yields a 1.249 average and 10 maximum rehashes (it saves (57667-44543)*2=26468 bytes of index table compared to 57667/33797).
Specialized functions with hard-coded hash index size and rehash factor will execute in 60-70% of the time compared to variables. This is probably due to the compiler (gcc 64-bit) substituting the modulo with multiplications and not having to fetch the values from memory locations as they will be coded as immediate values.
________ EDIT________
On the subject of caches I see two issues.
The first is data cacheing which I don't think will be possible because the lookup will just be a small step in some larger process and you run the risk of the table data's cache lines begin invalidated to a lesser or (probably) greater degree - if not entirely - by other data accesses in other steps of the larger process. I e the more code executed and data accessed in the process as a whole the less likely it will be that any pertinent lookup data will remain in the caches (this may or may not be pertinent to the OP's situation). To find an entry using (my) hashing you will encounter two cache misses (one to load the correct part of the index, and the other to load the area containg the entry itself) for every comparison that needs to be performed. Finding an entry on the first try will have cost two misses, the second try four etc. In my example the 60 clock cycle average cost per lookup implies that the table probably resided entirely in the L2 cache and with L1 not having to go there in a majority of the cases. My x86-64 CPU has L1-3, RAM wait states of approximately 4, 10, 40 and 100 which to me shows that RAM was completely kept out and L3 mostly.
The second is code cacheing which will have a more significant impact if it is small, tight, in-lined and with few control transfers (jumps and calls). My hash routine probably resides entirely in the L1 code cache. For more normal cases, the fewer the number of code cache line loads the faster it will be.
Make an array of structures of key val pairs.
Sort the array by key, put this in your program as static array, would only be 128kbyte.
Then in your program a simple binary look up by key will need on average only 14 key comparisons to find the right value. Should be able to approach speeds of 300 million look ups per second on modern pc.
You can sort with qsort and search with bsearch, both std lib functions.
Perform memonization, or in simple terms, cache the values you've computed already and calculate the new ones. You should hash the input and check the cache for that result. You can even start off with a set of cache values that you think the function would get called more often for. Besides that, I don't think you need to go to any extreme as the other answer suggest. Do things simple and when you are done with your application you can use a profiling tool to find bottle necks.
EDIT: Some code
#include <iostream>
#include <ctime>
using namespace std;
const int MAX_SIZE = 16000;
int preCalcData[MAX_SIZE] = {};
int getPrecalculatedResult(int x){
return preCalcData[x];
}
void setupPreCalcDataCache(){
for(int i = 0; i < MAX_SIZE; ++i){
preCalcData[i] = i*i; //or whatever calculation
}
}
int main(){
setupPreCalcDataCache();
cout << getPrecalculatedResult(0) << endl;
cout << getPrecalculatedResult(15999) << endl;
return 0;
}
I wouldn't worry about performance too much. This simple example, using an array and binary search lower_bound
#include <stdint.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <memory>
const int N = 16000;
typedef std::pair<uint64_t, uint64_t> CALC;
CALC calc[N];
static inline bool cmp_calcs(const CALC &c1, const CALC &c2)
{
return c1.first < c2.first;
}
int main(int argc, char **argv)
{
std::iostream::sync_with_stdio(false);
for (int i = 0; i < N; ++i)
calc[i] = std::make_pair(i, i);
std::sort(&calc[0], &calc[N], cmp_calcs);
for (long i = 0; i < 10000000; ++i) {
int r = rand() % 16000;
CALC *p = std::lower_bound(&calc[0], &calc[N], std::make_pair(r, 0), cmp_calcs);
if (p->first == r)
std::cout << "found\n";
}
return 0;
}
and compiled with
g++ -O2 example.cpp
does, including setup, 10,000,000 searches in about 2 seconds on my 5 year old PC.
You need to store 16 thousand values efficiently, preferably in memory. We are assuming that the computation of these values is more time consuming than accessing them from storage.
You have at your disposal many different data structures to get the job done, including databases. If you access these values in queriable chunks, then the DB overhead may very well be absorbed and spread accross your processing.
You mentioned map and hashmap (or hashtable) already in your question tags, but these are probably not the best possible answers for your problem, although they could do a fair job, provided that the hashing function isn't more expensive than the direct computation of the target UINT64 value, which has to be your reference benchmark.
Van Emde Boas Trees
Many variants of B-Trees (used extensively in database engines, high performance filesystems),
Tries
Are probably much better suited. Having some experience with it, I would probably go for a B-tree: they support fairly well serialization. That should let you prepare your dataset in advance in a different program. VEB trees have a very good access time (O(log log(n)), but I don't know how easily they may be serialized.
Later on, if you need even more performance, it would also be interesting to know usage patterns of your "database" to figure out what caching techniques you could implement on top of the store.
Using std::pair is better than any of map for speed.
but if I were you, I firstly use a std::list to store the data, after I got them all, I move them into a simple vector, then retrieving goes very fast if you implement a simple binary tree search by yourself.

64-bit NoBarrier_Store() not implemented on this platform

"64-bit NoBarrier_Store() not implemented on this platform"
I use tcmalloc on win7 with vs2005.
There is two threads in my app, one do malloc(), the other one do free().The tcmalloc print this when my app start.After debug, i find the following functon can't work on _WIN32,
// Return a suggested delay in nanoseconds for iteration number "loop"
static int SuggestedDelayNS(int loop) {
// Weak pseudo-random number generator to get some spread between threads
// when many are spinning.
static base::subtle::Atomic64 rand;
uint64 r = base::subtle::NoBarrier_Load(&rand);
r = 0x5deece66dLL * r + 0xb; // numbers from nrand48()
base::subtle::NoBarrier_Store(&rand, r);
r <<= 16; // 48-bit random number now in top 48-bits.
if (loop < 0 || loop > 32) { // limit loop to 0..32
loop = 32;
}
// loop>>3 cannot exceed 4 because loop cannot exceed 32.
// Select top 20..24 bits of lower 48 bits,
// giving approximately 0ms to 16ms.
// Mean is exponential in loop for first 32 iterations, then 8ms.
// The futex path multiplies this by 16, since we expect explicit wakeups
// almost always on that path.
return r >> (44 - (loop >> 3));
}
I want to know how to avoid this on win32. thanks very much.
It seems to be using atomic loads and stores without memory barriers. Might make this work a bit faster on some multi-CPU systems.
On an x86, we don't have those types of operations. Loads and stores are always visible to the others cores in the system. Cache sync is implemented in the hardware, and can't be controlled by the program.
Perhaps the Atomic library used has Load and Store operations without the NoBarrier prefix? Use those instead.