Checksumming large swathes of prime numbers? (for verification) - primes

Are there any clever algorithms for computing high-quality checksums on millions or billions of prime numbers? I.e. with maximum error-detection capability and perhaps segmentable?
Motivation:
Small primes - up to 64 bits in size - can be sieved on demand to the tune of millions per second, by using a small bitmap for sieving potential factors (up to 2^32-1) and a second bitmap for sieving the numbers in the target range.
Algorithm and implementation are reasonably simple and straightforward but the devil is in the details: values tend to push against - or exceed - the limits of builtin integral types everywhere, boundary cases abound (so to speak) and even differences in floating point strictness can cause breakage if programming is not suitably defensive. Not to mention the mayhem that an optimising compiler can wreak, even on already-compiled, already-tested code in a static lib (if link-time code generation is used). Not to mention that faster algorithms tend to be a lot more complicated and thus even more brittle.
This has two consequences: test results are basically meaningless unless the tests are performed using the final executable image, and it becomes highly desirable to verify proper operation at runtime, during normal use.
Checking against pre-computed values would give the highest degree of confidence but the required files are big and clunky. A text file with 10 million primes has on the order of 100 MB uncompressed and more than 10 MB compressed; storing byte-encoded differences requires one byte per prime and entropy coding can at best reduce the size to half (5 MB for 10 million primes). Hence even a file that covers only the small factors up to 2^32 would weigh in at about 100 MB, and the complexity of the decoder would exceed that of the windowed sieve itself.
This means that checking against files is not feasible except as a final release check for a newly-built executable. Not to mention that the trustworthy files are not easy to come by. The Prime Pages offer files for the first 50 million primes, and even the amazing primos.mat.br goes only up to 1,000,000,000,000. This is unfortunate since many of the boundary cases (== need for testing) occur between 2^62 and 2^64-1.
This leaves checksumming. That way the space requirements would be marginal, and only proportional to the number of test cases. I don't want to require that a decent checksum like MD5 or SHA-256 be available, and with the target numbers all being prime it should be possible to generate a high-quality, high-resolution checksum with some simple ops on the numbers themselves.
This is what I've come up with so far. The raw digest consists of four 64-bit numbers; at the end it can be folded down to the desired size.
for (unsigned i = 0; i < ELEMENTS(primes); ++i)
{
digest[0] *= primes[i]; // running product (must be initialised to 1)
digest[1] += digest[0]; // sum of sequence of running products
digest[2] += primes[i]; // running sum
digest[3] += digest[2] * primes[i]; // Hornerish sum
}
At two (non-dependent) muls per prime the speed is decent enough, and except for the simple sum each of the components has always uncovered all errors I tried to sneak past the digest. However, I'm not a mathematician, and empirical testing is not a guarantee of efficacy.
Are there some mathematical properties that can be exploited to design - rather than 'cook' as I did - a sensible, reliable checksum?
Is it possible to design the checksum in a way that makes it steppable, in the sense that subranges can be processed separately and then the results combined with a bit of arithmetic to give the same result as if the whole range had been checksummed in one go? Same thing as all advanced CRC implementations tend to have nowadays, to enable parallel processing.
EDIT The rationale for the current scheme is this: the count, the sum and the product do not depend on the order in which primes are added to the digest; they can be computed on separate blocks and then combined. The checksum does depend on the order; that's its raison d'être. However, it would be nice if the two checksums of two consecutive blocks could be combined somehow to give the checksum of the combined block.
The count and the sum can sometimes be verified against external sources, like certain sequences on oeis.org, or against sources like the batches of 10 million primes at primos.mat.br (the index gives first and last prime, the number == 10 million is implied). No such luck for product and checksum, though.
Before I throw major time and computing horsepower at the computation and verification of digests covering the whole range of small factors up to 2^64 I'd like to hear what the experts think about this...
The scheme I'm currently test-driving in 32-bit and 64-bit variants looks like this:
template<typename word_t>
struct digest_t
{
word_t count;
word_t sum;
word_t product;
word_t checksum;
// ...
void add_prime (word_t n)
{
count += 1;
sum += n;
product *= n;
checksum += n * sum + product;
}
};
This has the advantage that the 32-bit digest components are equal to the lower halves of the corresponding 64-bit values, meaning only 64-bit digests need to be computed stored even if fast 32-bit verification is desired. A 32-bit version of the digest can be found in this simple sieve test program # pastebin, for hands-on experimentation. The full Monty in a revised, templated version can be found in a newer paste for a sieve that works up to 2^64-1.

I've done a good bit of work parallelizing operations on Cell architectures. This has a similar feel.
In this case, I would use a hash function that's fast and possibly incremental (e.g. xxHash or MurmurHash3) and a hash list (which is a less flexible specialization of a Merkle Tree).
These hashes are extremely fast. It's going to be surprisingly hard to get better with some simple set of operations. The hash list affords parallelism -- different blocks of the list can be handled by different threads, and then you hash the hashes. You could also use a Merkle Tree, but I suspect that'd just be more complex without much benefit.
Virtually divide your range into aligned blocks -- we'll call these microblocks. (e.g. a microblock is a range such as [n<<15, (n+1)<<15) )
To handle a microblock, compute what you need to compute, add it to a buffer, hash the buffer. (An incremental hash function will afford a smaller buffer. The buffer doesn't have to be filled with the same length of data every time.)
Each microblock hash will be placed in a circular buffer.
Divide the circular buffer into hashable blocks ("macroblocks"). Incrementally hash these macroblocks in the proper order as they become available or if there's no more microblocks left.
The resulting hash is the one you want.
Some additional notes:
I recommend a design where threads reserve a range of pending microblocks that the circular buffer has space for, process them, dump the values in the circular buffer, and repeat.
This has the added benefit that you can decide how many threads you want to use on the fly. e.g. when requesting a new range of microblocks, each thread could detect if there's too many/little threads running and adjust.
I personally would have the thread adding the last microblock hash to a macroblock clean up that macroblock. Less parameters to tune this way.
Maintaining a circular buffer isn't as hard as it sounds -- the lowest order macroblock still unhandled defines what portion of the "macroblock space" the circular buffer represents. All you need is a simple counter that increments when appropriate to express this.
Another benefit is that since the threads go through a reserve/work/reserve/work cycle on a regular basis, a thread that is unexpectedly slow won't hinder the running time nearly as badly.
If you're looking to make something less robust but easier, you could forgo a good bit of the work by using a "striped" pattern -- decide on the max number of threads (N), and have each thread handle every N-th microblock (offset by its thread "ID") and hash the resulting macroblocks per thread instead. Then at the end, hash the macroblock hashes from the N threads. If you have less than N threads, you can divide the work up amongst the number of threads you do want. (e.g. 64 max threads, but three real threads, thread 0 handles 21 virtual threads, thread 1 handles 21 virtual threads, and thread 2 handles 22 virtual threads -- not ideal, but not terrible) This is essentially a shallow Merkel tree instead of a hash list.

Kaganar's excellent answer demonstrates how to make things work even if the digests for adjacent blocks cannot be combined mathematically to give the same result as if the combined block had been digested instead.
The only drawback of his solution is that the resulting block structure is by necessity rather rigid, rather like PKI with its official all-encompassing hierarchy of certifications vs. 'guerrilla style' PGP whose web of trust covers only the few subjects who are of interest. In other words, it requires devising a global addressing structure/hierarchy.
This is the digest in its current form; the change is that the order-dependent part has been simplified to its essential minimum:
void add_prime (word_t n)
{
count += 1;
sum += n;
product *= n;
checksum += n * count;
}
Here are the lessons learnt from practical work with that digest:
count, sum and product (i.e. partial primorial modulo word size) turned out to be exceedingly useful because of the fact that they relate to things also found elsewhere in the world, like certain lists at OEIS
count and sum were very useful because the first tends to be naturally available when manipulating (generating, using, comparing) batches of primes, and the sum is easily computed on the fly with zero effort; this allows partial verification against existing results without going the whole hog of instantiating and updating a digest, and without the overhead of two - comparatively slow - multiplications
count is also exceedingly useful as it must by necessity be part of any indexing superstructure built on systems of digests, and conversely it can guide the search straight to the block (range) containing the nth prime, or to the blocks overlapped by the nth through (n+k)th primes
the order dependency of the fourth component (checksum) turned out be less of a hindrance than anticipated, since small primes tend to 'occur' (be generated or used) in order, in situations where verification might be desired
the order dependency of the checksum - and lack of combinability - made it perfectly useless outside of the specific block for which it was generated
fixed-size auxiliary program structures - like the ubiquitous small factor bitmaps - are best verified as raw memory for startup self-checks, instead of running a primes digest on them; this drastically reduces complexity and speeds things up by several orders of magnitude
For many practical purposes the order-dependent checksum could simply be dropped, leaving you with a three-component digest that is trivially combinable for adjacent ranges.
For verification of fixed ranges (like in self-tests) the checksum component is still useful. Any other kind of checksum - the moral equivalent of a CRC - would be just as useful for that and probably faster. It would be even more useful if an order-independent (combinable) way of supplementing the resolution of the first three components could be found. Extending the resolution beyond the first three components is most relevant for bigger computing efforts, like sieving, verifying and digesting trillions of primes for posterity.
One such candidate for an order-independent, combinable fourth component is the sum of squares.
Overall the digest turned out to be quite useful as is, despite the drawbacks concerning the checksum component. The best way of looking at the digest is probably as consisting of a 'characteristic' part (the first three components, combinable) and a checksum part that is only relevant for the specific block. The latter could just as well be replaced with a hash of any desired resolution. Kaganar's solution indicates how this checksum/hash can be integrated into a system that extends beyond a single block, despite its inherent non-combinability.
The summary of prime number sources seems to have fallen by the wayside, so here it is:
up to 1,000,000,000,000 available as files from sites like primos.mat.br
up to 2^64-10*2^64 in super-fast bulk via the primesieve.org console program (pipe)
up to 2^64-1 - and beyond - via the gp/PARI program (pipe, about 1 million primes/minute)

I'm answering this question again in a second answer since this is a very different and hopefully better tack:
It occurred to me that what you're doing is basically looking for a checksum, not over a list of primes, but over a range of a bitfield where a number is prime (bit is set to 1) or it's not (bit is set to 0). You're going to have a lot more 0's than 1's for any interesting range, so you hopefully only have to do an operation for the 1's.
Typically the problem with using a trivial in-any-order hash is that they handle multiplicity poorly and are oblivious to order. But you don't care about either of these problems -- every bit can only be set or unset once.
From that point of view, a bitwise-exclusive-or or addition should be just fine if combined with a good hashing function of the index of the bit -- that is, the found prime. (If your primes are 64-bit you could go with some of the functions here.)
So, for the ultimate simplicity that will give you the same value for any set of ranges of inputs, yes, stick to hashing and combining it with a simple operation like you are. But change to a traditional hash function which appears "random" given its input -- hash64shift on the linked page is likely what you're looking for. The probability of a meaningful collision is remote. Most hash functions stink, however -- make sure you pick one that is known to have good properties. (Avalanches well, etc.) Thomas Wang's are usually not so bad. (Bob Jenkin's are fantastic, but he sticks mostly to 32 bit functions. Although his mix function on the linked page is very good, but probably overkill.)
Parallelizing the check is obviously trivial, the code size and effort is vastly reduced from my other answer, and there's much less synchronization and almost no buffering that needs to occur.

Related

High performance table structure for really small tables (<10 items usually) where once the table is created it doesn't change?

I am searching for a high performance C++ structure for a table. The table will have void* as keys and uint32 as values.
The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (those tables are suited for high number of items too).
So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.
Another solution which I will shortly evaluate is like that:
struct Group
{
void* items[5]; // search using SIMD
int32_t items[5];
}; // fits in cache line
struct Table
{
Group* groups;
size_t capacity;
};
Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!
EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.
Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).
If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by #JorgeBellon
in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.
A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.
Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).
You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.
I think a nice alternative is an optimization of your simple linear probing idea.
Your lookup procedure would look like this:
Slot *s = &table[hash(key)];
Slot *e = s + s->max_extent;
for (;s<e; ++s) {
if (s->key == key) {
return s->value;
}
}
return NOT_FOUND;
table[h].max_extent is the maximum number of elements you may have to look at if you're looking for an element with hash code h. You would pre-calculate this when you generate the table, so your lookup doesn't have to iterate until it gets a null. This greatly reduces the amount of probing you have to do for misses.
Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill your hash table in hash(key) order to produce the best result.
NOTE that we do not wrap around from the end to the start of the table while probing. Just allocate however many extra slots you need to avoid it.

Why does LLVM choose open-addressing hash table to implement llvm::StringMap?

Many sources say open-addressing, the hash collision handling approach used in llvm::StringMap, is not stable. Open-addressing is said to be inferior to chaining when the load factor is high (which is imaginable).
But if the load factor is low, there will be a huge memory waste for open-addressing, because I have to allocate Bucket_number * sizeof(Record) bytes in memory, even if the majority of buckets do not hold a record.
So my question is, what is the reason for LLVM to choose open-addressing over separate-chaining? Is it merely because of the advantage in speed gained by cache locality (records are stored in buckets themselves)?
Thanks:)
Edit: C++11 standard's requirements on std::unordered_setand std::unordered_map imply the chaining approach, not open-addressing. Why does LLVM choose an hash collision handling method that can't even satisfy the C++ standard? Are there any special use cases for llvm::StringMap that warrants this deviation? (Edit: this slide deck compares the performance of several LLVM data structures with that of STL counterparts)
Another question, by the way:
How does llvm::StringMap guarantee that keys' hash values are not recomputed when growing?
The manual says:
hash table growth does not recompute the hash values for strings already in the table.
Let us look at the implementation. Here we see that the table is stored as parallel arrays of indirect pointers to records as well as any array of cached 32-bit hash codes, that is separate arrays-of-structures.
Effectively:
struct StringMap {
uint32_t hashcode[CAPACITY];
StringMapEntry *hashptr[CAPACITY];
};
Except that the capacity is dynamic and the load factor would appear to be maintained at between 37.5% and 75% of capacity.
For N records a load factor F this yields N/F pointers plus N/F integers for the open-addressed implementation as compared to N*(1+1/F) pointers plus N integers for the equivalent chained implementation. On a typical 64-bit system the open-address version is between ~4% larger and ~30% smaller.
However as you rightly suspected the main advantage here lies in cache effects. Beyond on average cache reducing contention by shrinking the data the filtering of collisions boils down to a linear reprobing of consecutive 32-bit hash keys, without examining any further information. Rejecting a collision is therefore much faster the chained case in which the link must be followed into what is likely uncached storage, and therefore a significantly higher load factor may be used. On the flip side one additional likely cache miss must be taken on the pointer lookup table, however this is a constant which does not degrade with load equivalent to one chained collision.
Effectively:
StringMapEntry *StringMap::lookup(const char *text) {
for(uint32_t *scan = &hashcode[hashvalue % CAPACITY]; *scan != SENTINEL; ++scan) {
uint32_t hash_value = hash_function(text);
if(hash_value == *scan) {
StringMapEntry *entry = p->hashptr[scan - hashcode];
if(!std::strcmp(entry->text, text))
return entry;
}
}
}
}
Plus subtleties such as wrapping.
As for your second question the optimization is to precalculate and store the hash keys. This waste some storage but prevents the expensive operation of examining potentially long variable-length strings unless a match is almost certain. And in degenerate cases, complex template name mangling, may be hundreds of characters.
A further optimization in RehashTable is to use a power-of-two instead of a prime table size. This insures that growing the table effectively brings one additional hash code bit into play and de-interleaves the doubled table into two consecutive target arrays, effectively rendering the operation a cache-friendly linear sweep.

CUDA computing a histogram with shared memory

I'm following a udacity problem set lesson to compute a histogram of numBins element out of a long series of numElems values. In this simple case each element's value is also his own bin in the histogram, so generating with CPU code the histogram is as simple as
for (i = 0; i < numElems; ++i)
histo[val[i]]++;
I don't get the video explanation for a "fast histogram computation" according to which I should sort the values by a 'coarse bin id' and then compute the final histogram.
The question is:
why should I sort the values by 'coarse bin indices'?
why should I sort the values by 'coarse bin indices'?
This is an attempt to break down the work into pieces that can be handled by a single threadblock. There are several considerations here:
On a GPU, it's desirable to have multiple threadblocks so that all SMs can be engaged in solving the problem.
A given threadblock lives and operates on a single SM, so it is confined to the resources available on that SM, the primary limits being the number of threads and the size of available shared memory.
Since shared memory especially is limited, the division of work creates a smaller-sized histogram operation for each threadblock, which may fit in the SM shared memory whereas the overall histogram range may not. For example if I am histogramming over a range of 4 decimal digits, that would be 10,000 bins total. Each bin would probably need an int value, so that is 40Kbytes, which would just barely fit into shared memory (and might have negative performance implications as an occupancy limiter). A histogram over 5 decimal digits probably would not fit. On the other hand, with a "coarse bin sort" of a single decimal digit, I could reduce the per-block shared memory requirement from 40Kbytes to 4Kbytes (approximately).
Shared memory atomics are often considerably faster than global memory atomics, so breaking down the work this way allows for efficient use of shared memory atomics, which may be a useful optimization.
so I will have to sort all the values first? Isn't that more expensive than reading and doing an atomicAdd into the right bin?
Maybe. But the idea of a coarse bin sort is that it may be computationally much less expensive than a full sort. A radix sort is a commonly used, relatively fast sorting operation that can be done in parallel on a GPU. Radix sort has the characteristic that the sorting operation begins with the most significant "digit" and proceeds iteratively to the least significant digit. However a coarse bin sort implies that only some subset of the most significant digits need actually be "sorted". Therefore, a "coarse bin sort" using a radix sort technique could be computationally substantially less expensive than a full sort. If you sort only on the most significant digit out of 3 digits as indicated in the udacity example, that means your sort is only approximately 1/3 as expensive as a full sort.
I'm not suggesting that this is a guaranteed recipe for faster performance in every case. The specifics matter (e.g. size of histogram, range, final number of bins, etc.) The specific GPU you use may impact the tradeoff also. For example, Kepler and newer devices will have substantially improved global memory atomics, so the comparison will be substantially impacted by that. (OTOH, Pascal has substantially improved shared memory atomics, which will once again affect the comparison in the other direction.)

boost::flat_map and its performance compared to map and unordered_map

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?
Thanks!
I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.
Benchmarking
On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).
1) You need to consider cache warming
Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.
The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only.
Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).
2) RDTSC accuracy measure
I also recommend doing this:
u64 g_correctionFactor; // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;
static u64 const errormeasure = ~((u64)0);
#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // flush OOO instruction pipeline
return __rdtsc();
}
inline void WarmupRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // warmup cpuid.
__cpuid(a, 0x80000000);
__cpuid(a, 0x80000000);
// measure the measurer overhead with the measurer (crazy he..)
u64 minDiff = LLONG_MAX;
u64 maxDiff = 0; // this is going to help calculate our PRECISION ERROR MARGIN
for (int i = 0; i < 80; ++i)
{
u64 tick1 = GetRDTSC();
u64 tick2 = GetRDTSC();
minDiff = std::min(minDiff, tick2 - tick1); // make many takes, take the smallest that ever come.
maxDiff = std::max(maxDiff, tick2 - tick1);
}
g_correctionFactor = minDiff;
printf("Correction factor %llu clocks\n", g_correctionFactor);
g_accuracy = maxDiff - minDiff;
printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif
This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.
Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.
3) parameters
The last problem is people usually test for too few variations of the scenario.
A container performance is affected by:
Allocator
size of the contained type
cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
number of elements in the container (size of the problem)
type has trivial 3.-operations
type is POD
Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...
(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)
Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.
Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.
Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.
Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.
Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.
About the flat map
Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.
This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...
Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.
The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.
The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.
A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.
The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.
But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.
Benchmark results
This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.
tested types information:
typeid=__int64 . sizeof=8 . ispod=yes
typeid=struct MediumTypePod . sizeof=184 . ispod=yes
Insertion
EDIT:
My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:
I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:
map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)
Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):
We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).
Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.
Random search of 3 elements (clocks renormalized to 1)
in size = 100
in size = 10000
Iteration
over size 100 (only MediumPod type)
over size 10000 (only MediumPod type)
Final grain of salt
In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.
From the docs it seems this is analogous to Loki::AssocVector which I'm a fairly heavy user of. Since it's based on a vector it has the characteristics of a vector, that is to say:
Iterators gets invalidated whenever size grows beyond capacity.
When it grows beyond capacity it needs to reallocated and move objects over, ie insertion isn't guaranteed constant time except for the special case of inserting at end when capacity > size
Lookup is faster than std::map due to cache locality, a binary search which has the same performance characteristics as std::map otherwise
Uses less memory because it isn't a linked binary tree
It never shrinks unless you forcibly tell it to ( since that triggers reallocation )
The best use is when you know the number of elements in advance ( so you can reserve upfront ), or when insertion / removal is rare but lookup is frequent. Iterator invalidation makes it a bit cumbersome in some use cases so they're not interchangeable in terms of program correctness.

Entropy and parallel random number generator seeding

I have a loop where I am adding noise to some points; these are being later used as the basis for some statistical tests.
The datasets involved are quite large, so I would like to parallelise it using openMP to speed things up. The issue comes up when I want to have multiple PRNGs. I have my own PRNG class based upon NR's modulo method (rand4 I think), but I am unsure how to seed the PRNGs correctly to ensure appropriate entropy
Normalliy I would do something like this
prng.initTimer();
But if I have an array of prngs, one per worker thread, then I cannot simply call initTimer on each instance -- the timer may not change, and the timers being close may introduce correlation.
I need to protect against natural correlations, not against malicious attackers (this is experimental data), so I need to have a safe way of seeding the rng array.
I thought of simply using
prng[0].initTimer()
for(int i=1; i<numRNGs; i++)
prng[i].init(prng[0].getRandNum());
Then calling my loop, but am unsure if this will introduce correlations in the modulo method.
Seeding PRNGs doesn't necessary create independent streams. You should seed only the the first instance (call it reference) and initialise the remaining instances by fast forwarding the reference instance. This only works if you know how many random numbers each thread will consume and the fast forwarding algorithm is available.
I don't know much about your rand4 (googled it, but nothing specific came up), but you shouldn't assume that it is possible to create independent streams just by seeding. You probably want to use different (a better) PRNG. Take a look at WELL. It is fast, has good statistical properties and developed by well know experts. WELL 512 and 1024 are one of the fastest PRNGs available and both have huge periods. You can initialise several WELL instances with distinct seeds in order to create independent streams. Thanks to huge period there is almost zero chance that your PRNGs will generate overlapping streams of random numbers.
If your PRNGs are called frequently, beware of false sharing. This Herb Sutter's article explains how false sharing can kill multi-core performance. Packing multiple PRNGs into a contiguous array is almost a perfect recipe for false sharing. In order to avoid false sharing either add padding between PRNGs or allocate PRNGs on heap/free store. In the later case each RNG should be allocated individually using some sort of aligned allocator. Your compiler should provide a version of aligned malloc. Check the docs (well, googling is actually faster than reading manuals). Visual C++ has _aligned_malloc, GCC has memalign and posix_memalign. The aliment value must be a multiple of CPU's cache line size. The common practice is to align along 128 byte boundaries. For portable solution you can use TBB's cache aligned allocator.
I think it depends on the properties of your PRNG. Usual PRNGs weaknesses are lower entropy in the lower bits and lower entropy for the first n values. So I think you should check your PRNG for such weaknesses and change your code accordingly.
Perhaps some of the diehard tests give useful information, but you can also check the first n values and their statistical properties like sum and variance yourself and compare them to the expected values.
For example, seed the PRNG and sum up the first 100 values modulo 11 of your PRNG, repeat this R times. If the total sum is very different from the expected (5*100*R), your PRNG suffers from one or both weaknesses mentioned above.
Knowing nothing about the PRNG, I'd feel safer using something like this:
prng[0].initTimer();
// Throw the first 100 values away
for(int i=1; i < 100; i++)
prng[0].getRandNum();
// Use only higher bits for seed values (assuming 32 bit size)
for(int i=1; i<numRNGs; i++)
prng[i].init(((prng[0].getRandNum() >> 16) << 16)
+ (prng[0].getRandNum() >> 16));
But of course, these are speculations about the PRNG. With an ideal PRNG, your approach should work fine as it is.
If you seed your PRNGs using a sequence of numbers from the same type of PRNG, they will all be producing the same sequence of numbers, offset by one from each other. If you want them to produce different numbers, you will need to seed them with a sequence of pseudorandom numbers from a different PRNG.
Alternatively, if you are on a unix-like system with a /dev/random, you can just read from that device to get a sequence of random numbers to use as your seeds.