Is there a way to simulate cache locality when benchmarking? - c++

I'm trying to figure out what would be the best approach to benchmark C++ programs and would like to simulate both the scenario when the data related to the benchmarked section is present in the cache and when it's cold.
Is there a reliable way to enforce good and bad cache locality on x86-64 machines as a form of preparation for a test run, assuming the data it will involve is known?

Presumably you are benchmarking an algorithm that performs an operation over a range of objects, and you care about the locality of those objects in memory (and thus cache).
To "simulate" locality: Create locality. You can create a linked list with high locality as well as linked list with low locality:
Allocate the nodes in an array. To create a list with high locality, make sure that first element of the array points to the second, and so on. To create list with lower locality, create a random permutation of the order so that each node points to another in a random position of the array.
Make sure that number of elements at least a magnitude greater than the largest cache.

Related

Linked List vs Open Addressing in Hash Table

I have a data set with 130000 elements and I have two different data structures which are a doubly-linked list and hash table. When inserting data set elements into the linked list, I place the node at the end of the list by using tail pointer. When inserting data set elements into hash table , I benefit from open addressing method with a probe function. I face 110000 collisions for last 10 elements in the data set.
However, difference between total running times of insertion for two different data structures is equal to 0.0981 second.
Linked list = 0.028521 second
Hash Table = 0.120102 second
Is pointer operation slow or probing method very fast ?
Insertion at the end of Double Linked List via the tail pointer is O(1), as you read here too.
Insertion in a Hash Table with Open Addressing can be also constant, as you can read here too.
However, a correct and efficient implementation of a Hash Table with Open Addressing is quite tricky, since a lot of things can wrong (probing, load factor, hash function, etc.). Even Wikipedia mentions that..
I face 110000 collisions (in the hash table) for last 10 elements
This is a strong indication that something is not good in your hash table implementation.
This explains why the time measurements you made, if they are correct, make the double linked list faster than your hash table.
Is pointer operation slow or probing method very fast ?
There are no actual algorithms, so the answer will be rather theoretical.
In terms of performance, the general answer is: cache misses are expensive. Having DDR with 60 ns latency and 3.2 GHz CPU, last level cache miss stalls CPU for 60 * 3.2 =~ 200 cycles.
Double-linked List
For double linked list algorithm, you have to access tail pointer, tail element and new element. If you just add elements in a loop, there is a huge probability all those accesses will be in CPU cache.
In real life application, if you do something between the additions, you might have up to 3 cache misses (tail pointer, tail element, new element).
Hash Table
For hash table with open addressing the situation is a bit different. The hash function produces a random index in hash table, so usually, the first access to a hash table is a cache miss. In your case, the hash table is not that big (130K pointers), so it might fit into the L3 cache. But still, L3 cache miss is about 30 cycles CPU stall.
But what happens next? You just put a pointer into the table, no need to update tail element nor the new element. So no cache misses here.
If the hash table element is occupied, you just check the next one. Such a sequential access is easily predictable by CPU prefetcher, so all those accesses usually do no produce any cache misses as well: CPU prefetches next hash table into L1 cache.
So, in real application, hash table usually has one cache miss, but since the hash is unpredictable, hash table always have this first cache miss.
The Answer
To get a practical answer, what is going on in you application, you should use a tool to analyse CPU performance counters, like perf on Linux or VTune on Windows. The tool will show what exactly your CPU spend time on.
Real Application
One more theoretical disclaimer here. I guess, that if you fix your hash table (say, use few elements per bucket rather than open addressing) and effectively reduce number of collision, the performance could be on par.
Should you use double-linked list over the hash table or vice versa? It depend on you application. Hash table is good for random access, i.e. you can access any element in O(1) time. For double linked list you have to traverse the list, so the estimation would be O(n).
On the other hand, just adding elements at the end of the list is not only a cheaper operation, but also much easier to implement. You do not care about any collisions and hash table overflows.
So, in some cases double linked list would have huge advantages over the hash table and it is up to the application what will suit the best.

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.

How to keep a large priority queue with the most relevant items?

In an optimization problem I keep in a queue a lot of candidate solutions which I examine according to their priority.
Each time I handle one candidate, it is removed form the queue but it produces several new candidates making the number of cadidates to grow exponentially. To handle this I assign a relevancy to each candidate, whenever a candidate is added to the queue, if there is no more space avaliable, I replace (if appropiate) the least relevant candidate currently in the queue with the new one.
In order to do this efficiently I keep a large (fixed size) array with the candidates and two linked indirect binary heaps: one handles the candidates in decreasing priority order, and the other in ascending relevancy.
This is efficient enough for my purposes and the supplementary space needed is about 4 ints/candidate which is also reasonable. However it is complicated to code, and it doesn't seem optimal.
My question is if you know of a more adequate data structure or of a more natural way to perform this task without losing efficiency.
Here's an efficient solution that doesn't change the time or space complexity over a normal heap:
In a min-heap, every node is less than both its children. In a max-heap, every node is greater than its children. Let's alternate between a min and max property for each level making it: every odd row is less than its children and its grandchildren, and the inverse for even rows. Then finding the smallest node is the same as usual, and finding the largest node requires that we look at the children of the root and take the largest. Bubbling nodes (for insertion) becomes a bit tricker, but it's still the same O(logN) complexity.
Keeping track of capacity and popping the smallest (least relevant) node is the easy part.
EDIT: This appears to be a standard min-max heap! See here for a description. There's a C implementation: header, source and example. Here's an example graph:
(source: chonbuk.ac.kr)
"Optimal" is hard to judge (near impossible) without profiling.
Sometimes a 'dumb' algorithm can be the fastest because intel CPUs are incredibly fast at dumb array scans on contiguous blocks of memory especially if the loop and the data can fit on-chip. By contrast, jumping around following pointers in a larger block of memory that doesn't fit on-chip can be tens or hundreds or times slower.
You may also have the issues when you try to parallelize your code if the 'clever' data structure introduces locking thus preventing multiple threads from progressing simultaneously.
I'd recommend profiling both your current, the min-max approach and a simple array scan (no linked lists = less memory) to see which performs best. Odd as it may seem, I have seen 'clever' algorithms with linked lists beaten by simple array scans in practice often because the simpler approach uses less memory, has a tighter loop and benefits more from CPU optimizations. You also potentially avoid memory allocations and garbage collection issues with a fixed size array holding the candidates.
One option you might want to consider whatever the solution is to prune less frequently and remove more elements each time. For example, removing 100 elements on each prune operation means you only need to prune 100th of the time. That may allow a more asymmetric approach to adding and removing elements.
But overall, just bear in mind that the computer-science approach to optimization isn't always the practical approach to the highest performance on today and tomorrow's hardware.
If you use skip-lists instead of heaps you'll have O(1) time for dequeuing elements while still doing searches in O(logn).
On the other hand a skip list is harder to implement and uses more space than a binary heap.

Most efficient tree structure for what I'm trying to do

I'm wondering what the most generally efficient tree structure would be for a collection that has the following requirements:
The tree will hold anywhere between 0 and 232 - 1 items.
Each item will be a simple structure, containing one 32-bit unsigned integer (the item's unique ID, which will be used as the tree value) and two pointers.
Items will be inserted and removed from the tree very often; some items in the tree will remain there for the duration of the program, while others will only be in the tree very briefly before being removed.
Once an item is removed, its unique ID (that 32-bit unsigned integer) will be recycled and reused for a new item.
The tree structure needs to support efficient inserts and deletions, as well as quick lookups by the unique ID. Also, finding the first available unused unique ID needs to be a fast operation.
What sort of tree would be best-suited for these requirements?
EDIT: This tree is going to be held only in memory; at no point will it be persisted to disk. I don't need to worry about hitting the disk, or disk caching, or anything of the sort. This is also why I'm not looking into using something like SQLite.
Depending on how fast you need this to be you might just treat the whole thing as a single, in-memory table mmap-ed onto a file. Addressing is by direct computation. You can simply chain the free slots so you always know exactly where the next free one is. Most accesses will have a max of 1 or 2 disk accesses (depending on underlying filesystem requirements). Put a buttload of memory on the machine and you might not hit the disk at all.
I know this sounds pretty brute force, but you'd be amazed how fast it can be.
Update in response to: "I'm not looking for a disk-persistable solution"
Well, if you truly are going to have as many as 2^32 items in this structure (times how big it is) then you either need enough memory on the machine to hold this puppy or the kernel will start to swap things in and out of memory for you. This still translates to hitting the disk. If you let it swap, don't forget to check the size of the swap area, there's a good chance you'll have to bump it. Using mmap (or something similar) is sort of like creating your own private swap area and it will probably have less impact on other processes running on the same system.
I'll note that once this thing exceeds your available physical memory (whether you are using swap space or mmap or B-trees or Black-Red or extensible hashing or whatever) it becomes critical to understand
your access pattern. If you are hopscotching all over the place you're going to be hitting the disk a lot. One of the primary reasons for using a structure like a B-tree (or any one of several similar structures) is that the top level of the tree (containing the index) tends to stay in memory (because most paging algorithms use LRU) and you only eat a disk access when you touch a leaf page.
Bottom line: it either fits in memory or it doesn't. If it doesn't then your 10^-9 sec memory access turns into a 10^-3 disk access. I.e. 1 million times slower.
TANSTAAFL!
Have you considered something like a trie? Lookup is linear in key length, which in your case means essentially constant, and storage can be more compact due to nodes sharing common substrings.
Keep in mind, though, that if your data set is actually filling large amounts of your key space your bigger efficiency concern is likely to be caching and disk access, not lookups.
I would go for a red-black tree, because it balances the tree on insertion to ensure optimal insertion/deletion/retrieval. An AVL tree is an option, but it's slightly slower for insertions because it's more rigid about balancing on insertions.
http://en.wikipedia.org/wiki/Red-black_tree
http://en.wikipedia.org/wiki/AVL_tree
My reflex would tell me to reach for a standard implementation, such as the one in stl. But suppose you have reasons to implement your own I would typically go for either Red-Black Trees, which performs well on all operations. Alternatively I would try splay trees which can be really fast but have amortized complexity, i.e. some individual operations might take a little longer.
Stay away from AVL trees as you need to do a lot of updates. AVL trees are good for when you have a lot of lookups but few updates as the updated can be fairly slow.
Do you expect your tree to really hold 2^32-1 entries? Even half that and I would definitely try this with SQLite. You may be able to fit it all in memory, but if you page once, a database will be faster. Database are meant to handle huge data sets efficiently, especially when the whole set won't fit in memory at once.
I you do intend to do this yourself, look at some database code and use a BTree. A red-black will be faster with smaller datasets but with that much data your bottle neck isn't going to be processor speed but memory and harddrive speed.
All that said I can't imagine a map of pointers that large being useful. You'll be pushing the limits of modern memory just storing the map. You won't have anything left over for the map to point to.
boost::unordered_map has amortized constant time insertions, deletions and lookups. It's the best data structure for what you described.
Its only downside is that it's, well, unordered as the name says.. And also if you're REALLY unlucky it could end up being linear time if every single hash clashes. However that can be easily avoided using boost's default boost::hash function. Additionally hashing integers is trivial; so that worst case scenario will not happen to you.
(Note: it's not a tree but a hash table, and you asked specifically for a "Tree".. Maybe you thought that the most efficient way was some sort of tree (it's not)?)
Why a tree at all?
To me it seems you need a database. If you expect lower count of nodes, Hash Table could be enough.
I'm going to warn you about the memory. If you fill up whole tree (2^32 items) you will need 4 gigabytes for the values themselves another 8GB for the pointers. Consider the database, if this is likely.
Each item is represented by a 32-bit identity, which is its key, and two pointers. Are the pointers associated with the tree, or do they have to do with the identity?
If they're just part of implementing the tree, ditch them. You don't need them. Represent whether a number is there or not as a bit in a really big bitmap. Finding the lowest unused bit isn't fast, but I don't think it can be. It's only about 512M of main memory, which isn't that bad.
If the pointers are meaningful data, use an array. You're going to have to allocate space for four giganodes plus pointers to make up the map anyway, so allocate space for four giganodes plus one indicator each for whether the node is active or not. Use memset() to set the whole thing to zero, and keep a lowest-unused-node pointer. Use that to add a node. When you delete a node, mark it as unused, and use the pointers to maintain a two-way linked free list. You'll have to find the next lower unused node, and that might take a while, but again I don't see how to keep this fast. (If you just need an unused node, not the lowest one, just put the released node on the free list somewhere.)
This is likely to take about 64G or 96G of RAM, but that's less than a map solution.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.