How to keep a large priority queue with the most relevant items? - c++

In an optimization problem I keep in a queue a lot of candidate solutions which I examine according to their priority.
Each time I handle one candidate, it is removed form the queue but it produces several new candidates making the number of cadidates to grow exponentially. To handle this I assign a relevancy to each candidate, whenever a candidate is added to the queue, if there is no more space avaliable, I replace (if appropiate) the least relevant candidate currently in the queue with the new one.
In order to do this efficiently I keep a large (fixed size) array with the candidates and two linked indirect binary heaps: one handles the candidates in decreasing priority order, and the other in ascending relevancy.
This is efficient enough for my purposes and the supplementary space needed is about 4 ints/candidate which is also reasonable. However it is complicated to code, and it doesn't seem optimal.
My question is if you know of a more adequate data structure or of a more natural way to perform this task without losing efficiency.

Here's an efficient solution that doesn't change the time or space complexity over a normal heap:
In a min-heap, every node is less than both its children. In a max-heap, every node is greater than its children. Let's alternate between a min and max property for each level making it: every odd row is less than its children and its grandchildren, and the inverse for even rows. Then finding the smallest node is the same as usual, and finding the largest node requires that we look at the children of the root and take the largest. Bubbling nodes (for insertion) becomes a bit tricker, but it's still the same O(logN) complexity.
Keeping track of capacity and popping the smallest (least relevant) node is the easy part.
EDIT: This appears to be a standard min-max heap! See here for a description. There's a C implementation: header, source and example. Here's an example graph:
(source: chonbuk.ac.kr)

"Optimal" is hard to judge (near impossible) without profiling.
Sometimes a 'dumb' algorithm can be the fastest because intel CPUs are incredibly fast at dumb array scans on contiguous blocks of memory especially if the loop and the data can fit on-chip. By contrast, jumping around following pointers in a larger block of memory that doesn't fit on-chip can be tens or hundreds or times slower.
You may also have the issues when you try to parallelize your code if the 'clever' data structure introduces locking thus preventing multiple threads from progressing simultaneously.
I'd recommend profiling both your current, the min-max approach and a simple array scan (no linked lists = less memory) to see which performs best. Odd as it may seem, I have seen 'clever' algorithms with linked lists beaten by simple array scans in practice often because the simpler approach uses less memory, has a tighter loop and benefits more from CPU optimizations. You also potentially avoid memory allocations and garbage collection issues with a fixed size array holding the candidates.
One option you might want to consider whatever the solution is to prune less frequently and remove more elements each time. For example, removing 100 elements on each prune operation means you only need to prune 100th of the time. That may allow a more asymmetric approach to adding and removing elements.
But overall, just bear in mind that the computer-science approach to optimization isn't always the practical approach to the highest performance on today and tomorrow's hardware.

If you use skip-lists instead of heaps you'll have O(1) time for dequeuing elements while still doing searches in O(logn).
On the other hand a skip list is harder to implement and uses more space than a binary heap.

Related

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

Approximate sort (array/vector), predictable runtime

Background:
I need to process some hundred thousand events (producing results) given a hard time limit. The clock is literally ticking, and when the timer fires, whatever is done at that point must be flushed out.
What isn't ready by that time is either discarded (depending on an importance metric) or processed during the next time quantum (with an "importance boost", i.e. adding a constant to the importance metric).
Now ideally, the CPU is much faster than needed, and the whole set is ready a long time before the end of the time slice. Unluckily, the world is rarely ever ideal, and "hundred thousands" becomes "tens of millions" before you know.
Events are added to the back of a queue (which is really a vector) as they come in, and are processed from the front during the respective next quantum (so the program always processes the last quantum's input).
However, not all events are equally important. In case the available time is not sufficient, it would be preferrable to drop unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next time quantum's queue, but doing so further adds to the load so it isn't a perfect solution).
The obvious thing to use would be, of course, a priority queue / heap. Unluckily, heapifying 100k elements isn't precisely a free operation either (or parallel), and then I end up with objects being in some non-obvious and not necessarily cache-friendly memory locations, and pulling elements from a priority queue doesn't parallelize nicely.
What I would really like is somewhat like a vector that is sorted or at least "somewhat approximately sorted", which one can traverse sequentially afterwards. This would trivially allow me to create e.g. 12 threads (or any other number, one per CPU) that process e.g. 1/64 of the range (or another size) each, slowly advancing from the front to the end, and eventually dropping/postponing what's left over -- which will be events of little importantance that can be discarded.
Simply sorting the complete range using std::sort would be the easiest, most straightforward solution. However, the time it takes to sort items reduces the time available to actually process elements within the fixed time budget, and sorting time is for the most part single-CPU time (and parallel sort isn't that great either).
Also, doing a perfect sort (which isn't really needed) may bring forth worst case complexity whereas an approximate sort should ideally perform at its optimum and have a very predictable cost.
tl;dr
So, what I'm looking for is a way to sort an array/vector only approximately, but fast, and with a predictable (or guaranteed) runtime.
The sort key would be a small integer typically between 10 and 1000. Being postponed to the next time quantum might increase ("priority boost") that value by a small amount, e.g. 100 or 200.
In a different question where humans are supposed to do an approximate sort using "subjective compare"(?) shell sort was proposed. On various sorting demo applets, it seems like at least for the "random shuffle" input that's typical in these, shell sort can indeed do an "approximate sort" that doesn't look too bad with 3-4 passes over the data (and at least the read-tap is strictly sequential). Unluckily it seems to be somewhat of a black art to choose gap values that work well, and runtime estimates seem to involve a lot of looking into the crystal ball as well.
Comb sort with a relatively large shrink factor (such as 2 or 3?) seems tempting as well, since it visits memory strictly sequentially (on both taps) and is able to move far out elements by a great distance quickly. Again, judging from sorting demo applets, it seems like 3-4 passes already give a rather reasonable "approximate sort".
MSD radix sort comes to mind, though I am not sure how it would perform given typical 16/32bit integers in which most of the most significant bits are all zero! One would probably have to do an initial pass to find the most significant bit in the whole set, followed by 2-3 actual sort passes?
Is there a better algorithm or a well-known working approach with one of the algorithms I mentioned?
What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. (posted as an answer now as requested by Damon)
Use a separate vector for each priority. Then you don't need to sort them.
Sounds like a nice example where near-sort algorithms can be useful.
Back a decade Chazelle has developed a nice data-structure that somewhat works like a heap. The key difference is the time complexity though. It has constant time for all important operations, e.g. insert, remove, find lowest element etc.
The trick of this data-structure is, that it breaks the O(n*log n) complexity barrier by allowing for some error in the sort order.
To me that sounds pretty much what you need. The data-structure is called soft heap and explained on wikipedia:
https://en.wikipedia.org/wiki/Soft_heap
There are other algorithms that allow for some error in favor to speed as well. You'll find them if you google for Near Sort Algorithms
If you try that algorithm please give some feedback how it works in practice. I'm really eager to hear from you how the algorithm performs in practice.
Sounds like you want to use std::partition: move the part that interests you to the front, and the others to the back. Its complexity is in the order of O(n), but it is cache-friendly, so it's probably a lot faster than sorting.
If you have limited "bandwidth" in processing events (say a 128K per time quantum), you could use std::nth_element to select the 128K (minus some percentage lost due to making that computation) most promising events (assuming you have an operator< that compares priorities) in O(N) time. Then you process those in parallel, and when you are done, you reprioritize the remainder (again in O(N) time).
std::vector<Event> events;
auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process
if (events.size() <= guaranteed_bandwidth) {
// let all N workers loose on [begin(events), end(events)) range
} else {
auto nth = guaranteed_bandwidth * loss_from_nth_element;
std::nth_element(begin(events), begin(events) + nth);
// let all N workers loose on [begin(events), nth) range
// reprioritize [nth, end(events)) range and append to events for next time quantum
}
This guarantees that in the case that your bandwith threshold is reached, you process the most valuable elements first. You could even speed up the nth_element by a poor man's parallelization (e.g. let each of N workers compute M*128K/N best elements for small M in parallel, and then do a final merge and another nth_element on the M*128K elements).
The only weakness is that in case your system is really overloaded (billions of events, maybe due to some DOS attack) it could take more than the entire quantum to run nth_element (even when quasi-parallized) and you actually process nothing. But if the processing time per event is much larger (say a few 1,000 cycles) than a priority comparison (say a dozen cycles), this should not happen under regular loads.
NOTE: for performance reasons, it's of course better to sort pointers/indices into the main event vector, this is not shown for brevity.
If you have N worker threads, give each worker thread 1/Nth of the original unsorted array. The first thing the worker will do is your approximate fast sorting algorithm of preference on it's individual piece of the array. Then, they can each process their array peice in order - roughly performing higher priority items first, and also being very cache friendly. This way, you don't take a hit for trying to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is, is entirely parallelized. Sorting 10 pieces individually is much cheaper than sorting the whole thing.
This would work best if the priorities of items to process are randomly distributed. If there is some ordering to them you'll wind up with a thread being flooded by or starved of high priority items to process.

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.

Most efficient tree structure for what I'm trying to do

I'm wondering what the most generally efficient tree structure would be for a collection that has the following requirements:
The tree will hold anywhere between 0 and 232 - 1 items.
Each item will be a simple structure, containing one 32-bit unsigned integer (the item's unique ID, which will be used as the tree value) and two pointers.
Items will be inserted and removed from the tree very often; some items in the tree will remain there for the duration of the program, while others will only be in the tree very briefly before being removed.
Once an item is removed, its unique ID (that 32-bit unsigned integer) will be recycled and reused for a new item.
The tree structure needs to support efficient inserts and deletions, as well as quick lookups by the unique ID. Also, finding the first available unused unique ID needs to be a fast operation.
What sort of tree would be best-suited for these requirements?
EDIT: This tree is going to be held only in memory; at no point will it be persisted to disk. I don't need to worry about hitting the disk, or disk caching, or anything of the sort. This is also why I'm not looking into using something like SQLite.
Depending on how fast you need this to be you might just treat the whole thing as a single, in-memory table mmap-ed onto a file. Addressing is by direct computation. You can simply chain the free slots so you always know exactly where the next free one is. Most accesses will have a max of 1 or 2 disk accesses (depending on underlying filesystem requirements). Put a buttload of memory on the machine and you might not hit the disk at all.
I know this sounds pretty brute force, but you'd be amazed how fast it can be.
Update in response to: "I'm not looking for a disk-persistable solution"
Well, if you truly are going to have as many as 2^32 items in this structure (times how big it is) then you either need enough memory on the machine to hold this puppy or the kernel will start to swap things in and out of memory for you. This still translates to hitting the disk. If you let it swap, don't forget to check the size of the swap area, there's a good chance you'll have to bump it. Using mmap (or something similar) is sort of like creating your own private swap area and it will probably have less impact on other processes running on the same system.
I'll note that once this thing exceeds your available physical memory (whether you are using swap space or mmap or B-trees or Black-Red or extensible hashing or whatever) it becomes critical to understand
your access pattern. If you are hopscotching all over the place you're going to be hitting the disk a lot. One of the primary reasons for using a structure like a B-tree (or any one of several similar structures) is that the top level of the tree (containing the index) tends to stay in memory (because most paging algorithms use LRU) and you only eat a disk access when you touch a leaf page.
Bottom line: it either fits in memory or it doesn't. If it doesn't then your 10^-9 sec memory access turns into a 10^-3 disk access. I.e. 1 million times slower.
TANSTAAFL!
Have you considered something like a trie? Lookup is linear in key length, which in your case means essentially constant, and storage can be more compact due to nodes sharing common substrings.
Keep in mind, though, that if your data set is actually filling large amounts of your key space your bigger efficiency concern is likely to be caching and disk access, not lookups.
I would go for a red-black tree, because it balances the tree on insertion to ensure optimal insertion/deletion/retrieval. An AVL tree is an option, but it's slightly slower for insertions because it's more rigid about balancing on insertions.
http://en.wikipedia.org/wiki/Red-black_tree
http://en.wikipedia.org/wiki/AVL_tree
My reflex would tell me to reach for a standard implementation, such as the one in stl. But suppose you have reasons to implement your own I would typically go for either Red-Black Trees, which performs well on all operations. Alternatively I would try splay trees which can be really fast but have amortized complexity, i.e. some individual operations might take a little longer.
Stay away from AVL trees as you need to do a lot of updates. AVL trees are good for when you have a lot of lookups but few updates as the updated can be fairly slow.
Do you expect your tree to really hold 2^32-1 entries? Even half that and I would definitely try this with SQLite. You may be able to fit it all in memory, but if you page once, a database will be faster. Database are meant to handle huge data sets efficiently, especially when the whole set won't fit in memory at once.
I you do intend to do this yourself, look at some database code and use a BTree. A red-black will be faster with smaller datasets but with that much data your bottle neck isn't going to be processor speed but memory and harddrive speed.
All that said I can't imagine a map of pointers that large being useful. You'll be pushing the limits of modern memory just storing the map. You won't have anything left over for the map to point to.
boost::unordered_map has amortized constant time insertions, deletions and lookups. It's the best data structure for what you described.
Its only downside is that it's, well, unordered as the name says.. And also if you're REALLY unlucky it could end up being linear time if every single hash clashes. However that can be easily avoided using boost's default boost::hash function. Additionally hashing integers is trivial; so that worst case scenario will not happen to you.
(Note: it's not a tree but a hash table, and you asked specifically for a "Tree".. Maybe you thought that the most efficient way was some sort of tree (it's not)?)
Why a tree at all?
To me it seems you need a database. If you expect lower count of nodes, Hash Table could be enough.
I'm going to warn you about the memory. If you fill up whole tree (2^32 items) you will need 4 gigabytes for the values themselves another 8GB for the pointers. Consider the database, if this is likely.
Each item is represented by a 32-bit identity, which is its key, and two pointers. Are the pointers associated with the tree, or do they have to do with the identity?
If they're just part of implementing the tree, ditch them. You don't need them. Represent whether a number is there or not as a bit in a really big bitmap. Finding the lowest unused bit isn't fast, but I don't think it can be. It's only about 512M of main memory, which isn't that bad.
If the pointers are meaningful data, use an array. You're going to have to allocate space for four giganodes plus pointers to make up the map anyway, so allocate space for four giganodes plus one indicator each for whether the node is active or not. Use memset() to set the whole thing to zero, and keep a lowest-unused-node pointer. Use that to add a node. When you delete a node, mark it as unused, and use the pointers to maintain a two-way linked free list. You'll have to find the next lower unused node, and that might take a while, but again I don't see how to keep this fast. (If you just need an unused node, not the lowest one, just put the released node on the free list somewhere.)
This is likely to take about 64G or 96G of RAM, but that's less than a map solution.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.