boost circular_buffer or heap sorting performance - c++

I need to sort object that arrive to a process X before treating them.
Object are sorted given to a time-stamp - a 64 bit number.
When object are timed out(a few milli) and sorted, the process X start to look at them.
Most of the time the object arrive sorted, between 3% and 5% the objects are out of order.
So i need a structure that allow me to:
- insert quickly elements
- remove quickly timed out elements
What should be the best to sort them, in term of performance?
I started to implement with boost::circular_buffer.
If boost::heap is better for this, which boost::heap should i use? Because there are a few(fibonacci, binomial, priority queue...)
I am using boost 1_49 but i can use a newer version also.
With circular buffer i insert most of the elements at the beginning of the buffer. But it can be O(n) in some case.
But to take timed out element it is O(1)

You will need to measure but I would guess that a std::priority_queue<T> has a fair chance to be most efficient. Using any of the other heaps won't do you much good because you don't really need the additional operations (changing the priority of an element) but being able to use these operations increases the overhead quite significantly.
Especially when the size of T is a bit bigger but even if it is just an int, you may want to use a d-heap with d == 8: although this results in more comparisons it reduces the number of times an object is moved.

Related

What is the optimal data structure for storing data points acquired during last n seconds

I need a data structure (in C++), to store (integer,double) pair values acquired over last N seconds. The integer is a relative milliseconds time stamp (guaranteed to be monotonic) and the double is the actual data sample.
Constraints:
The number of points per second is not known apriori, but is not expected to vary much once the application is started. Typical value is 10 points per second.
The duration (i.e. N seconds) is also not known apriori and can change during execution. But when it is changed, I'm OK with flushing all data and starting fresh. Typical value is 60 seconds.
In every iteration, a new point is added to the end of the set and old points (i.e. older than N seconds from current time) are to be removed from the set.
I don't need fast random access, but fast insertions (tail) and deletions (head).
I am using std::deque at the moment, but I have a feeling that adding points at the tail end and deleting from the head end will cause frequent re-allocations.
Is there a standard way to do this? Or should I roll my own 'circular list' wrapper around std::vector?
For "fast insertions (tail) and deletions (head)" std::deque is optimal, given that it's amortized O(1) both insertion and deletion at both ends. You could also use std::queue. The standard library doesn't provide anything more "efficient" with the given requirements.
It seems that you can use circular buffer from boost.
There is no equivalent in C++ STL.
If you don't want to depend on boost, implement a non-shrinking circular buffer yourself over std::vector, that's not hard.

Approximate sort (array/vector), predictable runtime

Background:
I need to process some hundred thousand events (producing results) given a hard time limit. The clock is literally ticking, and when the timer fires, whatever is done at that point must be flushed out.
What isn't ready by that time is either discarded (depending on an importance metric) or processed during the next time quantum (with an "importance boost", i.e. adding a constant to the importance metric).
Now ideally, the CPU is much faster than needed, and the whole set is ready a long time before the end of the time slice. Unluckily, the world is rarely ever ideal, and "hundred thousands" becomes "tens of millions" before you know.
Events are added to the back of a queue (which is really a vector) as they come in, and are processed from the front during the respective next quantum (so the program always processes the last quantum's input).
However, not all events are equally important. In case the available time is not sufficient, it would be preferrable to drop unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next time quantum's queue, but doing so further adds to the load so it isn't a perfect solution).
The obvious thing to use would be, of course, a priority queue / heap. Unluckily, heapifying 100k elements isn't precisely a free operation either (or parallel), and then I end up with objects being in some non-obvious and not necessarily cache-friendly memory locations, and pulling elements from a priority queue doesn't parallelize nicely.
What I would really like is somewhat like a vector that is sorted or at least "somewhat approximately sorted", which one can traverse sequentially afterwards. This would trivially allow me to create e.g. 12 threads (or any other number, one per CPU) that process e.g. 1/64 of the range (or another size) each, slowly advancing from the front to the end, and eventually dropping/postponing what's left over -- which will be events of little importantance that can be discarded.
Simply sorting the complete range using std::sort would be the easiest, most straightforward solution. However, the time it takes to sort items reduces the time available to actually process elements within the fixed time budget, and sorting time is for the most part single-CPU time (and parallel sort isn't that great either).
Also, doing a perfect sort (which isn't really needed) may bring forth worst case complexity whereas an approximate sort should ideally perform at its optimum and have a very predictable cost.
tl;dr
So, what I'm looking for is a way to sort an array/vector only approximately, but fast, and with a predictable (or guaranteed) runtime.
The sort key would be a small integer typically between 10 and 1000. Being postponed to the next time quantum might increase ("priority boost") that value by a small amount, e.g. 100 or 200.
In a different question where humans are supposed to do an approximate sort using "subjective compare"(?) shell sort was proposed. On various sorting demo applets, it seems like at least for the "random shuffle" input that's typical in these, shell sort can indeed do an "approximate sort" that doesn't look too bad with 3-4 passes over the data (and at least the read-tap is strictly sequential). Unluckily it seems to be somewhat of a black art to choose gap values that work well, and runtime estimates seem to involve a lot of looking into the crystal ball as well.
Comb sort with a relatively large shrink factor (such as 2 or 3?) seems tempting as well, since it visits memory strictly sequentially (on both taps) and is able to move far out elements by a great distance quickly. Again, judging from sorting demo applets, it seems like 3-4 passes already give a rather reasonable "approximate sort".
MSD radix sort comes to mind, though I am not sure how it would perform given typical 16/32bit integers in which most of the most significant bits are all zero! One would probably have to do an initial pass to find the most significant bit in the whole set, followed by 2-3 actual sort passes?
Is there a better algorithm or a well-known working approach with one of the algorithms I mentioned?
What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. (posted as an answer now as requested by Damon)
Use a separate vector for each priority. Then you don't need to sort them.
Sounds like a nice example where near-sort algorithms can be useful.
Back a decade Chazelle has developed a nice data-structure that somewhat works like a heap. The key difference is the time complexity though. It has constant time for all important operations, e.g. insert, remove, find lowest element etc.
The trick of this data-structure is, that it breaks the O(n*log n) complexity barrier by allowing for some error in the sort order.
To me that sounds pretty much what you need. The data-structure is called soft heap and explained on wikipedia:
https://en.wikipedia.org/wiki/Soft_heap
There are other algorithms that allow for some error in favor to speed as well. You'll find them if you google for Near Sort Algorithms
If you try that algorithm please give some feedback how it works in practice. I'm really eager to hear from you how the algorithm performs in practice.
Sounds like you want to use std::partition: move the part that interests you to the front, and the others to the back. Its complexity is in the order of O(n), but it is cache-friendly, so it's probably a lot faster than sorting.
If you have limited "bandwidth" in processing events (say a 128K per time quantum), you could use std::nth_element to select the 128K (minus some percentage lost due to making that computation) most promising events (assuming you have an operator< that compares priorities) in O(N) time. Then you process those in parallel, and when you are done, you reprioritize the remainder (again in O(N) time).
std::vector<Event> events;
auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process
if (events.size() <= guaranteed_bandwidth) {
// let all N workers loose on [begin(events), end(events)) range
} else {
auto nth = guaranteed_bandwidth * loss_from_nth_element;
std::nth_element(begin(events), begin(events) + nth);
// let all N workers loose on [begin(events), nth) range
// reprioritize [nth, end(events)) range and append to events for next time quantum
}
This guarantees that in the case that your bandwith threshold is reached, you process the most valuable elements first. You could even speed up the nth_element by a poor man's parallelization (e.g. let each of N workers compute M*128K/N best elements for small M in parallel, and then do a final merge and another nth_element on the M*128K elements).
The only weakness is that in case your system is really overloaded (billions of events, maybe due to some DOS attack) it could take more than the entire quantum to run nth_element (even when quasi-parallized) and you actually process nothing. But if the processing time per event is much larger (say a few 1,000 cycles) than a priority comparison (say a dozen cycles), this should not happen under regular loads.
NOTE: for performance reasons, it's of course better to sort pointers/indices into the main event vector, this is not shown for brevity.
If you have N worker threads, give each worker thread 1/Nth of the original unsorted array. The first thing the worker will do is your approximate fast sorting algorithm of preference on it's individual piece of the array. Then, they can each process their array peice in order - roughly performing higher priority items first, and also being very cache friendly. This way, you don't take a hit for trying to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is, is entirely parallelized. Sorting 10 pieces individually is much cheaper than sorting the whole thing.
This would work best if the priorities of items to process are randomly distributed. If there is some ordering to them you'll wind up with a thread being flooded by or starved of high priority items to process.

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.

How to keep a large priority queue with the most relevant items?

In an optimization problem I keep in a queue a lot of candidate solutions which I examine according to their priority.
Each time I handle one candidate, it is removed form the queue but it produces several new candidates making the number of cadidates to grow exponentially. To handle this I assign a relevancy to each candidate, whenever a candidate is added to the queue, if there is no more space avaliable, I replace (if appropiate) the least relevant candidate currently in the queue with the new one.
In order to do this efficiently I keep a large (fixed size) array with the candidates and two linked indirect binary heaps: one handles the candidates in decreasing priority order, and the other in ascending relevancy.
This is efficient enough for my purposes and the supplementary space needed is about 4 ints/candidate which is also reasonable. However it is complicated to code, and it doesn't seem optimal.
My question is if you know of a more adequate data structure or of a more natural way to perform this task without losing efficiency.
Here's an efficient solution that doesn't change the time or space complexity over a normal heap:
In a min-heap, every node is less than both its children. In a max-heap, every node is greater than its children. Let's alternate between a min and max property for each level making it: every odd row is less than its children and its grandchildren, and the inverse for even rows. Then finding the smallest node is the same as usual, and finding the largest node requires that we look at the children of the root and take the largest. Bubbling nodes (for insertion) becomes a bit tricker, but it's still the same O(logN) complexity.
Keeping track of capacity and popping the smallest (least relevant) node is the easy part.
EDIT: This appears to be a standard min-max heap! See here for a description. There's a C implementation: header, source and example. Here's an example graph:
(source: chonbuk.ac.kr)
"Optimal" is hard to judge (near impossible) without profiling.
Sometimes a 'dumb' algorithm can be the fastest because intel CPUs are incredibly fast at dumb array scans on contiguous blocks of memory especially if the loop and the data can fit on-chip. By contrast, jumping around following pointers in a larger block of memory that doesn't fit on-chip can be tens or hundreds or times slower.
You may also have the issues when you try to parallelize your code if the 'clever' data structure introduces locking thus preventing multiple threads from progressing simultaneously.
I'd recommend profiling both your current, the min-max approach and a simple array scan (no linked lists = less memory) to see which performs best. Odd as it may seem, I have seen 'clever' algorithms with linked lists beaten by simple array scans in practice often because the simpler approach uses less memory, has a tighter loop and benefits more from CPU optimizations. You also potentially avoid memory allocations and garbage collection issues with a fixed size array holding the candidates.
One option you might want to consider whatever the solution is to prune less frequently and remove more elements each time. For example, removing 100 elements on each prune operation means you only need to prune 100th of the time. That may allow a more asymmetric approach to adding and removing elements.
But overall, just bear in mind that the computer-science approach to optimization isn't always the practical approach to the highest performance on today and tomorrow's hardware.
If you use skip-lists instead of heaps you'll have O(1) time for dequeuing elements while still doing searches in O(logn).
On the other hand a skip list is harder to implement and uses more space than a binary heap.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.