How to improve Boost Fibonacci Heap performance

How to improve Boost Fibonacci Heap performance - c++

I am implementing the Fast Marching algorithm, which is some kind of continuous Dijkstra. As I read in many papers, the Fibonacci heap is the most adequate heap for this purpose.
However, when profiling with callgrind my code I see that the following function is taking 58% of the execution time:
int popMinIdx () {
const int idx = heap_.top()->getIndex();
heap_.pop();
return idx;
}
Concretely, the pop() is taking 57.67% of the whole execution time.
heap_is defined as follows:
boost::heap::fibonacci_heap<const FMCell *, boost::heap::compare<compare_cells>> heap_;
Is it normal that it takes "that much" time or is there something I can do to improve performance?
Sorry if not enough information is given. I tried to be as brief as possible. I will add more info if needed.
Thank you!

The other answers aren't mentioning the big part: of course pop() takes the majority of your time: it's the only function that performs any real work!
As you may have read, the bounds on the operations of a Fibonacci Heap are amortized bounds. This means that if you perform enough operations in a good sequence, the bounds will average out to that. However, the actual costs are completely hidden.
Every time you insert an element, nothing happens. It is just thrown into the root list. Boom, O(1) time. Every time you merge two trees, its root is just linked into the root list. Boom, O(1) time. But hold on, your structure is not a valid Fibonacci Heap! That's where pop() (or extract-root) comes in: every time this operation is called, the entire Heap is restructured back into a correct shape. The Root is removed, its children are cut to the root list, and then we start merging trees in the root list so that no two trees with the same degree (number of children) exist in the root list.
So all of the work of Insert(e) and Merge(t) is actually delayed until Pop() is called, which then does all the work. What about the other operations?
Delete(e) is beautiful. We perform Decrease-Key(e, -inf) to make the element e become the root. And now we perform Pop()! Again, the work is done by Pop().
Decrease-Key(e, v) does its work by itself: it cuts e to the root list and starts a cutting cascade to put its children into the root list as well (which can cut their childlists too). So Decrease-Key puts a whole lot of elements into the root list. Can you guess which function has to fix that?
TL;DR: Pop() is the work horse of the Fibonacci Heap. All other operations are done efficiently because they create work for the Pop() operation. Pop() gathers the work and performs it in one go (which can take up to O(n)). This is actually really efficient because the "grouped" work can be done faster than each operation separately.
So yes, it is natural that Pop() takes up the majority of your time!

The Fibanacci Heap's pop() has an amortized runtime of O(log n) and worst case of O(n). If your heap is large, it could easily be consuming a majority of the CPU time in your algorithm, especially since most of the other operations you're likely using have O(1) runtimes (insert, top, etc.)
One thing I'd recommend is to try callgrind with your preferred optimization level (such as -O3) with debug info (-g), because the templatized datastructures/containers such as the fibonacci_heap are heavy on the inlined function usage. It could be that most of the CPU cycles you're measuring don't even exist in your optimized executable.

Related

Why is it beneficial to remove tail recursion?

Just an overall data structures question: Why is it beneficial to remove tail recursion specifically in a quick sort even if it doesn't change complexity?

Quicksort takes O(N2) time in the worst case.
In that worst case, if each call recurses separately on both halves, it also takes O(N) stack space.
That is too much space to waste on the call stack.
Practical implementations of quicksort recurse on the smaller half, and then loop to sort the larger half. This limits the worst case stack consumption to O(log N) in the worst case, which is quite reasonable.
If you get called to sort an array with 100000000 items, it's OK if your calls go 27 levels deep, but it's definitely not OK of they go 10000000 levels deep.

#Matt Timmermans’ answer is great and goes into a lot of detail. I wanted to add one more bit to the answer. You asked this:
Why is it beneficial to remove tail recursion specifically in a quick sort even if it doesn't change complexity?
Tail call elimination doesn’t change (asymptotic) time complexity, but it does change space complexity. While the big-O runtime of the code won’t change, the amount of space needed will shrink pretty significantly, and that’s important because space is limited (especially for the call stack).
One other detail: the cost of calling a function on many processor architectures is higher than that of a normal loop due to the cost of passing parameters, saving information about the return address, etc. So tail call elimination sometimes does improve the runtime, though only by a constant factor and not in a big-O sense.

C++ std::sort implementation

I am wondering as to the implementation of std::sort in c++11. I have an MPI-managed parallel code, where each rank reads data from a file into a vector A that needs to be sorted. Each rank does calls std::sort to do this.
When I run this with ~100 ranks, there is sometimes one rank which hangs at this call to std::sort. Eventually, I realized, it's not hanging, the sort just takes very long. That is, one rank will take ~200 times longer to sort than all of the others.
At first I suspected it was a load-balancing issue. Nope, I've checked thoroughly that the size of A per rank is as balanced as possible.
I've concluded that it may just simply be that one rank has an initial condition of A such that something like the worst-case performance of quicksort is realized (or at least a non-ideal-case).
Why do I think this?
If I change the MPI configuration (thereby perturbing the content of A per rank, since it comes from a file read), the issue disappears, or it can move to other ranks.
If I change std::sort to std::stable_sort (no longer using the quicksort algorithm), then all is fine.
However, it seems that it would be most sensible to implement a quicksort by choosing a random pivot point on each iteration. If that were the case with std::sort, then it would be overwhelmingly unlikely to choose a worst-case value randomly from A on many iterations (which would be required to result in a 200x performance hit).
Thus, my observations suggest that std::sort implements a fixed quicksort pivot value (e.g. always choose the first value in the array, or something like that). This is the only way that the behavior I'm seeing would be likely, and also give consistent results when re-running on the same MPI configuration (which it does).
Am I correct in that conclusion? I did manage to find the std source, but the sort function is totally unreadable, and makes a plethora of calls to various helper functions, and I'd rather avoid a rabbit hole. Aside from that, I'm running on an HPC system, and it's not even clear to me how to be sure what exactly mpicxx is linking to. I can't find any documentation which describe the algorithm implementation

std::sort is implementation specific.
And since C++11, regular quicksort is no longer a valid implementation as required complexity move from O(N log N) on average to O(N log N).

Haskell :: Slower Function as List Getting Bigger

I'm working with finding a maximum value of a list (by using maximum) and I know that this function need to process the entire list to reach its goal and this obviously gets slower as the list gets bigger. Unfortunately, my list is huge (hundreds of million).
Is this the only way? Or there are faster ways to do this? I found that Haskell is fast but at this point (getting slower), I'm wondering is there any other option to find the maximum.

I know that this function need to process the entire list to reach its goal and this obviously gets slower as the list gets bigger.
Finding a maximum is O(n) in Haskell (and presumably every other language as well). Unless you have some concrete benchmarks and code this seems totally expected.

Since you need to "look" at each value and pick the highest O(n) is the best solution (without any other information that could be used). If you have to perform the function multiple times, then you might want to sort your list (ascending or descending) and use the function head or last, that have a time complexity of O(1), while haskells sort is a mergesort with O(n log n) worst-case and average-case performance.

Approximate sort (array/vector), predictable runtime

Background:
I need to process some hundred thousand events (producing results) given a hard time limit. The clock is literally ticking, and when the timer fires, whatever is done at that point must be flushed out.
What isn't ready by that time is either discarded (depending on an importance metric) or processed during the next time quantum (with an "importance boost", i.e. adding a constant to the importance metric).
Now ideally, the CPU is much faster than needed, and the whole set is ready a long time before the end of the time slice. Unluckily, the world is rarely ever ideal, and "hundred thousands" becomes "tens of millions" before you know.
Events are added to the back of a queue (which is really a vector) as they come in, and are processed from the front during the respective next quantum (so the program always processes the last quantum's input).
However, not all events are equally important. In case the available time is not sufficient, it would be preferrable to drop unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next time quantum's queue, but doing so further adds to the load so it isn't a perfect solution).
The obvious thing to use would be, of course, a priority queue / heap. Unluckily, heapifying 100k elements isn't precisely a free operation either (or parallel), and then I end up with objects being in some non-obvious and not necessarily cache-friendly memory locations, and pulling elements from a priority queue doesn't parallelize nicely.
What I would really like is somewhat like a vector that is sorted or at least "somewhat approximately sorted", which one can traverse sequentially afterwards. This would trivially allow me to create e.g. 12 threads (or any other number, one per CPU) that process e.g. 1/64 of the range (or another size) each, slowly advancing from the front to the end, and eventually dropping/postponing what's left over -- which will be events of little importantance that can be discarded.
Simply sorting the complete range using std::sort would be the easiest, most straightforward solution. However, the time it takes to sort items reduces the time available to actually process elements within the fixed time budget, and sorting time is for the most part single-CPU time (and parallel sort isn't that great either).
Also, doing a perfect sort (which isn't really needed) may bring forth worst case complexity whereas an approximate sort should ideally perform at its optimum and have a very predictable cost.
tl;dr
So, what I'm looking for is a way to sort an array/vector only approximately, but fast, and with a predictable (or guaranteed) runtime.
The sort key would be a small integer typically between 10 and 1000. Being postponed to the next time quantum might increase ("priority boost") that value by a small amount, e.g. 100 or 200.
In a different question where humans are supposed to do an approximate sort using "subjective compare"(?) shell sort was proposed. On various sorting demo applets, it seems like at least for the "random shuffle" input that's typical in these, shell sort can indeed do an "approximate sort" that doesn't look too bad with 3-4 passes over the data (and at least the read-tap is strictly sequential). Unluckily it seems to be somewhat of a black art to choose gap values that work well, and runtime estimates seem to involve a lot of looking into the crystal ball as well.
Comb sort with a relatively large shrink factor (such as 2 or 3?) seems tempting as well, since it visits memory strictly sequentially (on both taps) and is able to move far out elements by a great distance quickly. Again, judging from sorting demo applets, it seems like 3-4 passes already give a rather reasonable "approximate sort".
MSD radix sort comes to mind, though I am not sure how it would perform given typical 16/32bit integers in which most of the most significant bits are all zero! One would probably have to do an initial pass to find the most significant bit in the whole set, followed by 2-3 actual sort passes?
Is there a better algorithm or a well-known working approach with one of the algorithms I mentioned?

What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. (posted as an answer now as requested by Damon)

Use a separate vector for each priority. Then you don't need to sort them.

Sounds like a nice example where near-sort algorithms can be useful.
Back a decade Chazelle has developed a nice data-structure that somewhat works like a heap. The key difference is the time complexity though. It has constant time for all important operations, e.g. insert, remove, find lowest element etc.
The trick of this data-structure is, that it breaks the O(n*log n) complexity barrier by allowing for some error in the sort order.
To me that sounds pretty much what you need. The data-structure is called soft heap and explained on wikipedia:
https://en.wikipedia.org/wiki/Soft_heap
There are other algorithms that allow for some error in favor to speed as well. You'll find them if you google for Near Sort Algorithms
If you try that algorithm please give some feedback how it works in practice. I'm really eager to hear from you how the algorithm performs in practice.

Sounds like you want to use std::partition: move the part that interests you to the front, and the others to the back. Its complexity is in the order of O(n), but it is cache-friendly, so it's probably a lot faster than sorting.

If you have limited "bandwidth" in processing events (say a 128K per time quantum), you could use std::nth_element to select the 128K (minus some percentage lost due to making that computation) most promising events (assuming you have an operator< that compares priorities) in O(N) time. Then you process those in parallel, and when you are done, you reprioritize the remainder (again in O(N) time).
std::vector<Event> events;
auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process
if (events.size() <= guaranteed_bandwidth) {
// let all N workers loose on [begin(events), end(events)) range
} else {
auto nth = guaranteed_bandwidth * loss_from_nth_element;
std::nth_element(begin(events), begin(events) + nth);
// let all N workers loose on [begin(events), nth) range
// reprioritize [nth, end(events)) range and append to events for next time quantum
}
This guarantees that in the case that your bandwith threshold is reached, you process the most valuable elements first. You could even speed up the nth_element by a poor man's parallelization (e.g. let each of N workers compute M*128K/N best elements for small M in parallel, and then do a final merge and another nth_element on the M*128K elements).
The only weakness is that in case your system is really overloaded (billions of events, maybe due to some DOS attack) it could take more than the entire quantum to run nth_element (even when quasi-parallized) and you actually process nothing. But if the processing time per event is much larger (say a few 1,000 cycles) than a priority comparison (say a dozen cycles), this should not happen under regular loads.
NOTE: for performance reasons, it's of course better to sort pointers/indices into the main event vector, this is not shown for brevity.

If you have N worker threads, give each worker thread 1/Nth of the original unsorted array. The first thing the worker will do is your approximate fast sorting algorithm of preference on it's individual piece of the array. Then, they can each process their array peice in order - roughly performing higher priority items first, and also being very cache friendly. This way, you don't take a hit for trying to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is, is entirely parallelized. Sorting 10 pieces individually is much cheaper than sorting the whole thing.
This would work best if the priorities of items to process are randomly distributed. If there is some ordering to them you'll wind up with a thread being flooded by or starved of high priority items to process.

How to keep a large priority queue with the most relevant items?

In an optimization problem I keep in a queue a lot of candidate solutions which I examine according to their priority.
Each time I handle one candidate, it is removed form the queue but it produces several new candidates making the number of cadidates to grow exponentially. To handle this I assign a relevancy to each candidate, whenever a candidate is added to the queue, if there is no more space avaliable, I replace (if appropiate) the least relevant candidate currently in the queue with the new one.
In order to do this efficiently I keep a large (fixed size) array with the candidates and two linked indirect binary heaps: one handles the candidates in decreasing priority order, and the other in ascending relevancy.
This is efficient enough for my purposes and the supplementary space needed is about 4 ints/candidate which is also reasonable. However it is complicated to code, and it doesn't seem optimal.
My question is if you know of a more adequate data structure or of a more natural way to perform this task without losing efficiency.

Here's an efficient solution that doesn't change the time or space complexity over a normal heap:
In a min-heap, every node is less than both its children. In a max-heap, every node is greater than its children. Let's alternate between a min and max property for each level making it: every odd row is less than its children and its grandchildren, and the inverse for even rows. Then finding the smallest node is the same as usual, and finding the largest node requires that we look at the children of the root and take the largest. Bubbling nodes (for insertion) becomes a bit tricker, but it's still the same O(logN) complexity.
Keeping track of capacity and popping the smallest (least relevant) node is the easy part.
EDIT: This appears to be a standard min-max heap! See here for a description. There's a C implementation: header, source and example. Here's an example graph:
(source: chonbuk.ac.kr)

"Optimal" is hard to judge (near impossible) without profiling.
Sometimes a 'dumb' algorithm can be the fastest because intel CPUs are incredibly fast at dumb array scans on contiguous blocks of memory especially if the loop and the data can fit on-chip. By contrast, jumping around following pointers in a larger block of memory that doesn't fit on-chip can be tens or hundreds or times slower.
You may also have the issues when you try to parallelize your code if the 'clever' data structure introduces locking thus preventing multiple threads from progressing simultaneously.
I'd recommend profiling both your current, the min-max approach and a simple array scan (no linked lists = less memory) to see which performs best. Odd as it may seem, I have seen 'clever' algorithms with linked lists beaten by simple array scans in practice often because the simpler approach uses less memory, has a tighter loop and benefits more from CPU optimizations. You also potentially avoid memory allocations and garbage collection issues with a fixed size array holding the candidates.
One option you might want to consider whatever the solution is to prune less frequently and remove more elements each time. For example, removing 100 elements on each prune operation means you only need to prune 100th of the time. That may allow a more asymmetric approach to adding and removing elements.
But overall, just bear in mind that the computer-science approach to optimization isn't always the practical approach to the highest performance on today and tomorrow's hardware.

If you use skip-lists instead of heaps you'll have O(1) time for dequeuing elements while still doing searches in O(logn).
On the other hand a skip list is harder to implement and uses more space than a binary heap.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js