construct boost priority queue based on iterators - c++

I have a list of binomial_heaps and each iteration of the algorithm I have to update the priority of an element in some of the binomial_heaps. For this I use the update function of the boost binomial_heap. However one of the binomial_heaps I have to remove and rebuild completely (as all priorities change). Instead of using push every time (which if I understand correctly would have a complexity of n*log(n)) I would like to construct it based on iterators of an underlying container (a kind of heapify or make_heap operation which would be linear time). This seems possible in the standard priority_queue, but not in the boost implementation. On the other hand the standard one does not provide me with an update function. Is there a way around this where I can have both, or another library that supports both. Or maybe my reasoning, that pushing all elements on an empty priority queue is slower, is not correct?
Some might say there is something seriously wrong with the fact that I need to rebuild an entire priority queue which would make the use of the priority queue completely superfluous. The algorithm I want to implement is "Finding community structure in very large networks by Aaron Clauset" in which the authors do exactly that (unless I didn't interpret it correctly)
(Sorry couldn't post the link to the paper as I don't have enough reputation to post more than 2 links)

The "fast modularity" algorithm by Clauset et al. (paper here, code here) uses a pair of linked data structures. On the one hand, you have a sparse matrix data structure (which is really just an adjacency list in which instead of storing the elements hanging off a particular array element as a linked list, we store them using a balanced binary tree data structure), and a max-heap. All the values in the sparse matrix (which are really the dQ_ij values for the potential merges in the algorithm) are also stored in the max-heap.
So, the max-heap is just an efficient way of finding the edge in the sparse matrix with the most positive value. Once you have the ij pair for that edge, you want to "insert" the elements of column (row) i into the elements of column (row) j, and then you want to delete column (row) i. So, you're not going to rebuild the entire max-heap after each pop from the max-heap. Instead, you want to delete some elements from it (the ones in the row/column that you delete from the sparse matrix) and update the values of others (the ones in the updated row/column for j).
This is where the linked data structure is helpful -- in the original implementation, each element in the sparse matrix stores a pointer to its corresponding entry in the max-heap so that if you update the value in the sparse matrix, you can then find the corresponding element in the max-heap and update its value. Once you do this, you need to re-heapify the updated heap element, by letting it move (recursively) up or down in the heap. Similarly, if you delete an element in the sparse matrix, you can find its entry in the heap and call a delete function on it.

Related

Instant sort when put new value in array C++

I have a dynamically allocated array containing structs with a key pair value. I need to write an update(key,value) function that puts new struct into array or if struct with same key is already in the array it needs to update its value. Insert and Update is combined in one function.
The problem is:
Before adding a struct I need to check if struct with this key already existing.
I can go through all elements of array and compare key (very slow)
Or I can use binary search, but (!) array must be sorted.
So I tried to sort array with each update (sloooow) or to sort it when calling binary search funtion.....which is each time updating
Finally, I thought that there must be a way of inserting a struct into array so it would be placed in a right place and be always sorted.
However, I couldn't think of an algorithm like that so I came here to ask for some help because google refuses to read my mind.
I need to make my code faster because my array accepts more that 50 000 structs and I'm using bubble sort (because I'm dumb).
Take a look at Red Black Trees: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
They will ensure the data is always sorted, and it has a complexity of O ( log n ) for inserts.
A binary heap will not suffice, as a binary heap does not have guaranteed sort order, your only guarantee is that the top element is either min or max.
One possible approach is to use a different data structure. As there is no genuine need to keep the structs ordered, there is only need to detect if the struct with the same key exits, so the costs of maintaining order in a balanced tree (for instance by using std::map) are excessive. A more suitable data structure would be a hash table. C++11 provides such in the standard library under obscure name std::unordered_map (http://en.cppreference.com/w/cpp/container/unordered_map).
If you insist on using an array, a possible approach might be to combine these algorithms:
Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter)
Partial sort (http://en.cppreference.com/w/cpp/algorithm/partial_sort)
Binary search
Maintain two ranges in the array -- first goes a range that is already sorted, then goes a range that is not yet. When you insert a struct, first check with the bloom filter if a matching struct might already exist. If the bloom filter gives a negative answer, then just insert the struct at the end of the array. After that the sorted range does not change, the unsorted range grows by one.
If the bloom filter gives a positive answer, then apply partial sort algorithm to make the entire array sorted and then use binary search to check if such an object actually exists. If so, replace this element. After that the sorted range is the entire array, and the unsorted range is empty.
If the binary search has shown that the bloom filter was wrong, and the matching struct is not there, then you just put the new struct at the end of the array. After that the sorted range is entire array minus one, and the unsorted range is the last element in the array.
Each time you insert an element, binary search to find if it exists. If it doesn't exist, the binary search will give you the index at which you can insert it.
You could use std::set, which does not allow duplicate elements and places elements in sorted position. This assumes that you are storing the key and value in a struct, and not separately. In order for the sorting to work properly, you will need to define a comparison function for the structs.

Is this how I combine two min-heaps together?

I am currently creating a source code to combine two heaps that satisfy the min heap property with the shape invariant of a complete binary tree. However, I'm not sure if what I'm doing is the correct accepted method of merging two heaps satisfying the requirements I laid out.
Here is what I think:
Given two priority queues represented as min heaps, I insert the nodes of the second tree one by one into the first tree and fix the heap property. Then I continue this until all of the nodes in the second tree is in the first tree.
From what I see, this feels like a nlogn algorithm since I have to go through all the elements in the second tree and for every insert it takes about logn time because the height of a complete binary tree is at most logn.. But I think there is a faster way, however I'm not sure what other possible method there is.
I was thinking that I could just insert the entire tree in, but that break the shape invariant and order invariant..Is my method the only way?
In fact building a heap is possible in linear time and standard function std::make_heap guarantees linear time. The method is explained in Wikipedia article about binary heap.
This means that you can simply merge heaps by calling std::make_heap on range containing elements from both heaps. This is asymptotically optimal if heaps are of similar size. There might be a way to exploit preexisting structure to reduce constant factor, but I find it not likely.

Algorithm for merging short lists into a long vector

I have a sparse matrix class whose non-zeros and corresponding column indices are stored, in row-order, in what are basically STL-vector-like containers. They may have unused capacity, like vectors; and to insert/remove elements, existing elements must be moved.
Say I have an operation, insert_erase_replace, or ier for short. ier can do the following, given a position p, a column index j, and a value v:
if v==0, ier removes the entry at p and left-shifts all subsequent entries.
if v!=0, and j is already present at p, ier replaces the cell contents at p with v.
if v!=0, and j is not present at p, ier inserts the entry v and column index j at p after right-shifting all subsequent entries.
So all of that is trivial.
Now let's say I have ier2, which does the same thing, except that it takes a list containing multiple column indices j and corresponding values v. It also has a size n, which indicates how many index/value pairs are present in the list. But because the vector only stores non-zeros, sometimes the actual insertion size is smaller than n.
Still trivial.
But now let's say I have ier3, which takes not just one list like ier2, but multiple lists. This represents editing a slice of the sparse matrix.
At some point, it becomes more efficient to iterate through the vectors, copying them piece by piece and inserting/replacing/erasing the list indices/values ier2-style as we arrive at each insertion point. And if the total insertion size would cause my vector to need a resize anyway, then we do that.
Given that my vector is much, much larger than the total length of the lists, is there an algorithm for efficiently merging the lists into the vector?
So far, here's what I have:
Each list passed to ier3 represents either a net deletion of entries (a left shift), a net replacement (no movement, therefore cheap), or a net insertion of entries (a right shift). There may also be some re-arrangement of elements in there, but the expensive parts are the net deletions and net insertions.
It's not hard to figure out an algorithm for efficiently doing ONLY net insertions or net deletions.
It's harder when either of the two may be happening.
The only thing I can think to do is to handle it in two passes:
Erase/replace
Insert/replace
We erase first because it makes it more likely that any insertions will require fewer copies.
Is this the right approach? Does anyone know of a better one?
Okay, so I'm going to suppose the intervals covered in each list in ier3 are disjoint and given to you in order. If it's meant for editing slices of a matrix, this seems reasonable. I'm also assuming you that you don't need to resize the vector, because that case is easily detectable and solvable.
Initialise a read pointer and a write pointer to the start of the vector you're editing. There'll be an instruction pointer into ie3 too, but I'll ignore that here for clarity's sake. You'll also need a queue. At each step, one of several things can happen:
Default: Neither read nor write are at a position detailed by ier3. In this case, add the element under read to the back of the queue and write the element at the front of the queue to the cell under write. Move both pointers forward one.
read is over a cell that needs to be deleted. In this case, simply move read forward one without adding anything to the queue.
read passes from one cell to the next such that an insertion should happen between them. In this case, add the insertion to the back of the queue and then continue with the next relevant case.
read is at a cell that needs to be modified. In this case, insert the modified cell at the back of the queue, write whatever's at the front of the queue to write, and step them both forwards.
read has arrived at the unused capacity of the vector. In which case just write whatever's left in the queue.
That's the basic outline, but a couple of optimizations can be made: first, if the queue's empty, step both pointers forward to the next position detailed by ie3 without doing anything. Second, minimize the buffer by doing extra writing steps whenever read is ahead of write and the queue is nonempty.
I'd go with your plan with a few important points highlighted.
The erase/replace step should start from the left and only move points within the affected range - it can leave a "gap". It should determine the size of the final vector. At the end of this step, use the determined size to shift the "tail" of the vector as needed, leaving the exact amount of space required for insertions free.
The insertions should start from the right and fill up the gap we left in step 1 by copying each point to it's final position.
This will never shift the main vector once and never copy any point (from the existing slice or insertion set) more than twice so it's essentially linear.
Other data structures might be helpful too - reserving space at both the front and end, or building it out of multiple sections so a resize doesn't force a full copy.
One further optimisation would be to allow some insertions during step 1. If you've erased some, completing any insertion you come across immediately until it balances will prevent you needing to move any points until you reach another erase.
Let n be the size of the list and m be the size of the vector. It sounds like ier does a binary search for j every time, so the searching part is O(n*log(m)).
Assuming the elements in the list are sorted, once you find the first element, it's faster to just navigate up the vector to find the next one. That way searching becomes O(log(m) + n) = O(n).
Also, do a dry pass first to count net deletions/insertions, and a second pass to actually apply the changes. I think these two passes will run faster than the two you describe.
I can suggest a different design for a sparse matrix that should help you achieve performance and a low memory footprint for large sparse matrices.
Instead of vector, why not use a 2D hash table. something like (no std:: for smaller code):
typedef unordered_map< unsigned /* index */, int /* value */ > col_type;
unordered_map< unsigned /* index */, col_type*>; // may need to define hash function for col_type
the outer class (sparse_matrix) searches in O(1) for a column. If not found, it allocates a new column.
Then the column type is searched for the column index in O(1) and either delete/replace or insert based on the original logic. It can see if the column is now empty and delete it from the 'row' hash map.
all basic operations add/delete/replace are O(1).
If you need a fast ordered iteration of the matrix, you can replace the unordered_map with 'map'. If the matrix is very sparse, the O(nlog(n)) complexity will be similar to the hash_map's.
BTW I used pointer to the col_type on purse, the outer hash map grows much (much!) faster this way.

Prim's algorithm for dynamic locations

Suppose you have an input file:
<total vertices>
<x-coordinate 1st location><y-coordinate 1st location>
<x-coordinate 2nd location><y-coordinate 2nd location>
<x-coordinate 3rd location><y-coordinate 3rd location>
...
How can Prim's algorithm be used to find the MST for these locations? I understand this problem is typically solved using an adjacency matrix. Any references would be great if applicable.
If you already know prim, it is easy. Create adjacency matrix adj[i][j] = distance between location i and location j
I'm just going to describe some implementations of Prim's and hopefully that gets you somewhere.
First off, your question doesn't specify how edges are input to the program. You have a total number of vertices and the locations of those vertices. How do you know which ones are connected?
Assuming you have the edges (and the weights of those edges. Like #doomster said above, it may be the planar distance between the points since they are coordinates), we can start thinking about our implementation. Wikipedia describes three different data structures that result in three different run times: http://en.wikipedia.org/wiki/Prim's_algorithm#Time_complexity
The simplest is the adjacency matrix. As you might guess from the name, the matrix describes nodes that are "adjacent". To be precise, there are |v| rows and columns (where |v| is the number of vertices). The value at adjacencyMatrix[i][j] varies depending on the usage. In our case it's the weight of the edge (i.e. the distance) between node i and j (this means that you need to index the vertices in some way. For instance, you might add the vertices to a list and use their position in the list).
Now using this adjacency matrix our algorithm is as follows:
Create a dictionary which contains all of the vertices and is keyed by "distance". Initially the distance of all of the nodes is infinity.
Create another dictionary to keep track of "parents". We use this to generate the MST. It's more natural to keep track of edges, but it's actually easier to implement by keeping track of "parents". Note that if you root a tree (i.e. designate some node as the root), then every node (other than the root) has precisely one parent. So by producing this dictionary of parents we'll have our MST!
Create a new list with a randomly chosen node v from the original list.
Remove v from the distance dictionary and add it to the parent dictionary with a null as its parent (i.e. it's the "root").
Go through the row in the adjacency matrix for that node. For any node w that is connected (for non-connected nodes you have to set their adjacency matrix value to some special value. 0, -1, int max, etc.) update its "distance" in the dictionary to adjacencyMatrix[v][w]. The idea is that it's not "infinitely far away" anymore... we know we can get there from v.
While the dictionary is not empty (i.e. while there are nodes we still need to connect to)
Look over the dictionary and find the vertex with the smallest distance x
Add it to our new list of vertices
For each of its neighbors, update their distance to min(adjacencyMatrix[x][neighbor], distance[neighbor]) and also update their parent to x. Basically, if there is a faster way to get to neighbor then the distance dictionary should be updated to reflect that; and if we then add neighbor to the new list we know which edge we actually added (because the parent dictionary says that its parent was x).
We're done. Output the MST however you want (everything you need is contained in the parents dictionary)
I admit there is a bit of a leap from the wikipedia page to the actual implementation as outlined above. I think the best way to approach this gap is to just brute force the code. By that I mean, if the pseudocode says "find the min [blah] such that [foo] is true" then write whatever code you need to perform that, and stick it in a separate method. It'll definitely be inefficient, but it'll be a valid implementation. The issue with graph algorithms is that there are 30 ways to implement them and they are all very different in performance; the wikipedia page can only describe the algorithm conceptually. The good thing is that once you implement it some way, you can find optimizations quickly ("oh, if I keep track of this state in this separate data structure, I can make this lookup way faster!"). By the way, the runtime of this is O(|V|^2). I'm too lazy to detail that analysis, but loosely it's because:
All initialization is O(|V|) at worse
We do the loop O(|V|) times and take O(|V|) time to look over the dictionary to find the minimum node. So basically the total time to find the minimum node multiple times is O(|V|^2).
The time it takes to update the distance dictionary is O(|E|) because we only process each edge once. Since |E| is O(|V|^2) this is also O(|V|^2)
Keeping track of the parents is O(|V|)
Outputting the tree is O(|V| + |E|) = O(|E|) at worst
Adding all of these (none of them should be multiplied except within (2)) we get O(|V|^2)
The implementation with a heap is O(|E|log(|V|) and it's very very similar to the above. The only difference is that updating the distance is O(log|V|) instead of O(1) (because it's a heap), BUT finding/removing the min element is O(log|V|) instead of O(|V|) (because it's a heap). The time complexity is quite similar in analysis and you end up with something like O(|V|log|V| + |E|log|V|) = O(|E|log|V|) as desired.
Actually... I'm a bit confused why the adjacency matrix implementation cares about it being an adjacency matrix. It could just as well be implemented using an adjacency list. I think the key part is how you store the distances. I could be way off in my implementation outlined above, but I am pretty sure it implements Prim's algorithm is satisfies the time complexity constraints outlined by wikipedia.

Hashmap to implement adjacency lists

I've implement an adjacency list using the vector of vectors approach with the nth element of the vector of vectors refers to the friend list of node n.
I was wondering if the hash map data structure would be more useful. I still have hesitations because I simply cannot identify the difference between them and for example if I would like to check and do an operation in nth elements neighbors (search,delete) how could it be more efficient than the vector of vectors approach.
A vector<vector<ID>> is a good approach if the set of nodes is fixed. If however you suddenly decide to remove a node, you'll be annoyed. You cannot shrink the vector because it would displace the elements stored after the node and you would lose the references. On the other hand, if you keep a list of free (reusable) IDs on the side, you can just "nullify" the slot and then reuse later. Very efficient.
A unordered_map<ID, vector<ID>> allows you to delete nodes much more easily. You can go ahead and assign new IDs to the newly created nodes and you will not be losing empty slots. It is not as compact, especially on collisions, but not so bad either. There can be some slow downs on rehashing when a vector need be moved with older compilers.
Finally, a unordered_multimap<ID, ID> is probably one of the easiest to manage. It also scatters memory to the wind, but hey :)
Personally, I would start prototyping with a unordered_multimap<ID, ID> and switch to another representation only if it proves too slow for my needs.
Note: you can cut in half the number of nodes if the adjacency relationship is symmetric by establishing than the relation (x, y) is stored for min(x, y) only.
Vector of vectors
Vector of vectors is good solution when you don't need to delete edges.
You can add edge in O(1), you can iterate over neighbours in O(N).
You can delete edge by vector[node].erase(edge) but it will be slow, complexity only O(number of vertices).
Hash map
I am not sure how you want to use hash map. If inserting edge means setting hash_map[edge] = 1 then notice that you are unable to iterate over node's neighbours.