Efficiently manage handle for an array binary heap in C++? - c++

Is there a way to efficiently keep track of handles in an array binary heap?
Since there's no fast lookups built into traditional binary heaps, users need a 'handle_type' for delete or decrease_key on an arbitrary element in the heap.
You can't use a pointer or an index into the array, because heap operations will move the element's location around. I can't think of a simple solution using only stack-like structures the way a traditional heap implementation does. Otherwise, I'd have to use 'new/delete' and that feels inefficient.
I don't want to get too obsessed about pre-mature optimization, but I want this to be a part of my standard library of utilities, and so I'd like to put in a little effort to get to know what is considered best practice for this sort of thing.
Maybe just a naive implementation using 'new/delete' is the way to go here. But I'd like to get some advice from people smarter than me first, if that's ok.
The priority queue implementation in the C++ standard library seems to side-step this issue entirely by simply not supporting a 'decrease_key' operation. I've been leafing through CLRS, and they mention this issue but don't really discuss it:
we shall not pursue them here, other than noting that... these handles
do need to be correctly maintained.
Is there a simple approach here I'm overlooking? What do "serious" libraries do when they need a general purpose array heap that needs a 'decrease_key' operation?

Is there a way to efficiently keep track of handles in an array binary tree?
It's hypothetically possible (but not very pretty). Internally, instead of storing an array of elements, the data structure would store an array of pointers (or smart pointers) to a struct containing an pair of an index and and element.
When an element is first inserted into position i in the array, the data structure would initialize the index of this struct to i.
As the data structure moves elements around the array, it should modify the index to reflect the new position.
The result of push can be a pointer to this struct (possibly wrapped in an opaque class). In order to access a specific element (e.g., for decrease_key), you would call some method of the heap with this return value. The heap would then
Know the address of the array (it is its member, after all)
Know the index within the array, through the struct you just sent it.
It could thereby implement decrease_key, for example.
However, there are probably better (and less cumbersome) solutions. Note that the above solution wouldn't alter the asymptotic complexity of the array heap, but the constants would be worse. Conversely, if you look at this summary of heap running times, you can see that a binary heap doesn't really have good performance for decrease_key operations. If you need that, you're probably better off with a Fibonnacci heap (or some other data structure). This leads to your final question
What do "serious" libraries do when they need a general purpose array heap that needs a 'decrease_key' operation?
Libraries such as boost::heap usually indeed implement other data structures more suited for more advanced operations (e.g., decrease_key). These data structures are naturally node based, and naturally support return values which are not invalidated so easily as in the array.

Related

How to keep track of visited points in C++

I am doing a problem in c++ that has to keep track of points that are visited in a traversal. The point is basically,
struct Point {
int x;
int y;
};
My first thought to solving something like this would be to use something like
std::set<Point> visited_points;
or maybe
std::map<Point, bool> visited_points;
However, I am a beginner in c++, and I realized you have to implement a Compare, which I didn't know how to do. When I asked, I was told said that using a map was "overkill" in a problem like this. He said the better solution was to do something like
std::vector<std::vector<bool>> visited_points;
He said std::map was not the best solution, since using a vector was faster.
I'm wondering why using a double vector is better in terms of style and performance. Is it because implementing a Compare is hard for a Point? A double vector feels hacky to me, and I also think it looks uglier than using a set or map. Is it really the best way to approach this problem, or is there a better solution I don't know about?
If someone asks you, in abstract, "What is the best way of keeping track of objects I've visited?", then you would be forgiven for replying "Use an std::unordered_set<Object>" (usually called a hash table for languages other than C++). That's a nice simple answer and it is often correct if you don't know anything at all about the objects. After all, a hash lookup is (expected) O(1), and in practice is usually quite fast.
There are a few caveats, the biggest one being that you will need to be able to compute a hash for each object. The C++ standard library does not (yet) come with a framework for computing hashes of arbitrary objects, not even PODs, and rendering an object as a string in order to be able to take advantage of std::hash<std::basic_string> is usually way too much work (unless the object is already a string, of course).
If you can't figure out how to write a hash function for you object, you might then think about using an ordered associative container (aka a balanced BST). However, that is not a good idea. Not because it is difficult to write a comparison function. Writing comparison functions is usually trivial, particularly for PODs; you can leverage the fact that std::tuple implements a comparison function for every tuple whose element types are all comparable.
The real problem with ordered associative containers is that they are high overhead. Element access is slow: O(log n), not O(1), and the constant is not small either. And the bookkeeping data required to maintain the balanced tree is much larger than the two-pointer hash-table node (and even that is quite big for small objects). So ordered associative containers really only make sense if you need to be able to traverse them in order. Generally, "visited" maps don't need to be traversed at all -- they are just used for lookup.
Both ordered and unordered containers have another problem: the objects in the container are individual dynamic memory allocations (the API requires that references to the objects in the container must be stable), so over time the individual objects end up getting scattered across dynamic memory, leading to a lot of cache misses.
But, really, even before you start thinking about how easy (or difficult) it will be to hash your objects in order to keep them in a hash-set, you should think about the nature of the objects you are tracking. In particular, can they be easily indexed with a small(-ish) integer? If so, you could just use a vector of bits, one bit per possible object. That's an efficient representation, both for access speed (definitely O(1)) and for space, and it is optimal for memory caching.
If your objects are easily numbered then bit-vectors will be an attractive alternative. One bit per object is (literally) two orders of magnitude less space than a hash-map, so unless you expect your visited map to be extremely sparse (rarely the case in algorithms which need a visited map), it's going to be a big win.
In the case of your problem, which I gather has to do with keeping track of points visited in a rectangular array such as a gameboard or an image, it is clear that the bit vector approach is going to work out well. It's true that you require two levels of indexing (unless you reduce the two indices into a single integer, which is quite easy if you know the dimensions), but that doesn't add much overhead.
Although there are doubts about how good an idea it was, the C++ standard library special cases std::vector<bool> to really be a bit vector. That makes it impossible to create a native pointer to a single element of the vector (which is why many people consider std::vector<bool> to be a hack), and creates some other odd issues when you try to use it as a vector. But if all you want is a bitmask -- as in the case of a visited map -- then it is a pretty good solution.
C++ also offers real bit vectors -- std::bitset -- but unfortunately these need to have their size known at compile time. Boost offers dynamic_bitset, which is a kind of std::vector<bool> written with hindsight, so it's also worth looking at.

Fast data structure that supports finding the minimum element and accessing, inserting, removing and updating data at any index

I'm looking for ideas to implement a templatized sequence container data structure which can beat the performance of std::vector in as many features as possible and potentially perform much faster. It should support the following:
Finding the minimum element (and returning it's index)
Insertion at any index
Removal at any index
Accessing and updating any element by index (via operator[])
What would be some good ways to implement such a structure in C++?
You generally be pretty sure that the STL implementations of all containers tend to be very good at the range of tasks they were designed for. That is to say, you're unlikely to be able to build a container that is as robust as std::vector and quicker for all applications. However, generally speaking, it is almost always possible to beat a generic tool when optimizing for a specific application.
First, let's think about what a vector actually is. You can think of it as a pointer to a c-style array, except that its elements are stored on the heap. Unlike a c array, it also provides a bunch of methods that make it a little bit more convenient to manipulate. But like a c-array, all of it's data is stored contiguously in memory, so lookups are extremely cheap, but changing its size may require the entire array to be shifted elsewhere in memory to make room for the new elements.
Here are some ideas for how you could do each of the things you're asking for better than a vanilla std::vector:
Finding the minimum element: Search is typically O(N) for many containers, and certainly for a vector (because you need to iterate through all elements to find the lowest). You can make it O(1), or very close to free, by simply keeping the smallest element at all times, and only updating it when the container is changed.
Insertion at any index: If your elements are small and there are not many, I wouldn't bother tinkering here, just do what the vector does and keep elements contiguously next to each other to keep lookups quick. If you have large elements, store pointers to the elements instead of the elements themselves (boost's stable vector will do this for you). Keep in mind that this make lookup more expensive, because you now need to dereference the pointer, so whether you want to do this will depend on your application. If you know the number of elements you are going to insert, std::vector provides the reserve method which preallocates some memory for you, but what it doesn't do is allow you to decide how the size of the allocated memory grows. So if your application warrants lots of push_back operations without enough information to intelligently call reserve, you might be able to beat the standard std::vector implementation by tailoring the growth function of your container to your particular needs. Another option is using a linked list (e.g. std::list), which will beat an std::vector in insertions for larger containers. However, the cost here is that lookup (see 4.) will now become vastly slower (O(N) instead of O(1) for vectors), so you're unlikely to want to go down this path unless you plan to do more insertions/erasures than lookups.
Removal at any index: Similar considerations as for 2.
Accessing and updating any element by index (via operator[]): The only way you can beat std::vector in this regard is by making sure your data is in the cache when you try to access it. This is because lookup for a vector is essentially an array lookup, which is really just some pointer arithmetic and a pointer dereference. If you don't access your vector often you might be able to squeeze out a few clock cycles by using a custom allocator (see boost pools) and placing your pool close to the stack pointer.
I stopped writing mainly because there are dozens of ways in which you could approach this problem.
At the end of the day, this is probably more of an exercise in teaching you that the implementation of std::vector is likely to be extremely efficient for most compilers. All of these suggestions are essentially micro-optimizations (which are the root of all evil), so please don't blindly apply these in important code, as they're highly likely to end up costing you a lot of time and headache.
However, that's not to say you shouldn't tinker and learn for yourself, so by all means go ahead and try to beat it for your application and let us know how you go! Good luck :)

Pointers or Indexes?

I have a network-like data structure, composed by nodes linked together.
The nodes, whose number will change, will be stored in a std::vector<Node> in no particular order, where Node is an appropriate class.
I want to keep track of the links between nodes. Again, the number of these links will change, and I was thinking about using again a std::vector<Link>. The Link class has to contain the information about the two nodes it's connecting, as well as other link features.
Should Link contain
two pointers to the two nodes?
two integers, to be used as an indexes for the std::vector<Node>?
or should I adopt a different system (why?)
the first approach, although probably better, is problematic as the pointers will have to be regenerated every time I add or remove nodes from the network, but on the other hand that will free me from e.g. storing nodes in a random-access container.
This is difficult to answer in general. There are various performance and ease-of-use trade-offs.
Using pointers can provide a more convenient usage for some operations. For example
link.first->value
vs.
nodes[link.first].value
Using pointers may provide better or worse performance than indices. This depends on various factors. You would need to measure to determine which is better in your case.
Using indices can save space if you can guarantee that there are only a certain number of nodes. You can then use a smaller data type for the indices, whereas with pointers you always need to use the full pointer size no matter how many nodes you have. Using a smaller data type can have a performance benefit by allowing more links to fit within a single cache line.
Copying the network data structure will be easier with indices, since you don't have to recreate the link pointers.
Having pointers to elements of a std::vector can be error-prone, since the vector may move the elements to another place in memory after an insert.
Using indices will allow you to do bounds checking, which may make it easier to find some bugs.
Using indices makes serialization more straightforward.
All that being said, I often find indices to be the best choice overall. Many of the syntactical inconveniences of indices can be overcome by using convenience methods, and you can switch the indices to pointers during certain operations where pointers have better performance.
Specify the interface for the class you want to use or create. Write unit tests. Do the most simple thing to fulfill the unit tests.
So it depends on the interface of the class. For example if a Link doesn't export information about the nodes, then it doesn't really matter what approach you chose. On the other hand if you go for pointers, consider std::shared_ptr.
I would add a (or a number of) link pointer to your Node class and then hand maintain the links. This will save you having to use an additional container.
If you are looking for something a bit more structured you can try using Boost Intrusive. This effectively does the same thing in a more generalized fashion.
You can avoid the Link class altogether if you use:
struct Node
{
std::vector<Node*> parents;
std::vector<Node*> children;
};
With this approach,
You avoid creating another class.
Your memory requirements are reduced.
You have to make fewer pointer traversals to traverse the network of Nodes.
Downside. You have to make sure that:
When creating or removing a link you have to update two objects.
When you delete a Node, you have to remove pointers to it from its parents and children.
You could make it a std::vector<Node *> instead of std::vector<Node> and allocate the nodes with new.
Then:
You can store the pointers to the nodes in the Link class without fear of them becoming invalidated
You can still randomly access them in the nodes vector.
Downside is that you will need to remember to delete them when they are removed from the node list.
My Personal experience with vectors in graph like structures has brought up these invariants.
Don't store data in vectors, where other classes hold a pointer/reference
You have a graph like data structure. If the code is not performance critical (this is something different to performance sensitive!) you should not consider cache compacting your data structures.
If you don't know how large your graph will be and you have got your Node data in a vector all iterators and pointers are invalidated once your vector calls vector::reallocate() this means that you have to somehow have to regenerate your whole data structure and perhaps you have to create a copy of all of it and use dfs or similar to adjust the pointers. The same thing will happen if you want to remove data in the middle of one of your vectors.
If you know how large your data will be you'll be set in stone to keep it that way or you'll have huge headaches once you reconsider.
Don't use shared pointers to keep track of what needs to be freed
If you have a graph like data structure and you delete on performance critical paths it's unwise to call delete whenever your algorithm decides he doesn't need the data anymore. One possibility is to keep data on the heap (if it is performance critical consider a pool allocator) mark objects you don't need anymore either during your performance critical sections (if you really really need to save space you can consider pointer tagging) or use some simple mark and sweep algorithm afterwards to find items no longer needed (yes graph algorithms are one of those cases where sutter is saying garbage collection is faster than smart pointers).
Be aware that deferred destruction of objects means that you loose all RAII like features in your Node classes.

Are there versions of the C++ STL's associative data structures optimized for numerous partial copies?

I have a large tree that grows as my algorithm progresses. Each node contains set, which I suppose is implemented as balanced binary search tree. Each node's set shall remain fixed after that node's creation, before its use in creating that node's children.
I fear however that copying each set is prohibitively expensive. Instead, I would prefer that each newly created node's set utilize all appropriate portions of the parent node's set. In short, I'm happy copying O(log n) of the set but not O(n).
Are there any variants of the STL's associative data structures that offer such an partial copy optimization? Perhaps in Boost? Such a data structure would be trivial to implement in Haskell or OCaML of course, but it'd require more effort in C++.
I know it's not generally productive to suggest a different language, but Haskell's standard container libraries do exactly this. I remember seeing a video (was it Simon Peyton Jones?) talking about this exact problem, and how a Haskell solution ended up being much faster than a C++ solution for the given programmer effort. Of course, this was for a specific problem that had a lot of sets with a lot of shared elements.
There is a fair amount of research into this subject. If you are looking for keywords, I suggest searching for "functional data structures" instead of "immutable data structures", since most functional paradigms benefit from immutability in general. Structures such as finger tree were developed to solve exactly this problem.
I know of no C++ library that implements these data structures. There is nothing stopping you from reading the relevant papers (or the Haskell source code, which is about 1k lines for Data.Set including tests) and implementing it yourself, but I know that is not what you'd want to hear. You'd also need do some kind of reference counting for the shared nodes, which for such deep structures can have a higher overhead than even simple garbage collectors.
It's practically impossible in C++, since the notion of an immutable
container doesn't exist. You may know that you'll be making no changes,
and that some sort of shared representation would be preferable, but the
compiler and the library don't, and there's no way of communicating this
to them.
Each node contains set, which I suppose is implemented as balanced
binary search tree. Each node's set shall remain fixed after that
node's creation, before its use in creating that node's children.
That's a pretty unique case. I would recommend using std::vector instead. (No really!) The code is creating the node can still use a set, and switching to a vector at the last second. However, the vector is smaller, and takes only a tiny number of memory allocations (one if you use reserve), making the algorithm much faster.
typedef unsigned int treekeytype;
typedef std::vector<unsigned int> minortreetype;
typedef std::pair<treekeytype, minortreetype> majornode
typedef std::set<treekeytype, minortreetype> majortype;
majortype majortree;
void func(majortype::iterator perform) {
std::set<unsigned int> results;
results.assign(perform->second.begin(), perform->second.end());
majortree[perform->first+1].assign(results.begin(), results.end()); //the only change is here
majortype::iterator next = majortree.find(perform->first+1);
func(next);
}
You can even use std::lower_bound and std::upper_bound to still get O(log(n)) memory accesses since it's still sorted the same as the set was, so you won't lose any efficiency. It's pure gain as long as you don't need to insert/remove frequently.
I fear however that copying each set is prohibitively expensive.
If this fear is caused because each set contains mostly the same nodes as it's parents, and the data is costly (to copy or in memory, whichever), with only a few nodes changed, make the subtrees contain std::shared_pointers instead of the data themselves. This means the data itself will not get copied, only the pointers.
I realize this isn't what you were aiming at with the question, but as JamesKanze said, I know of no such container. Other than possibly a bizarre and dangerous use of the STL's rope class. Note that I said and meant STL, not the standard C++ library. They're different.

What is the C++ equivalent of C# Collection<T> and how do you use it?

I have the need to store a list/collection/array of dynamically created objects of a certain base type in C++ (and I'm new to C++). In C# I'd use a generic collection, what do I use in C++?
I know I can use an array:
SomeBase* _anArrayOfBase = new SomeBase[max];
But I don't get anything 'for free' with this - in other words, I can't iterate over it, it doesn't expand automatically and so on.
So what other options are there?
Thanks
There is std::vector which is a wrapper around an array, but it can expand and will do automatically. However, it is a very expensive operation, so if you are going to do a lot of insertion or removal operations, don't use a vector. (You can use the reserve function, to reserve a certain amount of space)
std::list is a linked list, which has far faster insertion and removal times, but iteration is slower as the values are not stored in contiguous memory, which means that address calculation is far more complex and you can't take advantage of the processors cache when iterating over the list.
The major upside compared to the vector or deque is that elements can be added or removed from anywhere in the list fairly cheaply.
As a compromise, there is std::deque, which externally works in a similar way to a vector, but internally they are very different. The deque's storage doesn't have to be contiguous, so it can be divided up into blocks, meaning that when the deque grows, it doesn't have to reallocate the storage space for its entire contents. Access is slightly slower and you can't do pointer arithmetic to get an element.
You should use a vector.
#include <vector>
int main()
{
std::vector<SomeBase*> baseVector;
baseVector.push_back(new SomeBase());
}
C++ contains a collection of data containers within the STL.
Check it out here.
You should use one of the containers
std::vector<SomeBase>
std::list<SomeBase>
and if you really need dynamically allocated objects
std::vector<boost::shared_ptr<SomeBase>>
std::list<boost::shared_ptr<SomeBase>>
Everyone has mentioned that the common SC++L controls, but there is another important caveat when doing this in C++ (that Chaoz has included in his example).
In C++, your collection will need to be templated on SomeBase*, not on SomeBase. If you try to assign an instance of the derived type to an instance of the base typem you will end up causing what is called object slicing. This is almost definately not what you are trying to do.
Since you are coming from C#, just remember that "SomeBase MyInstance" means something very different in both languages. The C++ equivalent to this is usually "SomeBase* MyPointer" or "SomeBase& MyReference".
Use a vector. Have a look here.
I am a huge fan of std::deque. If you want things for free, the deque gives them to you. Fast access from the head and tail of the list. iterators, reverse_iterators, fast insertion at the head and the tail. It's not super specialized, but you wanted free stuff. ;-)
Also, I will link a great STL reference. The STL is where you get all the standard "free" stuff in C++. Standard Template Library. Enjoy!
Use STL. std::vector and std::set for instance. Plenty of examples out there.