c++ set versus vector + heap operations for an A* priority queue - c++

When is using a std::set more efficient (w.r.t. time) than using a std::vector along with make_heap/push_/pop_ for the priority queue in an A* operation? My guess is that if the vertices in the open list are small, using a vector is a better option. But does anyone have experience with this?

If i had to venture a guess? I'd guess that the vector version is probably a good choice because once it grows to a certain size, there won't be very many allocs.
But I don't like guessing. I prefer hard numbers. Try both, profile!

Use a priority queue. A binary heap based one is fine (like the vector based std priority queue). You can build the heap in O(n) time and all relevent operations take O(logn). In addition to that you can implement the decrease key operation which is useful for a*. It might be tricky to implement for the std queue however. The only way to do that with a set is to remove the element and reinsert it with a different priority.
Edit: you might want to look into using
std::make_heap
(and related functions). That way you get access to the vector and can easily implement decrease_key.
Edit 2: I see that you were intending on using make heap, so all you'd have to do is implement decrease key.

For A* search, I would go with a
std::vector-based priority queue.
However, the change in the
implementation from std::vector to another STL container should be
quite trivial, so I would experiment with different versions and see how
does it affect the algorithm
performance. In addition to stl::map, I would definitely try stl::deque.

If you're doing A* pathfinding work, check out the articles in AI Wisdom by Dan Higgins. In there is a description of how to get data structures fast. He mentions a "cheap list" which is like having a hot cache for recent nodes and avoiding a lot of the penalties for pathfinding data structures.
http://introgamedev.com/resource_aiwisdom.html

I don't think a vector based data structure for a priority queue for an A* search is a good idea because you're going to be constantly adding a single element somewhere in the list. If the fringe (I assume this is how you're doing it) is large and the element is to be added in the middle, this is highly inefficient.
When I implemented A* in Java a few weeks ago, I used the Java PriorityQueue which apparently is based on a priority heap, and that seems like a good way to do it. I recommend using set in C++.
EDIT: thanks Niki. I now understand how binary heaps are implemented (with an array), and I understand what you're actually asking in your question. I suspect a priority_queue is the best option, although (as Igor said) it wouldn't be hard to swap it over to a set to check the performance of that. I'm guessing there's a reason why priority queues (in Java and C++ at least) are implemented using binary heaps.

Related

What is are the advantages of a custom data structure?

What's the need to go for defining and implementing data structures (e.g. stack) ourselves if they are already available in C++ STL?
What are the differences between the two implementations?
First, implementing by your own an existing data structure is a useful exercise. You understand better what it does (so you can understand better what the standard containers do). In particular, you understand better why time complexity is so important.
Then, there is a quality of implementation issue. The standard implementation might not be suitable for you.
Let me give an example. Indeed, std::stack is implementing a stack. It is a general-purpose implementation. Have you measured sizeof(std::stack<char>)? Have you benchmarked it, in the case of a million of stacks of 3.2 elements on average with a Poisson distribution?
Perhaps in your case, you happen to know that you have millions of stacks of char-s (never NUL), and that 99% of them have less than 4 elements. With that additional knowledge, you probably should be able to implement something "better" than what the standard C++ stack provides. So std::stack<char> would work, but given that extra knowledge you'll be able to implement it differently. You still (for readability and maintenance) would use the same methods as in std::stack<char> - so your WeirdSmallStackOfChar would have a push method, etc. If (later during the project) you realize or that bigger stack might be useful (e.g. in 1% of cases) you'll reimplement your stack differently (e.g. if your code base grow to a million lines of C++ and you realize that you have quite often bigger stacks, you might "remove" your WeirdSmallStackOfChar class and add typedef std::stack<char> WeirdSmallStackOfChar; ....)
If you happen to know that all your stacks have less than 4 char-s and that \0 is not valid in them, representing such "stack"-s as a char w[4] field is probably the wisest approach. It is fast and easy to code.
So, if performance and memory space matters, you might perhaps code something as weird as
class MyWeirdStackOfChars {
bool small;
union {
std::stack<char>* bigstack;
char smallstack[4];
}
Of course, that is very incomplete. When small is true your implementation uses smallstack. For the 1% case where it is false, your implemention uses bigstack. The rest of MyWeirdStackOfChars is left as an exercise (not that easy) to the reader. Don't forget to follow the rule of five.
Ok, maybe the above example is not convincing. But what about std::map<int,double>? You might have millions of them, and you might know that 99.5% of them are smaller than 5. You obviously could optimize for that case. It is highly probable that representing small maps by an array of pairs of int & double is more efficient both in terms of memory and in terms of CPU time.
Sometimes, you even know that all your maps have less than 16 entries (and std::map<int,double> don't know that) and that the key is never 0. Then you might represent them differently. In that case, I guess that I am able to implement something much more efficient than what std::map<int,double> provides (probably, because of cache effects, an array of 16 entries with an int and a double is the fastest).
That is why any developer should know the classical algorithms (and have read some Introduction to Algorithms), even if in many cases he would use existing containers. Be also aware of the as-if rule.
STL implementation of Data Structures is not perfect for every possible use case.
I like the example of hash tables. I have been using STL implementation for a while, but I use it mainly for Competitive Programming contests.
Imagine that you are Google and you have billions of dollars in resources destined to storing and accessing hash tables. You would probably like to have the best possible implementation for the company use cases, since it will save resources and make search faster in general.
Oh, and I forgot to mention that you also have some of the best engineers on the planet working for you (:
(This video is made by Kulukundis talking about the new hash table made by his team at Google )
https://www.youtube.com/watch?v=ncHmEUmJZf4
Some other reasons that justify implementing your own version of Data Structures:
Test your understanding of a specific structure.
Customize part of the structure to some peculiar use case.
Seek better performance than STL for a specific data structure.
Hating STL errors.
Benchmarking STL against some simple implementation.

Can most of the data structures be implemented using vectors?

I used C++ vectors to implement stacks, queue, heaps, priority queue and directed weighted graphs. In the books and references, I have seen big classes for these data structures, all of which can be implemented in short using vectors. (May be there is more flexibility in using pointers)
Can we also implement even advanced data structures using vectors ?
If yes, why do C++ books still explain concepts with the long classes using pointers ?
Is it to keep in mind the lower level idea, if it is more vivid that way or it makes students equipped with such usage of pointers ?
It's true that many data structures can be implemented on top of a vector (array, for the sake of this answer), essentially all of them can, since every computation task can be implemented to run on a turing-machine which has a far more basic data access capability (or, in the real world, you may say that any program you implement with pointers eventually runs on a CPU with a simply array-like virtual memory space, so you could just call that a huge array). However, it's not always clever. Two main reasons :
performance / time complexity - a vector simply can't provide all basic operations that in O(1). There's a solution for fast initialization, but try to randomly insert values into a large vector and see how bad you perform - that's because you have to move all the elements by one place over and over. A list could do that in a single operation. Of course other structures have their own performance shortcomings, but that's the beauty of designing complicated data structures with these basic building blocks.
structural complexity - you can think of a list along the same line of a vector as an ordered container, and perhaps extend this into multidimensional matrices that can be implemented on top of them since they still retain some basic ordering, but there are more complicated structures. Take for e.g. a tree, a simple full binary tree one can be implemented with a vector very easily since the parent-child relations can be easily converted to index arithmetics, but what if the tree isn't full and has varying number of children per node? Now, you may say it can still be done (any graph can be implemented with vectors either through adjacency matrix or adjacency list for e.g.), but there's almost no sense in doing so when you can have a much simpler implementation using pointer links. Just think of doing an AVL roll with an array. :shudder:
Mind you that the second argument may very well boil down to performance ("hey, it's an awkward approach but I still managed to use a vector!"), but it's more than that - it would complicate your code, clutter your data structure design, and could make it far more prone to bugs.
Now, here comes the "but" - even though there's much sense in using all the possible tools the language provides you, it's very widely accepted to use vector-based structures for performance critical tasks. See almost all scientific CPU benchmarks, most of them ultimately rely on vectors (uncited, but I can elaborate further if anyone is interested. Suffice to say that even the well-known *graph*500 does that).
The reason is not that it's best programming practice, but that it's more suited with CPU internal structure and gets more "juice" out of the HW. That's due to spatial locality - CPUs are very fond of that as it allows the memory unit to parallelize accesses (in an array you always know where's the next element, in a list you have to wait until the current one is fetched), and also issue stream/stride prefetches that reduce latency of future requests.
I can't say this is always a good practice, when you run through a graph the accesses are still pretty irregular even if you use an array implementation, but it's still a very common practice.
To summarize, taking the question literally - most of them can, of sorts (for a given definition of "most", ok?), but if the intention was "why teach pointers", I believe you can see that in order to understand your limits and what you can and should use - you need to know a great deal more than just arrays and even pointers. A good programmer should know a bit about everything - OS design, CPU design, etc. You can't do anything decent unless you really understand the fabric you're running on, and that unfortunately (or not) includes lots of pointers
You can implement a kind of allocator using an std::vector as the backing store. If you do that, all the standard data structures from elementary computer science can be implemented on top of vectors. It will hardly free you from using pointers, though: vectors are really just chunks of memory with a few useful additional operations, most notably the ability to expand.
More to the point: if you don't understand pointers, you won't understand how to do use vector for advanced data structures either. vector is a useful abstraction, but it follows the C++ rule that "you don't get what you don't pay for", so it's also a very "thin" abstraction, and you do pay for the cost of abstraction in terms of the amount of code you have to write.
(Jonathan Wakely points out, in the comments, that you won't get the exact guarantees that the C++ standard library requires of allocators data structures when you implement them on top of vector. Put in principle, vectors are just a way of handling blocks of memory.)
If you are learning C++ you need to be familiar with pointers and how to use them even if there are more higher level concepts that does that job for you.
Yes, it is possible to implement most data structures with vectors or lists and if you just started learning programming it's probably a good idea that you'll know how to write these data structures yourself.
With that being said, production code should always use the standard library unless there is a good reason not to do so.

Fastest way to speed up map<string,int> .find() in c++ . Where the keys are in alphabetical order

I have a map with about 100,000 pairs . Is there any way that i can speed up searching when using find(), given that the keys are in alphabetical order. Also how should i go about doing it. I know that you can specify a new comparator when you create the map. But will that speed up the find() function at all?
Thanks in advance.
[solved] Thanks a bunch guys i have decided to go with a vector and use lower and upperbound to "snip" some of the searching.
Also i am new here is there any way to mark this question as answered , or pick a best answer?
A different comparator will only speed up find if it manages to do the comparison faster (which, for strings will usually be pretty difficult).
If you're basically inserting all the data in order, then doing the searching, it may be faster to use a std::vector with std::lower_bound or std::upper_bound.
If you don't really care about ordering, and just want to find the data as quickly as possible, you might find that std::unordered_map works better for you.
Edit: Just for the record: the way you "might find" or "may find" those things is normally by profiling. Depending on the situation, it might be enough faster that it's pretty obvious even in simple testing, so profiling isn't really necessary, but if there's (much) doubt, or you want to quantify the effect, a profiler is probably the right way to do it.
std::map is already taking advantage of the fact the keys are in alphabetical order - it guarantees that itself. You aren't going to be able to improve it by changing the comparator (one assumes it's already a reasonably efficient string comparison).
Have you considered using unordered_map (aka hash_map in various implementations pre C++11? It should be able to search in O(1) instead of O(log(n)) for std::map.
You could also look into something slightly more exotic, like a trie, but that's not part of the standard library so you'd either have to find one elsewhere or roll your own, so I'd suggest unordered_map is a good place to start.
If you're using std::find to find elements, you should switch to using map::find (you don't really say in your question.) map::find uses the fact that the map is ordered to search much faster.
If that's still not good enough, you might look into a hash container such as unordered_map rather than map.
I've put in a vote for unordered_map but I wanted to also make another point.
One of the things that can hurt performance on modern machines is poor use of the cache. A map is going to have nodes allocated all over the place and there won't be much locality of reference. Also since it has to store a bunch of pointers between nodes it will use up more memory.
At the recent Going Native 2012 conference Bjarne Stroustroup gave an interesting talk that touched on this topic. He compared vector and list performance at a task involving a lot of random insertions and deletions, where it might seem list ought to have dominated, but because of the memory size and layout issue vector was in fact the fastest by far. Take a look at his slides, starting at slide 43.
unordered_map gives you direct access to the element and so it probably means even less hopping around in memory than trying to stick your data in a vector (and thus better performance than vector) so my comment is simply an admonishment to always keep your memory access pattern in mind for performance

What is the best single-source shortest path algorithm for programming contests?

I was working on this graph problem from the UVa problem set. It's a single-source-shortest-paths problem with no negative edge weights. From what I've gathered, the algorithm with the best big-O running time for such problems is Dijkstra with a Fibonacci heap as the priority queue, although practically speaking a binary heap is easier to implement and works pretty well too.
However, it would seem that even a binary heap takes quite some time to roll, and in a competition time is limited. I am aware that the STL provides some heap algorithms and priority queues, but they don't seem to provide a decrease-key function which Dijkstra's needs. Or am I wrong here?
It seems that another possibility is to simply not use Dijkstra's. This forum thread has people claiming that they solved the above problem with breadth-first search / Bellman-Ford, which are much easier to code up. (Edit: OTOH, Dijkstra's with an unsorted array for the priority queue timed out.) That BFS/Bellman-Ford worked surprised me a little as I thought that the input size was quite large. I guess different problems will require solutions of different complexity, but my question is, how often would I need to use Dijkstra's in such competitions? Should I practice more on the simpler-but-slower algorithms instead?
If you can come up with a good best-first heuristics, I would try using A*
Based on my own experience, I never needed to implement Dijkstra algorithm with a heap in a programming contest. You can get away most of the time, using a slower but efficient enough algorithm. You might use a best Dijkstra implementation to solve a problem which expects a different/simpler algorithm, but this is rare the case.
You can implement Dijkstra using heaps/priority queues without decrease-key in (I think) O((E+V)log V). If you want to decrease key simply add the new entry to your priority queue (leaving the old entry still in the queue) and update your array with distances. When you take the minimum element out of your queue first check that it is equal to your distance array, if it isn't then it was a key you wanted to decrease so just ignore it.
The Boost Graph Library appears to have implementations for both Dijkstra and Bellman-Ford.
Dijkstra and a simple priority queue should do nicely even for large datasets. If you're practicing, you could try it also with a binary heap and compare performance. Certainly, I think doing a fibonacci heap is a little fringe and would choose to practice on other data structures and algorithms first.
Interestingly, using a priority queue is equivalent to breadth-first search with the heuristic of exploring the current best solution first.

Is there already some std::vector based set/map implementation?

For small sets or maps, it's usually much faster to just use a sorted vector, instead of the tree-based set/map - especially for something like 5-10 elements. LLVM has some classes in that spirit, but no real adapter that would provide a std::map like interface backed up with a std::vector.
Any (free) implementation of this out there?
Edit: Thanks for all the alternative ideas, but I'm really interested in a vector based set/map. I do have specific cases where I tend to create huge amounts of sets/maps which contain usually less than 10 elements, and I do really want to have less memory pressure. Think about for example neighbor edges for a vertex in a triangle mesh, you easily wind up with 100k sets of 3-4 elements each.
I just stumbled upon your question, hope its not too late.
I recommend a great (open source) library named Loki.
It has a vector based implementation of an associative container that is a drop-in replacement for std::map, called AssocVector.
It offers better performance for accessing elements (and worst performance for insertions/deletions).
The library was written by Andrei Alexandrescu author of Modern C++ Design.
It also contains some other really nifty stuff.
If you can't find anything suitable, I would just wrap a std::vector to do sort() on insert, and implement find() using lower_bound(). It should be straight forward, and just as efficient as a custom solution.
Old post, I know, but for more recent visitors, Boost's flat_set and flat_map look like what you need. See https://theboostcpplibraries.com/boost.container for more information.
I don't know any such implementation, but there are some functions that help working with sorted vectors already in STL, such as lower_bound and upper_bound.
If the set or map truly is small, the performance gained by micro-optimizing the data structure will have little to no noticeable effects. You'll save maybe one or two memory (read: cache) lookups when searching a tiny tree vs tiny vector, which in the big picture is insignificant.
Having said that, you could give hash_map a try. Lookups by key are guaranteed to run in constant time.
Maybe you're looking for unordered map's and unordered set's. Try taking a look at the TR1 unordered containers that rely on hashing, or the Boost.Unordered container library. Underneath the interface, I'm not sure if they really do use std::vector, but I'd wager it's worth taking a look at.