A variation of priority queue - c++

I need some kind of priority queue to store pairs <key, value>. Values are unique, but keys aren't. I will be performing the following operations (most common first):
random insertion;
retrieving (and removing) all elements with the least key.
random removal (by value);
I can't use std::priority_queue because it only supports removing the head.
For now, I'm using an unsorted std::list. Insertion is performed by just pushing new elements to the back (O(1)). Operation 2 sorts the list with list::sort (O(N*logN)), before performing the actual retrieval. Removal, however, is O(n), which is a bit expensive.
Any idea of a better data structure?

When you need order, use an ordered container. There is no point in paying the cost of sorting later on.
Your current solution is:
Insertion O(1)
Retrieval O(N log N)
Removal O(N) (which is as good as you can get without keeping another index there)
Simply using a std::multi_map you could have:
Insertion O(log N)
Retrieval O(log N) <-- much better isn't it ? We need to find the end of the range
Removal O(N)
Now, you could do slightly better with a std::map< key, std::vector<value> >:
Insertion O(log M) where M is the number of distinct keys
Retrieval O(1) (begin is guaranteed to be amortized constant time)
Removal O(N)
You can't really push the random removal... unless you're willing to keep another index there. For example:
typedef std::vector<value_type> data_value_t;
typedef std::map<key_type, data_value_t> data_t;
typedef std::pair<data_t::iterator,size_t> index_value_t;
// where iterator gives you the right vector and size_t is an index in it
typedef std::unordered_map<value_type, index_value_t> index_t;
But keeping this second index up to date is error prone... and will be done at the expense of the other operations! For example, with this structure, you would have:
Insertion O(log M) --> complexity of insertion in hash map is O(1)
Retrieval O(N/M) --> need to de index all the values in the vector, there are N/M in average
Removal O(N/M) --> finding in hash map O(1), dereferencing O(1), removing from the vector O(N/M) because we need to shift approximately half the content of the vector. Using a list would yield O(1)... but might not be faster (depends on the number of elements because of the memory tradeoff).
Also bear in mind that hash map complexity are amortized ones. Trigger a reallocation because you overgrew the load factor, and this particular insertion will take a very long time.
I'd really go with the std::map<key_type, std::vector<value_type> > in your stead. That's the best bang for the buck.

Can you reverse the order of the collection, i.e. store them in <value, key> order?
Then you could just use std::map having O(logn) time for insertion O(n) for removal (traversing whole collection) and O(logn) for random removal of value (which would be the key of said map).
If you could find a map implementation based on hashes instead of trees (like std::map) the times would be even better: O(1), O(n), O(1).

If you're using Visual Studio they have hash_multimap. I should also add that Boost has an unordered multimap, here. If you need an ordered multimap, STL multimap or ordered multiset STL multiset

std::multimap seem to be what you are searching for.
It will store your objects ordered by key, allow you to retrieve the lowest/highest key value (begin(), rbegin()) and all the object with a given key (equal_range, lower_bound, upper_bound).
(EDIT: if you have just a few items, say less than 30, you should also test the performance of just using a deque or a vector)

If I understood well, you performance target is to have fast (1) and (3), and (2) is not that important. In this case, and given that values are unique, why not just have a std::set<value>, and do a sequential search for (2)? You'd have O(log n) for (1) and (3), and O(n) for (2). Better yet, if your STL has std::hash_set, you'd have close to O(1) for (1) and (3).
If you need something better than O(n) for (2), one alternative would be to have a set of priority queues.

Ok, so I've tested many options and ended up with something based on the idea of Matthieu M.. I'm currently using a std::map<key_type, std::list<value_type> >, where the value_type contains a std::list<value_type>::iterator to itself, which is useful for removal.
Removal must check if the vector is empty, which implies a map query and possibly a call to erase. Worst-case complexity is when keys are distinct, O(logN) for insertion, O(1) for retrieval and O(logN) for removal. I've got very good experimental results comparing to other alternatives on my test machine.
Using a std::vector is less efficient both in terms of theoretical complexity (O(N) worst-case for removal when keys are identical) and experimentation I've been doing.

Related

STL container to select and remove random item?

The algorithm I'm implementing has the structure:
while C is not empty
select a random entry e from C
if some condition on e
append some new entries to C (I don't care where)
else
remove e from C
It's important that each iteration of the loop e is chosen at random (with uniform probability).
Ideally the select, append and remove steps are all O(1).
If I understand correctly, using std::list the append and remove steps will be O(1) but the random selection will be O(n) (e.g., using std::advance as in this solution).
And std::deque and std::vector seem to have complementary O(1) and O(n) operations.
I'm guessing that std::set will introduce some O(log n) complexity.
Is there any stl container that supports all three operations that I need in constant time (or amortized constant time)?
If you don't care about order and uniqueness of elements in your container, you can use the following:
std::vector<int> C;
while (!C.empty()) {
size_t pos = some_function_returning_a_number_between_zero_and_C_size_minus_one();
if (condition())
C.push_back(new_entry);
else {
C[i] = std::move(C.back());
C.pop_back();
}
}
No such container exists if element order should be consistent. You can get O(1) selection and (amortized) append with vector or deque, but removal is O(n). You can get O(1) (average case) insertion and removal with unordered_map, but selection is O(n). list gets you O(1) for append and removal, but O(n) selection. There is no container that will get you O(1) for all three operations. Figure out the least commonly used one, choose a container which works for the others, and accept the one operation will be slower.
If the order of the container doesn't matter per use 3365922's comment, the removal step could be done in O(1) on a vector/deque by swapping the element to be removed with the final element, then performing a pop_back.
I'm guessing that std::set will introduce some O(log n) complexity.
Not quite. Random selection in a set has linear compexity.
Is there any stl container that supports all three operations that I need in constant time (or amortized constant time)?
Strictly speaking no.
However, if you don't care about the order of the elements, then you can remove from a vector or deque in constant time. With this relaxation of requirements, all operations would have constant complexity.
In case you did need to keep the order between operations, constant complexity would still be possible as long as the order of the elements doesn't need to affect the random distribution (i.e. you want even distribution). The solution is to use a hybrid approach:
Store the values in a linked list. Store iterator to each element in a vector. Use the vector for random selection; Erase the element of the list using the iterator which keeps the order of elements; Erase the iterator from the vector without maintaining order of the iterators. When adding elements to the list, remember to add the iterator.

Best container for ordered elements

I am developing a time critical application and am looking for the best container to handle a collection of elements of the following type:
class Element
{
int weight;
Data data;
};
Considering that the time critical steps of my application, periodically performed in a unique thread, are the following:
the Element with the lowest weight is extracted from the container, and data is processed;
a number n>=0 of new Element, with random(*) weight, are inserted into the container.
Some Element of the container may have the same weight. The total number of elements in the container at any time is quite high and almost stationary in average (several hundreds of thousands). The time needed for the extract/process/insert sequence described above must be as low as possible. (Note(*): new weight is actually computed from data but is considered as random here to simplify.)
After some searches and tries of different STL containers, I ended up using std::multiset container, which performed about 5 times faster than ordered std::vector and 16 times faster than ordered std:list. But still, I am wondering if I could achieve even better performance, considering that the bottleneck of my application remains in the extract/insert operations.
Notice that, though I only tried ordered containers, I did not mentioned "ordered container" in my requirements. I do not need the Element to be ordered in the container, I only need to perform the "extract lowest weighted element"/"insert new elements" operations as fast as possible. I am not limited to STL containers and can go for boost, or any other implementation, if suited.
Thanks for help.
I do not need the Element to be ordered in the container, I only need to perform the "extract lowest weighted element"/"insert new elements" operations as fast as possible.
Then you should try priority_queue<T>, or use make_heap/push_heap/pop_heap operations on a vector<T>.
Since you are looking for min heap, not max heap, you would need to supply a custom comparator that orders your Element objects in reverse order.
I think that within the STL , lazy std::vector will give the best results.
a suggested psuedo code may look like:
emplace back new elements in the end of the vector
only when you want to smallest element, sort the array and get the first element
in this way, you get the amortized insertion time of vector, relativly small amount of memory allocations and good cache locality.
It is instructive to consider different candidates and how your assumptions would impact the final selection. When your requirements change, it then becomes easer to switch containers.
Generally, the containers of size N have roughly 3 complexity categories for their basic acces/modification operations: (amortized) O(1), O(log N) and O(N).
Your first requirement (finding the lowest weight element) gives you roughly three candidates with O(1) complexity, and one candidate with O(N) complexity per element:
O(1) for std::priority_queue<Element, LowestWeightCompare>
O(1) for std::multiset<Element, LowestWeightCompare>
O(1) for boost::flat_multiset<Element, LowestWeightCompare>
O(N) for std::unordered_multiset<Element>
Your second requirement (randomized insertion of new elements) gives you the following complexity per element for each of the above four choices
O(log N) for std::priority_queue
O(log N) for std::multiset
O(N) for boost::flat_multiset
amortized O(1) for std::unordered_multiset
Among the first three choices, boost::multiset should be dominated by the other two for large N. Among the remaining two, the better caching behavior of std::priority_queue over std::multiset might prevail. But: measure, measure, measure, however.
It is a priori ambiguous whether std::unorderd_multiset is competitive with the other three. Depending on the number n of randomly inserted elements, total cost per batch of find(1)-insert(n) would be O(N) search + O(n) insertion for std::unordered_multiset and O(1) search + O(n log N) insertion for std::multiset. Again, measure, measure, measure.
How robust are these considerations with respect to your requirements? The story would change as follows if you would have to find the k lowest weight elements in each batch. Then you'd have to compare the costs of find(k)-insert(n). The search costs would scale roughly as
O(k log N) for std::priority_queue
O(1) for std::multiset
O(1) for boost::flat_multiset
O(k N) for std::unordered_multiset
Note that a priority_queue can only efficiently access the top element, not its k top elements without actually calling pop() on them, which has O(log N) complexity per call. If you expect that your code would likely change from a find(1)-insert(n) batch-mode to a find(k)-insert(n), then it might be a good idea to choose std::multiset, or at least document what kind of interface changes it would require.
Bonus: the best of both worlds?! You might also want to experiment a bit with Boost.MultiIndex and use something like (check the documentation to get the syntax correct)
boost::multi_index<
Element,
indexed_by<
ordered_non_unique<member<Element, &Element::weight>, std::less<>>,
hashed_non_unique<>
>
>
The above code will create a node-based container that implement two pointer structures to keep track of both the ordering by Element weight and also to allow quick hashed insertion. This will allow O(1) lookup of the lowest weight Element and also allows O(n) random insertion of n new elements.
For large N, it should scale better than the four previously mentioned containers, but again, for moderate N, cache effects induced by pointer chasing into random memory might spoil its theoretical advantage over std::priority_queue. Did I mention the mantra of measure, measure, measure?
Try either of these:
std::map<int,std::vector<Data>>
or
std::unordered_map<int,std::vector<Data>>
The int above is the weight.
These both have different speeds for find, remove and add depending on many different factors such as if the element is there or not. (If there, unordered_map .find is faster, if not, map .find is faster)

C++ - List with logarithmic read, insertion at given position

I'm looking for data structure that behaves like a list, where we can insert an element at ANY given position and then read an element at ANY given position, where insertion and reading should be in logarithmic time. Is there something like this in the standard library or maybe I'm stuck with having to write this on my own (I know it can be implemented as a tree)?
std::multiset behaves pretty much like the logarithmic std::list that you are looking for
iteration is bidirectional
insertion / reading are O(log N)
Note however (as pointed out by #SergeRogatch) that the "price" you pay for O(log N) lookup (instead of O(N) for list) multiset will order elements as they are inserted. This behaves differently than std::list. This also means that your elements need to be comparable using std::less<> or you need to provide your own comparator.
An alternative would be to use std::unordered_multiset (i.e. a hash table), which has amortized O(1) element acces, but then there is no deterministic order either. But again, then your elements need to be usable with std::hash<> or you need to write your own hash function.

What is the most efficient std container for non-duplicated items?

What is the most efficient way of adding non-repeated elements into STL container and what kind of container is the fastest? I have a large amount of data and I'm afraid each time I try to check if it is a new element or not, it takes a lot of time. I hope map be very fast.
// 1- Map
map<int, int> Map;
...
if(Map.find(Element)!=Map.end()) Map[Element]=ID;
// 2-Vector
vector<int> Vec;
...
if(find(Vec.begin(), Vec.end(), Element)!=Vec.end()) Vec.push_back(Element);
// 3-Set
// Edit: I made a mistake: set::find is O(LogN) not O(N)
Both set and map has O(log(N)) performance for looking up keys. vector has O(N).
The difference between set and map, as far as you should be concerned, is whether you need to associate a key with a value, or just store a value directly. If you need the former, use a map, if you need the latter, use a set.
In both cases, you should just use insert() instead of doing a find().
The reason is insert() will insert the value into the container if and only if the container does not already contain that value (in the case of map, if the container does not contain that key). This might look like
Map.insert(std::make_pair(Element, ID));
for a map or
Set.insert(Element);
for a set.
You can consult the return value to determine whether or not an insertion was actually performed.
If you're using C++11, you have two more choices, which are std::unordered_map and std::unordered_set. These both have amortized O(1) performance for insertions and lookups. However, they also require that the key (or value, in the case of set) be hashable, which means you'll need to specialize std::hash<> for your key. Conversely, std::map and std::set require that your key (or value, in the case of set) respond to operator<().
If you're using C++11, you can use std::unordered_set. That would allow you O(1) existence-checking (technically amortized O(1) -- O(n) in the worst case).
std::set would probably be your second choice with O(lg n).
Basically, std::unordered_set is a hash table and std::set is a tree structure (a red black tree in every implementation I've ever seen)1.
Depending on how well your hashes distribute and how many items you have, a std::set might actually be faster. If it's truly performance critical, then as always, you'll want to do benchmarking.
1) Technically speaking, I don't believe either are required to be implemented as a hash table or as a balanced BST. If I remember correctly, the standard just mandates the run time bounds, not the implementation -- it just turns out that those are the only viable implementations that fit the bounds.
You should use a std::set; it is a container designed to hold a single (equivalent) copy of an object and is implemented as a binary search tree. Therefore, it is O(log N), not O(N), in the size of the container.
std::set and std::map often share a large part of their underlying implementation; you should check out your local STL implementation.
Having said all this, complexity is only one measure of performance. You may have better performance using a sorted vector, as it keeps the data local to one another and, therefore, more likely to hit the caches. Cache coherence is a large part of data structure design these days.
Sounds like you want to use a std::set. It's elements are unique, so you don't need to care about uniqueness when adding elements, and a.find(k) (where a is an std::set and k is a value) is defined as being logarithmic in complexity.
if your elements can be hashed for O(1), then better to use an index in a unordered_map or unordered_set (not in a map/set because they use RB tree in implementation which is O(logN) find complexity)
Your examples show a definite pattern:
check if the value is already in container
if not, add the value to the container.
Both of these operation would potentially take some time. First, looking up an element can be done in O(N) time (linear search) if the elements are not arranged in any particular manner (e.g., just a plain std::vector), it could be done in O(logN) time (binary search) if the elements are sorted (e.g., either std::map or std::set), and it could be done in O(1) time if the elements are hashed (e.g., either std::unordered_map or std::unordered_set).
The insertion will be O(1) (amortized) for a plain vector or an unordered container (hash container), although the hash container will be a bit slower. For a sorted container like set or map, you'll have log-time insertions because it needs to look for the place to insert it before inserting it.
So, the conclusion, use std::unordered_set or std::unordered_map (if you need the key-value feature). And you don't need to check before doing the insertion, these are unique-key containers, they don't allow duplicates.
If std::unordered_set / std::unordered_map (from C++11) or std::tr1::unordered_set / std::tr1::unordered_map (since 2007) are not available to you (or any equivalent), then the next best alternative is std::set / std::map.

Difference between std::set and std::priority_queue

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.
A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.
std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)
set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.