I need a data structure that satisfies the following:
stores an arbitrary number of elements, where each element is described by 10 numeric metrics
allows fast (log n) search of elements by any of the metrics
allows fast (log n) insertion of new elements
allows fast (log n) removal of elements
And let's assume that the elements are expensive to construct.
I came up with the following plan
store all elements in a vector called DATA.
use 10 std::sets, one for each of 10 metrics. Each std:set is light-weight, it contains only integers, which are indexes into the vector DATA. The comparison operators 'look up' the appropriate element in DATA and then select the appropriate metric
template< int C >
struct Cmp
{
bool operator() (int const a, int const b)
{
return ( DATA[a].coords[C] != DATA[b].coords[C] )
? ( DATA[a].coords[C] < DATA[b].coords[C] )
: ( a < b );
}
};
Elements are never modified or removed from a vector. A new element is pushed back to DATA and then its index (DATA.size()-1) is inserted into the sets (set<int, Cmp<..> >). To remove an element, I set a flag in the element saying that it is deleted (without actually removing it from the DATA vector) and then remove the element index from all ten std::sets.
This works fine as long as DATA is a global variable. (It also somewhat abuses the type system by making the templated struct Cmp dependent on a global variable.)
However, I was not able to enclose the DATA vector and std::set's (set<int, Cmp<...> >) inside a class and then 'index' DATA with those std::sets. For starters, the comparison operator Cmp defined inside an outer class has no access to the outer class' fields (so it cannot assess DATA). I also cannot pass the vector to the Cmp constructor because Cmp is being constructed by std::set and std::set expects a comparison operator with a constructor that has no arguments.
I have a feeling I'm working against C++ type system and trying to achieve something that the type system is purposely preventing me from doing. (I'm trying to make std::set depend on a variable that is going to be constructed only at runtime.) And while I understand why the type system might not like what I do, I think this is a legitimate use case.
Is there a way to implement the data structure/class I described above without providing a re-implementation of std::set/red-black tree? I hope there may be a trick I have not thought of yet. (And yes, I know that boost has something, but I'd like to stick to the standard library.)
When I read something like "look up foo by a value bar", my initial reaction is to use a map<> or something similar. There are some implications to this though:
Keys in an std::map (or values in an std::set) are unique, so no two elements can share the same key and accordingly no two data objects would be able to have the same metric. If multiple data objects can have the same metric (this isn't clear from your question), using an std::multimap (or std::multiset) would work though.
If the keys are constant and stored within the elements themselves, using a set<data*,cmp> is a common approach. The comparator then just retrieves the according field from the objects and compares them. Lookup then requires creating a temporary object and using find() with it. Some implementations also have an extension that allows searching with a different type, which would make this much easier but also make porting require actual work.
It is important that the fields used as keys remain constant though, because if you modify them, you implicitly change the order of the set<>. This is the reason that a set<>'s elements are effectively constant, i.e. even a plain iterator has a constant as value type. If you store pointers though, you can easily get around that, because a constant pointer is something different than a pointer to a constant. Don't shoot yourself into the foot with that!
If the metrics are not so much a property of the objects themselves (or you don't mind redundantly storing them), using an std::map would be a natural choice. Storing the same object under multiple keys, depending on the metric, can be done in separate containers (map<int,data*> c[10];). However, you can do that in a single map using e.g. a pair<metric,value> as key (map<pair<int,int>,data*> c;).
Using a vector<> to store the actual elements and only referencing them as either pointers or indices in a map surely works. I'd take the pointers though, as this is what allows the above approaches using a set or map to work. Without that, the comparator would have to store a reference to the container, where at the moment it just uses the global DATA container. Getting this to work with a vector is tricky though, since it reallocates its elements when growing, as you correctly pointed out. I'd consider a different container type, like std::list or std::deque. The former would allow erasing elements, too, but it has a higher per-element overhead. The latter has a relatively low per-element overhead, only slightly above std::vector. You could then even go so far as to store iterators instead of pointers, which helps debugging provided you use a "checked STL" for that. Still, you will have to do some manual bookkeeping which object is still referenced somewhere and which one isn't.
Instead of using a separate container, you could also allocate the elements dynamically, although that itself has some overhead. If the overhead per element is not an issue, you could then use reference-counted smart pointers. If the application is a one-shot process, you could also use raw pointers and let the OS reclaim the memory on exit.
Note that I assume that storing multiple copies of the data objects is not an option. If that was the case, you could just as well have a map<int,data> m[10];, where each map stores its own copy of the data objects. All the bookkeeping issues would then be resolved, but at the price of a 10x overhead.
Related
I have an std::list<MyObject*> objectList container that I need to sort and maintain in the following scenario:
Each object has a certain field that supplies a cost (a float value for example). That cost value is used to compare two objects as if they were floating point numbers
The collection must be ordered (ascending) and must quickly find the correct position for a newly inserted element.
It is possible to delete the lowest element (in terms of cost) and it is also possible to update the cost of several arbitrarily positioned elements. The list must be then reordered as fast as possible, taking advantage of its already sorted nature.
Could I use any other stl container/mechanism to allow for the three behavioral properties? It pretty much resembles a heap and I thought using make_heap could be a good way to sort the list. I need to have a container of pointers, since there are several other data structures that rely on these pointers.
How then can I choose a better container that's also pointer friendly and allows sorting by looking at the comparison operators of the pointed types?
CLARIFICATION: I need an stl container that best fits the scenario and can successfully wrap pointers or references for that matter. (For example, I read briefly that the std::set container could be a good candidate, but I have no experience with it).
A current implementation, based on the below answers:
struct SHafleEdgeComparatorFunctor
{
bool operator()(SHEEdge* lhs, SHEEdge* rhs)
{
return (*lhs) < rhs;
}
};
std::multiset<SHEEdge*, SHafleEdgeComparatorFunctor> m_edges;
Of course, the SHEEdge data structure has an overloaded operator:
bool operator<(SHEEdge* rhs)
{
return this->GetCollapseError() < rhs->GetCollapseError();
}
I would indeed use std::set. The tricky bit in your requirements is to update existing elements.
A std::set is always sorted. You will have to either wrap your pointers in a class with a useful compare operator or you have to pass a comparison predicate to the set.
Then you get the sorted property automatically and you get constant time removal of the lowest element.
You also get updating of the cost value in log complexity: Simply remove the object from the set and re-add it. This will be as fast as it can be for a sorted container.
Inserting, and deleting is fast in a set.
I'd start using a smart pointer like shared_ptr instead of a raw pointer (raw pointers are good e.g. if they are observing pointers, like pointers passed as function parameters, but when you have ownership semantics, like in this case, it's better to use a smart pointer).
Then, I'd start with std::vector as a container.
So, try make it vector<shared_ptr<MyObject>>.
You can measure performance of it compared to list<shared_ptr<MyObject>>.
(Note also that std::list has kind of more overhead than std::vector, since it's a node-based container, and each node has some overhead; instead std::vector allocates a contiguous chunk of memory to store its data, in this case the shared_ptrs; so std::vector is also more "cache-friendly", etc.)
In general, std::vector offers very good performance, and it's a good option as a "first choice" container. In any case, your mileage may very, and the best thing is to measure performance (speed) to get a better understanding in your particular case.
if I understand correctly what you are asking, you are looking for the correct container to use.
Indeed std::set seems to be the correct container for the kind of things what you want to do, but it will depend on all the use cases
Do you need to have O(1) access to the elements?
What is the operation used that will have the most important cost?
std::set uses a key to sort the elements and doesn't allow having duplicates (if you want duplicates, have a look at std::multiset). When you add an element, it will automatically be inserted in the correct position. Generally you don't want to use raw pointers as the key, as objects can be null.
Another alternative could be to use a std::vector<std::shared_ptr>>, as #MikePro said, it is a good practice to have the pointers inside smart pointers, to prevent having to manually delete them (and avoid any memory leak in case of an exception for example). If you use a vector, you will have to use functions like std::sort, std::find present in <algorithm> header or std::vector::insert.
Generally this image helps finding your container. It's not perfect (as you have to know a bit more than what is displayed) but it usually does its job well:
I'm looking for an alternative to std::set. I need it to support more operations then std::set:
Move elements from one set to another without 'create new->copy->remove old'.
Split set at some position to get two sets (similar behaviour can be obtained using std::list splice)
Set operations (like union) without unnecessary copying. std::set_union will copy elements from sets A and B to set C which is inefficient if I only need set C and don't need A and B anymore.
Are there any implementations which support these operations or I need to write one myself?
I have the same problem as you with std::set and there doesn't seem to be any sane way around with C++11, C++14. However in C++17 there are two new members added to std::set which look very promising.
std::set::extract allows extracting an entire node from the set. The removed node allows for obtaining a non-const reference to the underlying value effectively allowing for moving elements out of the set. It can be also inserted into another set without copying or moving the underlying value. std::set::merge allows for merging two sets without copying or moving any elements, only updating internal pointers.
The problem with trying to do what you suggest with a std::set is that I do not believe you can move a value out of one. This is due to set iterators only returning const references to stop you changing the value and breaking the internal structure. You could probably const_cast your way around this but I wouldn't recommend it. Even if you adopt this approach then you still have nodes in the tree being allocated and there is nothing you can do to avoid this overhead.
If you decide to implement your own set which supports moving values around, you should be able to get the Boost::Intrusive library (http://www.boost.org/doc/libs/1_52_0/doc/html/intrusive.html) to do the heavy lifting of keeping a set of sorted values. You would need to implement the code for managing the object lifetimes but that is easier than building a RB-tree implementation.
I implemented something similar for maps which stored the nodes in a std::list. This allowed for moving elements between maps without copying either the node structure or the values being stored. If I get time I will try and tidy it up and post it here.
For point 1 & 3, you can use set<shared_ptr> or some other smart pointer instead of store object directly in the set. In this case, you should implement
bool operator<(shared_ptr<T> const & a, shared_ptr<T> const & b)
For point 2, there won't be such a method determined just by index because there's no index in set. However, you can use something like filter in ruby. Here predicator can be function, function object or closure.
remove_copy_if(foo.begin(), foo.end(), back_inserter(bar), some_predicator);
I have about 70-150 different structs X with an unsigned integral ID field. They are read in and initialized on initialization of program and never modified thereafter. What would be the fastest way to access them (which happens a lot) among the following (or some other method?):
Use a std::vector v; where v[X.id] = X; to access by doing X& x = v[id]; (this should do a copy at the beginning but later on merely do a lookup by id on essentially a flat array.
Same as above but std::vector v; with X* x = v[id]; I am wary about this one because it has one extra level of indirection.
a std::map - feels like overkill compared to above?
same as above but unordered_map - again given 70-150 occurrences might not even beat suggestion 3.
Anything more clever? One problem I see with 1 is it might be a bit sparse in access patterns but not sure how to address that if that's the fastest way.
Using vector will be definitely the fastest approach:
Vector has complexity O(1). No need to search or hash, you will immediately find your instance.
Map has complexity O(log N). The map needs to compare your index with logN other entries to find your instance.
Unordered_map has complexity O(1), but with quite some overhead of calculating the hash value (although for simple numbers it will be not that much). However, the std::unordered_map still puts multiple entries behind one hash-index, so instead of comparing one index, it has to compare several ones (I think by default it's 4).
I think that for a small number of items there is no faster way to access than using array or vector data types because it provides constant time access. In case of many objects this approach is also fastest but also the most memory expensive.
If you don't mind pre-allocating the space for all the structures in advance, and the IDs can be ordered sequentially so they are also indexes, just use approach 1.
If the structures are expensive to construct, delay construction either by not actually placing the construction code in the default constructor (and doing it explicitly via a method call later) or by using a raw array and placement new on memory "slots" within that array.
If IDs are not sequential, use std::unordered_map to prevent waste of space on "holes".
I am confused. I don't know what containers should I use. I tell you what I need first. Basically I need a container that can stored X number of Object (and the number of objects is unknown, it could be 1 - 50k).
I read a lot, over here array vs list its says: array need to be resized if the number of objects is unknown (I am not sure how to resize an array in C++), and it also stated that if using a linked list, if you want to search certain item, it will loop through (iterate) from first to end (or vice versa) while an array can specify "array object at index".
Then I went for an other solution, map, vector, etc. Like this one: array vs vector. Some responder says never use array.
I am new to C++, I only used array, vector, list and map before. Now, for my case, what kind of container you will recommend me to use? Let me rephrase my requirements:
Need to be a container
The number of objects stored is unknown but is huge (1 - 40k maybe)
I need to loop through the containers to find specific object
std::vector is what you need.
You have to consider 2 things when selecting a stl container.
Data you want to store
Operations you want to perform on the stored data
There wasa good diagram in a question here on SO, which depitcs this, I cannot find the link to it but I had it saved long time ago, here it is:
You cannot resize an array in C++, not sure where you got that one from. The container you need is std::vector.
The general rule is: use std::vector until it doesn't work, then shift to something that does. There are all sorts of theoretical rules about which one is better, depending on the operations, but I've regularly found that std::vector outperforms the others, even when the most frequent operations are things where std::vector is supposedly worse. Locality seems more important than most of the theoretical considerations on a modern machine.
The one reason you might shift from std::vector is because of iterator validity. Inserting into an std::vector may invalidate iterators; inserting into a std::list never.
Do you need to loop through the container, or you have a key or ID for your objects?
If you have a key or ID - you can use map to be able to quickly access the object by it, if the id is the simple index - then you can use vector.
Otherwise you can iterate through any container (they all have iterators) but list would be the best if you want to be memory efficient, and vector if you want to be performance oriented.
You can use vector. But if you need to find objects in the container, then consider using set, multiset or map.
I have created a container for generic, weak-type data which is accessible through the subscript operator.
The std::map container allows both data access and element insertion through the operator, whereas std::vector I think doesn't.
What is the best (C++ style) way to proceed? Should I allow allocation through the subscript operator or have a separate insert method?
EDIT
I should say, I'm not asking if I should use vector or map, I just wanted to know what people thought about accessing and inserting being combined in this way.
In the case of Vectors: Subscript notation does not insert -- it overwrites.
This rest of this post distils the information from item 1-5 of Effective STL.
If you know the range of your data before hand -- and the size is fixed -- and you won't insert at locations which has data above it -- then you can use insert into vectors without unpleasant side-effects.
However in the general case vector insertions have implications such as shifting members upward and doubling memory when exhausted (which causes a flood of copies from the old vector's objects to locations in the new vector ) when you make ad hoc insertions. Vectors are designed for when you know the locality characteristics of your data..
Vectors come with an insert member function... and this function is very clever with most implementations in that it can infer optimizations from the iterators your supply. Can't you just use this ?
If you want to do ad-hoc insertions of data, you should use a list. Perhaps you can use a list to collect the data and then once its finalized populate a vector using the range based insert or range based constructor ?
it depends what you want. A map can be significantly slower than a vector if you wish to use the thing like an array. A map is very helpful if the index you want to use is non-sequential and you have LOADS of them. Its usually quicker to just use a vector, sort it and do a binary search to find what you are after. I've used this method to replace maps in tonnes of software and I still haven't found something where it was slower to do this with a vector.
So, IMO, std::vector is the better way, though a map MIGHT be useful if you are using it properly.
Separate insert method, definitely. The operator[] on std::map is just stupid and makes the code hard to read and debug.
Also you can't access data from a const context if you're using a operator[] to insert (which will lead to un-const-cancer, the even-more evil cousin of const-cancer).