C++ map allocator stores items in a vector? - c++

Here is the problem I would like to solve: in C++, iterators for map, multimap, etc are missing two desirable features: (1) they can't be checked at run-time for validity, and (2) there is no operator< defined on them, which means that they can't be used as keys in another associative container. (I don't care whether the operator< has any relationship to key ordering; I just want there to be some < available at least for iterators to the same map.)
Here is a possible solution to this problem: convince map, multimap, etc to store their key/data pairs in a vector, and then have the iterators be a small struct that contain a pointer to the vector itself and a subscript index. Then two iterators, at least for the same container, could be compared (by comparing their subscript indices), and it would be possible to test at run time whether an iterator is valid.
Is this solution achievable in standard C++? In particular, could I define the 'Allocator' for the map class to actually put the items in a vector, and then define the Allocator::pointer type to be the small struct described in the last paragraph? How is the iterator for a map related to the Allocator::pointer type? Does the Allocator::pointer have to be an actual pointer, or can it be anything that supports a dereference operation?
UPDATE 2013-06-11: I'm not understanding the responses. If the (key,data) pairs are stored in a vector, then it is O(1) to obtain the items given the subscript, only slightly worse than if you had a direct pointer, so there is no change in the asymptotics. Why does a responder say map iterators are "not kept around"? The standard says that iterators remain valid as long as the item to which they refer is not deleted. As for the 'real problem': say I use a multimap for a symbol table (variable name->storage location; it is a multimap rather than map because the variables names in an inner scope may shadow variables with the same name), and say now I need a second data structure keyed by variables. The apparently easiest solution is to use as key for the second map an iterator to the specific instance of the variable's name in the first map, which would work if only iterators had an operator<.

I think not.
If you were somehow able to "convince" map to store its pairs in a vector, you would fundamentally change certain (at least two) guarantees on the map:
insert, erase and find would no longer be logarithmic in complexity.
insert would no longer be able to guarantee the validity of unaffected iterators, as the underlying vector would sometimes need to be reallocated.
Taking a step back though, two things suggest to me that you are trying to "solve" the wrong problem.
First, it is unusual to need to have a vector of iterators.
Second, it is unusual to need to check an iterator for validity, as iterators are not generally kept around.
I wonder what the real problem is that you are trying to solve?

Related

How to retrieve the elements from map in the order of insertion?

I have a map which stores <int, char *>. Now I want to retrieve the elements in the order they have been inserted. std::map returns the elements ordered by the key instead. Is it even possible?
If you are not concerned about the order based on the int (IOW you only need the order of insertion and you are not concerned with the normal key access), simply change it to a vector<pair<int, char*>>, which is by definition ordered by the insertion order (assuming you only insert at the end).
If you want to have two indices simultaneously, you need, well, Boost.MultiIndex or something similar. You'd probably need to keep a separate variable that would only count upwards (would be a steady counter), though, because you could use .size()+1 as new "insertion time key" only if you never erased anything from your map.
Now I want to retrieve the elements in the order they have been inserted. [...] Is it even possible?
No, not with std::map. std::map inserts the element pair into an already ordered tree structure (and after the insertion operation, the std::map has no way of knowing when each entry was added).
You can solve this in multiple ways:
use a std::vector<std::pair<int,char*>>. This will work, but not provide the automatic sorting that a map does.
use Boost stuff (#BartekBanachewicz suggested Boost.MultiIndex)
Use two containers and keep them synchronized: one with a sequential insert (e.g. std::vector) and one with indexing by key (e.g. std::map).
use/write a custom container yourself, so that supports both type of indexing (by key and insert order). Unless you have very specific requirements and use this a lot, you probably shouldn't need to do this.
A couple of options (other than those that Bartek suggested):
If you still want key-based access, you could use a map, along with a vector which contains all the keys, in insertion order. This gets inefficient if you want to delete elements later on, though.
You could build a linked list structure into your values: instead of the values being char*s, they're structs of a char* and the previously- and nextly-inserted keys**; a separate variable stores the head and tail of the list. You'll need to do the bookkeeping for that on your own, but it gives you efficient insertion and deletion. It's more or less what boost.multiindex will do.
** It would be nice to store map iterators instead, but this leads to circular definition problems.

Questions about STL containers in C++

How often do std::multimap and std::unordered_multimap shuffle entries around? I'm asking because my code passes around references to distinguish between entries with the same hash, and I want to know when to run the reference redirection function on them.
What happens if I do this:
std::multimap atable; //Type specification stuff left out
//Code that pus in two entries with the same key, call that key foo
int bar = atable[foo];
Is the result different if it's an unordered_multimap instead?
Back to passing references around to distinguish entries with the same hash. Is there a safer way to do that?
Do the entries move around if I remove one of the entries (That's what's suggested by a reading of the documentation for std::vector)?
NO, no elements will be harmed during any operation.
As is explained in this famous Q&A, for associative containers, there is no iterator invalidation upon insertions / erasure (except for the element being erased of course). For unordered associative containers, there is iterator invalidation during rehashing, about which the Standard says (emphasize mine)
23.2.5 Unordered associative containers [unord.req]
9 The elements of an unordered associative container are organized into
buckets. Keys with the same hash code appear in the same bucket. The
number of buckets is automatically increased as elements are added to
an unordered associative container, so that the average number of
elements per bucket is kept below a bound. Rehashing invalidates
iterators, changes ordering between elements, and changes which
buckets elements appear in, but does not invalidate pointers or
references to elements. For unordered_multiset and unordered_multimap,
rehashing preserves the relative ordering of equivalent elements.
Again, this does not entail the reshuflling of the actually stored elements (the Key and Value types in unordered_map<Key, Value>), because unordered maps have buckets that are organized as linked lists, and iterators to stored elements (key-value pairs) have both an element pointer and a bucket pointer. The rehashing shuffles buckets, which invalidates the iterators (because their bucket pointer is invalidated) but not pointers or references to the elements itself. This is explained in detail in another Q&A
How often do std::multimap and std::unordered_multimap shuffle entries around?
Never. The iterators that point to elements of any associative container (including sets, maps, and their unordered or "multi" versions) are never invalidated (unless the specific element they point to is deleted). In other words, the actual elements are never "shuffled around". These are required to be implemented as linked structures (e.g., linked-tree), meaning they can be re-structured just by changing a few pointers, without having to physically move any element.
EDIT: Apparently (see TemplateRex' comment), this is not the case for unordered containers. In that case, the iterators can get invalidated, but the elements themselves do not move around. These requirements imply an indirect container with no back-pointer, which I guess is a reasonable choice, but not one I would have expected.
What happens if I do this: ... (get [] of multimap) ...
The operator[] is not defined for std::multimap (or unordered version). So, what would happen? A compiler error would happen.
Is the result different if it's an unordered_multimap instead?
No, it's the same, the operator[] does not exist.
Back to passing references around to distinguish entries with the same hash. Is there a safer way to do that?
Yes, the recommended practice is to refer to elements of the map / set / whatever using iterators, not references. The iterators to elements are guaranteed to remain valid, and they are copyable and have the right const-ness protection on them, and that makes them the perfect objects to "refer to an entry".
EDIT: As per the same comment, I would have to recommend using pointers to the elements if dealing with a hashed container (unordered containers), because iterators are not guaranteed (by the standard) to remain valid.
All of the associative containers in the C++ standard library are node based, i.e., their elements stay put. However, whether the hash is computed on the object after copying it or on a temporary object passed to the container isn't specified. I would guess, that generally the hash is computed before the object is copied/moved.
To distinguish elements with the same hash you need to have an equality function anyway: if the location of the object causes it to be different it would mean that all objects are different and you wouldn't be able to look them up at all. You need to have an equality function for the elements in an unordered container which defines equivalence of keys. For the ordered associative the equivalent class is based on the strict weak ordering, i.e., on an expression like this (using < rather than a binary predicate for readability; any binary predicate defining a strict weak order would work, too):
bool equivalent = !(a < b) && !(b < a);

Is there a linked hash set in C++?

Java has a LinkedHashSet, which is a set with a predictable iteration order. What is the closest available data structure in C++?
Currently I'm duplicating my data by using both a set and a vector. I insert my data into the set. If the data inserted successfully (meaning data was not already present in the set), then I push_back into the vector. When I iterate through the data, I use the vector.
If you can use it, then a Boost.MultiIndex with sequenced and hashed_unique indexes is the same data structure as LinkedHashSet.
Failing that, keep an unordered_set (or hash_set, if that's what your implementation provides) of some type with a list node in it, and handle the sequential order yourself using that list node.
The problems with what you're currently doing (set and vector) are:
Two copies of the data (might be a problem when the data type is large, and it means that your two different iterations return references to different objects, albeit with the same values. This would be a problem if someone wrote some code that compared the addresses of the "same" elements obtained in the two different ways, expecting the addresses to be equal, or if your objects have mutable data members that are ignored by the order comparison, and someone writes code that expects to mutate via lookup and see changes when iterating in sequence).
Unlike LinkedHashSet, there is no fast way to remove an element in the middle of the sequence. And if you want to remove by value rather than by position, then you have to search the vector for the value to remove.
set has different performance characteristics from a hash set.
If you don't care about any of those things, then what you have is probably fine. If duplication is the only problem then you could consider keeping a vector of pointers to the elements in the set, instead of a vector of duplicates.
To replicate LinkedHashSet from Java in C++, I think you will need two vanilla std::map (please note that you will get LinkedTreeSet rather than the real LinkedHashSet instead which will get O(log n) for insert and delete) for this to work.
One uses actual value as key and insertion order (usually int or long int) as value.
Another ones is the reverse, uses insertion order as key and actual value as value.
When you are going to insert, you use std::map::find in the first std::map to make sure that there is no identical object exists in it.
If there is already exists, ignore the new one.
If it does not, you map this object with the incremented insertion order to both std::map I mentioned before.
When you are going to iterate through this by order of insertion, you iterate through the second std::map since it will be sorted by insertion order (anything that falls into the std::map or std::set will be sorted automatically).
When you are going to remove an element from it, you use std::map::find to get the order of insertion. Using this order of insertion to remove the element from the second std::map and remove the object from the first one.
Please note that this solution is not perfect, if you are planning to use this on the long-term basis, you will need to "compact" the insertion order after a certain number of removals since you will eventually run out of insertion order (2^32 indexes for unsigned int or 2^64 indexes for unsigned long long int).
In order to do this, you will need to put all the "value" objects into a vector, clear all values from both maps and then re-insert values from vector back into both maps. This procedure takes O(nlogn) time.
If you're using C++11, you can replace the first std::map with std::unordered_map to improve efficiency, you won't be able to replace the second one with it though. The reason is that std::unordered map uses a hash code for indexing so that the index cannot be reliably sorted in this situation.
You might wanna know that std::map doesn't give you any sort of (log n) as in "null" lookup time. And using std::tr1::unordered is risky business because it destroys any ordering to get constant lookup time.
Try to bash a boost multi index container to be more freely about it.
The way you described your combination of std::set and std::vector sounds like what you should be doing, except by using std::unordered_set (equivalent to Java's HashSet) and std::list (doubly-linked list). You could also use std::unordered_map to store the key (for lookup) along with an iterator into the list where to find the actual objects you store (if the keys are different from the objects (or only a part of them)).
The boost library does provide a number of these types of combinations of containers and look-up indices. For example, this bidirectional list with fast look-ups example.

Should I return an iterator or a pointer to an element in a STL container?

I am developing an engine for porting existing code to a different platform. The existing code has been developed using a third party API, and my engine will redefine those third party API functions in terms of my new platform.
The following definitions come from the API:
typedef unsigned long shape_handle;
shape_handle make_new_shape( int type );
I need to redefine make_new_shape and I have the option to redefine shape_handle.
I have defined this structure ( simplified ):
struct Shape
{
int type
};
The Caller of make_new_shape doesn't care about the underlying structure of Shape, it just needs a "handle" to it so that it can call functions like:
void `set_shape_color( myshape, RED );`
where myshape is the handle to the shape.
My engine will manage the memory for the Shape objects and other requirements dictate that the engine should be storing Shape objects in a list or other iterable container.
My question is, what is the safest way to represent this handle - if the Shape itself is going to be stored in a std::list - an iterator, a pointer, an index?
Both an iterators or a pointers will do bad stuff if you try to access them after the object has been deleted so neither is intrinsically safer. The advantage of an iterator is that it can be used to access other members of your collection.
So, if you just want to access your Shape then a pointer will be simplest. If you want to iterate through your list then use an iterator.
An index is useless in a list since std::list does not overload the [] operator.
The answer depends on your representation:
for std::list, use an iterator (not a pointer), because an iterator allows you to remove the element without walking the whole list.
for std::map or boost::unordered_map, use the Key (of course)
Your design would be much strong if you used an associative container, because associative containers give you the ability to query for the presence of the object, rather than invoking Undefined Behavior.
Try benchmarking both map and unordered_map to see which one is faster in your case :)
IIF the internal representation will be a list of Shapes, then pointers and iterators are safe. Once an element is allocated, no relocation will ever occur. I wouldn't recommend an index for obvious access performance reasons. O(n) in case of lists.
If you were using a vector, then don't use iterators or pointers, because elements can be relocated when you exceed the vectors capacity, and your pointers/iterators would become invalid.
If you want a representation that is safe regardless of the internal container, then create a container (list/vector) of pointers to your shapes, and return the shape pointer to your client. Even if the container is moved around in memory, the Shape objects will stay in the same location.
Iterators aren't safer than pointers, but they have much better diagnostics than raw pointers if you're using a checked STL implementation!
For example, in a debug build, if you return a pointer to a list element, then erase that list element, you have a dangling pointer. If you access it you get a crash and all you can see is junk data. That can make it difficult to work out what went wrong.
If you use an iterator and you have a checked STL implementation, as soon as you access the iterator to an erased element, you get a message something like "iterator was invalidated". That's because you erased the element it points to. Boom, you just saved yourself potentially a whole lot of debugging effort.
So, not indices for O(n) performance. Between pointers and iterators - always iterators!

Best c++ container to strip items away from?

I have a list of files (stored as c style strings) that I will be performing a search on and I will remove those files that do not match my parameters. What is the best container to use for this purpose? I'm thinking Set as of now. Note the list of files will never be larger than when it is initialized. I'll only be deleting from the container.
I would definitely not use a set - you don't need to sort it so no point in using a set. Set is implemented as a self-balancing tree usually, and the self-balancing algorithm is unnecessary in your case.
If you're going to be doing this operation once, I would use a std::vector with remove_if (from <algorithm>), followed by an erase. If you haven't used remove_if before, what it does is go through and shifts all the relevant items down, overwriting the irrelevant ones in the process. You have to follow it with an erase to reduce the size of the vector. Like so:
std::vector<const char*> files;
files.erase(remove_if(files.begin(), files.end(), RemovePredicate()), files.end());
Writing the code to do the same thing with a std::list would be a little bit more difficult if you wanted to take advantage of its O(1) deletion time property. Seeing as you're just doing this one-off operation which will probably take so little time you won't even notice it, I'd recommend doing this as it's the easiest way.
And to be honest, I don't think you'll see that much difference in terms of speed between the std::list and std::vector approaches. The vector approach only copies each value once so it's actually quite fast, yet takes much less space. In my opinion, going up to a std::list and using three times the space is only really justified if you're doing a lot of addition and deletion throughout the entire application's lifetime.
Elements in a std::set must be unique, so unless the filenames are globally unique this won't suit your needs.
I would probably recommend a std::list.
From SGI:
A vector is a Sequence that supports random access to elements, constant time insertion and removal of elements at the end, and linear time insertion and removal of elements at the beginning or in the middle.
A list is a doubly linked list. That is, it is a Sequence that supports both forward and backward traversal, and (amortized) constant time insertion and removal of elements at the beginning or the end, or in the middle.
An slist is a singly linked list: a list where each element is linked to the next element, but not to the previous element. That is, it is a Sequence that supports forward but not backward traversal, and (amortized) constant time insertion and removal of elements.
Set is a Sorted Associative Container that stores objects of type Key. Set is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Unique Associative Container, meaning that no two elements are the same.
Multiset is a Sorted Associative Container that stores objects of type Key. Multiset is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Multiple Associative Container, meaning that two or more elements may be identical.
Hash_set is a Hashed Associative Container that stores objects of type Key. Hash_set is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Unique Associative Container, meaning that no two elements compare equal using the Binary Predicate EqualKey.
Hash_multiset is a Hashed Associative Container that stores objects of type Key. Hash_multiset is a simple associative container, meaning that its value type, as well as its key type, is Key. It is also a Multiple Associative Container, meaning that two or more elements may compare equal using the Binary Predicate EqualKey.
(Some containers have been omitted.)
I would go with hash_set if all you care to have is a container which is fast and doesn't contain multiple identical keys. hash_multiset if you do, set or multiset if you want the strings to be sorted, or list or slist if you want the strings to keep their insertion order.
After you've built your list/set, use remove_if to filter out your items based on your criteria.
I will start by tossing out vector since it is a sequential container. Set, I believe is close to being sequential or hashed. I would avoid that. A doublely-linked list, the stl list is one of these, has two pointers and the node. Basically, to remove an item, it breaks the chain then rejoins the two parts with the pointers.
Assuming your search criteria does not depend on the filename (ie. you search for content, file sizes etc.), so you cannot make use of a set, I'd go with a list. It will take you O(N) to construct the whole list, and O(1) per one delete.
If you wanted to make it even faster, and didn't insist on using ready-made STL containers, I would:
use a vector
delete using false delete, ie. marking an item as deleted
when the ratio of deleted/all items raises above certain threshold, I would filter the items with remove_if
This should give you the best space/time/cache performance. (Although you should profile it to be sure)
You can use two lists/vectors/whatever:
using namespace std;
vector<const char *> files;
files.push_back("foo.bat");
files.push_back("bar.txt");
vector<const char *> good_files; // Maybe reserve elements given files.size()?
for(vector<const char *>::const_iterator i = files.begin(); i != files.end(); ++i) {
if(file_is_good(*i)) {
new_files.push_back(*i);
}
}