In other words, if I fill two unordered_map, or unordered_set, objects with exactly the same content and the same hashing function, will iterating over them give the same sequence of key/value pairs?
If so, then what are the conditions for this to hold (e.g. same hashing function, same keys, not necessarily same values).
No. There is no requirement, for example, that objects that have the same hash be placed in any particular order. In fact, in general it's impossible for an unordered map to do this because the only information it has access to is the hash value.
The behaviour in this case is undefined. So, in some situations the sequence will be the same, in others - different. You can't be sure in anything. The types you mentioned are named unordered not by accident. Using them as ordered ones is a very very bad and extremely dangerous style.
You can find that your compiler behaves in some special way you would like to use. But you can't be sure. You mustn't be sure! You do not know, what conditions are causing such behavior of the compiler. You can never be sure that any change of the compiler version will not change the behavior you need.
What is simply forbidden in other languages, in C/C++ is not specified. But you should take it as forbidden, too.
Look c-faq about the problem of undefined behavior This concept is common for all C/C++
Well first I will quote MSDN:
The actual order of elements in the controlled sequence depends on the hash function, the comparison function, the order of insertion, the maximum load factor, and the current number of buckets. You cannot in general predict the order of elements in the controlled sequence. You can always be assured, however, that any subset of elements that have equivalent ordering are adjacent in the controlled sequence.
The above quote is the same for unordered_map and unordered_set and as the name implies the sequence is unspecified, however, they do mention that equivalent keys are adjacent.
If you need guaranteed ordering, this is not the datastructure for you. In fact, if you're mainly going to be doing iteration, unordered_map/set is not the data structre for you, either.
For iteration, a std::map will prove to be the better data structure as gonig from one node to the next is less algorithmically complex. And the order of iteration for the objects in std::map is guaranteed by the spec (and is actually a defining property of the structure itself). (This is all assuming you're using the same comparison operator, obviously). No hashing is involved in std::map.
Suffice to say, it sounds like you're barking up the wrong tree here. unordered_map should generally be using for the benefits such as O(1) lookup and not for storing a list of objects then iterating over them. It most definitely should not be used if you're trying to get a deterministic order of iteration.
Related
I don't know how a red black tree works with string keys. I've already seen it with numbers on youtube and it baffled me a lot. However I know very well how unoredred_map work (the internal of hash maps). std::map stays esoterical for me, but I read and tested that if we don't have many changes in the std::map, it could beat hash maps.
My case is simple, I have a std::map of <std::string,bool>. Keys contains paths to XML elements (example of a key: "Instrument_Roots/Instrument_Root/Rating_Type"), and I use the boolean value in my SAX parser to know if we reached a particular element.
I build this map "only once"; and then all I do is using std::find to search if a particular "key" ("path") exists in order to set its Boolean value to true, or search the first element who has "true" as associated value and use its corresponded "key", and finally I set all the boolean values to false to guarantee that only a single "key" has a "true" boolean value.
You shouldn't need to understand how red-black trees work in order to understand how to use a std::map. It's simply an associative array where the keys are in order (lexicographical order, in the case of string keys, at least with the default comparison function). That means that you can not only look keys up in a std::map, you can also make queries which depend on order. For example, you can find the largest key in the map which is not greater than the key you have. You can find the next larger key. Or (again in the case of strings) you can find all keys which start with the same prefix.
If you iterate over all the key-value pairs in a std::map, you will see them in order by key. That can be very useful, sometimes.
The extra functionality comes at a price. std::map is usually slower than std::unordered_map (though not always; for large string keys, the overhead of computing the hash function might be noticeable), and the underlying data structure has a certain amount of overhead, so they may occupy more space. The usual advice is to use a std::map if you find the fact that the keys are ordered to be essential or even useful.
But if you've benchmarked and concluded that for your application, a std::map is also faster, then go ahead and use it :)
It is occasionally useful to have a map whose mapped type is bool, but only if you need to distinguish between keys whose corresponding value is false and keys which are not present in the map at all. In effect, a std::map<T, bool> (or std::unordered_map<T, bool>) provides a ternary choice for each possible key.
If you don't need to distinguish between the two false cases, and you don't frequently change a key's value, then you may well be better off with a std::set (or std::unordered_set), which is exactly the same datastructure but without the overhead of the bool in each element. (Although only one bit of the bool is useful, alignment considerations may end up using 8 additional bytes for each entry.) Other than storage space, there won't be much (if any) performance difference, though.
If you do really need a ternary case, then you would be well-advised to make the value an enum rather than a bool. What do true and false mean in the context of your usage? My guess is that they don't mean "true" and "false". Instead, they mean something like "is an attribute path" and "is an element path". That distinction could be made much clearer (and therefore less accident-prone) by using enum PathType {ATTRIBUTE_PATH, ELEMENT_PATH};. That will not involve any additional resources, since the bool is occupying eight bytes of storage in any case (because of alignment).
By the way, there is no guarantee that the underlying data structure is precisely a red-black tree, although the performance guarantees would be difficult to achieve without some kind of self-balancing tree. I don't know of such an implementation, but it would be possible to use k-ary trees (for some small k) to take advantage of SIMD vector comparison operations, for example. Of course, that would need to be customized for appropriate key types.
If you do want to understand red-black trees, you could do worse than Robert Sedgewick's standard textbook on Algorithms. On the book's website, you'll find a brief illustrated explanation in the chapter on balanced trees.
I would recommend you to use std::unordered_set because you really don't need to store this boolean flag and you also don't need to keep these xml tags in sorted order so std::unordered_set seems to me as logical and the most efficient choice.
Is it safe to say that if I don't want duplicates in my container, and I don't care about element position as I only want to iterate through the container, then I should use an unordered_set instead of vector?
Is it safe to say that if I don't want duplicates in my container, and I don't care about element position as I only want to iterate through the container, then I should use an unordered_set instead of vector?
No, it is not. It depends on many factors. For example if you seldom add new elements but iterate over container quite often it would be preferable to use std::vector and maintain uniqueness manually. There also could be other factors affecting your decision. But normally yes you may prefer std::unordered_set as it simplifies your program.
Not entirely. unordered_sets are not required to be contiguous containers; in the case where you'd frequently want to read all numerous values contained in the set, you may prefer std::vector on time-critic application.
std::unordered_set:
Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value. This allows fast access to individual elements, since once a hash is computed, it refers to the exact bucket the element is placed into.
But in the general case, I'd say Yes.
I generally prefer vector or map. (or in your case, std::set).
Hash tables can be faster than maps/sets (red-black trees), but red-black trees have guaranteed performance 100% of the time. And logarithmic performance is REALLY fast! A hash table kan kill performance when it starts rehashing.
std::vector is the workhorse of the STL and should be your default choice. Vector is very straightforward, and is very cache-friendly
This article by Matt Austern is related to this topic and it is worth reading:
Why you shouldn't use set (and what you should use instead) by Matt Austern
This thread is trying to identify conditions under which unordered_set is preferable over vectors. Similarly, in the above article, the author clearly identifies four conditions, which all need to be satisfied in order to prefer set over a custom but simpler data structure called sorted_vector (last section: What is set good for?). It will be interesting to clearly state a set of conditions for preferring unordered_set over vector.
also, the last paragraph of the article summarizes a useful rule to keep in mind:
Every component in the standard C++ library is there because it's useful for some purpose, but sometimes that purpose is narrowly defined and rare. As a general rule you should always use the simplest data structure that meets your needs. The more complicated a data structure, the more likely that it's not as widely useful as it might seem.
Of course yes. If you do not want duplicates, you have to use a key-aware container, and since unordered_* totally win over their tree-based counterparts, this is pretty much your only choice.
When inserting elements into an std::unorder_set is it worth calling std::unordered_set::find prior to std::unordered_set::insert? From my understanding, I should always just call insert as it returns an std::pair which contains a bool that tells whether the insertion succeeded.
Calling find before insert is essentially an anti-pattern, which is typically observed in poorly designed custom set implementations. Namely, it might be necessary in implementations that do not tell the caller whether the insertion actually occurred. std::set does provide you with this information, meaning that it is normally not necessary to perform this find-before-insert dance.
A typical implementation of insert will typically contain the full implementation of find, meaning that the find-before-insert approach performs the search twice for no meaningful reason.
However, some other shortcomings of std::set design do sometimes call for a find-before-insert sequence. For example, if your set elements contain some fields that need to be modified if (an only if) the actual insertion occurred. For example, you might have to allocate "permanent" memory for some pointer fields instead of "temporary" (local) memory these fields were pointing to before the insertion. Unfortunately, this is impossible to do after the insertion, since std::set only provides you with non-modifying access to its elements. One workaround is to do a find first, thus "predicting" whether an actual insertion will occur, and then setting up the new element accordingly (like allocating "permanent" memory for all fields) before doing the insert. This is ugly from the performance point of view, but it is acceptable in non-performance-critical code. That's just how things are with standard containers.
It's best to just attempt the insert, otherwise the effort of hashing and iterating over any elements that have collided in the hash bucket is unnecessarily repeated.
If your set it threadsafe and accessed concurrently then calling find first does very little, as insert would be atomic but a check-then-act would be susceptible to race condition.
So in general and especially in a multithreaded context, just insert.
is there, in the c++ "Standard Library", any "Associative" (i.e. "Key-Value") Container/Data Structure, that has the ability, to preserve order, by order of insertion?
I have seen several topics on this, however, it seems, most before C++11.
Some suggest using "boost::multi_index", but, if at all possible, I would "rather" use standard containers/structures.
I see that C++11 has several, apparently, "unordered" associative containers :link.
Are any of these, by some way, "configurable", such that they are only sorted by insertion order?
Thanks!
C
No.
You are mixing linear access with random. Not very good bed fellows.
Just use both a vector/list (i.e. order of insertion) along with a map using an index into the former.
No; such capability was apparently sacrificed in the name of performance.
The order of equivalent items is required to be preserved across operations including rehashes, but there's no way to specify the original order. You could, in theory, use std::rotate or the like to permute the objects into the desired order after each insertion. Obviously impractical, but it proves the lack of capability is a little arbitrary.
Your best bet is to keep the subsequences in inner containers. You may use an iterator adaptor to iterate over such a "deep" container as if it were a single sequence. Such a utility can probably be found in Boost.
No.
In unordered maps too, there are not stored according to the order of insertion.
You can use vector to keep the track of the key!
I want to use multiset to count some custom defined keys. The keys are not comparable numerically, comparing two keys does not mean anything, but their equality can be checked.
I see that multiset template wants a Compare to order the multiset. The order is not important to me, only the counts are important. If I omit Compare completely what happens? Does multiset work without any problems for my custom keys? If I cannot use std::multiset what are my alternatives?
If you can only compare keys for equality then you cannot use std::multiset. For associative containers your key type must have a strict weak ordering imposed by a comparison operation.
The strict weak ordering doesn't necessarily have to be numerical.
[For use in an associative container, you don't actually need an equality comparison. Key equivalence is determined by !compare(a, b) && !compare(b, a).]
If you really can't define an ordering for your keys then your only option is to use an sequence container of key-value pairs and use an linear search for lookup. Needless to say this will be less efficient for set like operations than a multiset so you should probably try hard to create an ordering if at all possible.
You cannot use std::multiset if you don't have a strict weak ordering. Your options are:
Impose a strict-weak ordering on your data. If your key is a "linear" data structure, it is usually a good idea to compare it lexicographically.
Use an unordered container equivalent, e.g., boost::unordered_multiset. For that, you will need to make your custom data-type hash-able, which is often-times easier than imposing some kind of order.
If you omit the Compare completely, it will get the default value, which is less (which gives the result of the < operator applied to your key) - which may or may not even compile for your key.
The reason for having an ordering is that it allows the implementation to look up elements more quickly by their key (when inserting, deleting etc), To understand why, imagine looking words up in a dictionary. Traditional dictionaries use alphabetical order, which makes words easy to look up. If you were preparing a dictionary for a language that isn't easily ordered - say a pictographic language - then either it would be very hard to find words in it at all (you'd have to search the whole dictionary), or you'd try to find a logical way to order them (e.g. by putting all the pictures that can be drawn with one pen stroke first, then two lines, etc...) - because even if this order was completely arbitrary, it would make finding entries in the dictionary far more efficient.
Similarly, even if your keys don't need to be ordered for your own purposes, and don't have any natural order, you can usually define an ordering that is good enough to address these concerns. The ordering must be transitive (if a<b and b<c then a<c), and strict (never return true for a<a), asymmetric (a<b and b>a never both true). Ideally it should order all elements (if a & b are different then either a<b or b<a), though you can get away with that not being true (ie a strict weak ordering) - though that's rather technical.
Indeed, perhaps the most obvious use for it is the rare case where it is completely impossible to order the items - in which case you can supply a comparison operator which always returns false. This will very likely result in poor performance, but will at least function correctly.
So you have two important criteria which you listed.
You don't care about order
comparison of keys do not mean anything
and one assumed,
the fact that you are using multiset implies that there are many instances
So, why not use std::vector or std::deque or std::list? then you can take advantage of the various algorithms that can use the equality check (such as count_if etc.)