index or position in std::set - c++

I have a std::set of std::string. I need the "index" or "position" of each string in the set, is this a meaningful concept in the context?
I guess find() will return an iterator to the string, so my question might be better phrased as : "How do I convert an iterator to a number?".

std::distance is what you need. You will want, I guess std::distance(set.begin(), find_result)

I don't think it is meaningful - set's are 'self keyed' and sorted thus the 'index' would be invalidated when the set is modified.
Of course it depends upon how you intend to use the index and if the set is essentially static (say a dictionary).

Despite what others have written here, I don't think that "index" or "position" has meaning with respect to a set. In mathematical terms, a set exposes only its members and maybe its cardinality. The only meaningful operations involve testing whether an item is a member of the set, and combining or subtracting sets to yield new sets.
Some people talk about sets as data structures in looser terms, by facets of being "ordered" or "unordered", and whether they permit duplicates or enforce uniqueness. The former facet distinguishes an array with an O(n) insertion guard, where an attempt to insert an item first scans the existing members to see if the new item exists and, if not, inserts the new item at the end, and a hash table, that might retain such order only within a bucket's chain. A tree such as the Red-Black Tree used by std::set is somewhere in between; its traversal order is deterministic with respect to the strict weak order imposed by the comparator predicate, but, unlike the array sketched above, it doesn't retain insertion order.
The other facet — whether the set permits duplicate elements — is meaningless in mathematics, and is more accurately described as a bag. Such a structure acknowledges the difference between identity and value-based "sameness."
Your problem may involve caring about some position; it's not clear what that position means, but I expect you're going to need some data structure separate from std::set to model this properly. Perhaps a std::map mapping from your set of elements to each position would do. That would not guarantee that the positions are unique.
It may also help clarify the problem to think how you'd model it as relations, such as in a relational database. What comprises the key? What portions of the entities can vary independently?

Related

why does mutiset don't act like set

Why is multiset a set while a set can only contains only different elements, while multiset can contain the same elements? It could of just be called sortedArray or sortedList. Even if it just wants a sorted "collections", why is it a set?
Why is multiset a set
In mathematics there are two distinct concept of set and multiset. Standard library has two containers that model these concepts: std::set and std::multiset. These concepts are not the same and therefore container names are also different because they model different mathematical concepts.
Why is multiset a set [...]
It's not. The word "set" does appear in "multiset", but that does not make a multiset a set. A multiset is a generalization of a set, not necessarily itself a set. This linguistic setup is similar to a hypergraph, which is a generalization of a graph but not necessarily a graph, and to a hyperplane, which is a generalization of a plane but not necessarily a plane.
A less mathematical example would be penultimate, which is not "ultimate", or any other word with a prefix that changes the meaning of the root.
Perhaps "butterfly" and "dragonfly" would be apropos examples. Neither is a fly, despite the word "fly" appearing in both names. (For that matter, neither is buttery or draconic.) Sometimes a name is just a name.

which container from std::map or std::unordered_map is suitable for my case

I don't know how a red black tree works with string keys. I've already seen it with numbers on youtube and it baffled me a lot. However I know very well how unoredred_map work (the internal of hash maps). std::map stays esoterical for me, but I read and tested that if we don't have many changes in the std::map, it could beat hash maps.
My case is simple, I have a std::map of <std::string,bool>. Keys contains paths to XML elements (example of a key: "Instrument_Roots/Instrument_Root/Rating_Type"), and I use the boolean value in my SAX parser to know if we reached a particular element.
I build this map "only once"; and then all I do is using std::find to search if a particular "key" ("path") exists in order to set its Boolean value to true, or search the first element who has "true" as associated value and use its corresponded "key", and finally I set all the boolean values to false to guarantee that only a single "key" has a "true" boolean value.
You shouldn't need to understand how red-black trees work in order to understand how to use a std::map. It's simply an associative array where the keys are in order (lexicographical order, in the case of string keys, at least with the default comparison function). That means that you can not only look keys up in a std::map, you can also make queries which depend on order. For example, you can find the largest key in the map which is not greater than the key you have. You can find the next larger key. Or (again in the case of strings) you can find all keys which start with the same prefix.
If you iterate over all the key-value pairs in a std::map, you will see them in order by key. That can be very useful, sometimes.
The extra functionality comes at a price. std::map is usually slower than std::unordered_map (though not always; for large string keys, the overhead of computing the hash function might be noticeable), and the underlying data structure has a certain amount of overhead, so they may occupy more space. The usual advice is to use a std::map if you find the fact that the keys are ordered to be essential or even useful.
But if you've benchmarked and concluded that for your application, a std::map is also faster, then go ahead and use it :)
It is occasionally useful to have a map whose mapped type is bool, but only if you need to distinguish between keys whose corresponding value is false and keys which are not present in the map at all. In effect, a std::map<T, bool> (or std::unordered_map<T, bool>) provides a ternary choice for each possible key.
If you don't need to distinguish between the two false cases, and you don't frequently change a key's value, then you may well be better off with a std::set (or std::unordered_set), which is exactly the same datastructure but without the overhead of the bool in each element. (Although only one bit of the bool is useful, alignment considerations may end up using 8 additional bytes for each entry.) Other than storage space, there won't be much (if any) performance difference, though.
If you do really need a ternary case, then you would be well-advised to make the value an enum rather than a bool. What do true and false mean in the context of your usage? My guess is that they don't mean "true" and "false". Instead, they mean something like "is an attribute path" and "is an element path". That distinction could be made much clearer (and therefore less accident-prone) by using enum PathType {ATTRIBUTE_PATH, ELEMENT_PATH};. That will not involve any additional resources, since the bool is occupying eight bytes of storage in any case (because of alignment).
By the way, there is no guarantee that the underlying data structure is precisely a red-black tree, although the performance guarantees would be difficult to achieve without some kind of self-balancing tree. I don't know of such an implementation, but it would be possible to use k-ary trees (for some small k) to take advantage of SIMD vector comparison operations, for example. Of course, that would need to be customized for appropriate key types.
If you do want to understand red-black trees, you could do worse than Robert Sedgewick's standard textbook on Algorithms. On the book's website, you'll find a brief illustrated explanation in the chapter on balanced trees.
I would recommend you to use std::unordered_set because you really don't need to store this boolean flag and you also don't need to keep these xml tags in sorted order so std::unordered_set seems to me as logical and the most efficient choice.

Is there a linked hash set in C++?

Java has a LinkedHashSet, which is a set with a predictable iteration order. What is the closest available data structure in C++?
Currently I'm duplicating my data by using both a set and a vector. I insert my data into the set. If the data inserted successfully (meaning data was not already present in the set), then I push_back into the vector. When I iterate through the data, I use the vector.
If you can use it, then a Boost.MultiIndex with sequenced and hashed_unique indexes is the same data structure as LinkedHashSet.
Failing that, keep an unordered_set (or hash_set, if that's what your implementation provides) of some type with a list node in it, and handle the sequential order yourself using that list node.
The problems with what you're currently doing (set and vector) are:
Two copies of the data (might be a problem when the data type is large, and it means that your two different iterations return references to different objects, albeit with the same values. This would be a problem if someone wrote some code that compared the addresses of the "same" elements obtained in the two different ways, expecting the addresses to be equal, or if your objects have mutable data members that are ignored by the order comparison, and someone writes code that expects to mutate via lookup and see changes when iterating in sequence).
Unlike LinkedHashSet, there is no fast way to remove an element in the middle of the sequence. And if you want to remove by value rather than by position, then you have to search the vector for the value to remove.
set has different performance characteristics from a hash set.
If you don't care about any of those things, then what you have is probably fine. If duplication is the only problem then you could consider keeping a vector of pointers to the elements in the set, instead of a vector of duplicates.
To replicate LinkedHashSet from Java in C++, I think you will need two vanilla std::map (please note that you will get LinkedTreeSet rather than the real LinkedHashSet instead which will get O(log n) for insert and delete) for this to work.
One uses actual value as key and insertion order (usually int or long int) as value.
Another ones is the reverse, uses insertion order as key and actual value as value.
When you are going to insert, you use std::map::find in the first std::map to make sure that there is no identical object exists in it.
If there is already exists, ignore the new one.
If it does not, you map this object with the incremented insertion order to both std::map I mentioned before.
When you are going to iterate through this by order of insertion, you iterate through the second std::map since it will be sorted by insertion order (anything that falls into the std::map or std::set will be sorted automatically).
When you are going to remove an element from it, you use std::map::find to get the order of insertion. Using this order of insertion to remove the element from the second std::map and remove the object from the first one.
Please note that this solution is not perfect, if you are planning to use this on the long-term basis, you will need to "compact" the insertion order after a certain number of removals since you will eventually run out of insertion order (2^32 indexes for unsigned int or 2^64 indexes for unsigned long long int).
In order to do this, you will need to put all the "value" objects into a vector, clear all values from both maps and then re-insert values from vector back into both maps. This procedure takes O(nlogn) time.
If you're using C++11, you can replace the first std::map with std::unordered_map to improve efficiency, you won't be able to replace the second one with it though. The reason is that std::unordered map uses a hash code for indexing so that the index cannot be reliably sorted in this situation.
You might wanna know that std::map doesn't give you any sort of (log n) as in "null" lookup time. And using std::tr1::unordered is risky business because it destroys any ordering to get constant lookup time.
Try to bash a boost multi index container to be more freely about it.
The way you described your combination of std::set and std::vector sounds like what you should be doing, except by using std::unordered_set (equivalent to Java's HashSet) and std::list (doubly-linked list). You could also use std::unordered_map to store the key (for lookup) along with an iterator into the list where to find the actual objects you store (if the keys are different from the objects (or only a part of them)).
The boost library does provide a number of these types of combinations of containers and look-up indices. For example, this bidirectional list with fast look-ups example.

Is the order of two same unordered_maps the same?

In other words, if I fill two unordered_map, or unordered_set, objects with exactly the same content and the same hashing function, will iterating over them give the same sequence of key/value pairs?
If so, then what are the conditions for this to hold (e.g. same hashing function, same keys, not necessarily same values).
No. There is no requirement, for example, that objects that have the same hash be placed in any particular order. In fact, in general it's impossible for an unordered map to do this because the only information it has access to is the hash value.
The behaviour in this case is undefined. So, in some situations the sequence will be the same, in others - different. You can't be sure in anything. The types you mentioned are named unordered not by accident. Using them as ordered ones is a very very bad and extremely dangerous style.
You can find that your compiler behaves in some special way you would like to use. But you can't be sure. You mustn't be sure! You do not know, what conditions are causing such behavior of the compiler. You can never be sure that any change of the compiler version will not change the behavior you need.
What is simply forbidden in other languages, in C/C++ is not specified. But you should take it as forbidden, too.
Look c-faq about the problem of undefined behavior This concept is common for all C/C++
Well first I will quote MSDN:
The actual order of elements in the controlled sequence depends on the hash function, the comparison function, the order of insertion, the maximum load factor, and the current number of buckets. You cannot in general predict the order of elements in the controlled sequence. You can always be assured, however, that any subset of elements that have equivalent ordering are adjacent in the controlled sequence.
The above quote is the same for unordered_map and unordered_set and as the name implies the sequence is unspecified, however, they do mention that equivalent keys are adjacent.
If you need guaranteed ordering, this is not the datastructure for you. In fact, if you're mainly going to be doing iteration, unordered_map/set is not the data structre for you, either.
For iteration, a std::map will prove to be the better data structure as gonig from one node to the next is less algorithmically complex. And the order of iteration for the objects in std::map is guaranteed by the spec (and is actually a defining property of the structure itself). (This is all assuming you're using the same comparison operator, obviously). No hashing is involved in std::map.
Suffice to say, it sounds like you're barking up the wrong tree here. unordered_map should generally be using for the benefits such as O(1) lookup and not for storing a list of objects then iterating over them. It most definitely should not be used if you're trying to get a deterministic order of iteration.

Design Pattern for Data Structure

This question has been answered, so what follows below is an explanation of what I wanted to achieve.
I wanted to create a tablular data structure designed to allow efficient access to any row entry through a primary column that could possibly be hashed. I thought that the best way to go about this would be to maintain a vector of doubly-linked lists, each of which would represent one column, and a map that would contain mappings of primary column entry hashes to nodes. Now, the first mistake I made is in thinking that I would need to create my own implementation of a doubly-linked list in order to be able to store pointers to nodes, when in fact the standard states that iterators to std::list do not get invalidated as a result of insertion or splicing (see larsmans's answer). Here's some pseudocode to illustrate what I wanted to do previously. Assume the existence of a typename T representing the entry type and the existence of a dlist and node class, as described previously.
typedef dlist<T> column_type;
typedef vector<T> row_type;
typedef ptr_unordered_map<int32_t, row_type> hash_type;
shared_ptr<ptr_vector<column_type> > columns;
shared_ptr<hash_type> hashes;
Now, after reading larsmans's answer, I learned that I wouldn't need any of this since Boost.MultiIndex fulfills all of my needs as it is. Even if I did, Boost.Intrusive offers more efficient data structures to accomplish what I describe.
Thanks to all who took interest in the question or offered help! If you have any more questions, add another comment and I'll do my best to clarify the question further.
front() should return a reference to a node containing the value_type
Sounds like your thinking of begin instead of front, in STL/Boost terms, except that begin methods usually return iterators instead of references.
How would I be able to use a map of key hashes to std::list::iterator types and allow for addition of rows without having the entries in the map get outdated
Just do; "lists have the important property that insertion and splicing do not invalidate iterators to list elements, and that even removal invalidates only the iterators that point to the elements that are removed" (STL docs).
If you wanted, you could maintain a single std::list for the entire table and a vector of iterators into it to represent the starting points of rows.
Besides, have you looked at Boost.Intrusive and Boost.MultiIndex? And did you know that an std::map (red-black tree) of hashes is a very suboptimal way of representing a hash table?