Multiset without Compare? - c++

I want to use multiset to count some custom defined keys. The keys are not comparable numerically, comparing two keys does not mean anything, but their equality can be checked.
I see that multiset template wants a Compare to order the multiset. The order is not important to me, only the counts are important. If I omit Compare completely what happens? Does multiset work without any problems for my custom keys? If I cannot use std::multiset what are my alternatives?

If you can only compare keys for equality then you cannot use std::multiset. For associative containers your key type must have a strict weak ordering imposed by a comparison operation.
The strict weak ordering doesn't necessarily have to be numerical.
[For use in an associative container, you don't actually need an equality comparison. Key equivalence is determined by !compare(a, b) && !compare(b, a).]
If you really can't define an ordering for your keys then your only option is to use an sequence container of key-value pairs and use an linear search for lookup. Needless to say this will be less efficient for set like operations than a multiset so you should probably try hard to create an ordering if at all possible.

You cannot use std::multiset if you don't have a strict weak ordering. Your options are:
Impose a strict-weak ordering on your data. If your key is a "linear" data structure, it is usually a good idea to compare it lexicographically.
Use an unordered container equivalent, e.g., boost::unordered_multiset. For that, you will need to make your custom data-type hash-able, which is often-times easier than imposing some kind of order.

If you omit the Compare completely, it will get the default value, which is less (which gives the result of the < operator applied to your key) - which may or may not even compile for your key.
The reason for having an ordering is that it allows the implementation to look up elements more quickly by their key (when inserting, deleting etc), To understand why, imagine looking words up in a dictionary. Traditional dictionaries use alphabetical order, which makes words easy to look up. If you were preparing a dictionary for a language that isn't easily ordered - say a pictographic language - then either it would be very hard to find words in it at all (you'd have to search the whole dictionary), or you'd try to find a logical way to order them (e.g. by putting all the pictures that can be drawn with one pen stroke first, then two lines, etc...) - because even if this order was completely arbitrary, it would make finding entries in the dictionary far more efficient.
Similarly, even if your keys don't need to be ordered for your own purposes, and don't have any natural order, you can usually define an ordering that is good enough to address these concerns. The ordering must be transitive (if a<b and b<c then a<c), and strict (never return true for a<a), asymmetric (a<b and b>a never both true). Ideally it should order all elements (if a & b are different then either a<b or b<a), though you can get away with that not being true (ie a strict weak ordering) - though that's rather technical.
Indeed, perhaps the most obvious use for it is the rare case where it is completely impossible to order the items - in which case you can supply a comparison operator which always returns false. This will very likely result in poor performance, but will at least function correctly.

So you have two important criteria which you listed.
You don't care about order
comparison of keys do not mean anything
and one assumed,
the fact that you are using multiset implies that there are many instances
So, why not use std::vector or std::deque or std::list? then you can take advantage of the various algorithms that can use the equality check (such as count_if etc.)

Related

How to implement set::find() to match only the key of a pair?

I have a container class which stores data in a std::set. I don't need or use the extended facilities provided by std::map. There is a method values() which returns a const reference to the private set so if I were to use a map instead then I would have to copy the entire container. I want to keep it as a std::set.
The set contains objects of a class similar to std::pair with a key and a value and implements operator < for use in a set.
I have a method in the container which accepts the 'key' portion of the pair for the purpose of searching the set and returning a complete pair while only matching the key.
I can iterate through the set sequentially but then I lose the O(log N).
Also note that the set needs to be sorted, which removes the option of using an unordered_set.
It's not clear exactly what your operator< actually compares, but the long and the short of it is that with a std::set, the only way to efficiently search the set is by using its defined comparison function.
Based on your question, I am assuming that your set is
std::set<std::pair<firstType, secondType>, ComparisonClass>
With ComparisonClass implementing the strict weak ordering. Or, your could also be using a:
std::set<PairClass>
With the PairClass being a subclass of a std::pair, that implements an operator<, for the strict weak ordering. Either one or the other is what appears your question is describing. But either way, both alternatives are logically equivalent, for the purpose of the following answer:
If your operator< implements strict weak ordering based on both the value pair's first and second, then that's pretty much it. You can only execute the set's built-in logarithmic search by searching for the same first and second.
There's no easy way to do anything other than that. So, what now?
Well, the root problem seems to be is that you might not be using the right container. Consider the following container that, with a little bit of work, will be equivalent to your set:
std::multimap<firstType, std::set<secondType>>
That is, your container is a multimap keyed by your pair's first, with the value of your multimap being a std::set of all the secondType that are paired up with a given firstType.
The only thing you have to be careful here is to define insert and remove operation into this container in such a manner, so that you will never end up with a firstType with an empty std::set value. As long as this condition is met, this should be logically equivalent to a std::set of your std::pairs. Furthermore:
1) You can still implement an algorithmic search for a firstType+secondType by, first, a logarithmic search on the firstType, grabbing the value std::set, and then executing a logarithmic search on that. Logically equivalent.
2) You can implement an algorithmic search for just the firstType by doing only the first half of the full search. This gives you the value std::set, that provides the equivalent of all pairs that have the same firstType.

which container from std::map or std::unordered_map is suitable for my case

I don't know how a red black tree works with string keys. I've already seen it with numbers on youtube and it baffled me a lot. However I know very well how unoredred_map work (the internal of hash maps). std::map stays esoterical for me, but I read and tested that if we don't have many changes in the std::map, it could beat hash maps.
My case is simple, I have a std::map of <std::string,bool>. Keys contains paths to XML elements (example of a key: "Instrument_Roots/Instrument_Root/Rating_Type"), and I use the boolean value in my SAX parser to know if we reached a particular element.
I build this map "only once"; and then all I do is using std::find to search if a particular "key" ("path") exists in order to set its Boolean value to true, or search the first element who has "true" as associated value and use its corresponded "key", and finally I set all the boolean values to false to guarantee that only a single "key" has a "true" boolean value.
You shouldn't need to understand how red-black trees work in order to understand how to use a std::map. It's simply an associative array where the keys are in order (lexicographical order, in the case of string keys, at least with the default comparison function). That means that you can not only look keys up in a std::map, you can also make queries which depend on order. For example, you can find the largest key in the map which is not greater than the key you have. You can find the next larger key. Or (again in the case of strings) you can find all keys which start with the same prefix.
If you iterate over all the key-value pairs in a std::map, you will see them in order by key. That can be very useful, sometimes.
The extra functionality comes at a price. std::map is usually slower than std::unordered_map (though not always; for large string keys, the overhead of computing the hash function might be noticeable), and the underlying data structure has a certain amount of overhead, so they may occupy more space. The usual advice is to use a std::map if you find the fact that the keys are ordered to be essential or even useful.
But if you've benchmarked and concluded that for your application, a std::map is also faster, then go ahead and use it :)
It is occasionally useful to have a map whose mapped type is bool, but only if you need to distinguish between keys whose corresponding value is false and keys which are not present in the map at all. In effect, a std::map<T, bool> (or std::unordered_map<T, bool>) provides a ternary choice for each possible key.
If you don't need to distinguish between the two false cases, and you don't frequently change a key's value, then you may well be better off with a std::set (or std::unordered_set), which is exactly the same datastructure but without the overhead of the bool in each element. (Although only one bit of the bool is useful, alignment considerations may end up using 8 additional bytes for each entry.) Other than storage space, there won't be much (if any) performance difference, though.
If you do really need a ternary case, then you would be well-advised to make the value an enum rather than a bool. What do true and false mean in the context of your usage? My guess is that they don't mean "true" and "false". Instead, they mean something like "is an attribute path" and "is an element path". That distinction could be made much clearer (and therefore less accident-prone) by using enum PathType {ATTRIBUTE_PATH, ELEMENT_PATH};. That will not involve any additional resources, since the bool is occupying eight bytes of storage in any case (because of alignment).
By the way, there is no guarantee that the underlying data structure is precisely a red-black tree, although the performance guarantees would be difficult to achieve without some kind of self-balancing tree. I don't know of such an implementation, but it would be possible to use k-ary trees (for some small k) to take advantage of SIMD vector comparison operations, for example. Of course, that would need to be customized for appropriate key types.
If you do want to understand red-black trees, you could do worse than Robert Sedgewick's standard textbook on Algorithms. On the book's website, you'll find a brief illustrated explanation in the chapter on balanced trees.
I would recommend you to use std::unordered_set because you really don't need to store this boolean flag and you also don't need to keep these xml tags in sorted order so std::unordered_set seems to me as logical and the most efficient choice.

C++ container for storing sorted unique values with different predicates for sorting and uniqueness

I have a record with 2 fields (say, A and B). 2 instances of the record should be considered equal, if their As are equal. On the other hand, a collection of the record instances should be sorted by the B field.
Is there a container like std::set, which can be defined with two different predicates, one for sorting and one for uniqueness, so I could avoid explicit sorting and just append elements? If no, how can it be workarounded?
Regards,
There is nothing in the standard library which would support your use case directly. You could use Boost.MultiIndexContainer for this purpose, though. Something like this:
typedef multi_index_container<
Record,
indexed_by<
ordered_non_unique<member<Record, decltype(Record::B), &Record::B>>,
hashed_unique<member<Record, decltype(Record::A), &Record::A>>
>
> RecordContainer;
(Code assuming correct headers and using namespace directives for brevity).
The idea is to create a container with two indices, one which will guarantee the ordering based on B and the other which will guarantee uniqueness based on A. decltype() in the code can of course be replaced by the actual types of A and B which you know, but I don't.
The order of the indices matters slightly, since for convenience, the container itself offers the same interface as its first index. You can always access any index by using container.get(), though.
The code is not intended as a copy&paste solution, but as a starting point. You can add customisations, index tags etc. Refer to Boost documentation for details.
Is there a container like std::set, which can be defined with two different predicates, one for sorting and one for uniqueness
std::set defines whether particular element is unique OR not in terms of the sorting criteria you provide to it( by default it uses less<>) . There's no need to explicitly pass another criteria for checking equality of elements.
With that said, however, you can use a predicate with algorithms to check for equality of elements of std::set.

When do you call stable_sort() on scalars?

Is it ever good to call stable_sort instead of sort on scalar types (i.e. int, long, etc.) with the default comparator?
If so, when should you do this?
If not, then why don't standard libraries just forward such calls to sort? Wouldn't that be much faster?
Stable sorts are really only useful when the items you are sorting have satellite information.
From CLRS (Introduction to Algorithms, 3rd Ed.):
"In practice, the numbers to be sorted are rarely isolated values. Each is usually part
of a collection of data called a record. Each record contains a key, which is the
value to be sorted. The remainder of the record consists of satellite data, which are
usually carried around with the key. In practice, when a sorting algorithm permutes
the keys, it must permute the satellite data as well."
When a sort is stable, it means that ties are broken in the sorted array by the items' original ordering. If you are only sorting int and long types, you don't need a stable sort.
There should be no difference (maybe with exception of things like -0.0 and 0.0). However I do not think there is any need to forward such calls, because std::sort or std::stable_sort should not know what they are sorting, so long as the comparison operation compiles. These functions don't need to be too smart.
With the default comparator specifically (implying the natural strict ordering)? I don't see any use for stable sorting on scalars in that case. Stable sorting can't provide any additional benefits in situations when equivalent values (according to the comparator) are indistinguishable. (Although #Andrey Tuganov in his answer makes an interesting and relevant remark about negative zeros).
Nevertheless stable sorting on scalars might be useful when the ordering criterion is weaker than the natural strict ordering. For example, you can write a comparison predicate that will say that any odd number is greater than any even number. In that case the resultant ordering will simply partition the array into contiguous blocks of even and odd numbers (in that order). If you are interested in keeping the relative order of these numbers unchanged, you need stable sorting algorithm.

Is the order of two same unordered_maps the same?

In other words, if I fill two unordered_map, or unordered_set, objects with exactly the same content and the same hashing function, will iterating over them give the same sequence of key/value pairs?
If so, then what are the conditions for this to hold (e.g. same hashing function, same keys, not necessarily same values).
No. There is no requirement, for example, that objects that have the same hash be placed in any particular order. In fact, in general it's impossible for an unordered map to do this because the only information it has access to is the hash value.
The behaviour in this case is undefined. So, in some situations the sequence will be the same, in others - different. You can't be sure in anything. The types you mentioned are named unordered not by accident. Using them as ordered ones is a very very bad and extremely dangerous style.
You can find that your compiler behaves in some special way you would like to use. But you can't be sure. You mustn't be sure! You do not know, what conditions are causing such behavior of the compiler. You can never be sure that any change of the compiler version will not change the behavior you need.
What is simply forbidden in other languages, in C/C++ is not specified. But you should take it as forbidden, too.
Look c-faq about the problem of undefined behavior This concept is common for all C/C++
Well first I will quote MSDN:
The actual order of elements in the controlled sequence depends on the hash function, the comparison function, the order of insertion, the maximum load factor, and the current number of buckets. You cannot in general predict the order of elements in the controlled sequence. You can always be assured, however, that any subset of elements that have equivalent ordering are adjacent in the controlled sequence.
The above quote is the same for unordered_map and unordered_set and as the name implies the sequence is unspecified, however, they do mention that equivalent keys are adjacent.
If you need guaranteed ordering, this is not the datastructure for you. In fact, if you're mainly going to be doing iteration, unordered_map/set is not the data structre for you, either.
For iteration, a std::map will prove to be the better data structure as gonig from one node to the next is less algorithmically complex. And the order of iteration for the objects in std::map is guaranteed by the spec (and is actually a defining property of the structure itself). (This is all assuming you're using the same comparison operator, obviously). No hashing is involved in std::map.
Suffice to say, it sounds like you're barking up the wrong tree here. unordered_map should generally be using for the benefits such as O(1) lookup and not for storing a list of objects then iterating over them. It most definitely should not be used if you're trying to get a deterministic order of iteration.