Creating unordered_set of unordered_set - c++

I want to create a container that will store unique sets of integers inside.
I want to create something similar to
std::unordered_set<std::unordered_set<unsigned int>>
But g++ does not let me do that and says:
invalid use of incomplete type 'struct std::hash<std::unordered_set<unsigned int> >'
What I want to achieve is to have unique sets of unsigned ints.
How can I do that?

I'm adding yet another answer to this question as currently no one has touched upon a key point.
Everyone is telling you that you need to create a hash function for unordered_set<unsigned>, and this is correct. You can do so by specializing std::hash<unordered_set<unsigned>>, or you can create your own functor and use it like this:
unordered_set<unordered_set<unsigned>, my_unordered_set_hash_functor> s;
Either way is fine. However there is a big problem you need to watch out for:
For any two unordered_set<unsigned> that compare equal (x == y), they must hash to the same value: hash(x) == hash(y). If you fail to follow this rule, you will get run time errors. Also note that the following two unordered_sets compare equal (using pseudo code here for clarity):
{1, 2, 3} == {3, 2, 1}
Therefore hash({1, 2, 3}) must equal hash({3, 2, 1}). Said differently, the unordered containers have an equality operator where order does not matter. So however you construct your hash function, its result must be independent of the order of the elements in the container.
Alternatively you can replace the equality predicate used in the unordered_set such that it does respect order:
unordered_set<unordered_set<unsigned>, my_unordered_set_hash_functor,
my_unordered_equal> s;
The burden of getting all of this right, makes:
unodered_set<set<unsigned>, my_set_hash_functor>
look fairly attractive. You still have to create a hash functor for set<unsigned>, but now you don't have to worry about getting the same hash code for {1, 2, 3} and {3, 2, 1}. Instead you have to make sure these hash codes are different.
I note that Walter's answer gives a hash functor that has the right behavior: it ignores order in computing the hash code. But then his answer (currently) tells you that this is not a good solution. :-) It actually is a good solution for unordered containers. An even better solution would be to return the sum of the individual hashes instead of hashing the sum of the elements.

You can do this, but like every unsorted_set/map element type the inner unsorted_set now needs a Hash function to be defined. It does not have one by default but you can write one yourself.

What you have to do is to define an appropriate hash for keys of type std::unordered_set<unsigned int> (since operator== is already defined for this key, you will not need to also provide the EqualKey template parameter for std::unordered_set<std::unordered_set<unsigned int>, Hash, EqualKey>.
One simple (albeit inefficient) option is to hash on the total sum of all elements of the set. This would look similar to this:
template<typename T>
struct hash_on_sum
: private std::hash<typename T::element_type>
{
typedef T::element_type count_type;
typedef std::hash<count_type> base;
std::size_t operator()(T const&obj) const
{
return base::operator()(std::accumulate(obj.begin(),obj.end(),count_type()));
}
};
typedef std::unordered_set<unsigned int> inner_type;
typedef std::unordered_set<inner_type, hash_on_sum<inner_type>> set_of_unique_sets;
However, while simple, this is not good, since it does not guarantee the following requirement. For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().

std::unordered_set<unsigned int>> does not meet the requirement to be an element of a std::unordered_set since there is no default hash function (i.e. std::hash<> is no specialized for std::unordered_set<unsigned int>> ).
you can provide one (it should be fast, and avoid collisions as much as possible) :
class MyHash
{
public:
std::size_t operator()(const std::unordered_set<unsigned int>& s) const
{
return ... // return some meaningful hash of the et elements
}
};
int main() {
std::unordered_set<std::unordered_set<unsigned int>, MyHash> u;
}
You can see very good examples of hash functions in this answer.
You should really provide both a Hash and an Equality function meeting the standard requirement of an Unordered Associative Container.

Hash() the default function to create hashes of your set's elements does not know how to deal with an entire set as an element. Create a hash function that creates a unique value for every unique set and you're good to go.
This is the constructor for an unordered_set
explicit unordered_set( size_type bucket_count = /*implementation-defined*/,
const Hash& hash = Hash(),
const KeyEqual& equal = KeyEqual(),
const Allocator& alloc = Allocator() );
http://en.cppreference.com/w/cpp/container/unordered_set/unordered_set
Perhaps the simplest thing for you to do is create a hash function for your unordered_set<unsigned int>
unsigned int my_hash(std::unordered_set<unsigned int>& element)
{
for( e : element )
{
some sort of math to create a unique hash for every unique set
}
}
edit: as seen in another answer, which I forgot completely, the hashing function must be within a Hash object. At least according to the constructor I pasted in my answer.

There's a reason there is no hash to unordered_set. An unordered_set is a mutable sequence by default. A hash must hold the same value for as long as the object is in the unordered_set. Thus your elements must be immutable. This is not guaranteed by using the modifier const&, as it only guaranties that only the main unordered_set and its methods will not modify the sub-unordered_set. Not using a reference could be a safe solution (you'd still have to write the hash function) but do you really want the overhead of moving/copying unordered_sets ?
You could instead use some kind of pointer. This is fine; a pointer is only a memory address and your unordered_set itself does not relocate (it might reallocate its element pool, but who cares ?). Therefore your pointer is constant and it can hold the same hash for its lifetime in the unordered_set.
( EDIT: as Howard pointed out, you must ensure that any order you element are stored for your set, if two sets have the same elements they are considered equal. By enforcing an order in how you store your integers, you freely get that two sets correspond to two equal vectors. )
As a bonus, you now can use a smart pointer within the main set itself to manage the memory of sub-unordered_set if you allocated them on the heap.
Note that this is still not your most efficient implementation to get a collection of sets of int. To make you sub-sets, you could write a quick wrapper around std::vector that stores the int, ordered by value. int int are small and cheap to compare, and using a dichotomic search is only O(log n) in complexity. A std::unordered_set is a heavy structure and what you lose by going from O(1) to O(log n), you gain it back by having compact memory for each sets. This shouldn't be too hard to implement but is almost guaranteed to be better in performance.
Harder to implements solution would involve a trie.

Related

Does std::sort of non primitive data type always return the same sorted array if it contains duplicates as per comparator

Suppose I have a non primitive data type with duplicates as per comparator, and I attempt to sort it using std::sort...does it give the same sorted array everytime (if we compare the sorted array in every result, will it be same ?). I know it is not stable (it may change the order of equal elements), but is the result of the same input array guaranteed to be deterministic (reliable and reproducible) ?
struct Data {
std::string str;
int data;
};
struct {
bool operator()(Data a, Data b) const { return a.data > b.data; }
} customLess;
int main() {
std::vector v = {
{"Rahul", 100},
{"Sachin", 200},
{"Saurav", 200},
{"Rohit", 300},
// .....
};
for(uint k = 0; k < 1000; k++) {
auto v2 = v;
std::sort(v2.begin(), v2.end(), customLess);
}
}
If I read you correctly, you're asking whether, despite the lack of stability, std::sort guarantees repeatability; if the same input is provided in the same order, and there are elements that are equal on the compared components, but unequal on others, will said elements always get sorted the same relative to one another?
The answer is No, std::sort make no such guarantees. Doing so would impose restrictions on implementations that might cause them to perform worse (e.g. implementations based on quicksort couldn't use a random pivot to minimize the occurrence of quicksort's pathological case, where performance is O(n²) rather than the average case O(n log n)). While a plain quicksort of that design is banned in C++11 (where std::sort now requires O(n log n) comparisons period, not merely O(n log n) average case), it can still form the top-level sort for an introsort-based std::sort implementation (a reasonable strategy when the inputs are received from possibly malicious sources and you want to reduce their ability to force excessive recursion followed by slower heapsort), so requiring repeatability would prevent implementations from using a random pivot (or any other sorting strategy with a random component), for a benefit virtually no one cares about.
std::sort means you don't care about the order of unequal elements that compare equal according to the comparator; they're not going to limit potential optimizations to provide a useless guarantee. Many implementations might, in practice, have repeatable sort order in this scenario, but it's not something code should rely on; if you need repeatability, either:
Use std::stable_sort (and get an ordering for equal inputs that is repeatable across implementations, where std::sort, being implemented differently by different vendors, would almost certainly not be repeatable across implementations that chose different algorithms), or
Expand your custom comparator to perform fallback sorting that encompasses all fields in the input elements, so it's impossible to have any uncertainty unless the fields are 100% equal, not merely equivalent based on the main comparison, which gets you not only repeatability for equal inputs, but repeatability for inputs with the same elements in different order. The actual results might put two completely equal elements in a different order (e.g. you might be able to check .data() on a std::string, and discover that two string with the same characters end up sorting in different orders), but that's almost never important (and if it is, again, use std::stable_sort). In this case, you'd change your comparator to (adding #include <tuple> if you're not using it):
struct {
bool operator()(const Data& a, const Data& b) const {
return std::tie(a.data, a.str) > std::tie(b.data, b.str);
}
} customLess;
so all fields are compared. Note that I changed the arguments to be const references (so you're not copying two Data objects for each comparison) and I used std::tie to make the fallback comparison efficient and easy to code (std::tie lets you use std::tuple's lexicographic sort without having to reimplement lexicographic sorting from scratch, an error-prone process, while still sticking to reference semantics to avoid copies).

Hashing an unordered container without needing to implement a comparison operator for the type

I'm looking to hash an unordered container, such as an unordered_map and unordered_set.
For an ordered type, like a vector, boost::hash_range(v.begin(). v.end()) works well, but it is also dependent on order, e.g.
#include <boost/functional/hash.hpp>
#include <functional>
namespace std {
template<>
struct hash<std::vector<int>> {
size_t operator ()(const std::vector<int>& v) const noexcept {
return boost::hash_range(v.begin(), v.end());
}
};
}
Example of this working: https://coliru.stacked-crooked.com/a/0544c1b146ebeaa0
boost.org says
If you are calculating a hash value for data where the order of the
data doesn't matter in comparisons (e.g. a set) you will have to
ensure that the data is always supplied in the same order.
Okay, so that would seem easy - just sort the data in some way, but I don't want to do this every time I hash it. Using a normal map or set could work but I would need to do a fair bit of re-writing.
Additionally, this would require every type I use to have either >, <, <= or >= defined, as well as == and std::hash.
How can I hash a container so that the order does not matter?
The requirement seems rather logical, since the hash function is combining the previous elements hash with the current element hash somehow, then the order is important, because
H(A, B, C) is then computed as H(H(H(A), B), C) so that each intermediate result is used as input to the next element (think about a block cipher).
To hash a sequence of elements without caring for ordering you'd need a commutative hash function, so you'd be restricted to commutative operations (eg. XOR). I'm not sure how strong could be such an hash function but for your specific scenario it could be sufficient.
After sorting the hash values of individual container elements, the sorted list of hash values can be hashed again to obtain a hash value for the unordered container.
Assume H1 is the hash function for a single element and H2 is the hash function for a list of hash values, then the hash value for some unordered container with elements A, B, and C could be calculated as H2(SORT(H1(A), H1(B), H1(C))). By construction, the resulting hash value will be independent of the order. In this way, you will also get a stronger hash value compared to combining the individual hash values using commutative operations.

Call to implicitly-deleted default constructor of 'unordered_set< vector<int> >'

It seems like when I try to define an unordered_set of vector, I get an error saying: "Call to implicitly-deleted default constructor of unordered_set< vector<int> > ." This doesn't happen when I define a regular (ordered) set: set< vector<int> >. It seems like I need to define hash<vector<int>> in order to get rid of the error.
Does anyone know why I get this error only when I use unordered_set? Shouldn't both data structures use hashing, so why would an unordered_set need a custom-defined hash function? In fact, shouldn't a regular (ordered) set need some custom-defined comparator as well in order to order the vector<int> data structures?
It's because unordered_set is using std::hash template to compute hash for its entries and there is no std::hash for pairs. You have to define custom hash to use unordered_set.
struct vector_hash
{
template <class T1, class T2>
std::size_t operator () (std::pair<T1, T2> const &v) const
{
return std::hash<T1>()(v.size());
}
};
and then declare your unordered_set as -
std::unordered_set< vector<int>, vector_hash> set;
This hash function is not good. It's just an example.
Shouldn't both data structures use hashing
No. This is documented, you can always look it up yourself:
std::set
std::set is an associative container that contains a sorted set of unique objects of type Key. Sorting is done using the key comparison function Compare. Search, removal, and insertion operations have logarithmic complexity. Sets are usually implemented as red-black trees
Note that Compare defaults to std::less<Key>, and std::vector overloads operator<.
std::unordered_set, for comparison
Unordered set is an associative container that contains a set of unique objects of type Key. Search, insertion, and removal have average constant-time complexity.
Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value
and the Hash type parameter defaults to std::hash<Key>. This has a list of specializations for standard library types, and std::vector is not included on that list.
Unfortunately, there is no general specialization of std::hash for std::vector. There is only a specialization for std::vector<bool>, which is not very helpful:
int main()
{
std::hash<std::vector<bool>> hash_bool; // ok
std::hash<std::vector<int>> hash_int; // error
}
Demo.
The reason IMHO is that there is no standard way to combine hashes. Because you have to combine hashes for all elements in order to construct the hash for the whole vector. But why there is no standard way to combine hashes - this is the mystery.
You can use boost::hash or absl::Hash instead, e.g.:
unordered_set<vector<int>, boost::hash<vector<int>>>
Be aware that the hash codes computed by absl::Hash are not guaranteed to be stable across different runs of your program. Which might be either a benefit (if you aim for security), or a disadvantage (if you aim for reproducibility).
My answer is not related to std::vector issue directly, but in my case I've got the "Call to implicitly-deleted default constructor" error on std::string, just because I did not include the <string> header file.
I mean that that error can happen when unordered_map is used on forward declared type:
#include <unordered_map>
#include <string_view>
//#include <string>
int main()
{
auto m = std::unordered_map<std::string, int>{};
}
set use self balancing trees whereas unordered_set uses pure hashing.

How does the following comparator even works while building up the min heap?

I know that if I build a heap using STL, it makes a max_heap. And if I want to make a min_heap, I will have to write my own custom comparator. Now, the following comparator,
struct greater1{
bool operator()(const long& a,const long& b) const{
return a>b;
}
};
int main() {
std::vector<long> humble;
humble.push_back(15);
humble.push_back(15);
humble.push_back(9);
humble.push_back(25);
std::make_heap(humble.begin(), humble.end(), greater1());
while (humble.size()) {
std::pop_heap(humble.begin(),humble.end(),greater1());
long min = humble.back();
humble.pop_back();
std::cout << min << std::endl;
}
return 0;
}
The above code I got from off of the internet. I just have one doubt. How is the comparator actually working. And as far as I understand, shouldn't it be something like, return a < b because we want the minimum element to be in the front and then the bigger element in the heap. Why is it return a > b. Doesn't it mean that, if (a>b), then this will return true and a will be put in the heap before b and therefore a bigger element is put in front of a smaller element?
I think you're reading too much into a connection between the comparator semantics and the heap semantics. Remember, the internal details and structure of containers are deliberately abstracted away from you so, the moment you started trying to rationalise about this in terms of how you think the internal structure of a max_heap should look, you got carried away.
In the standard library, default comparators are always less-than. If the relationship between elements for sorting within the particular container/algorithm is not less-than, the container/algorithm will be clever enough to make that transformation (in this case, on usual implementations, by simply passing the operands in the reverse order, like cmp(b,a)!). But, fundamentally, it will always start with a less-than ordering because that is the consistent convention adopted.
So, to invert the ordering of your container, you would take a less-than comparator and turn it into a greater-than comparator, no matter what the physical layout of the container's implementation may (in your opinion) turn out to be.
Furthermore, as an aside, and to echo the Croissant's comments, I would take longs by value … and, in fact, just use std::greater rather than recreating it.
Standard heap builds in such a way that for every element a and its child b, the comparison cmp(b,a) holds, where cmp is the comparator supplied. Note that the first argument to cmp is the child. (Or, abstracting from the internal representation, the standard way is so that cmp(top, other) is true for the first element top and any other other.)
This is obviously done to make the default comparator ("less") build max-heap.
So you need to provide a comparator which you want to return true when a child is supplied as a first argument. For the min-heap, this will be 'greater'.

Avoid extra process in unordered_map insertion

I have an std::unordered_map, and I want both to increment the first value in a std::pair, hashed by key, and to create a reference to key. For example:
std::unordered_map<int, std::pair<int, int> > hash;
hash[key].first++;
auto it(hash.find(key));
int& my_ref(it->first);
I could, instead of using the [] operator, insert the data with insert(), but I'd allocate a pair, even if it were to be deallocated later, as hash may already have key -- not sure of it, though. Making it clearer:
// If "key" is already inserted, the pair(s) will be allocated
// and then deallocated, right?
auto it(hash.insert(std::make_pair(key, std::make_pair(0, 0))));
it->second.first++;
// Here I can have my reference, with extra memory operations,
// but without an extra search in `hash`
int& my_ref(it->first);
I'm pretty much inclined to use the first option, but I can't seem to decide which one is the best. Any better solution to this?
P.S.: an ideal solution for me would be something like an insertion that does not require an initial, possibly useless, allocation of the value.
As others have pointed out, a "allocating" a std::pair<int,int> is really nothing more than copying two integers (on the stack). For the map<int,pair<int,int>>::value_type, which is pair<int const, pair<int, int>> you are at three ints, so there is no significant overhead in using your second approach. You can slightly optimize by using emplace instead of insert i.e.:
// Here an `int` and a struct containing two `int`s are passed as arguments (by value)
auto it(hash.emplace(key, std::make_pair(0, 0)).first);
it->second.first++;
// You get your reference, without an extra search in `hash`
// Not sure what "extra memory operations" you worry about
int const& my_ref(it->first);
Your first approach, using both hash[key] and hash.find(key) is bound to be more expensive, because an element search will certainly be more expensive than an iterator dereference.
Premature copying of arguments on their way to construction of the unordered_map<...>::value_type is a negligible problem, when all arguments are just ints. But if instead you have a heavyweight key_type or a pair of heavyweight types as mapped_type, you can use the following variant of the above to forward everything by reference as far as possible (and use move semantics for rvalues):
// Here key and arguments to construct mapped_type
// are forwarded as tuples of universal references
// There is no copying of key or value nor construction of a pair
// unless a new map element is needed.
auto it(hash.emplace(std::piecewise_construct,
std::forward_as_tuple(key), // one-element tuple
std::forward_as_tuple(0, 0) // args to construct mapped_type
).first);
it->second.first++;
// As in all solutions, get your reference from the iterator we already have
int const& my_ref(it->first);
How about this:
auto it = hash.find(key);
if (it == hash.end()) { it = hash.emplace(key, std::make_pair(0, 0)).first; }
++it->second.first;
int const & my_ref = it->first; // must be const
(If it were an ordered map, you'd use lower_bound and hinted insertion to re­cycle the tree walk.)
If I understand correctly, what you want is an operator[] that returns an iterator, not a mapped_type. The current interface of unordered_map does not provide such feature, and operator[] implementation relies on private members (at least the boost implementation, I don't have access C++11 std files in my environment).
I suppose that JoergB's answer will be faster and Kerrek SB's one will have a smaller memory footprint. It's up to you to decide what is more critical for your project.