tr1::unordered_set union and intersection - c++

How to do intersection and union for sets of the type tr1::unordered_set in c++? I can't find much reference about it.
Any reference and code will be highly appreciated. Thank you very much.
Update: I just guessed the tr1::unordered_set should provide the function for intersection, union, difference.. Since that's the basic operation of sets.
Of course I can write a function by myself, but I just wonder if there are built in function from tr1.
Thank you very much.

I see that set_intersection() et al. from the algorithm header won't work as they explicitly require their inputs to be sorted -- guess you ruled them out already.
It occurs to me that the "naive" approach of iterating through hash A and looking up every element in hash B should actually give you near-optimal performance, since successive lookups in hash B will be going to the same hash bucket (assuming that both hashes are using the same hash function). That should give you decent memory locality, even though these buckets are almost certainly implemented as linked lists.
Here's some code for unordered_set_difference(), you can tweak it to make the versions for set union and set difference:
template <typename InIt1, typename InIt2, typename OutIt>
OutIt unordered_set_intersection(InIt1 b1, InIt1 e1, InIt2 b2, InIt2 e2, OutIt out) {
while (!(b1 == e1)) {
if (!(std::find(b2, e2, *b1) == e2)) {
*out = *b1;
++out;
}
++b1;
}
return out;
}
Assuming you have two unordered_sets, x and y, you can put their intersection in z using:
unordered_set_intersection(
x.begin(), x.end(),
y.begin(), y.end(),
inserter(z, z.begin())
);
Unlike bdonlan's answer, this will actually work for any key types, and any combination of container types (although using set_intersection() will of course be faster if the source containers are sorted).
NOTE: If bucket occupancies are high, it's probably faster to copy each hash into a vector, sort them and set_intersection() them there, since searching within a bucket containing n elements is O(n).

There's nothing much to it - for intersect, just go through every element of one and ensure it's in the other. For union, add all items from both input sets.
For example:
void us_isect(std::tr1::unordered_set<int> &out,
const std::tr1::unordered_set<int> &in1,
const std::tr1::unordered_set<int> &in2)
{
out.clear();
if (in2.size() < in1.size()) {
us_isect(out, in2, in1);
return;
}
for (std::tr1::unordered_set<int>::const_iterator it = in1.begin(); it != in1.end(); it++)
{
if (in2.find(*it) != in2.end())
out.insert(*it);
}
}
void us_union(std::tr1::unordered_set<int> &out,
const std::tr1::unordered_set<int> &in1,
const std::tr1::unordered_set<int> &in2)
{
out.clear();
out.insert(in1.begin(), in1.end());
out.insert(in2.begin(), in2.end());
}

based on the previous answer:
C++11 version, if the set supports a fast look up function find()
(return values are moved efficiently)
/** Intersection and union function for unordered containers which support a fast lookup function find()
* Return values are moved by move-semantics, for c++11/c++14 this is efficient, otherwise it results in a copy
*/
namespace unorderedHelpers {
template<typename UnorderedIn1, typename UnorderedIn2,
typename UnorderedOut = UnorderedIn1>
UnorderedOut makeIntersection(const UnorderedIn1 &in1, const UnorderedIn2 &in2)
{
if (in2.size() < in1.size()) {
return makeIntersection<UnorderedIn2,UnorderedIn1,UnorderedOut>(in2, in1);
}
UnorderedOut out;
auto e = in2.end();
for(auto & v : in1)
{
if (in2.find(v) != e){
out.insert(v);
}
}
return out;
}
template<typename UnorderedIn1, typename UnorderedIn2,
typename UnorderedOut = UnorderedIn1>
UnorderedOut makeUnion(const UnorderedIn1 &in1, const UnorderedIn2 &in2)
{
UnorderedOut out;
out.insert(in1.begin(), in1.end());
out.insert(in2.begin(), in2.end());
return out;
}
}

Related

Checking vector equality by comparing each element doesn't match with `==` operator

I tried to check if two std::vectors are equivalent, by implementing custom functions.
But the results don't match with that of the == operator.
(Even though my functions say the two vectors are the same, vector1 == vector2 sometimes says the two vectors are not the same.)
Below is the code I wrote. (removed some std::couts)
Implementation 1:
template <typename T1, typename T2>
bool isVecSame(std::vector<T1> v1, std::vector<T2> v2) {
bool bSame = true;
if (!(v1.size() == v2.size()))
return (bSame = false);
for (size_t i = 0, sz = v1.size(); i < sz; ++i) {
if (!(v1.at(i) == v2.at(i)))
bSame = false;
}
return bSame;
}
Implementation 2: (quite similar to std::equal implementation of cppreference.com)
template <typename T1, typename T2>
bool isVecSameV2(std::vector<T1> v1, std::vector<T2> v2) {
bool bSame = true;
if (!(v1.size() == v2.size()))
return (bSame = false);
auto itBeg_v1 = v1.begin();
auto itEnd_v1 = v1.end();
auto itBeg_v2 = v2.begin();
for ( ; itBeg_v1 != itEnd_v1; ++itBeg_v1, ++itBeg_v2) {
if ( !((*itBeg_v1) == (*itBeg_v2)) )
bSame = false;
}
return bSame;
}
(The results of these two implementations are the same, as far as I tested.)
I thought the std::vector's == operator works similarly: Compares the size of the two vectors and then compares each element by ==, if the elements implemented the == operator.
Then what can be the cause of the difference between the result of vector1 == vector2 and the result of the above functions?
P.S. the types T of std::vector<T> are: uint64_t, std::string, int, float. And I didn't overload any operators.
Context: I'm trying to read a text file line by line and parse it into the std::vector, and check if the two vectors from the different read(of the same text file) are the same by using vector1 == vector2. Strangely, sometimes vectors were not the same, so that's why I tried to implement custom functions to compare two vectors - which tells me the vectors are the same all the time. (not easy to write a minimal reproducible example because it involves reading a huge text file)
There must be only one template parameter . Types are checked by the standard implementation with the compat and getcont function. Their iterators must be compatible.
The vectors should be passed as const reference. You are making copies. Potentially with side effects when you use std::string. Very unlikely, but possible.
The function must be defined as const. Otherwise they will not be compatible with some algorithms.
template <typename T>
bool isVecSameV2(const std::vector<T> &v1, const std::vector<T>& v2) const {
However. The problem is most likely somewhere else in the code. But we cannot see it . . .

std::hash variations of object with arbitrary number of attributes of fundamental type

Discussion:
Let's say I have a struct/class with an arbitrary number of attributes that I want to use as key to a std::unordered_map e.g.,:
struct Foo {
int i;
double d;
char c;
bool b;
};
I know that I have to define a hasher-functor for it e.g.,:
struct FooHasher {
std::size_t operator()(Foo const &foo) const;
};
And then define my std::unordered_map as:
std::unordered_map<Foo, MyValueType, FooHasher> myMap;
What bothers me though, is how to define the call operator for FooHasher. One way to do it, that I also tend to prefer, is with std::hash. However, there are numerous variations e.g.,:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
std::hash<double>()(foo.d) ^
std::hash<char>()(foo.c) ^
std::hash<bool>()(foo.b);
}
I've also seen the following scheme:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
(std::hash<double>()(foo.d) << 1) ^
(std::hash<char>()(foo.c) >> 1) ^
(std::hash<bool>()(foo.b) << 1);
}
I've seen also some people adding the golden ratio:
std::size_t operator()(Foo const &foo) const {
return (std::hash<int>()(foo.i) + 0x9e3779b9) ^
(std::hash<double>()(foo.d) + 0x9e3779b9) ^
(std::hash<char>()(foo.c) + 0x9e3779b9) ^
(std::hash<bool>()(foo.b) + 0x9e3779b9);
}
Questions:
What are they trying to achieve by adding the golden ration or shifting bits in the result of std::hash.
Is there an "official scheme" to std::hash an object with arbitrary number of attributes of fundamental type?
A simple xor is symmetric and behaves badly when fed the "same" value multiple times (hash(a) ^ hash(a) is zero). See here for more details.
This is the question of combining hashes. boost has a hash_combine that is pretty decent. Write a hash combiner, and use it.
There is no "official scheme" to solve this problem.
Myself, I typically write a super-hasher that can take anything and hash it. It hash combines tuples and pairs and collections automatically, where it first hashes the count of elements in the collection, then the elements.
It finds hash(t) via ADL first, and if that fails checks if it has a manually written hash in a helper namespace (used for std containers and types), and if that fails does a std::hash<T>{}(t).
Then my hash for Foo support looks like:
struct Foo {
int i;
double d;
char c;
bool b;
friend auto mytie(Foo const& f) {
return std::tie(f.i, f.d, f.c, f.b);
}
friend std::size_t hash(Foo const& f) {
return hasher::hash(mytie(f));
}
};
where I use mytie to move Foo into a tuple, then use the std::tuple overload of hasher::hash to get the result.
I like the idea of hashes of structurally similar types having the same hash. This lets me act as if my hash is transparent in some cases.
Note that hashing unordered meows in this manner is a bad idea, as an asymmetric hash of an unordered meow may generate spurious misses.
(Meow is the generic name for map and set. Do not ask me why: Ask the STL.)
The standard hash framework is lacking in respect of combining hashes. Combining hashes using xor is sub-optimal.
A better solution is proposed in N3980 "Types Don't Know #".
The main idea is using the same hash function and its state to hash more than one value/element/member.
With that framework your hash function would look:
template <class HashAlgorithm>
void hash_append(HashAlgorithm& h, Foo const& x) noexcept
{
using std::hash_append;
hash_append(h, x.i);
hash_append(h, x.d);
hash_append(h, x.c);
hash_append(h, x.b);
}
And the container:
std::unordered_map<Foo, MyValueType, std::uhash<>> myMap;

Fast hash function for `std::vector`

I implemented this solution for getting an hash value from vector<T>:
namespace std
{
template<typename T>
struct hash<vector<T>>
{
typedef vector<T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& in) const
{
size_t size = in.size();
size_t seed = 0;
for (size_t i = 0; i < size; i++)
//Combine the hash of the current vector with the hashes of the previous ones
hash_combine(seed, in[i]);
return seed;
}
};
}
//using boost::hash_combine
template <class T>
inline void hash_combine(std::size_t& seed, T const& v)
{
seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
But this solution doesn't scale at all: with a vector<double> of 10 millions elements it's gonna take more than 2.5 s (according to VS).
Does exists a fast hash function for this scenario?
Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map will be used in different runs and in addition two vector<double> with the same content but different addresses will be mapped differently (undesired behavior for this application).
NOTE: As per the comments, you get a 25-50x speed-up by compiling with optimizations. Do that, first. Then, if it's still too slow, see below.
I don't think there's much you can do. You have to touch all the elements, and that combination function is about as fast as it gets.
One option may be to parallelize the hash function. If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. The synchronization overhead may be worth it for very large vectors.
The approach that MSVC's old hashmap used was to sample less often.
This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. Not reading some characters is pretty unavoidable.
The second thing you should do is not specialize std::hash on all vectors, this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as the std is sure to permit itself to add hash combining and hashing of vectors).
When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend.
namespace my_utils {
namespace hash_impl {
namespace details {
namespace adl {
template<class T>
std::size_t hash(T const& t) {
return std::hash<T>{}(t);
}
}
template<class T>
std::size_t hasher(T const& t) {
using adl::hash;
return hash(t);
}
}
struct hash_tag {};
template<class T>
std::size_t hash(hash_tag, T const& t) {
return details::hasher(t);
}
template<class T>
std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) {
seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
template<class Container>
std::size_t fash_hash_random_container(hash_tag, Container const& c ) {
std::size_t size = c.size();
std::size_t stride = 1 + size/10;
std::size_t r = hash(hash_tag{}, size);
for(std::size_t i = 0; i < size; i += stride) {
r = hash_combine(hash_tag{}, r, c.data()[i])
}
return r;
}
// std specializations go here:
template<class T, class A>
std::size_t hash(hash_tag, std::vector<T,A> const& v) {
return fash_hash_random_container(hash_tag{}, v);
}
template<class T, std::size_t N>
std::size_t hash(hash_tag, std::array<T,N> const& a) {
return fash_hash_random_container(hash_tag{}, a);
}
// etc
}
struct my_hasher {
template<class T>
std::size_t operator()(T const& t)const {
return hash_impl::hash(hash_impl::hash_tag{}, t);
}
};
}
now my_hasher is a universal hasher. It uses either hashes declared in my_utils::hash_impl (for std types), or free functions called hash that will hash a given type, to hash things. Failing that, it tries to use std::hash<T>. If that fails, you get a compile-time error.
Writing a free hash function in the namespace of the type you want to hash tends to be less annoying than having to go off and open std and specialize std::hash in my experience.
It understands vectors and arrays, recursively. Doing tuples and pairs requires a bit more work.
It samples said vectors and arrays at about 10 times.
(Note: hash_tag is both a bit of a joke, and a way to force ADL and prevent having to forward-declare the hash specializations in the hash_impl namespace, because that requirement sucks.)
The price of sampling is that you could get more collisions.
Another approach if you have a huge amount of data is to hash them once, and keep track of when they are modified. To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. Now a vector gets hashed once; if you modify it, the hash is discarded.
One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. That is tricky.
You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. Worth trying.
You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. This would add a lg(n) overhead to all element access. Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated).
Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. Using 80 megabyte keys in a map seems strange to me.

Sequence iterator? Isn't there one in boost?

From time to time I am feeling the need for a certain kind of iterator (for which I can't make up a good name except the one prefixed to the title of this question).
Suppose we have a function (or function object) that maps an integer to type T. That is, we have a definition of a mathematical sequence, but we don't actually have it stored in memory. I want to make an iterator out of it. The iterator class would look something like this:
template <class F, class T>
class sequence_iterator : public std::iterator<...>
{
int i;
F f;
public:
sequence_iterator (F f, int i = 0):f(f), i(i){}
//operators ==, ++, +, -, etc. will compare, increment, etc. the value of i.
T operator*() const
{
return f(i);
}
};
template <class T, class F>
sequence_iterator<F, T> make_sequence_iterator(F f, int i)
{
return sequence_iterator<F, T>(f, i);
}
Maybe I am being naive, but I personally feel that this iterator would be very useful. For example, suppose I have a function that checks whether a number is prime or not. And I want to count the number of primes in the interval [a,b]. I'd do this;
int identity(int i)
{
return i;
}
count_if(make_sequence_iterator<int>(identity, a), make_sequence_iterator<int>(identity, b), isPrime);
Since I have discovered something that would be useful (at least IMHO) I am definitely positive that it exists in boost or the standard library. I just can't find it. So, is there anything like this in boost?. In the very unlikely event that there actually isn't, then I am going to write one - and in this case I'd like to know your opinion whether or not should I make the iterator_category random_access_iterator_tag. My concern is that this isn't a real RAI, because operator* doesn't return a reference.
Thanks in advance for any help.
boost::counting_iterator and boost::transform_iterator should do the trick:
template <typename I, typename F>
boost::transform_iterator<
F,
boost::counting_iterator<I>>
make_sequence_iterator(I i, F f)
{
return boost::make_transform_iterator(
boost::counting_iterator<I>(i), f);
}
Usage:
std::copy(make_sequence_iterator(0, f), make_sequence_iterator(n, f), out);
I would call this an integer mapping iterator, since it maps a function over a subsequence of the integers. And no, I've never encountered this in Boost or in the STL. I'm not sure why that is, since your idea is very similar to the concept of stream iterators, which also generate elements by calling functions.
Whether you want random access iteration is up to you. I'd try building a forward or bidirectional iterator first, since (e.g.) repeated binary searches over a sequence of integers may be faster if they're generated and stored in one go.
Does the boost::transform_iterator fills your needs? there are several useful iterator adaptors in boost, the doc is here.
I think boost::counting_iterator is what you are looking for, or atleast comes the closest. Is there something you are looking for it doesn't provide? One could do, for example:
std::count_if(boost::counting_iterator<int>(0),
boost::counting_iterator<int>(10),
is_prime); // or whatever ...
In short, it is an iterator over a lazy sequence of consecutive values.
Boost.Utility contains a generator iterator adaptor. An example from the documentation:
#include <iostream>
#include <boost/generator_iterator.hpp>
class my_generator
{
public:
typedef int result_type;
my_generator() : state(0) { }
int operator()() { return ++state; }
private:
int state;
};
int main()
{
my_generator gen;
boost::generator_iterator_generator<my_generator>::type it =
boost::make_generator_iterator(gen);
for (int i = 0; i < 10; ++i, ++it)
std::cout << *it << std::endl;
}

performance of find in an unordered_map

I know this is probably a stupid question, but I wanted to make sure and I couldn't readily find this information.
What is the performance characteristic of find() in an unordered map? Is it as fast/nearly as fast as a normal lookup?
I.e.
std::string defaultrow = sprite.attribute("defaultrow").value();
auto rclassIt = Rows::NameRowMap.find(defaultrow);
if (rclassIt != Rows::NameRowMap.end())
defRow = rclassIt->second;
vs.
std::string defaultrow = sprite.attribute("defaultrow").value();
defRow = Rows::NameRowMap[defaultrow];
where Rows::NameRowMap is a unordered map mapping a string index to an int.
In my case, I don't know for certain if the key will exist before hand, so the first solution seemed safer to me, but if I can guarantee existence, is the second case faster (ignoring the extra checks I'm doing)? and if so, why? If it matters, I'm using the 1.46 boost implementation
find and operator[] on an unordered container are O(1) average, O(n) worst-case -- it depends on the quality of your hash function.
It's pretty possible that operator[] uses find and insert internally. For example, IIRC that's the case with Miscrosoft's std::map implementation.
EDIT:
What I was trying to say is that operator[] is not magical, it still has to do a lookup first. From what I see in Boost 1.46.0 both find and said operator use find_iterator internally.
Usually it's better to use find for lookups, because your code will be more reusable and robust (e.g. you will never insert something accidentally), especially in some kind of generic code.
They have the same amortized complexity of O(1), but the operator also creates a new element when the value is not found. If the value is found, performance difference should be minor. My boost is a little old - version 1.41, but hopefully it does not matter. Here is the code:
// find
//
// strong exception safety, no side effects
template <class H, class P, class A, class G, class K>
BOOST_DEDUCED_TYPENAME hash_table<H, P, A, G, K>::iterator_base
hash_table<H, P, A, G, K>::find(key_type const& k) const
{
if(!this->size_) return this->end();
bucket_ptr bucket = this->get_bucket(this->bucket_index(k));
node_ptr it = find_iterator(bucket, k);
if (BOOST_UNORDERED_BORLAND_BOOL(it))
return iterator_base(bucket, it);
else
return this->end();
}
// if hash function throws, basic exception safety
// strong otherwise
template <class H, class P, class A, class K>
BOOST_DEDUCED_TYPENAME hash_unique_table<H, P, A, K>::value_type&
hash_unique_table<H, P, A, K>::operator[](key_type const& k)
{
typedef BOOST_DEDUCED_TYPENAME value_type::second_type mapped_type;
std::size_t hash_value = this->hash_function()(k);
bucket_ptr bucket = this->bucket_ptr_from_hash(hash_value);
if(!this->buckets_) {
node_constructor a(*this);
a.construct_pair(k, (mapped_type*) 0);
return *this->emplace_empty_impl_with_node(a, 1);
}
node_ptr pos = this->find_iterator(bucket, k);
if (BOOST_UNORDERED_BORLAND_BOOL(pos)) {
return node::get_value(pos);
}
else {
// Side effects only in this block.
// Create the node before rehashing in case it throws an
// exception (need strong safety in such a case).
node_constructor a(*this);
a.construct_pair(k, (mapped_type*) 0);
// reserve has basic exception safety if the hash function
// throws, strong otherwise.
if(this->reserve_for_insert(this->size_ + 1))
bucket = this->bucket_ptr_from_hash(hash_value);
// Nothing after this point can throw.
return node::get_value(add_node(a, bucket));
}
}