I am dealing with a dataset in which I want to remove duplicates. Duplicates are defined by having the same value for three fields stored as int64.
I am using C++17. I want my code to be as fast as possible (memory is less of a constraint). I do not care about ordering. I know nothing about the distribution of the int64 values.
My idea is to use an unordered_set with a hash of the three int64 as a key.
Here are my questions:
Is the unordered_set the best option? How about a map?
Which hash function should I use?
Is it a good idea to put the three int64 into a string then hash that string?
Thanks for your help.
I would use:
std::unordered_map<uint64_t, std::unordered_map<uint64_t, std::unordered_set<uint64_t>>>
Is the unordered_set the best option? How about a map?
Anything unordered_ (I believe) will use hash tables, ordered - some kind of binary tree.
Which hash function should I use?
Whatever std:: provides for uint64_t, unless you have a reason to believe that you can do better.
Is it a good idea to put the three int64 into a string then hash that string?
What can you do with strings that you can't with integers? It most likely will be longer...
Thanks, Vlad!
I also found an interesting implementation here: How do I combine hash values in C++0x?
inline void hash_combine(std::size_t& seed) { }
template <typename T, typename... Rest>
inline void hash_combine(std::size_t& seed, const T& v, Rest... rest) {
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
hash_combine(seed, rest...);
}
Usage:
std::size_t h=0;
hash_combine(h, obj1, obj2, obj3);
which I then plug into a flat std::unordered_set<std::size_t>.
This is roughly 2.5x faster than your proposal on my machine. On the other hand, your solution is simpler to read and does not require a hand-crafted hasher.
And then this solution is even faster:
bool insert_key_and_exists (const int64_t &a, const int64_t &b, std::unordered_set< std::pair<int64_t, int64_t> > *dict)
{
auto c = std::make_pair(a,b);
if (dict->contains(c))
return true;
dict->insert(c);
return false;
}
Hash tables used in things like unordered_map are excellent for large objects. Especially if the key is smaller. But for this you would probably get the best speed using a vector. Sort it. Then run std::unique
#include <cstdint>
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>
// sort and remove duplicates
void remove_dups(std::vector<std::array<int64_t, 3>>& v)
{
std::sort(v.begin(), v.end());
v.erase(std::unique(v.begin(), v.end()), v.end());
}
int main()
{
std::vector<std::array<int64_t, 3>> v{ {1,2,3}, {3,2,1}, {1,2,3}, {2,3,4} };
remove_dups(v);
for (const auto& i3 : v)
{
for (const auto& i : i3)
std::cout << i << " ";
std::cout << '\n';
}
}
Related
An std::map<K,V> m, in a mathematical view, is a function fm in which all pairs of domain and range elements (x,y) ∈ K × V such that fm(x) = y.
So, I want to get the domain of fm, i.e. the set of all keys (or perhaps the range - the set of all values). I can do this procedurally with C++11, like so:
std::unordered_set<K> keys;
for (const auto& kv_pair : m) { keys.insert(kv_pair->first); }
right? But - I want to do it functionally (read: In a fancy way which makes me feel superior). How would I do that?
Notes:
I do not necessarily need the result to be an std::unordered_set; something which would could replace such a set would probably work too (e.g. a set Facade).
Readability, (reasonable) terseness and avoiding gratuitous copying of data are all considerations.
Boost.Range provides exactly that, with the adaptor map_keys. Look at this example from the documentation.
You can write:
auto keys = m | boost::adaptors::map_keys;
// keys is a range view to the keys in your map, no copy involved
// you can use keys.begin() and keys.end() to iterate over it
EDIT : I'll leave my old answer below, it uses iterators instead of ranges. Notice that
the range represented by the two boost::transform_iterator still represents the set of keys in your map.
IMO the functional way to do that would require an iterator that points to the keys of the map, so that you can simply use std::copy.
It makes sense because you are not transforming or accumulating anything, you are just copying the keys.
Unfortunately the standard does not provide iterator adaptors, but you can use those provided by Boost.Iterator.
#include <algorithm>
#include <map>
#include <unordered_set>
#include <boost/iterator/transform_iterator.hpp>
struct get_first
{
template<class A, class B>
const A & operator()(const std::pair<A,B> & val) const
{
return val.first;
}
};
int main()
{
std::map<int, std::string> m;
std::unordered_set<int> r;
// ...
std::copy(boost::make_transform_iterator(m.begin(), get_first{}),
boost::make_transform_iterator(m.end(), get_first{}),
std::inserter(r, r.end()) );
}
It would be more expressive to have an iterator that dereferences the Kth element of a tuple/pair, but transform_iterator will do the job fine.
IMHO, an important characteristic for intuitive functional code is that the algorithm actually return the result, rather than setting some variable elsewhere as a side effect. This can be done with std::accumulate, e.g.:
#include <iostream>
#include <set>
#include <map>
#include <algorithm>
int main()
{
typedef std::map<int, int> M;
M m { {1, -1}, {2, -2}, {3, -3}, {4, -4} };
auto&& x = std::accumulate(std::begin(m), std::end(m), std::set<int>{},
[](std::set<int>& s, const M::value_type& e)
{
return s.insert(e.first), std::move(s);
// .first is key,
}); // .second is value
for (auto& i : x)
std::cout << i << ' ';
std::cout << '\n';
}
Output:
1 2 3 4
See it run here
The std::begin(m), std::end(m) bit is actually a big headache, as it frustrates chaining of such operations. For example, it's be ideal if we could chain "functional" operations like our "GET KEYS" above alongside others...
x = m. GET KEYS . SQUARE THEM ALL . REMOVE THE ODD ONES
...or at least...
x = f(f(f(m, GET KEYS), SQUARE THEM ALL), REMOVE THE ODD ONES)
...but you'll have to write some trivial code yourself to get there or pick up a library supporting functional "style".
There's a number of ways you could write this. One slightly more 'functional' way is:
vector<string> keys;
transform(begin(m), end(m), back_inserter(keys), [](const auto& p){ return p.first; });
But to really improve on this and enable a more functional style using the standard library we need something like Eric Niebler's Range Proposal to be standardized. In the meantime, there are a number of non-standard range based libraries like Eric's range-v3 and boost Range you can use to get a more functional style.
std::map<int, int> m;
std::unordered_set<int> keys;
std::for_each(m.begin(), m.end(), [&keys](decltype(*m.begin()) kv)-> void {keys.insert(kv.first);});
How can I create a std::map<int, float> from a vector<float>, so that the map contains the k highest values from the vector with the keys beeing the index of the value in the vector.
A naive approach would be to traverse the vector (O(n)), extract and erase (O(n)) the highest element k times (O(k)), leading to a complexity of O(k*n^2), which is suboptimal, I guess.
Even better would be to just copy (O(n)) and remove the smallest until size is k. Which would lead to O(n^2). Still polynomial...
Any ideas?
Following should do the job:
#include <cstdint>
#include <algorithm>
#include <iostream>
#include <map>
#include <tuple>
#include <vector>
// Compare: greater T2 first.
struct greater_by_second
{
template <typename T1, typename T2>
bool operator () (const std::pair<T1, T2>& lhs, const std::pair<T1, T2>& rhs)
{
return std::tie(lhs.second, lhs.first) > std::tie(rhs.second, rhs.first);
}
};
std::map<std::size_t, float> get_index_pairs(const std::vector<float>& v, int k)
{
std::vector<std::pair<std::size_t, float>> indexed_floats;
indexed_floats.reserve(v.size());
for (std::size_t i = 0, size = v.size(); i != size; ++i) {
indexed_floats.emplace_back(i, v[i]);
}
std::nth_element(indexed_floats.begin(),
indexed_floats.begin() + k,
indexed_floats.end(), greater_by_second());
return std::map<std::size_t, float>(indexed_floats.begin(), indexed_floats.begin() + k);
}
Let's test it:
int main(int argc, char *argv[])
{
const std::vector<float> fs {45.67f, 12.34f, 67.8f, 4.2f, 123.4f};
for (const auto& elem : get_index_pairs(fs, 2)) {
std::cout << elem.first << " " << elem.second << std::endl;
}
return 0;
}
Output:
2 67.8
4 123.4
You can keep a list of the k-highest values so far, and update it for each of the values in your vector, which takes you down to O(n*log k) (assuming log k for each update of the list of highest values) or, for a naive list, O(kn).
You can probably get closer to O(n), but assuming k is probably pretty small, may not be worth the effort.
Your optimal solution will have a complexity of O(n+k*log(k)), since sorting the k elements can be reduced to this, and you will have to look at each of the elements at least once.
Two possible solutions come to mind:
Iterate through the vector while adding all elements to a bounded (size k) priority-queue/heap, also keeping their indices.
Create a copy of your vector with including the original indices, i.e. std::vector<std::pair<float, std::size_t>> and use std::nth_element to move the k highest values to the front using a comparator that compares only the first element. Then insert those elements into your target map. Ironically, that last step adds you the k*log(k) in the overall complexity, while nth_element is linear (but will permute your indices).
Maybe I did not get it, but in case the incremental approach is not an option, why not use std::sort std::partial_sort?
That should be an o(n log k), and since k is very likely to be small, that makes practically an o(n).
Edit: thanks to Mike Seymour for the update.
Edit (bis):
The idea is to use an intermediate vector for sorting, and then put it into the map. Trying to reduce the order of the computation would only be justified for significant amount of data, so I guess the copy time (in o(n) ) could be lost in background noise.
Edit (bis):
That's actually what the selected answer does, without the theorietical explanations :).
I have a std::map with both key and value as integers. Now I want to randomly shuffle the map, so keys point to a different value at random. I tried random_shuffle but it doesn't compile. Note that I am not trying to shuffle the keys, which makes no sense for a map. I'm trying to randomise the values.
I could push the values into a vector, shuffle that and then copy back. Is there a better way?
You can push all the keys in a vector, shuffle the vector and use it to swap the values in the map.
Here is an example:
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <algorithm>
#include <random>
#include <ctime>
using namespace std;
int myrandom (int i) { return std::rand()%i;}
int main ()
{
srand(time(0));
map<int,string> m;
vector<int> v;
for(int i=0; i<10; i++)
m.insert(pair<int,string>(i,("v"+to_string(i))));
for(auto i: m)
{
cout << i.first << ":" << i.second << endl;
v.push_back(i.first);
}
random_shuffle(v.begin(), v.end(),myrandom);
vector<int>::iterator it=v.begin();
cout << endl;
for(auto& i:m)
{
string ts=i.second;
i.second=m[*it];
m[*it]=ts;
it++;
}
for(auto i: m)
{
cout << i.first << ":" << i.second << endl;
}
return 0;
}
The complexity of your proposal is O(N), (both the copies and the shuffle have linear complexity) which seems optimal (looking at less elements would introduce non-randomness into your shuffle).
If you want to repeatedly shuffle your data, you could maintain a map of type <Key, size_t> (i.e. the proverbial level of indirection) that indexes into a std::vector<Value> and then just shuffle that vector repeatedly. That saves you all the copying in exchange for O(N) space overhead. If the Value type itself is expensive, you have an extra vector<size_t> of indices into the real data on which you do the shuffling.
For convenience sake, you could encapsulate the map and vector inside one class that exposes a shuffle() member function. Such a wrapper would also need to expose the basic lookup / insertion / erase functionality of the underyling map.
EDIT: As pointed out by #tmyklebu in the comments, maintaining (raw or smart) pointers to secondary data can be subject to iterator invalidation (e.g. when inserting new elements at the end that causes the vector's capacity to be resized). Using indices instead of pointers solves the "insertion at the end" problem. But when writing the wrapper class you need to make sure that insertions of new key-value pairs never cause "insertions in the middle" for your secondary data because that would also invalidate the indices. A more robust library solution would be to use Boost.MultiIndex, which is specifically designed to allow multiple types of view over a data structure.
Well, with only using the map i think of that:
make a flag array for each cell of the map, randomly generate two integers s.t. 0<=i, j < size of map; swap them and mark these cells as swapped. iterate for all.
EDIT: the array is allocate by the size of the map, and is a local array.
I doubt it...
But... Why not write a quick class that has 2 vectors in. A sorted std::vector of keys and a std::random_shuffled std::vector of values? Lookup the key using std::lower_bound and use std::distance and std::advance to get the value. Easy!
Without thinking too deeply, this should have similar complexity to std::map and possibly better locality of reference.
Some untested and unfinished code to get you started.
template <class Key, class T>
class random_map
{
public:
T& at(Key const& key);
void shuffle();
private:
std::vector<Key> d_keys; // Hold the keys of the *map*; MUST be sorted.
std::vector<T> d_values;
}
template <class Key, class T>
T& random_map<Key, T>::at(Key const& key)
{
auto lb = std::lower_bound(d_keys.begin(), d_keys.end(), key);
if(key < *lb) {
throw std::out_of_range();
}
auto delta = std::difference(d_keys.begin(), lb);
auto it = std::advance(d_values.begin(), lb);
return *it;
}
template <class Key, class T>
void random_map<Key, T>::shuffle()
{
random_shuffle(d_keys.begin(), d_keys.end());
}
If you want to shuffle the map in place, you can implement your own version of random_shuffle for your map. The solution still requires placing the keys into a vector, which is done below using transform:
typedef std::map<int, std::string> map_type;
map_type m;
m[10] = "hello";
m[20] = "world";
m[30] = "!";
std::vector<map_type::key_type> v(m.size());
std::transform(m.begin(), m.end(), v.begin(),
[](const map_type::value_type &x){
return x.first;
});
srand48(time(0));
auto n = m.size();
for (auto i = n-1; i > 0; --i) {
map_type::size_type r = drand48() * (i+1);
std::swap(m[v[i]], m[v[r]]);
}
I used drand48()/srand48() for a uniform pseudo random number generator, but you can use whatever is best for you.
Alternatively, you can shuffle v, and then rebuild the map, such as:
std::random_shuffle(v.begin(), v.end());
map_type m2 = m;
int i = 0;
for (auto &x : m) {
x.second = m2[v[i++]];
}
But, I wanted to illustrate that implementing shuffle on the map in place isn't overly burdensome.
Here is my solution using std::reference_wrapper of C++11.
First, let's make a version of std::random_shuffle that shuffles references. It is a small modification of version 1 from here: using the get method to get to the referenced values.
template< class RandomIt >
void shuffleRefs( RandomIt first, RandomIt last ) {
typename std::iterator_traits<RandomIt>::difference_type i, n;
n = last - first;
for (i = n-1; i > 0; --i) {
using std::swap;
swap(first[i].get(), first[std::rand() % (i+1)].get());
}
}
Now it's easy:
template <class MapType>
void shuffleMap(MapType &map) {
std::vector<std::reference_wrapper<typename MapType::mapped_type>> v;
for (auto &el : map) v.push_back(std::ref(el.second));
shuffleRefs(v.begin(), v.end());
}
The following program does not compile an unordered set of pairs of integers, but it does for integers. Can unordered_set and its member functions be used on user-defined types, and how can I define it?
#include <unordered_set>
...
class A{
...
private:
std::unordered_set< std::pair<int, int> > u_edge_;
};
Compiler error:
error: no matching function for call to 'std::unordered_set >::unordered_set()'
There is no standard way of computing a hash on a pair. Add this definition to your file:
struct pair_hash {
inline std::size_t operator()(const std::pair<int,int> & v) const {
return v.first*31+v.second;
}
};
Now you can use it like this:
std::unordered_set< std::pair<int, int>, pair_hash> u_edge_;
This works, because pair<T1,T2> defines equality. For custom classes that do not provide a way to test equality you may need to provide a separate function to test if two instances are equal to each other.
Of course this solution is limited to a pair of two integers. Here is a link to an answer that helps you define a more general way of making hash for multiple objects.
Your code compiles on VS2010 SP1 (VC10), but it fails to compile with GCC g++ 4.7.2.
However, you may want to consider boost::hash from Boost.Functional to hash a std::pair (with this addition, your code compiles also with g++).
#include <unordered_set>
#include <boost/functional/hash.hpp>
class A
{
private:
std::unordered_set<
std::pair<int, int>,
boost::hash< std::pair<int, int> >
> u_edge_;
};
The problem is that std::unordered_set is using std::hash template to compute hashes for its entries and there is no std::hash specialization for pairs. So you will have to do two things:
Decide what hash function you want to use.
Specialize std::hash for your key type (std::pair<int, int>) using that function.
Here is a simple example:
#include <unordered_set>
namespace std {
template <> struct hash<std::pair<int, int>> {
inline size_t operator()(const std::pair<int, int> &v) const {
std::hash<int> int_hasher;
return int_hasher(v.first) ^ int_hasher(v.second);
}
};
}
int main()
{
std::unordered_set< std::pair<int, int> > edge;
}
As already mentioned in most of the other answers on this question, you need to provide a hash function for std::pair<int, int>. However, since C++11, you can also use a lambda expression instead of defining a hash function. The following code takes the solution given by Sergey as basis:
auto hash = [](const std::pair<int, int>& p){ return p.first * 31 + p.second; };
std::unordered_set<std::pair<int, int>, decltype(hash)> u_edge_(8, hash);
Code on Ideone
I'd like repeat Sergey's disclaimer: This solution is limited to a pair of two integers. This answer provides the idea for a more general solution.
OK here is a simple solution with guaranteed non collisions. Simply reduce your problem to an existing solution i.e. convert your pair of int to string like so:
auto stringify = [](const pair<int, int>& p, string sep = "-")-> string{
return to_string(p.first) + sep + to_string(p.second);
}
unordered_set<string> myset;
myset.insert(stringify(make_pair(1, 2)));
myset.insert(stringify(make_pair(3, 4)));
myset.insert(stringify(make_pair(5, 6)));
Enjoy!
You need to provide a specialization for std::hash<> that works with std::pair<int, int>. Here is a very simple example of how you could define the specialization:
#include <utility>
#include <unordered_set>
namespace std
{
template<>
struct hash<std::pair<int, int>>
{
size_t operator () (std::pair<int, int> const& p)
{
// A bad example of computing the hash,
// rather replace with something more clever
return (std::hash<int>()(p.first) + std::hash<int>()(p.second));
}
};
}
class A
{
private:
// This won't give you problems anymore
std::unordered_set< std::pair<int, int> > u_edge_;
};
The other answers here all suggest building a hash function that somehow combines your two integers.
This will work, but produces non-unique hashes. Though this is fine for your use of unordered_set, for some applications it may be unacceptable. In your case, if you happen to choose a bad hash function, it may lead to many unnecessary collisions.
But you can produce unique hashes!
int is usually 4 bytes. You could make this explicit by using int32_t.
The hash's datatype is std::size_t. On most machines, this is 8 bytes. You can check this upon compilation.
Since a pair consists of two int32_t types, you can put both numbers into an std::size_t to make a unique hash.
That looks like this (I can't recall offhandedly how to force the compiler to treat a signed value as though it were unsigned for bit-manipulation, so I've written the following for uint32_t.):
#include <cassert>
#include <cstdint>
#include <unordered_set>
#include <utility>
struct IntPairHash {
std::size_t operator()(const std::pair<uint32_t, uint32_t> &p) const {
assert(sizeof(std::size_t)>=8); //Ensure that std::size_t, the type of the hash, is large enough
//Shift first integer over to make room for the second integer. The two are
//then packed side by side.
return (((uint64_t)p.first)<<32) | ((uint64_t)p.second);
}
};
int main(){
std::unordered_set< std::pair<uint32_t, uint32_t>, IntPairHash> uset;
uset.emplace(10,20);
uset.emplace(20,30);
uset.emplace(10,20);
assert(uset.size()==2);
}
You are missing a hash function for std::pair<int, int>>. For example,
struct bad_hash
{
std::size_t operator()(const std::pair<int,int>& p) const
{
return 42;
}
};
....
std::unordered_set< std::pair<int, int>, bad_hash> u_edge_;
You can also specialize std::hash<T> for std::hash<std::pair<int,int>>, in which case you can omit the second template parameter.
To make a unordered_set of pairs, you can either create a custom hash function or you can make an unordered_set of strings.
Create custom hash function: Creating the custom hash depends on the data. So there is no one size fits all hash function. A good hash function must have fewer collisions, so you need to consider the collision count while making the hash function.
Using Strings: Using string is very simple and takes less time. It also guarantees few or no collisions. Instead of using an unordered_set<pair<int, int>> we use an unordered_set. We can represent the pair by separating the numbers with a separator (character or string). The example given below shows how you can insert pair of integers with the separator (";").
auto StringPair = [](const pair<int, int>& x){return to_string(x.first) + ";" + to_string(x.second);};
unordered_set Set;
vector<pair<int, int>> Nums = {{1,2}, {2, 3}, {4, 5}, {1,2}};
for(auto & pair: Nums)
{
Set.insert(StringPair(pair));
}
Just to add my 2 cents here, it's weird that to use unordered_set you need to specify an external hash function. Encapsulation principle would prefer that inside your class you would have an 'hash()' function that returns the hash, and the unordered_set would call that. You should have an Hashable interface and your class, in this case std::pair, would implement that interface.
I think this is the approach followed by languages like Java. Unfortunately C++ doesn't follow this logic. The closest you can get to mimic that is:
derive a class from std::pair (this allows you to have more readable code anyway)
pass the hash function to the unordered_set template
Code Sample
class Point : public pair<int, int> {
public:
int &x = this->first; // allows to use mypoint.x for better readability
int &y = this->second; // allows to use mypoint.y for better readability
Point() {};
Point(int first, int second) : pair{first, second}{};
class Hash {
public:
auto operator()(const Point &p) const -> size_t {
return ((size_t)p.first) << 32 | ((size_t)p.second);
}
};
};
int main()
{
unordered_set< Point, Point::Hash > us;
Point mypoint(1000000000,1);
size_t res = Point::Hash()(mypoint);
cout<<"Hello World " << res << " " << mypoint.x;
return 0;
}
The simple hash function used works if size_t is 64bit and int is 32bit, in this case this hash function guarantees no collisions and it's ideal.
I'm using an std::unordered_map<key,value> in my implementation. i will be using any of the STL containers as the key. I was wondering if it is possible to create a generic hash function for any container being used.
This question in SO offers generic print function for all STL containers. While you can have that, why cant you have something like a Hash function that defines everything ? And yeah, a big concern is also that it needs to fast and efficient.
I was considering doing a simple hash function that converts the values of the key to a size_t and do a simple function like this.
Can this be done ?
PS : Please don't use boost libraries. Thanks.
We can get an answer by mimicking Boost and combining hashes.
Warning: Combining hashes, i.e. computing a hash of many things from many hashes of the things, is not a good idea generally, since the resulting hash function is not "good" in the statistical sense. A proper hash of many things should be build from the entire raw data of all the constituents, not from intermediate hashes. But there currently isn't a good standard way of doing this.
Anyway:
First off, we need the hash_combine function. For reasons beyond my understanding it's not been included in the standard library, but it's the centrepiece for everything else:
template <class T>
inline void hash_combine(std::size_t & seed, const T & v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
Using this, we can hash everything that's made up from hashable elements, in particular pairs and tuples (exercise for the reader).
However, we can also use this to hash containers by hashing their elements. This is precisely what Boost's "range hash" does, but it's straight-forward to make that yourself by using the combine function.
Once you're done writing your range hasher, just specialize std::hash and you're good to go:
namespace std
{
template <typename T, class Comp, class Alloc>
struct hash<std::set<T, Comp, Alloc>>
{
inline std::size_t operator()(const std::set<T, Comp, Alloc> & s) const
{
return my_range_hash(s.begin(), s.end());
}
};
/* ... ditto for other containers */
}
If you want to mimic the pretty printer, you could even do something more extreme and specialize std::hash for all containers, but I'd probably be more careful with that and make an explicit hash object for containers:
template <typename C> struct ContainerHasher
{
typedef typename C::value_type value_type;
inline size_t operator()(const C & c) const
{
size_t seed = 0;
for (typename C::const_iterator it = c.begin(), end = c.end(); it != end; ++it)
{
hash_combine<value_type>(seed, *it);
}
return seed;
}
};
Usage:
std::unordered_map<std::set<int>, std::string, ContainerHasher<std::set<int>>> x;