Generic Hash function for all STL-containers - c++

I'm using an std::unordered_map<key,value> in my implementation. i will be using any of the STL containers as the key. I was wondering if it is possible to create a generic hash function for any container being used.
This question in SO offers generic print function for all STL containers. While you can have that, why cant you have something like a Hash function that defines everything ? And yeah, a big concern is also that it needs to fast and efficient.
I was considering doing a simple hash function that converts the values of the key to a size_t and do a simple function like this.
Can this be done ?
PS : Please don't use boost libraries. Thanks.

We can get an answer by mimicking Boost and combining hashes.
Warning: Combining hashes, i.e. computing a hash of many things from many hashes of the things, is not a good idea generally, since the resulting hash function is not "good" in the statistical sense. A proper hash of many things should be build from the entire raw data of all the constituents, not from intermediate hashes. But there currently isn't a good standard way of doing this.
Anyway:
First off, we need the hash_combine function. For reasons beyond my understanding it's not been included in the standard library, but it's the centrepiece for everything else:
template <class T>
inline void hash_combine(std::size_t & seed, const T & v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
Using this, we can hash everything that's made up from hashable elements, in particular pairs and tuples (exercise for the reader).
However, we can also use this to hash containers by hashing their elements. This is precisely what Boost's "range hash" does, but it's straight-forward to make that yourself by using the combine function.
Once you're done writing your range hasher, just specialize std::hash and you're good to go:
namespace std
{
template <typename T, class Comp, class Alloc>
struct hash<std::set<T, Comp, Alloc>>
{
inline std::size_t operator()(const std::set<T, Comp, Alloc> & s) const
{
return my_range_hash(s.begin(), s.end());
}
};
/* ... ditto for other containers */
}
If you want to mimic the pretty printer, you could even do something more extreme and specialize std::hash for all containers, but I'd probably be more careful with that and make an explicit hash object for containers:
template <typename C> struct ContainerHasher
{
typedef typename C::value_type value_type;
inline size_t operator()(const C & c) const
{
size_t seed = 0;
for (typename C::const_iterator it = c.begin(), end = c.end(); it != end; ++it)
{
hash_combine<value_type>(seed, *it);
}
return seed;
}
};
Usage:
std::unordered_map<std::set<int>, std::string, ContainerHasher<std::set<int>>> x;

Related

Removing duplicates with three int64 as a key in C++

I am dealing with a dataset in which I want to remove duplicates. Duplicates are defined by having the same value for three fields stored as int64.
I am using C++17. I want my code to be as fast as possible (memory is less of a constraint). I do not care about ordering. I know nothing about the distribution of the int64 values.
My idea is to use an unordered_set with a hash of the three int64 as a key.
Here are my questions:
Is the unordered_set the best option? How about a map?
Which hash function should I use?
Is it a good idea to put the three int64 into a string then hash that string?
Thanks for your help.
I would use:
std::unordered_map<uint64_t, std::unordered_map<uint64_t, std::unordered_set<uint64_t>>>
Is the unordered_set the best option? How about a map?
Anything unordered_ (I believe) will use hash tables, ordered - some kind of binary tree.
Which hash function should I use?
Whatever std:: provides for uint64_t, unless you have a reason to believe that you can do better.
Is it a good idea to put the three int64 into a string then hash that string?
What can you do with strings that you can't with integers? It most likely will be longer...
Thanks, Vlad!
I also found an interesting implementation here: How do I combine hash values in C++0x?
inline void hash_combine(std::size_t& seed) { }
template <typename T, typename... Rest>
inline void hash_combine(std::size_t& seed, const T& v, Rest... rest) {
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
hash_combine(seed, rest...);
}
Usage:
std::size_t h=0;
hash_combine(h, obj1, obj2, obj3);
which I then plug into a flat std::unordered_set<std::size_t>.
This is roughly 2.5x faster than your proposal on my machine. On the other hand, your solution is simpler to read and does not require a hand-crafted hasher.
And then this solution is even faster:
bool insert_key_and_exists (const int64_t &a, const int64_t &b, std::unordered_set< std::pair<int64_t, int64_t> > *dict)
{
auto c = std::make_pair(a,b);
if (dict->contains(c))
return true;
dict->insert(c);
return false;
}
Hash tables used in things like unordered_map are excellent for large objects. Especially if the key is smaller. But for this you would probably get the best speed using a vector. Sort it. Then run std::unique
#include <cstdint>
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>
// sort and remove duplicates
void remove_dups(std::vector<std::array<int64_t, 3>>& v)
{
std::sort(v.begin(), v.end());
v.erase(std::unique(v.begin(), v.end()), v.end());
}
int main()
{
std::vector<std::array<int64_t, 3>> v{ {1,2,3}, {3,2,1}, {1,2,3}, {2,3,4} };
remove_dups(v);
for (const auto& i3 : v)
{
for (const auto& i : i3)
std::cout << i << " ";
std::cout << '\n';
}
}

C++ fixed-capacity associate container

I am looking for a container like std::unordered_map that does not use any dynamic allocation. I believe this would be the case for any associate container with a fixed number of keys, or even keys that have to be chosen at compile time.
I am not looking for a constexpr or compile time hash map because I would like to be able to update the values in the map.
Example use case:
FixedCapacityMap<std::string_view, int> fruits {
{"n_apples", 0},
{"n_pairs", 0}
}
fruits["n_apples"] += 1;
fruits["n_pairs"] += 1;
Does anyone know if such a library exists, and if not how something like this could be implemented?
A necessary consequence of the "no dynamic allocation" rule is that the underlying data is embedded in your type, so you need to specify the number of keys as a template parameter as well.
If the keys are known at compile time you can construct a fixed-size hash table over that.
In general, the next best thing is either chained hashing or binary search. Here is a small implementation that uses binary search over a std::array<std::pair<K,V>, N>:
template <class K, class V, size_t N>
class FixedCapacityMap {
public:
using value_type = std::pair<K,V>;
FixedCapacityMap(std::initializer_list<value_type> init) {
assert(init.size() == N);
std::copy(cbegin(init), cend(init), begin(store));
}
V& operator[](const K& key) {
auto it = std::lower_bound(begin(store), end(store), std::pair{key, V()});
if (it == end(store) || it.first != key)
throw std::out_of_range(key);
return it.second;
}
private:
std::array<value_type, N> store;
}
I was able to find a library with this functionality:
https://github.com/serge-sans-paille/frozen
It allows for constexpr and constinit ordered and unordered maps with (newly added by me) the ability to update values in runtime.

Why doesn’t std::map provide key_iterator and value_iterator?

I am working in a C++03 environment, and applying a function to every key of a map is a lot of code:
const std::map<X,Y>::const_iterator end = m_map.end();
for (std::map<X,Y>::const_iterator element = m_map.begin(); element != end; ++element)
{
func( element->first );
}
If a key_iterator existed, the same code could take advantage of std::for_each:
std::for_each( m_map.key_begin(), m_map.key_end(), &func );
So why isn’t it provided? And is there a way to adapt the first pattern to the second one?
Yes, it is a silly shortcoming. But it's easily rectified: you can write your own generic key_iterator class which can be constructed from the map (pair) iterator. I've done it, it's only a few lines of code, and it's then trivial to make value_iterator too.
There is no need for std::map<K, V> to provide iterators for the keys and/or the values: such an iterator can easily be built based on the existing iterator(s). Well, it isn't as easy as it should/could be but it is certainly doable. I know that Boost has a library of iterator adapters.
The real question could be: why doesn't the standard C++ library provide iterator adapters to project iterators? The short answer is in my opinion: because, in general, you don't want to modify the iterator to choose the property accessed! You rather want to project or, more general, transform the accessed value but still keep the same notion of position. Formulated different, I think it is necessary to separate the notion of positioning (i.e., advancing iterator and testing whether their position is valid) from accessing properties at a given position. The approach I envision is would look like this:
std::for_each(m_map.key_pm(), m_map.begin(), m_map.end(), &func);
or, if you know the underlying structure obtained from the map's iterator is a std::pair<K const, V> (as is the case for std::map<K, V> but not necessarily for other containers similar to associative containers; e.g., a associative container based on a b-tree would benefit from splitting the key and the value into separate entities):
std::for_each(_1st, m_map.begin(), m_map.end(), &func);
My STL 2.0 page is an [incomplete] write-up with a bit more details on how I think the standard C++ library algorithms should be improved, including the above separation of iterators into positioning (cursors) and property access (property maps).
So why isn’t it provided?
I don't know.
And is there a way to adapt the first pattern to the second one?
Alternatively to making a “key iterator” (cf. my comment and other answers), you can write a small wrapper around func, e.g.:
class FuncOnFirst { // (maybe find a better name)
public:
void operator()(std::map<X,Y>::value_type const& e) const { func(e.first); }
};
then use:
std::for_each( m_map.begin(), m_map.end(), FuncOnFirst() );
Slightly more generic wrapper:
class FuncOnFirst { // (maybe find a better name)
public:
template<typename T, typename U>
void operator()(std::pair<T, U> const& p) const { func(p.first); }
};
There is no need for key_iterator or value_iterator as value_type of a std::map is a std::pair<const X, Y>, and this is what function (or functor) called by for_each() will operate on. There is no performance gain to be had from individual iterators as the pair is aggregated in the underlying node in the binary tree used by the map.
Accessing the key and value through a std::pair is hardly strenuous.
#include <iostream>
#include <map>
typedef std::map<unsigned, unsigned> Map;
void F(const Map::value_type &v)
{
std::cout << "Key: " << v.first << " Value: " << v.second << std::endl;
}
int main(int argc, const char * argv[])
{
Map map;
map.insert(std::make_pair(10, 20));
map.insert(std::make_pair(43, 10));
map.insert(std::make_pair(5, 55));
std::for_each(map.begin(), map.end(), F);
return 0;
}
Which gives the output:
Key: 5 Value: 55
Key: 10 Value: 20
Key: 43 Value: 10
Program ended with exit code: 0

Simple customized iterator with lambdas in C++

Suppose I have a container which contains int, a function that works over containers containing Point, and that I have a function that given some int gives me the corresponding Point it represents (imagine that I have indexed all the points in my scene in some big std::vector<Point>). How do I create a simple (and efficient) wrapper to use my first container without copying its content?
The code I want to type is something like that:
template<typename InputIterator>
double compute_area(InputIterator first, InputIterator beyond) {
// Do stuff
}
template<typename InputIterator, typename OutputIterator>
void convex_hull(InputIterator first, InputIterator beyond, OutputIterator result) {
// Do stuff
}
struct Scene {
std::vector<Point> vertices;
foo(const std::vector<int> &polygon) {
// Create a simple wraper with limited amount of mumbo-jumbo
auto functor = [](int i) -> Point& { return vertices[polygon[i]]; });
MagicIterator polyBegin(0, functor);
MagicIterator polyEnd(polygon.size(), functor);
// NOTE: I want them to act as random access iterator
// And then use it directly
double a = compute_area(polyBegin, polyEnd);
// Bonus: create custom inserter similar to std::back_inserter
std::vector<int> result;
convex_hull(polyBegin, polyEnd, MagicInserter(result));
}
};
So, as you've seen, I'm looking for something a bit generic. I thought about using lambdas as well, but I'm getting a bit mixed up on how to proceed to keep it simple and user-friendly.
I suggest Boost's Transform Iterator. Here's an example usage:
#include <boost/iterator/transform_iterator.hpp>
#include <vector>
#include <cassert>
#include <functional>
struct Point { int x, y; };
template<typename It>
void compute(It begin, It end)
{
while (begin != end) {
begin->x = 42;
begin->y = 42;
++begin;
}
}
int main()
{
std::vector<Point> vertices(5);
std::vector<int> polygon { 2, 3, 4 };
std::function<Point&(int)> functor = [&](int i) -> Point& { return vertices[i]; };
auto polyBegin = boost::make_transform_iterator(polygon.begin(), functor);
auto polyEnd = boost::make_transform_iterator(polygon.end(), functor);
compute(polyBegin, polyEnd);
assert(vertices[2].y == 42);
}
I didn't quite get the part about custom back_inserter. If the type stored in result vector is the same as what the functor returns, the one from standard library will do. Otherwise you can just wrap it in transform_iterator, too.
Note that the functor is stored in a std::function. Boost relies on the functor to have a typedef result_type defined and lambdas don't have it.
I see two methods. Either start with boost::iterator_facade, then write the "functional iterator" type.
Or, use boost::counting_iterator iterator or write your own (they are pretty easy), then use boost::transform_iterator to map that Index iterator over to your Point iterator.
All of the above can also be written directly. I'd write it as a random access iterator: which requires a number of typedefs, ++, --, a number of +=, -=, -, +s, the comparisons, and * and -> to be defined properly. It is a bit of boilerplate, the boost libraries above just make it a touch less boilerplate (by having the boilerplate within itself).
I've written myself a version of this that takes the function type as an argument, then stores the function alongside the index. It advances/compares/etc using the index, and dereferences using the function type. By making the function type std::function<blah()> I get the type-erased version of it, and by making it a decltype of a lambda argument, or the type of a functor, I get a more efficient version.

C++ algorithms that create their output-storage instead of being applied to existing storage?

The C++ std algorithms define a number of algorithms that take an input and an output sequence, and create the elements of the output sequence from the elements of the input sequence. (Best example being std::transform.)
The std algorithms obviously take iterators, so there's no question that the container for the OutputIterator has to exist prior to the algorithm being invoked.
That is:
std::vector<int> v1; // e.g. v1 := {1, 2, 3, 4, 5};
std::vector<int> squared;
squared.reserve(v1.size()); // not strictly necessary
std::transform(v1.begin(), v1.end(), std::back_inserter(squared),
[](int x) { return x*x; } ); // λ for convenience, needn't be C++11
And this is fine as far as the std library goes. When I find iterators too cumbersome, I often look to Boost.Range to simplify things.
In this case however, it seems that the mutating algorithms in Boost.Range also use OutputIterators.
So I'm currently wondering whether there's any convenient library out there, that allows me to write:
std::vector<int> const squared = convenient::transform(v1, [](int x) { return x*x; });
-- and if there is none, whether there is a reason that there is none?
Edit: example implementation (not sure if this would work in all cases, and whether this is the most ideal one):
template<typename C, typename F>
C transform(C const& input, F fun) {
C result;
std::transform(input.begin(), input.end(), std::back_inserter(result), fun);
return result;
}
(Note: I think convenient::transform will have the same performance characteristics than the handwritten one, as the returned vector won't be copied due to (N)RVO. Anyway, I think performance is secondary for this question.)
Edit/Note: Of the answers(comments, really) given so far, David gives a very nice basic generic example.
And Luc mentions a possible problem with std::back_inserter wrt. genericity.
Both just go to show why I'm hesitating to whip this up myself and why a "proper" (properly tested) library would be preferable to coding this myself.
My question phrased in bold above, namely is there one, or is there a reason there is none remains largely unanswered.
This is not meant as an answer to the question itself, it's a complement to the other answers -- but it wouldn't fit in the comments.
well - what if you wanted list or deque or some other sequence type container - it's pretty limiting.
namespace detail {
template<typename Iter, typename Functor>
struct transform {
Iter first, last;
Functor functor;
template<typename Container> // SFINAE is also available here
operator Container()
{
Container c;
std::transform(first, last, std::back_inserter(c), std::forward<Functor>(functor));
return c;
}
};
} // detail
template<typename Iter, typename Functor>
detail::transform<Iter, typename std::decay<Functor>::type>
transform(Iter first, Iter last, Functor&& functor)
{ return { first, last, std::forward<Functor>(functor) }; }
While this would work with a handful of containers, it's still not terribly generic since it requires that the container be 'compatible' with std::back_inserter(c) (BackInsertable?). Possibly you could use SFINAE to instead use std::inserter with c.begin() if c.push_back() is not available (left as an exercise to the reader).
All of this also assume that the container is DefaultConstructible -- consider containers that make use of scoped allocators. Presumably that loss of genericity is a feature, as we're only trying to cover the 'simplest' uses.
And this is in fact while I would not use such a library: I don't mind creating the container just outside next to the algorithm to separate the concerns. (I suppose this can be considered my answer to the question.)
IMHO, the point of such an algorithm is to be generic, i.e. mostly container agnostic. What you are proposing is that the transform function be very specific, and return a std::vector, well - what if you wanted list or deque or some other sequence type container - it's pretty limiting.
Why not wrap if you find it so annoying? Create your own little utilities header which does this - after all, it's pretty trivial...
The Boost.Range.Adaptors can be kind of seen as container-returning algorithms. Why not use them?
The only thing that needs to be done is to define a new range adaptor create<T> that can be piped into the adapted ranges and produces the desired result container:
template<class T> struct converted{}; // dummy tag class
template<class FwdRange, class T>
T operator|(FwdRange const& r, converted<T>){
return T(r.begin(), r.end());
}
Yep, that's it. No need for anything else. Just pipe that at the end of your adaptor list.
Here could be a live example on Ideone. Alas, it isn't, because Ideone doesn't provide Boost in C++0x mode.. meh. In any case, here's main and the output:
int main(){
using namespace boost::adaptors;
auto range = boost::irange(1, 10);
std::vector<int> v1(range.begin(), range.end());
auto squared = v1 | transformed([](int i){ return i * i; });
boost::for_each(squared, [](int i){ std::cout << i << " "; });
std::cout << "\n========================\n";
auto modded = squared | reversed
| filtered([](int i){ return (i % 2) == 0; })
| converted<std::vector<int>>(); // gimme back my vec!
modded.push_back(1);
boost::for_each(modded, [](int i){ std::cout << i << " "; });
}
Output:
1 4 9 16 25 36 49 64 81
========================
64 36 16 4 1
There is no one and correct way of enabling
std::vector<int> const squared =
convenient::transform(v1, [](int x) { return x*x; });
without a potential performance cost. You either need an explicit
std::vector<int> const squared =
convenient::transform<std::vector> (v1, [](int x) { return x*x; });
Note the explicit mentioning of the container type: Iterators don't tell anything about the container they belong to. This becomes obvious if you remind that a container's iterator is allowed by the standard to be an ordinary pointer.
Letting the algorithm take a container instead of iterators is not a solution, either. That way, the algorithm can't know how to correctly get the first and last element. For example, a long int-array does not have methods for begin(), end() and length(), not all containers have random access iterators, not operator[] defined. So there is no truly generic way to take containers.
Another possibility that allows for container-agnostic, container-returning algorithms would be some kind of generic factory (see live at http://ideone.com/7d4E2):
// (not production code; is even lacking allocator-types)
//-- Generic factory. -------------------------------------------
#include <list>
template <typename ElemT, typename CacheT=std::list<ElemT> >
struct ContCreator {
CacheT cache; // <-- Temporary storage.
// Conversion to target container type.
template <typename ContT>
operator ContT () const {
// can't even move ...
return ContT (cache.begin(), cache.end());
}
};
Not so much magic there apart from the templated cast operator. You then return that thing from your algorithm:
//-- A generic algorithm, like std::transform :) ----------------
ContCreator<int> some_ints () {
ContCreator<int> cc;
for (int i=0; i<16; ++i) {
cc.cache.push_back (i*4);
}
return cc;
}
And finally use it like this to write magic code:
//-- Example. ---------------------------------------------------
#include <vector>
#include <iostream>
int main () {
typedef std::vector<int>::iterator Iter;
std::vector<int> vec = some_ints();
for (Iter it=vec.begin(), end=vec.end(); it!=end; ++it) {
std::cout << *it << '\n';
}
}
As you see, in operator T there's a range copy.
A move might be possible by means of template specialization in case the target and source containers are of the same type.
Edit: As David points out, you can of course do the real work inside the conversion operator, which will come at probably no extra cost (with some more work it can be done more convenient; this is just for demonstration):
#include <list>
template <typename ElemT, typename Iterator>
struct Impl {
Impl(Iterator it, Iterator end) : it(it), end(end) {}
Iterator it, end;
// "Conversion" + Work.
template <typename ContT>
operator ContT () {
ContT ret;
for ( ; it != end; ++it) {
ret.push_back (*it * 4);
}
return ret;
}
};
template <typename Iterator>
Impl<int,Iterator> foo (Iterator begin, Iterator end) {
return Impl<int,Iterator>(begin, end);
}
#include <vector>
#include <iostream>
int main () {
typedef std::vector<int>::iterator Iter;
const int ints [] = {1,2,4,8};
std::vector<int> vec = foo (ints, ints + sizeof(ints) / sizeof(int));
for (Iter it=vec.begin(), end=vec.end(); it!=end; ++it) {
std::cout << *it << '\n';
}
}
The one requirement is that the target has a push_back method. Using std::distance to reserve a size may lead to sub-optimal performance if the target-container-iterator is not a random-access one.
Again, a no-answer, but rather a follow up from the comments to another answer
On the genericity of the returned type in the questions code
The code as it stands does not allow the conversion of the return type, but that can be easily solvable by providing two templates:
template <typename R, typename C, typename F>
R transform( C const & c, F f ) {_
R res;
std::transform( c.begin(), c.end(), std::back_inserter(res), f );
return res;
}
template <typename C, typename F>
C transform( C const & c, F f ) {
return transform<C,C,F>(c,f);
}
std::vector<int> src;
std::vector<int> v = transform( src, functor );
std::deque<int> d = transform<std::deque<int> >( src, functor );