Algorithm for hash/crc of unordered multiset

Algorithm for hash/crc of unordered multiset - c++

Let's say I would like to create a unordered set of unordered multisets of unsigned int. For this, I need to create a hash function to calculate a hash of the unordered multiset. In fact, it has to be good for CRC as well.
One obvious solution is to put the items in vector, sort them and return a hash of the result. This seems to work, but it is expensive.
Another approach is to xor the values, but obviously if I have one item twice or none the result will be the same - which is not good.
Any ideas how I can implement this cheaper - I have an application that will be doing this thousand for thousands of sets, and relatively big ones.

Since it is a multiset, you would like for the hash value to be the same for identical multisets, whose representation might have the same elements presented, added, or deleted in a different order. You would then like for the hash value to be commutative, easy to update, and change for each change in elements. You would also like for two changes to not readily cancel their effect on the hash.
One operation that meets all but the last criteria is addition. Just sum the elements. To keep the sum bounded, do the sum modulo the size of your hash value. (E.g. modulo 264 for a 64-bit hash.) To make sure that inserting or deleting zero values changes the hash, add one to each value first.
A drawback of the sum is that two changes can readily cancel. E.g. replacing 1 3 with 2 2. To address that, you can use the same approach and sum a polynomial of the entries, still retaining commutativity. E.g. instead of summing x+1, you can sum x2+x+1. Now it is more difficult to contrive sets of changes with the same sum.

Here's a reasonable hash function for std::unordered_multiset<int> it would be better if the computations were taken mod a large prime but the idea stands.
#include <iostream>
#include <unordered_set>
namespace std {
template<>
struct hash<unordered_multiset<int>> {
typedef unordered_multiset<int> argument_type;
typedef std::size_t result_type;
const result_type BASE = static_cast<result_type>(0xA67);
result_type log_pow(result_type ex) const {
result_type res = 1;
result_type base = BASE;
while (ex > 0) {
if (ex % 2) {
res = res * base;
}
base *= base;
ex /= 2;
}
return res;
}
result_type operator()(argument_type const & val) const {
result_type h = 0;
for (const int& el : val) {
h += log_pow(el);
}
return h;
}
};
};
int main() {
std::unordered_set<std::unordered_multiset<int>> mySet;
std::unordered_multiset<int> set1{1,2,3,4};
std::unordered_multiset<int> set2{1,1,2,2,3,3,4,4};
std::cout << "Hash 1: " << std::hash<std::unordered_multiset<int>>()(set1)
<< std::endl;
std::cout << "Hash 2: " << std::hash<std::unordered_multiset<int>>()(set2)
<< std::endl;
return 0;
}
Output:
Hash 1: 2290886192
Hash 2: 286805088
When it's a prime p, the number of collisions is proportional to 1/p. I'm not sure what the analysis is for powers of two. You can make updates to the hash efficient by adding/subtracting BASE^x when you insert/remove the integer x.

Implement the inner multiset as a value->count hash map.
This will allow you to avoid the problem that an even number of elements cancels out via xor in the following way: Instead of xor-ing each element, you construct a new number from the count and the value (e.g. multiplying them), and then you can build the full hash using xor.

Related

Create custom Hash Function

I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class.
Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:
the bitsets in my class obviously have fixed size, so I wanted to do the following:
example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:
create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique
how could I implement this approach?
thank you in advance

To expand on a comment, as requested:
Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:
struct Pair
{
std::bitset<6> bits;
int intval;
};
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong();
return rtrn;
}
};
This works on two assumptions:
The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t
I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.
Possible improvements
We can preserve the upper bits by rotating instead of shifting.
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
std::size_t bitdigits = pair.bits.size();
// can be simplified to std::rotl(rtrn, bitdigits) in C++20
rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
rtrn ^= pair.bits.to_ulong();
return rtrn;
}
};
Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...
If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::hash<decltype(pair.intval)> ihash;
std::hash<decltype(pair.bits)> bhash;
return ihash(pair.intval) * 31 + bhash(pair.bits);
}
};
I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x

A good sorting algo for this particular usecase

struct Object {
int16_t order = 0;
};
I have a std::list of Object instances, which I want to sort based on an 'order' member variable.
smaller order values are placed earlier in the list.
While looping, if the current order value is the same as an existing one, I think it should be placed before that existing one, so that I don't have to continue looking at the rest elements of the list.
The list can have a maximum of 1024 items.
I'm looking for an algo that will allow me to sort the list in the least amount of iterations, or something close to that. A naive approach that I have now results in a triangular amount of iterations, which for 1024 is:
(1024(1024 + 1)) / 2 = 524,288

Use a member sort method - std::list::sort with appropriate comparator:
int main() {
std::list<Object> objects{
Object{4}, Object{2}, Object{6}, Object{7}, Object{42}
};
objects.sort([](const auto& lhs, const auto& rhs) {
return lhs.order < rhs.order;
});
}

Set insert doing a weird number of comparisons

I am unable to explain the number of comparisons that std::set does while inserting a new element. Here is an example:
For this code
struct A {
int i = 0;
bool operator()(int a, int b)
{
++i;
return a < b;
}
};
int main()
{
A a;
set<int, A> s1(a);
s1.insert(1);
cout << s1.key_comp().i << endl;
s1.insert(2);
cout << s1.key_comp().i << endl;
}
The output is
0
3
Why does inserting a second element require 3 comparisons? o_O

This is a side effect of using a red-black tree to implement std::set, which requires more comparisons initially compared to a standard binary tree.

I don't know the particular as they will depend on your std::set implementation, however determining the equality of two items requires two comparisons, as it is based on the fact that not (x < y) and not (y < x) implies x == y.
Depending on how the tree is optimized, you might thus be paying a first comparison to determine whether it should go left or right, and then two comparisons to check whether it's equal or not.
The Standard has no requirement except that the number of comparisons be O(log N) where N is the number of items already in the set. Constant factors are a quality of implementation issue.

Understanding std::accumulate

I want to know why std::accumulate (aka reduce) 3rd parameter is needed. For those who do not know what accumulate is, it's used like so:
vector<int> V{1,2,3};
int sum = accumulate(V.begin(), V.end(), 0);
// sum == 6
Call to accumulate is equivalent to:
sum = 0; // 0 - value of 3rd param
for (auto x : V) sum += x;
There is also optional 4th parameter, which allow to replace addition with any other operation.
Rationale that I've heard is that if you need let say not to add up, but multiply elements of a vector, we need other (non-zero) initial value:
vector<int> V{1,2,3};
int product = accumulate(V.begin(), V.end(), 1, multiplies<int>());
But why not do like Python - set initial value for V.begin(), and use range starting from V.begin()+1. Something like this:
int sum = accumulate(V.begin()+1, V.end(), V.begin());
This will work for any op. Why is 3rd parameter needed at all?

You're making a mistaken assumption: that type T is of the same type as the InputIterator.
But std::accumulate is generic, and allows all different kinds of creative accumulations and reductions.
Example #1: Accumulate salary across Employees
Here's a simple example: an Employee class, with many data fields.
class Employee {
/** All kinds of data: name, ID number, phone, email address... */
public:
int monthlyPay() const;
};
You can't meaningfully "accumulate" a set of employees. That makes no sense; it's undefined. But, you can define an accumulation regarding the employees. Let's say we want to sum up all the monthly pay of all employees. std::accumulate can do that:
/** Simple class defining how to add a single Employee's
* monthly pay to our existing tally */
auto accumulate_func = [](int accumulator, const Employee& emp) {
return accumulator + emp.monthlyPay();
};
// And here's how you call the actual calculation:
int TotalMonthlyPayrollCost(const vector<Employee>& V)
{
return std::accumulate(V.begin(), V.end(), 0, accumulate_func);
}
So in this example, we're accumulating an int value over a collection of Employee objects. Here, the accumulation sum isn't the same type of variable that we're actually summing over.
Example #2: Accumulating an average
You can use accumulate for more complex types of accumulations as well - maybe want to append values to a vector; maybe you have some arcane statistic you're tracking across the input; etc. What you accumulate doesn't have to be just a number; it can be something more complex.
For example, here's a simple example of using accumulate to calculate the average of a vector of ints:
// This time our accumulator isn't an int -- it's a structure that lets us
// accumulate an average.
struct average_accumulate_t
{
int sum;
size_t n;
double GetAverage() const { return ((double)sum)/n; }
};
// Here's HOW we add a value to the average:
auto func_accumulate_average =
[](average_accumulate_t accAverage, int value) {
return average_accumulate_t(
{accAverage.sum+value, // value is added to the total sum
accAverage.n+1}); // increment number of values seen
};
double CalculateAverage(const vector<int>& V)
{
average_accumulate_t res =
std::accumulate(V.begin(), V.end(), average_accumulate_t({0,0}), func_accumulate_average)
return res.GetAverage();
}
Example #3: Accumulate a running average
Another reason you need the initial value is because that value isn't always the default/neutral value for the calculation you're making.
Let's build on the average example we've already seen. But now, we want a class that can hold a running average -- that is, we can keep feeding in new values, and check the average so far, across multiple calls.
class RunningAverage
{
average_accumulate_t _avg;
public:
RunningAverage():_avg({0,0}){} // initialize to empty average
double AverageSoFar() const { return _avg.GetAverage(); }
void AddValues(const vector<int>& v)
{
_avg = std::accumulate(v.begin(), v.end(),
_avg, // NOT the default initial {0,0}!
func_accumulate_average);
}
};
int main()
{
RunningAverage r;
r.AddValues(vector<int>({1,1,1}));
std::cout << "Running Average: " << r.AverageSoFar() << std::endl; // 1.0
r.AddValues(vector<int>({-1,-1,-1}));
std::cout << "Running Average: " << r.AverageSoFar() << std::endl; // 0.0
}
This is a case where we absolutely rely on being able to set that initial value for std::accumulate - we need to be able to initialize the accumulation from different starting points.
In summary, std::accumulate is good for any time you're iterating over an input range, and building up one single result across that range. But the result doesn't need to be the same type as the range, and you can't make any assumptions about what initial value to use -- which is why you must have an initial instance to use as the accumulating result.

The way things are, it is annoying for code that knows for sure a range isn't empty and that wants to start accumulating from the first element of the range on. Depending on the operation that is used to accumulate with, it's not always obvious what the 'zero' value to use is.
If on the other hand you only provide a version that requires non-empty ranges, it's annoying for callers that don't know for sure that their ranges aren't empty. An additional burden is put on them.
One perspective is that the best of both worlds is of course to provide both functionality. As an example, Haskell provides both foldl1 and foldr1 (which require non-empty lists) alongside foldl and foldr (which mirror std::transform).
Another perspective is that since the one can be implemented in terms of the other with a trivial transformation (as you've demonstrated: std::transform(std::next(b), e, *b, f) -- std::next is C++11 but the point still stands), it is preferable to make the interface as minimal as it can be with no real loss of expressive power.

Because standard library algorithms are supposed to work for arbitrary ranges of (compatible) iterators. So the first argument to accumulate doesn't have to be begin(), it could be any iterator between begin() and one before end(). It could also be using reverse iterators.
The whole idea is to decouple algorithms from data. Your suggestion, if I understand it correctly, requires a certain structure in the data.

If you wanted accumulate(V.begin()+1, V.end(), V.begin()) you could just write that. But what if you thought v.begin() might be v.end() (i.e. v is empty)? What if v.begin() + 1 is not implemented (because v only implements ++, not generized addition)? What if the type of the accumulator is not the type of the elements? Eg.
std::accumulate(v.begin(), v.end(), 0, [](long count, char c){
return isalpha(c) ? count + 1 : count
});

It's indeed not needed. Our codebase has 2 and 3-argument overloads which use a T{} value.
However, std::accumulate is pretty old; it comes from the original STL. Our codebase has fancy std::enable_if logic to distinguish between "2 iterators and initial value" and "2 iterators and reduction operator". That requires C++11. Our code also uses a trailing return type (auto accumulate(...) -> ...) to calculate the return type, another C++11 feature.

Memoizing a function with two inputs in C++

I have a function, f(a,b), that accepts two inputs. I do not know ahead of time which values of a and b will be used. I'm okay with being a little wasteful on memory (I care about speed). I want to be able to check if the output of f(a,b) has already been delivered, and if so, deliver that output again without re-running through the f(a,b) process.
Trivially easy to do in Python with decorators, but C++ is way over my head here.

I would use a std::map (or maybe an std::unordered_map) whose key is a std::pair, or perhaps use a map of maps.
C++11 improvements are probably helpful in that case. Or maybe some Boost thing.

The poster asks:
I want to be able to check if the output of f(a,b) has already been delivered, and if so, deliver that output again without re-running through the f(a,b) process.
It's pretty easy in C++ using a std::map. The fact that the function has exactly two parameters means that we can use std::pair to describe them.
#include <map>
#include <iostream>
uint64_t real_f(int a, int b) {
std::cout << "*";
// Do something tough:
return (uint64_t)a*b;
}
uint64_t memo_f(int a, int b) {
typedef std::pair<int, int> key;
typedef std::map<key, uint64_t> map;
static map m;
key k(a,b);
map::iterator it = m.find(k);
if(it == m.end()) {
return m[k] = real_f(a, b);
}
return it->second;
}
int main () {
std::cout << memo_f(1, 2) << "\n";
std::cout << memo_f(3, 4) << "\n";
std::cout << memo_f(1, 2) << "\n";
std::cout << memo_f(3, 4) << "\n";
std::cout << memo_f(5, 6) << "\n";
}
The output of the above program is:
*2
*12
2
12
*30
The lines without asterisks represent cached results.

With C++11, you could use tasks and futures. Let f be your function:
int f(int a, int b)
{
// Do hard work.
}
Then you would schedule the function execution, which returns you a handle to the return value. This handle is called a future:
template <typename F>
std::future<typename std::result_of<F()>::type>
schedule(F f)
{
typedef typename std::result_of<F()>::type result_type;
std::packaged_task<result_type> task(f);
auto future = task.get_future();
tasks_.push_back(std::move(task)); // Queue the task, execute later.
return std::move(future);
}
Then, you could use this mechanism as follows:
auto future = schedule(std::bind(&f, 42, 43)); // Via std::bind.
auto future = schedule([&] { f(42, 43); }); // Lambda alternative.
if (future.has_value())
{
auto x = future.get(); // Blocks if the result of f(a,b) is not yet availble.
g(x);
}
Disclaimer: my compiler does not support tasks/futures, so the code may have some rough edges.

The main point about this question are the relative expenses in CPU and RAM between calculating f(a,b) and keeping some sort of lookup table to cache results.
Since an exhaustive table of 128 bits index length is not (yet) feasable, we need to reduce the lookup space into a manageable size - this can't be done without some considerations inside your app:
How big is the really used space of function inputs? Is there a pattern in it?
What about the temporal component? Do you expect repeated calculations to be close to one another or ditributed along the timeline?
What about the distribution? Do you assume a tiny part of the index space to consume the majority of function calls?
I would simply start with a fixed-size array of (a,b, f(a,b)) tuples and a linear search. Depending on your pattern as asked above, you might want to
window-slide it (drop oldest on a cache miss): This is good for localized reocurrences
have (a,b,f(a,b),count) tuples with the tuple with the smallest count being expelled - this is good for non-localized occurrences
have some key-function determine a position in the cache (this is good for tiny index space usage)
whatever else Knuth or Google might have thought of
You might also want to benchmark repeated calculation against the lookup mechanism, if the latter becomes more and more complex: std::map and freinds don't come for free, even if they are high-quality implementations.

The only easy way is to use std::map. std::unordered_map does not work. We cannot use std::pair as the key in unordered map. You can do the following,
std::map<pair<int, int>, int> mp;
int func(int a, int b)
{
if (mp.find({a, b}) != mp.end()) return mp[{a, b}];
// compute f(a, b)...
mp[{a, b}] = // computed value;
return mp[{a, b}];
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Algorithm for hash/crc of unordered multiset - c++

Related

Create custom Hash Function

A good sorting algo for this particular usecase

Set insert doing a weird number of comparisons

Understanding std::accumulate

Memoizing a function with two inputs in C++

Categories

Resources