Hashing an unordered sequence of small integers

Hashing an unordered sequence of small integers - c++

Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/

Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)

Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).

I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}

You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.

I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.

Related

Most efficient way to find index of matching values in two sorted arrays using C++

I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]

I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}

Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).

What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.

Is there a better implementation for keeping a count for unique integer pairs?

This is in C++. I need to keep a count for every pair of numbers. The two numbers are of type "int". I sort the two numbers, so (n1 n2) pair is the same as (n2 n1) pair. I'm using the std::unordered_map as the container.
I have been using the elegant pairing function by Matthew Szudzik, Wolfram Research, Inc.. In my implementation, the function gives me a unique number of type "long" (64 bits on my machine) for every pair of two numbers of type "int". I use this long as my key for the unordered_map (std::unordered_map). Is there a better way to keep count of such pairs? By better I mean, faster and if possible with lesser memory usage.
Also, I don't need all the bits of long. Even though you can assume that the two numbers can range up to max value for 32 bits, I anticipate the max possible value of my pairing function to require at most 36 bits. If nothing else, at least is there a way to have just 36 bits as key for the unordered_map? (some other data type)
I thought of using bitset, but I'm not exactly sure if the std::hash will generate a unique key for any given bitset of 36 bits, which can be used as key for unordered_map.
I would greatly appreciate any thoughts, suggestions etc.

First of all I think you came with wrong assumption. For std::unordered_map and std::unordered_set hash does not have to be unique (and it cannot be in principle for data types like std::string for example), there should be low probability that 2 different keys will generate the same hash value. But if there is a collision it would not be end of the world, just access would be slower. I would generate 32bit hash from 2 numbers and if you have an idea of typical values just test for probability of hash collision and choose hash function accordingly.
For that to work you should use pair of 32bit numbers as a key in std::unordered_map and provide a proper hash function. Calculating unique 64bit key and use it with hash map is controversal as hash_map will then calculate another hash of this key, so it is possible you are making it slower.
About 36 bits key, this is not a good idea unless you have a special CPU that handles 36 bit data. Your data either will be aligned on 64bit boundary and you would not have any benefits of saving memory, or you will get penalty of unaligned data access otherwise. In first case you would just have extra code to get 36 bits from 64bit data (if processor supports it). In the second your code will be slower than 32 bit hash even if there are some collisions.
If that hash_map is a bottleneck you may consider different implementation of hash map like goog-sparsehash.sourceforge.net

Just my two cents, the pairing functions that you've got in the article are WAY more complicated than you actually need. Mapping 2 32 bit UNISIGNED values to 64 uniquely is easy. The following does that, and even handles the non-pair states, without hitting the math peripheral too heavily (if at all).
uint64_t map(uint32_t a, uint32_t b)
{
uint64_t x = a+b;
uint64_t y = abs((int32_t)(a-b));
uint64_t ans = (x<<32)|(y);
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
uint64_t x = map>>32;
uint64_t y = map&0xFFFFFFFFL;
*a = (x+y)>>1;
*b = (x-*a);
}
Another alternative:
uint64_t map(uint32_t a, uint32_t b)
{
bool bb = a>b;
uint64_t x = ((uint64_t)a)<<(32*(bb));
uint64_t y = ((uint64_t)b)<<(32*!(bb));
uint64_t ans = x|y;
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
*a = map>>32;
*b = map&0xFFFFFFFF;
}
That works as a unique key. You can easily modify that to be a hash function provider for unordered map, though whether or not that will be faster than std::map is dependent on the number of values you've got.
NOTE: this will fail if the values a+b > 32 bits.

std::map trick for comparing unrepresentable numbers?

I would like to have a user-defined key in a C++ std::map. The key is a binary representation of an integer set with maximum value 2^V so I can't represent all 2^V possible values. I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Now the problem is that to put this user-defined bitset as key in a std::map, I need to define a valid comparison between bitset values but if I have a maximum size of, say, V=1000, then I cannot get a number I can compare, let alone aggregating them all i.e., 2^1000 is not representable.
Therefore my question is, suppose I have two different sets (by setting the right bits in my bitset representation) and I cannot represent the final number because it will overflow:
id_1 = 2^0 + 2^1 + ... + 2^V
id_2 = 2^0 + 2^1 + ... + 2^V
Is there a suitable transformation that would lead to a value I can compare? I need to be able to say id_1 < id_2 so I would like to transform a sum of exponentials to a value that is representable BUT maintaining the invariant of the "less than". I was thinking along the lines of e.g. applying a log transformation in a clever way to preserve "less than".
Here is an example:
set_1 = {2,3,4}; set_2 = {8}
id(set_1) = 2^2 + 2^3 + 2^4 = 28; id(set_2) = 2^8 = 256
id(set_1) < id(set_2)
Perfect! How about a general set that can have {1,...,V}, and thus 2^V possible subsets?

I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Supposing that this array is accessed via a data member ra of the key type Key, and both arrays are of length N, then you want a comparator something like this:
bool operator<(const Key &lhs, const Key &rhs) {
return std::lexicographical_compare(lhs.ra, &lhs.ra[N], rhs.ra, &rhs.ra[N]);
}
This implicitly considers the array to be big-endian, i.e. the first uint64_t is the most significant. If you don't like that, that's fair enough, since you might already have in mind some relative significance for whatever order you've stored your V bits into your array. There's no great mystery to lexicographical_compare, so just look at an example implementation and modify as required.
This is called "lexicographical order". Other than the facts that I've used uint64_t instead of char and both arrays are the same length, it is how strings are compared[*] -- in fact the use of uint64_t isn't important, you could just use std::memcmp in your comparator instead of comparing 64-bit chunks. operator< for strings doesn't work by converting the whole string to an integer, and neither should your comparator.
[*] until you bring locale-specific collation rules into play.

Hash functions and how they work

So I have two different field types, a char* of length n and an int. I want to generate a hashvalue using both as keys. I add the last 16 bits of the int variable, we'll call the sum integer x, then I use collate: hash to generate a hashvalue for the char*, we'll call it integer y. I then add x+y together, then use hash with the sum to generate a hash value. Lets say i want to limit the hashvalues to a range of [1,4]. Can i just hashvalue%4 to get what i want? Also if there is a better way of generating a hashvalue from the two key let me know.

For the range [1,4] you will have to add 1 to hashvalue%4. However, a hash of 4 is a very small hash. That will give you a lot of collisions, limiting the effectiveness of the hash (that is, many different values of the fields will give you the same hash value.)
I recommend that you add more size (bits) to the hash, maybe 64K (16 bit hash). That will give you less collisions. Also, why not using std::unordered_map, that already implements a hash table?
Finally, as per the hashing function, it depends on the meaning of each of the fields. For example, if in your implementation, only the low 16 bits of the integers count, then the hash should be based only on those bits. There are general hashing functions for strings and for integers, so you could use any of them. Finally, for combining hash values for both fields, summing (or xor-ing) them is a common approach. Just ensure that the generated hash values are as much equally spread over the range as possible.

So, what you describe in many words is written:
struct noname {
int ifield;
char[N] cfield;
};
int hash(const noname &n) {
int x = n.ifield;
int y = ???(n.cfield);
return x + y;
// return (x + y) & 3;
}
Whether this hash function is good depends on the data. For example, if the ifield is always a multiple of 4, it is clearly bad. If the values of the fields are roughly evenly distributed, everything is fine.
Well, except for your requirement to limit the hash range to [1;4]. First, [0;3] is easier to compute, second, such a small range would be appropriate if you only have two or three different things that will have their hash code generated. The range should be at least twice as large as the number of expected different elements.

String to Integer Hashing Function with Precision

I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value.
The function I've been using is given below:
int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
/////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp
unsigned long h = 0;
long M = pow(10, iPrecision);
while(*zKey)
{
h = (h << 4) + *zKey++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return (int) (h % M);
}
The string to be hashed is similar to "SAEUI1210.00000010_1".
However, this produces duplicate values in some cases.
Are there any good alternatives which wouldn't duplicate the same hash for different string values.

The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.
In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.
Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.
That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)

Every hash will have collisions. Period. That's called a Birthday Problem.
You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.

Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.

MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js