I have a struct called Point. Point is pretty simple:
struct Point
{
Row row;
Column column;
// some other code for addition and subtraction of points is there too
}
Row and Column are basically glorified ints, but I got sick of accidentally transposing the input arguments to functions and gave them each a wrapper class.
Right now I use a set of points, but repeated lookups are really slowing things down. I want to switch to an unordered_set.
So, I want to have an unordered_set of Points. Typically this set might contain, for example, every point on a 80x24 terminal = 1920 points. I need a good hash function. I just came up with the following:
struct PointHash : public std::unary_function<Point, std::size_t>
{
result_type operator()(const argument_type& val) const
{
return val.row.value() * 1000 + val.col.value();
}
};
However, I'm not sure that this is really a good hash function. I wanted something fast, since I need to do many lookups very quickly. Is there a better hash function I can use, or is this OK?
Following the technique is given in Effective Java (2nd edition), and quoted from there in Programming in Scala. Have a prime constant (we'll say 53 but you may find something larger will give more even distribution here), and perform multiplication and addition as follows:
(53 + int_hash(row)) * 53 + int_hash(col)
For more values (say you add a z coordinate), just keep nesting, like
((53 + int_hash(row)) * 53 + int_hash(col)) * 53 + int_hash(z)
Where int_hash is a function for hashing a single integer. You can visit this page to find a bunch of good hash functions for single integers.
With a small enough domain, you might be able to come up with a perfect hash function. Or perhaps just use a 2 dimensional array. For larger data amounts, use a prime number based multiplication and mod to your table size (and if your table is a base 2 number in size). This eliminates the divide/mod that can be costly on smaller, embedded type systems.
Or find any number of integer based hash functions that already exist. Make sure you measure any hash function you create for collision. Enough collisions will eliminate any gains over O(n log n) methods such as maps/trees.
I guess doing a bitshift by 10 instead would be more efficient than multiplying by 1000.
return (val.row.value()<<10) + val.col.value();
Related
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.
I have been looking for an answer to this for quite a while, but I am not able to find one.
The problem:
I have a n-dimensional (e.g. n = 9) function which is extremely computationally burdensome to evaluate, but for which I need a huge amount of evaluations. I want to use interpolation for this case.
However k < n dimensions (e.g. k = 7) are discrete (mostly binary) and therefore I need not to interpolate over these, which leaves me with m-dimensions (e.g. 2) over which I want to interpolate. I am mostly interested in basic linear interpolation, similar to http://rncarpio.github.io/linterp/.
The question:
(Option A) Should I invoke d1 x d2 x ... x dk interpolation functions (e.g. 2^7= 128) which then only interpolate over the two dimensions I need, but I need to look for the right interpolation function every time I need a value, ...
... (Option B) or should I invoke one interpolation function which could possible interpolate between all dimensions, but which I then will only use to interpolate across the two dimensions I need (for all others I fully provide the grid with function values)?
I think it is important to emphasize that I am really interested in linear interpolation and that the answer will most likely differ in other cases. Furthermore, in the application I want to use this, I need not 128 functions but rather over 10,000 functions.
Additionally, should option A be the answer, how should I store the interpolation functions in c++, i.e. should I use a map with a tuple as a key (drawing on the boost library), or a multidimensional array (again, drawing on the boost library) or is there an easier way?
I'd likely choose Option A, but not maps. If you have binary data, just use an array of size 2 (this is one of the rare cases when using an array is right); if you have a small domain consider having two vectors, one for keys, one for values. This is because vector search can be made extremely efficient on at least on x86 / x64 architectures. Be sure to hide this implementation detail by providing an accessor function (i.e., const value& T::lookup(const key&)). I'd vote against using a tuple as a map key as it makes it both slower and more complicated. If you need to extremely optimize and your domains are so small that the product of their cardinality fits within 64 bits, you might just manually create an index key (like: (1<<3) * key4bits + (1<<2) * keyBinary + key2bits), in this case you'll use a map (or two vectors).
First of all, a disclaimer; hash is a somewhat inaccurate term for what I'm aiming for, please, feel free to suggest a better title.
At any rate, I'm currently attempting to program a complex spatial algorithm running in real-time. In order to save cycles, I've decided to generate a lookup table that contains all of the 32,000 possibilities.
If I were to do this conventionally, the values(Inclusive range and field count) 2x +0 -> +15 and 3x -2 -> +2 would be mapped to two four-bit and three three-bit values respectively, giving me a lookup-table size of 2 ^ (2*4 + 3*3) = 131,072 entries, a nearly 410% waste.
Given the nature of the algorithm, collisions would absolutely cripple its functionality (so no traditional hash functions unless I could guarantee no collisions with all relevant values). Beyond that, the structure I'm working with is rather large (ie, I would /really/ like to avoid allocating any more than 200% of what I need). Finally, since this table will be referenced so often, I'd like to avoid the overhead of a traditional hash-table in both bucket lookups and an excessively complex hash function.
Having taken a more traditional computer-science approach, I'm beginning to strongly believe the solution lies in some mathematics of base-conversion I'm completely ignorant of. Any idea if this is the case?
You can calculate an index the same way you calculated the maximum number of combinations, by multiplying each element. Take each element from most significant to least significant, add a constant to make it range from 0 to n-1, and multiply by the number of combinations remaining.
Given your 0 to 15 values of a, b (range of 16) and -2 to +2 values of c, d, e (range of 5):
index = a * 16*5*5*5 + b * 5*5*5 + (c+2) * 5*5 + (d+2) * 5 + (e+2);
I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!
Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...
To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.
I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.
I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.
The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.
Can anyone help me figure out how to write a quality hash function?
Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.
If storage is the issue, you can use one of the hash functions analyzed here.
Here's a fairly reasonable hash function from 4 integers to 1 integer:
unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];
With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.
There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.
Because hashing can generate collisions, you have to keep the keys in memory anyway in order to discover these collisions. Hashmaps and other standard datastructures do do this in their internal bookkeeping.
As the key is so small, just use the key directly rather than hashing. This will be faster and will ensure no collisions.
I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.
A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.
To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.
Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.
Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).
Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.
Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.