Hash that returns the same value for all numbers in range? - c++

I'm working on a problem where I have an entire table from a database in memory at all times, with a low range and high range of 9-digit numbers. I'm given a 9-digit number that I need to use to lookup the rest of the columns in the table based on whether that number falls in the range. For example, if the range was 100,000,000 to 125,000,000 and I was given a number 117,123,456, then I would know that I'm in the 100-125 mil range, and whatever vector of data that points to is what I will be using.
Now the best I can think of for lookup time is log(n) run time. This is OK, at best, but still pretty slow. The table has at least 100,000 entries and I will need to look up values in this table tens-of-thousands, if not hundred-thousands of times, per execution of this application (10+ times/day).
So I was wondering if it was possible to use an unordered_set instead, writing my own Hash function that ALWAYS returns the same hash-value for every number in range. Using the same example above, 100,000,000 through 125,000,000 will always return, for example, a hash value of AB12CD. Then when I use the lookup value of 117,123,456, I will get that same AB12CD hash and have a lookup time of O(1).
Is this possible, and if so, any ideas how?
Thanks in advance.

Yes. Assuming that you can number your intervals in order, you could fit a polynomial to your cutoff values, and receive an index value from the polynomial. For instance, with cutoffs of 100,000,000, 125,000,000, 250,000,000, and 327,000,000, you could use points (100, 0), (125, 1), (250, 2), and (327, 3), restricting the first derivative to [0, 1]. Assuming that you have decently-behaved intervals, you'll be able to fit this with an (N+2)th-degree polynomial for N cutoffs.
Have a table of desired hash values; use floor[polynomial(i)] for the index into the table.

Can you write such a hash function? Yes. Will evaluating it be slower than a search? Well there's the catch...
I would personally solve this problem as follows. I'd have a sorted vector of all values. And then I'd have a jump table of indexes into that vector based on the value of n >> 8.
So now your logic is that you look in the jump table to figure out where you are jumping to and how many values you should consider. (Just look at where you land versus the next index to see the size of the range.) If the whole range goes to the same vector, you're done. If there are only a few entries, do a linear search to find where you belong. If they are a lot of entries, do a binary search. Experiment with your data to find when binary search beats a linear search.
A vague memory suggests that the tradeoff is around 100 or so because predicting a branch wrong is expensive. But that is a vague memory from many years ago, so run the experiment for yourself.

Related

Please reply::HashTable:Determining Table size and which hash function to use

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.

How to efficiently gather the repeating elements in a given array?

I'd like to gather the duplicates in a given array. For example, i have an array like this:
{1,5,3,1,5,6,3}
and i want the result to be:
{3,3,1,1,5,5,6}
In my case, the number of cluster is unknowen before calculation, and the order is not concerned.
I achived this by using the bult-in function Sort in C++. However, actually the ordering is not necessary. Hence, i guess there are probably more efficient methods to accomplish it.
Thanks in advance.
First, construct a histogram noting frequencies of each number. You can use a dictionary to accomplish this in O(n) time and space.
Next, loop over the dictionary's keys (order is unimportant here) and for each one, write a number of instances of that key equal to the corresponding value.
Example:
{1,5,3,1,5,6,3} input
{1->2,5->2,3->2,6->1} histogram dictionary
{1,1,5,5,3,3,6} wrote two 1s, two 5s, two 3s, then one 6
This whole thing is O(n) time and space. Certainly you can't do better than O(n) time. Whether you can do better than O(n) space or not while maintaining O(n) time I cannot say.

c++ discrete distribution sampling with frequently changing probabilities

Problem: I need to sample from a discrete distribution constructed of certain weights e.g. {w1,w2,w3,..}, and thus probability distribution {p1,p2,p3,...}, where pi=wi/(w1+w2+...).
some of wi's change very frequently, but only a very low proportion of all wi's. But the distribution itself thus has to be renormalised every time it happens, and therefore I believe Alias method does not work efficiently because one would need to build the whole distribution from scratch every time.
The method I am currently thinking is a binary tree (heap method), where all wi's are saved in the lowest level, and then the sum of each two in higher level and so on. The sum of all of them will be in the highest level, which is also a normalisation constant. Thus in order to update the tree after change in wi, one needs to do log(n) changes, as well as the same amount to get the sample from the distribution.
Question:
Q1. Do you have a better idea on how to achieve it faster?
Q2. The most important part: I am looking for a library which has already done this.
explanation: I have done this myself several years ago, by building heap structure in a vector, but since then I have learned many things including discovering libraries ( :) ), and containers such as map... Now I need to rewrite that code with higher functionality, and I want to make it right this time:
so Q2.1 is there a nice way to make a c++ map ordered and searched not by index, but by a cumulative sum of it's elements (this is how we sample, right?..). (that is my current theory how I would like to do it, but it doesnt have to be this way...)
Q2.2 Maybe there is some even nicer way to do the same? I would believe this problem is so frequent that I am very surprised I could not find some sort of library which would do it for me...
Thank you very much, and I am very sorry if this has been asked in some other form, please direct me towards it, but I have spent a good while looking...
-z
Edit: There is a possibility that I might need to remove or add the elements as well, but I think I could avoid it, if that makes a huge difference, thus leaving only changing the value of the weights.
Edit2: weights are reals in general, I would have to think if I could make them integers...
I would actually use a hash set of strings (don't remember the C++ container for it, you might need to implement your own though). Put wi elements for each i, with the values "w1_1", "w1_2",... all through "w1_[w1]" (that is, w1 elements starting with "w1_").
When you need to sample, pick an element at random using a uniform distribution. If you picked w5_*, say you picked element 5. Because of the number of elements in the hash, this will give you the distribution you were looking for.
Now, when wi changes from A to B, just add B-A elements to the hash (if B>A), or remove the last A-B elements of wi (if A>B).
Adding new elements and removing old elements is trivial in this case.
Obviously the problem is 'pick an element at random'. If your hash is a closed hash, you pick an array cell at random, if it's empty - just pick one at random again. If you keep your hash 3 or 4 times larger than the total sum of weights, your complexity will be pretty good: O(1) for retrieving a random sample, O(|A-B|) for modifying the weights.
Another option, since only a small part of your weights change, is to split the weights into two - the fixed part and the changed part. Then you only need to worry about changes in the changed part, and the difference between the total weight of changed parts and the total weight of unchanged parts. Then for the fixed part your hash becomes a simple array of numbers: 1 appears w1 times, 2 appears w2 times, etc..., and picking a random fixed element is just picking a random number.
Updating your normalisation factor when you change a value is trivial. This might suggest an algorithm.
w_sum = w_sum_old - w_i_old + w_i_new;
If you leave p_i as a computed property p_i = w_i / w_sum you would avoid recalculating the entire p_i array at the cost of calculating p_i every time they are needed. You would, however, be able to update many statistical properties without recalculating the entire sum
expected_something = (something_1 * w_1 + something_2 * w_2 + ...) / w_sum;
With a bit of algebra you can update expected_something by subtracting the contribution with the old weight and add the contribution with the new weight, multiplying and dividing with the normalization factors as required.
If you during the sampling keep track of which outcomes that are part of the sample, it would be possible to propagate how the probabilities were updated to the generated sample. Would this make it possible for you to update rather than recalculate values related to the sample? I think a bitmap could provide an efficient way to store an index of which outcomes that were used to build the sample.
One way of storing the probabilities together with the sums is to start with all probabilities. In the next N/2 positions you store the sums of the pairs. After that N/4 sums of the pairs etc. Where the sums are located can, obviously, be calculate in O(1) time. This data-structure is sort of a heap, but upside down.

How to get partition where value belongs in partitioned interval?

I have interval that is partitioned int large amount of smaller partitions.
There aren't any spaces also there aren't any overlapping intervals.
E.g: (0;600) is separated into:
(0;10>
(10;25>
(25;100>
(100;125>
(125;550>
(550;600)
Now i have large amount of values and i need to get partition id for each of them.
I can store array of values that partitions this interval into smaller intervals.
But if all values belongs to last partition it'll need to pass through whole array.
So i'm searching for any better solution to store these intervals. I want simple - max cca 150 lines length algorithm and i don't want to use any library except std.
Since there are no "empty spaces" in your partitioning, the end of each partition is redundant (it's the same as the start of the next partition).
And since you have the partition list sorted, you can simply use binary search, with std::upper_bound.
See it in action.
Edit: Correction (upper_bound, not lower_bound).
You could just improve your search algorithm.
Put all the ranges in the array, and then use Binary Search Algorithm to search for the right range.
It will cost O(logn), and it's really easy to implement.

Choosing N random numbers from a set

I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.
I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.
Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?
You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.
I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm
As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.
A lot depends on the amount of extra storage you're willing to expend to make the selection faster.
If you're not willing to use any extra storage, #Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.