Hashing Techniques, using unordered_map

Hashing Techniques, using unordered_map - c++

So my professor just assigned this homework assignment. I know my fare share of hashing techniques, but I have absolutely no idea how to go about not losing a lot of points due to collisions, because 1 Million strings will literally brute force collisions into my hash table.
What should I focus on?
Creating a really good re-hashing technique to detect when a collision occurs and appropriately re-hash
Focus on how to convert the strings into unique integers as to avoid collisions using some kind of prime number based modulus.
Or maybe I'm just misunderstanding the assignment completely. How would you guys go about solving this. Any ideas would be really helpful.

The task is to create a hashfunction with zero collisions. TonyD just calculated the expected collisions to be 116. According to the grading you will get zero points for a hashfunction with 116 collisions.
The professor gave a hint to use unordered_map which doesnt help for designing hashfunctions. It may be a trick question...
How would you design a function which returns a repeatable, unique number for 1 million inputs?

Your teacher's asking you to hash 1 million strings and you have 2^32 = 4,294,967,296 distinct 32-bit integer values available.
With 20 character random strings, there's massively more possible strings than hash values, so you can't map specific strings onto specific hash values in a way that limits the collision potential (i.e. say you had <= 2^32 potential strings because the string length was shorter, or the values each character was allowed to take were restricted - you'd have a chance at a perfect hash function: a formula mapping each to a known distinct number).
So, you're basically left having to try to randomly but repeatably map from strings to hash values. The "Birthday Paradox" then kicks in, meaning you must expect quite a lot of collisions. How many? Well - this answer provides the formula - for m buckets (2^32) and n inserts (1,000,000):
expected collisions = n - m * (1 - ((m-1)/m)^n)
= 1,000,000 - 2^32 * (1 - ((2^32 - 1) / 2^32) ^ 1,000,000)
= 1,000,000 - 2^32 * (1 - 0.99976719645926983712557804052625)
~= 1,000,000 - 999883.6
~= 116.4
Put another way, the very best possible hash function would on average - for random string inputs - still have 116 collisions.
Your teacher says:
final score for you is max{0, 200 – 5*T}
So, there's no point doing the assignment: you're more likely to have a 24 carat gold meteor land in your front garden than get a positive score.
That said, if you want to achieve the lowest number of collisions for the class, a lowish-performance (not particularly cache friendly) but minimal collision option is simply to have an array of random data...
uint32_t data[20][256] = { ... };
Download some genuinely random data from an Internet site to populate it with. Discard any duplicate numbers (in C++, you can use a std:set<> to find them). Index by character position (0..19) then character value, generating your hash by XORing the values.
Illustration of collisions
If unconvinced by the information above, you can generate a million random 32-bit values - as if they were hashes of distinct strings - and see how often the hash values repeat. Any given run should produce output not too far from the 116 collision average calculated above.
#include <iostream>
#include <map>
#include <random>
int main()
{
std::random_device rd;
std::map<unsigned, int> count;
for (int i = 0; i < 1000000; ++i)
++count[rd()];
std::map<int, int> histogram;
for (auto& c : count)
++histogram[c.second];
for (auto& h : histogram)
std::cout << h.second << " hash values generated by " << h.first << " key(s)\n";
}
A few runs produced output...
$ ./poc
999752 hash values generated by 1 key(s)
124 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999796 hash values generated by 1 key(s)
102 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999784 hash values generated by 1 key(s)
108 hash values generated by 2 key(s)
$ ./poc
999744 hash values generated by 1 key(s)
128 hash values generated by 2 key(s)

Related

Generate random number sequence with certain entropy

I need to generate partly random sequence of numbers such that the sequence overall has certain entropy level.
E.g. if I would feed the generated data into gzip it would be able compress it. And in fact, this would be the exact application for the code, testing data compressors.
I'm programming this in C++ and first idea that came into my mind would be to initialize bunch of std::mt19937 PRNGs with random seed and choose one randomly and make random lenght pattern with it.
The std::mt19937 is reset each time with same seed so that it always generates same pattern:
#include <iostream>
#include <random>
#include <vector>
int main() {
std::random_device rd;
std::vector<std::mt19937> rngs;
std::vector<int> seeds;
std::uniform_int_distribution<int> patternrg(0,31);
std::uniform_int_distribution<int> lenghtrg(1,64);
std::uniform_int_distribution<int> valuerg(0,255);
for(int i = 0; i < 32; ++i) {
seeds.push_back(rd());
rngs.emplace_back(seeds.back());
}
for(;;) {
// Choose generator and pattern lenght randomly.
auto gen = patternrg(rd);
auto len = lenghtrg(rd);
rngs[gen].seed(seeds[gen]);
for(int i = 0; i < len; ++i) {
std::cout << valuerg( rngs[gen] )<<"\n";
}
}
}
Above code mets the first requirement of generating compressable randomness but the second is harder: how to control the level entropy/randomness?

Let me write few sentences which you could find useful. Suppose we want to sample one bit with given entropy. So it is either 0 or 1, and entropy you want is equal to e.
H(10|p) = -p log2(p) - (1 - p) log2(1 - p), where p is probability to get 1. Simple test - in case of p=1/2 one would get entropy of 1 - maximum entropy. So you
pick e equal to some value below 1, solve equation
-p log2(p) - (1 - p) log2(1 - p) = e
and get back p, and then you could start sampling using Bernoulli distribution. Simple demo is here. And in C++ one could use standard library routine.
Ok, suppose you want to sample one byte with given entropy. It has 256 values, and entropy
H(byte|\vec{p}) = -Sum(1...256) pi log2(pi).
Again, if all combination are equiprobable (pi=1/256), you'll get -256/256 log2(1/256) = 8, which is maximum entropy. If you now fix your entropy (say, I want it to be 7), then there would be infinite number of solutions for pi, there is no single unique realization of given entropy.
You could simplify problem a bit - lets consider again one parameter case, where probability to find 1 is p and probability to find 0 is (1-p). Thus, from 256 outcomes we now got 9 of them - 00000000, 00000001, 00000011, 00000111, 00001111, 00011111, 00111111, 01111111, 11111111.
For each of those cases we could write probability, compute entropy, assign it to whatever you want and solve back to find p.
Sampling would be relatively easy - first step would be sample on of the 9 combinations via discrete distribution, and second step would be shuffle bits inside the byte using Fisher-Yates shuffle.
Same approach might be used for, say, 32bits or 64bits - you have 33 or 65 cases, construct entropy, assign to whatever you want, find p, sample one of them and then
shuffle bits inside sampled value.
No code right now, but I could probably write some code later if there is an interest...
UPDATE
Keep in mind another peculiar property of the fixing entropy. Even for a simple case of single bit, if you try to solve
-p log2(p) - (1 - p) log2(1 - p) = e
for a given e, you'll get two answers, and it is easy to understand why - equation is symmetric wrt p and 1-p (or replacing 0s with 1s and 1s with 0s). In other words, for entropy it is irrelevant if you transfer information using mostly zeros, or mostly ones. It is not true for things like natural text.

The entropy rate (in terms of the output byte values, not the human-readable characters) of your construction has several complications, but (for a number of generators much smaller than 256) it’s a good approximation to say that it’s the entropy of each choice (5 bits to pick the sequence plus 6 for its length) divided by the average length of the subsequences (65/2), or 0.338 bits out of a possible 8 per byte. (This is significantly lower than normal English text.) You can increase the entropy rate by defining more sequences or reducing the typical length of the subsequence drawn from each. (If the subsequence is often just one character, or the sequences number in the hundreds, collisions will necessarily reduce the entropy rate below this estimate and limit it to 8 bits per byte.)
Another easily tunable sequence class involves drawing single bytes from [0,n] with a probability p<1/(n+1) for 0 and the others equally likely. This gives an entropy rate H=(1-p)ln (n/(1-p))-p ln p which lies on [ln n,ln (n+1)), so any desired rate can be selected by choosing n and then p appropriately. (Remember to use lg instead of ln if you want bits of entropy.)

how to generatehash code of a string in c++ that will be limited range?

I have on my configuration file 100 values.
Each value is build of 2 char that can be like 90 or AA or 04 or TR or FE
I want to generate hash code of each value - and store them in array that contain 100 element - and each of the values from the configuration will be save in the hash code index in the array.
The question:
How to create hash code from 2 char that the hash code is limited between 0 to 99

What you need in your specific case (mapping a fixed set of 2-byte sequences to consecutive numbers) is called perfect hashing.
While you could implement it yourself, there's an open-source tool called gperf which can generate the code for you:
There are options for generating C or C++ code, for emitting switch statements or nested ifs instead of a hash table, and for tuning the algorithm employed by gperf.

Finding all values that occurs odd number of times in huge list of positive integers

I came across this question from a colleague.
Q: Given a huge list (say some thousands)of positive integers & has many values repeating in the list, how to find those values occurring odd number of times?
Like 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1...
Here,
1 occrus 8 times
2 occurs 7 times (must be listed in output)
3 occurs 6 times
4 occurs 5 times (must be listed in output)
& so on... (the above set of values is only for explaining the problem but really there would be any positive numbers in the list in any order).
Originally we were looking at deriving a logic (to be based on c).
I suggested the following,
Using a hash table and the values from the list as an index/key to the table, keep updating the count in the corresponding index every time when the value is encountered while walking through the list; however, how to decide on the size of the hash table?? I couldn't say it surely though it might require Hashtable as big as the list.
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
This might not be the best solution given this scenario.
Can you please suggest on any other efficient way of doing it so??
I sought in SO, but there were queries/replies on finding a single value occurring odd number of times but none like the one I have mentioned.
The relevance for this question is not known but seems to be asked in his interview...
Please suggest.
Thank You,

If the values to be counted are bounded by even a moderately reasonable limit then you can just create an array of counters, and use the values to be counted as the array indices. You don't need a tight bound, and "reasonable" is somewhat a matter of platform. I would not hesitate to take this approach for a bound (and therefore array size) sufficient for all uint16_t values, and that's not a hard limit:
#define UPPER_BOUND 65536
uint64_t count[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(count, 0, sizeof(count));
for (i = 0; i < num_values; i += 1) {
count[values[i]] += 1;
)
}
Since you only need to track even vs. odd counts, though, you really only need one bit per distinct value in the input. Squeezing it that far is a bit extreme, but this isn't so bad:
#define UPPER_BOUND 65536
uint8_t odd[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(odd, 0, sizeof(odd));
for (i = 0; i < num_values; i += 1) {
odd[values[i]] ^= 1;
)
}
At the end, odd[i] contains 1 if the value i appeared an odd number of times, and it contains 0 if i appeared an even number of times.
On the other hand, if the values to be counted are so widely distributed that an array would require too much memory, then the hash table approach seems reasonable. In that case, however, you are asking the wrong question. Rather than
how to decide on the size of the hash table?
you should be asking something along the lines of "what hash table implementation doesn't require me to manage the table size manually?" There are several. Personally, I have used UTHash successfully, though as of recently it is no longer maintained.
You could also use a linked list maintained in order, or a search tree. No doubt there are other viable choices.
You also asked
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
If you perform the analysis via the general approach we have discussed so far then yes, the only way to read out the result is to iterate through the counts. I can imagine alternative, more complicated, approaches wherein you switch numbers between lists of those having even counts and those having odd counts, but I'm having trouble seeing how whatever efficiency you might gain in readout could fail to be swamped by the efficiency loss at the counting stage.

In your specific case, you can walk the list and toggle the value's existence in a set. The resulting set will contain all of the values that appeared an odd number of times. However, this only works for that specific predicate, and the more generic count-then-filter algorithm you describe will be required if you wanted, say, all of the entries that appear an even number of times.
Both algorithms should be O(N) time and worst-case O(N) space, and the constants will probably be lower for the set-based algorithm, but you'll need to benchmark it against your data. In practice, I'd run with the more generic algorithm unless there was a clear performance problem.

What are some checksum implementations that allow for incremental computation?

In my program I have a set of sets that are stored in a proprietary hash table. Like all hash tables, I need two functions for each element. First, I need the hash value to use for insertion. Second, I need a compare function when there's conflicts. It occurs to me that a checksum function would be perfect for this. I could use the value in both functions. There's no shortage of checksum functions but I would like to know if there's any commonly available ones that I wouldn't need to bring in a library for (my company is a PIA when it comes to that).A system library would be ok.
But I have an additional, more complicated requirement. I need for the checksum to be incrementally calculable. That is, if a set contains A B C D E F and I subtract D from the set, it should be able to return a new checksum value without iterating over all the elements in the set again. The reason for this is to prevent non-linearity in my code. Ideally, I'd like for the checksum to be order independent but I can sort them first if needed. Does such an algorithm exist?

Simply store a dictionary of items in your set, and their corresponding hash value. The hash value of the set is the hash value of the concatenated, sorted hashes of the items. In Python:
hashes = '''dictionary of hashes in string representation'''
# e.g.
hashes = { item: hashlib.sha384(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = hashlib.sha384(concatenated_hashes)
As hash function I would use sha384, but you might want to try Keccak-384.
Because there are (of course) no cryptographic hash functions with a lengths of only 32-bit, you have to use a checksum instead, like Adler-32 or CRC32. The idea remains the same. Best use Adler32 on the items and crc32 on the concatenated hashes:
hashes = { item: zlib.adler32(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = zlib.crc32(concatenated_hashes)
In C++ you can use Adler-32 and CRC-32 of Botan.

A CRC is a set of bits that are calculated from an input.
If your input is the same size (or less) as the CRC (in your case - 32 bits), you can find the input that created this CRC - in effect reversing it.
If your input is larger than 32 bits, but you know all the input except for 32 bits, you can still reverse the CRC to find the missing bits.
If, however, the unknown part of the input is larger than 32 bits, you can't find it as there is more than one solution.
Why am I telling you this? Imagine you have the CRC of the set
{A,B,C}
Say you know what B is, and you can now calculate easily the CRC of the set
{A,C}
(by "easily" I mean - without going over the entire A and C inputs - like you wanted)
Now you have 64 bits describing A and C! And since we didn't have to go over the entirety of A and C to do it - it means we can do it even if we're missing information about A and C.
So it looks like IF such a method exists, we can magically fix more than 32 unknown bits from an input if we have the CRC of it.
This obviously is wrong. Does that mean there's no way to do what you want? Of course not. But it does give us constraints on how it can be done:
Option 1: we don't gain more information from CRC({A,C}) that we didn't have in CRC({A,B,C}). That means that the (relative) effect of A and C on the CRC doesn't change with the removal of B. Basically - it means that when calculating the CRC we use some "order not important" function when adding new elements:
we can use, for example, CRC({A,B,C}) = CRC(A) ^ CRC(B) ^ CRC(C) (not very good, as if A appears twice it's the same CRC as if it never appeared at all), or CRC({A,B,C}) = CRC(A) + CRC(B) + CRC(C) or CRC({A,B,C}) = CRC(A) * CRC(B) * CRC(C) (make sure CRC(X) is odd, so it's actually just 31 bits of CRC) or CRC({A,B,C}) = g^CRC(A) * g^CRC(B) * g^CRC(C) (where ^ is power - useful if you want cryptographically secure) etc.
Option 2: we do need all of A and C to calculate CRC({A,C}), but we have a data structure that makes it less than linear in time to do so if we already calculated CRC({A,B,C}).
This is useful if you want specifically CRC32, and don't mind remembering more information in addition to the CRC after the calculation (the CRC is still 32 bit, but you remember a data structure that's O(len(A,B,C)) that you will later use to calculate CRC{A,C} more efficiently)
How will that work? Many CRCs are just the application of a polynomial on the input.
Basically, if you divide the input into n chunks of 32 bit each - X_1...X_n - there is a matrix M such that
CRC(X_1...X_n) = M^n * X_1 + ... + M^1 * X_n
(where ^ here is power)
How does that help? This sum can be calculated in a tree-like fashion:
CRC(X_1...X_n) = M^(n/2) * CRC(X_1...X_n/2) + CRC(X_(n/2+1)...X_n)
So you begin with all the X_i on the leaves of the tree, start by calculating the CRC of each consecutive pair, then combine them in pairs until you get the combined CRC of all your input.
If you remember all the partial CRCs on the nodes, you can then easily remove (or add) an item anywhere in the list by doing just O(log(n)) calculations!
So there - as far as I can tell, those are your two options. I hope this wasn't too much of a mess :)
I'd personally go with option 1, as it's just simpler... but the resulting CRC isn't standard, and is less... good. Less "CRC"-like.
Cheers!

pseudo random distribution which guarantees all possible permutations of value sequence - C++

Random question.
I am attempting to create a program which would generate a pseudo-random distribution. I am trying to find the right pseudo-random algorithm for my needs. These are my concerns:
1) I need one input to generate the same output every time it is used.
2) It needs to be random enough that a person who looks at the output from input 1 sees no connection between that and the output from input 2 (etc.), but there is no need for it to be cryptographically secure or truly random.
3)Its output should be a number between 0 and (29^3200)-1, with every possible integer in that range a possible and equally (or close to it) likely output.
4) I would like to be able to guarantee that every possible permutation of sequences of 410 outputs is also a potential output of consecutive inputs. In other words, all the possible groupings of 410 integers between 0 and (29^3200)-1 should be potential outputs of sequential inputs.
5) I would like the function to be invertible, so that I could take an integer, or a series of integers, and say which input or series of inputs would produce that result.
The method I have developed so far is to run the input through a simple halson sequence:
boost::multiprecision::mpz_int denominator = 1;
boost::multiprecision::mpz_int numerator = 0;
while (input>0) {
denominator *=3;
numerator = numerator * 3 + (input%3);
input = input/3;
}
and multiply the result by 29^3200. It meets requirements 1-3, but not 4. And it is invertible only for single integers, not for series (since not all sequences can be produced by it). I am working in C++, using boost multiprecision.
Any advice someone can give me concerning a way to generate a random distribution meeting these requirements, or just a class of algorithms worth researching towards this end, would be greatly appreciated. Thank you in advance for considering my question.
----UPDATE----
Since multiple commenters have focused on the size of the numbers in question, I just wanted to make clear that I recognize the practical problems that working with such sets poses but in asking this question I'm interested only in the theoretical or conceptual approach to the problem - for example, imagine working with a much smaller set of integers like 0 to 99, and the permutations of sets of 10 of output sequences. How would you design an algorithm to meet these five conditions - 1)input is deterministic, 2)appears random (at least to the human eye), 3)every integer in the range is a possible output, 4)not only all values, but also all permutations of value sequences are possible outputs, 5)function is invertible.
---second update---
with many thanks to #Severin Pappadeux I was able to invert an lcg. I thought I'd add a little bit about what I did to hopefully make it easier for anyone seeing this in the future. First of all, these are excellent sources on inverting modular functions:
https://www.khanacademy.org/computing/computer-science/cryptography/modarithmetic/a/modular-inverses
https://www.khanacademy.org/computer-programming/discrete-reciprocal-mod-m/6253215254052864
If you take the equation next=ax+c%m, using the following code with your values for a and m will print out the euclidean equations you need to find ainverse, as well as the value of ainverse:
int qarray[12];
qarray[0]=0;
qarray[1]=1;
int i =2;
int reset = m;
while (m % a >0) {
int remainder=m%a;
int quotient=m/a;
std::cout << m << " = " << quotient << "*" << a << " + " << remainder << "\n";
qarray[i] =qarray[i-2]-(qarray[i-1]*quotient);
m=a;
a=remainder;
i++;
}
if (qarray[i-1]<0) {qarray[i-1]+=reset;}
std::cout << qarray[i-1] << "\n";
The other thing it took me a while to figure out is that if you get a negative result, you should add m to it. You should add a similar term to your new equation:
prev = (ainverse(next-c))%m;
if (prev<0) {prev+=m;}
I hope that helps anyone who ventures down this road in the future.

Ok, I'm not sure if there is a general answer, so I would concentrate on random number generator having, say, 64bit internal state/seed, producing 64bit output and having 2^64-1 period. In particular, I would look at linear congruential generator (aka LCG) in the form of
next = (a * prev + c) mod m
where a and m are primes to each other
So:
1) Check
2) Check
3) Check (well, for 64bit space of course)
4) Check (again, except 0 I believe, but each and every permutation of 64bits is output of LCG starting with some seed)
5) Check. LCG is known to be reversible, i.e. one could get
prev = (next - c) * a_inv mod m
where a_inv could be computed from a, m using Euclid's algorithm
Well, if it looks ok to you, you could try to implement LCG in your 15546bits space
UPDATE
And quick search shows reversible LCG discussion/code here
Reversible pseudo-random sequence generator

In your update, "appears random (to the human eye)" is the phrasing you use. The definition of "appears random" is not a well agreed upon topic. There are varying degrees of tests for "randomness."
However, if you're just looking to make it appear random to the human eye, you can just use ring multiplication.
Start with the idea of generating N! values between 0 and M (N>=410, M>=29^3200)
Group this together into one big number. we're going to generate a single number ranging from 0 to *M^N!. If we can show that the pseudorandom number generator generates every value from 0 to M^N!, we guarantee your permutation rule.
Now we need to make it "appear random." To the human eye, Linear Congruent Generators are enough. Pick a LCG with a period greater than or equal to 410!*M^N satisfying the rules to ensure a complete period. Easiest way to ensure fairness is to pick a LCG in the form x' = (ax+c) mod M^N!
That'll do the trick. Now, the hard part is proving that what you did was worth your time. Consider that the period of just a 29^3200 long sequence is outside the realm of physical reality. You'll never actually use it all. Ever. Consider that a superconductor made of Josephine junctions (10^-12kg processing 10^11bits/s) weighing the mass of the entire universe 3*10^52kg) can process roughly 10^75bits/s. A number that can count to 29^3200 is roughly 15545 bits long, so that supercomputer can process roughly 6.5x10^71 numbers/s. This means it will take roughly 10^4600s to merely count that high, or somewhere around 10^4592 years. Somewhere around 10^12 years from now, the stars are expected to wink out, permanently, so it could be a while.

There are M**N sequences of N numbers between 0 and M-1.
You can imagine writing all of them one after the other in a (pseudorandom) sequence and placing your read pointer randomly in the resulting loop of N*(M**N) numbers between 0 and M-1...
def output(input):
total_length = N*(M**N)
index = input % total_length
permutation_index = shuffle(index / N, M**N)
element = input % N
return (permutation_index / (N**element)) % M
Of course for every permutation of N elements between 0 and M-1 there is a sequence of N consecutive inputs that produces it (just un-shuffle the permutation index). I'd also say (just using symmetry reasoning) that given any starting input the output of next N elements is equally probable (each number and each sequence of N numbers is equally represented in the total period).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js