If I want to use extendible hashing to store a maximum of 100 records, then what is the minimum array size that I need?
I am guessing that an array of 100 would be sufficient, but I could be wrong. I also suspect that I can use a smaller array.
What do you know about your hash function?
You mentioned extendible hashing.
With extendible hashing you look at your hash as a bit string and typically implement the bucket lookup via a trie. Instead of a trie based lookup though I assume you are converting this to an index into your array.
You mentioned you will have at most 100 elements. If you wanted all distinct hashes you'd have 128 possibilities since that's the closest combination of bits with 7 bits.
If your hashing function can hash each element to have 7 of 7 (or more) different bits, then you have the most optimal solution with a bucket size of 1. Leaving 128 leaf nodes, or an array of size 128.
If your hashing function can hash each element to have 6 of 7 (or more) different bits, then you have a bucket size of 2. You would have 64 leaf nodes/combinations/array size.
If your hashing function can hash each element to have 5 of 7 (or more) different bits, then you have a bucket size of 4. You would have 32 leaf nodes/combinations/array size.
Since you said you want a bucket size of 4 I think your answer would be 32 and you have a hard requirement that you have a good hashing function that can give you at least 5 of the first bits as distinct.
I think it depends on whether you need high performance or saving storage. You can just save elements into an array of 100. I don't know a lot about extendible hashing, but my general understanding about hashing is that it will have some kinds of collision, and if you use a bigger array to store it, the number of collision can reduce and the performance in adding/deleting and querying will also be faster. I think you should use at least 128 (just to be 2^k, I am not an expert in hashing):)
Related
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.
I need to map an array of sorted integers, with the length varying from 1 to 4 at max, to an index of a global array. Like [13,24,32] becoming a number in range 0..n, and no other array mapping to that same number.
The quantity of arrays is a few millions, and the mapping has to be "unique" (or at least with very few collisions for smaller arrays), because these arrays represent itemsets and I use the k-1 smaller itemset to build the one of size k.
My current implementation uses a efficient hash function that produces a double between 0..1 for an array, and I store the itemsets in STL Map, with the double as the key. Got from this article:
N. D. Atreas, C. Karanikas “A faster pattern matching algorithm based on prime numbers and hashing approximation “, 2007
I'm going to implement a parallel version of this in CUDA, so I can't use something like STL Map. I could create myself easily a self-balancing binary search tree as a map in the GPU global memory, but that would be really slow. So in order to reduce global memory access to a minimum, I need to map the itemset to a huge array in the global memory.
I've tried to cast the double as a long integer and hash it with a 64 bit hash function, but it produces some collision, as expected.
So, I ask if there's a "unique" hash function for doubles between 0..1, or for array of integers from size 1..4, that gives an unique index for a table of size N.
If I make this an assumption about your arrays:
each item of your arrays (such as 13) are 32-bit integers.
Then what you ask is impossible.
You have at least 2^(32*4) possible values, or 128 bits. And you are trying to pack those into an array of much smaller size (20 bits for one million entries). You cannot do this without collisions (or there being some agreement amongst the elements, such as each element choosing the "next available index", but that's not a hash).
Say I decided that my hasher for hash_set of a series of integer is the integer itself. And also say my integer range is very large, 1-20 and then 1000-1200, then 10000-12000.
e.g.: 1, 2, 5, 7, 1111, 1102, 1000, 10003, 10005
Wouldn't that be a very bad hashing function? How would data be stored by hash_set in this case, by say, the gcc implementation if anyone knows.
Thanks
EDIT:
Thank you for both replies. I should note that I have already specified my hasher to return the input value. e.g. the hash for 1001 would be 1001. So I ask if the implementation would take liberty to do another round of hashing, or would it see 1001 and the array size would grow to 1001?
Even if your data is clumped in certain ranges within the hash values typically only the least significant bits of the hash of each value will be used to store it. This means that if the bits representing say 0-128 were evenly distributed then your hash function would still behave well regardless of the distribution of the hash value. This does mean however if your values are all multiples of a certain binary value e.g. eight then lower bits won't be so evenly distributed and the values will clump in the hash table causing excessive chaining and slowing down operations.
The hash table would start small, occasionally rehashing to grow when the load factor gets high enough. Just because the hash value is 12000 does not mean there would be 12000 buckets, of course--the hash_set will do something like "mod" the hash function's output to make it fit within the number of buckets.
The identity function you describe is not a bad hash function for many hash table implementations (including GCC's). In fact it is what many people use, and obviously it is efficient. What it would be a bad example of is a cryptographic hash function, but that has a different purpose.
There are two integer arrays ,each in very large files (size of each is larger than RAM). How would you find the common elements in the arrays in linear time.
I cant find a decent solution to this problem. Any ideas?
One pass on one file build a bitmap (or a Bloom filter if the integer range is too large for a bitmap in memory).
One pass on the other file find the duplicates (or candidates if using a Bloom filter).
If you use a Bloom filter, the result is probabilistic. New passes can reduce the false positive (Bloom filters don't have false negative).
Assuming integer size is 4 bytes.
Now we can have maximum of 2^32 integers i.e I can have a bitvector of 2^32 bits (512 MB) to represent all integers where each bit reperesents 1 integer.
1. Initialize this vector with all zeroes
2. Now go through one file and set bits in this vector to 1 if you find an integer.
3. Now go through other file and look for any set bit in bit Vector.
Time complexity O(n+m)
space complexity 512 MB
You can obviously use an hash table to find common elements with O(n) time complexity.
First, you need to create an hash table using the first array, then compare the second array using this hash table.
Let's say enough RAM is available to hold 5% of hash of either given file-array (FA).
So, I can split the file arrays (FA1 and FA2) into 20 chunks each - say do a MOD 20 of the contents. We get FA1(0)....FA1(19) and FA2(0)......FA2(19). This can be done in linear time.
Hash FA1(0) in memory and compare contents of FA2(0) with this hash. Hashing and checking for existence are constant time operations.
Destroy this hash and repeat for FA1(1)...FA1(19). This is also linear. So, the whole operation is linear.
Assuming you are talking of integers with the same size, and written in the files in binary mode, you first sort the 2 files (use a quicksort, but reading and writing to the file "offsets" ).
Then you just need to move from the start of the 2 files, and check for matches, if you have a match write the output to another file (assuming you can't also store the result in memory) and keep moving on the files until EOF.
Sort files. With fixed length integers it can be done in O(n) time:
Get some part of file, sort it with radix sort, write to temporary file. Repeat until all data finished. This part is O(n)
Merge sorted parts. This is O(n) too. You can even skip repeated numbers.
On sorted files find a common subset of integers: compare numbers, write it down if they are equal, then step one number ahead on file with smaller number. This is O(n).
All operations are O(n) and final algorithm is O(n) too.
EDIT: bitmap method is much faster if you have enough memory for bitmaps. This method works for any fixed size integers, 64-bit for example. Bitmap of size 2^31 Mb will not be practical for at least a few years :)
Say you have a List of 32-bit Integers and the same collection of 32-bit Integers in a Multiset (a set that allows duplicate members)
Since Sets don't preserve order but List do, does this mean we can encode a Multiset in less bits than the List?
If so how would you encode the Multiset?
If this is true what other examples are there where not needing to preserve order saves bits?
Note, I just used 32-bit Integers as an example. Does the data type matter in the encoding? Does the data type need to be fixed length and comparable for you to get the savings?
EDIT
Any solution should work well for collections that have low duplication as well as high duplication. Its obvious with high duplication encoding a Multiset by just simply counting duplicates is very easy, but this takes more space if there is no duplication in the collection.
In the multiset, each entry would be a pair of numbers: The integer value, and a count of how many times it is used in the set. This means additional repeats of each value in the multiset do not cost any more to store (you just increment the counter).
However (assuming both values are ints) this would only be more efficient storage than a simple list if each list item is repeated twice or more on average - There could be more efficient or higher performance ways of implementing this, depending on the ranges, sparsity, and repetitive of the numbers being stored. (For example, if you know there won't be more than 255 repeats of any value, you could use a byte rather than an int to store the counter)
This approach would work with any types of data, as you are just storing the count of how many repeats there are of each data item. Each data item needs to be comparable (but only to the point where you know that two items are the same or different). There is no need for the items to take the same amount of storage each.
If there are duplicates in the multiset, it could be compressed to a smaller size than a naive list. You may want to have a look at Run-length encoding, which can be used to efficiently store duplicates (a very simple algorithm).
Hope that is what you meant...
Data compression is a fairly complicated subject, and there are redundancies in data that are hard to use for compression.
It is fundamentally ad hoc, since a non-lossy scheme (one where you can recover the input data) that shrinks some data sets has to enlarge others. A collection of integers with lots of repeats will do very well in a multimap, but if there's no repetition you're using a lot of space on repeating counts of 1. You can test this by running compression utilities on different files. Text files have a lot of redundancy, and can typically be compressed a lot. Files of random numbers will tend to grow when compressed.
I don't know that there really is an exploitable advantage in losing the order information. It depends on what the actual numbers are, primarily if there's a lot of duplication or not.
In principle, this is the equivalent of sorting the values and storing the first entry and the ordered differences between subsequent entries.
In other words, for a sparsely populated set, only little saving can be had, but for a more dense set, or one with clustered entries - more significant compression is possible (i.e. less bits need to be stored per entry, possibly less than one in the case of many duplicates). I.e. compression is possible but the level depends on the actual data.
The operation sort followed by list delta will result in a serialized form that is easier to compress.
E.G. [ 2 12 3 9 4 4 0 11 ] -> [ 0 2 3 4 4 9 11 12 ] -> [ 0 2 1 1 0 5 2 1 ] which weighs about half as much.