I'm doing a lab for my introduction to C++ and we've started on a username and password database where my professor wants us to implement as hashmap with a dynamically allocated array of LinkedLists. I just want some confirmation on what I'm doing so that I know I'm doing it correctly...
1) Buckets is where information will be stored. I presume each bucket is a singly LinkedList.
2) A hash function % number of buckets will determine which index I use in my array to store the user and password information.
3) Key-Value ... I'm a little confused by this. Is the key my username, and the value my password?
4) Load Factor is the number of keys stored divided by the number of buckets. So in my case, if I had 50 users stored in my hashmap, would it be 50/100? My head has a hard time wrapping around this concept. Does this mean not every bucket will be used sometimes?
1) Correct. Ideally each "bucket" would only hold one value. If there are conflicts in the hash algorithm then multiple values would be stored in the same bucket, hence the use of a linked list.
2) Correct. The hash algorithm is what allows you to know where to store/retrieve data in the hashmap.
3) Correct.
4) Correct. You do not want the load factor of the hashmap to be too high, otherwise the running time for inserting/retrieving begins to approach O(N). The useful aspect of hashing is that it (ideally) allows for insertion and retrieval in O(1) time when the load factor is low.
Typically once the load factor reaches a certain level, the size of the hashmap is increased and rehashed in order to lower the load factor. A hashmap uses more space than a typical array would, but this is generally offset by the speed of inserting/retrieving data from it.
1) Yes. Each bucket would hold a linked list. Singly linked is common.
2) Yep, sounds typical.
3) Yes.
4) Yep. If you have 100 buckets and 50 entries, than you have an average linked list length of 0.5. By necessity that would mean at least half will have no entries.
Related
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.
The unordered_set should provide O(1) lookup time in case of a good hash function.
Each bucket contains items with the same hash value.
Let's assume that our hash function is ideal and gives no collisions at all.
Hash values can vary from 0 to max(std::size_t).
How to organize unordered_set without allocating continuous memory interval for buckets and to provide O(1) lookup?
We can not allocate continuous memory interval because if we allocate then we use a lot of memory for only to several hash values - 0 and 1000000 for example. Values in the middle are not used at all, but we allocated memory for them.
Each bucket contains items with the same hash value.
Wrong. Each bucket "contains" items with the same value of:
hash(key) % current number of buckets.
Let's assume that our hash function is ideal and gives no collisions at all.
Not having collisions in pre-mod space isn't necessarily ideal: what matters is collisions after modding (or masking if there's a power-of-2 bucket count, e.g. Visual C++) into the number of buckets.
And even then, having no collisions is not (normally) the design goal for a hash function used with a hash table. That's known as perfect hashing, and usually only practical when the keys are all known up front, pre-inserted, then fast lookup is wanted for the rest of the application's lifetime. In more common scenarios, the aim in the general case is to have - when inserting a new value - it be roughly as likely that you'll collide with already-inserted values as it would be if you picked a bucket a random. That's because a high-quality hash function effectively produces a random-but-repeatable value. Hash functions work well because it's acceptable to have a certain level of collisions, which is statistically related to the current load factor (and not the size() of the container).
Hash values can vary from 0 to max(std::size_t).
How to organize unordered_set without allocating continuous memory interval for buckets and to provide O(1) lookup?
You typically have somewhere between 1 and ~2 buckets per value/element when the table had its largest size(). You can customise this somewhat by calling max_load_factor(float) to determine when the table resizes, but you can't customise by how much it resizes - that's left to the implementation. GCC will usually pick a prime a bit larger than twice the current size. Visual C++ will usually double the buckets.
We can not allocate continuous memory interval because if we allocate then we use a lot of memory for only to several hash values - 0 and 1000000 for example.
Values in the middle are not used at all, but we allocated memory for them.
This ignores the modding of the hash value into the bucket count. (It also ignores the viability of sparse arrays, which can be practical because virtual address space can be massively larger than the physical RAM backing it, but that's not the point of hash tables.)
If we know that we're going to hash between m and n items, where m and n are relatively large, what's a reasonable strategy for setting the number of initial buckets for std::unordered_set? If it helps, in my case m=n/2. In general, I'd like to optimize for speed, but can't afford an unreasonable amount of memory. Thanks in advance.
tl;dr There's no simple answer. Either measure, or let the container manage the bucket size automatically.
As I tried to say in the comments, there are too many variables, and you don't seem to realise how vague you're being. It took an hour for you to even say which implementation you're interested in.
m and n are "relatively large" ... relative to what?
"These are the only two operations and I want them to be fast." Define fast? What's fast enough? What's too slow? Have you measured?
If you want to minimize the load factor, so that there is on average no more than one element per bucket (and so no iteration through buckets needed once that the right bucket is known) then you'll need at least n buckets. But that doesn't guarantee one bucket per element, because the function used to determine the bucket from a hash code might return the same value for every pointer you put in the container. Knowing if that's likely depends on the hash function being used, and the function that maps hash codes to buckets, and the pointer values themselves.
For GCC the hash function for pointers is the identity function. For the default unordered_map implementation the mapping to buckets is hash_function(x) % bucket_count() and the bucket count is always a prime number, to reduce the likelihood of collisions. If the addresses you're storing in the hash map tend to be separated by multiples of the bucket count then they're going to end up in the same bucket. Knowing how likely that is depends on the number of buckets used for n (which you haven't stated) and the distribution of pointer values you're using (which you haven't stated).
If you use a custom hash function that has knowledge of the pointer values you expect to store then you could use a perfect hash function that uniformly distributes between [0, n) and then set the bucket_count() to n and ensure no collisions.
But it's not obvious that ensuring only a single element per bucket is worth it, because it uses more memory. Iterating through a bucket containing two or three elements is not going to be a bottleneck in most programs. Maybe it will be in yours, it's impossible to know because you haven't said what you want except it has to be fast. Which is so vague it's meaningless.
The only way to answer these questions is for you to measure the real world performance, nobody can give you a magic number that will make your code faster based on your vague requirements. If there was an easy answer that always makes things faster for "relatively large" number of elements then the standard library implementation should already be doing that and so you'd just be wasting your time doing the same thing by hand.
As an alternative, if you can live with logarithmic performance (usually not a problem), use a std::map instead. Then you have guaranteed lookup complexity 100% of the time, no re-hashing. A very useful property e.g. in hard real-time systems.
This is an interview question.
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
My solution:
The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Improve it :
(0) straightforward, increase bucket numbers > 1 million.
(1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1)
(2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Any better ideas ? thanks !
The simplest and most obvious improvement would be to increase the number of buckets in the hash table to something like 1.2 million -- at least assuming your hash function can generate numbers in that range (which it typically will).
Obviously increasing the bucket number improves the performance. Assuming this is no an option (for whatever reason) I suggest the following:
Normally the hash table consists of buckets, each holds a linked list (points to its head). You may however create a hash table, buckets of which hold a binary search tree (pointer to its root) rather than the list.
So that you'll have a hybrid of a hash table and the binary tree. Once I've implemented such thing. I didn't have a limitation on the number of buckets in the hash table, however I didn't know the number of elements from the beginning, plus I had no information about the quality of the hash function. Hence, I created a hash table with reasonable number of buckets, and the rest of the ambiguity was solved by the binary tree.
If N is the number of elements, and M is the number of buckets, then the complexity grows as O[log(N/M)], in case of equal distribution.
If you can't use another data structure or a larger table there are still options:
If the active set of elements is closer to 1000 than 1M you can improve the average successful lookup time by moving each element you find to the front of its list. That will allow it to be found quickly when it is reused.
Similarly if there is a set of misses that happens frequently you can cache the negative result (this can just be a special kind of entry in the hash table).
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
That doesn't quite add up, but let's run with it....
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
The worst (and best = only) case for missing elements is that you hash to a bucket then go through inspecting all the elements in that specific list (i.e. 1000) then fail. If they want big-O notation, by definition that describes how performance varies with the number of elements N, so we have to make an assumption about how the # buckets varies with N too: my guess is that the 997 buckets is a fixed amount, and is not going to be increased if the number of elements increases. The number of comparisons is therefore N/997, which - being a linear factor - is still O(N).
My solution: The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Nope - you're thinking of the number of comparisons - but big-O notation is about scalability.
Improve it : (0) straightforward, increase bucket numbers > 1 million. (1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1) (2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Well yes - average collisions relates to the number of entries and buckets. If you want very few collisions, you'd have well over 1 million entries in the table, but that gets wasteful of memory, though for large objects you can have an index or pointer to the actual object. An alternative is to look for faster collision handling mechanisms, such as trying a series of offsets from the hashed-to bucket (using % to map the displacements back into the table size), rather than resorting to some heap using linked lists. Rehashing is another alternative, given lower re-collision rates but typically needing more CPU, and having an arbitrarily long list of good hashing algorithms is problematic.
Hash tables within hash tables is totally pointless and remarkably wasteful of memory. Much better to use a fraction of that space to reduce collisions in the outer hash table.
I'm using leveldb to store records (key-value), where the key is a 64-bit hash and the value is a double. To make an analogy: think of the 64-bit hash is a unique ID of a customer and the double as an account balance (i.e. how much money they have in their account). I want to sort the database by account balance and list the customers with the highest account balance first. However, the database cannot fit into memory so I have to use some other method for sorting it in order to sort by account balance.
I'm considering using STXXL, but it requires that I make a copy of the database into a single flat file, then I can use STXXL to do an external sort (which would make a bunch of smaller files, sort them and then merge them back into another single flat file). Is there a better approach to sorting the data or should I go with the STXXL sort?
How many entries do you have? Could an unsigned 32-bit integer be used as an index (would allow 4,294,967,296 indexes) which could be used to identify how to sort the original array?
i.e. create pairs of 32-bit indexes and account balances, sort on the balances then use the 32 bit indexes to work out what order the original data should be in?