How to sort LevelDB by value - c++

I'm using leveldb to store records (key-value), where the key is a 64-bit hash and the value is a double. To make an analogy: think of the 64-bit hash is a unique ID of a customer and the double as an account balance (i.e. how much money they have in their account). I want to sort the database by account balance and list the customers with the highest account balance first. However, the database cannot fit into memory so I have to use some other method for sorting it in order to sort by account balance.
I'm considering using STXXL, but it requires that I make a copy of the database into a single flat file, then I can use STXXL to do an external sort (which would make a bunch of smaller files, sort them and then merge them back into another single flat file). Is there a better approach to sorting the data or should I go with the STXXL sort?

How many entries do you have? Could an unsigned 32-bit integer be used as an index (would allow 4,294,967,296 indexes) which could be used to identify how to sort the original array?
i.e. create pairs of 32-bit indexes and account balances, sort on the balances then use the 32 bit indexes to work out what order the original data should be in?

Related

Please reply::HashTable:Determining Table size and which hash function to use

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.

Easiest primary key for main table?

My main table, Users, stores information about users. I plan to have a UserId field as the primary key of the table. I have full control of creation and assignment of these keys, and I want to ensure that I assign keys in a way that provides good performance. What should I do?
You have a few options:
The most generic solution is to use UUIDs, as specified in RFC 4122.
For example, you could have a STRING(36) that stores UUIDs. Or you could store the UUID as a pair of INT64s or as a BYTE(16). There are some pitfalls to using UUIDs, so read the details of this answer.
If you want to save a bit of space and are absolutely sure that you will have fewer than a few billion users, then you could use an INT64 and then assign UserIds using a random number generator. The reason you want to be sure you have fewer than a few billion users is because of the Birthday Problem, the odds that you get at least one collision are about 50% once you have 4B users, and they increase very fast from there. If you assign a UserId that has already been assigned to a previous user, then your insertion transaction will fail, so you'll need to be prepared for that (by retrying the transaction after generating a new random number).
If there's some column, MyColumn, in the Users table that you would like to have as primary key (possibly because you know you'll want to look up entries using this column frequently), but you're not sure about the tendency of this column to cause hotspots (say, because it's generated sequentially or based on timestamps), then you two other options:
3a) You could "encrypt" MyColumn and use this as your primary key. In mathematical terms, you could use an automorphism on the key values, which has the effect of chaotically scrambling them while still never assigning the same value multiple times. In this case, you wouldn't need to store MyColumn separately at all, but rather you would only store/use the encrypted version and could decrypt it when necessary in your application code. Note that this encryption doesn't need to be secure but instead just needs to guarantee that the bits of the original value are sufficiently scrambled in a reversible way. For example: If your values of MyColumn are integers assigned sequentially, you could just reverse the bits of MyColumn to create a sufficiently scrambled primary key. If you have a more interesting use-case, you could use an encryption algorithm like XTEA.
3b) Have a compound primary key where the first part is a ShardId, computed ashash(MyColumn) % numShards and the second part is MyColumn. The hash function will ensure that you don't create a hot-spot by allocating your rows to a single split. More information on this approach can be found here. Note that you do not need to use a cryptographic hash, although md5 or sha512 are fine functions. SpookyHash is a good option too. Picking the right number of shards is an interesting question and can depend upon the number of nodes in your instance; it's effectively a trade-off between hotspot-avoiding power (more shards) and read/scan efficiency (fewer shards). If you only have 3 nodes, then 8 shards is probably fine. If you have 100 nodes; then 32 shards is a reasonable value to try.

Some confirmation on hashmaps and terminology C++

I'm doing a lab for my introduction to C++ and we've started on a username and password database where my professor wants us to implement as hashmap with a dynamically allocated array of LinkedLists. I just want some confirmation on what I'm doing so that I know I'm doing it correctly...
1) Buckets is where information will be stored. I presume each bucket is a singly LinkedList.
2) A hash function % number of buckets will determine which index I use in my array to store the user and password information.
3) Key-Value ... I'm a little confused by this. Is the key my username, and the value my password?
4) Load Factor is the number of keys stored divided by the number of buckets. So in my case, if I had 50 users stored in my hashmap, would it be 50/100? My head has a hard time wrapping around this concept. Does this mean not every bucket will be used sometimes?
1) Correct. Ideally each "bucket" would only hold one value. If there are conflicts in the hash algorithm then multiple values would be stored in the same bucket, hence the use of a linked list.
2) Correct. The hash algorithm is what allows you to know where to store/retrieve data in the hashmap.
3) Correct.
4) Correct. You do not want the load factor of the hashmap to be too high, otherwise the running time for inserting/retrieving begins to approach O(N). The useful aspect of hashing is that it (ideally) allows for insertion and retrieval in O(1) time when the load factor is low.
Typically once the load factor reaches a certain level, the size of the hashmap is increased and rehashed in order to lower the load factor. A hashmap uses more space than a typical array would, but this is generally offset by the speed of inserting/retrieving data from it.
1) Yes. Each bucket would hold a linked list. Singly linked is common.
2) Yep, sounds typical.
3) Yes.
4) Yep. If you have 100 buckets and 50 entries, than you have an average linked list length of 0.5. By necessity that would mean at least half will have no entries.

Vector or Map or Hash map for C++?

I have a large number of records, say around 4,000,000, that I want to address them repeatedly and put information in a class that is linked to that record. I am not sure what kind of data structure should I use? Should I use the vectors, maps, or hash maps. I don't need to insert a record, but I need to read a table which contain sets of these records numbers (or names), and then grab some of the data which are linked to that record and do some processes on them. Is the finding on map fast enough to not go for hashmaps for this example? The records have a class as its structure and I have not done anything before with using the map or hashmap that has a class as its value (if it is possible).
Thanks in advance guys.
Edited:
I do not need to have all the records on the memory at the same time for now> I need to first give it a structure first and then grab the data from some of the records. The total number of records is around 20 million, and I want to read each of these raw records and then if its basic info doesn't exist in my new map or vector that I want to create and put the rest of data in there as a vector. Because I have 20 million records, I think it would be very excruciating that for every record go through 4 million record to find if the basic info of that record exist or not. I have around 4 million type of packages and each of these packages could have more than one type of service (roughly around 5 (20/4) per package). I want to read each of these records and then if the package ID does not exist into the vector or whatever I want to use and push the basic info into the vector and then for the services that are related to that package be saved in a vector inside the package class.
These three data structures have each a different purpose.
A vector is basically a dynamic array, which is good for indexed values.
A map is a sorted data-structure with O(log(n)) retrieval and insertion time (implemented using a balanced binary tree, usually Red-Black). This is best if you can't find an efficient hash method.
A hash_map uses hashes to retrieve object. If you have a well defined hash function with a low collision rate, you will get constant retrieval and insertion time on average. hash_maps are usually faster than map but not always. It is highly dependent on the hash function.
For your example, I think that it is best to use a hash_map where the key would be the record number (assuming record numbers are unique).
If these record numbers are dense (meaning there are little or no gaps between the indexes
, say: 1,2,4,5,8,9,10...), you can use a vector. If your records are coming from a database with an autoincrement primary key and not many deletions, this should be usually the case.

STL Map versus Static Array

I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;