Hashing algorithm for pair of integers [duplicate] - c++

This question already has answers here:
Create a hashcode of two numbers
(7 answers)
Closed 6 years ago.
The problem: Storing dynamic adjacency list of a graph in a file while retaining O(1) algorithmic complexity of operations.
I am trying to store dynamic bidirectional graph in a file(s). Both nodes and edges can be added and removed and the operations must be O(1). My current design is:
File 1 - Nodes
Stores two integers per node (inserts are appends and removals use free list):
number of incoming edges
number of outgoing edges
File 2 - Edges
Stores 4 integers per edge (inserts are appends and removals use free list + swap with last edge for a node to update its new index):
from node (indice to File 1)
from index (i.e. third incoming edge)
to node (indice to File 1)
to index (i.e. second outgoing edge).
File 3 - Links
Serves as openly addressed hash table of locations of edges in File 2. Basically when you read a node from File 1 you know there are x incoming edges and y outgoing edges. With that you can go to File 3 to get the position of each of these edges in File 2. The key thus being:
index of node in File 1 (i.e. 0 for first node, 1 for second node)
0 <= index of edge < number of outgoing/incoming edges
Example of File 3 keys if represented as chained hash table (that is unfortunately not suitable for files but would not require hashing...):
Keys (indices from `File 1` + 0 <= index < number of edgesfrom `File 1`, not actually stored)
1 | 0 1 2
2 | 0 1
3 |
4 | 0
5 | 0 1 2
I am using qHash and QPair to hash these atm however the number of collisions is very high. Especially when I compare it to single int hashing that is very efficient with qHash. Since the values stored are indices to yet another file probing is rather expensive so I would like to cut the number of collissions down.
Is there a specialized hashing algorithm or approach to use for pair of ints that could perform better in this situation? Or of course a different approach that would avoid this problem like how to implement chained hash table in a file for example (I can only think of using buffers but that would be overkill for sparse graphs like mine I believe)?

If you read through comments on this answer, they claim qHash of an int just returns that int unchanged (which is a fairly common way to hash integers for undemanding use in in-memory hash tables). So, using a strong general-purpose hash function will achieve a dramatic reduction in collisions, though you may loose out on some incidental caching benefits of having nearby keys more likely to hash to the same area on disk, so do measure rather than taking it for granted that fewer collisions means better performance. I also suggest trying boost::hash_combine to create an overall hash from multiple hash values (just using + or XOR is a very bad idea). Then, if you're reading from disk, there's probably some kind of page size - e.g. 4k, 8k - which you'll have to read in to access any data anywhere on that page, so if there's a collision it'd still be better to look elsewhere on the already-loaded page, rather than waiting to load another page from disk. Simple linear probing manages that a lot of the time, but you could improve on that further by wrapping back to the start of the page to ensure you've searching all of it before probing elsewhere.

Related

Efficiently uniquize lines of huge text file in C++

Suppose I have very huge text file, with lines of different arbitrary lengths. I want to remove duplicate lines, how do I do this in C++?
Equal duplicates can be apart on very large distance. And I want to leave only first occurance.
File is so huge that it can be even 10-50 times bigger than size of RAM.
There is Linux command uniq, but it removes only adjacent equal lines. While I need removing any far apart duplicates.
Originally this question was asked here, but now it is deleted. It was failed to be UnDeleted or ReOpen. And I already implemented half of huge answer to it when it was deleted.
I'm asking this question only to share my own answer below, yet I'm providing below a very tiny solution, which doesn't scale because it uses only in-memory unordered set.
Simplest in-memory only solution using std::unordered_set:
Try it online!
#include <random>
#include <iostream>
#include <unordered_set>
#include <string>
#include <fstream>
int main() {
size_t constexpr n = 15;
std::mt19937_64 rng{125};
{
std::ofstream f("test.txt");
std::cout << "Input:" << std::endl;
for (size_t i = 0; i < n; ++i) {
auto const x = rng() % (n * 3 / 4);
f << x << std::endl;
std::cout << x << std::endl;
}
}
std::ofstream fw("test.txt.out");
std::ifstream f("test.txt");
std::string line;
std::unordered_set<std::string> set;
std::cout << std::endl << "Output:" << std::endl;
while (std::getline(f, line)) {
if (set.count(line))
continue;
fw << line << std::endl;
std::cout << line << std::endl;
set.insert(line);
}
}
Console Output:
Input:
2
10
6
10
7
6
3
2
6
2
3
7
8
1
10
Output:
2
10
6
7
3
8
1
Note. Although in above code I used only very short lines, in my real files lines can be of any length. In other words uniquized set of all lines exceeds memory size many times. This means in-memory only solutions are not suitable.
I implemented below from scratch 3 totally different solutions of OP's task:
My own implementation of disk-based hash set. It stores 80-96-bit hash of each line. It solves task the way similarly to code in Question of OP. This set is similar to absl::flat_hash_set by design of its algorithm. This algorithm is 2nd fastest of three.
Sorting based algorithm, which sorts hashes of all lines in portions equal to free memory size, using std::sort(), then merges them all using N-Way Merge Sort. This algorithm is 1st fastest among three.
Implemented disk-based B-Tree from scratch, which supports any arbitrary branching degree. This tree keeps sorted hashes of all lines, this way it allows to exclude duplicates. This is slowest out of three algorithms.
I'll provide some details about all algorithms:
Disk based hash set uses single huge file that stores entries equal to pairs of value and partial hash. Partial hash stored in entry consists of high bits of hash of line. Lower bits of hash are stored indirectly as index of bucket.
This hash set is similar to absl::flat_hash_set from ABSEIL library.
Similar in a sence that it stores part of higher bits of hash near value inside bucket. When hash set searches for existing value it first reads a bucket from disk, where bucket index is equal to hash % hash_table_size.
After bucket is read from disk it is checked if hash of searched key has same higher bits. If so, value is checked if its key is equal to searched one. If not, then following few buckets are read from disk (actually they are cached from previous read), and checked same way. If after reading we meet empty bucket then we conclude that there is no searched key.
To add value to hash set we search for key as described above. Then write key/value to first empty bucket.
Read and write in hash set is done through random read and write on disk. It is optimal if we use SSD instead of HDD, because then random read and write will be very fast.
Of course even on SSD if you do writing then SSD writes 8KB at a time, even if you change only 16 bytes. Because SSD flash cluster size is 8KB. Although reading is fast. Hence this disk hash set is not too fast.
This algorithm is 2nd fastest among three my algorithms.
Second sorting based algorithm does following.
First it accumulates into memory as many hashes of lines of text files as possible, as far as there is free memory. Then sorts it in-memory through std::sort using std::execution::par_unseq which allows to sort in-memory array in multi-threaded fasion.
Then sorted in-memory portion of hashes is stored to disk into first file. Another portion of in-memory hashes is sorted and stored into second file. And so on we continue doing this till we store all possible hashes into many files.
Together with hashes in each entry of sorted file we keep index of line of source file. We sort these pairs.
Then we merge all files using N-Way Merge Sort, to do this I utilize so called Heap, which is emulated in C++ through std::make_heap() and std::push_heap() and std::pop_heap().
Merged sequence of pairs is stored into one huge file.
After sorting of pairs is done, then we uniquize pairs by sequentially scanning them and removing adjacent pairs which have duplicate hashes but different lines indices, then we keep only first index of line. This uniquized file is stored into another one huge file. We store only second elements of pairs, i.e. indices of lines.
Uniquized file is then sorted again. To remind, it contains only indices of lines. Sorting is done in the way as described at start of this algorithm, meaning that we split into many in-memory files, sort them, and N-Way Merge Sort them into single huge file.
Finally, last uniquized and sorted huge file we sequentially scan together with scanning of original text file. Scanning them in pair we do 2-way merge, meaning that we skip lines with absent indices.
Done. Now our original text file has only unique lines.
Third algorithm is based on B-Tree, which allows to store any sequence of sorted elements. In our case we store sorted hashes of lines.
B-Tree is quite complex to explain, better to read Wiki Article.
In short B-Tree is constructed following way. Tree has branching degree equal to some B, lets say B = 10, it means at most 10 children has each intermediate node except leaves.
Each intermediate node has pointers to 10 children plus 10 smallest keys of a child subtree.
If we insert into tree something then from root we descend down to leaves, while on the way we check if searched key is contained in child sub-tree. It is contained in child sub-tree only if left child has smaller or equal key, while right child has bigger key.
Now we add new entry to leaf. If leaf overflows in size, i.e. contains more than 10 elements, then it is split into two nodes of equal number of elements. Then inside its parent instead of single pointer to child we try to add two pointers to children. This parent child count may overflow 10 elements, then we split it into two equal nodes too.
Same way all nodes on the way from leaf to root may be split if necessary. Until we meet a node that has less than 10 pointers, then we don't need to split it and process finishes.
Also till root we need to update new smallest sub-tree key, because it may have changed if inserted into leaf value was at the first position.
If we need to search in a tree, then we do same as described above, meaning that we search from root till leaf for given key. Value inside a leaf either contains our key, then we found something, or non-equal key, then we didn't find a key.
In my B-Tree algorithm I didn't implement deletion, only insertion or search. Deletion is more complex, but is not needed for our task, it can be implemented later in our spare time if needed.
This 3rd algorithm is slowest, because it does around 4-5 random SSD reads and writes on every added or read element.
Now, below I present whole C++ code that implements all 3 algorithms. This code should be compiled in GCC, also CLang can compile it. Right now it is not suitable for MSVC, but I can probably tweak to support MSVC too, only because MSVC doesn't have __int128 type that GCC/CLang have.
This program is fully self contained, just compile it and run. It runs speed tests of single operations, plus runs full tests on pre-generated data. You may change nums_cnt = 1ULL << 17; to some bigger value if you need to generate more lines of text. nums_cnt signifies how many lines are there.
Try it online! (GodBolt)
SOURCE CODE HERE. Post together with full code is so large that it can't fit 30 000 symbols limit of StackOverflow post size, code alone is 46 KB in size. Hence I provide code separately inside GitHub Gist link below. Also you may click on Try it online! above, this will redirect you to code in GodBolt server, you may try it live there.
GitHub Gist full code
Console Output:
Insert Time HS 174.1 ns/num, BT 5304.9 ns/num, UM 278.5 ns/num, Boost HS 1.599x, Boost BT 0.052x
Erase Time HS 217.4 ns/num, UM 286.9 ns/num, Boost HS 1.320x
Find Time HS 113.7 ns/num, BT 1954.6 ns/num, UM 61.8 ns/num, Boost HS 0.544x, Boost BT 0.032x
Algo HashSet 61.34 sec
Algo Sort 13.8 sec
Algo BTree 256.9 sec

How to track changes to a list

I have an immutable base list of items that I want to perform a number of operations on: edit, add, delete, read. The actual operations will be queued up and performed elsewhere (sent up to the server and a new list will be sent down), but I want a representation of what the list would look like with the current set of operations applied to the base list.
My current implementation keeps a vector of ranges and where they map to. So an unedited list has one range from 0 to length that maps directly to the base list. If an add is performed in index 5, then we have 3 ranges: 0-4 maps to base list 0-4. 5 maps to the new item, and 6-(length+1) maps to 5-length. This works, however with a lot of adds and deletes, reads degrades to O(n).
I've thought of using hashmaps but the shifts in ranges that can occur with inserts and deletes presents a challenge. Is there some way to achieve this so that reads are around O(1) still?
If you had a roughly balanced tree of ranges, where each node kept a record of the total number of elements below it in the tree, you could answer reads in worst case time the depth of the tree, which should be about log(n). Perhaps a https://en.wikipedia.org/wiki/Treap would be one of the easier balanced trees to implement for this.
If you had a lot of repetitious reads and few modifications, you might gain by keeping a hashmap of the results of reads since the last modification, clearing it on modification.

Performance specification for handling duplicate keys in a binary search tree

I was going through the book Introduction to Algorithms looking for the best ways to handle duplicate keys in a binary search tree.
There are several ways mentioned for this use case:
Keep a boolean flag x:b at node x, and set x to either x:left or x:right based on the value of x:b, which alternates between FALSE and TRUE each time we
visit x while inserting a node with the same key as x.
Keep a list of nodes with equal keys at x, and insert ยด into the list.
Randomly set x to either x:left or x:right.
I understand each implementation has it's own performance hits/misses, and STL may implement it differently from Boost Containers.
Is the performance bound mentioned in C++11 specification for the worst time performance of handling duplicate keys , say for multimap?
In terms of insertion/deletion time 2 is always better because it wouldn't increase the size of the tree and wouldn't require elaborate structure changes when you insert or delete a duplicate.
Option 3 is space optimal if there are small number of duplicates.
Option 1 will require storing 1 extra bit of information (which, in most implementation takes 2 bytes), but the height of the tree will be optimal as compared to 1.
TL;DR: Implementing 2 is slightly difficult, but worthwhile if number of duplicates is large. Otherwise use 3. I wouldn't use 1.

How to improve performance of a hashtable with 1 million elements and 997 buckets?

This is an interview question.
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
My solution:
The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Improve it :
(0) straightforward, increase bucket numbers > 1 million.
(1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1)
(2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Any better ideas ? thanks !
The simplest and most obvious improvement would be to increase the number of buckets in the hash table to something like 1.2 million -- at least assuming your hash function can generate numbers in that range (which it typically will).
Obviously increasing the bucket number improves the performance. Assuming this is no an option (for whatever reason) I suggest the following:
Normally the hash table consists of buckets, each holds a linked list (points to its head). You may however create a hash table, buckets of which hold a binary search tree (pointer to its root) rather than the list.
So that you'll have a hybrid of a hash table and the binary tree. Once I've implemented such thing. I didn't have a limitation on the number of buckets in the hash table, however I didn't know the number of elements from the beginning, plus I had no information about the quality of the hash function. Hence, I created a hash table with reasonable number of buckets, and the rest of the ambiguity was solved by the binary tree.
If N is the number of elements, and M is the number of buckets, then the complexity grows as O[log(N/M)], in case of equal distribution.
If you can't use another data structure or a larger table there are still options:
If the active set of elements is closer to 1000 than 1M you can improve the average successful lookup time by moving each element you find to the front of its list. That will allow it to be found quickly when it is reused.
Similarly if there is a set of misses that happens frequently you can cache the negative result (this can just be a special kind of entry in the hash table).
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
That doesn't quite add up, but let's run with it....
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
The worst (and best = only) case for missing elements is that you hash to a bucket then go through inspecting all the elements in that specific list (i.e. 1000) then fail. If they want big-O notation, by definition that describes how performance varies with the number of elements N, so we have to make an assumption about how the # buckets varies with N too: my guess is that the 997 buckets is a fixed amount, and is not going to be increased if the number of elements increases. The number of comparisons is therefore N/997, which - being a linear factor - is still O(N).
My solution: The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Nope - you're thinking of the number of comparisons - but big-O notation is about scalability.
Improve it : (0) straightforward, increase bucket numbers > 1 million. (1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1) (2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Well yes - average collisions relates to the number of entries and buckets. If you want very few collisions, you'd have well over 1 million entries in the table, but that gets wasteful of memory, though for large objects you can have an index or pointer to the actual object. An alternative is to look for faster collision handling mechanisms, such as trying a series of offsets from the hashed-to bucket (using % to map the displacements back into the table size), rather than resorting to some heap using linked lists. Rehashing is another alternative, given lower re-collision rates but typically needing more CPU, and having an arbitrarily long list of good hashing algorithms is problematic.
Hash tables within hash tables is totally pointless and remarkably wasteful of memory. Much better to use a fraction of that space to reduce collisions in the outer hash table.

remove elements: but which container to prefer

I am keeping the nonzeros of a sparse matrix representation in some triplets, known in the numerical community as Compressed Sparse Row storage, entries are stored row-wise, for instance a 4x4 matrix is represented as
r:0 0 1 1 2 2 3 3 3
c:0 3 2 3 2 3 1 2 3
v:1 5 2 2 4 1 5 4 5
so 'r' gives row indices, 'c' gives column indices and 'v' are the values associated to the 2 indices above that value.
I would like to delete some rows and columns from my matrix representation, say rows and cols: 1 and 3. So I should remove 1s and 3s from the 'r' and 'c' arrays. I am also trying to learn more about the performance of the stl containers and read a bit more. As first try, created a multimap and delete the items by looping over them with the find method of multimap. This removes the found keys however might leave some of the searched values in the 'c' array then I swapped the key,value pairs and do the same operation for this second map, however this did not seem to be a very good solution to me, it seems to be pretty fast(on a problem with 50000 entries), though. So the question is what would be the most efficient way to do this with standard containers?
You could use a map (between a pair of rows and columns) and the value, something like map<pair<int,int>, int>
If you then want to delete a row, you iterate over the elements and erase those with the to-be deleted row. The same can be done for columns.
How are you accessing the matrix? Do you look up particular rows/columns and do things with them that way, or do you use the whole matrix at a time for operations like matrix-vector multiplications or factorization routines? If you're not normally indexing by row/column, then it may be more efficient to store your data in std::vector containers.
Your deletion operation is then a matter of iterating straight through the container, sliding down subsequent elements in place of the entries you wish to delete. Obviously, there are tradeoffs involved here. Your map/multimap approach will take something like O(k log n) time to delete k entries, but whole-matrix operations in that representation will be very inefficient (though hopefully still O(n) and not O(n log n)).
Using the array representation, deleting a single row or column would take O(n) time, but you could delete an arbitrary number of rows or columns in the same single pass, by keeping their indices in a pair of hash tables or splay trees and doing a lookup for each entry. After the deletion scan, you could either resize the vectors down to the number of elements you have left, which saves memory but might entail a copy, or just keep an explicit count of how many entries are valid, trading dead memory for saving time.