Efficiently uniquize lines of huge text file in C++

Efficiently uniquize lines of huge text file in C++ - c++

Suppose I have very huge text file, with lines of different arbitrary lengths. I want to remove duplicate lines, how do I do this in C++?
Equal duplicates can be apart on very large distance. And I want to leave only first occurance.
File is so huge that it can be even 10-50 times bigger than size of RAM.
There is Linux command uniq, but it removes only adjacent equal lines. While I need removing any far apart duplicates.
Originally this question was asked here, but now it is deleted. It was failed to be UnDeleted or ReOpen. And I already implemented half of huge answer to it when it was deleted.
I'm asking this question only to share my own answer below, yet I'm providing below a very tiny solution, which doesn't scale because it uses only in-memory unordered set.
Simplest in-memory only solution using std::unordered_set:
Try it online!
#include <random>
#include <iostream>
#include <unordered_set>
#include <string>
#include <fstream>
int main() {
size_t constexpr n = 15;
std::mt19937_64 rng{125};
{
std::ofstream f("test.txt");
std::cout << "Input:" << std::endl;
for (size_t i = 0; i < n; ++i) {
auto const x = rng() % (n * 3 / 4);
f << x << std::endl;
std::cout << x << std::endl;
}
}
std::ofstream fw("test.txt.out");
std::ifstream f("test.txt");
std::string line;
std::unordered_set<std::string> set;
std::cout << std::endl << "Output:" << std::endl;
while (std::getline(f, line)) {
if (set.count(line))
continue;
fw << line << std::endl;
std::cout << line << std::endl;
set.insert(line);
}
}
Console Output:
Input:
2
10
6
10
7
6
3
2
6
2
3
7
8
1
10
Output:
2
10
6
7
3
8
1
Note. Although in above code I used only very short lines, in my real files lines can be of any length. In other words uniquized set of all lines exceeds memory size many times. This means in-memory only solutions are not suitable.

I implemented below from scratch 3 totally different solutions of OP's task:
My own implementation of disk-based hash set. It stores 80-96-bit hash of each line. It solves task the way similarly to code in Question of OP. This set is similar to absl::flat_hash_set by design of its algorithm. This algorithm is 2nd fastest of three.
Sorting based algorithm, which sorts hashes of all lines in portions equal to free memory size, using std::sort(), then merges them all using N-Way Merge Sort. This algorithm is 1st fastest among three.
Implemented disk-based B-Tree from scratch, which supports any arbitrary branching degree. This tree keeps sorted hashes of all lines, this way it allows to exclude duplicates. This is slowest out of three algorithms.
I'll provide some details about all algorithms:
Disk based hash set uses single huge file that stores entries equal to pairs of value and partial hash. Partial hash stored in entry consists of high bits of hash of line. Lower bits of hash are stored indirectly as index of bucket.
This hash set is similar to absl::flat_hash_set from ABSEIL library.
Similar in a sence that it stores part of higher bits of hash near value inside bucket. When hash set searches for existing value it first reads a bucket from disk, where bucket index is equal to hash % hash_table_size.
After bucket is read from disk it is checked if hash of searched key has same higher bits. If so, value is checked if its key is equal to searched one. If not, then following few buckets are read from disk (actually they are cached from previous read), and checked same way. If after reading we meet empty bucket then we conclude that there is no searched key.
To add value to hash set we search for key as described above. Then write key/value to first empty bucket.
Read and write in hash set is done through random read and write on disk. It is optimal if we use SSD instead of HDD, because then random read and write will be very fast.
Of course even on SSD if you do writing then SSD writes 8KB at a time, even if you change only 16 bytes. Because SSD flash cluster size is 8KB. Although reading is fast. Hence this disk hash set is not too fast.
This algorithm is 2nd fastest among three my algorithms.
Second sorting based algorithm does following.
First it accumulates into memory as many hashes of lines of text files as possible, as far as there is free memory. Then sorts it in-memory through std::sort using std::execution::par_unseq which allows to sort in-memory array in multi-threaded fasion.
Then sorted in-memory portion of hashes is stored to disk into first file. Another portion of in-memory hashes is sorted and stored into second file. And so on we continue doing this till we store all possible hashes into many files.
Together with hashes in each entry of sorted file we keep index of line of source file. We sort these pairs.
Then we merge all files using N-Way Merge Sort, to do this I utilize so called Heap, which is emulated in C++ through std::make_heap() and std::push_heap() and std::pop_heap().
Merged sequence of pairs is stored into one huge file.
After sorting of pairs is done, then we uniquize pairs by sequentially scanning them and removing adjacent pairs which have duplicate hashes but different lines indices, then we keep only first index of line. This uniquized file is stored into another one huge file. We store only second elements of pairs, i.e. indices of lines.
Uniquized file is then sorted again. To remind, it contains only indices of lines. Sorting is done in the way as described at start of this algorithm, meaning that we split into many in-memory files, sort them, and N-Way Merge Sort them into single huge file.
Finally, last uniquized and sorted huge file we sequentially scan together with scanning of original text file. Scanning them in pair we do 2-way merge, meaning that we skip lines with absent indices.
Done. Now our original text file has only unique lines.
Third algorithm is based on B-Tree, which allows to store any sequence of sorted elements. In our case we store sorted hashes of lines.
B-Tree is quite complex to explain, better to read Wiki Article.
In short B-Tree is constructed following way. Tree has branching degree equal to some B, lets say B = 10, it means at most 10 children has each intermediate node except leaves.
Each intermediate node has pointers to 10 children plus 10 smallest keys of a child subtree.
If we insert into tree something then from root we descend down to leaves, while on the way we check if searched key is contained in child sub-tree. It is contained in child sub-tree only if left child has smaller or equal key, while right child has bigger key.
Now we add new entry to leaf. If leaf overflows in size, i.e. contains more than 10 elements, then it is split into two nodes of equal number of elements. Then inside its parent instead of single pointer to child we try to add two pointers to children. This parent child count may overflow 10 elements, then we split it into two equal nodes too.
Same way all nodes on the way from leaf to root may be split if necessary. Until we meet a node that has less than 10 pointers, then we don't need to split it and process finishes.
Also till root we need to update new smallest sub-tree key, because it may have changed if inserted into leaf value was at the first position.
If we need to search in a tree, then we do same as described above, meaning that we search from root till leaf for given key. Value inside a leaf either contains our key, then we found something, or non-equal key, then we didn't find a key.
In my B-Tree algorithm I didn't implement deletion, only insertion or search. Deletion is more complex, but is not needed for our task, it can be implemented later in our spare time if needed.
This 3rd algorithm is slowest, because it does around 4-5 random SSD reads and writes on every added or read element.
Now, below I present whole C++ code that implements all 3 algorithms. This code should be compiled in GCC, also CLang can compile it. Right now it is not suitable for MSVC, but I can probably tweak to support MSVC too, only because MSVC doesn't have __int128 type that GCC/CLang have.
This program is fully self contained, just compile it and run. It runs speed tests of single operations, plus runs full tests on pre-generated data. You may change nums_cnt = 1ULL << 17; to some bigger value if you need to generate more lines of text. nums_cnt signifies how many lines are there.
Try it online! (GodBolt)
SOURCE CODE HERE. Post together with full code is so large that it can't fit 30 000 symbols limit of StackOverflow post size, code alone is 46 KB in size. Hence I provide code separately inside GitHub Gist link below. Also you may click on Try it online! above, this will redirect you to code in GodBolt server, you may try it live there.
GitHub Gist full code
Console Output:
Insert Time HS 174.1 ns/num, BT 5304.9 ns/num, UM 278.5 ns/num, Boost HS 1.599x, Boost BT 0.052x
Erase Time HS 217.4 ns/num, UM 286.9 ns/num, Boost HS 1.320x
Find Time HS 113.7 ns/num, BT 1954.6 ns/num, UM 61.8 ns/num, Boost HS 0.544x, Boost BT 0.032x
Algo HashSet 61.34 sec
Algo Sort 13.8 sec
Algo BTree 256.9 sec

Related

Hashing algorithm for pair of integers [duplicate]

This question already has answers here:
Create a hashcode of two numbers
(7 answers)
Closed 6 years ago.
The problem: Storing dynamic adjacency list of a graph in a file while retaining O(1) algorithmic complexity of operations.
I am trying to store dynamic bidirectional graph in a file(s). Both nodes and edges can be added and removed and the operations must be O(1). My current design is:
File 1 - Nodes
Stores two integers per node (inserts are appends and removals use free list):
number of incoming edges
number of outgoing edges
File 2 - Edges
Stores 4 integers per edge (inserts are appends and removals use free list + swap with last edge for a node to update its new index):
from node (indice to File 1)
from index (i.e. third incoming edge)
to node (indice to File 1)
to index (i.e. second outgoing edge).
File 3 - Links
Serves as openly addressed hash table of locations of edges in File 2. Basically when you read a node from File 1 you know there are x incoming edges and y outgoing edges. With that you can go to File 3 to get the position of each of these edges in File 2. The key thus being:
index of node in File 1 (i.e. 0 for first node, 1 for second node)
0 <= index of edge < number of outgoing/incoming edges
Example of File 3 keys if represented as chained hash table (that is unfortunately not suitable for files but would not require hashing...):
Keys (indices from `File 1` + 0 <= index < number of edgesfrom `File 1`, not actually stored)
1 | 0 1 2
2 | 0 1
3 |
4 | 0
5 | 0 1 2
I am using qHash and QPair to hash these atm however the number of collisions is very high. Especially when I compare it to single int hashing that is very efficient with qHash. Since the values stored are indices to yet another file probing is rather expensive so I would like to cut the number of collissions down.
Is there a specialized hashing algorithm or approach to use for pair of ints that could perform better in this situation? Or of course a different approach that would avoid this problem like how to implement chained hash table in a file for example (I can only think of using buffers but that would be overkill for sparse graphs like mine I believe)?

If you read through comments on this answer, they claim qHash of an int just returns that int unchanged (which is a fairly common way to hash integers for undemanding use in in-memory hash tables). So, using a strong general-purpose hash function will achieve a dramatic reduction in collisions, though you may loose out on some incidental caching benefits of having nearby keys more likely to hash to the same area on disk, so do measure rather than taking it for granted that fewer collisions means better performance. I also suggest trying boost::hash_combine to create an overall hash from multiple hash values (just using + or XOR is a very bad idea). Then, if you're reading from disk, there's probably some kind of page size - e.g. 4k, 8k - which you'll have to read in to access any data anywhere on that page, so if there's a collision it'd still be better to look elsewhere on the already-loaded page, rather than waiting to load another page from disk. Simple linear probing manages that a lot of the time, but you could improve on that further by wrapping back to the start of the page to ensure you've searching all of it before probing elsewhere.

Difference between multimap and unordered_multimap in c++? [duplicate]

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks

Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.

The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.

unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;

What data structure is the quickest for inserting and searching for elements?

I am writing a program that will do a basic compression using a lookup table. To create the table, I will read in a text file (size 2MB) and then find the 255 most common words and store them into another text file. I am trying to use a vector now, but the runtime is slow at about a minute of runtime to insert into the vector, sort it, and then output the top 255 elements to another text file. The insertion appears to be the problematic since I am having to check for whether or not it already exists inside of the vector and then increment a counter if it does exist, or add the element to the end of the vector if it doesn't. I need to find an efficient way of inserting elements into a data structure only when they are not already inside of the data structure (No Duplicates).

std::unordered_map is likely to be the best for your purpose, no guarantees. You can "add a key if and only if not already present" just by using operator[].
You'll make one pass over the 2MB splitting into words and counting the frequencies (one lookup in the structure per word). Then use std::partial_sort_copy (the version that takes a comparator) to get the top 255 by frequency count from the unordered_map. You should partial_sort_copy into a vector or array and then use that to write the file.
For 2 MB of data, anything over a few seconds is certainly slower than it "should" be, and a few seconds is still slower than it could be. So you're right to be concerned about your vector, but you should also profile your code to make sure that it really is the vector costing you the time, not some other issue.

Try using STL map or set it is much faster than vector: see here

Sorting large files in better time than 0(n log n) time

I have two files:
One file stores name "mapping.txt" of 10GB:
1 "First string"
2 "Second string"
3 "Third string"
...
199000000 "199000000th string"
And the other file stores integers from mapping.txt in some arbitrary order (stored in file.txt):
88 76 23 1 5 7 9 10 78 12 99 12 15 16 77 89 90 51
Now I want to sort "mapping.txt" in the order specified by the integers above like:
88 "88th string"
76 "76th string"
23 "23rd string"
1 "1st string"
5 "5th string"
7 "7th string"
How do I accomplish this using C++?
I know that for every integer in the file one can perform a binary search in "mapping.txt", but since its time complexity is O(n log n), it is not very efficient for large files.
I'd like a way to do this that is more performant than 0(n log n).

Here's what I would do. This may not be the most efficient way, but I can't think of a better one.
First, you pass over the big file once to build an index of the offset at which each line starts. This index should fit into memory, if your lines are long enough.
Then you pass over the small file, read each index, jump to the corresponding location in the big file, and copy that line to the target file.
Because your index is continuous and indexed by integer, lookup is constant time. Any in-memory lookup time will be completely overshadowed by disk seek time anyway, though.

I know that for every integer in file.txt one can perform a binary search in "mapping.txt"
As you said binary search is not useful here, besides the reason that you exposed you also have the challenge that mapping.txt is not in a friendly format to perform searching or indexing.
If possible I would recommend to change the format of the mapping file to one more suitable to do direct seek calls. For instance, you could think in a file containing fixed length strings so you can calculate the position of each entry ( that would be constant in the number of fseek calls but keep in mind that the function itself wouldn't be constant)
[EDIT]:
Other thing you could do to minimize access to the mapping.txt is the following:
Load the "order" file into an array in memory but in a way where the position is the actual line on mapping.txt and the element is the desired position on the new file, for instance the first element of that array would be 4 because 1 is on the 4th position (in your example).
For convenience split the new array into N buckets files so if an element would go to the 200th position that would be the first position on the 4th bucket (for example).
Now you can access the mapping file in a sequential fashion, you would for each line check on your array for the actual position in your new file and put then in the corresponding bucket.
Once you passed the whole mapping file (you only have to checked it once), you only need to append the N buckets into your desired file.

As Sebastian suggested, try
creating an index over the mapping file ("mapping.txt") with the offset (and optionally length) of each string in the file.
Then access that index for each entry in the ordering file ("file.txt") and seek to the stored position in the text file.
This has linear time complexity depending on the size of the two files and linear space complexity with small factor depending on the line count of "mapping.txt"
For fast and memory-efficient sequential read access to large regular files use mmap(2) and madvise(2) or their corresponding constructs in the Windows API. If the file is larger than your address space, mmap it in chunks as large as possible. Don't forget to madvise the kernel on the different access pattern in step 2 (random vs. sequential).
Please don't copy that much stuff from a file onto the heap, if you don't need it later and your system has memory maps!

Given you have a list of exactly how you want the data output, I'd try an array

You would be best served to split this problem up into smaller problems:
Split mapping.txt and file.txt into n and m entry chunks respectively (n and m could be the same size or different)
Take your normal map-sorting routine and modify it to take a chunk number (the chunk being which m-offset of file.txt you're operating on) and perform the map-sorting on those indices from the various mapping.txt chunks.
Once complete, you will have m output-X.txt files which you can merge into your actual output file.
Since your data is ASCII, it will be a pain to map fixed windows into either file thus splitting both into smaller files will be helpful.

This is a pretty good candidate for mergesort.
This will be O(n log n) but most algorithms will not beat that.
You just need to use the index file to alter the key comparison.
You will find merge sort in any decent algorithms text book and it is well sort to doing a external sort to disk, for whne the file to be sorted is bigger than memory.
If you really must beat O(n log n), pass over a the file and build a hash table, indexed by the key, of where every line is. Then read the index file and and use the hash table to locate each line.
In theory this would be O(n + big constant).
I see some problems with this however: what is n? that will be a big hash table. Implementation may very well slower than the O(n log n) solution due to "big constant" being really big. Even if you mmap the file for effienct access you may get a lot of paging.

Difference in performance between map and unordered_map in c++

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks

Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.

The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.

unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js