Remove duplicates from two large text files using unordered_map

Remove duplicates from two large text files using unordered_map - c++

I am new to a lot of these C++ libraries, so please forgive me if my questions comes across as naive.
I have two large text files, about 160 MB each (about 700000 lines each). I need to remove from file2 all of the duplicate lines that appear in file1. To achieve this, I decided to use unordered_map with a 32 character string as my key. The 32 character string is the first 32 chars of each line (this is enough to uniquely identify the line).
Anyway, so I basically just go through the first file and push the 32 char substring of each line into the unordered_map. Then I go through the second file and check whether the line in file2 exists in my unordered_map. If it doesn't exist, the I write the full line to a new text file.
This works fine for the smaller files.. (40 MB each), but for this 160 MB files.. it takes very long to insert into the hashtable (before I even start looking at file2). At around 260,000 inserts.. it seems to have halted or is going very slow. Is it possible that I have reached my memory limitations? If so, can anybody explain how to calculate this? If not, is there something else that I could be doing to make it faster? Maybe choosing a custom hash function, or specifying some parameters that would help optimize it?
My key object pair into the hash table is (string, int), where the string is always 32 chars long, and int is a count I use to handle duplicates.
I am running a 64 bit Windows 7 OS w/ 12 GB RAM.
Any help would be greatly appreciated.. thanks guys!!

You don't need a map because you don't have any associative data. An unordered set will do the job. Also, I'd go with some memory efficient hash set implementation like Google's sparse_hash_set. It is very memory efficient and is able to store contents on disk.
Aside from that, you can work on smaller chunks of data. For example, split your files into 10 blocks, remove duplicates from each, then combine them until you reach the a single block with no duplicates. You get the idea.

I would not write a C++ program to do this, but use some existing utilities.
In Linux, Unix and Cygwin, perform the following:
cat the two files into 1 large file:
# cat file1 file2 > file3
Use sort -u to extract the unique lines:
# sort -u file3 > file4
Prefer to use operating system utilities rather than (re)writing your own.

Related

Grep pattern match between very large files is way too slow

I've spent way too much time on this and am looking for suggestions. I have too very large files (FASTQ files from an Illumina sequencing run for those interested). What I need to do is match a pattern common between both files and print that line plus the 3 lines below it into two separate files without duplications (which exist in the original files). Grep does this just fine but the files are ~18GB and matching between them is ridiculously slow. Example of what I need to do is below.
FileA:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
TGTTCAAAGCAGGCGTATTGCTCGAATATATTAGCATGGAATAATAGAAT
+DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
__\^c^ac]ZeaWdPb_e`KbagdefbZb[cebSZIY^cRaacea^[a`c
You can see 3 unique headers starting with # followed by 3 additional lines
FileB:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
_[_ceeeefffgfdYdffed]e`gdghfhiiihdgcghigffgfdceffh
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
There are 4 headers here but only 2 are unique as one of them is repeated 3 times
I need the common headers between the two files without duplicates plus the 3 lines below them. In the same order in each file.
Here's what I have so far:
grep -E #DLZ38V1.*/ --only-matching FileA | sort -u -o FileA.sorted
grep -E #DLZ38V1.*/ --only-matching FileB | sort -u -o FileB.sorted
comm -12 FileA.sorted FileB.sorted > combined
combined
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/
This is only the common headers between the two files without duplicates. This is what I want.
Now I need to match these headers to the original files and grab the 3 lines below them but only once.
If I use grep I can get what I want for each file
while read -r line; do
grep -A3 -m1 -F $line FileA
done < combined > FileA.Final
FileA.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
The while loop is repeated to generate FileB.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
This works but FileA and FileB are ~18GB and my combined file is around ~2GB. Does anyone have any suggestions on how I can dramatically speed up the last step?

Depending on how often do you need to run this:
you could dump (you'll probably want bulk inserts with the index built afterwards) your data into a Postgres (sqlite?) database, build an index on it, and enjoy the fruits of 40 years of research into efficient implementations of relational databases with practically no investment from you.
you could mimic having a relational database by using the unix utility 'join', but there wouldn't be much joy, since that doesn't give you an index, yet it is likely to be faster than 'grep', you might hit physical limitations...I never tried to join two 18G files.
you could write a bit of C code (put your favourite compiled (to machine code) language here), which converts your strings (four letters only, right?) into binary and builds an index (or more) based on it. This could be made lightning fast and small memory footprint as your fifty character string would take up only two 64bit words.

Thought I should post the fix I came up with for this. Once I obtained the combined file (above) I used a perl hash reference to read them into memory and scan file A. Matches in file A were hashed and used to scan file B. This still takes a lot of memory but works very fast. From 20+ days with grep to ~20 minutes.

perl regex using too much memory?

I have a perl routine that is causing me frequent "out of memory" issues in the system.
The script does 3 things
1> get the output of a command to an array (#arr = `$command` --> array will hold about 13mb of data after the command)
2> Use a large regex to match the contents of individual array elements -->
The regex is something like this
if($new_element =~ m|([A-Z0-9\-\._\$]+);\d+\s+([0-9]+)-([A-Z][A-Z][A-Z])-([0-9][0-9][0-9][0-9]([0-9]+)\:([0-9]+)\:([0-9]+)|io)
<put to hash>
3> Put the array in a persistent hash map.
$hash_var{arr[0]} = "Some value"
edit:
Sample data processed by regex are
Z4:[newuser.newdir]TESTOPEN_ERROR.COM;4
8-APR-2014 11:14:12.58
Z4:[newuser.newdir]TEST_BOC.CFG;5
5-APR-2014 10:43:11.70
Z4:[newuser.newdir]TEST_BOC.COM;20
5-APR-2014 10:41:01.63
Z4:[newuser.newdir]TEST_NEWRT.COM;17
4-APR-2014 10:30:56.11
About 10000 lines like these
I started by suspecting the array and hash together may be consuming too much of memory.
However i have started to think this regex might have some thing to do with out of memory as well.
Does perl regex(with 'io' option!) really the main culprit causing out of memory?

This has nothing to do with regexes.
If you are operating in a memory-constrained environment, you should process data records one at a time rather than fetching all of them at once. Let's assume you pull your data like:
my #data = `some command`;
for my $line (#data) {
... # process the line
}
This is incredibly wasteful because you need storage for the data, and for the output of your processing (in your case: the hash).
Instead, process the input line by line. We can use the open function instead of backticks for this:
open my $cmd, '-|', 'some', 'command' or die "Can't run some command: $!";
while (my $line = <$cmd>) {
... # process the line
}
There is no need for an array here, which saves us 13MB of memory which we can now put to use otherwise.

What problem are you really trying to solve?
Use your words... not Perl.
Something like: "The script is picking apart the output from an openvms Directory output command and the objective is to report the number of file and oldest date ordered by directory"
First question is WHY keep the array. Will the script 'walk' it again?
If not, just processes there and then in a for loop.
The regex seems to pick out out a file-name, and date. That's been does before.
It is not hard, and can be simplified by trusting the OpenVMS directory format.
Somethign like this reads better imho:
if($new_element =~ m|](.*);\d+\s+(\d+)-(\w+)-(\d+)\s+(\d+):(d+):(\d+)|)
: $hash_var{arr[0]} =
Hmmm, that suggests to me that a whole line from array is used as a key value, with all 50+ spaces. So those 10,000 lines tuning into 1,000,000+ bytes just for raw key bytes. A lot but not crazy. New we know that the first word on the line MUST be unique, why not exploit that:
$hash_var{$1} = xxx if /(\S+)/l;
The program may also want to exploit that the leading strings are highly repetitive, and substitute everything before the "]" with an ever increasing directory number, maintained in a 'look-a-side' array and/or hash.
Personally I would drop /NOHEAD from the command, and use a regex to pick up the directories as they come by on their own lines.
Or use a SUBSTR or whatever... of course you'd need to construct a similar key on re-access.
In the related topic, there is debugging output printed.
Perhaps include the line number in the array for your own understanding?
Perl encounters "out of memory" in openvms system
Good luck!
Hein

LINUX / C++ Remove strings in first file from the second file

I am trying to compare two files of strings and remove everything that is in file 1 from file 2 if its there and save it in a third output file. I was going to write a c++ program for this but best i could come up with was O(N^2), is there any commands in Linux to do this? if not what is the most efficient way to do it with c++ ? these files have up to 1 billion strings in one and 10 million in another so O(N^2) is extremely inefficient
ex f1
hello
josh
cory
sam
don
f2
jack
josh
joey
sam
neda
etc
outputfile:
jack
joey
neda
etc
to be clear I am NOT trying to merge them then remove duplicates, i only want duplicates of strings in file 1 removed from file 2.
thanks

fgrep is handy for this: it will grep one file for a set of fixed strings.
fgrep -f f1 -v f2 will print out all lines in f2 that are not found in f1.

You can solve this task by using the Aho-Corasick string matching algorithm. It is used for multiple-keyword search across text and it's time complexity is linear.
There are some C++ implementations of this algorithm on the net. For example this.
In addition, there is a nice-looking python library for this.
However, I'm not sure if the memory complexity is OK when using those sources/libraries. You may have to read the input from the first file in chunks (as it may have billions of characters).

You could code a C++ (or Ocaml) program which reads all the words of the first file and store them in a set of strings (using std::set<std::string> in C++, or module SS = Set.Make(String);; in Ocaml). Filling that set should be O(n log n) complexity (where n is the number of words, i.e. the cardinal of the set). Testing that a file of m words each word belongs (or not) to that set is O (m log n)
Sets are implemented as balanced trees with a logarithmic membership test time.
However, you should probably have used some data base systems to store (and fill) the data. (e.g. PostGreSQL, MariaDB, MongoDB, CouchDB, ....)

Search Large Text File for Thousands of strings

I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.
I have a list of 20,000 unique strings. I want to know the offset for each string each time it appears in the file. Currently, my output looks like this:
netloader.cc found at offset: 46350917
netloader.cc found at offset: 48138591
netloader.cc found at offset: 50012089
netloader.cc found at offset: 51622874
netloader.cc found at offset: 52588949
...
360doc.com found at offset: 26411474
360doc.com found at offset: 26411508
360doc.com found at offset: 26483662
360doc.com found at offset: 26582000
I am loading the 20,000 strings into a std::set (to ensure uniqueness), then reading a 128MB chunk from the file, and then using string::find to search for the strings (start over by reading another 128MB chunk). This works and completes in about 4 days. I'm not concerned about a read boundary potentially breaking a string I'm searching for. If it does, that's OK.
I'd like to make it faster. Completing the search in 1 day would be ideal, but any significant performance improvement would be nice. I prefer to use standard C++ with Boost (if necessary) while avoiding other libraries.
So I have two questions:
Does the 4 day time seem reasonable considering the tools I'm using and the task?
What's the best approach to make it faster?
Thanks.
Edit: Using the Trie solution, I was able to shorten the run-time to 27 hours. Not within one day, but certainly much faster now. Thanks for the advice.

Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:
hand, has, have, foot, file
The resulting tree would look something like this:
The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.
Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.
If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree
This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.
Edit
The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm
tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
foreach node in nodes:
if node.has_child(character):
node.follow_edge(character)
if node.isLeaf():
# You found a match!!
else:
nodes.delete(node)
if tree.has_child(character):
nodes.add(tree.get_child(character))
Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.
Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.

Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.

First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js