Search Large Text File for Thousands of strings - c++

I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.
I have a list of 20,000 unique strings. I want to know the offset for each string each time it appears in the file. Currently, my output looks like this:
netloader.cc found at offset: 46350917
netloader.cc found at offset: 48138591
netloader.cc found at offset: 50012089
netloader.cc found at offset: 51622874
netloader.cc found at offset: 52588949
...
360doc.com found at offset: 26411474
360doc.com found at offset: 26411508
360doc.com found at offset: 26483662
360doc.com found at offset: 26582000
I am loading the 20,000 strings into a std::set (to ensure uniqueness), then reading a 128MB chunk from the file, and then using string::find to search for the strings (start over by reading another 128MB chunk). This works and completes in about 4 days. I'm not concerned about a read boundary potentially breaking a string I'm searching for. If it does, that's OK.
I'd like to make it faster. Completing the search in 1 day would be ideal, but any significant performance improvement would be nice. I prefer to use standard C++ with Boost (if necessary) while avoiding other libraries.
So I have two questions:
Does the 4 day time seem reasonable considering the tools I'm using and the task?
What's the best approach to make it faster?
Thanks.
Edit: Using the Trie solution, I was able to shorten the run-time to 27 hours. Not within one day, but certainly much faster now. Thanks for the advice.

Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:
hand, has, have, foot, file
The resulting tree would look something like this:
The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.
Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.
If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree
This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.
Edit
The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm
tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
foreach node in nodes:
if node.has_child(character):
node.follow_edge(character)
if node.isLeaf():
# You found a match!!
else:
nodes.delete(node)
if tree.has_child(character):
nodes.add(tree.get_child(character))
Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words.
Have you considered looking at some string matching algorithms? Aho–Corasick comes to mind.

Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find.

Related

Rope data structure & Lines

I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.
The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.
However, how would I get where a specific line (\n for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.
Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the Rope, find the newlines, and when you reach the 15th newline, then you stop.
Store the start and length of each line in a vector. So you would have your Rope data structure containing all the characters, and then a separate std::vector<line>. The line structure would just consist of 2 fields; start and length. Start represents where the line starts inside of the Rope, and length is the length of the line. To get where the 15th line starts, just do lines[14].start
Problems:
#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.
#2 is also not good. Although getting where a line starts is extremely fast (O(1)), every time you insert a line, you have to move all the lines ahead of it, which is O(N). Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start and length are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.
Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n)) for inserting & accessing lines)
I was thinking of using a BST, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode do this but with a PieceTable instead.
Any help would be greatly appreciated.
EDIT:
The answer that #interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?
I also noticed ropey, which is a rust library for the Rope. I was wondering if there was something similar but for C++.
In each rope node (both leaves and internal nodes), in addition to holding the number of characters in that subtree, you can also put the total number of newlines contained in the subtree.
Then finding a specific newline will work exactly the same way as finding the node holding a specific character index. You would look at the "number of newlines" field instead of the "number of characters" field.
All rope operations will work mostly the same. When creating a new internal node, you just need to add its children's number of newlines. Complexity of all operations is the same.

regex to match max of input numeric string

I have incoming input entries.
Like these
750
1500
1
100
25
55
And There is an lookup table like given below
25
7
5
75
So when I will receive my first entry, in this case its 750. So this will look up into lookup entry table will try to match with a string which having max match from left to right.
So for 750, max match case would be 75.
I was wondering, Is that possible if we could write a regex for this kind of scenario. Because if I choose using startsWith java function It can get me output of 7 as well.
As input entries will be coming from text file one by one and all lookup entries present in file different text file.
I'm using java language.
May I know how can I write a regex for this flavor..?
This doesn't seem like a regex problem at first, but you could actually solve it with a regex, and the result would be pretty efficient.
A regex for your example lookup table would be:
/^(75?|5|25)/
This will do what you want, and it will avoid the repeated searches of a naive "check every one" approach.
The regex would get complicated,though, as your lookup table grew. Adding a couple of terms to your lookup table:
25
7
5
75
750
72
We now have:
/^(7(50?|2)?|5|25)/
This is obviously going to get complicated quickly. The trick would be programmatically constructing the appropriate regex for arbitrary data--not a trivial problem, but not insurmountable either.
That said, this would be an..umm...unusual thing to implement in production code.
I would be hesitant to do so.
In most cases, I would simply do this:
Find all the strings that match.
Find the longest one.
(?: 25 | 5 | 75? )
There is free software that automatically makes full blown regex trie for you.
Just put the output regex into a text file and load it instead.
If your values don't change very much, this is a very fast way to do a lookup.
If it does change, generate another one.
Whats good about a full blown trie here, is that it never takes more than 8
steps to match.
The one I just did http://imgur.com/a/zwMhL
App screenshot
Even a 175,000 Word Dictionary takes no more than 8 steps.
Internally the app initially makes a ternary tree from the input
then converts it into a full blown regex trie.

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.
Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

eval failing to match regex after sometime

I get first input from user which is a tree (having significant height and depth) of nodes. Each of the node contains a regex and modifiers. This tree gets saved in memory. This is taken only once at the application startup.
The second input is a value which is matched starting at the root node of the tree till an exact matching leaf node is found (Depth First Search). The match is determined as follows :
my $evalstr = <<EOEVAL;
if(\$input_value =~ /\$node_regex/$node_modifiers){
1;
}else{
-1;
}
EOEVAL
no strict 'refs';
my $return_value = eval "no strict;$evalstr";
The second input is provided continuously throughout the application's life time by a source.
problem:
The above code works very well for some time (approx. 10 hours), but after continuous input for this time, the eval continuously starts failing and I get -1 in $return_value. All other features of the application work very fine including other comparison statements.If I restart the application, the matching again starts and gives proper results.
Observations:
1) I get deep recursion warning many times, but I read somewhere it is normal as stack size for me would be more than 100 many a times, considering the size of the input tree.
2) If I use simple logic for regex match without eval as above, I don't get any issue for any continuous run of the application.
if($input_value =~ /$node_regex/){
$return_value=1;
}else{
$return_value=-1;
}
but then I have to sacrifice dynamic modifiers, as per Dynamic Modifiers
Checks:
1) I checked $# but it is empty.
2) Also printed the respective values of $input_value,$node_regex and $node_modifiers, they are correct and should have matched the value with regex at the failure point.
3) I checked for memory usage, but it's fairly constant over the time for the perl process.
4) Was using perl 5.8.8 then updated it to 5.12, but still face the same issue.
Question :
What could be the cause of above issue? Why it fails after some time, but works well when the application is restarted?
A definitive answer would require more knowledge of perl internals than I have. But given what you are doing, continuous parsing of large trees, it seems safe to assume that some limit is being reached, some resource is exhausted. I would take a close look at things and make sure that all resources are being released between each iteration of a parse. I would be especially concerned with circular references in the complex structures, and making sure that there are none.

Remove duplicates from two large text files using unordered_map

I am new to a lot of these C++ libraries, so please forgive me if my questions comes across as naive.
I have two large text files, about 160 MB each (about 700000 lines each). I need to remove from file2 all of the duplicate lines that appear in file1. To achieve this, I decided to use unordered_map with a 32 character string as my key. The 32 character string is the first 32 chars of each line (this is enough to uniquely identify the line).
Anyway, so I basically just go through the first file and push the 32 char substring of each line into the unordered_map. Then I go through the second file and check whether the line in file2 exists in my unordered_map. If it doesn't exist, the I write the full line to a new text file.
This works fine for the smaller files.. (40 MB each), but for this 160 MB files.. it takes very long to insert into the hashtable (before I even start looking at file2). At around 260,000 inserts.. it seems to have halted or is going very slow. Is it possible that I have reached my memory limitations? If so, can anybody explain how to calculate this? If not, is there something else that I could be doing to make it faster? Maybe choosing a custom hash function, or specifying some parameters that would help optimize it?
My key object pair into the hash table is (string, int), where the string is always 32 chars long, and int is a count I use to handle duplicates.
I am running a 64 bit Windows 7 OS w/ 12 GB RAM.
Any help would be greatly appreciated.. thanks guys!!
You don't need a map because you don't have any associative data. An unordered set will do the job. Also, I'd go with some memory efficient hash set implementation like Google's sparse_hash_set. It is very memory efficient and is able to store contents on disk.
Aside from that, you can work on smaller chunks of data. For example, split your files into 10 blocks, remove duplicates from each, then combine them until you reach the a single block with no duplicates. You get the idea.
I would not write a C++ program to do this, but use some existing utilities.
In Linux, Unix and Cygwin, perform the following:
cat the two files into 1 large file:
# cat file1 file2 > file3
Use sort -u to extract the unique lines:
# sort -u file3 > file4
Prefer to use operating system utilities rather than (re)writing your own.