How to speed up regex searching for large quantity of potentially large files in C++? - c++

I'm trying to make a program to read user inputted wildcard files and wildcard strings using an excel document as a configuration file. For example the user may be able to enter in C:\Read*.txt, and any files in the C drive that start with Read and then any characters after read and are text files will be included in the search.
They could search for Message: * and all strings beginning with "Message: " and ending with any sequence of characters would get matched.
So far it is a working program but the problem is that the speed efficiency is quite terrible and I need it to be able to search very large files. I'm using a filestream and the regex class to do so and I'm not sure what is taking so much time.
The bulk of the time in my code is being spent in the following loop (I've only included the lines above the while loop so you can better understand what I'm trying to do):
smatch matches;
vector<regex> expressions;
for (int i = 0; i < regex_patterns.size(); i++){expressions.emplace_back(regex_patterns.at(i));}
auto startTimer = high_resolution_clock::now();
// Open file and begin reading
ifstream stream1(filePath);
if (stream1.is_open())
{
int count = 0;
while (getline(stream1, line))
{
// Continue to next step if line is empty, no point in searching it.
if (line.size() == 0)
{
// Continue to next step if line is empty, no point in searching it.
continue;
}
// Loop through each search string, if match, save line number and line text,
for (int i = 0; i < expressions.size(); i++)
{
size_t found = regex_search(line, matches, expressions.at(i));
if (found == 1)
{
lineNumb.push_back(count);
lineTextToSave.push_back(line);
}
}
count = count + 1;
}
}
auto stopTimer = high_resolution_clock::now();
auto duration2 = duration_cast<milliseconds>(stopTimer - startTimer);
cout << "Time to search file: " << duration2.count() << "\n";
Is there a better method of searching files than this? I tried looking up many things but haven't found a programmatic example that I've understood thus far.

Some ideas by order of priority:
You could join all the regex patterns together to form a single regex instead of matching r regexes on each line. This will speed up your program by a factor of r. Example: (R1)|(R2)|(...)|(Rr)
Ensure you are compiling the regex before usage.
Do not add the final .* to your regex pattern.
Some ideas but non-portable:
Memory map the file instead of reading through iostreams
Consider if it is worth reimplementing grep instead of calling to grep through popen()

Related

Parsing Data of data from a file

i have this project due however i am unsure of how to parse the data by the word, part of speech and its definition... I know that i should make use of the tab spacing to read it but i have no idea how to implement it. here is an example of the file
Recollection n. The power of recalling ideas to the mind, or the period within which things can be recollected; remembrance; memory; as, an event within my recollection.
Nip n. A pinch with the nails or teeth.
Wodegeld n. A geld, or payment, for wood.
Xiphoid a. Of or pertaining to the xiphoid process; xiphoidian.
NB: Each word and part of speech and definition is one line in a text file.
If you can be sure that the definition will always follow the first period on a line, you could use an implementation like this. But it will break if there are ever more than 2 periods on a single line.
string str = "";
vector<pair<string,string>> v; // <word,definition>
while(getline(fileStream, str, '.')) { // grab line, deliminated '.'
str[str.length() - 1] = ""; // get rid of n, v, etc. from word
v.push_back(make_pair<string,string>(str,"")); // push the word
getline(fileStream, str, '.'); // grab the next part of the line
v.back()->second = str; // push definition into last added element
}
for(auto x : v) { // check your results
cout << "word -> " << x->first << endl;
cout << "definition -> " << x->second << endl << endl;
}
The better solution would be to learn Regular Expressions. It's a complicated topic but absolutely necessary if you want to learn how to parse text efficiently and properly:
http://www.cplusplus.com/reference/regex/

Notepad++ or UltraEdit: regex remove special duplicates

I need to remove duplicates if
key = anything
but NOT
key=anything
the key can be anything too
e.g.
edit_home=home must be in place
while
edit_home = home or even other string must be removed IF edit_home is a duplicate
for all the lines of the document
thank you
p.s. clearer example:
one=you are
two=we are
three_why=8908908
one = good
two = fine
three_4 = best
three_why = win
from that list i only need to keep:
one=you are
two=we are
three_why=8908908
three_4 = best // because three_4 doesn't have a duplicate
I found a method to do it, but I would need a better search list support by regex or a plugin or a direct regex (which I don't know).
That is: I have two files to compare.
One has the full keys, the other has incomplete.
I merge in a new file all the keys from the first file with those ones of the second, in groups (because the keys are in groups e.g. many keys titled one, many titled two and so on...). Then I regex replace all the keys in the new file by
find (.*)(\s\=\s) replace with \1\=
So they all become key=anything
Then I replace everything after = with empty to isolate the keys.
Then remove the duplicates.
At this point I have trouble to do something like
^.*(^keyone\b|^keytwo\b|^keythree\b).*$
to find all those keys in the document I need. So from that I can select all and replace with the correct keys.
Why? Because in this example the keys are 3 only BUT indeed the keys are many and the find field breaks at a certain point.
How to do it right?
Update: I found Toolbucket plugin which allows to search for many strings, but another issue is that in addition to duplicate, I also have to remove the original.
That is, if I find 2 times the same key "one" I have to remove all the lines containing one.
Ctrl + F
Find tab
Find what: ^.*\S=\S.*$
Find All in Current Document
Copy result from result window to a new window (the list of Line 1: Line 2: Line 3: ...)
Ctrl + F
Replace tab
(the following will remove the leading "Line number:" from every line)
Find what: ^.*?\d:\s
Replace with: Empty
ok, after all that i wrote, one solution could be (therefore, once i have the merged keys)
(?m)^(.*)$(?=\r?\n^(?!\1).*(?s).*?\1)
with this i can mark/highlight all the duplicated keys :-) so then i can manage those only, removing them from the first list and adding what remains to the second file...
If someone has a solution with a direct regex will be really appreciated
Here is a commented UltraEdit script for this task.
// Note: This script does not work for large files as it loads the
// entire file content into very limited scripting memory for fast
// processing even with multiple GB of RAM installed.
if (UltraEdit.document.length > 0) // Is any file opened?
{
// Define environment for this script and select entire file content.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.selectAll();
// Determine line termination used currently in active file.
var sLineTerm = "\r\n";
if (typeof(UltraEdit.activeDocument.lineTerminator) == "number")
{
// The two lines below require UE v16.00 or UES v10.00 or later.
if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
}
else // This version of UE/UES does not offer line terminator property.
{
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\n"; // Not DOS, perhaps UNIX.
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\r"; // Also not UNIX, perhaps MAC.
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\r\n"; // No line terminator, use DOS.
}
}
}
}
// Get all lines of active file into an array of strings
// with each string being one line from active file.
var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
var nTotalLines = asLines.length;
// Process each line in the array.
for(var nCurrentLine = 0; nCurrentLine < asLines.length; nCurrentLine++)
{
// Skip all lines not containing or starting with an equal sign.
if (asLines[nCurrentLine].indexOf('=') < 1) continue;
// Get string left to equal sign with tabs/spaces trimmed.
var sKey = asLines[nCurrentLine].replace(/^[\t ]*([^\t =]+).*$/,"$1");
// Skip lines beginning with just tabs/spaces left to equal sign.
if (sKey.length == asLines[nCurrentLine].length) continue;
var_dump(sKey);
// Build the regular expression for the search in all other lines.
var rRegSearch = new RegExp("^[\\t ]*"+sKey+"[\\t ]*=","g");
// Ceck all remaining lines for a line also starting with
// this key string case-sensitive with left to an equal sign.
var nLineCompare = nCurrentLine + 1;
while(nLineCompare < asLines.length)
{
// Does this line also has this key left to equal
// sign with or without surrounding spaces/tabs?
if (asLines[nLineCompare].search(rRegSearch) < 0)
{
nLineCompare++; // No, continue on next line.
}
else // Yes, remove this line from array.
{
asLines.splice(nLineCompare,1);
}
}
}
// Was any line removed from the array?
if (nTotalLines == asLines.length)
{
UltraEdit.activeDocument.top(); // Cancel the selection.
UltraEdit.messageBox("Nothing found to remove!");
}
else
{
// If version of UE/UES supports direct write to clipboard, use
// user clipboard 9 to paste the lines into file with overwritting
// everything as this is much faster than using write command in
// older versions of UE/UES.
if (typeof(UltraEdit.clipboardContent) == "string")
{
var nActiveClipboard = UltraEdit.clipboardIdx;
UltraEdit.selectClipboard(9);
UltraEdit.clipboardContent = asLines.join(sLineTerm);
UltraEdit.activeDocument.paste();
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(nActiveClipboard);
}
else UltraEdit.activeDocument.write(asLines.join(sLineTerm));
var nRemoved = nTotalLines - asLines.length;
UltraEdit.activeDocument.top();
UltraEdit.messageBox("Removed " + nRemoved + " line" + ((nRemoved != 1) ? "s" : "") + " on updated file.");
}
}
Copy this code and paste it into a new ASCII file using DOS line terminators in UltraEdit.
Next use command File - Save As to save the script file for example with name RemoveDuplicateKeys.js into %AppData%\IDMComp\UltraEdit\MyScripts or wherever you want to have saved your UltraEdit scripts.
Open Scripting - Scripts and add the just saved UltraEdit script to the list of scripts. You can enter a description for this script, too.
Open the file with the list, or make this file active if it is already opened in UltraEdit.
Run the script by clicking on it in menu Scripting, or by opening Views - Views/Lists - Script List and double clicking on the script.

Is there a faster way to split a text file when a certain token is found?

I have a 7GB text file comprised of multi line records that are delimited with a line that only contains the token "$$$$".
I wrote a method to split it by parsing a line at a time, testing for the token, and splitting accordingly. The idea is to write each multi line record to different output files in round robin fashion. My code is below:
// Open all temp files for reading
int nThreads = threadData.size();
std::vector<ofstream*> ostrms(nThreads);
for (int i = 0; i < nThreads; ++i)
{
ostrms[i] = new ofstream(threadData[i].InFileName);
if (! ostrms[i]->is_open() )
return(false);
}
// parse mol records into temp files in round-robin fashion
std::vector<std::string> molRecord;
std::string line;
const std::string MOL_END_OF_RECORD = "$$$$";
int curOutfileNo = 0;
while( ! strm.eof() )
{
std::getline(strm,line);
if (line.find(MOL_END_OF_RECORD) != std::string::npos)
{
for (int i = 0; i < molRecord.size(); ++i)
*(ostrms[curOutfileNo]) << molRecord[i] << "\n";
(*ostrms[curOutfileNo]) << line << "\n";
curOutfileNo = (curOutfileNo+1) % nThreads;
molRecord.clear();
}
else
molRecord.push_back(line);
}
for (int i = 0; i < nThreads; ++i)
delete ostrms[i];
This runs very slowly (several minutes). Is there a faster way?
The 7GB text file has 245,634,858 lines and 466537 unique records delimited by"$$$$"
If you are absolutely sure that your splitting lines contain exactly $$$$ without any prefix or suffix characters (e.g. spaces), you might replace
if (line.find(MOL_END_OF_RECORD) != std::string::npos)
with
if (line == std::string(MOL_END_OF_RECORD))
but I don't think it matters that lot.
If spending a day on improving the coding is worth the effort (I believe that it is not), and assuming a Linux system, you could use with care some clever combination of low-level syscalls like read(2) with a large buffer of at least 64 Kbytes, mmap(2) on multi-megabyte ranges, posix_fadvise(2), readahead(2) (in a separate thread), ...
If you access the same file (with a constant content) several times, you might consider preprocessing (or pre-digesting) it e.g. to fill some GDBM indexed file, or some Sqlite (or other) "database", and have your real application use these. You could also simply compute some "index" file containing the offset of every $$$$ delimiter.
As I commented, you should consider that the time(1) spent by utilities like wc(1) as reasonable lower bound of execution time. I guess that they could show you that in fact (on your particular system) the program is I/O bound.
BTW, if your machine has more than e.g. 10Gbytes of RAM, you could simply wc yourhugefile before running your program. The wc process will fill the file system RAM cache with your file's data. See http://www.linuxatemyram.com/
We can't help much more unless you explain what is the huge data, how often does it change, and what does your application....
You could also buy more RAM and/or some SSD...

Unknown reason behind out_of_range error for substring

UPDATE: Yes, answered and solved. I also then managed to find the issue with the output that was the real problem I was having. I had thought the substring error was behind it, but I was wrong, as when that had been fixed, the output issue persisted. I found that it was a simple mix up in the calculations. I had been subtracting 726 instead of 762. I could've had this done hours ago... Lulz. That's all I can say... Lulz.
I am teaching myself C++ (with the tutorial from their website). I have jumped ahead time to time when I have needed to do something I cannot with what I have learned so far. Additionally, I wrote this relatively quickly. So, if my code looks inelegant or otherwise unacceptable at a professional level, please do excuse that for now. My only current purpose is to get this question answered.
This program takes each line of a text file I have. Note that the text file's lines look like this:
.123.456.789
It has 366 lines. The program I first wrote to deal with this had me input each of the three numbers for each line manually. As I'm sure you can imagine, that was extremely inefficient. This program's purpose is to take each number out of the text file and perform functions and output the results to another text file. It does this per line until it reaches the end of the file.
I have read up more on what could cause this error, but I cannot find the cause of it in my case. Here is the bit of the code that I believe to contain the cause of the problem:
int main()
{
double a;
double b;
double c;
double d;
double e;
string search; //The string for lines fetched from the text file
string conversion;
string searcha; //Characters 1-3 of search are inserted to this string.
string searchb; //Characters 5-7 of search are inserted to this string.
string searchc; //Characters 9-11 of search are inserted to this string.
string subsearch; //Used with the substring to fetch individual characters.
string empty;
fstream convfil;
convfil.open("/home/user/Documents/MPrograms/filename.txt", ios::in);
if (convfil.is_open())
{
while (convfil.good())
{
getline(convfil,search); //Fetch line from text file
searcha = empty;
searchb = empty;
searchc = empty;
/*From here to the end seems to be the problem.
I provided code from the beginning of the program
to make sure that if I were erring earlier in the code,
someone would be able to catch that.*/
for (int i=1; i<4; ++i)
{
subsearch = search.substr(i,1);
searcha.insert(searcha.length(),subsearch);
a = atof(searcha.c_str());
}
for (int i=5; i<8; ++i)
{
subsearch = search.substr(i,1);
searchb.insert(searchb.length(),subsearch);
b = atof(searchb.c_str());
}
for (int i=9; i<search.length(); ++i)
{
subsearch = search.substr(i,1);
searchc.insert(searchc.length(),subsearch);
c = atof(searchc.c_str());
}
I usually teach myself how to get around these issues when they come up by looking at references and problems other people may have had, but I couldn't find anything that helped me in this instance. I have tried numerous variations upon this, but as the issue has something to do with the substring and I couldn't get rid of the substring in any of these variations, all returned the same error and the same result in the output file.
This is a problem:
while (convfil.good()) {
getline(convfil,search); //Fetch line from text file
You test for failure before you do the operation that can fail. When getline does fail, you're already inside the loop.
As a result, your code tries to process an invalid record at the end.
Instead try
while (getline(convfil,search)) { //Fetch line from text file
or even
while (getline(convfil,search) && search.length() > 9) {
which will also stop without error if there's a blank line at the end of the file.
It's possible you are reading a blank line at the end of the file and trying to process it.
Test for an empty string before processing it.

Regular expression slow

I am trying to parse a build log file to get some information, using regular expressions. I am trying to use regular expression like ("( {9}time)(.+)(c1xx\\.dll+)(.+)s") to match a line like time(D:\Program Files\Microsoft Visual Studio 11.0\VC\bin\c1xx.dll)=0.047s
This is taking about 120 s to complete, in a file which has 19,000 lines. some of which are pretty large. Basic problem is when I cut the number of lines to about 19000, using some conditions, it did not changed anything, actually made it worse. I do not understand, if I remove the regular expressions altogether, only scanning the file takes about 6s. That means regular expressions are the main time consuming process here. So why the does not go at least some amount lower when I removed half of the lines.
Also, can anyone tell me what kind of regular expression is faster, more generic one or more specific one. i.e. I can match this line time(D:\Program Files\Microsoft Visual Studio 11.0\VC\bin\c1xx.dll)=0.047s uniquley in file using this regex also - ("(.+)(c1xx.dll)(.+)"). But it makes the whole thing to run even slower but when I use something like ("( {9}time)(.+)(c1xx\\.dll+)(.+)") It makes it run slightly faster.
I am using c++ 11 regex library and mostly regex_match function.
regex c1xx("( {9}time)(.+)(c1xx\\.dll+)(.+)s");
auto start = system_clock::now();
int linecount = 0;
while (getline(inFile, currentLine))
{
if (regex_match(currentLine.c_str(), cppFile))
{
linecount++;
// Do something, just insert it into a vector
}
}
auto end = system_clock::now();
auto elapsed = duration_cast<milliseconds>(end - start);
cout << "Time taken for parsing first log = " << elapsed.count() << " ms" << " lines = " << linecount << endl;
Output:
Time taken for parsing first log = 119416 ms lines = 19617
regex c1xx("( {9}time)(.+)(c1xx\\.dll+)(.+)s");
auto start = system_clock::now();
int linecount = 0;
while (getline(inFile, currentLine))
{
if (currentLine.size() > 200)
{
continue;
}
if (regex_match(currentLine.c_str(), cppFile))
{
linecount++;
// Do something, just insert it into a vector
}
}
auto end = system_clock::now();
auto elapsed = duration_cast<milliseconds>(end - start);
cout << "Time taken for parsing first log = " << elapsed.count() << " ms" << " lines = " << linecount << endl;
Output:
Time taken for parsing first log = 131613 ms lines = 9216
Why its taking more time in the second case ?
So why the does not go at least some amount lower when I removed half of the lines.
Why its taking more time in the second case ?
It is conceivable that the regex library is somehow able to filter out lines more efficiently than your size check. It is also possible that the introduction of an additional branch in your while loop is confusing the compiler's branch prediction, and so you are not getting optimal instruction pipelining/prefetching.
Also, can anyone tell me what kind of regular expression is faster, more generic one or more specific one.
If the expression ("(.+)(c1xx.dll)(.+)") would work, I believe (".+c1xx\\.dll.+") would also work, and regex won't bother saving match positions for you.