Regular expression slow - c++

I am trying to parse a build log file to get some information, using regular expressions. I am trying to use regular expression like ("( {9}time)(.+)(c1xx\\.dll+)(.+)s") to match a line like time(D:\Program Files\Microsoft Visual Studio 11.0\VC\bin\c1xx.dll)=0.047s
This is taking about 120 s to complete, in a file which has 19,000 lines. some of which are pretty large. Basic problem is when I cut the number of lines to about 19000, using some conditions, it did not changed anything, actually made it worse. I do not understand, if I remove the regular expressions altogether, only scanning the file takes about 6s. That means regular expressions are the main time consuming process here. So why the does not go at least some amount lower when I removed half of the lines.
Also, can anyone tell me what kind of regular expression is faster, more generic one or more specific one. i.e. I can match this line time(D:\Program Files\Microsoft Visual Studio 11.0\VC\bin\c1xx.dll)=0.047s uniquley in file using this regex also - ("(.+)(c1xx.dll)(.+)"). But it makes the whole thing to run even slower but when I use something like ("( {9}time)(.+)(c1xx\\.dll+)(.+)") It makes it run slightly faster.
I am using c++ 11 regex library and mostly regex_match function.
regex c1xx("( {9}time)(.+)(c1xx\\.dll+)(.+)s");
auto start = system_clock::now();
int linecount = 0;
while (getline(inFile, currentLine))
{
if (regex_match(currentLine.c_str(), cppFile))
{
linecount++;
// Do something, just insert it into a vector
}
}
auto end = system_clock::now();
auto elapsed = duration_cast<milliseconds>(end - start);
cout << "Time taken for parsing first log = " << elapsed.count() << " ms" << " lines = " << linecount << endl;
Output:
Time taken for parsing first log = 119416 ms lines = 19617
regex c1xx("( {9}time)(.+)(c1xx\\.dll+)(.+)s");
auto start = system_clock::now();
int linecount = 0;
while (getline(inFile, currentLine))
{
if (currentLine.size() > 200)
{
continue;
}
if (regex_match(currentLine.c_str(), cppFile))
{
linecount++;
// Do something, just insert it into a vector
}
}
auto end = system_clock::now();
auto elapsed = duration_cast<milliseconds>(end - start);
cout << "Time taken for parsing first log = " << elapsed.count() << " ms" << " lines = " << linecount << endl;
Output:
Time taken for parsing first log = 131613 ms lines = 9216
Why its taking more time in the second case ?

So why the does not go at least some amount lower when I removed half of the lines.
Why its taking more time in the second case ?
It is conceivable that the regex library is somehow able to filter out lines more efficiently than your size check. It is also possible that the introduction of an additional branch in your while loop is confusing the compiler's branch prediction, and so you are not getting optimal instruction pipelining/prefetching.
Also, can anyone tell me what kind of regular expression is faster, more generic one or more specific one.
If the expression ("(.+)(c1xx.dll)(.+)") would work, I believe (".+c1xx\\.dll.+") would also work, and regex won't bother saving match positions for you.

Related

How to speed up regex searching for large quantity of potentially large files in C++?

I'm trying to make a program to read user inputted wildcard files and wildcard strings using an excel document as a configuration file. For example the user may be able to enter in C:\Read*.txt, and any files in the C drive that start with Read and then any characters after read and are text files will be included in the search.
They could search for Message: * and all strings beginning with "Message: " and ending with any sequence of characters would get matched.
So far it is a working program but the problem is that the speed efficiency is quite terrible and I need it to be able to search very large files. I'm using a filestream and the regex class to do so and I'm not sure what is taking so much time.
The bulk of the time in my code is being spent in the following loop (I've only included the lines above the while loop so you can better understand what I'm trying to do):
smatch matches;
vector<regex> expressions;
for (int i = 0; i < regex_patterns.size(); i++){expressions.emplace_back(regex_patterns.at(i));}
auto startTimer = high_resolution_clock::now();
// Open file and begin reading
ifstream stream1(filePath);
if (stream1.is_open())
{
int count = 0;
while (getline(stream1, line))
{
// Continue to next step if line is empty, no point in searching it.
if (line.size() == 0)
{
// Continue to next step if line is empty, no point in searching it.
continue;
}
// Loop through each search string, if match, save line number and line text,
for (int i = 0; i < expressions.size(); i++)
{
size_t found = regex_search(line, matches, expressions.at(i));
if (found == 1)
{
lineNumb.push_back(count);
lineTextToSave.push_back(line);
}
}
count = count + 1;
}
}
auto stopTimer = high_resolution_clock::now();
auto duration2 = duration_cast<milliseconds>(stopTimer - startTimer);
cout << "Time to search file: " << duration2.count() << "\n";
Is there a better method of searching files than this? I tried looking up many things but haven't found a programmatic example that I've understood thus far.
Some ideas by order of priority:
You could join all the regex patterns together to form a single regex instead of matching r regexes on each line. This will speed up your program by a factor of r. Example: (R1)|(R2)|(...)|(Rr)
Ensure you are compiling the regex before usage.
Do not add the final .* to your regex pattern.
Some ideas but non-portable:
Memory map the file instead of reading through iostreams
Consider if it is worth reimplementing grep instead of calling to grep through popen()

Parsing Data of data from a file

i have this project due however i am unsure of how to parse the data by the word, part of speech and its definition... I know that i should make use of the tab spacing to read it but i have no idea how to implement it. here is an example of the file
Recollection n. The power of recalling ideas to the mind, or the period within which things can be recollected; remembrance; memory; as, an event within my recollection.
Nip n. A pinch with the nails or teeth.
Wodegeld n. A geld, or payment, for wood.
Xiphoid a. Of or pertaining to the xiphoid process; xiphoidian.
NB: Each word and part of speech and definition is one line in a text file.
If you can be sure that the definition will always follow the first period on a line, you could use an implementation like this. But it will break if there are ever more than 2 periods on a single line.
string str = "";
vector<pair<string,string>> v; // <word,definition>
while(getline(fileStream, str, '.')) { // grab line, deliminated '.'
str[str.length() - 1] = ""; // get rid of n, v, etc. from word
v.push_back(make_pair<string,string>(str,"")); // push the word
getline(fileStream, str, '.'); // grab the next part of the line
v.back()->second = str; // push definition into last added element
}
for(auto x : v) { // check your results
cout << "word -> " << x->first << endl;
cout << "definition -> " << x->second << endl << endl;
}
The better solution would be to learn Regular Expressions. It's a complicated topic but absolutely necessary if you want to learn how to parse text efficiently and properly:
http://www.cplusplus.com/reference/regex/

Formatting text in C++

So I'm trying to make a little text based game for my first real project in C++. When the game calls for a big block of text, several paragraphs worth, I'm having trouble getting it to look nice. I want to have uniform line lengths, and to do that I'm just having to manually enter line breaks in the appropraite places. Is there a command to do this for me? I have seen the setw() command, but that doesn't make the text wrap if it goes past the width. Any advice? Here is what I'm doing right now if that helps.
cout << " This is essentially what I do with large blocks of" << '\n';
cout << " descriptive text. It lets me see how long each of " << '\n';
cout << " the lines will be, but it's really a hassle, see? " << '\n';
Getting or writing a library function to automatically insert leading whitespace and newlines suitable for a particular width of output would be a good idea. This website considers library recommendations off topic, but I've included some code below - not particularly efficient but easy to understand. The basic logic should be to jump forwards in the string to the maximum width, then move backwards until you find whitespace (or perhaps a hyphen) at which you're prepared to break the line... then print the leading whitespace and the remaining part of the line. Continue until done.
#include <iostream>
std::string fmt(size_t margin, size_t width, std::string text)
{
std::string result;
while (!text.empty()) // while more text to move into result
{
result += std::string(margin, ' '); // add margin for this line
if (width >= text.size()) // rest of text can fit... nice and easy
return (result += text) += '\n';
size_t n = width - 1; // start by assuming we can fit n characters
while (n > width / 2 && isalnum(text[n]) && isalnum(text[n - 1]))
--n; // between characters; reduce n until word breaks or 1/2 width left
// move n characters from text to result...
(result += text.substr(0, n)) += '\n';
text.erase(0, n);
}
return result;
}
int main()
{
std::cout << fmt(5, 70,
"This is essentially what I do with large blocks of "
"descriptive text. It lets me see how long each of "
"the lines will be, but it's really a hassle, see?");
}
That hasn't been tested very thoroughly, but seems to work. See it run here. For some alternatives, see this SO question.
BTW, your original code can be simplified to...
cout << " This is essentially what I do with large blocks of\n"
" descriptive text. It lets me see how long each of\n"
" the lines will be, but it's really a hassle, see?\n";
...as C++ considers the statement unfinished until it hits the semicolon, and double-quoted string literals that appear next to each other in the code are concatenated automatically, as if the inner double quotes and whitespace were removed.
There are pretty library to handle text formatting, named cppformat/cppformat on github.
By using this, you can manipulate text formatting easily.
You can make a tierce simple function
void print(const std::string& brute, int sizeline)
{
for(int i = 0; i < brute.length(); i += sizeline)
std::cout << brute.substr(i, sizeline) << std::endl;
}
int main()
{
std::string mytext = "This is essentially what I do with large blocks of"
"descriptive text. It lets me see how long each of "
"1the lines will be, but it's really a hassle, see?";
print(mytext, 20);
return 0;
}
but words will be cut
output:
This is essentially
what I do with large
blocks ofdescriptiv
e text. It lets me s
ee how long each of
1the lines will be,
but it's really a ha
ssle, see?

Reading text file by scanning for keywords

As part of a bigger application I am working on a class for reading input from a text file for use in the initialization of the program. Now I am myself fairly new to programming, and I only started to learn C++ in December, so I would be very grateful for some hints and ideas on how to get started! I apologise in advance for a rather long wall of text.
The text file format is "keyword-driven" in the following way:
There are a rather small number of main/section keywords (currently 8) that need to be written in a given order. Some of them are optional, but if they are included they should adhere to the given ordering.
Example:
Suppose there are 3 potential keywords ordered like as follows:
"KEY1" (required)
"KEY2" (optional)
"KEY3" (required)
If the input file only includes the required ones, the ordering should be:
"KEY1"
"KEY3"
Otherwise it should be:
"KEY1"
"KEY2"
"KEY3"
If all the required keywords are present, and the total ordering is ok, the program should proceed by reading each section in the sequence given by the ordering.
Each section will include a (possibly large) amount of subkeywords, some of which are optional and some of which are not, but here the order does NOT matter.
Lines starting with characters '*' or '--' signify commented lines, and they should be ignored (as well as empty lines).
A line containing a keyword should (preferably) include nothing else than the keyword. At the very least, the keyword must be the first word appearing there.
I have already implemented parts of the framework, but I feel my approach so far has been rather ad-hoc. Currently I have manually created one method per section/main keyword , and the first task of the program is to scan the file for to locate these keywords and pass the necessary information on to the methods.
I first scan through the file using an std::ifstream object, removing empty and/or commented lines and storing the remaining lines in an object of type std::vector<std::string>.
Do you think this is an ok approach?
Moreover, I store the indices where each of the keywords start and stop (in two integer arrays) in this vector. This is the input to the above-mentioned methods, and it would look something like this:
bool readMAINKEY(int start, int stop);
Now I have already done this, and even though I do not find it very elegant, I guess I can keep it for the time being.
However, I feel that I need a better approach for handling the reading inside of each section, and my main issue is how should I store the keywords here? Should they be stored as arrays within a local namespace in the input class or maybe as static variables in the class? Or should they be defined locally inside relevant functions? Should I use enums? The questions are many!
Now I've started by defining the sub-keywords locally inside each readMAINKEY() method, but I found this to be less than optimal. Ideally I want to reuse as much code as possible inside each of these methods, calling upon a common readSECTION() method, and my current approach seems to lead to much code duplication and potential for error in programming. I guess the smartest thing to do would simply be to remove all the (currently 8) different readMAINKEY() methods, and use the same function for handling all kinds of keywords. There is also the possibility for having sub-sub-keywords etc. as well (i.e. a more general nested approach), so I think maybe this is the way to go, but I am unsure on how it would be best to implement it?
Once I've processed a keyword at the "bottom level", the program will expect a particular format of the following lines depending on the actual keyword. In principle each keyword will be handled differently, but here there is also potential for some code reuse by defining different "types" of keywords depending on what the program expects to do after triggering the reading of it. Common task include e.g. parsing an integer or a double array, but in principle it could be anything!
If a keyword for some reason cannot be correctly processed, the program should attempt as far as possible to use default values instead of terminating the program (if reasonable), but an error message should be written to a logfile. For optional keywords, default values will of course also be used.
In order to summarise, therefore, my main questions are the following:
1. Do you think think my approach of storing the relevant lines in a std::vector<std::string> to be reasonable?
This will of course require me to do a lot of "indexing work" to keep track of where in the vector the different keywords are located. Or should I work more "directly" with the original std::ifstream object? Or something else?
2. Given such a vector storing the lines of the text file, how I can I best go about detecting the keywords and start reading the information following them?
Here I will need to take account of possible ordering and whether a keyword is required or not. Also, I need to check if the lines following each "bottom level" keyword is in the format expected in each case.
One idea I've had is to store the keywords in different containers depending on whether they are optional or not (or maybe use object(s) of type std::map<std::string,bool>), and then remove them from the container(s) if correctly processed, but I am not sure exactly how I should go about it..
I guess there is really a thousand different ways one could answer these questions, but I would be grateful if someone more experienced could share some ideas on how to proceed. Is there e.g. a "standard" way of doing such things? Of course, a lot of details will also depend on the concrete application, but I think the general format indicated here can be used in a lot of different applications without a lot of tinkering if programmed in a good way!
UPDATE
Ok, so let my try to be more concrete. My current application is supposed to be a reservoir simulator, so as part of the input I need information about the grid/mesh, about rock and fluid properties, about wells/boundary conditions throughout the simulation and so on. At the moment I've been thinking about using (almost) the same set-up as the commercial Eclipse simulator when it comes to input, for details see
http://petrofaq.org/wiki/Eclipse_Input_Data.
However, I will probably change things a bit, so nothing is set in stone. Also, I am interested in making a more general "KeywordReader" class that with slight modifications can be adapted for use in other applications as well, at least it can be done in a reasonable amount of time.
As an example, I can post the current code that does the initial scan of the text file and locates the positions of the main keywords. As I said, I don't really like my solution very much, but it seems to work for what it needs to do.
At the top of the .cpp file I have the following namespace:
//Keywords used for reading input:
namespace KEYWORDS{
/*
* Main keywords and corresponding boolean values to signify whether or not they are required as input.
*/
enum MKEY{RUNSPEC = 0, GRID = 1, EDIT = 2, PROPS = 3, REGIONS = 4, SOLUTION = 5, SUMMARY =6, SCHEDULE = 7};
std::string mainKeywords[] = {std::string("RUNSPEC"), std::string("GRID"), std::string("EDIT"), std::string("PROPS"),
std::string("REGIONS"), std::string("SOLUTION"), std::string("SUMMARY"), std::string("SCHEDULE")};
bool required[] = {true,true,false,true,false,true,false,true};
const int n_key = 8;
}//end KEYWORDS namespace
Then further down I have the following function. I am not sure how understandable it is though..
bool InputReader::scanForMainKeywords(){
logfile << "Opening file.." << std::endl;
std::ifstream infile(filename);
//Test if file was opened. If not, write error message:
if(!infile.is_open()){
logfile << "ERROR: Could not open file! Unable to proceed!" << std::endl;
std::cout << "ERROR: Could not open file! Unable to proceed!" << std::endl;
return false;
}
else{
logfile << "Scanning for main keywords..." << std::endl;
int nkey = KEYWORDS::n_key;
//Initially no keywords have been found:
startIndex = std::vector<int>(nkey, -1);
stopIndex = std::vector<int>(nkey, -1);
//Variable used to control that the keywords are written in the correct order:
int foundIndex = -1;
//STATISTICS:
int lineCount = 0;//number of non-comment lines in text file
int commentCount = 0;//number of commented lines in text file
int emptyCount = 0;//number of empty lines in text file
//Create lines vector:
lines = std::vector<std::string>();
//Remove comments and empty lines from text file and store the result in the variable file_lines:
std::string str;
while(std::getline(infile,str)){
if(str.size()>=1 && str.at(0)=='*'){
commentCount++;
}
else if(str.size()>=2 && str.at(0)=='-' && str.at(1)=='-'){
commentCount++;
}
else if(str.size()==0){
emptyCount++;
}
else{
//Found a non-empty, non-comment line.
lines.push_back(str);//store in std::vector
//Start by checking if the first word of the line is one of the main keywords. If so, store the location of the keyword:
std::string fw = IO::getFirstWord(str);
for(int i=0;i<nkey;i++){
if(fw.compare(KEYWORDS::mainKeywords[i])==0){
if(i > foundIndex){
//Found a valid keyword!
foundIndex = i;
startIndex[i] = lineCount;//store where the keyword was found!
//logfile << "Keyword " << fw << " found at line " << lineCount << " in lines array!" << std::endl;
//std::cout << "Keyword " << fw << " found at line " << lineCount << " in lines array!" << std::endl;
break;//fw cannot equal several different keywords at the same time!
}
else{
//we have found a keyword, but in the wrong order... Terminate program:
std::cout << "ERROR: Keywords have been entered in the wrong order or been repeated! Cannot continue initialisation!" << std::endl;
logfile << "ERROR: Keywords have been entered in the wrong order or been repeated! Cannot continue initialisation!" << std::endl;
return false;
}
}
}//end for loop
lineCount++;
}//end else (found non-comment, non-empty line)
}//end while (reading ifstream)
logfile << "\n";
logfile << "FILE STATISTICS:" << std::endl;
logfile << "Number of commented lines: " << commentCount << std::endl;
logfile << "Number of non-commented lines: " << lineCount << std::endl;
logfile << "Number of empty lines: " << emptyCount << std::endl;
logfile << "\n";
/*
Print lines vector to screen:
for(int i=0;i<lines.size();i++){
std:: cout << "Line nr. " << i << " : " << lines[i] << std::endl;
}*/
/*
* So far, no keywords have been entered in the wrong order, but have all the necessary ones been found?
* Otherwise return false.
*/
for(int i=0;i<nkey;i++){
if(KEYWORDS::required[i] && startIndex[i] == -1){
logfile << "ERROR: Incorrect input of required keywords! At least " << KEYWORDS::mainKeywords[i] << " is missing!" << std::endl;;
logfile << "Cannot proceed with initialisation!" << std::endl;
std::cout << "ERROR: Incorrect input of required keywords! At least " << KEYWORDS::mainKeywords[i] << " is missing!" << std::endl;
std::cout << "Cannot proceed with initialisation!" << std::endl;
return false;
}
}
//If everything is in order, we also initialise the stopIndex array correctly:
int counter = 0;
//Find first existing keyword:
while(counter < nkey && startIndex[counter] == -1){
//Keyword doesn't exist. Leave stopindex at -1!
counter++;
}
//Store stop index of each keyword:
while(counter<nkey){
int offset = 1;
//Find next existing keyword:
while(counter+offset < nkey && startIndex[counter+offset] == -1){
offset++;
}
if(counter+offset < nkey){
stopIndex[counter] = startIndex[counter+offset]-1;
}
else{
//reached the end of array!
stopIndex[counter] = lines.size()-1;
}
counter += offset;
}//end while
/*
//Print out start/stop-index arrays to screen:
for(int i=0;i<nkey;i++){
std::cout << "Start index of " << KEYWORDS::mainKeywords[i] << " is : " << startIndex[i] << std::endl;
std::cout << "Stop index of " << KEYWORDS::mainKeywords[i] << " is : " << stopIndex[i] << std::endl;
}
*/
return true;
}//end else (file opened properly)
}//end scanForMainKeywords()
You say your purpose is to read initialization data from a text file.
Seems you need to parse (syntax analyze) this file and store the data under the right keys.
If the syntax is fixed and each construction starts with a keyword, you could write a recursive descent (LL1) parser creating a tree (each node is a stl vector of sub-branches) to store your data.
If the syntax is free, you might pick JSON or XML and use an existing parsing library.

How to extract formatted text in C++?

This might have appeared before, but I couldn't understand how to extract formatted data. Below is my code to extract all text between string "[87]" and "[90]" in a text file.
Apparently, the position of [87] and [90] is the same as indicated in the output.
void ExtractWebContent::filterContent(){
string str, str1;
string positionOfCurrency1 = "[87]";
string positionOfCurrency2 = "[90]";
size_t positionOfText1, positionOfText2;
ifstream reading;
reading.open("file_Currency.txt");
while (!reading.eof()){
getline (reading, str);
positionOfText1 = str.find(positionOfCurrency1);
positionOfText2 = str.find(positionOfCurrency2);
cout << "positionOfCurrency1 " << positionOfText1 << endl;
cout << "positionOfCurrency2 " << positionOfText2 << endl;
//str1= str.substr (positionOfText);
cout << "String" << str1 << endl;
}
reading.close();
An Update on the currency file:
[79]More »Brent slips to $102 on worries about euro zone economy
Market Data
* Currencies
CAPTION: Currencies
Name Price Change % Chg
[80]USD/SGD
1.2606 -0.00 -0.13%
USD/SGD [81]USDSGD=X
[82]EUR/SGD
1.5242 0.00 +0.11%
EUR/SGD [83]EURSGD=X
That really depends on what 'extracting data means'. In simple cases you can just read the file into a string and then use string member functions (especially find and substr) to extract the segment you are interested in. If you are interested in data per line getline is the way to go for line extraction. Apply find and substr as before to get the segment.
Sometimes a simple find wont get you far and you will need a regular expression to do easily get to the parts you are interested in.
Often simple parsers evolve and soon outgrow even regular expressions. This often signals time for the very large hammer of C++ parsing Boost.Spirit.
Boost.Tokenizer can be helpful for parsing out a string, but it gets a little trickier if those delimiters have to be bracketed numbers like you have them. With the delimieters as described, a regex is probably adequate.
All that does is concatenate the output of reading and the strings "[1]" and "[2]". I'm guessing this code resulted from a rather literal extrapolation of similar code using scanf. scanf (as well as the rest of C) still works in C++, so if that works for you I would use it.
That said, there are various levels of sophistication at which you can do this. Using regexes is one of the most powerful/flexible ways, but it might be overkill. The quickest way in my opinion is just to do something like:
Find index of substring "[1]", i1
Find index of substring "[2]", i2
get substring between i1+3 and i2.
In code, supposing std::string line has the text:
size_t i1 = line.find("[1]");
size_t i2 = line.find("[2]");
std::string out(line.substr(i1+3, i2));
Warning: no error checking.