Why exception with this regular expression pattern (tr1::regex)? - c++

I came accross a very strange problem with tr1::regex (VS2008) that I can't figure out the reason for. The code at the end of the post compiles fine but throws an exception when reaching the 4th regular expression definition during execution:
Microsoft C++ exception: std::tr1::regex_error at memory location 0x0012f5f4..
However, the only difference I can see (maybe I am blind) between the 3rd and 4th one is the 'NumberOfComponents' instead of 'SchemeVersion'. At first I thought maybe both (3rd and 4th) are wrong and the error from the 3rd is just triggered in the 4th. That seems not to be the case as I moved both of them around and put multiple other regex definitions between them. The line in question always triggers the exception.
Does anyone have any idea why that line
std::tr1::regex rxNumberOfComponents("\\NumberOfComponents:(\\s*\\d+){1}");
triggers an exception but
std::tr1::regex rxSchemeVersion("\\SchemeVersion:(\\s*\\d+){1}");
doesn't? Is the runtime just messing with me?
Thanks for the time to read this and for any insights.
T
PS: I am totally sure the solution is so easy I have to hit my head against the nearest wall to even out the 'stupid question' karma ...
#include <regex>
int main(void)
{
std::tr1::regex rxSepFileIdent("Scanner Separation Configuration");
std::tr1::regex rxScannerNameIdent("\\ScannerName:((\\s*\\w+)+)");
std::tr1::regex rxSchemeVersion("\\SchemeVersion:(\\s*\\d+){1}");
std::tr1::regex rxNumberOfComponents("\\NumberOfComponents:(\\s*\\d+){1}");
std::tr1::regex rxConfigStartIdent("Configuration Start");
std::tr1::regex rxConfigEndIdent("Configuration End");
return 0;
}

You need to double-escape your backslashes - once for the regex itself, a second time for the string they're in.
The one that starts with S works because \S is a valid regex escape (non-whitespace characters). The one that starts with N does not (because \N is not a valid regex escape).
Instead, use "\\\\SchemeVersion: et cetera.

Related

Stack overflow during std::regex_replace

I'm trying to execute the following C++ STL-based code to replace text in a relatively large SQL script (~8MB):
std::basic_regex<TCHAR> reProc("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$\n((^(?![ \t]*go[ \t]*).*$\n)+)^[ \t]*go[ \t]*$");
std::basic_string<TCHAR> replace = _T("ALTER $1 $2\n$3\ngo");
return std::regex_replace(strInput, reProc, replace);
The result is a stack overflow, and it's hard to find information about that particular error on this particular site since that's also the name of the site.
Edit: I am using Visual Studio 2013 Update 5
Edit 2: The original file is over 23,000 lines. I cut the file down to 3,500 lines and still get the error. When I cut it by another ~50 lines down to 3,456 lines, the error goes away. If I put just those cut lines into the file, the error is still gone. This suggests that the error is not related to specific text, but just too much of it.
Edit 3: A full working example is demonstrated operating properly here:
https://regex101.com/r/iD1zY6/1
It doesn't work in that STL code, though.
The following trimmed-down version of your regex saves about 20% of processing steps according to regex101 (see here).
\\bcreate[ \t]+(view|procedure|proc)[ \t]+(.+)\n(((?![ \t]*go[ \t]*).*\n)+)[ \t]*go[ \t]*
Modifications:
inline anchors removed: you are expressly testing for newline characters
repetition operator for the db object keywords removed - a repetition at this point would make the original script syntactically invalid.
initial whitespace pattern replaced by word boundary (note the double backslash - the escape sequence is for the regex engine, not for the compiler)
If you can be sure that ...
the create ... statements do not occur in string literals, and
you do not need to distinguish between create ... statements followed by a go or not (eg. because all statements are trailed by a go)
...it might even be easier to just replace these strings:
std::basic_regex<TCHAR> reProc("\bcreate[ \t]+(view|procedure|proc)");
std::basic_string<TCHAR> replace = _T("ALTER $1");
return std::regex_replace(strInput, reProc, replace);
(Here is a demo for the latter approach - reduces the steps to a little more than 1/4 th).
It turns out that STL regular expressions are tragic under-performers versus Perl (about 100 times slower if you can believe https://stackoverflow.com/a/37016671/78162), so it's apparently necessary to absolutely minimize the use of regular expressions in STL/C++ when performance is a serious concern. (The degree to which C++/STL under-performs here blew my mind considering I presume C++ to generally be one of the more performant languages). I ended up passing the file stream to read one line at a time and only run the expression on lines that needed processing like this:
std::basic_string<TCHAR> result;
std::basic_string<TCHAR> line;
std::basic_regex<TCHAR> reProc(_T("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$"), std::regex::optimize);
std::basic_string<TCHAR> replace = _T("ALTER $1 $2");
do {
std::getline(input, line);
int pos = line.find_first_not_of(_T(" \t"));
if ((pos != std::basic_string<TCHAR>::npos)
&& (_tcsnicmp(line.substr(pos, 6).data(), _T("create"), 6)==0))
result.append(std::regex_replace(line, reProc, replace));
else
result.append(line);
result.append(_T("\n"));
} while (!input.eof());
return result;

D: split string by comma, but not quoted string

I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Why does regex_match throw "complexity exception"?

I am trying to test (using boost::regex) whether a line in a file contains only numeric entries seperated by spaces. I encountered an exception which I do not understand (see below). It would be great if someone could explain why it is thrown. Maybe I am doing something stupid here in my way of defining the patterns? Here is the code:
// regex_test.cpp
#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;
int main(){
// My basic pattern to test for a single numeric expression
const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+\\.?[[:d:]]*";
// pattern for the full line
const string numeric_sequence_pattern = "([[:s:]]*"+numeric_value_pattern+"[[:s:]]*)+";
regex r(numeric_sequence_pattern);
string line= "1 2 3 4.444444444444";
bool match = regex_match(line, r);
cout<<match<<endl;
//...
}
I compile that successfully with
g++ -std=c++11 -L/usr/lib64/ -lboost_regex regex_test.cpp
The resulting program worked fine so far and match == true as I wanted. But then I test an input line like
string line= "1 2 3 4.44444444e-16";
Of course, my pattern isn't built to recognise the format 4.44444444e-16 and I would expect that match == false. However, instead I get the following runtime error:
terminate called after throwing an instance of
'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): The complexity of matching the regular expression exceeded predefined bounds.
Try refactoring the regular expression to make each choice made by the state machine unambiguous.
This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
Why is that?
Note: the example I gave is extremal in the sense that putting one digit less after the dot works ok. That means
string line= "1 2 3 4.4444444e-16";
just results in match == false as expected. So, I'm baffled. What is happening here?
Thanks already!
Update:
Problem seems to be solved. Given the hint of alejrb I refactored the pattern to
const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+(?:\\.[[:d:]]*)?";
That seems to work as it should. Somehow, the isolated optional \\. inside the original pattern [[:d:]]+\\.?[[:d:]]* left to many possibilities to match a long sequence of digits in different ways.
I hope the pattern is safe now. However, if someone finds a way to use it for a blow up in the new form, let me know! It's not so obvious for me whether that might still be possible...
I'd say that your regex is probably exponentially backtracking. To protect you from a loop that would become entirely unworkable if the input were any longer, the regex engine just aborts the attempt.
One of the patterns that often causes this problem is anything of the form (x+x+)+ - which you build up here when you place the first pattern inside the second.
There's a good discussion at http://www.regular-expressions.info/catastrophic.html

Using Regex to find function containing a specific method or variable

This is my first post on stackoverflow, so please be gentle with me...
I am still learning regex - mostly because I have finally discovered how useful they can be and this is in part through using Sublime Text 2. So this is Perl regex (I believe)
I have done searching on this and other sites but I am now genuinely stuck. Maybe I am trying to do something that can't be done
I would like to find a regex (pattern) that will let me find the function or method or procedure etc that contains a given variable or method call.
I have tried a number of expressions and they seem to get part of the way but not all the way. Particularly when searching in Javascript I pick up multiple function declarations instead of the one nearest to the call/variable that I am looking for.
for example:
I am looking for the function that calls the method save data()
I have learnt, from this excellent site that I can use (?s) to switch . to include newlines
function.*(?=(?s).*?savedata\(\))
however, that will find the first instance of the word function and then all the text unto and including savedata()
if there are multiple procedures then it will start at the next function and repeat until it gets to savedata() again
function(?s).*?savedata\(\) does something similar
I have tried asking it to ignore the second function (I believe) by using something like:
function(?s).*?(?:(?!function).*?)*savedata\(\)
But that doesn't work.
I have done some investigation with look forwards and look backwards but either I am doing it wrong (highly possible) or they are not the right thing.
In summary (I guess), how do I go backwards, from a given word to the nearest occurrence of a different word.
At the moment I am using this to search through some javascript files to try and understand the structure/calls etc but ultimately I am hoping to use on c# files and some vb.net files
Many thanks in advance
Thanks for the swift responses and sorry for not added an example block of code - which I will do now (modified but still sufficient to show the issue)
if I have a simple block of javascript like the following:
function a_CellClickHandler(gridName, cellId, button){
var stuffhappenshere;
var and here;
if(something or other){
if (anothertest) {
event.returnValue=false;
event.cancelBubble=true;
return true;
}
else{
event.returnValue=false;
event.cancelBubble=true;
return true;
}
}
}
function a_DblClickHandler(gridName, cellId){
var userRow = rowfromsomewhere;
var userCell = cellfromsomewhereelse;
//this will need to save the local data before allowing any inserts to ensure that they are inserted in the correct place
if (checkforarangeofthings){
if (differenttest) {
InsSeqNum = insertnumbervalue;
InsRowID = arow.getValue()
blnWasInsert = true;
blnWasDoubleClick = true;
SaveData();
}
}
}
running the regex against this - including the second one that was identified as should be working Sublime Text 2 will select everything from the first function through to SaveData()
I would like to be able to get to just the dblClickHandler in this case - not both.
Hopefully this code snippet will add some clarity and sorry for not posting originally as I hoped a standard code file would suffice.
This regex will find every Javascript function containing the SaveData method:
(?<=[\r\n])([\t ]*+)function[^\r\n]*+[\r\n]++(?:(?!\1\})[^\r\n]*+[\r\n]++)*?[^\r\n]*?\bSaveData\(\)
It will match all the lines in the function up to, and including, the first line containing the SaveData method.
Caveat:
The source code must have well-formed indentation for this to work, as the regex uses matching indentations to detect the end of functions.
Will not match a function if it starts on the first line of the file.
Explanation:
(?<=[\r\n]) Start at the beginning of a line
([\t ]*+) Capture the indentation of that line in Capture Group 1
function[^\r\n]*+[\r\n]++ Match the rest of the declaration line of the function
(?:(?!\1\})[^\r\n]*+[\r\n]++)*? Match more lines (lazily) which are not the last line of the function, until:
[^\r\n]*?\bSaveData\(\) Match the first line of the function containing the SaveData method call
Note: The *+ and ++ are possessive quantifiers, only used to speed up execution.
EDIT:
Fixed two minor problems with the regex.
EDIT:
Fixed another minor problem with the regex.

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)
Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...
The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.
I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)