Why does regex_match throw "complexity exception"? - c++

I am trying to test (using boost::regex) whether a line in a file contains only numeric entries seperated by spaces. I encountered an exception which I do not understand (see below). It would be great if someone could explain why it is thrown. Maybe I am doing something stupid here in my way of defining the patterns? Here is the code:
// regex_test.cpp
#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;
int main(){
// My basic pattern to test for a single numeric expression
const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+\\.?[[:d:]]*";
// pattern for the full line
const string numeric_sequence_pattern = "([[:s:]]*"+numeric_value_pattern+"[[:s:]]*)+";
regex r(numeric_sequence_pattern);
string line= "1 2 3 4.444444444444";
bool match = regex_match(line, r);
cout<<match<<endl;
//...
}
I compile that successfully with
g++ -std=c++11 -L/usr/lib64/ -lboost_regex regex_test.cpp
The resulting program worked fine so far and match == true as I wanted. But then I test an input line like
string line= "1 2 3 4.44444444e-16";
Of course, my pattern isn't built to recognise the format 4.44444444e-16 and I would expect that match == false. However, instead I get the following runtime error:
terminate called after throwing an instance of
'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): The complexity of matching the regular expression exceeded predefined bounds.
Try refactoring the regular expression to make each choice made by the state machine unambiguous.
This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
Why is that?
Note: the example I gave is extremal in the sense that putting one digit less after the dot works ok. That means
string line= "1 2 3 4.4444444e-16";
just results in match == false as expected. So, I'm baffled. What is happening here?
Thanks already!
Update:
Problem seems to be solved. Given the hint of alejrb I refactored the pattern to
const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+(?:\\.[[:d:]]*)?";
That seems to work as it should. Somehow, the isolated optional \\. inside the original pattern [[:d:]]+\\.?[[:d:]]* left to many possibilities to match a long sequence of digits in different ways.
I hope the pattern is safe now. However, if someone finds a way to use it for a blow up in the new form, let me know! It's not so obvious for me whether that might still be possible...

I'd say that your regex is probably exponentially backtracking. To protect you from a loop that would become entirely unworkable if the input were any longer, the regex engine just aborts the attempt.
One of the patterns that often causes this problem is anything of the form (x+x+)+ - which you build up here when you place the first pattern inside the second.
There's a good discussion at http://www.regular-expressions.info/catastrophic.html

Related

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.

using ^ (caret) inside the states in lex/flex

I'll put up my lex code first(lex body only).
%%
ps {BEGIN STATE1;}
. ;
<STATE1>^[0-9] print("number after ps".)
with this code I'm trying to match a number right after the letters "ps". Thats why I used ^ character.
But the code doesn't match any correct strings such as ps3, ps4fd,ps554 etc.
Then I removed the ^ and tried but then it worked but also matches strings like pserd7, psfh45,psfhdjh4er etc.
I know that I can solve the problem without using states (ps[0-9].*). But I have to do this with states. How can I fix this? thanks....
with this code I'm trying to match a number right after the letters "ps". Thats why I used ^ character
But ^ doesn't mean that. It means 'beginning of line'.
I know that I can solve the problem without using states (ps[0-9].*). But I have to do this with states.
Why? Very strange requirement.
You need to add more rules to cover the other possibilities. For example:
<STATE1>. { BEGIN INITIAL; }
But this depends on what else if anything is legal after 'ps'.

C++ Simple use of regex

I'm just trying to mess around and get familiar with using regex in c++.
Let's say I want the user to input the following: ###-$$-###, make #=any number between 0-9 and $=any number between 0-5. This is my idea for accomplishing this:
regex rx("[0-9][0-9][0-9]""\\-""[0-5][0-5]")
That's not the exact code however that's the general idea to check whether or not the user's input is a valid string of numbers. However, let's say i won't allow numbers starting with a 0 so: 099-55-999 is not acceptable. How can I check something like that and output invalid? Thanks
[0-9]{3}-[0-5]{2}-[0-9]{3}
matches a string that starts with three digits between 0 and 9, followed by a dash, followed by two digits between 0 and 5, followed by a dash, followed by three digits between 0 and 9.
Is that what you're looking for? This is very basic regex stuff. I suggest you look at a good tutorial.
EDIT: (after you changed your question):
[1-9][0-9]{2}-[0-5]{2}-[0-9]{3}
would match the same as above except for not allowing a 0 as the first character.
std::tr1::regex rx("[0-9]{3}-[0-5]{2}-[0-9]{3}");
Your talking about using tr1 regex in c++ right and not the managed c++? If so, go here where it explains this stuff.
Also, you should know that if your using VS2010 that you don't need the boost library anymore for regex.
Try this:
#include <regex>
#include <iostream>
#include <string>
int main()
{
std::tr1::regex rx("\\d{3}-[0-5]{2}-\\d{3}");
std::string s;
std::getline(std::cin,s);
if(regex_match(s.begin(),s.end(),rx))
{
std::cout << "Matched!" << std::endl;
}
}
For explanation check #Tim's answer. Do note the double \ for the digit metacharacter.

Why exception with this regular expression pattern (tr1::regex)?

I came accross a very strange problem with tr1::regex (VS2008) that I can't figure out the reason for. The code at the end of the post compiles fine but throws an exception when reaching the 4th regular expression definition during execution:
Microsoft C++ exception: std::tr1::regex_error at memory location 0x0012f5f4..
However, the only difference I can see (maybe I am blind) between the 3rd and 4th one is the 'NumberOfComponents' instead of 'SchemeVersion'. At first I thought maybe both (3rd and 4th) are wrong and the error from the 3rd is just triggered in the 4th. That seems not to be the case as I moved both of them around and put multiple other regex definitions between them. The line in question always triggers the exception.
Does anyone have any idea why that line
std::tr1::regex rxNumberOfComponents("\\NumberOfComponents:(\\s*\\d+){1}");
triggers an exception but
std::tr1::regex rxSchemeVersion("\\SchemeVersion:(\\s*\\d+){1}");
doesn't? Is the runtime just messing with me?
Thanks for the time to read this and for any insights.
T
PS: I am totally sure the solution is so easy I have to hit my head against the nearest wall to even out the 'stupid question' karma ...
#include <regex>
int main(void)
{
std::tr1::regex rxSepFileIdent("Scanner Separation Configuration");
std::tr1::regex rxScannerNameIdent("\\ScannerName:((\\s*\\w+)+)");
std::tr1::regex rxSchemeVersion("\\SchemeVersion:(\\s*\\d+){1}");
std::tr1::regex rxNumberOfComponents("\\NumberOfComponents:(\\s*\\d+){1}");
std::tr1::regex rxConfigStartIdent("Configuration Start");
std::tr1::regex rxConfigEndIdent("Configuration End");
return 0;
}
You need to double-escape your backslashes - once for the regex itself, a second time for the string they're in.
The one that starts with S works because \S is a valid regex escape (non-whitespace characters). The one that starts with N does not (because \N is not a valid regex escape).
Instead, use "\\\\SchemeVersion: et cetera.

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)
Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...
The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.
I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)