Is this regular expression exponential? - regex

I would like to know if:
/.*(Set-Cookie: (.*))?;.*(<\?xml.*)/
is an exponential regexp.
Thanks

It really depends on the regex engine, but in most engines, that pattern is probably polynomial of a high degree (maybe cubic or higher) when there's no match.
You can use e.g. RegexBuddy to see how many steps it takes to match, and more importantly, to not match certain input. You can use this to benchmark how complex the backtracking process may be in certain engines.
It's not clear exactly what you are trying to do, but that pattern really doesn't do much with the Set-Cookie subpattern allowed to be optional (e.g. the group may not match that string even if it exists, since it's optional to begin with).
If you are trying to parse XML, then please, please, please do not use regular expressions. There are many XML parsers available in most modern languages, and they would not only be appropriate for the job, but they'd also be correct and much more pleasant to work with than regex.
References
regular-expressions.info/Catastrophic Backtracking
Related questions
Detect if a regexp is exponential
The pattern, debunked
To point out why that pattern doesn't "work" (which would render irrelevant whether or not it's fast or slow), consider the following input:
Set-Cookie: NOMNOMNOM;<?xml
With the pattern /.*(Set-Cookie: (.*))?;.*(<\?xml.*)/, the entire string is a match, but group 1 doesn't capture Set-Cookie: NOMNOMNOM, and group 2 doesn't capture NOMNOMNOM (as seen on rubular.com). That's because the leading .* gobbled up the cookie, and since the cookie subpattern is optional, it's still a match anyway.
We can try to "fix" this by making the leading .* reluctant as .*?. Now, group 1 can match Set-Cookie, which is perhaps the intent all along (as seen on rubular.com).
However, this is hardly an improvement. You really do not want to go down this direction. There are still many problems with the regex, and just getting it to work right will be very difficult, if not nearly impossible.
It should be noted that the pattern as given does match ";<?xml" (as seen on rubular.com). That is, as long as there's a ; anywhere in the string, and then later a <?xml, the pattern will match. It's not clear if this pattern really does anything useful.
Related questions
Difference between .*? and .* for regex

Related

What's the most sensible way to emulate lookbehind behavior in Rust regex?

The Rust regex crate states:
This crate provides a native implementation of regular expressions that is heavily based on RE2 both in syntax and in implementation. Notably, backreferences and arbitrary lookahead/lookbehind assertions are not provided.
As of this writing, "rust regex lookbehind" comes back with no results from DuckDuckGo.
I've never had to work around this before, but I can think of two approaches:
Approach 1 (forward)
Iterate over .captures() for the pattern I want to use as lookbehind.
Match the thing I actually wanted to match between captures. (forward)
Approach 2 (reverse)
Match the pattern I really want to match.
For each match, look for the lookbehind pattern until the end byte of a previous capture or the beginning of the string.
Not only does this seem like a huge pain, it also seems like a lot of edge cases are going to trip me up. Is there a better way to go about this?
Example
Given a string like:
"Fish33-Tiger2Hyena4-"
I want to extract ["33-", "2", "4-"] iff each one follows a string like "Fish".
Without a motivating example, it's hard to usefully answer your question in a general way. In many cases, you can substitute lookaround operators with two regexes---one to search for candidates and another to produce the actual match you're interested in. However, this approach isn't always feasible.
If you're truly stuck, then you're only option is to use a regex library that supports these features. Rust has bindings to a couple of them:
PCRE
PCRE2
Oniguruma
There is also a more experimental library, fancy-regex, which is built on top of the regex crate.
If you have a regex application where you have a known consistent pattern that you want to use as lookbehind, another workaround is to use .splits() with the lookbehind-matching pattern as the argument (similar to the idea mentioned in the other answer). That will at least give you strings expressed by their adjacency to the match you want to lookbehind.
I don't know about performance guarantees regex-wise but this at least means that you can do a lookbehind-free regex match on the split result either N times (for N splits), or once on the concatenated result as needed.

Why don't regex engines ensure all required characters are in the string?

For example, look at this email validating regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$. If you look carefully, there are three parts: stuff, the # character, and more stuff. So the regex requires an email address to have an #, thus the string aaaaaaaaaaaaaaaaaaaaaa! will not match.
Yet most regex engines will catastrophically backtrack given this combination. (PCRE, which powers Regex101, is smarter than most, but other regex/string combinations can cause catastrophic backtracking.)
Without needing to know much about Big O, I can tell that combinatorial things are exponential, while searching is linear. So why don't regex engines ensure the string contains required characters (so they can quit early)?
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Coming back after getting more experience, I know that regexes are declarative, meaning the execution plan is determined by the computer, not the programmer. Optimization is one of the ways that regex engines differ the most.
While PCRE and Perl have challenged the declarative status-quo with the introduction of backtracking control verbs, it is other engines, without the verbs, which are most likely to catastrophically backtrack.
I think you're taking this the wrong way, really:
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Well, if you write a regex, your regex engine will need to follow that program you've written.
If you write a complex program, then there's nothing the engine can do about that; this regex explicitly specifies that you'll need to match "stuff" first, before looking for the #.
Now, not being too involved in writing compilers, I agree, in this case, it might be possible to first identify all the "static" elements, which here are only said #, and look for them. Sadly, in the general case, this won't really help you, because there might either be more than one static element or the none at all…
If you cared about speed, you'd actually just first search for the # with plain linear search, and then do your regex thing after you've found one.
Regexes were never meant to be as fast as linear search engines, because they were rather meant to be much, much more powerful.
So, not only are you taking the wrong person to the judge (the regex engine rather than the regex, which is a program with a complexity), you're also blaming the victim for the crime (you want to harvest the speed of just looking for the # character, but still use a regex).
by the way, don't validate email addresses with regexes. It's the wrong tool:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

Simplifying my Eclipse regex

So, I'm fairly new to regex. That being said, I'm looking for help. I've got this regex to do what I want, but this is as simple as I can make it with my current understanding.
(\w+\s*?\w+\s*?\-*?/*?\w+)\s*?(\(\w+\))
What this needs to match are the following configurations of strings:
word
word word
word-word
word/word
word word/word
word word/LL
word word (word)
word-word word/word
I kind of feel like I'm abusing *? but I saw an example that used that and it seemed to do what I needed. I've also seen that just * will do the same? Or just ?? Anyway there it is. Thanks in advance!
Also, the grouping is there because this regex is actually significantly longer with other groups. Please keep that in mind. I need the group to still work with others (4 in total).
EDIT: Sorry everyone. I'm actually trying to convert text being copy pasted from a pdf into python syntax using the built in find/replace (using regex) in the Eclipse IDE. That's why I didn't specify what I was using. I thought it was just plain ol' regex.
Also, my current regex works perfectly. What I'm asking for here is a lesson on simplicity (and the * and ? better explained). I just felt my current regex was long and ugly.
? after other RegEx quantifiers makes them reluctant. Meaning that they will match input only when the remainder of the RegEx has not been able to match.
The reluctant ? is superfluous when the set of characters it applies to has no common character with the following set. For example in:
[0-9]*?[A-Z]
there is no way [A-Z] will match unless all previous [0-9]s have been matched. Then why make [0-9]* reluctant? On the contrary, make it greedy by removing the ?.
[0-9]*[A-Z]
There is a second case where ? is abused. For example, you know that certain text contains, say, a colon followed an uppercase word. There are no other possible occurrences of a colon.
.*?:[A-Z]+
would do the job. Hoevever,
[^:]*:[A-Z]+
represents better the fact that a colon will always initiate what you want to match. In this case, we "created" the first condition (of character commonality) by realizing that, in fact, there never was need for one. IOW that we never needed a .* matching also :s, but just [^:]*.
I'm reluctant to use the reluctant operator because sometimes it tends to obscure patterns instead of clarify them and also because performance implications, both thanks to the fact that it increases the level of backtracking enormously (and without a reason).
Applying these principles,
(\w+\s*\w+\s*\-*/*\w+)\s*(\(\w+\))
seems a better option. Also, at some point you use \-*/*. It's hard to know what you really want without as many counter-examples as (positive) examples (and this is extremely important while developing and testing any RegEx!), but do you really want to accept perhaps many -s followed by perhaps many /s? My impression is that what you are looking for is one - or one / or one space. [ \-/] would do much better. Or perhaps \s*[\-/]?\s* if you want to accept multiple spaces, even before and/or after the [\-/]
(\w+\s*\w+\s*[\-/]?\s*\w+)\s*(\(\w+\))
See the Java documentation on Regular Expressions to find out more.
p.s.w.g was correct in pointing out that (.*) is the simplest form of what I needed. The other 3 grouping of my regular expression are specific enough that this works. Thank you p.s.w.g.
PS still don't know why I was down-voted

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

improving Perl regex performance by adding +

I have some regexes in a Perl script that are correct but slow. I am considering trying to improve performance by adding extra + operators (ie *+ instead of * and ++ instead of +) to disable backtracking. I tried replacing all of them and the regexes stopped working... so much for the simple solution. How do I know where I can add them where it won't break the regex?
If the regexes stopped working, you either aren't using a version of perl that supports them, or you actually do need backtracking in those cases.
Identify sections of the regex that won't ever need backtracking (that is, that if asked to match starting at a given point, there will never be more than one length you might want them to match), and surround them with (?> ). This has the same effect as ++/*+ and is supported even pre-5.10.
Note that restricting backtracking is often not "optimization", since it changes what will and will not be matched. The idea is that you use it to better describe what you actually want matched. Borrowing from the article linked in the OP's answer, something like ^(.*?,){11}P (twelfth comma separated field starts P) is not just inefficient, it is incorrect, since backtracking will cause it to actually match even when only a field after the twelfth starts with P. By correcting it to ^(?>.*?,){11}P you are restricting it to actually matching the correct number of leading fields. (In this trivial case, ^([^,]*,){11}P also does the job, but if you add in support for escaped or quoted commas within fields using alternation, (?> becomes the easier choice.)
Hmmm... once I posted the question, looking at the "Related" column led me to this which has some pretty good ideas.... http://www.regular-expressions.info/catastrophic.html