improving Perl regex performance by adding + - regex

I have some regexes in a Perl script that are correct but slow. I am considering trying to improve performance by adding extra + operators (ie *+ instead of * and ++ instead of +) to disable backtracking. I tried replacing all of them and the regexes stopped working... so much for the simple solution. How do I know where I can add them where it won't break the regex?

If the regexes stopped working, you either aren't using a version of perl that supports them, or you actually do need backtracking in those cases.
Identify sections of the regex that won't ever need backtracking (that is, that if asked to match starting at a given point, there will never be more than one length you might want them to match), and surround them with (?> ). This has the same effect as ++/*+ and is supported even pre-5.10.
Note that restricting backtracking is often not "optimization", since it changes what will and will not be matched. The idea is that you use it to better describe what you actually want matched. Borrowing from the article linked in the OP's answer, something like ^(.*?,){11}P (twelfth comma separated field starts P) is not just inefficient, it is incorrect, since backtracking will cause it to actually match even when only a field after the twelfth starts with P. By correcting it to ^(?>.*?,){11}P you are restricting it to actually matching the correct number of leading fields. (In this trivial case, ^([^,]*,){11}P also does the job, but if you add in support for escaped or quoted commas within fields using alternation, (?> becomes the easier choice.)

Hmmm... once I posted the question, looking at the "Related" column led me to this which has some pretty good ideas.... http://www.regular-expressions.info/catastrophic.html

Related

Can this regex be made memory efficient

I get an xml as plain unformatted text blob. I have to make some replacements and I use regex find and replace.
For example:
<MeasureValue><Text value="StartCalibration" /></MeasureValue>
has to be converted to
<MeasureValue type="Text" value="StartCalibration"/>
The regex I wrote was
<MeasureValue><((\w*)\s+value="(.*?)".*?)></MeasureValue>
And the replacement part was:
<MeasureValue type="$2" value="$3"/>
Here a link showing the same.
The issue is that in a file having 370 such occurrences, I get out of memory error. I have heard of the so called greedy regex patterns and wondering if this can be the case plaguing me. If this is already memory efficient then I will leave it as it is and try to increase the server memory. I have to process thousands of such documents.
EDIT: This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps.
As a rule of thumb, specificity is positively correlated with efficiency in regex.
So, know your data and build something to surgically match it.
The more specific you build your regex, like literally writing down the pattern (and usually ending up with a freak regex), the fewer resources it will take due to the fewer "possibilities" it can match in your data.
To be more precise, imagine we are trying to match a string
2014-08-26 app[web.1]: 50.0.134.125
Approaches such as
(.*) (.*) (.*)
leaves it too open and prone to match with MANY different patterns and combinations, and thus takes a LOT more to process its infinite possibilities. check here https://regex101.com/r/GvmPOC/1
On the other han you could spend a little more time building a more elaborated expression such as:
^[0-9]{4}\-[0-9]{2}-[0-9]{2} app\[[a-zA-Z0-9.]+\]\: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$`
and I agree, it is horrible but much more precise. It won't waste your precious resources finding unnecessary stuff. check here https://regex101.com/r/quz7fo/1
Another thing to have in mind is: operators such as * or + do a scan operation, which depending on the size of your string, might take some time. Also, whenever is possible, specifying the anchors ^$ also help the script not to try to find too many matches within the same string.
Bringing it to your reality...
If we have to use regex.
The million-dollar question is, how can we turn your regex into something more precise?
Since there is no limit to tag name lengths in XML... there is no way to make it utterly specific :(
We could try to specify what characters to match and avoid . and \w. So substitute it to something more like a-zA-Z is preferrable. Also making use of negative classes [^] would help to narrow down the range of possibilities.
Avoiding * and ? and try to put a quantifier {} (although I don't know your data to make this decision). And as I stated above, there is no limit in XML for this.
I didn't understand precisely the function of the ? in your code, so removing it is something less to process.
Ended up with something like
<(([a-zA-Z]+) value="([^"]*)"[^<>]*)>
Not many changes though. You can try to measure it to see if there was any improvement.
But perhaps the best approach is not to use regex at all :(
I don't know the language you are working with, but if it is getting complicated with the processing time, I would suggest you to not use regex and try some alternative.
If there is a slight possibility to use a xml parser it would be preferable.
https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
Sorry if it wasn't as conclusive as you might have expected, but the field for working on it is likewise really open.

Simplifying my Eclipse regex

So, I'm fairly new to regex. That being said, I'm looking for help. I've got this regex to do what I want, but this is as simple as I can make it with my current understanding.
(\w+\s*?\w+\s*?\-*?/*?\w+)\s*?(\(\w+\))
What this needs to match are the following configurations of strings:
word
word word
word-word
word/word
word word/word
word word/LL
word word (word)
word-word word/word
I kind of feel like I'm abusing *? but I saw an example that used that and it seemed to do what I needed. I've also seen that just * will do the same? Or just ?? Anyway there it is. Thanks in advance!
Also, the grouping is there because this regex is actually significantly longer with other groups. Please keep that in mind. I need the group to still work with others (4 in total).
EDIT: Sorry everyone. I'm actually trying to convert text being copy pasted from a pdf into python syntax using the built in find/replace (using regex) in the Eclipse IDE. That's why I didn't specify what I was using. I thought it was just plain ol' regex.
Also, my current regex works perfectly. What I'm asking for here is a lesson on simplicity (and the * and ? better explained). I just felt my current regex was long and ugly.
? after other RegEx quantifiers makes them reluctant. Meaning that they will match input only when the remainder of the RegEx has not been able to match.
The reluctant ? is superfluous when the set of characters it applies to has no common character with the following set. For example in:
[0-9]*?[A-Z]
there is no way [A-Z] will match unless all previous [0-9]s have been matched. Then why make [0-9]* reluctant? On the contrary, make it greedy by removing the ?.
[0-9]*[A-Z]
There is a second case where ? is abused. For example, you know that certain text contains, say, a colon followed an uppercase word. There are no other possible occurrences of a colon.
.*?:[A-Z]+
would do the job. Hoevever,
[^:]*:[A-Z]+
represents better the fact that a colon will always initiate what you want to match. In this case, we "created" the first condition (of character commonality) by realizing that, in fact, there never was need for one. IOW that we never needed a .* matching also :s, but just [^:]*.
I'm reluctant to use the reluctant operator because sometimes it tends to obscure patterns instead of clarify them and also because performance implications, both thanks to the fact that it increases the level of backtracking enormously (and without a reason).
Applying these principles,
(\w+\s*\w+\s*\-*/*\w+)\s*(\(\w+\))
seems a better option. Also, at some point you use \-*/*. It's hard to know what you really want without as many counter-examples as (positive) examples (and this is extremely important while developing and testing any RegEx!), but do you really want to accept perhaps many -s followed by perhaps many /s? My impression is that what you are looking for is one - or one / or one space. [ \-/] would do much better. Or perhaps \s*[\-/]?\s* if you want to accept multiple spaces, even before and/or after the [\-/]
(\w+\s*\w+\s*[\-/]?\s*\w+)\s*(\(\w+\))
See the Java documentation on Regular Expressions to find out more.
p.s.w.g was correct in pointing out that (.*) is the simplest form of what I needed. The other 3 grouping of my regular expression are specific enough that this works. Thank you p.s.w.g.
PS still don't know why I was down-voted

Should I create one complex RegEx or multiple and less complex ones?

Should I create one complex RegEx to tackle all cases on hand or should I break one complex RegEx in multiple Regex which ?
I'm concerned regarding performance using complex Regex.
Will breaking the complex Regex into smaller simple regex perform better?
If you want a meaningful answer to the performance question, you need to benchmark both cases.
Regarding readability/maintainability, you can write unreadable code in any language and so you can do with regular expressions. If you write a big one, be sure to use the x modifier (IgnorePatternWhitespace in c#) and use comments to build your regex.
A randomly chosen example from one of my past answers in c#:
MatchCollection result = Regex.Matches
(testingString,
#"
(?<=\$) # Ensure there is a $ before the string
[^|]* # Match any character that is not a |
(?=\|) #Till a | is ahead
"
, RegexOptions.IgnorePatternWhitespace);
I don't think there would be much of a difference now because of compiler optimization, however, using a simple one would make understanding your code easier which in turn makes maintenance easier.
Complex regular expressions can be VERY slow, but it depends on your regular expression and your environment. Take the case of string.trim(). It can be trivially implemented with regular expressions. You might use one regex or two (remove front and back whitespace separately). Here is somebody that took 11 different javascript trim implementations and benchmarked them in different browsers: http://blog.stevenlevithan.com/archives/faster-trim-javascript. In that case, one regex loses big time in most situations.

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

Is this regular expression exponential?

I would like to know if:
/.*(Set-Cookie: (.*))?;.*(<\?xml.*)/
is an exponential regexp.
Thanks
It really depends on the regex engine, but in most engines, that pattern is probably polynomial of a high degree (maybe cubic or higher) when there's no match.
You can use e.g. RegexBuddy to see how many steps it takes to match, and more importantly, to not match certain input. You can use this to benchmark how complex the backtracking process may be in certain engines.
It's not clear exactly what you are trying to do, but that pattern really doesn't do much with the Set-Cookie subpattern allowed to be optional (e.g. the group may not match that string even if it exists, since it's optional to begin with).
If you are trying to parse XML, then please, please, please do not use regular expressions. There are many XML parsers available in most modern languages, and they would not only be appropriate for the job, but they'd also be correct and much more pleasant to work with than regex.
References
regular-expressions.info/Catastrophic Backtracking
Related questions
Detect if a regexp is exponential
The pattern, debunked
To point out why that pattern doesn't "work" (which would render irrelevant whether or not it's fast or slow), consider the following input:
Set-Cookie: NOMNOMNOM;<?xml
With the pattern /.*(Set-Cookie: (.*))?;.*(<\?xml.*)/, the entire string is a match, but group 1 doesn't capture Set-Cookie: NOMNOMNOM, and group 2 doesn't capture NOMNOMNOM (as seen on rubular.com). That's because the leading .* gobbled up the cookie, and since the cookie subpattern is optional, it's still a match anyway.
We can try to "fix" this by making the leading .* reluctant as .*?. Now, group 1 can match Set-Cookie, which is perhaps the intent all along (as seen on rubular.com).
However, this is hardly an improvement. You really do not want to go down this direction. There are still many problems with the regex, and just getting it to work right will be very difficult, if not nearly impossible.
It should be noted that the pattern as given does match ";<?xml" (as seen on rubular.com). That is, as long as there's a ; anywhere in the string, and then later a <?xml, the pattern will match. It's not clear if this pattern really does anything useful.
Related questions
Difference between .*? and .* for regex