Regex query efficient? - regex

I came up with the below regex expression to look for terms like Password,Passphrase,Pass001 etc and the word following it. Is it efficient or can it be made better? Thanks for the help
"([Pp][aA][sS][Ss]([wW][oO][rR][dD][sS]?|[Pp][hH][rR][aA][sS][eE])?|[Pp]([aA][sS]([sS])?)?[wW][Dd])[0-9]?[0-9]?[0-9]?[\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*"
I will be using it to scan files upto 300K for these terms. When I try now to scan with these expression a whole C: drive it takes 5 hours or worse case I have encountered, 5 days

You may use the following enhancement:
(?i)p(?:ass(?:words?|phrase)?|(?:ass?)?wd)[0-9]{0,3}[-\s:=_\/#&'\]\[()+*\r\n]\S*
See the regex demo
Instead of [sS], you may make the regex case insensitive by adding (?i) case insensitive modifier. Use corresponding option in your software if it does not work like this.
Make sure your alternations do not match at the same location in the string. It is not quite easy here, but p at the start of each alternative in the first group decreases the regex efficiency. So, move it outside (e.g. (?:pass|port) => p(ass|ort)).
Use non-capturing groups rather than capturing ones if you are not going to access submatches, that also has a slight impact on performance.
Use limiting quantifiers instead of repeating ? quantified patterns. Instead of a?a?a?, use a{0,3}.
Do not overescape chars inside the character class. I only left \/, \] and \[ as I am not sure what regex flavor you are using, it might appear you can avoid escaping at all.
Note that a performance penalty is big if you have consecutive non-fixed width patterns that may match the same type of chars. You have [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*: [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+ matches 1 or more special chars and \S* matches 0 or more chars other than whitespace that also matches some chars matched by the preceding pattern. Remove the + from the preceding subpattern.

Related

is this regex vulnerable to REDOS attacks

Regex :
^\d+(\.\d+)*$
I tried to break it with :
1234567890.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1x]
that is 200x".1"
I have read about ReDos attacks from :
Preventing Regular Expression Denial of Service (ReDoS)
Runaway Regular Expressions: Catastrophic Backtracking
However, I am not too confident in my skills to prepare a ReDos attack on an expression. I tried to trigger catastrophic backtracking due to "Nested Quantifiers".
Is that expression breakable? What input should be used for that and, if yes, how did you come up with it?
"Nested quantifiers" isn't inherently a problem. It's just a simple way to refer to a problem which is actually quite a bit more complicated. The problem is "quantifying over a sub-expression which can, itself, match in many ways at the same position". It just turns out that you almost always need a quantifier in the inner sub-expression to provide a rich enough supply of matches, and so quantifiers inside quantifiers serve as a red flag that indicates the possibility of trouble.
(.*)* is problematic because .* has maximum symmetry — it can match anything between zero and all of the remaining characters at any point of the input. Repeating this leads to a combinatorial explosion.
([0-9a-f]+\d+)* is problematic because at any point in a string of digits, there will be many possible ways to allocate those digits between an initial substring of [0-9a-f]+ and a final substring of \d+, so it has the same exact issue as (.*)*.
(\.\d+)* is not problematic because \. and \d match completely different things. A digit isn't a dot and a dot isn't a digit. At any given point in the input there is only one possible way to match \., and only one possible way to match \d+ that leaves open the possibility of another repetition (consume all of the digits, because if we stop before a digit, the next character is certainly not a dot). Therefore (\.\d+)* is no worse, backtracking-wise, than a \d* would be in the same context, even though it contains nested quantifiers.
Your regex is safe, but only because of "\."
Testing on regex101.com shows that there are no combinations of inputs that create runaway checks - but your regex is VERY close to being vulnerable, so be careful when modifying it.
As you've read, catastrophic backtracking happens when two quantifiers are right next to each other. In your case, the regex expands to \d+\.\d+\.\d+\.\d+\. ... and so on. Because you make the dot required for every single match between \d+, your regex grows by only three steps for each period-number you add. (This translates to 4 steps per period-number if you put an invalid character at the end.) That's a linear growth rate, so your regex is fine. Demo
However, if you make the \. optional, accidentally forget the escape character to make it plain ol' ., or remove it altogether, then you're in trouble. Such a regex would allow catastrophic backtracking; an invalid character at the end approximately doubles the runtime with every additional number you add before it. That's an exponential growth rate, and it's enough to crash time out the regex101 engine's default settings with just 18 digits and 1 invalid character. Demo
As written, your regex is fine, and will remain so as long as you ensure sure there's something "solid" between the first \d+ and the second \d+, as well as something "solid" between the second \d+ and the * outside its capture group.

Why regular expression .* is slower at one place and faster at other

Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .* to \s*, it will reduce the steps required significantly to 16:
However, if I change second .* to \s*, it does not reduce the steps any further:
I have few questions:
Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
Does the steps counter given in this site really gives any indication about regex performance?
what other simple or similar (position related) regex performance observations you have?
The following is output from the debugger.
The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld
The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
try the next term in the regex. If it matches, proceed on
"unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.
\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.

Why is a character class faster than alternation?

It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?
This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.
Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

When to choose [^x]* or .*?

Assume i have a substring in a longer string like (...)aaabaacaaaaaXaaaadaeaa(...) and i want to match or replace the aaabaacaaaaa with the X as delimiter.
I can now use (.*?)X to find the string before the X or i can use ([^X]*) to find it. I could also use negative look-ahead but i don't think it is necessary in this case.
So which one of the two (or three) options is the better technique to get the group i want to match in this context?
Take this very simple example:
www\..*?\.com
www\.[^.]*\.com
The first one matches any input that contains a www. and a .com with anything in between. The second matches a www. and a .com that does not have a . in-between.
The first would match: www.google.something.com
The second would not.
Only use the negated class if that section absolutely cannot contain the character.
.*? is called lazy quantifier.
[^X]* is called greedy negation quantifier
Wherever possible use negation i.e. [^X] since it doesn't cause backtracking. Ofcourse if your input text can contain letter X then you have no choice but to use .*?
I am copying this text from one of the recent comment from #ridgerunner:
The expression: [^X)]* is certainly more efficient than .*? in
every language except possibly Perl (whose regex engine is highly
optimized for the lazy dot star expression). The expression .*? must
stop and backtrack once at every character position as it
"bumps-along", whereas the greedy quantifier applied to the negated
character class expression can consume the entire chunk in a single
step, with no backtracking.

What does ?: do in regex

I have a regex that looks like this
/^(?:\w+\s)*(\w+)$*/
What is the ?:?
It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
You're still looking for a specific pattern (in this case, a single whitespace character following at least one word), but you don't care what's actually matched.
It means only group but do not remember the grouped part.
By default ( ) tells the regex engine to remember the part of the string that matches the pattern between it. But at times we just want to group a pattern without triggering the regex memory, to do that we use (?: in place of (
Further to the excellent answers provided, its usefulness is also to simplify the code required to extract groups from the matched results. For example, your (\w+) group is known as group 1 without having to be concerned about any groups that appear before it. This may improve the maintainability of your code.
Let's understand by taking a example
In simple words we can say is let's for example I have been given a string say (s="a eeee").
Your regex(/^(?:\w+\s)(\w+)$/. ) will basically do in this case it will start with string finds 'a' in beginning of string and notice here there is 'white space character here) which in this case if you don't included ?: it would have returned 'a '(a with white space character).
If you may don't want this type of answer so u have included as*(?:\w+\s)* it will return you simply a without whitespace ie.'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
PS:I am beginner in regex.This is what i have understood with ?:.Feel free to pinpoint the wrong things explained.