Why is a character class faster than alternation? - regex

It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?

This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.

Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

Related

Why regular expression .* is slower at one place and faster at other

Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .* to \s*, it will reduce the steps required significantly to 16:
However, if I change second .* to \s*, it does not reduce the steps any further:
I have few questions:
Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
Does the steps counter given in this site really gives any indication about regex performance?
what other simple or similar (position related) regex performance observations you have?
The following is output from the debugger.
The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld
The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
try the next term in the regex. If it matches, proceed on
"unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.
\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.

Regex query efficient?

I came up with the below regex expression to look for terms like Password,Passphrase,Pass001 etc and the word following it. Is it efficient or can it be made better? Thanks for the help
"([Pp][aA][sS][Ss]([wW][oO][rR][dD][sS]?|[Pp][hH][rR][aA][sS][eE])?|[Pp]([aA][sS]([sS])?)?[wW][Dd])[0-9]?[0-9]?[0-9]?[\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*"
I will be using it to scan files upto 300K for these terms. When I try now to scan with these expression a whole C: drive it takes 5 hours or worse case I have encountered, 5 days
You may use the following enhancement:
(?i)p(?:ass(?:words?|phrase)?|(?:ass?)?wd)[0-9]{0,3}[-\s:=_\/#&'\]\[()+*\r\n]\S*
See the regex demo
Instead of [sS], you may make the regex case insensitive by adding (?i) case insensitive modifier. Use corresponding option in your software if it does not work like this.
Make sure your alternations do not match at the same location in the string. It is not quite easy here, but p at the start of each alternative in the first group decreases the regex efficiency. So, move it outside (e.g. (?:pass|port) => p(ass|ort)).
Use non-capturing groups rather than capturing ones if you are not going to access submatches, that also has a slight impact on performance.
Use limiting quantifiers instead of repeating ? quantified patterns. Instead of a?a?a?, use a{0,3}.
Do not overescape chars inside the character class. I only left \/, \] and \[ as I am not sure what regex flavor you are using, it might appear you can avoid escaping at all.
Note that a performance penalty is big if you have consecutive non-fixed width patterns that may match the same type of chars. You have [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*: [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+ matches 1 or more special chars and \S* matches 0 or more chars other than whitespace that also matches some chars matched by the preceding pattern. Remove the + from the preceding subpattern.

When to choose [^x]* or .*?

Assume i have a substring in a longer string like (...)aaabaacaaaaaXaaaadaeaa(...) and i want to match or replace the aaabaacaaaaa with the X as delimiter.
I can now use (.*?)X to find the string before the X or i can use ([^X]*) to find it. I could also use negative look-ahead but i don't think it is necessary in this case.
So which one of the two (or three) options is the better technique to get the group i want to match in this context?
Take this very simple example:
www\..*?\.com
www\.[^.]*\.com
The first one matches any input that contains a www. and a .com with anything in between. The second matches a www. and a .com that does not have a . in-between.
The first would match: www.google.something.com
The second would not.
Only use the negated class if that section absolutely cannot contain the character.
.*? is called lazy quantifier.
[^X]* is called greedy negation quantifier
Wherever possible use negation i.e. [^X] since it doesn't cause backtracking. Ofcourse if your input text can contain letter X then you have no choice but to use .*?
I am copying this text from one of the recent comment from #ridgerunner:
The expression: [^X)]* is certainly more efficient than .*? in
every language except possibly Perl (whose regex engine is highly
optimized for the lazy dot star expression). The expression .*? must
stop and backtrack once at every character position as it
"bumps-along", whereas the greedy quantifier applied to the negated
character class expression can consume the entire chunk in a single
step, with no backtracking.

Regexp Question - Negating a captured character

I'm looking for a regular expression that allows for either single-quoted or double-quoted strings, and allows the opposite quote character within the string. For example, the following would both be legal strings:
"hello 'there' world"
'hello "there" world'
The regexp I'm using uses negative lookahead and is as follows:
(['"])(?:(?!\1).)*\1
This would work I think, but what about if the language didn't support negative lookahead. Is there any other way to do this? Without alternation?
EDIT:
I know I can use alternation. This was more of just a hypothetical question. Say I had 20 different characters in the initial character class. I wouldn't want to write out 20 different alternations. I'm trying to actually negate the captured character, without using lookahead, lookbehind, or alternation.
This is actually much simpler than you may have realized. You don't really need the negative look-ahead. What you want to do is a non-greedy (or lazy) match like this:
(['"]).*?\1
The ? character after the .* is the important part. It says, consume the minimum possible characters before hitting the next part of the regex. So, you get either kind of quote, and then you go after 0-M characters until you encounter a character matching whichever quote you first ran into. You can learn more about greedy matching vs. non-greedy here and here.
Sure:
'([^']*)'|"([^"]*)"
On a successful match, the $+ variable will hold the contents of whichever alternate matched.
In the general case, regexps are not really the answer. You might be interested in something like Text::ParseWords, which tokenizes text, accounting for nested quotes, backslashed quotes, backslashed spaces, and other oddities.

Regex to *not* match any characters

I know it is quite some weird goal here but for a quick and dirty fix for one of our system we do need to not filter any input and let the corruption go into the system.
My current regex for this is "\^.*"
The problem with that is that it does not match characters as planned ... but for one match it does work. The string that make it not work is ^#jj (basically anything that has ^ ... ).
What would be the best way to not match any characters now ? I was thinking of removing the \  but only doing this will transform the "not" into a "start with" ...
The ^ character doesn't mean "not" except inside a character class ([]). If you want to not match anything, you could use a negative lookahead that matches anything: (?!.*).
A simple and cheap regex that will never match anything is to match against something that is simply unmatchable, for example: \b\B.
It's simply impossible for this regex to match, since it's a contradiction.
References
regular-expressions.info\Word Boundaries
\B is the negated version of \b. \B matches at every position where \b does not.
Another very well supported and fast pattern that would fail to match anything that is guaranteed to be constant time:
$unmatchable pattern $anything goes here etc.
$ of course indicates the end-of-line. No characters could possibly go after $ so no further state transitions could possibly be made. The additional advantage are that your pattern is intuitive, self-descriptive and readable as well!
tldr; The most portable and efficient regex to never match anything is $- (end of line followed by a char)
Impossible regex
The most reliable solution is to create an impossible regex. There are many impossible regexes but not all are as good.
First you want to avoid "lookahead" solutions because some regex engines don't support it.
Then you want to make sure your "impossible regex" is efficient and won't take too much computation steps to match... nothing.
I found that $- has a constant computation time ( O(1) ) and only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/3).
For comparison:
$^ and $. both take 36 steps to compute -> O(1)
\b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)
Empty regex (alternative solution)
If your regex engine accepts it, the best and simplest regex to never match anything might be: an empty regex .
Instead of trying to not match any characters, why not just match all characters? ^.*$ should do the trick. If you have to not match any characters then try ^\j$ (Assuming of course, that your regular expression engine will not throw an error when you provide it an invalid character class. If it does, try ^()$. A quick test with RegexBuddy suggests that this might work.
^ is only not when it's in class (such as [^a-z] meaning anything but a-z). You've turned it into a literal ^ with the backslash.
What you're trying to do is [^]*, but that's not legal. You could try something like
" {10000}"
which would match exactly 10,000 spaces, if that's longer than your maximum input, it should never be matched.
((?iLmsux))
Try this, it matches only if the string is empty.
Interesting ... the most obvious and simple variant:
~^
.
https://regex101.com/r/KhTM1i/1
requiring usually only one computation step (failing directly at the start and being computational expensive only if the matched string begins with a long series of ~) is not mentioned among all the other answers ... for 12 years.
You want to match nothing at all? Neg lookarounds seems obvious, but can be slow, perhaps ^$ (matches empty string only) as an alternative?