A Question of Greedy vs. Negated Character Classes in Regex - regex

I have a very large file that looks like this (see below). I have two basic choices of regex to use on it (I know there may be others but I'm really trying to compare Greedy and Negated Char Class) methods.
ftp: [^\D]{1,}
ftp: (\d)+
ftp: \d+
Note: what if I took off the parense around the \d?
Now + is greedy which forces backtracking but the Negated Char Class require a char-by-char comparison. Which is more efficient? Assume the file is very-very large so minute differences in processor usage will become exaggerated due to the length of the file.
Now that you've answered that, What if my Negated Char Class was very large, say 18 different characters? Would that change your answer?
Thanks.
ftp: 1117 bytes
ftp: 5696 bytes
ftp: 3207 bytes
ftp: 5696 bytes
ftp: 7200 bytes

[^\D]{1,} and \d+ is exactly the same. The regex parser will compile [^\D] and \d into character classes with the equal content, and + is just short for {1,}.
If you want lazy repetition you can add a ? at the end.
\d+?
The character classes are usually compiled into bitmaps for ASCII-characters. For Unicode (>=256) it is implementation dependent. One way could be to create a list of ranges, and use binary search on it.
For ASCII the lookup time is constant over the size. For Unicode it is logarithmic or linear.

Both your expressions have the same greediness. As others have said here, except for the capturing group they will execute in the same way.
Also in this case greediness won't matter much at the execution speed since you don't have anything following \d*. In this case the expression will simply process all the digits it can find and stop when the space is encountered. No backtracking should occur with these expressions.
To make it more explicit, backtracking should occur if you have an expression like this:
\d*123
In this case the parser will engulf all the digits, then backtrack to match the three following digits.

Yeah, I agree with MizardX... these two expressions are semantically equivalent. Although the grouping could require additional resources. That's not what you were asking about.

My initial tests show that [^\D{1,} is a bit slower than \d+, on a 184M file the former takes 9.6 seconds while the latter takes 8.2
Without capturing (the ()'s) both are about 1 second faster, but the difference between the two is about the same.
I also did a more extensive test where the captured value is printed to /dev/null, with a third version splitting on the space, results:
([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s
Edit: split version improved and time decreased to be the same or lower than (\d+)
Fastest version so far (can anyone improve?):
while (<>)
{
if ($foo = (split / /)[1])
{
print $foo . "\n";
}
}

This is kind of a trick question as written because (\d)+ takes slightly longer due to the overhead of the capturing parentheses. If you change it to \d+ they take the same amount of time in my Perl / system.

Not a direct answer to the question, but why not a different approach altogether, since you know the format of the lines already? For example, you could use a regex on the whitespace between the fields, or avoid regex altogether and split() on the whitespace, which is generally going to be faster than any regular expression, depending on the language you're using.

Related

is this regex vulnerable to REDOS attacks

Regex :
^\d+(\.\d+)*$
I tried to break it with :
1234567890.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1x]
that is 200x".1"
I have read about ReDos attacks from :
Preventing Regular Expression Denial of Service (ReDoS)
Runaway Regular Expressions: Catastrophic Backtracking
However, I am not too confident in my skills to prepare a ReDos attack on an expression. I tried to trigger catastrophic backtracking due to "Nested Quantifiers".
Is that expression breakable? What input should be used for that and, if yes, how did you come up with it?
"Nested quantifiers" isn't inherently a problem. It's just a simple way to refer to a problem which is actually quite a bit more complicated. The problem is "quantifying over a sub-expression which can, itself, match in many ways at the same position". It just turns out that you almost always need a quantifier in the inner sub-expression to provide a rich enough supply of matches, and so quantifiers inside quantifiers serve as a red flag that indicates the possibility of trouble.
(.*)* is problematic because .* has maximum symmetry — it can match anything between zero and all of the remaining characters at any point of the input. Repeating this leads to a combinatorial explosion.
([0-9a-f]+\d+)* is problematic because at any point in a string of digits, there will be many possible ways to allocate those digits between an initial substring of [0-9a-f]+ and a final substring of \d+, so it has the same exact issue as (.*)*.
(\.\d+)* is not problematic because \. and \d match completely different things. A digit isn't a dot and a dot isn't a digit. At any given point in the input there is only one possible way to match \., and only one possible way to match \d+ that leaves open the possibility of another repetition (consume all of the digits, because if we stop before a digit, the next character is certainly not a dot). Therefore (\.\d+)* is no worse, backtracking-wise, than a \d* would be in the same context, even though it contains nested quantifiers.
Your regex is safe, but only because of "\."
Testing on regex101.com shows that there are no combinations of inputs that create runaway checks - but your regex is VERY close to being vulnerable, so be careful when modifying it.
As you've read, catastrophic backtracking happens when two quantifiers are right next to each other. In your case, the regex expands to \d+\.\d+\.\d+\.\d+\. ... and so on. Because you make the dot required for every single match between \d+, your regex grows by only three steps for each period-number you add. (This translates to 4 steps per period-number if you put an invalid character at the end.) That's a linear growth rate, so your regex is fine. Demo
However, if you make the \. optional, accidentally forget the escape character to make it plain ol' ., or remove it altogether, then you're in trouble. Such a regex would allow catastrophic backtracking; an invalid character at the end approximately doubles the runtime with every additional number you add before it. That's an exponential growth rate, and it's enough to crash time out the regex101 engine's default settings with just 18 digits and 1 invalid character. Demo
As written, your regex is fine, and will remain so as long as you ensure sure there's something "solid" between the first \d+ and the second \d+, as well as something "solid" between the second \d+ and the * outside its capture group.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Regex matching numbers and decimals

I need a regex expression that will match the following:
.5
0.5
1.5
1234
but NOT
0.5.5
absnd (any letter character or space)
I have this that satisfies all but 0.5.5
^[.?\d]+$
This is a fairly common task. The simplest way I know of to deal with it is this:
^[+-]?(\d*\.)?\d+$
There are also other complications, such as whether you want to allow leading zeroes or commas or things like that. This can be as complicated as you want it to be. For example, if you want to allow the 1,234,567.89 format, you can go with this:
^[+-]?(\d*|\d{1,3}(,\d{3})*)(\.\d+)?\b$
That \b there is a word break, but I'm using it as a sneaky way to require at least one numeral at the end of the string. This way, an empty string or a single + won't match.
However, be advised that regexes are not the ideal way to parse numeric strings. All modern programming languages I know of have fast, simple, built-in methods for doing that.
Here's a much simpler solution that doesn't use any look-aheads or look-behinds:
^\d*\.?\d+$
To clearly understand why this works, read it from right to left:
At least one digit is required at the end.
7 works
77 works
.77 works
0.77 works
0. doesn't work
empty string doesn't work
A single period preceding the digit is optional.
.77 works
77 works
..77 doesn't work
Any number of digits preceding the (optional) period.
.77 works
0.77 works
0077.77 works
0077 works
Not using look-aheads and look-behinds has the added benefit of not having to worry about RegEx-based DOS attacks.
HTH
Nobody seems to be accounting for negative numbers. Also, some are creating a capture group which is unnecessary. This is the most thorough solution IMO.
^[+-]?(?:\d*\.)?\d+$
The following should work:
^(?!.*\..*\.)[.\d]+$
This uses a negative lookahead to make sure that there are fewer than two . characters in the string.
http://www.rubular.com/r/N3jl1ifJDX
This could work:
^(?:\d*\.)?\d+$

Regex to *not* match any characters

I know it is quite some weird goal here but for a quick and dirty fix for one of our system we do need to not filter any input and let the corruption go into the system.
My current regex for this is "\^.*"
The problem with that is that it does not match characters as planned ... but for one match it does work. The string that make it not work is ^#jj (basically anything that has ^ ... ).
What would be the best way to not match any characters now ? I was thinking of removing the \  but only doing this will transform the "not" into a "start with" ...
The ^ character doesn't mean "not" except inside a character class ([]). If you want to not match anything, you could use a negative lookahead that matches anything: (?!.*).
A simple and cheap regex that will never match anything is to match against something that is simply unmatchable, for example: \b\B.
It's simply impossible for this regex to match, since it's a contradiction.
References
regular-expressions.info\Word Boundaries
\B is the negated version of \b. \B matches at every position where \b does not.
Another very well supported and fast pattern that would fail to match anything that is guaranteed to be constant time:
$unmatchable pattern $anything goes here etc.
$ of course indicates the end-of-line. No characters could possibly go after $ so no further state transitions could possibly be made. The additional advantage are that your pattern is intuitive, self-descriptive and readable as well!
tldr; The most portable and efficient regex to never match anything is $- (end of line followed by a char)
Impossible regex
The most reliable solution is to create an impossible regex. There are many impossible regexes but not all are as good.
First you want to avoid "lookahead" solutions because some regex engines don't support it.
Then you want to make sure your "impossible regex" is efficient and won't take too much computation steps to match... nothing.
I found that $- has a constant computation time ( O(1) ) and only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/3).
For comparison:
$^ and $. both take 36 steps to compute -> O(1)
\b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)
Empty regex (alternative solution)
If your regex engine accepts it, the best and simplest regex to never match anything might be: an empty regex .
Instead of trying to not match any characters, why not just match all characters? ^.*$ should do the trick. If you have to not match any characters then try ^\j$ (Assuming of course, that your regular expression engine will not throw an error when you provide it an invalid character class. If it does, try ^()$. A quick test with RegexBuddy suggests that this might work.
^ is only not when it's in class (such as [^a-z] meaning anything but a-z). You've turned it into a literal ^ with the backslash.
What you're trying to do is [^]*, but that's not legal. You could try something like
" {10000}"
which would match exactly 10,000 spaces, if that's longer than your maximum input, it should never be matched.
((?iLmsux))
Try this, it matches only if the string is empty.
Interesting ... the most obvious and simple variant:
~^
.
https://regex101.com/r/KhTM1i/1
requiring usually only one computation step (failing directly at the start and being computational expensive only if the matched string begins with a long series of ~) is not mentioned among all the other answers ... for 12 years.
You want to match nothing at all? Neg lookarounds seems obvious, but can be slow, perhaps ^$ (matches empty string only) as an alternative?

How can I match a quote-delimited string with a regex?

If I'm trying to match a quote-delimited string with a regex, which of the following is "better" (where "better" means both more efficient and less likely to do something unexpected):
/"[^"]+"/ # match quote, then everything that's not a quote, then a quote
or
/".+?"/ # match quote, then *anything* (non-greedy), then a quote
Assume for this question that empty strings (i.e. "") are not an issue. It seems to me (no regex newbie, but certainly no expert) that these will be equivalent.
Update: Upon reflection, I think changing the + characters to * will handle empty strings correctly anyway.
You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:
"[^"]*"!
or:
".*?"!
The difference appears when you have the subject:
"one" "two"!
The first regex matches:
"two"!
while the second regex matches:
"one" "two"!
Always be as specific as you can. Use the negated character class when you can.
Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.
As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.
More complicated, but it handles escaped quotes and also escaped backslashes (escaped backslashes followed by a quote is not a problem)
/(["'])((\\{2})*|(.*?[^\\](\\{2})*))\1/
Examples:
"hello\"world" matches "hello\"world"
"hello\\"world" matches "hello\\"
I would suggest:
([\"'])(?:\\\1|.)*?\1
But only because it handles escaped quote chars and allows both the ' and " to be the quote char. I would also suggest looking at this article that goes into this problem in depth:
http://blog.stevenlevithan.com/archives/match-quoted-string
However, unless you have a serious performance issue or cannot be sure of embedded quotes, go with the simpler and more readable:
/".*?"/
I must admit that non-greedy patterns are not the basic Unix-style 'ed' regular expression, but they are getting pretty common. I still am not used to group operators like (?:stuff).
I'd say the second one is better, because it fails faster when the terminating " is missing. The first one will backtrack over the string, a potentially expensive operation. An alternative regexp if you are using perl 5.10 would be /"[^"]++"/. It conveys the same meaning as version 1 does, but is as fast as version two.
I'd go for number two since it's much easier to read. But I'd still like to match empty strings so I would use:
/".*?"/
From a performance perspective (extremely heavy, long-running loop over long strings), I could imagine that
"[^"]*"
is faster than
".*?"
because the latter would do an additional check for each step: peeking at the next character. The former would be able to mindlessly roll over the string.
As I said, in real-world scenarios this would hardly be noticeable. Therefore I would go with number two (if my current regex flavor supports it, that is) because it is much more readable. Otherwise with number one, of course.
Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.
Your example #1:
/"[^"]+"/ # match quote, then everything that's not a quote, then a quote
matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.
Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.
I prefer the first regex, but it's certainly a matter of taste.
The first one might be more efficient?
Search for double-quote
add double-quote to group
for each char:
if double-quote:
break
add to group
add double-quote to group
Vs something a bit more complicated involving back-tracking?
Considering that I didn't even know about the "*?" thing until today, and I've been using regular expressions for 20+ years, I'd vote in favour of the first. It certainly makes it clear what you're trying to do - you're trying to match a string that doesn't include quotes.