Perl Regex Negative Lookbehind Incorrect Match (SAS) - regex

In SAS, I am setting up PXPARSE functions to extract meaningful information from free text answers from a survey. For the most part, I have done this without issue. However, I've started needing lookarounds and now I am getting an incorrect match despite my best efforts.
Here is the expression that is being evaluated:
hlhx=PRXPARSE('/yes|(?<!no).*homeless.*(for|in|year|age)|at\sage|couch|was\shomeless|multiple|
lived.*streets|(?<!\bnot).*at\srisk|has\sbeen|high\srisk|currently\shomeless|
liv(es|ing|ed).*car|many|(?<!\bno).*(hx|history|h.?o)|(?<!\bno)(?<!low).+risk/ox');
A couple of responses should not match this expression, but do:
no hx of homelessness and low risk of homelessness
owns home, no h/o homelessness; low risk for homelessness
no and little risk
Obviously I have not properly specified my lookbehinds. Any help would be greatly appreciated.
EDIT: To put a finer point on it, what part of the expression is causing a match with entries like those in the list?
Best,
Lauren

Here's how your regex matches no and little risk:
One of the branches in your regex is ...|(?<!\bno)(?<!low).+risk.
The regex engine starts by attempting a match at every position within the target string, starting at the beginning:
no and little risk
^
The first constraint is that the current position cannot be preceded by a word boundary followed by "no" (due to (?<!\bno)). This condition is satisfied: The beginning of the target string is not preceded by anything.
The second constraint is that the current position cannot be preceded by "low" (due to (?<!low)). This condition is also satisfied (see above).
Then we match one or more non-newline characters, but as many as possible of them (this is the .+ part). Here we initially consume the whole string:
no and little risk
------------------^
But then the regex requires a match of risk, which fails (there are no more characters left in the target string). This causes .+ to backtrack and consume fewer and fewer characters, until this happens:
no and little risk
--------------^
At this point, risk successfully matches and the regex finishes.
The basic problem is that want you want to do is (?<!\bno.+)(?<!low.+)risk, but what you wrote is (?<!\bno)(?<!low).+risk. These are two very different things!
The former means "match 'risk', but only if it's not preceded by 'no' or 'low' anywhere in the string (up to 1 character before 'risk')". The latter means "match any non-empty substring followed by 'risk', as long as it's not preceded by either 'no' or 'low'". This gives the regex engine the freedom to look for any matching position in the string, as long as it's not immediately preceded by "no" or "low" and is followed by ".+risk" somewhere.
Unfortunately (?<!\bno.+) is not a valid regex because look-behind assertions must have a fixed length.
One possible workaround is to do the following:
^(?!.*(?:\bno|low).+risk).*risk
This says: Starting from the beginning of the string, first make sure there is no "no" or "low" followed by "risk" anywhere, then match "risk" anywhere within the string.
This is not quite equivalent to the (hypothetical) variable-width look-behind version, because that one would have matched
risk no risk
^^^^
due to the presence of "risk" without "no" preceding it, whereas this workaround first finds
risk no risk
^^^^^^^
and immediately rejects the whole string.

Related

is this regex vulnerable to REDOS attacks

Regex :
^\d+(\.\d+)*$
I tried to break it with :
1234567890.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1x]
that is 200x".1"
I have read about ReDos attacks from :
Preventing Regular Expression Denial of Service (ReDoS)
Runaway Regular Expressions: Catastrophic Backtracking
However, I am not too confident in my skills to prepare a ReDos attack on an expression. I tried to trigger catastrophic backtracking due to "Nested Quantifiers".
Is that expression breakable? What input should be used for that and, if yes, how did you come up with it?
"Nested quantifiers" isn't inherently a problem. It's just a simple way to refer to a problem which is actually quite a bit more complicated. The problem is "quantifying over a sub-expression which can, itself, match in many ways at the same position". It just turns out that you almost always need a quantifier in the inner sub-expression to provide a rich enough supply of matches, and so quantifiers inside quantifiers serve as a red flag that indicates the possibility of trouble.
(.*)* is problematic because .* has maximum symmetry — it can match anything between zero and all of the remaining characters at any point of the input. Repeating this leads to a combinatorial explosion.
([0-9a-f]+\d+)* is problematic because at any point in a string of digits, there will be many possible ways to allocate those digits between an initial substring of [0-9a-f]+ and a final substring of \d+, so it has the same exact issue as (.*)*.
(\.\d+)* is not problematic because \. and \d match completely different things. A digit isn't a dot and a dot isn't a digit. At any given point in the input there is only one possible way to match \., and only one possible way to match \d+ that leaves open the possibility of another repetition (consume all of the digits, because if we stop before a digit, the next character is certainly not a dot). Therefore (\.\d+)* is no worse, backtracking-wise, than a \d* would be in the same context, even though it contains nested quantifiers.
Your regex is safe, but only because of "\."
Testing on regex101.com shows that there are no combinations of inputs that create runaway checks - but your regex is VERY close to being vulnerable, so be careful when modifying it.
As you've read, catastrophic backtracking happens when two quantifiers are right next to each other. In your case, the regex expands to \d+\.\d+\.\d+\.\d+\. ... and so on. Because you make the dot required for every single match between \d+, your regex grows by only three steps for each period-number you add. (This translates to 4 steps per period-number if you put an invalid character at the end.) That's a linear growth rate, so your regex is fine. Demo
However, if you make the \. optional, accidentally forget the escape character to make it plain ol' ., or remove it altogether, then you're in trouble. Such a regex would allow catastrophic backtracking; an invalid character at the end approximately doubles the runtime with every additional number you add before it. That's an exponential growth rate, and it's enough to crash time out the regex101 engine's default settings with just 18 digits and 1 invalid character. Demo
As written, your regex is fine, and will remain so as long as you ensure sure there's something "solid" between the first \d+ and the second \d+, as well as something "solid" between the second \d+ and the * outside its capture group.

Efficient Regex for requiring a specific sentence pattern but allowing html etc

(As is often the case, while writing this, I think I fixed the expression itself so it now works for my purposes, so efficiency is now my main concern - but I would still like input as to whether the expression can be improved or will let through way more than it should, so I have left the entire explanation in.)
I am trying to write a regular expression which will validate that user-submitted text matches a length requirement. Users must write 7 or more full sentences of 4 or more words. We are defining this as follows:
- 4 words means 3 or more sections of '1 or more non-space characters followed by 1 or more spaces', then 1 instance of '1 or more non-space characters optionally followed by a space' (because some people like to put spaces before their punctuation marks I guess)
- A sentence is ended with a punctuation mark (.?!)
- Zero or more spaces are allowed after each sentence
- (Repeat 7 times)
This definition can be changed to anything sensible, but that's what I came up with so far. Which gives me the following RegEx:
((\S+\s+){3,}\S+[.?!]\s*){7,}
This seems to work, but I have obviously fudged many things and wonder if anyone has a better idea. (It has to allow for html at any point, and a lot of other quirks from users' writing. I am not too concerned about people gaming the system - there are still manual checks, this is just a first-stage check to lighten the load.)
My other main concern is efficiency - I'm new to regex and don't know what is a 'normal' calculation time, but the debugger(s) I'm using are struggling at times when I paste in a block of text to check, and I don't know if this is caused by my RegEx or the debugger. It is often timing out on longer sections of text where there is no match. Is there a more efficient way to do what I'm wanting...?
First, when doing full text match, always surround the regex with ^...$. ^ anchors the start of the regex to the start of the validation string, and $ anchors the end of the regex to the end of the string. Otherwise, if it fails to match, it will repeat the validation attempt starting on every single character (Which, at minimum (4 words * 3 spaces) * 7 sentences = excessive amount of work).
Second, always use mutually exclusive groups where you can. \S (anything not white-space) includes the characters .?!, So on failing to find punctuation, it has to backtrack and retry each \S it matched. (Namely, because the first pass would mark it as a word instead of punctuation) So I would recommend replace \S with a more mutually exclusive "anything not white-space or punctuation" [^\s.?!]. Note that that [] contains a lowercase s instead of an uppercase one. [^...] is "match any character NOT in this group".
Those 2 things will drop you from catastrophic backtracking to a reasonable ~1-3k steps depending on paragraph length.
UPDATE:
If you would allow a small alteration to the validation logic, making it so that multiple short sentences can count together as one sentence, then the following regex should do.
^(\s*(\S+\s+){3}([.?!]\s*)?([^\s.?!]+\s+)*\S+\s*[.?!]){7,}$
This hybrid version will allow short sentences without causing catastrophic backtracking. Without the small rule change, you will need to nest a variable length pattern inside a variable length pattern; which is catastrophic when the pattern isn't fully mutually exclusive. (updated demo)
Also, technically you can replace {7,}$ with just {7} if once 7 sentences have been found, you don't care what comes after that. (That will let the regex stop as soon as minimal viability is found, which would be more accepting of some extreme edge cases)
(You can play with it here on regex101.com)

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Regex to Match All Except a String

Given the string beginend where begin and end are both optional, I want to match the whole string and back-reference only begin. Begin is unknown but alpha-numeric; end is literally end. How would I go about doing this?
In case it matters, I'd be using this in a Textpad macro to replace "beginend" with something else including "begin".
To match an string of "alpha-numeric" characters that do not contain "end" you can use something like:
(?:(?!end)[A-Za-z\d])+
An expression like this would do what you ask:
^((?:(?!end)[A-Za-z0-9])+)(?:end)?\z
EDITED (see after blockquote)
I don't have commenting privileges, so I can't comment on his
solution, but Qtax's solution will not work because it assumes that
begin will never contain the substring "end", e.g., it wouldn't
match the string "sendingend".
My solution:
^([A-Za-z0-9]*)(?:end)?$
Of course, it also depends on what you mean by alphanumeric. My
example has the strictest definition, i.e., just upper- and lower-case
letters plus digits. You'd need to add in other characters if you want
them. If you want to include the underscore as well as those
characters, you can replace the whole bulky [A-Za-z0-9] with \w
(equivalent to [A-Za-z0-9_]). Add \s if you want whitespace.
Since you said your regex knowledge is limited, I'll explain the rest
of the solution to you and whoever else comes along.
^ and $ match the beginning and the end of the string, respectively. By including the $ in particular, you're
guaranteeing that the last "end" you encounter is really at the end.
For example, without them, it would still match the string
"sendingsending" and the rest of your program would think it's found
that "end" at the end. With these, it's still going to match
"sendingsending" because any characters are allowed (see below), but
other steps in your script will recognize the presence of
"end". It actually doesn't matter much for this current
string, because the ([A-Za-z0-9]*) will capture the entire string if
"end" is not present. However, you therefore need another regex to
ensure the presence or absence of "end"...so you'd do something like
(end)$ to locate it.
([A-Za-z0-9]*): the square brackets contain the specific characters that are allowed (you should definitely read up on this if
you don't know). The * means it will match one of those characters 0
or more times, so this allows for no string (i.e., just "end") as well
as super-long strings. The parentheses are capturing that pattern so
you can back-reference it.
(?:end)?: the last ? makes it match this pattern 0 or 1 times (i.e., makes it optional). The (?:string) structure allows you to
group characters together as you would with parentheses but the ?:
makes it not save that pattern, so it uses less memory. In your
case, that memory would be negligible, but it's nice to know for
future use.
If you need more help, try Googling 'regex'. There's tons of good
references. You can also test them out. My personal favourite tester
is called My Regex Tester.
Good luck!
I just tried looking up TextPad macros, and you might run into a problem. As I've explained above, to verify the presence of "end" at the end of the string, you'll need something separate. I was envisioning some kind of conditional, something like IF (end)$ THEN replace with ^([A-Za-z0-9]*)(?:end)?$ ELSE use the whole string. However, I don't know if you can do that with these macros...it's hard to say, because I'm not a TextPad user and there's next to no documentation. If you can't, then I think you're going to have to put some restrictions on it. One idea is to not allow "end" to be anywhere in the begin substring (which is how Qtax's solution did it). But now I'm wondering...if "end" if going to be optional, and if conditionals aren't allowed, what's the point of having it at all? ...perhaps I'm overthinking things. I await your reply.
Try using a positive look-ahead. This is a zero-width assertion so won't be included in the match. It also allows for the substring end to be present within the alpha-numeric string
([a-z0-9]*)(?=end)
What this is saying is: Match an alpha-numeric string only if it is immediately followed by end

what's wrong with this regex for password rules

I'm trying for at least 2 letters, at least 2 non letters, and at least 6 characters in length:
^.*(?=.{6,})(?=[a-zA-Z]*){2,}(?=[0-9##$%^&+=]*){2,}.*$
but that misses the mark on many levels, yet I'm not sure why. Any suggestions?
While this type of test can be done with a regex, it may be easier and more maintainable to do a non-regex check. The regex to achieve this is fairly complex and a bit unreadable. But the code to run this test is fairly straight forward. For example take the following method as an implementation of your requirements (language C#)
public bool IsValid(string password) {
// arg null check ommitted
return password.Length >= 6 &&
password.Where(Char.IsLetter).Count() > 2 &&
password.Where(x => !Char.IsLetter(x)).Count() > 2;
}
To answer the question in the title, here's what's wrong with your regex:
First, the .* (dot-star) at the beginning consumes the whole string. Then the first lookahead, (?=.{6,}) is applied and fails because the match position is at the end of the string. So the regex engine starts backtracking, "taking back" characters by moving the match position backward one character at a time and reapplying the lookahead. When it's taken back six characters, the first lookahead succeeds and the next one is applied.
The second lookahead is (?=[a-zA-Z]*), which means "at the current match position, try to match zero or more ASCII letters." The match position is still six characters back from the end of the string, but it doesn't matter; the lookahead will always succeed no matter you apply it, because it can legally match zero characters. Also, the letters can be anywhere in the string, so the lookahead has to accommodate whatever intervening non-letters there might be.
Then you have {2,}. It's not part of the lookahead subexpression because it's outside the parentheses. In that position, it means the lookahead has to succeed two or more times, which makes no sense. If it succeeded once, it will succeed any number of times, because it's being applied at the same position every time. Some regex flavors treat it as an error when you apply a quantifier to a lookahead (or to any other zero-width assertion, eg, lookbehind, word boundary, line anchors). Most flavors seem to ignore the quantifier.
Then you have another lookahead that will always succeed, and another useless quantifier. Finally, the dot-star at the end re-consumes the six characters the first dot-star had to relinquish.
I think this is what you were trying for:
^
(?=.{6})
(?=(?:[^A-Za-z]*[A-Za-z]){2})
(?=(?:[^0-9##$%^&+=]*[0-9##$%^&+=]){2})
.*$
If you really want to use regular expressions, try this one:
(?=.{6})(?=[^a-zA-Z]*[a-zA-Z][^a-zA-Z]*[a-zA-Z])(?=[^0-9##$%^&+=]*[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])^.+$
This matches anything that is at least six characters long ((?=.{6,})) and does contain at least two alphabetic characters ((?=[a-zA-Z][^a-zA-Z]*[a-zA-Z])) and does contain at least two characters of the character set [0-9##$%^&+=] ((?=[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])).