Invalid Boost Regex Lookbehind with OR and ^ - regex

I'm having an issue with boost regex and suspect its a bug, but knew someone here would know for sure and if there's a workaround
I'm checking the start of a selection for start of string, white-space or an underscore using
(?<=^|\s|_)
However under boost this creates an error:
ERROR: Bad regular expression at char 0. Invalid lookbehind assertion encountered in the regular expression.
Without the ^, all is well and similarly with just the ^ its fine.
Any help getting around this would be greatly received.
Cheers

Brief
The code you presented (?<=^|\s|_) is a lookbehind using 3 possibilities:
^ Assert position at start of the line
\s Match any whitespace character
_ Match the underscore character literally
Note that with the above, 2. and 3. are identical in the number of characters that it will match: One, while 1. will match zero characters (position assertion).
Since 1. is of width 0, and 2. and 3. are of width 1, this causes the lookbehind to be of variable width. Some regex flavours will permit subtleties such as assertions to be used alongside fixed width matches, while others will not.
Typically, in lookbehinds, any quantifiers or variations thereof where matches don't share the same length (variable length) causes errors as you've seen.
Solution
Some regex flavours will permit your code to run, while others will not. For regex flavours that do not permit this sort of behaviour, workarounds should be used.
For your specific case, you can likely use the following regex to solve your issue
(?:^|(?<=\s|_))

Boost regex, like Python re, does not allow you to use alternatives of different length in a lookbehind (^ matches zero chars, while \s and _ match 1 char both). See the Boost reference:
(?<=pattern) consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length).
In these cases, it is a good idea to use a negative lookbehind with a negated character class matching any char but the ones you need. The (?<=^|\s|_) pattern will change into
(?<![^\s_])
It will match any location that is not immediately preceded with a char other than whitespace or _ (i.e. it will match the start of string (^), after a whitespace or _, just what you need).
See the regex demo:

Related

Negative lookbehind in regex

(Note: not a duplicate of Why can't you use repetition quantifiers in zero-width look behind assertions; see end of post.)
I'm trying to write a grep -P (Perl) regex that matches B, when it is not preceded by A -- regardless of whether there is intervening whitespace.
So, I tried this negative lookbehind, and tested it in regex101.com:
(?<!A)\s*B
This causes "AB" not to be matched, which is good, but "A B" does result in a match, which is not what I want.
I am not exactly sure why this is. It has something to do with the fact that \s* matches the empty string "", and you can say that there are, as such, infinity matches of \s* between A and B. But why does this affect "A B" but not "AB"?
Is the following regex a proper solution, and if so, why exactly does it fix the problem?
(?<![A\s])\s*B
I posted this before and it was incorrectly marked as a duplicate question. The variable-length thing I'm looking for is part of the match, not part of the negative lookbehind itself -- so this quite different from the other question. Yes, I could put the \s* inside the negative lookbehind, but I haven't done so (and doing so is not supported, as the other question explains). Also, I am particularly interested in why the alternate regex I post above works, since I know it works but I'm not exactly sure why. The other question did not help answer that.
But why does this affect "A B" but not "AB"?
Regexes match at a position, which it is helpful to think of as being between characters. In "A B" there is a position (after the space and before the B) where (?<!A) succeeds (because there isn't an A immediately preceding; there's a space instead), and \s*B succeeds (\s* matches the empty string, and B matches B), so the entire pattern succeeds.
In "AB" there is no such position. The only place where \s*B can match (immediately before the B), is also immediately after the A, so (?<!A) cannot succeed. There are no positions that satisfy both, so the pattern as a whole can't succeed.
Is the following regex a proper solution, and if so, why exactly does it fix the problem?
(?<![A\s])\s*B
This works because (?<![A\s]) will not succeed immediately after an A or after a space. So now the lookbehind forbids any match position that has spaces before it. If there are any spaces before the B, they have to be consumed by the \s* portion of the pattern, and the match position must be before them. If that position also doesn't have an A before it, the lookbehind can succeed and the pattern as a whole can match.
This is a trick that's made possible by the fact that \s is a fixed-width pattern that matches at every position inside of a non-empty \s* match. It can't be extended to the general case of any pattern between the (non-)A and the B.

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

A pattern matching an expression that doesn't end with specific sequence

I need a regex pattern which matches such strings that DO NOT end with such a sequence:
\.[A-z0-9]{2,}
by which I mean the examined string must not have at its end a sequence of a dot and then two or more alphanumeric characters.
For example, a string
/home/patryk/www
and also
/home/patryk/www/
should match desired pattern and
/home/patryk/images/DSC002.jpg should not.
I suppose this has something to do with lookarounds (look aheads) but still I have no idea how to make it.
Any help appreciated.
Old Answer
You can use a negative lookbehind at the end if your regex flavor supports it:
^.*+(?<!\.\w{2,})$
This will match a string that has an end anchor not preceded by the icky sequence you don't want.
Note that as m.buettner has pointed out, this uses an indefinite length lookbehind, which is a feature unique to .NET
New Answer
After a bit of digging around, however, I've found that variable length look-aheads are pretty widely supported, so here is a version that uses those:
^(?:(?!\.\w{2,}$).)++$
In a comment on an answer, you have stated you wanted to not match strings with forward slashes at the end, which is accomplished by simply adding a forward slash to the lookahead.
^(?:(?!(\.\w{2,}|/)$).)++$
Note that I am using \w for succinctness, but it lets underscores through. If this is important, you could replace it with [^\W_].
Asad's version is very convenient, but only .NET's regex engine supports variable-length lookbehinds (which is one of the many reasons why every regex question should include the language or tool used).
We can reduce this to a fixed-length lookbehind (which is supported in most engines except for JavaScrpit) if we think about the possible cases which should match. That would be either one or zero letters/digits at the end (whether preceded by . or not) or two or more letters/digits that are not preceded by a dot.
^.*(?:(?<![a-zA-Z0-9])[a-zA-Z0-9]?|(?<![a-zA-Z0-9.])[a-zA-Z0-9]{2,})$
This should do it:
^(?:[^.]+|\.(?![A-Za-z0-9]{2,}$))+$
It alternates between matching one or more of anything except a dot, or a dot if it's not followed by two or more alphanumeric characters and the end of the string.
EDIT: Upgrading it to meet the new requirement is just more of the same:
^(?:[^./]+|/(?=.)|\.(?![A-Za-z0-9]{2,}$))+$
Breaking that down, we have:
[^./]+ # one or more of any characters except . or /
/(?=.) # a slash, as long as there's at least one character following it
\.(?![A-Za-z0-9]{2,}$) # a dot, unless it's followed by two or more alphanumeric characters followed by the end of the string
On another note: [A-z] is an error. It matches all the uppercase and lowercase ASCII letters, but it also matches the characters [, ], ^, _, backslash and backtick, whose code points happen to lie between Z and a.
Variable length look behinds are rarely supported, but you don't need one:
^.*(?<!\.[A-z0-9][A-z0-9]?)$

Simple regex negation

Trying to match ONLY the first character in the sample below.
Sample string: C/C++/Objective C/Objective-C/ObjectiveC/objectiveC
My faulty regex: (?![O|o]bjective[ |-]?)C(?!\+\+)
Doh.
Try this:
(?<![Oo]bjective[ -]?)C(?!\+\+)
Corrections are:
Use negative lookbehind instead of negative lookahead (the (?<!...) bit).
Removed pipe character from character classes (the [...] bits).
It might also be worth adding a pair of \bs either side of the C, since your current regex will match Coconut, BBC, CFML and so on
Also worth pointing out that, inside character classes, the - is special if not the first or last character. Some people prefer to escape it even in these situations, i.e. [ \-], in case a later character is accidentally added after it.

what's wrong with this regex for password rules

I'm trying for at least 2 letters, at least 2 non letters, and at least 6 characters in length:
^.*(?=.{6,})(?=[a-zA-Z]*){2,}(?=[0-9##$%^&+=]*){2,}.*$
but that misses the mark on many levels, yet I'm not sure why. Any suggestions?
While this type of test can be done with a regex, it may be easier and more maintainable to do a non-regex check. The regex to achieve this is fairly complex and a bit unreadable. But the code to run this test is fairly straight forward. For example take the following method as an implementation of your requirements (language C#)
public bool IsValid(string password) {
// arg null check ommitted
return password.Length >= 6 &&
password.Where(Char.IsLetter).Count() > 2 &&
password.Where(x => !Char.IsLetter(x)).Count() > 2;
}
To answer the question in the title, here's what's wrong with your regex:
First, the .* (dot-star) at the beginning consumes the whole string. Then the first lookahead, (?=.{6,}) is applied and fails because the match position is at the end of the string. So the regex engine starts backtracking, "taking back" characters by moving the match position backward one character at a time and reapplying the lookahead. When it's taken back six characters, the first lookahead succeeds and the next one is applied.
The second lookahead is (?=[a-zA-Z]*), which means "at the current match position, try to match zero or more ASCII letters." The match position is still six characters back from the end of the string, but it doesn't matter; the lookahead will always succeed no matter you apply it, because it can legally match zero characters. Also, the letters can be anywhere in the string, so the lookahead has to accommodate whatever intervening non-letters there might be.
Then you have {2,}. It's not part of the lookahead subexpression because it's outside the parentheses. In that position, it means the lookahead has to succeed two or more times, which makes no sense. If it succeeded once, it will succeed any number of times, because it's being applied at the same position every time. Some regex flavors treat it as an error when you apply a quantifier to a lookahead (or to any other zero-width assertion, eg, lookbehind, word boundary, line anchors). Most flavors seem to ignore the quantifier.
Then you have another lookahead that will always succeed, and another useless quantifier. Finally, the dot-star at the end re-consumes the six characters the first dot-star had to relinquish.
I think this is what you were trying for:
^
(?=.{6})
(?=(?:[^A-Za-z]*[A-Za-z]){2})
(?=(?:[^0-9##$%^&+=]*[0-9##$%^&+=]){2})
.*$
If you really want to use regular expressions, try this one:
(?=.{6})(?=[^a-zA-Z]*[a-zA-Z][^a-zA-Z]*[a-zA-Z])(?=[^0-9##$%^&+=]*[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])^.+$
This matches anything that is at least six characters long ((?=.{6,})) and does contain at least two alphabetic characters ((?=[a-zA-Z][^a-zA-Z]*[a-zA-Z])) and does contain at least two characters of the character set [0-9##$%^&+=] ((?=[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])).