regex lookbehind is `>` a shortcut for `<=`? - regex

Through some convoluted testing, I ended up potentially discovering a shortcut.
A lookbehind in PowerShell is supposed to use the <= syntax, which is referenced in various other places when googling for lookbehinds in PowerShell, e.g., this Microsoft blog.
Take this simple example Regex:
(?>^[^x]*)$
(?> begins the lookbehind
^[^x]* tests that the character x is not present since the beginning of the string
) closes the lookbehind
$ anchors the end of the line
When I test it:
'sample = x' -match '(?>^[^x]*)$'
False
'sample = ' -match '(?>^[^x]*)$'
True
The first block returns false: the lookbehind does not match a string without x.
The second block returns true: the lookbehind matches a string without x.
It seems to work!
Now if I try to use the <= syntax:
'sample = x' -match '(?<=^[^x]*)$'
False
'sample = ' -match '(?<=^[^x]*)$'
True
It has the same behavior.
Is this a "shortcut" for RegEx lookbehinds in PowerShell, or why is > working at all?
435|PS(7.2.1) C:\Users\User\Documents [230211-15:03:27]> $PSVersionTable
Name Value
---- -----
PSVersion 7.2.1
PSEdition Core
GitCommitId 7.2.1
OS Microsoft Windows 10.0.22621
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0

tl;dr
Regex grouping constructs (?<=…) and (?>…) serve different purposes and only happen to work the same in your particular scenario; neither is called for in your scenario.
Use '...' -notmatch 'x' to test if a given string contains any instances of 'x' (returns $true if not).
Background information:
The two grouping constructs you reference serve different purposes (enclosed subexpressions are represented with placeholder … below):
(?<=…) is a (zero-width, positive) lookbehind assertion:
It is a non-capturing grouping construct that must match the enclosed subexpression immediately before (to the left, i.e. "looking behind") where the remaining expression matches, without capturing what the subexpression matched.
In essence, this means: When what follows this construct matches, also make sure (assert) that what comes before it matches the subexpression inside (?<=…); if it doesn't, there's no overall match; if it does, don't capture (include in the results) its match.
Therefore, this construct only makes sense if placed before a capturing construct; e.g.:
# Matches only 'unheard of', because only in it is the match
# for 'hear.' preceded by 'un'
# Captures only 'heard' from the matching string, not the 'un'
'heard from', 'unheard of' -match '(?<=un)hear.'
(?>…) is an atomic group, aka non-backtracking subexpression:
It is a capturing grouping construct - similar to a regular capture group (matched subexpression), (…) - except that it never backtracks.
In essence, this means: once the subexpression has found a match, it won't allow backtracking based on the remainder of the expression; this construct is mostly used as a performance optimization when it is known that backtracking wouldn't succeed.
# Atomic group:
# -> $false, because the atomic group fully consumes the string,
# so there's nothing for '.' to match *after* the group.
'abc!' -match '(?>.+).'
# Regular capture group:
# -> $true, with backtracking; the capture group captures 'abc'
'abc!' -match '(.+).'
What you tried:
(?<=^[^x]*)$ - your regex with a lookbehind assertion
As noted above, there's no good reason to use a lookbehind assertion without following it with a capturing expression.
Your regex will by definition not capture anything ($ is itself an assertion).
Since you're matching the whole string, the immediate simplification would be not to use a grouping construct at all (but see the bottom section):
^[^x]*$
As an optimization, if you explicitly want to prevent the capturing that happens by default, use a noncapturing group, (?:…):
(?:^[^x]*$)
(?>^[^x]*)$ - your regex with an atomic group
Since you're matching the whole string, there is no reason to use a atomic group, given that there's no backtracking that needs preventing, so this regex is in effect the same as (^[^x]*)$, i.e. a regular capture group (followed by $).
As noted, there's no reason to capture anything here, so (?:^[^x]*$) would prevent that.
In short:
Both your regexes match the input string in full, and therefore require no grouping construct (except, optionally, to explicitly prevent capturing).
Read on for a much simpler solution.
Taking a step back:
The conceptually simplest and most efficient solution is:
'...' -notmatch 'x'
That is, you can let -notmatch, the negated form of PowerShell's -match operator look for (at most one) x, and negate the Boolean result, so that not finding any x returns $true.
In other words: the test succeeds if no x is present in the input string.

Related

Matching Double String, how do I only match with 1 result?

I'm able to match the string I need for regex but it is matching twice.
https://regex101.com/r/KmgGwS/7
if ( $_.PSPath -match ("(?<=\::)(.*?)(?=\\)+")) {
$matches.Values
}
For example, the input string is something like:
'Microsoft.PowerShell.Security\Certificate::CurrentUser\Root\A43489159A520F0D93D032CCAF37E7FE20A8B419'
It expect it to get:
CurrentUser
With the current code, it gets that string twice:
CurrentUser
CurrentUser
tl;dr
$Matches[0]
contains what your regex matched overall, $Matches[1] contains what the 1st (and only) capture group (parenthesized sub-expression, (...)) matched - in your case, both values happen to be the same.
Unless you need to explicitly enumerate the overall match as well as capture-group matches, don't use $Matches.Value.
To enumerate capture-group values only, i.e. to enumerate all matches except the overall match, use:
# Enumerate all matches except the overall one (key 0)
$Matches.GetEnumerator().Where({ $_.Key -as [int] -ne 0 }).Value
The automatic $Matches variable, in which PowerShell reflects the results of the most recent -match operation[1], is a hashtable (System.Collections.Hashtable):
Entry 0 ($Matches[0]) contains what the regex matched in full.
All other entries, if any, contain the substrings that capture groups (parenthesized subexpressions, (...)) matched, with entry 1 representing the 1st capture group's match, 2 the 2nd, and so on.
If you use named capture groups (e.g. (?<foo>...), the entries use that name (e.g., $Matches['foo'] or, alternatively, $Matches.foo).
If you use non-capturing groups ((?:...)), they result in no entry in $Matches.
(Similarly, look-around assertions - (?<=...), (?<!...), (?=...), and (?!...) - do not result in entries.)
As for what you tried:
$Matches.Values outputs a collection of the values of all entries in the hashtable, meaning the overall match (entry 0) as well as any capture-group matches.
Since your regex contains a capture group that effectively captures the same as the regex as a whole, (.*?), $Matches.Values outputs a collection of values that is the equivalent of array
'CurrentUser', 'CurrentUser', which, when output to the console, yields the result shown in the question.
Note that if you regex happens not to contain any capture groups, as suggested in sln's answer, $Matches.Values may appear to return a single string, but in reality it returns an ICollection instance that just happens to have only one element.
Now, that distinction between a single-element collection and a scalar may often be irrelevant in PowerShell, but it's something to be aware of, because there are cases where it matters.
[1] Caveats:
* If the regex didn't match at all, $Matches isn't updated, so the previous value may linger.
* If the LHS of -match is an array (collection), -match acts as a filter, and $Matches isn't updated.
Note that $Matches is also set in the branch handlers of a switch -Regex statement.
What you're seeing is the match value and the
group 1 value. Both of which contain the same thing.
If you want to see only a single value, remove the capture group.
(?<= :: )
.*?
(?= \\ )
Or, like this (?<=::).*?(?=\\)

Why do substrings prevent match with negative lookahead?

Consider the following test data:
x.foo,x.bar
y.foo,y.bar
yy.foo,yy.bar
x.foo,y.bar
y.foo,x.bar
yy.foo,x.bar
x.foo,yy.bar
yy.foo,y.bar
y.foo,yy.bar
I'm attempting to write a regular expression where the string before .foo and the string before .bar are different from each other. The first three items should not match. The other six should.
This mostly works:
^(.+?)\.foo,(?!\1)(.+?)\.bar$
However, it misses on the last one, because y is in match group 1, and thus yy is not matched in match group 2.
Interactive: https://regex101.com/r/Pv5062/1
How can I modify the negative lookahead pattern such that the last item matches as well?
Inline backreferences do not store the context information, they only keep the text captured. You need to specify the context yourself.
You may add a dot after \1:
^(.+?)\.foo,(?!\1\.)(.+?)\.bar$
^^
Or, even repeat the part after the second (.+?):
^(.+?)\.foo,(?!\1\.bar$)(.+?)\.bar$
Or, if the bar part cannot contain ., you may make it more "generic":
^(.+?)\.foo,(?!\1\.[^.]+$)(.+?)\.bar$
See the regex demo and another regex demo.
The point is: your (?!\1) is not "anchored" and will fail the match in case the text stored in Group 1 appears immediately to the right of the current location regardless of the context. To solve this, you need to provide this context. As the value that can be matched with .+? can contain virtually anything all you can rely on is the "hardcoded" bits after the lookahead.

Regex - date format not using mixed seperators

I've written a regex that identifies dates in the form of dd/mm/yyyy or dd.mm.yyyy but it currently accepts dd/mm.yyyy as a correct format but I don't want mixed separators to be accepted as valid. How would I modify my regex to fix this issue.
My Regex is:
^(0[1-9]|[12][0-9]|3[01])[/|./.](0[1-9]|1[012])[/./.](19|20)\d\d$
Use a look-ahead to require that the same separators is used:
^(?=.*([/.]).*\1)<your regex here>$
The expression (?=.*([/.]).*\1) is a look ahead that contains a back reference \1 to the first separator [/.], meaning it must be repeated later in the input.
The whole regex would be (simplifying the separator expression to just [/.]):
^(?=.*([/.]).*\1)(0[1-9]|[12][0-9]|3[01])[/.](0[1-9]|1[012])[/.](19|20)\d\d$
Try
^(?:0[1-9]|[12][0-9]|3[01])(\/|\.)(?:0[1-9]|1[012])\1(19|20)\d\d$
This will match
01/02/2018
or
01.02.2018
But will not match
01/02.2018
\1 matches same contents in the first bracket which is (\/|\.) in this case. This is called "back reference". So the second separator have to be the repeat of what's matching in the first bracket.
By using (?:) instead normal () it will prevent the bracket to be counted as matching patterns for back reference, it will make it easier to code, and this is better for performance too, because anything in normal bracket will be stored in the memory to be prepared for back reference. So you should use (?:) if you are using brackets just to cover patterns.
Solution for PHP and Python.
Regex: ^(?:[0-2][0-9]|3[01])(?:(\/)|\.)(?:0[1-9]|1[0-2])(?(1)\/|\.)(?:19|20)\d{2}$
Details:
(?:) Non capturing group
() Capturing group
[] Match a single character present in the list
| Or
(?(1)) If Clause, Group 1.
Output:
01/12/1999 true
25.12.1999 true
23.12/1999 false
23/12.1999 false
23,12/1999 false

Pattern backreference to an optional capturing subexpression

In an attempt to use Bash's built-in regular expression matching to parse the following types of strings, which are to be converted to Perl substitution expressions (quotes are not part of data)
'~#A#B#'
#^ ^ ^-- Replacement string.
#| +---- Pattern string.
#+------ Regular expression indicator (no need to escape strings A and B),
# which is only allowed if strings A and B are surrounded with ##.
# Strings A and B may not contain #, but are allowed to have ~.
'#A#B#'
#^------ When regex indicator is missing, strings A and B will be escaped.
'A#B'
# Simplified form of '#A#B#', i. e. without the enclosing ##.
# Still none of the strings A and B is allowed to contain # at any position,
# but can have ~, so leading ~ should be treated as part of string A.
I tried the following pattern (again, without quotes):
'^((~)?(#))?([^#]+)#([^#]+)\3$'
That is, it declares the leading ~# optional (and ~ in it even more optional), then captures parts A and B, and requires the trailing # to be present only if it was present in the leader. The leading # is captured for backreference matching only — it is not needed elsewhere, while ~ is captured to be inspected by script afterwards.
However, that pattern only works as expected with the most complete types of input data:
'~#A#B#'
'#A#B#'
but not for
'A#B'
I. e., whenever the leading part is missing, \3 fails to match. But if \3 is replaced with .*, the match succeeds and it can be seen that ${BASH_REMATCH[3]} is an empty string. This is something that I do not understand, provided that unset variables are treated as empty strings in Bash. How do I match a backreference with optional content then?
As a workaround, I could write an alternative pattern
'^(~?)#([^#]+)#([^#]+)#$|^([^#]+)#([^#]+)$'
but it results in distinct capture groups for each possible case, which makes the code less intuitive.
Important note. As #anubhava mentioned in his comment, backreference matching may not be available in some Bash builds (perhaps it is a matter of build options rather than of version number, or even of some external library). This question is of course targeted at those Bash environments that support such functionality.
There are two ways to deal with this problem:
Instead of making the group optional (in other words, allowing it to not match at all), make it mandatory but match the empty string. In other words, change constructs like (#)? to (#?).
Use a conditional to match the backreference \3 only if group 3 matched. To do this, change \3 to (?(3)#|).
Generally, the first option is preferable because of its better readability. Also, bash's regular expressions don't seem to support conditional constructs, so we need to make option 1 work. This is difficult because of the additional condition that ~ is only allowed if # is also present. If bash supported lookaheads, we could do something like ((~)(?:#))?(#?). But since it doesn't, we need to get creative. I've come up with the following pattern:
^((~(#))|(#?))([^#]+)#([^#]+)(\3|\4)$
Demo.
The idea is to make use of the alternation operator | to handle two different cases: Either the text starts with ~#, or it doesn't. ((~(#))|(#?)) captures ~# in group 2 and # in group 3 if possible, but if there's no ~ then it just captures # (if present) in group 4. Then we can use (\3|\4) at the end to match the closing #, if there was an opening one (remember, group 3 captured # if the text started with ~#, and group 4 captured # or the empty string if the text did not start with ~#).

PCRE regex backreference works, but subroutines do not

I am trying to match the texts:
1. "HeyHey HeyHey"
2. "HeyHey HeyHeyy"
with the regexes:
a /(\w+) \1\w/
b /(\w+) (\w+)\w/
c /(\w+) (?1)\w/
Regex a matches 1 completely, and 2 completely but the last 'y'.
Regex b completely matches 1 and 2.
Regex c does not match 1 or 2.
Following http://www.rexegg.com/regex-disambiguation.html#subroutines I thought b and c are equivalent. But apparently, they are not.
What is the difference? Why is the subroutine not working, while copying the same regex works?
experimented here: https://regex101.com/#pcre
It is because with PCRE, the reference to a subpattern ((?1) here) is atomic by default.
(Note that this behaviour is particular to PCRE and Perl doesn't share it.)
The subpattern is \w+ (with a greedy quantifier), all the word characters are matched (HeyHeyy in the second string), but since (?1) is atomic, the regex engine can't backtrack and give back the last y to make \w succeed.
You can obtain the same result with this pattern:
/(\w+) (?>\w+)\w/
# ^-----^-- atomic group
that doesn't match the string, when without the atomic group, the pattern succeeds:
/(\w+) \w+\w/
More about atomic groups: http://regular-expressions.info/atomic.html
This particularity is also described here (but only in a recursive context): http://www.rexegg.com/regex-recursion.html (see "Recursion Depths are Atomic")