Matching Double String, how do I only match with 1 result? - regex

I'm able to match the string I need for regex but it is matching twice.
https://regex101.com/r/KmgGwS/7
if ( $_.PSPath -match ("(?<=\::)(.*?)(?=\\)+")) {
$matches.Values
}
For example, the input string is something like:
'Microsoft.PowerShell.Security\Certificate::CurrentUser\Root\A43489159A520F0D93D032CCAF37E7FE20A8B419'
It expect it to get:
CurrentUser
With the current code, it gets that string twice:
CurrentUser
CurrentUser

tl;dr
$Matches[0]
contains what your regex matched overall, $Matches[1] contains what the 1st (and only) capture group (parenthesized sub-expression, (...)) matched - in your case, both values happen to be the same.
Unless you need to explicitly enumerate the overall match as well as capture-group matches, don't use $Matches.Value.
To enumerate capture-group values only, i.e. to enumerate all matches except the overall match, use:
# Enumerate all matches except the overall one (key 0)
$Matches.GetEnumerator().Where({ $_.Key -as [int] -ne 0 }).Value
The automatic $Matches variable, in which PowerShell reflects the results of the most recent -match operation[1], is a hashtable (System.Collections.Hashtable):
Entry 0 ($Matches[0]) contains what the regex matched in full.
All other entries, if any, contain the substrings that capture groups (parenthesized subexpressions, (...)) matched, with entry 1 representing the 1st capture group's match, 2 the 2nd, and so on.
If you use named capture groups (e.g. (?<foo>...), the entries use that name (e.g., $Matches['foo'] or, alternatively, $Matches.foo).
If you use non-capturing groups ((?:...)), they result in no entry in $Matches.
(Similarly, look-around assertions - (?<=...), (?<!...), (?=...), and (?!...) - do not result in entries.)
As for what you tried:
$Matches.Values outputs a collection of the values of all entries in the hashtable, meaning the overall match (entry 0) as well as any capture-group matches.
Since your regex contains a capture group that effectively captures the same as the regex as a whole, (.*?), $Matches.Values outputs a collection of values that is the equivalent of array
'CurrentUser', 'CurrentUser', which, when output to the console, yields the result shown in the question.
Note that if you regex happens not to contain any capture groups, as suggested in sln's answer, $Matches.Values may appear to return a single string, but in reality it returns an ICollection instance that just happens to have only one element.
Now, that distinction between a single-element collection and a scalar may often be irrelevant in PowerShell, but it's something to be aware of, because there are cases where it matters.
[1] Caveats:
* If the regex didn't match at all, $Matches isn't updated, so the previous value may linger.
* If the LHS of -match is an array (collection), -match acts as a filter, and $Matches isn't updated.
Note that $Matches is also set in the branch handlers of a switch -Regex statement.

What you're seeing is the match value and the
group 1 value. Both of which contain the same thing.
If you want to see only a single value, remove the capture group.
(?<= :: )
.*?
(?= \\ )
Or, like this (?<=::).*?(?=\\)

Related

regex lookbehind is `>` a shortcut for `<=`?

Through some convoluted testing, I ended up potentially discovering a shortcut.
A lookbehind in PowerShell is supposed to use the <= syntax, which is referenced in various other places when googling for lookbehinds in PowerShell, e.g., this Microsoft blog.
Take this simple example Regex:
(?>^[^x]*)$
(?> begins the lookbehind
^[^x]* tests that the character x is not present since the beginning of the string
) closes the lookbehind
$ anchors the end of the line
When I test it:
'sample = x' -match '(?>^[^x]*)$'
False
'sample = ' -match '(?>^[^x]*)$'
True
The first block returns false: the lookbehind does not match a string without x.
The second block returns true: the lookbehind matches a string without x.
It seems to work!
Now if I try to use the <= syntax:
'sample = x' -match '(?<=^[^x]*)$'
False
'sample = ' -match '(?<=^[^x]*)$'
True
It has the same behavior.
Is this a "shortcut" for RegEx lookbehinds in PowerShell, or why is > working at all?
435|PS(7.2.1) C:\Users\User\Documents [230211-15:03:27]> $PSVersionTable
Name Value
---- -----
PSVersion 7.2.1
PSEdition Core
GitCommitId 7.2.1
OS Microsoft Windows 10.0.22621
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
tl;dr
Regex grouping constructs (?<=…) and (?>…) serve different purposes and only happen to work the same in your particular scenario; neither is called for in your scenario.
Use '...' -notmatch 'x' to test if a given string contains any instances of 'x' (returns $true if not).
Background information:
The two grouping constructs you reference serve different purposes (enclosed subexpressions are represented with placeholder … below):
(?<=…) is a (zero-width, positive) lookbehind assertion:
It is a non-capturing grouping construct that must match the enclosed subexpression immediately before (to the left, i.e. "looking behind") where the remaining expression matches, without capturing what the subexpression matched.
In essence, this means: When what follows this construct matches, also make sure (assert) that what comes before it matches the subexpression inside (?<=…); if it doesn't, there's no overall match; if it does, don't capture (include in the results) its match.
Therefore, this construct only makes sense if placed before a capturing construct; e.g.:
# Matches only 'unheard of', because only in it is the match
# for 'hear.' preceded by 'un'
# Captures only 'heard' from the matching string, not the 'un'
'heard from', 'unheard of' -match '(?<=un)hear.'
(?>…) is an atomic group, aka non-backtracking subexpression:
It is a capturing grouping construct - similar to a regular capture group (matched subexpression), (…) - except that it never backtracks.
In essence, this means: once the subexpression has found a match, it won't allow backtracking based on the remainder of the expression; this construct is mostly used as a performance optimization when it is known that backtracking wouldn't succeed.
# Atomic group:
# -> $false, because the atomic group fully consumes the string,
# so there's nothing for '.' to match *after* the group.
'abc!' -match '(?>.+).'
# Regular capture group:
# -> $true, with backtracking; the capture group captures 'abc'
'abc!' -match '(.+).'
What you tried:
(?<=^[^x]*)$ - your regex with a lookbehind assertion
As noted above, there's no good reason to use a lookbehind assertion without following it with a capturing expression.
Your regex will by definition not capture anything ($ is itself an assertion).
Since you're matching the whole string, the immediate simplification would be not to use a grouping construct at all (but see the bottom section):
^[^x]*$
As an optimization, if you explicitly want to prevent the capturing that happens by default, use a noncapturing group, (?:…):
(?:^[^x]*$)
(?>^[^x]*)$ - your regex with an atomic group
Since you're matching the whole string, there is no reason to use a atomic group, given that there's no backtracking that needs preventing, so this regex is in effect the same as (^[^x]*)$, i.e. a regular capture group (followed by $).
As noted, there's no reason to capture anything here, so (?:^[^x]*$) would prevent that.
In short:
Both your regexes match the input string in full, and therefore require no grouping construct (except, optionally, to explicitly prevent capturing).
Read on for a much simpler solution.
Taking a step back:
The conceptually simplest and most efficient solution is:
'...' -notmatch 'x'
That is, you can let -notmatch, the negated form of PowerShell's -match operator look for (at most one) x, and negate the Boolean result, so that not finding any x returns $true.
In other words: the test succeeds if no x is present in the input string.

Why do substrings prevent match with negative lookahead?

Consider the following test data:
x.foo,x.bar
y.foo,y.bar
yy.foo,yy.bar
x.foo,y.bar
y.foo,x.bar
yy.foo,x.bar
x.foo,yy.bar
yy.foo,y.bar
y.foo,yy.bar
I'm attempting to write a regular expression where the string before .foo and the string before .bar are different from each other. The first three items should not match. The other six should.
This mostly works:
^(.+?)\.foo,(?!\1)(.+?)\.bar$
However, it misses on the last one, because y is in match group 1, and thus yy is not matched in match group 2.
Interactive: https://regex101.com/r/Pv5062/1
How can I modify the negative lookahead pattern such that the last item matches as well?
Inline backreferences do not store the context information, they only keep the text captured. You need to specify the context yourself.
You may add a dot after \1:
^(.+?)\.foo,(?!\1\.)(.+?)\.bar$
^^
Or, even repeat the part after the second (.+?):
^(.+?)\.foo,(?!\1\.bar$)(.+?)\.bar$
Or, if the bar part cannot contain ., you may make it more "generic":
^(.+?)\.foo,(?!\1\.[^.]+$)(.+?)\.bar$
See the regex demo and another regex demo.
The point is: your (?!\1) is not "anchored" and will fail the match in case the text stored in Group 1 appears immediately to the right of the current location regardless of the context. To solve this, you need to provide this context. As the value that can be matched with .+? can contain virtually anything all you can rely on is the "hardcoded" bits after the lookahead.

Repeated variable length regexp matching

I have an expression
AA-BB/CC/DD
I want to convert this to
<AA-BB> <AA-CC> <AA-DD>
All I can do is configure this as a regexp substitution. I can't figure it out.
AA should match at the beginning of a line. - and / are literal characters, BB,CC and DD are numbers, i.e \d+
So a first draft is ...
^(\w+)([\-/]\d+)+
but I want all matches, not just the greedy one.
(actually this one matches AA-BB-CC-DD too, but that's ok although it's not according to spec)
No, you can't do that with regex. Probably with .net, because there you can access all intermediate results of repeated capturing groups ...
Repeating a Capturing Group vs. Capturing a Repeated Group
That is the problem, if you do something like ^(\w+)([\-/]\w+)+ the value stored in group2 is always only the last pattern it matched. Your task is not possible with regex/replace.
I would do something like:
^(\w+)-([\w+\/]+)
Then split the content of group 2 by "/" and combine group1 with each element of the array resulting from the split.

Matching quote contents

I am trying to remove quotes from a string. Example:
"hello", how 'are "you" today'
returns
hello, how are "you" today
I am using php preg_replace.
I've got a couple of solutions at the moment:
(\'|")(.*)\1
Problem with this is it matches all characters (including quotes) in the middle, so the result ($2) is
hello", how 'are "you today'
Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1
to not match the first backreference in the middle.
Second solution:
(\'[^\']*\'|"[^"]*")
Problem is, this includes the quotes in the back reference so doesn't actually do anything at all. The result ($1):
"hello", how 'are "you" today'
Instead of:
(\'[^\']*\'|"[^"]*")
Simply write:
\'([^\']*)\'|"([^"]*)"
\______/ \_____/
1 2
Now one of the groups will match the quoted content.
In most flavor, when a group that failed to match is referred to in a replacement string, the empty string gets substituted in, so you can simply replace with $1$2 and one will be the successful capture (depending on the alternate) and the other will substitute in the empty string.
Here's a PHP implementation (as seen on ideone.com):
$text = <<<EOT
"hello", how 'are "you" today'
EOT;
print preg_replace(
'/\'([^\']*)\'|"([^"]*)"/',
'$1$2',
$text
);
# hello, how are "you" today
A closer look
Let's use 1 and 2 for the quotes (for clarity). Whitespaces will also be added (for clarity).
Before, you have, as your second solution, this pattern:
( 1[^1]*1 | 2[^2]*2 )
\_______________________/
capture whole thing
content and quotes
As you correctly pointed out, this match a pair of quotes correctly (assuming that you can't escape quotes), but it doesn't capture the content part.
This may not be a problem depending on context (e.g. you can simply trim one character from the beginning and end to get the content), but at the same time, it's also not that hard to fix the problem: simply capture the content from the two possibilities separately.
1([^1]*)1 | 2([^2]*)2
\_____/ \_____/
capture contents from
each alternate separately
Now either group 1 or group 2 will capture the content, depending on which alternate was matched. As a "bonus", you can check which quote was used, i.e. if group 1 succeeded, then 1 was used.
Appendix
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
(…) is used for grouping. (pattern) is a capturing group and creates a backreference. (?:pattern) is non-capturing.
References
regular-expressions.info/Brackets for capturing, Alternation, Character class, Repetition
Regarding:
Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1
(\'|")(((?!(\1|\r|\n)).)*)\1
(where (?!...) is a negative lookahead for ...) should work.
I dont know whether this solves your main problem, but it does solve the "match a character iff it doesnt match a backref" part.
Edit:
Missed a parenthesis, fixed.
You cannot do this with a regular expression. This requires an internal state to keep track of (among other things)
Whether or not a previous quote of a certain type has been encountered
Whether or not the "outer" level of quotes is the current level
Whether an "inner" set of quotes has been descended into, and if so, where that set of quotes begins in the string
This requires a grammar-aware parser to do correctly. A regular expression engine does not keep state because it is a finite state automata, which only operates on the current input regardless of previous circumstances.
It's the same reason you cannot reliably match sets of nested parentheses or XML elements.

using a matched expression as a starting point for a match

I'm using http://regexpal.com/ and some of the look-ahead and look-behind are not supported in JavaScript.
is it still possible to use a matched criteria to signal the beginning of a match, and another for the end, without being included in the match.
for example, if I'm using [tag] to only get the tag,
or if I have {abc 1,def} to match abc 1 and def
it's easy enough to get this when the string is short, but I would like it to find this from a longer string, only when this group is surrounded by { } and individual items surrounded by the ` character
If you don't have lookbehind as in JavaScript, you can use a non-capturing group (or no group at all) instead of the (positive) lookbehind. Of course this will then become a part of the overall match, so you need to enclose the part you actually want to match in capturing parentheses, and then evaluate not the entire match but just that capturing group.
So instead of
(?<=foo)bar
you could use
foo(bar)
In the first version, the match result bar would be in backreference $0. In the second version $0 would equal foobar, but $1 will contain bar.
This will fail, though, if a match and the lookbehind of the next match would overlap. For example, if you want to match digits that are surrounded by letters.
(?<=[a-z])[0-9](?=[a-z])
will match all numbers in a1b2c3d, but
[a-z]([0-9])[a-z]
will only match 1 and 3.
I suppose you can always use grouping:
m = "blah{abc 1, def}blah".match(/\{(.*?)\}/)
Where
m[0] = "{abc 1, def}"
m[1] = "abc 1, def"
That regexpal page doesn't show the resulting subgroups, if any.