Regex capture multi matches in Group - regex

I'm not sure if this is possible. I'am searching for a way to capture multiple matches in a group.
This it work perfectly fine:
"Catch me if you can" -match "(?=.*(Catch))"
Result: Catch
I would like to have the result of two matches in the group:
"Catch me if you can" -match "(?=.*Catch)(?=.*me)"
Expected Result: Catch me

Note: If hard-coding the result of both regex subexpressions matching is sufficient, simply use:
if ('Catch me if you can' -match '(?=.*Catch)(?=.*me)') { 'Catch me' }
You're trying to:
match two separate regex subexpressions,
and report what specific strings they matched only if BOTH matched.
Note:
While it is possible to use a variation of your regex, which concatenates two look-ahead assertions ((?=.*(Catch))(?=.*(me))), to extract what the two subexpressions of interest captured, the captured substrings would be reported in the order in which the subexpressions are specified in the regex, not in the order in which the substrings appear in the input string. E.g., input string 'me Catch if you can' would also result in output string 'Catch me'
The following solution uses the [regex]::Match() .NET API for preserving the input order of the captured substrings by sorting the captures by their starting position in the input string:
$match = [regex]::Match('me Catch if you can', '(?=.*(Catch))(?=.*(me))', 'IgnoreCase')
if ($match.Success) { ($match.Groups | Select-Object -Skip 1 | Sort-Object Position).Value -join ' ' }
Note the use of the IgnoreCase option, so as to match PowerShell's default behavior of case-insensitive matching.
The above outputs 'me Catch', i.e. the captured substrings in the order in which they appear in the input string.
If instead you prefer that the captured substrings be reported in the order in which the subexpressions that matched them appear in the regex ('Catch me'), simply omit | Sort-Object Position from the command above.
Alternatively, you then could make your -match operation work, as follows, by enclosing the subexpressions of interest in (...) to form capture groups and then accessing the captured substrings via the automatic $Matches variable - but note that no information about matching positions is then available:
if ('me Catch if you can' -match '(?=.*(Catch))(?=.*(me))') {
$Matches[1..2] -join ' ' # -> 'Catch me'
}
Note that this only works because a single match result captures both substrings of interest, due to the concatenation of two look-ahead assertions ((?=...)); because -match only ever looks for one match, the simpler 'Catch|me' regex would not work, as it would stop matching once either subexpression is found.
See also:
GitHub issue #7867, which suggests introducing a -matchall operator that returns all matches found in the input string.

The (?= is a LookAhead, but you don't have it looking ahead of anything. In this example LookAhead is looking ahead of "Catch" to see if it can find ".*me".
Catch(?=.*me)
Also, do you really want to match "catchABCme"? I would think you would want to match "catch ABC me", but not "catchABCme", "catchABC me", or "catch ABCme".
Here is some test code to play with:
$Lines = #(
'catch ABC me if you can',
'catch ABCme if you can',
'catchABC me if you can'
)
$RegExCheckers = #(
'Catch(?=.*me)',
'Catch(?=.*\s+me)',
'Catch\s(?=(.*\s+)?me)'
)
foreach ($RegEx in $RegExCheckers) {
$RegExOut = "`"$RegEx`"".PadLeft(22,' ')
foreach ($Line in $Lines) {
$LineOut = "`"$Line`"".PadLeft(26,' ')
if($Line -match $RegEx) {
Write-Host "$RegExOut matches $LineOut"
} else {
Write-Host "$RegExOut didn't match $LineOut"
}
}
Write-Host
}
And here is the output:
"Catch(?=.*me)" matches "catch ABC me if you can"
"Catch(?=.*me)" matches "catch ABCme if you can"
"Catch(?=.*me)" matches "catchABC me if you can"
"Catch(?=.*\s+me)" matches "catch ABC me if you can"
"Catch(?=.*\s+me)" didn't match "catch ABCme if you can"
"Catch(?=.*\s+me)" matches "catchABC me if you can"
"Catch\s(?=(.*\s+)?me)" matches "catch ABC me if you can"
"Catch\s(?=(.*\s+)?me)" didn't match "catch ABCme if you can"
"Catch\s(?=(.*\s+)?me)" didn't match "catchABC me if you can"
As you can see, the last RegEx expression requires a space after "catch" and before "me".
Also, a great place to test RegEx is regex101.com, you can place the RegEx at the top and multiple lines you want to test it against in the box in the middle.

Related

Powershell: Can't Get RegEx to work on multiple lines

I am getting notes from a ticket that come in the form of:
[Employee ID]:
[First Name]: Test
[Last Name]: User
[Middle Initial]:
[Email]:
[Phone]:
[* Last 4 of SSN]: 1234
I've tried the following code to get the first name (in this example it would be 'Test':
if ($incNotes -match '(^\[First Name\]:)(. * ?$)')
{
Write-Host $_.matches.groups[0].value
Write-Host $_.matches.groups[1].value
}
But I get nothing. Is there a way I can use just one long regex pattern to get the information I need? The information stays in the same format on every ticket that comes through.
How would I get the information after the [First Name]: and so on....
You can use
if ($incNotes -match '(?m)^\[First Name]: *(\S+)') {
Write-Host $matches[1]
}
See the regex demo. If you can have any kind of horizontal whitespace chars between : and the name, replace the space with [\p{Zs}\t], or some kind of [\s-[\r\n]].
Details:
(?m) - a RegexOptions.Multiline option that makes ^ match start of any line position, and $ match end of lines
^ - start of a line
\[First Name]: - a [First Name]: string
* - zero or more spaces
(\S+) - Capturing group 1: one or more non-whitespace chars (replace with \S.* or \S[^\n\r]* to match any text till end of string).
Note that -match is a case insensitive regex matching operator, use -cmatch if you need a case sensitive behavior. Also, it only finds the first match and $matches[1] returns the Group 1 value.

PowerShell Regex - word with wildcards and commas

Trying to do a replace on what I understand to be a simple operation but hitting a wall.
I can replace a word with a comma on the end:
$firstval = 'ssonp,RDPNP,LanmanWorkstation,webclient,MfeEpePcNP,PRNetworkProvider'
($firstval) -replace 'webclient+,',''
ssonp,RDPNP,LanmanWorkstation,MfeEpePcNP,PRNetworkProvider
But haven't been able to work out how to add a wildcard in the word, or how I'd have multiple words with wildcards proceeded by a comma, e.g.:
w* client+,* fee*, etc
(spaces added to stop being interpreted as formatting within the question)
Played with a few permeations and attempted to use examples from other questions without any luck.
The -replace operator takes a regular expression as its first parameter. You seem to be confusing wildcards and regular expressions. Your pattern w*client+,*fee*,, though a valid regular expression, seems to be intended to use wildcards.
The regular expression equivalent of the * wildcard is .*, where . means "any character" and * means "0 or more occurrences". Thus, the regular expression equivalent of w*client, would be w.*client,, and, similarly the regular expression equivalent of *fee*, would be .*fee.*,. Since the string to be searched has comma-separated values, however, we don't want our patterns to include "any character" (.*) but rather "any character but comma" ([^,]*). Therefore, the patterns to use become w[^,]*client, and [^,]*fee[^,]*,, respectively.
To search for both words in a string, separate the two patterns with |. The following builds such a pattern and tests it against strings with a match in various locations:
# Match w*client or *fee*
$wordPattern = 'w[^,]*client|[^,]*fee[^,]*';
# Match $wordPattern and at most one comma before or after
$wordWithAdjacentCommaPattern = '({0}),?|,({0})$' -f $wordPattern;
"`$wordWithAdjacentCommaPattern: $wordWithAdjacentCommaPattern";
# Replace single value
'webclient', `
# Replace first value
'webclient,middle,last', `
# Replace middle value
'first,webclient,last', `
# Replace last value
'first,middle,webclient' `
| ForEach-Object -Process { '"{0}" => "{1}"' -f $_, ($_ -replace $wordWithAdjacentCommaPattern); };
This outputs the following:
$wordWithAdjacentCommaPattern: (w[^,]*client|[^,]*fee[^,]*),?|,(w[^,]*client|[^,]*fee[^,]*)$
"webclient" => ""
"webclient,middle,last" => "middle,last"
"first,webclient,last" => "first,last"
"first,middle,webclient" => "first,middle"
A non-regex alternative you might consider would be to split your input string into individual values, filter out values that match certain wildcards, and reassemble what's left into comma-separated values:
(
'ssonp,RDPNP,LanmanWorkstation,webclient,MfeEpePcNP,PRNetworkProvider' -split ',', -1, 'SimpleMatch' `
| Where-Object { $_ -notlike 'w*client' -and $_ -notlike '*fee*'; } `
) -join ',';
By the way, you used the regular expression webclient+, to match and remove the text webclient, from your string (looks like the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\NetworkProvider\Order\ProviderOrder registry value). Just a note that, with the +, that will search for the literal text webclien followed by 1 or more occurrences of t followed by the literal text ,. Thus, that will match webclientt,, webclienttt,, webclientttttttttt,, etc. as well webclient,. If you are only interested in matching webclient, then you can just use the pattern webclient, (no +).

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Negative lookahead assertion with the * modifier in Perl

I have the (what I believe to be) negative lookahead assertion <#> *(?!QQQ) that I expect to match if the tested string is a <#> followed by any number of spaces (zero including) and then not followed by QQQ.
Yet, if the tested string is <#> QQQ the regular expression matches.
I fail to see why this is the case and would appreciate any help on this matter.
Here's a test script
use warnings;
use strict;
my #strings = ('something <#> QQQ',
'something <#> RRR',
'something <#>QQQ' ,
'something <#>RRR' );
print "$_\n" for map {$_ . " --> " . rep($_) } (#strings);
sub rep {
my $string = shift;
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
return $string;
}
This prints
something <#> QQQ --> something at w/o QQQ
something <#> RRR --> something at w/o RRR
something <#>QQQ --> something at w/ QQQ
something <#>RRR --> something at w/o RRR
And I'd have expected the first line to be something <#> QQQ --> something at w/ QQQ.
It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".
You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):
<#> *(?!QQQ)(?! )
ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:
unless ($string =~ s,<#> *QQQ,at w/ QQQ,) {
$string =~ s,<#> *,at w/o,;
}
You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.
A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<#> " as <#> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.
I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.
However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.
<#>(?! *QQQ) # Add the spaces to the look-ahead
The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:
<#> *(?! *QQQ)
This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.
The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:
+--------------- Matches "<#>".
| +----------- Matches "" (empty string).
| | +--- Doesn't match " QQQ".
| | |
--- ---- ---
'something <#> QQQ' =~ /<#> [ ]* (?!QQQ)/x
All you need to do is shuffle things around. Replace
/<#>[ ]*(?!QQQ)/
with
/<#>(?![ ]*QQQ)/
Or you can make it so the regex will only match all the spaces:
/<#>[ ]*+(?!QQQ)/
/<#>[ ]*(?![ ]|QQQ)/
/<#>[ ]*(?![ ])(?!QQQ)/
PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.