RegEx improvement recommendations - regex

Given a string like
Some text and [A~Token] and more text and [not a token] and
[another~token]
I need to extract the "tokens" for later replacement. The tokens are defined as two identifiers separated by a ~ and enclosed in [ ]. What I have been doing is using $string -match "\[.*?~.*?\]", which works. And, as I understand it I am escaping both brackets, doing any character zero or more times and forced lazy, then the ~ and then the same any character sequence. So, my first improvement was to replace .*? with .+?, as I want 1 or more, not zero or more. Then I moved to $string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]", which limits the two identifiers to alpha numerics, which is a big improvement.
So, first question is:
Is this last solution the best approach, or is there further improvements to be made?
Also, currently I only get a single token returned, so I am looping through the string, replacing tokens as they are found, and looping till there are no tokens. But, my understanding is that RegEx is greedy by default, and so I would have expected this last version to return two tokens, and I could loop through the dictionary rather than using a While loop.
So, second question is:
What am I doing wrong that I am only getting one match back? Or am I misunderstanding how greedy matching works?
EDIT:
to clarify, I am using $matches, as shown here, and still only getting a count of 1.
if ($string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]") {
Write-Host "new2: $($matches.count)"
foreach ($key in $matches.keys) {
Write-Host "$($matches.$key)"
}
}
Also, I can't really use a direct replace at the point of identifying the token, because there are a TON of potential replacements. I take the token, strip the square brackets, then split on the ~ to arrive at prefix and suffix values, which then identify a specific replacement value, which I can do with a dedicated -replace.
And one last clarification, the number of tokens is variable. It could just be one, it could be three or four. So my solution has to be pretty flexible.

To list all tokens and use the values you can use code like this:
$matces = Select-String '\[([\w]+)~([\w]+)\]' -input $string -AllMatches | Foreach {$_.matches}
foreach($value in $matces){
$fullToken = $value.Value;
$firstPart = $value.Groups[1].Value;
$secondPart = $value.Groups[2].Value;
echo "full token found: '$fullToken' first part: '$firstPart' second part: '$secondPart'";
}
Note in regex parts grouped with () this allows access to parts of you token.
In this loop you can find appropriate value that you want to insert instead of fullToken using firstPart and secondPart.
As for the \[.*?~.*?\] not working properly its because it tries to match and succeeds with text [not a token] and [another~token] as in this regex characters ][ are allowed in token parts. \[[^\]\[]*?~[^\]\[]*?\] (^ negates expression so it would read: all characters except ][) would also be fine but its not that readable with all braces if \w is good enough you should us it.

You can use \w to match a word character (letter, digit, underscore).
That results in the pattern \[\w+~\w+\].
Now you can create a regex object with that pattern:
$rgx = [Regex]::new($pattern)
and replace all occurences of that pattern with the Replace operator:
$rgx.Replace($inputstring, $replacement)
Maybe it's also worth noting that regex has an .Match operator which returns the first occurence of the pattern and an .Matches operator which return all occurences of the pattern.

Taking your example line
$String = "Some text and [A~Token] and more text and [not a token] and [another~token]"
This RegEx with capture groups
$RegEx = [RegEx]"\[(\w+~\w+)\][^\[]+\[[^\]]+\][^\[]+\[(\w+~\w+)\]"
if ($string -match $RegEX){
"First token={0} Second token={1}" -f $matches[1],$matches[2]
}
returns:
First token=A~Token Second token=another~token
See the above RegEx explained on https://regex101.com/r/tp6b9e/1
The area between the two tokens is matched alternating with negated classes
for [/] and the literal char [/]

Related

Powershell Regex - Replace between string A and string B only if contains string C

I have a file which looks like this
ABC01|01
Random data here 2131233154542542542
More random data
STRING-C
A bit more random stuff
&(%+
ABC02|01
Random data here 88888888
More random data 22222
STRING-D
A bit more random stuff
&(%+
I'm trying to make a script to Find everything between ABC01 and &(%+ ONLY if it contains STRING-C
I came up with this for regex ABC([\s\S]*?)STRING-C(?s)(.*?)&\(%\+
I'm getting this content from a text file with get-content.
$bad_content = gc $bad_file -raw
I want to do something like ($bad_content.replace($pattern,"") to remove the regex match.
How can I replace my matches in the file with nothing? I'm not even sure if my regex is correct but on regex101 it seems to find the strings I'm needing.
Your regex works with the sample input given, but not robustly, because if the order of blocks were reversed, it would mistakenly match across the blocks and remove both.
Tim Biegeleisen's helpful answer shows a regex that fixes the problem, via a negative lookahead assertion ((?!...)).
Let me show how to make it work from PowerShell:
You need to use the regex-based -replace operator, not the literal-substring-based .Replace() method:[1] to apply it.
To read the input string from a file, use Get-Content's -Raw switch to ensure that the file is read as a single, multi-line string; by default, Get-Content returns an array (stream) of lines, which would cause the -replace operation to be applied to each line individually.
(Get-Content -Raw file.txt) -replace '(?s)ABC01(?:(?!&\(%\+).)*?STRING-C.*?&\(%\+'
Not specifying replacement text (as the optional 2nd RHS operand to -replace) replaces the match with the empty string and therefore effectively removes what was matched.
The regex borrowed from Tim's answer is simplified a bit, by using the inline method of specifying matching options to tun on the single-line option ((?s)) at the start of the expression, which makes subsequent . instances match newlines too (a shorter and more efficient alternative to [\s\S]).
[1] See this answer for the juxtaposition of the two, including guidance on when to use which.
We can use a tempered dot trick when matching between the two markers to ensure that we don't cross the ending marker before matching STRING-C:
ABC01(?:(?!&\(%\+)[\s\S])*?STRING-C[\s\S]*?&\(%\+
Demo
Here is an explanation of the regex pattern:
ABC01 match the starting marker
(?:(?!&\(%\+)[\s\S])*? without crossing the ending marker
STRING-C match the nearest STRING-C marker
[\s\S]*? then match all content, across lines, until reaching
&\(%\+ the ending marker

Regex for string following a string and forward slash

I'm trying to extract the resource group name in Azure via az cli.
The full path to a certain resource group looks like this:
/subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup
I'm trying to only extract "Test_ResourceGroup" out of the full string (which is stored in a variable), so i think the code would be something like
$scope = /subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup
$resourcegroup = $scope -match 'regex'
But I'm terrible at regex and not great at it. The addition challenge is that sometimes there's more strings or integers after the resource name, e.g.
/subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup/specificnameofresource/blahblah
But again, I just want the resource group name.
while a pure regex solution is likely faster, this is simpler & easier to understand/modify. well, it is for me. [grin]
the neat thing is that the final .Split() will silently fail, so it works for "end of string" and for any position in the string.
$TestList = #(
'/subscriptions/b049-1234-1256-125456-125/resourceGroups/Test_ResourceGroup/specificnameofresource/blahblah'
'/subscriptions/b049-1234-1256-125456-125/resourceGroups/ResourceGroupZiggity'
)
foreach ($TL_Item in $TestList)
{
($TL_Item -split 'resourcegroups/')[-1].Split('/')[0]
}
output ...
Test_ResourceGroup
ResourceGroupZiggity
You can use the regex-based -replace operator:
$resourcegroup = $scope -replace '.+/resourceGroups/([^/]+).*', '$1'
.+/resourceGroups/ captures one or more (+) characters (.), followed by /resourceGroups/, i.e. everything up to and including /resourceGroups/.
[^/]+ captures one or more characters that are not (^) in the character set ([...]) comprising /, i.e. everything up, but not including the next /, if any.
(...) is a capturing subexpression a so-called capture group), whose captured text can be referred to in the replacement operand (substitution text) as $1.
.* matches zero or more (*) remaining characters, i.e. whatever characters are left.
Since the regex matches the entire input string, replacing what it matched with '$1 in effect extracts just the token of interest.

How to filter unwanted parts of a PowerShell string with Regex and replace?

I am confused about the workings of PowerShell's -replace operator in regards to its use with regex. I've looked for documentation online but can't find any that goes into more detail than basic use: it looks for a string, and replaces that string with either another string (if defined) or nothing. Great.
I want to do the same thing as the person in this question where the user wants to extract a simple program name from a complex string. Here is the code that I am trying to replicate:
$string = '% O0033(SUB RAD MSD 50R III) G91G1X-6.4Z-2.F500 G3I6.4Z-8.G3I6.4 G3R3.2X6.4F500 G91G0Z5. G91G1X-10.4 G3I10.4 G3R5.2X10.4 G90G0Z2. M99 %'
$program = $string -replace '^%\sO\d{4}\((.+?)\).+$','$1'
$program
SUB RAD MSD 50R III
As you can see the output string is the string that the user wants, and everything else is filtered out. The only difference for me is that I want a string that is composed of six digits and nothing else. However when I attempt to do it on a string with my regex, I get this:
$string2 = '1_123456_1'
$program2 = $string -replace '(\d{6})','$1'
$program2
1_123456_1
There is no change. Why is this happening? What should my code be instead? Furthermore, what is the $1 used for in the code?
The -replace operator only replaces the part of the string that matches. A capture group matches some subset of the match (or all of it), and the capture group can be referenced in the replace string as you've seen.
Your second example only ever matches that part you want to extract. So you need to ensure that you match the whole string but only capture the part you want to keep, then make the replacement string match your capture:
$string2 = '1_123456_1'
$program2 = $string -replace '\d_(\d{6})_\d','$1'
$program2
How you match "the rest of the string" is up to you; it depends on what could be contained in it. So what I did above is just one possible way. Other possible patterns:
1_(\d{6})_1
[^_]*_(\d{6})_[^_]*
^.*?(\d{6}).*?$
Capturing groups (pairs of unescaped parentheses) in the pattern are used to allow easy access to parts of a match. When you use -replace on a string, all non-overlapping substrings are matched, and these substrings are replaced/removed.
In your case, -replace '(\d{6})', '$1' means you replace the whole match (that is equal to the first capture, since you enclosed the whole pattern with a capturing group) with itself.
Use -match in cases like yours when you want to get a part of the string:
PS> $string2 = '1_123456_1'
PS> $string2 -match '[0-9]{6}'
PS> $Matches[0]
123456
The -match will get you the first match, just what you want.
Use -replace when you need to get a modified string back (reformatting a string, inserting/removing chars and suchlike).

Regular expression using powershell

Here's is the scenario, i have these lines mentioned below i wanted to extract only the middle character in between two dots.
"scvmm.new.resources" --> This after an regular expression match should return only "new"
"sc.new1.rerces" --> This after an regular expression match should return only "new1"
What my basic requirement was to exract anything between two dots anything can come in prefix and suffix
(.*).<required code>.(.*)
Could anyone please help me out??
You can do that without using regex. Split the string on '.' and grab the middle element:
PS> "scvmm.new.resources".Split('.')[1]
new
Or this
'scvmm.new.resources' -replace '.*\.(.*)\..*', '$1'
Like this:
([regex]::Match("scvmm.new1.resources", '(?<=\.)([^\.]*)(?=\.)' )).value
You don't actually need regular expressions for such a trivial substring extraction. Like Shay's Split('.') one can use IndexOf() for similar effect like so,
$s = "scvmm.new.resources"
$l = $s.IndexOf(".")+1
$r = $s.IndexOf(".", $l)
$s.Substring($l, $r-$l) # Prints new
$s = "sc.new1.rerces"
$l = $s.IndexOf(".")+1
$r = $s.IndexOf(".", $l)
$s.Substring($l, $r-$l) # Prints new1
This looks the first occurence of a dot. Then it looks for first occurense of a dot after the first hit. Then it extracts the characters between the two locations. This is useful in, say, scenarios in which the separation characters are not the same (though the Split() way would work in many cases too).

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");