Say I have a regular expression like the following, but I loaded it from a file into a variable $regex, and so have no idea at design time what its contents are, but at runtime I can discover that it includes the "version1", "version2", "version3" and "version4" named groups:
"Version (?<version1>\d),(?<version2>\d),(?<version3>\d),(?<version4>\d)"
...and I have these variables:
$version1 = "3"
$version2 = "2"
$version3 = "1"
$version4 = "0"
...and I come across the following string in a file:
Version 7,7,0,0
...which is stored in a variable $input, so that ($input -match $regex) evaluates to $true.
How can I replace the named groups from $regex in the string $input with the values of $version1, $version2, $version3, $version4 if I do not know the order in which they appear in $regex (I only know that $regex includes these named groups)?
I can't find any references describing the syntax for replacing a named group with the value of a variable by using the group name as an index to the match - is this even supported?
EDIT:
To clarify - the goal is to replace templated version strings in any kind of text file where the version string in a given file requires replacement of a variable number of version fields (could be 2, 3, or all 4 fields). For example, the text in a file could look like any of these (but is not restricted to these):
#define SOME_MACRO(4, 1, 0, 0)
Version "1.2.3.4"
SomeStruct vs = { 99,99,99,99 }
Users can specify a file set and a regular expression to match the line containing the fields, with the original idea being that the individual fields would be captured by named groups. The utility has the individual version field values that should be substituted in the file, but has to preserve the original format of the line that will contain the substitutions, and substitute only the requested fields.
EDIT-2:
I think I can get the result I need with substring calculations based on the position and extent of each of the matches, but was hoping Powershell's replace operation was going to save me some work.
EDIT-3:
So, as Ansgar correctly and succinctly describes below, there isn't a way (using only the original input string, a regular expression about which you only know the named groups, and the resulting matches) to use the "-replace" operation (or other regex operations) to perform substitutions of the captures of the named groups, while leaving the rest of the original string intact. For this problem, if anybody's curious, I ended up using the solution below. YMMV, other solutions possible. Many thanks to Ansgar for his feedback and options provided.
In the following code block:
$input is a line of text on which substitution is to be performed
$regex is a regular expression (of type [string]) read from a file that has been verified to contain at least one of the supported named groups
$regexToGroupName is a hash table that maps a regex string to an array of group names ordered according to the order of the array returned by [regex]::GetGroupNames(), which matches the left-to-right order in which they appear in the expression
$groupNameToVersionNumber is a hash table that maps a group name to a version number.
Constraints on the named groups within $regex are only (I think) that the expression within the named groups cannot be nested, and should match at most once within the input string.
# This will give us the index and extent of each substring
# that we will be replacing (the parts that we will not keep)
$matchResults = ([regex]$regex).match($input)
# This will hold substrings from $input that were not captured
# by any of the supported named groups, as well as the replacement
# version strings, properly ordered, but will omit substrings captured
# by the named groups
$lineParts = #()
$startingIndex = 0
foreach ($groupName in $regexToGroupName.$regex)
{
# Excise the substring leading up to the match for this group...
$lineParts = $lineParts + $input.Substring($startingIndex, $matchResults.groups[$groupName].Index - $startingIndex)
# Instead of the matched substring, we'll use the substitution
$lineParts = $lineParts + $groupNameToVersionNumber.$groupName
# Set the starting index of the next substring that we will keep...
$startingIndex = $matchResults.groups[$groupName].Index + $matchResults.groups[$groupName].Length
}
# Keep the end of the original string (if there's anything left)
$lineParts = $lineParts + $input.Substring($startingIndex, $input.Length - $startingIndex)
$newLine = ""
foreach ($part in $lineParts)
{
$newLine = $newLine + $part
}
$input= $newLine
Simple Solution
In the scenario where you simply want to replace a version number found somewhere in your $input text, you could simply do this:
$input -replace '(Version\s+)\d+,\d+,\d+,\d+',"`$1$Version1,$Version2,$Version3,$Version4"
Using Named Captures in PowerShell
Regarding your question about named captures, that can be done by using curly brackets. i.e.
'dogcatcher' -replace '(?<pet>dog|cat)','I have a pet ${pet}. '
Gives:
I have a pet dog. I have a pet cat. cher
Issue with multiple captures & solution
You can't replace multiple values in the same replace statement, since the replacement string is used for everything. i.e. if you did this:
'dogcatcher' -replace '(?<pet>dog|cat)|(?<singer>cher)','I have a pet ${pet}. I like ${singer}''s songs. '
You'd get:
I have a pet dog. I like 's songs. I have a pet cat. I like 's songs. I have a pet . I like cher's songs.
...which is probably not what you're hoping for.
Rather, you'd have to do a match per item:
'dogcatcher' -replace '(?<pet>dog|cat)','I have a pet ${pet}. ' -replace '(?<singer>cher)', 'I like ${singer}''s songs. '
...to get:
I have a pet dog. I have a pet cat. I like cher's songs.
More Complex Solution
Bringing this back to your scenario, you're not actually using the captured values; rather you're hoping to replace the spaces they were in with new values. For that, you'd simply want this:
$input = 'I''m running Programmer''s Notepad version 2.4.2.1440, and am a big fan. I also have Chrome v 56.0.2924.87 (64-bit).'
$version1 = 1
$version2 = 3
$version3 = 5
$version4 = 7
$v1Pattern = '(?<=\bv(?:ersion)?\s+)\d+(?=\.\d+\.\d+\.\d+)'
$v2Pattern = '(?<=\bv(?:ersion)?\s+\d+\.)\d+(?=\.\d+\.\d+)'
$v3Pattern = '(?<=\bv(?:ersion)?\s+\d+\.\d+\.)\d+(?=\.\d+)'
$v4Pattern = '(?<=\bv(?:ersion)?\s+\d+\.\d+\.\d+\.)\d+'
$input -replace $v1Pattern, $version1 -replace $v2Pattern, $version2 -replace $v3Pattern,$version3 -replace $v4Pattern,$version4
Which would give:
I'm running Programmer's Notepad version 1.3.5.7, and am a big fan. I also have Chrome v 1.3.5.7 (64-bit).
NB: The above could be written as a 1 liner, but I've broken it down to make it simpler to read.
This takes advantage of regex lookarounds; a way of checking the content before and after the string you're capturing, without including those in the match. i.e. so when we select what to replace we can say "match the number that appears after the word version" without saying "replace the word version".
More info on those here: http://www.regular-expressions.info/lookaround.html
Your Example
Adapting the above to work for your example (i.e. where versions may be separated by commas or dots, and there's no consistency to their format beyond being 4 sets of numbers:
$input = #'
#define SOME_MACRO(4, 1, 0, 0)
Version "1.2.3.4"
SomeStruct vs = { 99,99,99,99 }
'#
$version1 = 1
$version2 = 3
$version3 = 5
$version4 = 7
$v1Pattern = '(?<=\b)\d+(?=\s*[\.,]\s*\d+\s*[\.,]\s*\d+\s*[\.,]\s*\d+\b)'
$v2Pattern = '(?<=\b\d+\s*[\.,]\s*)\d+(?=\s*[\.,]\s*\d+\s*[\.,]\s*\d+\b)'
$v3Pattern = '(?<=\b\d+\s*[\.,]\s*\d+\s*[\.,]\s*)\d+(?=\s*[\.,]\s*\d+\b)'
$v4Pattern = '(?<=\b\d+\s*[\.,]\s*\d+\s*[\.,]\s*\d+\s*[\.,]\s*)\d+\b'
$input -replace $v1Pattern, $version1 -replace $v2Pattern, $version2 -replace $v3Pattern,$version3 -replace $v4Pattern,$version4
Gives:
#define SOME_MACRO(1, 3, 5, 7)
Version "1.3.5.7"
SomeStruct vs = { 1,3,5,7 }
Regular expressions don't work that way, so you can't. Not directly, that is. What you can do (short of using a more appropriate regular expression that groups the parts you want to keep) is to extract the version string and then in a second step replace that substring with the new version string:
$oldver = $input -replace $regexp, '$1,$2,$3,$4'
$newver = $input -replace $oldver, "$Version1,$Version2,$Version3,$Version4"
Edit:
If you don't even know the structure, you must extract that from the regular expression as well.
$version = #($version1, $version2, $version3, $version4)
$input -match $regexp
$oldver = $regexp
$newver = $regexp
for ($i = 1; $i -le 4; $i++) {
$oldver = $oldver -replace "\(\?<version$i>\\d\)", $matches["version$i"]
$newver = $newver -replace "\(\?<version$i>\\d\)", $version[$i-1]
}
$input -replace $oldver, $newver
Related
Trying to replace a number (20 with a variable $cntr=120) in a string using replace operator. But getting stuck with $cntr in the output. Where I am doing wrong? Any better solutions please.
Input string
myurl.com/search?project=ABC&startAt=**20**&maxResults=100&expand=log
Desired Output string
myurl.com/search?project=ABC&startAt=**120**&maxResults=100&expand=log
Actual Output string
myurl.com/search?project=ABC&startAt=**$cntr**&maxResults=100&expand=log
Code:
$str='myurl.com/search?project=ABC&startAt=20&maxResults=100&expand=log'
$cntr=120
$str = $str -replace '^(.+&startAt=)(\d+)(&.+)$', '$1$cntr$3'
$str
You need to
Use double quotes to be able to use string interpolation
Use the unambiguous backreference syntax, ${n}, where n is the group ID.
In this case, you can use
PS C:\Users\admin> $str='myurl.com/search?project=ABC&startAt=20&maxResults=100&expand=log'
PS C:\Users\admin> $cntr=120
PS C:\Users\admin> $str = $str -replace '^(.+&startAt=)(\d+)(&.+)$', "`${1}$cntr`$3"
PS C:\Users\admin> $str
myurl.com/search?project=ABC&startAt=120&maxResults=100&expand=log
See the .NET regex "Substituting a Numbered Group" documentation:
All digits that follow $ are interpreted as belonging to the number group. If this is not your intent, you can substitute a named group instead. For example, you can use the replacement string ${1}1 instead of $11 to define the replacement string as the value of the first captured group along with the number "1".
A couple things here:
If you just add the "12" you end up with $112$3 which isn't what you want. What I did was appended a slash in front and then removed it on the backend, so the replace becomes $1\12$3.
$str='myurl.com/search?project=ABC&startAt=20&maxResults=100&expand=log'
$cntr=12
$str = ($str -replace '^(.+&startAt=)(\d+)(&.+)$', ('$1\' + $cntr.ToString() +'$3')).Replace("\", "")
$str
Looking to see if there's another way to add the literal "12" in the replace section with the extra character, but this does work.
Here's another way to do it where you have a literal string between the $1 and $3 and then replace that at the end.
$str='myurl.com/search?project=ABC&startAt=20&maxResults=100&expand=log'
$cntr=12
$str = ($str -replace '^(.+&startAt=)(\d+)(&.+)$', ('$1REPLACECOUNTER$3')).Replace("REPLACECOUNTER", "$cntr")
$str
I am trying to use Powershell to replace a semicolon ; with a pipe | that is in a file that is semicolon separated, so it's a specific set of semicolons that occurs between double-quotes ". Here's a sample of the file with the specific portion in bold:
Camp;Brazil;AI;BCS GRU;;MIL-32011257;172-43333640;;"1975995;1972871;1975";FAC0088/21;3;20.000;24.8;25.000;.149;GLASSES SPARE PARTS,;EXW;C;.00;EUR;
I've tried using -replace, as follows:
(Get-Content $file.PSPath) |
Foreach-Object { $_ -replace '".*(;).*"',"|" } |
However, the regex does not replace the semicolon between the quotes with a pipe. I've tried several other Regex to no avail. What would I do to accomplish this?
You can use a Regex.Replace method with a callback as the replacement argument:
$s = 'Camp;Brazil;AI;BCS GRU;;MIL-32011257;172-43333640;;"1975995;1972871;1975";FAC0088/21;3;20.000;24.8;25.000;.149;GLASSES SPARE PARTS,;EXW;C;.00;EUR;'
$rx = [regex]'"[^"]*"'
$rx.Replace($s, { param($m) $m.value.Replace(';','|') })
# => Camp;Brazil;AI;BCS GRU;;MIL-32011257;172-43333640;;"1975995|1972871|1975";FAC0088/21;3;20.000;24.8;25.000;.149;GLASSES SPARE PARTS,;EXW;C;.00;EUR;
That is, match any substring between two " chars, and replace all ; chars with | inside the matches only.
Also, here is PowerShell Core v6.1+ version where you can pass a script block as the -replace replacement operand where the match is represented as an automatic $_ variable:
(Get-Content $file.PSPath) |
Foreach-Object { $_ -replace '"[^"]*"', { $_.Value.Replace(';', '|') } }
Why not use lookarounds?
Since the left- and right-hand delimiters are identical single chars, ", any lookaround-based solution will either be erroneous or too long and still prone to errors. It would happen because lookarounds do not consume the texts they match, and each " thus could be matched separately as the initial ". Have a look at the (?<="[^"]*);(?=[^"]*") regex, where "b;c;d";1;23;"45;677777;z" turns into "b|c|d"|1|23|"45|677777|z" because the ; between 1 and 23, and 23 and " are found between two double quotation marks.
Similar problem is also with the \G-based patterns that can be used to match multiple match occurrences between two different delimiters, and that are usually not used in .NET regex as the latter supports infinite-width lookbehinds.
when I run the below regex match command, either:
'abc123' -match '(\d+)|(\w+)|(abc123)|(25)'
or
[regex]::matches('abc123', '(\d+)|(\w+)|(abc123)|(25)')
is there a way for me to extract the matching sub-pattern? In this case it would be the third capture block: 'abc123'
You can't get the exact regex part that matched your string as far as I'm aware, if you use a smart constructor for the Regex you can easily automate it though.
$ToMatch = 'abc123FOO'
$PossibleMatches = #('\d+','\w+','abc123.+','25')
$JoinOn = ')|('
$Regex = "($($PossibleMatches -join $JoinOn))"
$CaughtGroup = [Regex]::Matches($ToMatch,$Regex).Groups | ? {$_.Success -and $_.Name -ne '0'}
$CaughtIndex = [int]$CaughtGroup.Name
$CaughtMatch = $PossibleMatches[$CaughtIndex]
"Matched Group $($CaughtIndex) '$($CaughtMatch)'"
will give you
Matched Group 2 'abc123.+'
if this isn't ok for you (i.e. you have wildly varied regex etc.) you might want to do break up the program flow and try match it against an array of possible ones first?
I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"
I'm having to replace fqdn's inside a SQL dump for website migration purposes. I've written a perl filter that's supposed to take STDIN, replace the serialized strings containing the domain name that's supposed to be replaced, replace it with whatever argument is passed into the script, and output to STDOUT.
This is what I have so far:
my $search = $ARGV[0];
my $replace = $ARGV[1];
my $offset_s = length($search);
my $offset_r = length($replace);
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
while (<STDIN>) {
my #fs = split(';', $_);
foreach (#fs) {
chomp;
if (m#$regex#g) {
my ( $len, $extra, $str ) = ( $1, $2, $3 );
my $new_len = $len - $offset_s + $offset_r;
$str =~ eval { s/$search/$replace/ };
print 's:' . $new_len . ':' . $extra . $str . '\"'."\n";
}
}
}
The filter gets passed data that may look like this (this is taken from a wordpress dump, but we're also supposed to accommodate drupal dumps:
INSERT INTO `wp_2_options` VALUES (1,'siteurl','http://to.be.replaced.com/wordpress/','yes'),(125,'dashboard_widget_options','
a:2:{
s:25:\"dashboard_recent_comments\";a:1:{
s:5:\"items\";i:5;
}
s:24:\"dashboard_incoming_links\";a:2:{
s:4:\"home\";s:31:\"http://to.be.replaced.com/wordpress\";
s:4:\"link\";s:107:\"http://blogsearch.google.com/blogsearch?scoring=d&partner=wordpress&q=link:http://to.be.replaced.com/wordpress/\";
}
}
','yes'),(148,'theme_175','
a:1:{
s:13:\"courses_image\";s:37:\"http://to.be.replaced.com/files/image.png\";
}
','yes')
The regex works if I don't have any periods in my $search. I've tried escaping the periods, i.e. domain\.to\.be\.replaced, but that didn't work. I'm probably doing this either in a very roundabout way or missing something obvious. Any help would be greatly appreciated.
There is no need to evaluate (eval) your regular expression because of including variables in them. Also, to avoid the special meaning of metacharacters of those variables like $search, escape them using quotemeta() function or including the variable between \Q and \E inside the regexp. So instead of:
my $regex = eval { "s\:([0-9]+)\:\\\"(https?\://.*)($search.*)\\\"" };
Use:
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)(\Q$search\E.*)\\\"};
or
my $quoted_search = quotemeta $search;
my $regex = qr{s\:([0-9]+)\:\\\"(https?\://.*)($quoted_search.*)\\\"};
And the same advice for this line:
$str =~ eval { s/$search/$replace/ };
you have to double the escape char \ in your $search variable for the interpolated string to contain the escaped periods.
i.e. domain\.to\.be\.replaced -> domain.to.be.replaced (not wanted)
while domain\\.to\\.be\\.replaced -> domain\.to\.be\.replaced (correct).
I'm not sure your perl regex would replace the DNS in string matching several times the old DNS (in the same serialized string).
I made a gist with a script using bash, sed and one big perl regex for this same problem. You may give it a try.
The regex I use is something like that (exploded for lisibility, and having -7 as the known difference between domain names lengths):
perl -n -p -i -e '1 while s#
([;|{]s:)
([0-9]+)
:\\"
(((?!\\";).)*?)
(domain\.to\.be\.replaced)
(.*?)
\\";#"$1".($2-7).":\\\"$3new.domain.tld$6\\\";"#ge;' file
Which is maybe not the best one but at least it seems to de the job. The g option manages lines containing several serialized strings to cleanup and the while loop redo the whole job until no replacement occurs in serilized strings (for strings containing several occurences of the DNS). I'm not fan enough of regex to try a recursive one.