Related
I have a PS script that pulls some status information from a switch. The output looks like this:
8 Auto Unknown -1 Class 4 On Good 3.29 47.75 68.96
I now need to assign these strings to variables, for further processing. I'm guessing RegEx would be the best (only?) way to do it, but I don't have the first clue on how to achieve that, so any suggestion will be gratefully received.
Cheers,
B.
Since the fields seem to be separated by varying numbers of spaces, it is simplest to use the unary form of -split, the string splitting operator:
# Sample input.
$line = #'
8 Auto Unknown -1 Class 4 On Good 3.29 47.75 68.96
'#
# Split the line into an array of fields by whitespace.
$fields = -split $line
# Output the result.
$fields
If you additionally want to infer the data types of the fields, simply by seeing if they can be converted to an integer ([int]) or a floating-point value ([double]):
foreach ($i in 0..($fields.Count-1)) {
if ($newValue = $fields[$i] -as [double]) { $fields[$i] = $newValue }
if ($newValue = $fields[$i] -as [int]) { $fields[$i] = $newValue }
}
I have a long string in the format
id1:2014-08-05 11:24;Does this work?,id2:2014-08-04 13:22; Does this work,too?,id3:2014-07-25 16:56 ...
I am trying to extract the 'date' and 'comment' part out of this, based on the id, which is the input.
For example, if the input is id2, I'd want the comment as 'Does this work, too?' and date as '2014-08-04 13:22'. Here is the regex I have so far.
if($string =~ m/\b$id:(.*?);(.*,?)/){
my $date = $1;
my $comment = substr($2,0,-1); #to remove the last ,
}
Now since there is a ',' as part of the string itself, my regex treats it as a delimiter and just returns 'Does this work' as the comment, leaving out the ',too?' part.
Any help would really help as to how to handle when my string has the delimiter within itself.
I think the best way to do this is to form a hash out of the string. If you start by splitting the string on any comma that's immediately followed by some alphanumeric characters and a colon then the commas within the comments will be ignored and most of your work is done.
Then just use a regex to divide each split into three chunks: the ID, the date/time, and the comment, and put them into a hash. After that you can get the date/time for an ID as $data{id1}[0] and the comment as $data{id1}[1]
This program demonstrates
use strict;
use warnings;
my $s = 'id1:2014-08-05 11:24;Does this work?,id2:2014-08-04 13:22; Does this work,too?,id3:2014-07-25 16:56 ...';
my %data;
for (split /,(?=\w+:)/, $s) {
my #fields = /([^:]+):([^;]+);(.+)/g;
$data{$1} = [ $2, $3 ];
}
print $data{id2}[1], "\n";
output
Does this work,too?
$str = "id1:2014-08-05 11:24;Does this work?,id2:2014-08-04 13:22; Does this work,too?,id3:2014-07-25 16:56; bla";
$id = "id2";
# I need comma to set the end of the last "record"
$str = $str . ",";
if ($str =~ /$id:([\d\-\: ]+);([ \w\?\,]+)\,/) {
print "date = $1\n";
print "comment = $2\n";
}
I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"
Say I have a regular expression like the following, but I loaded it from a file into a variable $regex, and so have no idea at design time what its contents are, but at runtime I can discover that it includes the "version1", "version2", "version3" and "version4" named groups:
"Version (?<version1>\d),(?<version2>\d),(?<version3>\d),(?<version4>\d)"
...and I have these variables:
$version1 = "3"
$version2 = "2"
$version3 = "1"
$version4 = "0"
...and I come across the following string in a file:
Version 7,7,0,0
...which is stored in a variable $input, so that ($input -match $regex) evaluates to $true.
How can I replace the named groups from $regex in the string $input with the values of $version1, $version2, $version3, $version4 if I do not know the order in which they appear in $regex (I only know that $regex includes these named groups)?
I can't find any references describing the syntax for replacing a named group with the value of a variable by using the group name as an index to the match - is this even supported?
EDIT:
To clarify - the goal is to replace templated version strings in any kind of text file where the version string in a given file requires replacement of a variable number of version fields (could be 2, 3, or all 4 fields). For example, the text in a file could look like any of these (but is not restricted to these):
#define SOME_MACRO(4, 1, 0, 0)
Version "1.2.3.4"
SomeStruct vs = { 99,99,99,99 }
Users can specify a file set and a regular expression to match the line containing the fields, with the original idea being that the individual fields would be captured by named groups. The utility has the individual version field values that should be substituted in the file, but has to preserve the original format of the line that will contain the substitutions, and substitute only the requested fields.
EDIT-2:
I think I can get the result I need with substring calculations based on the position and extent of each of the matches, but was hoping Powershell's replace operation was going to save me some work.
EDIT-3:
So, as Ansgar correctly and succinctly describes below, there isn't a way (using only the original input string, a regular expression about which you only know the named groups, and the resulting matches) to use the "-replace" operation (or other regex operations) to perform substitutions of the captures of the named groups, while leaving the rest of the original string intact. For this problem, if anybody's curious, I ended up using the solution below. YMMV, other solutions possible. Many thanks to Ansgar for his feedback and options provided.
In the following code block:
$input is a line of text on which substitution is to be performed
$regex is a regular expression (of type [string]) read from a file that has been verified to contain at least one of the supported named groups
$regexToGroupName is a hash table that maps a regex string to an array of group names ordered according to the order of the array returned by [regex]::GetGroupNames(), which matches the left-to-right order in which they appear in the expression
$groupNameToVersionNumber is a hash table that maps a group name to a version number.
Constraints on the named groups within $regex are only (I think) that the expression within the named groups cannot be nested, and should match at most once within the input string.
# This will give us the index and extent of each substring
# that we will be replacing (the parts that we will not keep)
$matchResults = ([regex]$regex).match($input)
# This will hold substrings from $input that were not captured
# by any of the supported named groups, as well as the replacement
# version strings, properly ordered, but will omit substrings captured
# by the named groups
$lineParts = #()
$startingIndex = 0
foreach ($groupName in $regexToGroupName.$regex)
{
# Excise the substring leading up to the match for this group...
$lineParts = $lineParts + $input.Substring($startingIndex, $matchResults.groups[$groupName].Index - $startingIndex)
# Instead of the matched substring, we'll use the substitution
$lineParts = $lineParts + $groupNameToVersionNumber.$groupName
# Set the starting index of the next substring that we will keep...
$startingIndex = $matchResults.groups[$groupName].Index + $matchResults.groups[$groupName].Length
}
# Keep the end of the original string (if there's anything left)
$lineParts = $lineParts + $input.Substring($startingIndex, $input.Length - $startingIndex)
$newLine = ""
foreach ($part in $lineParts)
{
$newLine = $newLine + $part
}
$input= $newLine
Simple Solution
In the scenario where you simply want to replace a version number found somewhere in your $input text, you could simply do this:
$input -replace '(Version\s+)\d+,\d+,\d+,\d+',"`$1$Version1,$Version2,$Version3,$Version4"
Using Named Captures in PowerShell
Regarding your question about named captures, that can be done by using curly brackets. i.e.
'dogcatcher' -replace '(?<pet>dog|cat)','I have a pet ${pet}. '
Gives:
I have a pet dog. I have a pet cat. cher
Issue with multiple captures & solution
You can't replace multiple values in the same replace statement, since the replacement string is used for everything. i.e. if you did this:
'dogcatcher' -replace '(?<pet>dog|cat)|(?<singer>cher)','I have a pet ${pet}. I like ${singer}''s songs. '
You'd get:
I have a pet dog. I like 's songs. I have a pet cat. I like 's songs. I have a pet . I like cher's songs.
...which is probably not what you're hoping for.
Rather, you'd have to do a match per item:
'dogcatcher' -replace '(?<pet>dog|cat)','I have a pet ${pet}. ' -replace '(?<singer>cher)', 'I like ${singer}''s songs. '
...to get:
I have a pet dog. I have a pet cat. I like cher's songs.
More Complex Solution
Bringing this back to your scenario, you're not actually using the captured values; rather you're hoping to replace the spaces they were in with new values. For that, you'd simply want this:
$input = 'I''m running Programmer''s Notepad version 2.4.2.1440, and am a big fan. I also have Chrome v 56.0.2924.87 (64-bit).'
$version1 = 1
$version2 = 3
$version3 = 5
$version4 = 7
$v1Pattern = '(?<=\bv(?:ersion)?\s+)\d+(?=\.\d+\.\d+\.\d+)'
$v2Pattern = '(?<=\bv(?:ersion)?\s+\d+\.)\d+(?=\.\d+\.\d+)'
$v3Pattern = '(?<=\bv(?:ersion)?\s+\d+\.\d+\.)\d+(?=\.\d+)'
$v4Pattern = '(?<=\bv(?:ersion)?\s+\d+\.\d+\.\d+\.)\d+'
$input -replace $v1Pattern, $version1 -replace $v2Pattern, $version2 -replace $v3Pattern,$version3 -replace $v4Pattern,$version4
Which would give:
I'm running Programmer's Notepad version 1.3.5.7, and am a big fan. I also have Chrome v 1.3.5.7 (64-bit).
NB: The above could be written as a 1 liner, but I've broken it down to make it simpler to read.
This takes advantage of regex lookarounds; a way of checking the content before and after the string you're capturing, without including those in the match. i.e. so when we select what to replace we can say "match the number that appears after the word version" without saying "replace the word version".
More info on those here: http://www.regular-expressions.info/lookaround.html
Your Example
Adapting the above to work for your example (i.e. where versions may be separated by commas or dots, and there's no consistency to their format beyond being 4 sets of numbers:
$input = #'
#define SOME_MACRO(4, 1, 0, 0)
Version "1.2.3.4"
SomeStruct vs = { 99,99,99,99 }
'#
$version1 = 1
$version2 = 3
$version3 = 5
$version4 = 7
$v1Pattern = '(?<=\b)\d+(?=\s*[\.,]\s*\d+\s*[\.,]\s*\d+\s*[\.,]\s*\d+\b)'
$v2Pattern = '(?<=\b\d+\s*[\.,]\s*)\d+(?=\s*[\.,]\s*\d+\s*[\.,]\s*\d+\b)'
$v3Pattern = '(?<=\b\d+\s*[\.,]\s*\d+\s*[\.,]\s*)\d+(?=\s*[\.,]\s*\d+\b)'
$v4Pattern = '(?<=\b\d+\s*[\.,]\s*\d+\s*[\.,]\s*\d+\s*[\.,]\s*)\d+\b'
$input -replace $v1Pattern, $version1 -replace $v2Pattern, $version2 -replace $v3Pattern,$version3 -replace $v4Pattern,$version4
Gives:
#define SOME_MACRO(1, 3, 5, 7)
Version "1.3.5.7"
SomeStruct vs = { 1,3,5,7 }
Regular expressions don't work that way, so you can't. Not directly, that is. What you can do (short of using a more appropriate regular expression that groups the parts you want to keep) is to extract the version string and then in a second step replace that substring with the new version string:
$oldver = $input -replace $regexp, '$1,$2,$3,$4'
$newver = $input -replace $oldver, "$Version1,$Version2,$Version3,$Version4"
Edit:
If you don't even know the structure, you must extract that from the regular expression as well.
$version = #($version1, $version2, $version3, $version4)
$input -match $regexp
$oldver = $regexp
$newver = $regexp
for ($i = 1; $i -le 4; $i++) {
$oldver = $oldver -replace "\(\?<version$i>\\d\)", $matches["version$i"]
$newver = $newver -replace "\(\?<version$i>\\d\)", $version[$i-1]
}
$input -replace $oldver, $newver
I've got some data that I'm parsing in Perl, and will be adding more and more differently formatted data in the near future. What I would like to do is write an easy-to-use function, that I could pass a string and a regex to, and it would return anything in parentheses. It would work something like this (pseudocode):
sub parse {
$data = shift;
$regex = shift;
$data =~ eval ("m/$regex/")
foreach $x ($1...$n)
{
push (#ra, $x);
}
return \#ra;
}
Then, I could call it like this:
#subs = parse ($data, '^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');
As you can see, there's a couple of issues with this code. I don't know if the eval would work, the 'foreach' definitely wouldn't work, and without knowing how many parentheses there are, I don't know how many times to loop.
This is too complicated for split, so if there's another function or possibility that I'm overlooking, let me know.
Thanks for your help!
In list context, a regular expression will return a list of all the parenthesized matches.
So all you have to do is:
my #matches = $string =~ /regex (with) (parens)/;
And assuming that it matched, #matches will be an array of the two capturing groups.
So using your regex:
my #subs = $data =~ /^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/;
Also, when you have long regexes, Perl has the x modifier, which goes after the closing regex delimiter. The x modifier allows you to put white-space and newlines inside the regex for increased readability.
If you are worried about the capturing groups that might be zero length, you can pass the matches through #subs = grep {length} #subs to filter them out.
Then, I could call it like this:
#subs = parse($data,
'^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');
Instead, call it like:
parse($data,
qr/^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/);
Further, your task would be made simpler if you can use named captures (i.e. Perl 5.10 and later). Here is an example:
#!/usr/bin/perl
use strict; use warnings;
my %re = (
id => '(?<id> [0-9]+ )',
name => '(?<name> \w+ )',
value => '(?<value> [0-9]+ )',
);
my #this = (
'123,one:12',
'456,two:21',
);
my #that = (
'one:[12],123',
'two:[21],456',
);
my $this_re = qr/$re{id} , $re{name} : $re{value}/x;
my $that_re = qr/$re{name} : \[$re{value}\] , $re{id} /x;
use YAML;
for my $d ( #this ) {
print Dump [ parse($d, $this_re) ];
}
for my $d ( #that ) {
print Dump [ parse($d, $that_re) ];
}
sub parse {
my ($d, $re) = #_;
return unless $d =~ $re;
return my #result = #+{qw(id name value)};
}
Output:
---
- 123
- one
- 12
---
- 456
- two
- 21
---
- 123
- one
- 12
---
- 456
- two
- 21
You are trying to parse a complex expression with a regex - which is an insufficient tool for the job. Recall that regular expressions cannot parse higher grammars. For intuition, any expression which might be nested cannot be parsed with regex.
When you want to find text inside of pairs of parenthesis, you want to use Text::Balanced.
But, that is not what you want to do, so it will not help you.