PowerShell regex does not match near newline - regex

I have an exe output in form
Compression : CCITT Group 4
Width : 3180
and try to extract CCITT Group 4 to $var with PowerShell script
$var = [regex]::match($exeoutput,'Compression\s+:\s+([\w\s]+)(?=\n)').Groups[1].Value
The http://regexstorm.net/tester say, the regexp Compression\s+:\s+([\w\s]+)(?=\n) is correct but not PowerShell. PowerShell does not match. How can I write the regexp correctly?

You want to get all text from some specific pattern till the end of the line. So, you do not even need the lookahead (?=\n), just use .+, because . matches any char but a newline (LF) char:
$var = [regex]::match($exeoutput,'Compression\s+:\s+(.+)').Groups[1].Value
Or, you may use a -match operator and after the match is found access the captured value using $matches[1]:
$exeoutput -match 'Compression\s*:\s*(.+)'
$var = $matches[1]

Wiktor Stribiżew's helpful answer simplifies your regex and shows you how to use PowerShell's -match operator as an alternative.
Your follow-up comment about piping to Out-String fixing your problem implies that your problem was that $exeOutput contained an array of lines rather than a single, multiline string.
This is indeed what happens when you capture the output from a call to an external program (*.exe): PowerShell captures the stdout output lines as an array of strings (the lines without their trailing newline).
As an alternative to converting array $exeOutput to a single, multiline string with Out-String (which, incidentally, is slow[1]), you can use a switch statement to operate on the array directly:
# Stores 'CCITT Group 4' in $var
$var = switch -regex ($exeOutput) { 'Compression\s+:\s+(.+)' { $Matches[1]; break } }
Alternatively, given the specific format of the lines in $exeOutput, you could leverage the ConvertFrom-StringData cmdlet, which can perform parsing the lines into key-value pairs for you, after having replaced the : separator with =:
$var = ($exeoutput -replace ':', '=' | ConvertFrom-StringData).Compression
[1] Use of a cmdlet is generally slower than using an expression; with a string array $array as input, you can achieve what $array | Out-String does more efficiently with $array -join "`n", though note that Out-String also appends a trailing newline.

Related

Regular Expressions in powershell split

I need to strip out a UNC fqdn name down to just the name or IP depending on the input.
My examples would be
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
I want to end up with just tom or 123.43.234.23
I have the following code in my array which is striping out the domain name perfect, but Im still left with \\tom
-Split '\.(?!\d)')[0]
Your regex succeeds in splitting off the tokens of interest in principle, but it doesn't account for the leading \\ in the input strings.
You can use regex alternation (|) to include the leading \\ at the start as an additional -split separator.
Given that matching a separator at the very start of the input creates an empty element with index 0, you then need to access index 1 to get the substring of interest.
In short: The regex passed to -split should be '^\\\\|\.(?!\d)' instead of '\.(?!\d)', and the index used to access the resulting array should be [1] instead of [0]:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '^\\\\|\.(?!\d)')[1] }
The above yields:
tom
123.43.234.23
Alternatively, you could remove the leading \\ in a separate step, using -replace:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '\.(?!\d)')[0] -replace '^\\\\' }
Yet another alternative is to use a single -replace operation, which does not require a ForEach-Object call (doesn't require explicit iteration):
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' -replace
'?(x) ^\\\\ (.+?) \.\D .+', '$1'
Inline option (?x) (IgnoreWhiteSpace) allows you to make regexes more readable with insignificant whitespace: any unescaped whitespace can be used for visual formatting.
^\\\\ matches the \\ (escaped with \) at the start (^) of each string.
(.+?) matches one or more characters lazily.
\.\D matches a literal . followed by something other than a digit (\d matches a digit, \D is the negation of that).
.+ matches one or more remaining characters, i.e., the rest of the input.
$1 as the replacement operand refers to what the 1st capture group ((...)) in the regex matched, and, given that the regex was designed to consume the entire string, replaces it with just that.
I'm stealing Lee_Daileys $InSTuff
but appending a RegEx I used recently
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$InStuff |ForEach-Object {($_.Trim('\\') -split '\.(?!\d{1,3}(\.|$))')[0]}
Sample Output:
tom
123.43.234.23
As you can see here on RegEx101 the dots between the numbers are not matched
The Select-String function uses regex and populates a MatchInfo object with the matches (which can then be queried).
The regex "(\.?\d+)+|\w+" works for your particular example.
"\\tom.overflow.corp.com", "\\123.43.234.23.overflow.corp.com" |
Select-String "(\.?\d+)+|\w+" | % { $_.Matches.Value }
while this is NOT regex, it does work. [grin] i suspect that if you have a really large number of such items, then you will want a regex. they do tend to be faster than simple text operators.
this will get rid of the leading \\ and then replace the domain name with .
# fake reading in a text file
# in real life, use Get-Content
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$DomainName = '.overflow.corp.com'
$InStuff.ForEach({
$_.TrimStart('\\').Replace($DomainName, '')
})
output ...
tom
123.43.234.23

Using powershell to search for a pattern

I am trying to write a powershell script to search for a pattern in a text file. Specifically I am looking at reading a file line by line and returning any line that has a space at the 32nd character position.
I have this so far but it just returns all lines that have white space. I need to narrow it down to the 32nd position
Get-Content -path C:\VM_names.txt | Where-Object {$_ -match "\s+"}
Use this pattern:
-match '^.{31} '
Explanation:
^ - beginning of the string
. - any character
{31} - repeated 31 times
- a space
This is actually really easy to do. By default, Get-Content reads a text file as an array of strings (individual lines), unless you use the -Raw parameter, which reads it as a single string. You can use the -match PowerShell operator to "match" the lines that meet your regular expression.
(Get-Content -Path c:\VM_names.txt) -match '^.{31}\s'
The result of the above command is an array of lines that match the desired regular expression.
NOTE: The call to Get-Content must be wrapped in parentheses, otherwise the PowerShell parser will think that -match is a parameter on that command.
NOTE2: As a good practice, use single quotes around strings, unless you specifically know that you need double quotes. You'll save yourself from accidental interpolation.

Trim More than 20 Characters

I am working on a script that will generate AD usernames based off of a csv file. Right now I have the following line working.
Select-Object #{n=’Username’;e={$_.FirstName.ToLower() + $_.LastName.ToLower() -replace "[^a-zA-Z]" }}
As of right now this takes the name and combines it into a AD friendly name. However I need to name to be shorted to no more than 20 characters. I have tried a few different methods to shorten the username but I haven't had any luck.
Any ideas on how I can get the username shorted?
Probably the most elegant approach is to use a positive lookbehind in your replacement:
... -replace '(?<=^.{20}).*'
This expression matches the remainder of the string only if it is preceded by 20 characters at the beginning of the string (^.{20}).
Another option would be a replacement with a capturing group on the first 20 characters:
... -replace '^(.{20}).*', '$1'
This captures at most 20 characters at the beginning of the string and replaces the whole string with just the captured group ($1).
$str[0..19] -join ''
e.g.
PS C:\> 'ab'[0..19]
ab
PS C:\> 'abcdefghijklmnopqrstuvwxyz'[0..19] -join ''
abcdefghijklmnopqrst
Which I would try in your line as:
Select-Object #{n=’Username’;e={(($_.FirstName + $_.LastName) -replace "[^a-z]").ToLower()[0..19] -join '' }}
([a-z] because PowerShell regex matches are case in-senstive, and moving .ToLower() so you only need to call it once).
And if you are using Strict-Mode, then why not check the length to avoid going outside the bounds of the array with the delightful:
$str[0..[math]::Min($str.Length, 19)] -join ''
To truncate a string in PowerShell, you can use the .NET String::Substring method. The following line will return the first $targetLength characters of $str, or the whole string if $str is shorter than that.
if ($str.Length -gt $targetLength) { $str.Substring(0, $targetLength) } else { $str }
If you prefer a regex solution, the following works (thanks to #PetSerAl)
$str -replace "(?<=.{$targetLength}).*"
A quick measurement shows the regex method to be about 70% slower than the substring method (942ms versus 557ms on a 200,000 line logfile)

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"

Split string on character, except that doubled up character is an escape

Input:
$string = "this-is-just-an--example"
Output:
this
is
just
an-example
Tried various things centered around Regex.Split and "-[^-]" or "-([^])".
Example of things that did not work:
[regex]::Split( $string, "-[^-]" )
[regex]::Split( $string, "-([^-])" )
Of course I can use the String.Split and iterate, and realize that empty string means I ran into escaped character... But it's ugly code.
P.S. tried searching for duped for a few minutes, didn't find any.
Use the lookahead and lookbehind assertions, and then do a replace to eliminate the remaining doubled characters:
$string = "this-is-just-an--example"
$string -split '(?<!-)-(?!-)' -replace '--','-'
this
is
just
an-example
Replacing the escape sequence with another value first eliminates the need for a complex split condition. All that is needed is a unique replacement value that does not already appear in the string.
For example, using # as the replacement value for $string = 'this-is-just-an--example', the following line will get the desired result:
$string -replace '--','#' -split '-' -replace '#','-'
-replace '--','#' eliminates the escape sequence (giving this-is-just-an#example),
-split '-' then separates the result (giving an array containing this, is, just, and an#example),
and finally -replace '#','-' restores the escaped value (giving this, is, just, an-example).
Both -split and -replace are built-in PowerShell operators that work on strings using regular expressions (equivalent to the Regex.Split and Regex.Replace methods in .NET).