Powershell regex to remove everything except key - regex

Given a string like
'The fingerprint is: ABCDEFGHIJKLMNOPQRSTUVWXYZ12345678910111'
How would you remove all text that isn't a 40 character string consisting of A-Z 0-9 ?
Currently I'm looking for the string 'The fingerprint is: ' and removing it, but I feel it would be safer to look for a 40 character alphanumeric.
$foo = $foo -replace 'The fingerprint is: ',''
I expect something like this to work, but no luck.
$foo = $foo -creplace '^[A-Z0-9]{40}',''
I've also tried just looking for the characters that match
$foo = $foo -match '[A-Z0-9]{40}'

Depends a bit, but if it's 40 contiguous and it's the only 40 character string you could use replace:
"The fingerprint is: ABCDEFGHIJKLMNOPQRSTUVWXYZ12345678910111" -replace '.*([A-Z0-9]{40}).*', '$1'
Note: The replacement, $1, is a reference to the match group. It is not a PowerShell variable and is deliberately written in single quotes so it will not expand.

To match the 40 character alphanumeric with no replacement, this
$foo = 'The fingerprint is: ABCDEFGHIJKLMNOPQRSTUVWXYZ12345678910111';
$foo -Match '[A-Z0-9]{40}' | Out-Null;
Write-Output $matches[0];
prints
ABCDEFGHIJKLMNOPQRSTUVWXYZ12345678910111
[A-Z0-9] is a bracket expression matching any of the contained characters (- denotes a range of values)
{40} matches the previous element 40 times
Out-Null suppresses the boolean return value of the -match operator

Related

PowerShell regex does not match near newline

I have an exe output in form
Compression : CCITT Group 4
Width : 3180
and try to extract CCITT Group 4 to $var with PowerShell script
$var = [regex]::match($exeoutput,'Compression\s+:\s+([\w\s]+)(?=\n)').Groups[1].Value
The http://regexstorm.net/tester say, the regexp Compression\s+:\s+([\w\s]+)(?=\n) is correct but not PowerShell. PowerShell does not match. How can I write the regexp correctly?
You want to get all text from some specific pattern till the end of the line. So, you do not even need the lookahead (?=\n), just use .+, because . matches any char but a newline (LF) char:
$var = [regex]::match($exeoutput,'Compression\s+:\s+(.+)').Groups[1].Value
Or, you may use a -match operator and after the match is found access the captured value using $matches[1]:
$exeoutput -match 'Compression\s*:\s*(.+)'
$var = $matches[1]
Wiktor Stribiżew's helpful answer simplifies your regex and shows you how to use PowerShell's -match operator as an alternative.
Your follow-up comment about piping to Out-String fixing your problem implies that your problem was that $exeOutput contained an array of lines rather than a single, multiline string.
This is indeed what happens when you capture the output from a call to an external program (*.exe): PowerShell captures the stdout output lines as an array of strings (the lines without their trailing newline).
As an alternative to converting array $exeOutput to a single, multiline string with Out-String (which, incidentally, is slow[1]), you can use a switch statement to operate on the array directly:
# Stores 'CCITT Group 4' in $var
$var = switch -regex ($exeOutput) { 'Compression\s+:\s+(.+)' { $Matches[1]; break } }
Alternatively, given the specific format of the lines in $exeOutput, you could leverage the ConvertFrom-StringData cmdlet, which can perform parsing the lines into key-value pairs for you, after having replaced the : separator with =:
$var = ($exeoutput -replace ':', '=' | ConvertFrom-StringData).Compression
[1] Use of a cmdlet is generally slower than using an expression; with a string array $array as input, you can achieve what $array | Out-String does more efficiently with $array -join "`n", though note that Out-String also appends a trailing newline.

Trim More than 20 Characters

I am working on a script that will generate AD usernames based off of a csv file. Right now I have the following line working.
Select-Object #{n=’Username’;e={$_.FirstName.ToLower() + $_.LastName.ToLower() -replace "[^a-zA-Z]" }}
As of right now this takes the name and combines it into a AD friendly name. However I need to name to be shorted to no more than 20 characters. I have tried a few different methods to shorten the username but I haven't had any luck.
Any ideas on how I can get the username shorted?
Probably the most elegant approach is to use a positive lookbehind in your replacement:
... -replace '(?<=^.{20}).*'
This expression matches the remainder of the string only if it is preceded by 20 characters at the beginning of the string (^.{20}).
Another option would be a replacement with a capturing group on the first 20 characters:
... -replace '^(.{20}).*', '$1'
This captures at most 20 characters at the beginning of the string and replaces the whole string with just the captured group ($1).
$str[0..19] -join ''
e.g.
PS C:\> 'ab'[0..19]
ab
PS C:\> 'abcdefghijklmnopqrstuvwxyz'[0..19] -join ''
abcdefghijklmnopqrst
Which I would try in your line as:
Select-Object #{n=’Username’;e={(($_.FirstName + $_.LastName) -replace "[^a-z]").ToLower()[0..19] -join '' }}
([a-z] because PowerShell regex matches are case in-senstive, and moving .ToLower() so you only need to call it once).
And if you are using Strict-Mode, then why not check the length to avoid going outside the bounds of the array with the delightful:
$str[0..[math]::Min($str.Length, 19)] -join ''
To truncate a string in PowerShell, you can use the .NET String::Substring method. The following line will return the first $targetLength characters of $str, or the whole string if $str is shorter than that.
if ($str.Length -gt $targetLength) { $str.Substring(0, $targetLength) } else { $str }
If you prefer a regex solution, the following works (thanks to #PetSerAl)
$str -replace "(?<=.{$targetLength}).*"
A quick measurement shows the regex method to be about 70% slower than the substring method (942ms versus 557ms on a 200,000 line logfile)

Regex greedyness REasking

I have this text $line = "config.txt.1", and I want to match it with regex and extract the number
part of it. I am using two versions:
$line = "config.txt.1";
(my $result) = $line =~ /(\d*).*/; #ver 1, matched, but returns nothing
(my $result) = $line =~ /(\d).*/; #ver 2, matched, returns 1
(my $result) = $line =~ /(\d+).*/; #ver 3, matched, returns 1
I think the * was sort of messing things around, I have been looking at this, but still
don't the greedy mechanism in the regex engine. If I start from left of the regex, and potentially there might be no digits in the text, so for ver 1, it will match too. But for
ver 3, it won't match. Can someone give me an explanation for why it is that and how
I should write for what I want? (potentially with a number, not necessarily single digit)
Edit
Requirement: potentially with a number, not necessarily single digit, and match can not capture anything, but should not fail
The output must be as follows (for the above example):
config.txt 1
The regex /(\d*).*/ always matches immediately, because it can match zero characters. It translates to match as many digits at this position as possible (zero or more). Then, match as many non-newline characters as possible. Well, the match starts looking at the c of config. Ok, it matches zero digits.
You probably want to use a regex like /\.(\d+)$/ -- this matches an integer number between a period . and the end of string.
Use the literal '.' as a reference to match before the number:
#!/usr/bin/perl
use strict;
use warnings;
my #line = qw(config.txt file.txt config.txt.1 config.foo.2 config.txt.23 differentname.fsdfsdsdfasd.2444);
my (#capture1, #capture2);
foreach (#line){
my (#filematch) = ($_ =~ /(\w+\.\w+)/);
my (#numbermatch) = ($_ =~ /\w+\.\w+\.?(\d*)/);
my $numbermatch = $numbermatch[0] // $numbermatch[1];
push #capture1, #filematch;
push #capture2, #numbermatch;
}
print "$capture1[$_]\t$capture2[$_]\n" for 0 .. $#capture1;
Output:
config.txt
file.txt
config.txt 1
config.foo 2
config.txt 23
differentname.fsdfsdsdfasd 2444
Thanks guys, I think I figured out myself what I want:
my ($match) = $line =~ /\.(\d+)?/; #this will match and capture any digit
#number if there was one, and not fail
#if there wasn't one
To capture all digits following a final . and not fail the match if the string doesn't end with digits, use /(?:\.(\d+))?$/
perl -E 'if ("abc.123" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched 123
perl -E 'if ("abc" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched
You do not need .* at all. These two statements assign the exact same number:
my ($match1) = $str =~ /(\d+).*/;
my ($match1) = $str =~ /(\d+)/;
A regex by default matches partially, you do not need to add wildcards.
The reason your first match does not capture a number is because * can match zero times as well. And since it does not have to match your number, it does not. Which is why .* is actually detrimental in that regex. Unless something is truly optional, you should use + instead.

Trying to match this using regular expressions in PowerShell

I am trying to use regular expressions to match certain lines in a file, but I am having some trouble.
The file contains text like this:
Mario, 123456789
Luigi, 234-567-890
Nancy, 345 5666 77533
Bowser, 348759823745908732589
Peach, 534785
Daisy, 123-456-7890
I'm trying to match just the numbers as either XXX-XXX-XXX or XXX XXX XXX pattern.
I've tried a few different ways, but it always expects something I don't want it to or it tell me everything is false.
I'm using PowerShell to do this.
At first I tried:
{$match = $i -match "\d{3}\-\d{3}\-\d{3}|\d{3}\ \d{3}\ \d{3}"
Write-Host $match}
But when I do that it matches the long strong of numbers and XXX-XXX-XXXXX.
I read something saying that n would match the exact quantity, so I tried that...
{$match = $i -match "\d{n3}\-\d{n3}\-\d{n3}|\d{n3}\ \d{n3}\ \{n3}"
Write-Host $match}
That made everything false...
So I tried
{$match = $i -match "\d\n{3}\-\d\n{3}\-\d\n{3}|\d\n{3}\ \d\n{3}\ \d\n{3}"
I also tried the lazy quantifier, ?:
{$match = $i -match "\d{3?}\-\d{3?}\-\d{3?}|\d{3?}\ \{3?}\ \{3?}"
Write-Host $match}
Still false...
The final thing I tried was this...
{$match = $i -match "\d[0-9\{3\}\-\d[0-9]\{3\}\-\d[0-9]{3\}|\d[0-9]\{3\}\ \d[0-9]\{3}\ \d[0-9]\{3\}"<br>
Write-Host $match}
Still no luck...
The following pattern gives two matches:
Get-Content .\test.txt | Where-Object {$_ -match '\d{3}[-|\s]\d{3}[-|\s]\d{3}'}
Luigi, 234-567-890
Daisy,
123-456-7890
If you want to exclude the last match, add the '$' anchor (represents the end of the string:
Get-Content .\test.txt | Where-Object {$_ -match '\d{3}[-|\s]\d{3}[-|\s]\d{3}$'}
Luigi, 234-567-890
If you want to be very specific and match lines from start to end (use the ^ anchor, denotes the start of the string):
Get-Content .\test.txt | Where-Object {$_ -match '^\w+,\s+\d{3}[-|\s]\d{3}[-|\s]\d{3}$'}
Luigi, 234-567-890
Your first answer is the closest. The {3} matches exactly 3 characters. I think the n you saw was supposed to represent any number, not an actual n character. The reason it matches the long strings is that you only specified that the match must find 3 digits, dash or space, 3 digits, dash or space, then 3 more digits. You did not specify that it doesn't count if there are more digits after that.
To not match when there is a number after, you can use a negative lookahead.
(\d{3}-\d{3}-\d{3}|\d{3}\ \d{3}\ \d{3})(?!\d)
Alternatively, if you want to only match at the end of the line, possibly with trailing space
(\d{3}-\d{3}-\d{3}|\d{3}\ \d{3}\ \d{3})\s*$
As Gideon said, your first is the best place to start.
"\b\d{3}\-\d{3}\-\d{3}\b|\b\d{3}\ \d{3}\ \d{3}\b"
The \b special character added before and after each statement is a word boundary - basically a space or newline or punctuation like a period or comma. This ensures that 9999 doesn't match, but 999. does.
Try this:
/(\d+[- ])+\d+/
It's better not to have so rigid regular expressions, unless you are absolutely sure there that your input will not change.
So this regex matches at least a digit, then greedily searches for more digits followed by a space or a dash. This is also repeated as much as possible then followed by at least another digit.
When manipulating data in PowerShell, it usually is a good idea to create objects representing the data (after all, PowerShell is all about objects). Filtering based on object properties is usually easier and more robust. Your problem is a good example.
Here is what we are after:
the persons: $persons
where: where
the number of that person: $_.number
matches: -match
the pattern
starting with three digits: ^\d{3}
followed by three digits between dashes or spaces: (-\d{3}-|\ \d{3}\ )
ending on three digits: \d{3}$
Below is the entire script:
$persons = import-csv -Header "name", "number" -delimiter "," data.csv
$persons | where {$_.number -match "^\d{3}(\-\d{3}\-|\ \d{3}\ )\d{3}$"}
You can also use Select-String:
Select-String '(\d{3}[ -]){2}\d{3}$' .\file.txt | % {$_.Line}

Regular expression to match any character being repeated more than 10 times

I'm looking for a simple regular expression to match the same character being repeated more than 10 or so times. So for example, if I have a document littered with horizontal lines:
=================================================
It will match the line of = characters because it is repeated more than 10 times. Note that I'd like this to work for any character.
The regex you need is /(.)\1{9,}/.
Test:
#!perl
use warnings;
use strict;
my $regex = qr/(.)\1{9,}/;
print "NO" if "abcdefghijklmno" =~ $regex;
print "YES" if "------------------------" =~ $regex;
print "YES" if "========================" =~ $regex;
Here the \1 is called a backreference. It references what is captured by the dot . between the brackets (.) and then the {9,} asks for nine or more of the same character. Thus this matches ten or more of any single character.
Although the above test script is in Perl, this is very standard regex syntax and should work in any language. In some variants you might need to use more backslashes, e.g. Emacs would make you write \(.\)\1\{9,\} here.
If a whole string should consist of 9 or more identical characters, add anchors around the pattern:
my $regex = qr/^(.)\1{9,}$/;
In Python you can use (.)\1{9,}
(.) makes group from one char (any char)
\1{9,} matches nine or more characters from 1st group
example:
txt = """1. aaaaaaaaaaaaaaa
2. bb
3. cccccccccccccccccccc
4. dd
5. eeeeeeeeeeee"""
rx = re.compile(r'(.)\1{9,}')
lines = txt.split('\n')
for line in lines:
rxx = rx.search(line)
if rxx:
print line
Output:
1. aaaaaaaaaaaaaaa
3. cccccccccccccccccccc
5. eeeeeeeeeeee
. matches any character. Used in conjunction with the curly braces already mentioned:
$: cat > test
========
============================
oo
ooooooooooooooooooooooo
$: grep -E '(.)\1{10}' test
============================
ooooooooooooooooooooooo
={10,}
matches = that is repeated 10 or more times.
use the {10,} operator:
$: cat > testre
============================
==
==============
$: grep -E '={10,}' testre
============================
==============
You can also use PowerShell to quickly replace words or character reptitions. PowerShell is for Windows. Current version is 3.0.
$oldfile = "$env:windir\WindowsUpdate.log"
$newfile = "$env:temp\newfile.txt"
$text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
$text -replace '/(.)\1{9,}/', ' ' | Set-Content -Path $newfile
PHP's preg_replace example:
$str = "motttherbb fffaaattther";
$str = preg_replace("/([a-z])\\1/", "", $str);
echo $str;
Here [a-z] hits the character, () then allows it to be used with \\1 backreference which tries to match another same character (note this is targetting 2 consecutive characters already), thus:
mother father
If you did:
$str = preg_replace("/([a-z])\\1{2}/", "", $str);
that would be erasing 3 consecutive repeated characters, outputting:
moherbb her
A slightly more generic powershell example. In powershell 7, the match is highlighted including the last space (can you highlight in stack?).
'a b c d e f ' | select-string '([a-f] ){6,}'
a b c d e f