Select-String pattern finds only partial string match of -cmatch - regex

I am trying to put together a string replacement routine. I have got as far as isolating the substring matches for two strings stored in array of strings $lines. Except there is a problem:
[string[]]$lines = "160 FROG Kermit 164 Big Bird_Road, Wellsville Singer","161 PIGGY Miss Pretty 1640 Really Long Main_Road, Whathellville Prima Donna"
# match string from last number to comma
foreach ($line in $lines) {
if ($line -cmatch '\d\s\w[a-z]*\s.*,') {
Write-Host "Found match!"
$line | Select-String -Pattern '\d\s\w[a-z]*\s.' -AllMatches |
ForEach-Object {
$x = $_.Matches[1].Value
Write-Host "x is:" $x
}
}
The first regex in $line -cmatch '\d\s\w[a-z]*\s.*,' is correct according to testing in Expresso. I want the address part of the string from last street number to comma. I am looking to replace the street basename spaces with underscores eg Big Bird_Road with Big_Bird_Road and Really Long Main_Road with Really_Long_Main_Road
The problem is that the second regex contained in: $line | Select-String -Pattern '\d\s\w[a-z]*\s.' -AllMatches |
Cannot be completed. As it is here. The output is:
Found match!
x is: 4 Big B
Found match!
x is: 0 Really L
The substring has not been captured yet! And if I add the remaining *, I get no output at all for x is:
Why doesn't the first regex (used with -cmatch) work in the same way when used as a Select-String pattern?

If you want to do a replace for that format in the strings, you can might use -replace and might use a patter to match the spaces only to replace them with an underscore:
(?<=\d\s+\w[a-zA-Z\s_]*)\s(?=[^\d,]*,)
Explanation
(?<= Positive lookbehind to assert what to the left is
\d\s+\w[a-zA-Z\s_]* Match a digit, 1+ whitespace chars, a word char and optionally repeat the listed characters in the character class
) Close the lookbehind
\s Match a whitespace char (or \s+ to match 1 or more)
(?=[^\d,]*,) Assert a comma to the right after matching optional chars other than a digit or comma
Regex demo
[string[]]$lines = "160 FROG Kermit 164 Big Bird_Road, Wellsville Singer","161 PIGGY Miss Pretty 1640 Really Long Main_Road, Whathellville Prima Donna"
foreach ($line in $lines) {
$line -replace "(?<=\d\s+\w[a-zA-Z\s_]*)\s(?=[^\d,]*,)","_"
}
Output
160 FROG Kermit 164 Big_Bird_Road, Wellsville Singer
161 PIGGY Miss Pretty 1640 Really_Long_Main_Road, Whathellville Prima Donna

Related

Negative Lookbehind Works in Editor But Not in Powershell Script

Using the following. I am attempting to replace spaces with comma-space for all instances in a string. While avoiding repeating commas already present in the string.
Test string:
'186 ATKINS, Cindy Maria 25 Every Street Smalltown, Student'
Using the following code:
Get-Content -Path $filePath |
ForEach-Object {
$match = ($_ | Select-String $regexPlus).Matches.Value
$c = ($_ | Get-Content)
$c = $c -replace $match,', '
$c
}
The output is:
'186, ATKINS,, Cindy, Maria, 25, Every, Street, Smalltown,, Student'
My $regexPlus value is:
$regexPlus = '(?s)(?<!,)\s'
I have tested the negative lookbehind assertion in my editor and it works. Why does it not work in this Powershell script? The regex 101 online editor produces this curious mention of case sensitivity:
Negative Lookbehind (?<!,)
Assert that the Regex below does not match
, matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
I have tried editing to:
$match = ($_ | Select-String $regexPlus -CaseSensitive).Matches.Value
But still not working. Any ideas are welcome.
Part of the problem here is that you are trying to force through the regex to do the replacement, when, like #WiktorStribiżew mentions, simply use -replace like it's supposed to be used. i.e. -replace does all the hard work for you.
When you do this:
$match = ($_ | Select-String $regexPlus).Matches.Value
You are right, you are trying to find Regex matches. Congratulations! It found a space character, but when you do this:
$c = $c -replace $match,', '
It interprets $match as a space character like this:
$c = $c -replace ' ',', '
And not as a regular expression that you might have been expecting. That's why it's not seeing the negative lookbehind for the commas, because all it is searching for are spaces, and it is dutifully replacing all the spaces with comma spaces.
The solution is simple in that, all you have to do is simply use the Regex text in the -replace string:
$regexPlus = '(?s)(?<!,)\s'
$c = $c -replace $regexPlus,', '
e.g. The negative lookbehind working as advertised:
PS C:> $str = '186 ATKINS, Cindy Maria 25 Every Street Smalltown, Student'
PS C:> $regexPlus = '(?s)(?<!,)\s'
PS C:> $str -replace $regexPlus,', '
186, ATKINS, Cindy, Maria, 25, Every, Street, Smalltown, Student
You can use
(Get-Content -Path $filePath) -replace ',*\s+', ', '
This code replaces zero or more commas and all one or more whitespaces after them with a single comma + space.
See the regex demo.
More details:
,* - zero or more commas
\s+ - one or more whitespace chars.

Replacing a Matched Character in Powershell

I have a text file of 3 name entries:
# dot_test.txt
001 AALTON, Alan .....25 Every Street
006 JOHNS, Jason .... 3 Steep Street
002 BROWN. James .... 101 Browns Road
My task is to find instances of NAME. when it should be NAME, using the following:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive |
ForEach-Object { if($_.Matches.Value -match '\.$'){$_.Matches.Value -replace '\,$'} }
The output is:
BROWN.
The conclusion is this script block identifies the instance of NAME. but fails to make the replacement.
Any suggestions on how to achieve this would be appreciated.
$_.Matches.Value -replace '\,$'
This attempts to replace a , (which you needn't escape as \,) at the end of ($) your match with the empty string (due to the absence of a second, replacement operand), i.e. it would effectively remove a trailing ,.
However, given that your match contains no , and that you instead want to replace its trailing . with ,, use the following:
$_.Matches.Value -replace '\.$', ',' # -> 'BROWN,'
You can use -replace directly, and if you need to replace both a comma and dot at the end of the string, use [.,]$ regex:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive | % {$_.Matches.Value -replace '\.$', ','}
Details:
(?s)[A-Z]{3}.*?\D(?=\s|$) - matches
(?s) - RegexOptions.Singleline mode on and . can match line breaks
[A-Z]{3} - three uppercase ASCII letters
.*? - any zero or more chars as few as possible
\D - any non-digit char
(?=\s|$) - a positive lookahead that matches a location either immediately followed with a whitespace or end of string.
The \.$ pattern matches a . at the end of string.

Powershell Regex Multiline Regex

I'm trying to regex a file. I have tried these but I'm not good with regex.
((\|\n.*|\n))\d.*\n\s.*[0-9]{1,3}\s
((\|\n.*|\n))\d\d\d\d\d\d\d\n\s\s\s\s\s\s\s\s\s\s[0-9]{1,3}\s
((\|\n.*|\n))\d{7,8}\n\s.*[0-9]{1,3}\s
\|\n\s.*\d{7}\n\s.*[0-9]{1,3}\s
^.*\|\r?\n.*\r?\n[0-9]{1,3}$
I have a file that has lines like these
$00.00|0.00|0.00|||
8360657
68694
What I'm trying to do is figure out is the 3rd line is between 1 and 3 digits. If it's longer than 3 digits I don't care about it.
There is a lot more data in this file, and for each occurance of the above 3 lines I want to know all matches if the 3rd line in my example is 3 digits or less. How can I modify my regex to work?
Here is my example code of what I've tried:
$file = "C:\Users\user\Desktop\del2\file.le"
$content = gc $file -raw
$gRegex = "((\|\n.*|\n))\d{7,8}\n\s.*[0-9]{1,3}\s"
$content -match $guarantorRegex
I have got these to match using regex101.com however I'm not getting this to work in powershell...
What worked for me in the end:
$file = "C:\Users\user\Desktop\del2\D2341202.le"
$content = gc $file -raw
$guarantorRegex = "\|\r?\n[ ]{10}.*\r?\n[ ]{10}[0-9]{1,3}\s"
$content | select-string -Pattern $gRegex -AllMatches | % { $_.Matches } | % { $_.Value } > "C:\Users\user\Desktop\matches.txt"
If you want to match 10 spaces, you could match a space with a quantifier [ ]{10}
(The square brackets are for clarity only)
(?m)^[ ]{10}.*\|\r?\n[ ]{10}.*\r?\n[ ]{10}[0-9]{1,3}\$
(?m) Inline modifier to enable multiline
^ Start of line
[ ]{10}.*\| Match 10 spaces, 1+ times any char except a newline and |
\r?\n[ ]{10}.* Match a newline, 10 spaces, 1+ times any char except a newline
\r?\n[ ]{10}[0-9]{1,3} Match a newline, 10 spaces 3 digits 0-9
$ End of line
Regex demo
Note that \s will also match a newline.
If you want to match whitespaces except a newline you could use [^\S\r\n]{10}
If you don't want to use anchors and there is a whitespace char at the end, you could use the pattern that worked for you
\|\r?\n[ ]{10}.*\r?\n[ ]{10}[0-9]{1,3}\s

Trying to match this using regular expressions in PowerShell

I am trying to use regular expressions to match certain lines in a file, but I am having some trouble.
The file contains text like this:
Mario, 123456789
Luigi, 234-567-890
Nancy, 345 5666 77533
Bowser, 348759823745908732589
Peach, 534785
Daisy, 123-456-7890
I'm trying to match just the numbers as either XXX-XXX-XXX or XXX XXX XXX pattern.
I've tried a few different ways, but it always expects something I don't want it to or it tell me everything is false.
I'm using PowerShell to do this.
At first I tried:
{$match = $i -match "\d{3}\-\d{3}\-\d{3}|\d{3}\ \d{3}\ \d{3}"
Write-Host $match}
But when I do that it matches the long strong of numbers and XXX-XXX-XXXXX.
I read something saying that n would match the exact quantity, so I tried that...
{$match = $i -match "\d{n3}\-\d{n3}\-\d{n3}|\d{n3}\ \d{n3}\ \{n3}"
Write-Host $match}
That made everything false...
So I tried
{$match = $i -match "\d\n{3}\-\d\n{3}\-\d\n{3}|\d\n{3}\ \d\n{3}\ \d\n{3}"
I also tried the lazy quantifier, ?:
{$match = $i -match "\d{3?}\-\d{3?}\-\d{3?}|\d{3?}\ \{3?}\ \{3?}"
Write-Host $match}
Still false...
The final thing I tried was this...
{$match = $i -match "\d[0-9\{3\}\-\d[0-9]\{3\}\-\d[0-9]{3\}|\d[0-9]\{3\}\ \d[0-9]\{3}\ \d[0-9]\{3\}"<br>
Write-Host $match}
Still no luck...
The following pattern gives two matches:
Get-Content .\test.txt | Where-Object {$_ -match '\d{3}[-|\s]\d{3}[-|\s]\d{3}'}
Luigi, 234-567-890
Daisy,
123-456-7890
If you want to exclude the last match, add the '$' anchor (represents the end of the string:
Get-Content .\test.txt | Where-Object {$_ -match '\d{3}[-|\s]\d{3}[-|\s]\d{3}$'}
Luigi, 234-567-890
If you want to be very specific and match lines from start to end (use the ^ anchor, denotes the start of the string):
Get-Content .\test.txt | Where-Object {$_ -match '^\w+,\s+\d{3}[-|\s]\d{3}[-|\s]\d{3}$'}
Luigi, 234-567-890
Your first answer is the closest. The {3} matches exactly 3 characters. I think the n you saw was supposed to represent any number, not an actual n character. The reason it matches the long strings is that you only specified that the match must find 3 digits, dash or space, 3 digits, dash or space, then 3 more digits. You did not specify that it doesn't count if there are more digits after that.
To not match when there is a number after, you can use a negative lookahead.
(\d{3}-\d{3}-\d{3}|\d{3}\ \d{3}\ \d{3})(?!\d)
Alternatively, if you want to only match at the end of the line, possibly with trailing space
(\d{3}-\d{3}-\d{3}|\d{3}\ \d{3}\ \d{3})\s*$
As Gideon said, your first is the best place to start.
"\b\d{3}\-\d{3}\-\d{3}\b|\b\d{3}\ \d{3}\ \d{3}\b"
The \b special character added before and after each statement is a word boundary - basically a space or newline or punctuation like a period or comma. This ensures that 9999 doesn't match, but 999. does.
Try this:
/(\d+[- ])+\d+/
It's better not to have so rigid regular expressions, unless you are absolutely sure there that your input will not change.
So this regex matches at least a digit, then greedily searches for more digits followed by a space or a dash. This is also repeated as much as possible then followed by at least another digit.
When manipulating data in PowerShell, it usually is a good idea to create objects representing the data (after all, PowerShell is all about objects). Filtering based on object properties is usually easier and more robust. Your problem is a good example.
Here is what we are after:
the persons: $persons
where: where
the number of that person: $_.number
matches: -match
the pattern
starting with three digits: ^\d{3}
followed by three digits between dashes or spaces: (-\d{3}-|\ \d{3}\ )
ending on three digits: \d{3}$
Below is the entire script:
$persons = import-csv -Header "name", "number" -delimiter "," data.csv
$persons | where {$_.number -match "^\d{3}(\-\d{3}\-|\ \d{3}\ )\d{3}$"}
You can also use Select-String:
Select-String '(\d{3}[ -]){2}\d{3}$' .\file.txt | % {$_.Line}

Regular expression to match any character being repeated more than 10 times

I'm looking for a simple regular expression to match the same character being repeated more than 10 or so times. So for example, if I have a document littered with horizontal lines:
=================================================
It will match the line of = characters because it is repeated more than 10 times. Note that I'd like this to work for any character.
The regex you need is /(.)\1{9,}/.
Test:
#!perl
use warnings;
use strict;
my $regex = qr/(.)\1{9,}/;
print "NO" if "abcdefghijklmno" =~ $regex;
print "YES" if "------------------------" =~ $regex;
print "YES" if "========================" =~ $regex;
Here the \1 is called a backreference. It references what is captured by the dot . between the brackets (.) and then the {9,} asks for nine or more of the same character. Thus this matches ten or more of any single character.
Although the above test script is in Perl, this is very standard regex syntax and should work in any language. In some variants you might need to use more backslashes, e.g. Emacs would make you write \(.\)\1\{9,\} here.
If a whole string should consist of 9 or more identical characters, add anchors around the pattern:
my $regex = qr/^(.)\1{9,}$/;
In Python you can use (.)\1{9,}
(.) makes group from one char (any char)
\1{9,} matches nine or more characters from 1st group
example:
txt = """1. aaaaaaaaaaaaaaa
2. bb
3. cccccccccccccccccccc
4. dd
5. eeeeeeeeeeee"""
rx = re.compile(r'(.)\1{9,}')
lines = txt.split('\n')
for line in lines:
rxx = rx.search(line)
if rxx:
print line
Output:
1. aaaaaaaaaaaaaaa
3. cccccccccccccccccccc
5. eeeeeeeeeeee
. matches any character. Used in conjunction with the curly braces already mentioned:
$: cat > test
========
============================
oo
ooooooooooooooooooooooo
$: grep -E '(.)\1{10}' test
============================
ooooooooooooooooooooooo
={10,}
matches = that is repeated 10 or more times.
use the {10,} operator:
$: cat > testre
============================
==
==============
$: grep -E '={10,}' testre
============================
==============
You can also use PowerShell to quickly replace words or character reptitions. PowerShell is for Windows. Current version is 3.0.
$oldfile = "$env:windir\WindowsUpdate.log"
$newfile = "$env:temp\newfile.txt"
$text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
$text -replace '/(.)\1{9,}/', ' ' | Set-Content -Path $newfile
PHP's preg_replace example:
$str = "motttherbb fffaaattther";
$str = preg_replace("/([a-z])\\1/", "", $str);
echo $str;
Here [a-z] hits the character, () then allows it to be used with \\1 backreference which tries to match another same character (note this is targetting 2 consecutive characters already), thus:
mother father
If you did:
$str = preg_replace("/([a-z])\\1{2}/", "", $str);
that would be erasing 3 consecutive repeated characters, outputting:
moherbb her
A slightly more generic powershell example. In powershell 7, the match is highlighted including the last space (can you highlight in stack?).
'a b c d e f ' | select-string '([a-f] ){6,}'
a b c d e f