Powershell Regex Multiline Regex - regex

I'm trying to regex a file. I have tried these but I'm not good with regex.
((\|\n.*|\n))\d.*\n\s.*[0-9]{1,3}\s
((\|\n.*|\n))\d\d\d\d\d\d\d\n\s\s\s\s\s\s\s\s\s\s[0-9]{1,3}\s
((\|\n.*|\n))\d{7,8}\n\s.*[0-9]{1,3}\s
\|\n\s.*\d{7}\n\s.*[0-9]{1,3}\s
^.*\|\r?\n.*\r?\n[0-9]{1,3}$
I have a file that has lines like these
$00.00|0.00|0.00|||
8360657
68694
What I'm trying to do is figure out is the 3rd line is between 1 and 3 digits. If it's longer than 3 digits I don't care about it.
There is a lot more data in this file, and for each occurance of the above 3 lines I want to know all matches if the 3rd line in my example is 3 digits or less. How can I modify my regex to work?
Here is my example code of what I've tried:
$file = "C:\Users\user\Desktop\del2\file.le"
$content = gc $file -raw
$gRegex = "((\|\n.*|\n))\d{7,8}\n\s.*[0-9]{1,3}\s"
$content -match $guarantorRegex
I have got these to match using regex101.com however I'm not getting this to work in powershell...
What worked for me in the end:
$file = "C:\Users\user\Desktop\del2\D2341202.le"
$content = gc $file -raw
$guarantorRegex = "\|\r?\n[ ]{10}.*\r?\n[ ]{10}[0-9]{1,3}\s"
$content | select-string -Pattern $gRegex -AllMatches | % { $_.Matches } | % { $_.Value } > "C:\Users\user\Desktop\matches.txt"

If you want to match 10 spaces, you could match a space with a quantifier [ ]{10}
(The square brackets are for clarity only)
(?m)^[ ]{10}.*\|\r?\n[ ]{10}.*\r?\n[ ]{10}[0-9]{1,3}\$
(?m) Inline modifier to enable multiline
^ Start of line
[ ]{10}.*\| Match 10 spaces, 1+ times any char except a newline and |
\r?\n[ ]{10}.* Match a newline, 10 spaces, 1+ times any char except a newline
\r?\n[ ]{10}[0-9]{1,3} Match a newline, 10 spaces 3 digits 0-9
$ End of line
Regex demo
Note that \s will also match a newline.
If you want to match whitespaces except a newline you could use [^\S\r\n]{10}
If you don't want to use anchors and there is a whitespace char at the end, you could use the pattern that worked for you
\|\r?\n[ ]{10}.*\r?\n[ ]{10}[0-9]{1,3}\s

Related

Select-String pattern finds only partial string match of -cmatch

I am trying to put together a string replacement routine. I have got as far as isolating the substring matches for two strings stored in array of strings $lines. Except there is a problem:
[string[]]$lines = "160 FROG Kermit 164 Big Bird_Road, Wellsville Singer","161 PIGGY Miss Pretty 1640 Really Long Main_Road, Whathellville Prima Donna"
# match string from last number to comma
foreach ($line in $lines) {
if ($line -cmatch '\d\s\w[a-z]*\s.*,') {
Write-Host "Found match!"
$line | Select-String -Pattern '\d\s\w[a-z]*\s.' -AllMatches |
ForEach-Object {
$x = $_.Matches[1].Value
Write-Host "x is:" $x
}
}
The first regex in $line -cmatch '\d\s\w[a-z]*\s.*,' is correct according to testing in Expresso. I want the address part of the string from last street number to comma. I am looking to replace the street basename spaces with underscores eg Big Bird_Road with Big_Bird_Road and Really Long Main_Road with Really_Long_Main_Road
The problem is that the second regex contained in: $line | Select-String -Pattern '\d\s\w[a-z]*\s.' -AllMatches |
Cannot be completed. As it is here. The output is:
Found match!
x is: 4 Big B
Found match!
x is: 0 Really L
The substring has not been captured yet! And if I add the remaining *, I get no output at all for x is:
Why doesn't the first regex (used with -cmatch) work in the same way when used as a Select-String pattern?
If you want to do a replace for that format in the strings, you can might use -replace and might use a patter to match the spaces only to replace them with an underscore:
(?<=\d\s+\w[a-zA-Z\s_]*)\s(?=[^\d,]*,)
Explanation
(?<= Positive lookbehind to assert what to the left is
\d\s+\w[a-zA-Z\s_]* Match a digit, 1+ whitespace chars, a word char and optionally repeat the listed characters in the character class
) Close the lookbehind
\s Match a whitespace char (or \s+ to match 1 or more)
(?=[^\d,]*,) Assert a comma to the right after matching optional chars other than a digit or comma
Regex demo
[string[]]$lines = "160 FROG Kermit 164 Big Bird_Road, Wellsville Singer","161 PIGGY Miss Pretty 1640 Really Long Main_Road, Whathellville Prima Donna"
foreach ($line in $lines) {
$line -replace "(?<=\d\s+\w[a-zA-Z\s_]*)\s(?=[^\d,]*,)","_"
}
Output
160 FROG Kermit 164 Big_Bird_Road, Wellsville Singer
161 PIGGY Miss Pretty 1640 Really_Long_Main_Road, Whathellville Prima Donna

Replacing a Matched Character in Powershell

I have a text file of 3 name entries:
# dot_test.txt
001 AALTON, Alan .....25 Every Street
006 JOHNS, Jason .... 3 Steep Street
002 BROWN. James .... 101 Browns Road
My task is to find instances of NAME. when it should be NAME, using the following:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive |
ForEach-Object { if($_.Matches.Value -match '\.$'){$_.Matches.Value -replace '\,$'} }
The output is:
BROWN.
The conclusion is this script block identifies the instance of NAME. but fails to make the replacement.
Any suggestions on how to achieve this would be appreciated.
$_.Matches.Value -replace '\,$'
This attempts to replace a , (which you needn't escape as \,) at the end of ($) your match with the empty string (due to the absence of a second, replacement operand), i.e. it would effectively remove a trailing ,.
However, given that your match contains no , and that you instead want to replace its trailing . with ,, use the following:
$_.Matches.Value -replace '\.$', ',' # -> 'BROWN,'
You can use -replace directly, and if you need to replace both a comma and dot at the end of the string, use [.,]$ regex:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive | % {$_.Matches.Value -replace '\.$', ','}
Details:
(?s)[A-Z]{3}.*?\D(?=\s|$) - matches
(?s) - RegexOptions.Singleline mode on and . can match line breaks
[A-Z]{3} - three uppercase ASCII letters
.*? - any zero or more chars as few as possible
\D - any non-digit char
(?=\s|$) - a positive lookahead that matches a location either immediately followed with a whitespace or end of string.
The \.$ pattern matches a . at the end of string.

Regular expression to locate one string appearing anywhere after another but before someting

I have an EDI file. This is the piece in question:
N1*ST*TEST
N3*ADDRESS
N4*CITY*ST*POSTAL
PER*EM*TEST#GMAIL.COM
N1*BY*TEST
N3*ADDRESS
N4*CITY*ST*POSTAL
PER*EM*TEST2#GMAIL.COM
I am using powershell
Get-ChildItem 'C:\Temp\*.edi' | Where-Object {(Select-String -InputObject $_ -Pattern 'PER\*EM\*\w+#\w+\.\w+' -List)}
I want to find the email address that appears after the N1*ST, but before the N1*BY. I have the expression that works for an email address but I am stuck on how to only get the one value. The real issue is sometimes the email is there and other times it is not. So I really do want to ignore that second email after the N1*BY.
Thanks in advance for the help.
You can use
(?s)(?<=N1\*ST.*)PER\*EM\*\w+#\w+\.\w+(?=.*N1\*BY)
See the .NET regex demo.
Details
(?s) - a DOTALL (RegexOptions.Singleline in .NET) regex inline modifier making . match newline chars, too
(?<=N1\*ST.*) - a positive lookbehind that matches a location immediaely preceded with N1*ST
PER\*EM\* -a PER*EM* string
\w+#\w+ - 1+ word chars, #, and 1+ word chars
\. - a dot
\w+ - 1+ word chars
(?=.*N1\*BY) - a positive lookahead that matches a location immediaely followed with N1*BY literal string.
NOTE: You need to read in the file contents with Get-Content $filepath -Raw in order to find the proper match.
Something like
Get-ChildItem 'C:\Temp\*.edi' | % { Get-Content $_ -Raw | Select-String -Pattern '(?s)(?<=N1\*ST.*)PER\*EM\*\w+#\w+\.\w+(?=.*N1\*BY)' } | % { $_.Matches.value }

Remove formatting from US phone number and their extension number

HI need help get phone number and there extension using either replace or regex
phone
(123) 455-6789 --> 1234556789
(123) 577-2145 ext81245 --> 1235772145
extension
(123) 455-6789 -->
(123) 577-2145 ext81245 --> 81245
"(123) 455-6789" -replace "[()\s\s-]+|Ext\S+", ""
"(123) 455-6789 Ext 2445" -replace "[()\s\s-]+|Ext\S+", ""
This solves phone number but not extension.
You may try:
^\((\d{3})\)\s*(\d{3})-(\d{4})(?: ext(\d{5}))?$
Explanation of the above regex:
^, $ - Represents start and end of the line respectively.
\((\d{3})\) - Represents first capturing group matching the digits inside ().
\s* - Matches a white-space character zero or more times.
(\d{3})- - Represents second capturing group capturing exactly 3 digits followed by a -.
(\d{4}) - Represents third capturing group matching the digits exactly 4 times.
(?: ext(\d{5}))? -
(?: Represents a non capturing group
ext - Followed by a space and literal ext.
(\d{5}) - Represents digits exactly 5 times.
) - Closing of the non-captured group.
? - Represents the quantifier making the whole non-captured group optional.
You can find the sample demo of the above regex in here.
Powershell Commands:
PS C:\Path\To\MyDesktop> $input_path='C:\Path\To\MyDesktop\InputFile.txt'
PS C:\Path\To\MyDesktop> $output_path='C:\Path\To\MyDesktop\outFile.txt'
PS C:\Path\To\MyDesktop> $regex='^\((\d{3})\)\s*(\d{3})-(\d{4})(?: ext(\d{5}))?$'
PS C:\Path\To\MyDesktop> select-string -Path $input_path -Pattern $regex -AllMatches | % { "Phone Number: $($_.matches.groups[1])$($_.matches.groups[2])$($_.matches.groups[3]) Extension: $($_.matches.groups[4])" } > $output_path
Sample Result:
After you've replaced all characters, you could split the result to get two numbers
Applied to your example
#(
'(123) 455-6789'
, '(123) 577-2145 ext81245'
) | % {
$elements = $_ -replace '[()\s\s-]+' -split 'ext'
[PSCustomObject]#{
phone = $elements[0]
extension = $elements[1]
}
}
returns
phone extension
------ ---------
1234556789
1235772145 81245
Try out this pattern. It will match phone numbers with and without parentheses, spaces and hyphens.
((?:\(?)(\d{3})(?:\)?\s?)(\d{3})(?:-?)(\d{4}))
So alternatively, you could use two replace functions in a single go. Say your original data sits in File1.txt and you want to output to File2.txt then you could use:
$content = Get-Content -Path 'C:\File1.txt'
$newContent = $content -replace '[^\d\n]', '' -replace '^(.{10})(.*)', 'Phone: $1 Extention: $2'
$newContent | Set-Content -Path 'C:\File2.txt'

Regular Expressions in powershell split

I need to strip out a UNC fqdn name down to just the name or IP depending on the input.
My examples would be
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
I want to end up with just tom or 123.43.234.23
I have the following code in my array which is striping out the domain name perfect, but Im still left with \\tom
-Split '\.(?!\d)')[0]
Your regex succeeds in splitting off the tokens of interest in principle, but it doesn't account for the leading \\ in the input strings.
You can use regex alternation (|) to include the leading \\ at the start as an additional -split separator.
Given that matching a separator at the very start of the input creates an empty element with index 0, you then need to access index 1 to get the substring of interest.
In short: The regex passed to -split should be '^\\\\|\.(?!\d)' instead of '\.(?!\d)', and the index used to access the resulting array should be [1] instead of [0]:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '^\\\\|\.(?!\d)')[1] }
The above yields:
tom
123.43.234.23
Alternatively, you could remove the leading \\ in a separate step, using -replace:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '\.(?!\d)')[0] -replace '^\\\\' }
Yet another alternative is to use a single -replace operation, which does not require a ForEach-Object call (doesn't require explicit iteration):
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' -replace
'?(x) ^\\\\ (.+?) \.\D .+', '$1'
Inline option (?x) (IgnoreWhiteSpace) allows you to make regexes more readable with insignificant whitespace: any unescaped whitespace can be used for visual formatting.
^\\\\ matches the \\ (escaped with \) at the start (^) of each string.
(.+?) matches one or more characters lazily.
\.\D matches a literal . followed by something other than a digit (\d matches a digit, \D is the negation of that).
.+ matches one or more remaining characters, i.e., the rest of the input.
$1 as the replacement operand refers to what the 1st capture group ((...)) in the regex matched, and, given that the regex was designed to consume the entire string, replaces it with just that.
I'm stealing Lee_Daileys $InSTuff
but appending a RegEx I used recently
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$InStuff |ForEach-Object {($_.Trim('\\') -split '\.(?!\d{1,3}(\.|$))')[0]}
Sample Output:
tom
123.43.234.23
As you can see here on RegEx101 the dots between the numbers are not matched
The Select-String function uses regex and populates a MatchInfo object with the matches (which can then be queried).
The regex "(\.?\d+)+|\w+" works for your particular example.
"\\tom.overflow.corp.com", "\\123.43.234.23.overflow.corp.com" |
Select-String "(\.?\d+)+|\w+" | % { $_.Matches.Value }
while this is NOT regex, it does work. [grin] i suspect that if you have a really large number of such items, then you will want a regex. they do tend to be faster than simple text operators.
this will get rid of the leading \\ and then replace the domain name with .
# fake reading in a text file
# in real life, use Get-Content
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$DomainName = '.overflow.corp.com'
$InStuff.ForEach({
$_.TrimStart('\\').Replace($DomainName, '')
})
output ...
tom
123.43.234.23