Replacing a Matched Character in Powershell - regex

I have a text file of 3 name entries:
# dot_test.txt
001 AALTON, Alan .....25 Every Street
006 JOHNS, Jason .... 3 Steep Street
002 BROWN. James .... 101 Browns Road
My task is to find instances of NAME. when it should be NAME, using the following:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive |
ForEach-Object { if($_.Matches.Value -match '\.$'){$_.Matches.Value -replace '\,$'} }
The output is:
BROWN.
The conclusion is this script block identifies the instance of NAME. but fails to make the replacement.
Any suggestions on how to achieve this would be appreciated.

$_.Matches.Value -replace '\,$'
This attempts to replace a , (which you needn't escape as \,) at the end of ($) your match with the empty string (due to the absence of a second, replacement operand), i.e. it would effectively remove a trailing ,.
However, given that your match contains no , and that you instead want to replace its trailing . with ,, use the following:
$_.Matches.Value -replace '\.$', ',' # -> 'BROWN,'

You can use -replace directly, and if you need to replace both a comma and dot at the end of the string, use [.,]$ regex:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive | % {$_.Matches.Value -replace '\.$', ','}
Details:
(?s)[A-Z]{3}.*?\D(?=\s|$) - matches
(?s) - RegexOptions.Singleline mode on and . can match line breaks
[A-Z]{3} - three uppercase ASCII letters
.*? - any zero or more chars as few as possible
\D - any non-digit char
(?=\s|$) - a positive lookahead that matches a location either immediately followed with a whitespace or end of string.
The \.$ pattern matches a . at the end of string.

Related

Regular expression to locate one string appearing anywhere after another but before someting

I have an EDI file. This is the piece in question:
N1*ST*TEST
N3*ADDRESS
N4*CITY*ST*POSTAL
PER*EM*TEST#GMAIL.COM
N1*BY*TEST
N3*ADDRESS
N4*CITY*ST*POSTAL
PER*EM*TEST2#GMAIL.COM
I am using powershell
Get-ChildItem 'C:\Temp\*.edi' | Where-Object {(Select-String -InputObject $_ -Pattern 'PER\*EM\*\w+#\w+\.\w+' -List)}
I want to find the email address that appears after the N1*ST, but before the N1*BY. I have the expression that works for an email address but I am stuck on how to only get the one value. The real issue is sometimes the email is there and other times it is not. So I really do want to ignore that second email after the N1*BY.
Thanks in advance for the help.
You can use
(?s)(?<=N1\*ST.*)PER\*EM\*\w+#\w+\.\w+(?=.*N1\*BY)
See the .NET regex demo.
Details
(?s) - a DOTALL (RegexOptions.Singleline in .NET) regex inline modifier making . match newline chars, too
(?<=N1\*ST.*) - a positive lookbehind that matches a location immediaely preceded with N1*ST
PER\*EM\* -a PER*EM* string
\w+#\w+ - 1+ word chars, #, and 1+ word chars
\. - a dot
\w+ - 1+ word chars
(?=.*N1\*BY) - a positive lookahead that matches a location immediaely followed with N1*BY literal string.
NOTE: You need to read in the file contents with Get-Content $filepath -Raw in order to find the proper match.
Something like
Get-ChildItem 'C:\Temp\*.edi' | % { Get-Content $_ -Raw | Select-String -Pattern '(?s)(?<=N1\*ST.*)PER\*EM\*\w+#\w+\.\w+(?=.*N1\*BY)' } | % { $_.Matches.value }

Remove formatting from US phone number and their extension number

HI need help get phone number and there extension using either replace or regex
phone
(123) 455-6789 --> 1234556789
(123) 577-2145 ext81245 --> 1235772145
extension
(123) 455-6789 -->
(123) 577-2145 ext81245 --> 81245
"(123) 455-6789" -replace "[()\s\s-]+|Ext\S+", ""
"(123) 455-6789 Ext 2445" -replace "[()\s\s-]+|Ext\S+", ""
This solves phone number but not extension.
You may try:
^\((\d{3})\)\s*(\d{3})-(\d{4})(?: ext(\d{5}))?$
Explanation of the above regex:
^, $ - Represents start and end of the line respectively.
\((\d{3})\) - Represents first capturing group matching the digits inside ().
\s* - Matches a white-space character zero or more times.
(\d{3})- - Represents second capturing group capturing exactly 3 digits followed by a -.
(\d{4}) - Represents third capturing group matching the digits exactly 4 times.
(?: ext(\d{5}))? -
(?: Represents a non capturing group
ext - Followed by a space and literal ext.
(\d{5}) - Represents digits exactly 5 times.
) - Closing of the non-captured group.
? - Represents the quantifier making the whole non-captured group optional.
You can find the sample demo of the above regex in here.
Powershell Commands:
PS C:\Path\To\MyDesktop> $input_path='C:\Path\To\MyDesktop\InputFile.txt'
PS C:\Path\To\MyDesktop> $output_path='C:\Path\To\MyDesktop\outFile.txt'
PS C:\Path\To\MyDesktop> $regex='^\((\d{3})\)\s*(\d{3})-(\d{4})(?: ext(\d{5}))?$'
PS C:\Path\To\MyDesktop> select-string -Path $input_path -Pattern $regex -AllMatches | % { "Phone Number: $($_.matches.groups[1])$($_.matches.groups[2])$($_.matches.groups[3]) Extension: $($_.matches.groups[4])" } > $output_path
Sample Result:
After you've replaced all characters, you could split the result to get two numbers
Applied to your example
#(
'(123) 455-6789'
, '(123) 577-2145 ext81245'
) | % {
$elements = $_ -replace '[()\s\s-]+' -split 'ext'
[PSCustomObject]#{
phone = $elements[0]
extension = $elements[1]
}
}
returns
phone extension
------ ---------
1234556789
1235772145 81245
Try out this pattern. It will match phone numbers with and without parentheses, spaces and hyphens.
((?:\(?)(\d{3})(?:\)?\s?)(\d{3})(?:-?)(\d{4}))
So alternatively, you could use two replace functions in a single go. Say your original data sits in File1.txt and you want to output to File2.txt then you could use:
$content = Get-Content -Path 'C:\File1.txt'
$newContent = $content -replace '[^\d\n]', '' -replace '^(.{10})(.*)', 'Phone: $1 Extention: $2'
$newContent | Set-Content -Path 'C:\File2.txt'

Regular Expressions in powershell split

I need to strip out a UNC fqdn name down to just the name or IP depending on the input.
My examples would be
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
I want to end up with just tom or 123.43.234.23
I have the following code in my array which is striping out the domain name perfect, but Im still left with \\tom
-Split '\.(?!\d)')[0]
Your regex succeeds in splitting off the tokens of interest in principle, but it doesn't account for the leading \\ in the input strings.
You can use regex alternation (|) to include the leading \\ at the start as an additional -split separator.
Given that matching a separator at the very start of the input creates an empty element with index 0, you then need to access index 1 to get the substring of interest.
In short: The regex passed to -split should be '^\\\\|\.(?!\d)' instead of '\.(?!\d)', and the index used to access the resulting array should be [1] instead of [0]:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '^\\\\|\.(?!\d)')[1] }
The above yields:
tom
123.43.234.23
Alternatively, you could remove the leading \\ in a separate step, using -replace:
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' |
ForEach-Object { ($_ -Split '\.(?!\d)')[0] -replace '^\\\\' }
Yet another alternative is to use a single -replace operation, which does not require a ForEach-Object call (doesn't require explicit iteration):
'\\tom.overflow.corp.com', '\\123.43.234.23.overflow.corp.com' -replace
'?(x) ^\\\\ (.+?) \.\D .+', '$1'
Inline option (?x) (IgnoreWhiteSpace) allows you to make regexes more readable with insignificant whitespace: any unescaped whitespace can be used for visual formatting.
^\\\\ matches the \\ (escaped with \) at the start (^) of each string.
(.+?) matches one or more characters lazily.
\.\D matches a literal . followed by something other than a digit (\d matches a digit, \D is the negation of that).
.+ matches one or more remaining characters, i.e., the rest of the input.
$1 as the replacement operand refers to what the 1st capture group ((...)) in the regex matched, and, given that the regex was designed to consume the entire string, replaces it with just that.
I'm stealing Lee_Daileys $InSTuff
but appending a RegEx I used recently
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$InStuff |ForEach-Object {($_.Trim('\\') -split '\.(?!\d{1,3}(\.|$))')[0]}
Sample Output:
tom
123.43.234.23
As you can see here on RegEx101 the dots between the numbers are not matched
The Select-String function uses regex and populates a MatchInfo object with the matches (which can then be queried).
The regex "(\.?\d+)+|\w+" works for your particular example.
"\\tom.overflow.corp.com", "\\123.43.234.23.overflow.corp.com" |
Select-String "(\.?\d+)+|\w+" | % { $_.Matches.Value }
while this is NOT regex, it does work. [grin] i suspect that if you have a really large number of such items, then you will want a regex. they do tend to be faster than simple text operators.
this will get rid of the leading \\ and then replace the domain name with .
# fake reading in a text file
# in real life, use Get-Content
$InStuff = -split #'
\\tom.overflow.corp.com
\\123.43.234.23.overflow.corp.com
'#
$DomainName = '.overflow.corp.com'
$InStuff.ForEach({
$_.TrimStart('\\').Replace($DomainName, '')
})
output ...
tom
123.43.234.23

Remove all characters except regex pattern in array

So Im creating an array of all the versions of a particular pkg in a directory
What I want to do is strip out all the characters except the version numbers
The first array has info such as
GoogleChrome.45.45.34.nupkg
GoogleChrome.34.28.34.nupkg
So the output I need is
45.45.34
34.28.34
$dirList = Get-ChildItem $sourceDir -Recurse -Include "*.nupkg" -Exclude
$pkgExclude |
Foreach-Object {$_.Name}
$reg = '.*[0-9]*.nupkg'
$appName ='GoogleChrome'
$ouText = $dirList | Select-String $appName$reg -AllMatches | % {
$_.Matches.Value }
$ouText
$verReg='(\d+)(.)(?!nupkg)'
The last regex matches the pattern of what I want to keep but I cant figure out how to extract what I dont need.
You do not need to post-process matches if you apply the right pattern from the start.
In order to extract . separated digits in between GoogleChrome. and .nupkg you may use
Select-String '(?<=GoogleChrome\.)[\d.]+(?=\.nupkg)' -AllMatches
See the regex demo
Details
(?<=GoogleChrome\.) - the location should be preceded with GoogleChrome. substring
[\d.]+ - one or more digits or/and .
(?=\.nupkg) - there must be .nupkg immediately to the right of the current location.
If .nupkg should not be relied upon, use
Select-String '(?<=GoogleChrome\.)\d+(?:\.\d+)+' -AllMatches
Here, \d+(?:\.\d+)+ will match 1 or more digits followed with 1 or more occurrences of a . and 1+ digits only if preceded with GoogleChrome..
(\d+.?)+(?!nupkg)
this would give you desired output in the match, check the regex demo

I don't understand why Select-String isn't matching this

$lines = '<string>D:\home\bob\utility.mdb</string>'
[String[]]$varr = $lines | Select-String -AllMatches -Pattern "<string>*.mdb" |
Select-Object -ExpandProperty Matches |
Select-Object -ExpandProperty Value
This is returning null.
Ultimately I want the entire line, from the <string> to the </string>.
But apparently I don't know how to express this in powershell.
In your initial example:
<string>[\s]*.mdb[\s]*
the first [\s]* will not match anything. You perhaps intended:
<string>[\S]*.mdb[\s]*
But then, I think that the property Matches will take in the whole string from start to end, meaning you'll have to put everything, and the dot needs to be escaped since you can call it a wildcard:
<string>[\S]*\.mdb[\s]*<\/string>
And I think you can remove some unneeded parts (I'm not too familiar with powershell's regex, but I haven't seen any where you have to put character classes written like \S within square brackets):
<string>\S*\.mdb<\/string>
Little explanation:
\s matches a space character, and often also matches a newline, or a tab (\n and \t respectively).
\S will match everything that \s doesn't match.