Remove formatting from US phone number and their extension number

Remove formatting from US phone number and their extension number - regex

HI need help get phone number and there extension using either replace or regex
phone
(123) 455-6789 --> 1234556789
(123) 577-2145 ext81245 --> 1235772145
extension
(123) 455-6789 -->
(123) 577-2145 ext81245 --> 81245
"(123) 455-6789" -replace "[()\s\s-]+|Ext\S+", ""
"(123) 455-6789 Ext 2445" -replace "[()\s\s-]+|Ext\S+", ""
This solves phone number but not extension.

You may try:
^\((\d{3})\)\s*(\d{3})-(\d{4})(?: ext(\d{5}))?$
Explanation of the above regex:
^, $ - Represents start and end of the line respectively.
\((\d{3})\) - Represents first capturing group matching the digits inside ().
\s* - Matches a white-space character zero or more times.
(\d{3})- - Represents second capturing group capturing exactly 3 digits followed by a -.
(\d{4}) - Represents third capturing group matching the digits exactly 4 times.
(?: ext(\d{5}))? -
(?: Represents a non capturing group
ext - Followed by a space and literal ext.
(\d{5}) - Represents digits exactly 5 times.
) - Closing of the non-captured group.
? - Represents the quantifier making the whole non-captured group optional.
You can find the sample demo of the above regex in here.
Powershell Commands:
PS C:\Path\To\MyDesktop> $input_path='C:\Path\To\MyDesktop\InputFile.txt'
PS C:\Path\To\MyDesktop> $output_path='C:\Path\To\MyDesktop\outFile.txt'
PS C:\Path\To\MyDesktop> $regex='^\((\d{3})\)\s*(\d{3})-(\d{4})(?: ext(\d{5}))?$'
PS C:\Path\To\MyDesktop> select-string -Path $input_path -Pattern $regex -AllMatches | % { "Phone Number: $($_.matches.groups[1])$($_.matches.groups[2])$($_.matches.groups[3]) Extension: $($_.matches.groups[4])" } > $output_path
Sample Result:

After you've replaced all characters, you could split the result to get two numbers
Applied to your example
#(
'(123) 455-6789'
, '(123) 577-2145 ext81245'
) | % {
$elements = $_ -replace '[()\s\s-]+' -split 'ext'
[PSCustomObject]#{
phone = $elements[0]
extension = $elements[1]
}
}
returns
phone extension
------ ---------
1234556789
1235772145 81245

Try out this pattern. It will match phone numbers with and without parentheses, spaces and hyphens.
((?:\(?)(\d{3})(?:\)?\s?)(\d{3})(?:-?)(\d{4}))

So alternatively, you could use two replace functions in a single go. Say your original data sits in File1.txt and you want to output to File2.txt then you could use:
$content = Get-Content -Path 'C:\File1.txt'
$newContent = $content -replace '[^\d\n]', '' -replace '^(.{10})(.*)', 'Phone: $1 Extention: $2'
$newContent | Set-Content -Path 'C:\File2.txt'

Related

Replacing a Matched Character in Powershell

I have a text file of 3 name entries:
# dot_test.txt
001 AALTON, Alan .....25 Every Street
006 JOHNS, Jason .... 3 Steep Street
002 BROWN. James .... 101 Browns Road
My task is to find instances of NAME. when it should be NAME, using the following:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive |
ForEach-Object { if($_.Matches.Value -match '\.$'){$_.Matches.Value -replace '\,$'} }
The output is:
BROWN.
The conclusion is this script block identifies the instance of NAME. but fails to make the replacement.
Any suggestions on how to achieve this would be appreciated.

$_.Matches.Value -replace '\,$'
This attempts to replace a , (which you needn't escape as \,) at the end of ($) your match with the empty string (due to the absence of a second, replacement operand), i.e. it would effectively remove a trailing ,.
However, given that your match contains no , and that you instead want to replace its trailing . with ,, use the following:
$_.Matches.Value -replace '\.$', ',' # -> 'BROWN,'

You can use -replace directly, and if you need to replace both a comma and dot at the end of the string, use [.,]$ regex:
Select-String -AllMatches -Path $input_path -Pattern '(?s)[A-Z]{3}.*?\D(?=\s|$)' -CaseSensitive | % {$_.Matches.Value -replace '\.$', ','}
Details:
(?s)[A-Z]{3}.*?\D(?=\s|$) - matches
(?s) - RegexOptions.Singleline mode on and . can match line breaks
[A-Z]{3} - three uppercase ASCII letters
.*? - any zero or more chars as few as possible
\D - any non-digit char
(?=\s|$) - a positive lookahead that matches a location either immediately followed with a whitespace or end of string.
The \.$ pattern matches a . at the end of string.

Regex for multiple app versions

Im trying to get list of versions from my custom attribute in powershell script. Atrribute looks like this:
[assembly: CompatibleVersions("1.7.1.0","1.7.1.1","1.2.2.3")]
And I end up with regex like this but it does'nt work at all:
'\(\"([^\",?]*)\"+\)'

You should do this as a two-step process: First you parse out the CompatibleVersions attribute, and then you split out those version numbers. Otherwise you will have difficulties finding the version numbers individually without likely finding otheer version-like numbers.
$s = '[assembly: CompatibleVersions("1.7.1.0","1.7.1.1","1.2.2.3")]'
$versions = ($s | Select-String -Pattern 'CompatibleVersions\(([^)]+)\)' | % { $_.Matches }).Groups[1].Value
$versions.Split(',') | % { $_.Trim('"') } | Write-Host
# 1.7.1.0
# 1.7.1.1
# 1.2.2.3

Start by grabbing the parentheses pair and everything inside:
$string = '[assembly: CompatibleVersions("1.7.1.0","1.7.1.1","1.2.2.3")]'
if($string -match '\(([^)]+)\)'){
# Remove the parentheses themselves, split by comma and then trim the "
$versionList = $Matches[0].Trim("()") -split ',' |ForEach-Object Trim '"'
}

You may use
$s | select-string -pattern "\d+(?:\.\d+)+" -AllMatches | Foreach {$_.Matches} | ForEach-Object {$_.Value}
The \d+(?:\.\d+)+ pattern will match:
\d+ - 1 or more digits
(?:\.\d+)+ - 1 or more sequences of a . and 1+ digits.
See the regex demo on RegexStorm.

'"([.\d]+)"' will match any substring composed of dots and digits (\d) and comprised into double quotes (")
Try it here

A number between .. can be 0, but cannot be 00, 01 or similar.
Pay attention to the starting [
This is a regex for the check:
^\[assembly: CompatibleVersions\("(?:[1-9]\d*|0)(?:\.(?:[1-9]\d*|0)){3}"(?:,"(?:[1-9]\d*|0)(?:\.(?:[1-9]\d*|0)){3}")*\)]$
Here is the regex with tests.
But if you are reading a list, you should use instead:
^\[assembly: CompatibleVersions\("((?:[1-9]\d*|0)(?:\.(?:[1-9]\d*|0)){3}"(?:,"(?:[1-9]\d*|0)(?:\.(?:[1-9]\d*|0)){3}")*)\)]$
By it you will extract the "...","..."... consequence from the inner parenthesis.
After that split the result string by '","' into a list and remove last " from the last element and the first " from the first element. Now you have list of correct versions Strings.
Alas, regex cannot create a list without split() function.

Program-Name Detection

this is how the lines look like:
//| Vegas.c |
and I would like to get the name, here Vegas.c
This works in PS' regex:
$found = $body -match '.+?\s+(\w.+?\.c[\+]+)[\s\|]+'
But what if the name does not start with a-zA-Z0-9 (=\w) but e.g. ~ or other none-word-chars?
The first char of the name must be different from a blank so I tried:
$found = $body -match '.+?\s+(\S+.+?\.c[\+]+)[\s\|]+'
$found = $body -match '.+?\s+([^\ ]+.+?\.c[\+]+)[\s\|]+'
$found = $body -match '.+?\s+([^\s]+.+?\.c[\+]+)[\s\|]+'
None of them work even some more work. In most of the cases this detects only the whole line!
Any ideas?

How about this?
\/\/\| *([^ ]*)
\/ matches the character /
\/ matches the character /
\| matches the character |
 * matches 0 to many of the character
round brackets ( ) are the first capture group
[^ ] captures all the characters that are ^(not) a space (so long as all your file names do not contain spaces this should work)

I think you made your question more basic then you needed from what I see in your comments but I have this which worked with your test string.
$string = #"
//| Vegas.c |
"#
Just look for data inbetween the pipes and whitespace the pipes border. Not sure how it will perform with you real data but should work if spaces are in the program names.
[void]($string -match "\|\s+(.+)\s+\|")
$Matches[1]
Vegas.c
You could also used named matches in PowerShell
[void]($string -match "\|\s+(?<Program>.+)\s+\|")
$Matches.Program
Vegas.c

Trying to match this using regular expressions in PowerShell

I am trying to use regular expressions to match certain lines in a file, but I am having some trouble.
The file contains text like this:
Mario, 123456789
Luigi, 234-567-890
Nancy, 345 5666 77533
Bowser, 348759823745908732589
Peach, 534785
Daisy, 123-456-7890
I'm trying to match just the numbers as either XXX-XXX-XXX or XXX XXX XXX pattern.
I've tried a few different ways, but it always expects something I don't want it to or it tell me everything is false.
I'm using PowerShell to do this.
At first I tried:
{$match = $i -match "\d{3}\-\d{3}\-\d{3}|\d{3}\ \d{3}\ \d{3}"
Write-Host $match}
But when I do that it matches the long strong of numbers and XXX-XXX-XXXXX.
I read something saying that n would match the exact quantity, so I tried that...
{$match = $i -match "\d{n3}\-\d{n3}\-\d{n3}|\d{n3}\ \d{n3}\ \{n3}"
Write-Host $match}
That made everything false...
So I tried
{$match = $i -match "\d\n{3}\-\d\n{3}\-\d\n{3}|\d\n{3}\ \d\n{3}\ \d\n{3}"
I also tried the lazy quantifier, ?:
{$match = $i -match "\d{3?}\-\d{3?}\-\d{3?}|\d{3?}\ \{3?}\ \{3?}"
Write-Host $match}
Still false...
The final thing I tried was this...
{$match = $i -match "\d[0-9\{3\}\-\d[0-9]\{3\}\-\d[0-9]{3\}|\d[0-9]\{3\}\ \d[0-9]\{3}\ \d[0-9]\{3\}"<br>
Write-Host $match}
Still no luck...

The following pattern gives two matches:
Get-Content .\test.txt | Where-Object {$_ -match '\d{3}[-|\s]\d{3}[-|\s]\d{3}'}
Luigi, 234-567-890
Daisy,
123-456-7890
If you want to exclude the last match, add the '$' anchor (represents the end of the string:
Get-Content .\test.txt | Where-Object {$_ -match '\d{3}[-|\s]\d{3}[-|\s]\d{3}$'}
Luigi, 234-567-890
If you want to be very specific and match lines from start to end (use the ^ anchor, denotes the start of the string):
Get-Content .\test.txt | Where-Object {$_ -match '^\w+,\s+\d{3}[-|\s]\d{3}[-|\s]\d{3}$'}
Luigi, 234-567-890

Your first answer is the closest. The {3} matches exactly 3 characters. I think the n you saw was supposed to represent any number, not an actual n character. The reason it matches the long strings is that you only specified that the match must find 3 digits, dash or space, 3 digits, dash or space, then 3 more digits. You did not specify that it doesn't count if there are more digits after that.
To not match when there is a number after, you can use a negative lookahead.
(\d{3}-\d{3}-\d{3}|\d{3}\ \d{3}\ \d{3})(?!\d)
Alternatively, if you want to only match at the end of the line, possibly with trailing space
(\d{3}-\d{3}-\d{3}|\d{3}\ \d{3}\ \d{3})\s*$

As Gideon said, your first is the best place to start.
"\b\d{3}\-\d{3}\-\d{3}\b|\b\d{3}\ \d{3}\ \d{3}\b"
The \b special character added before and after each statement is a word boundary - basically a space or newline or punctuation like a period or comma. This ensures that 9999 doesn't match, but 999. does.

Try this:
/(\d+[- ])+\d+/
It's better not to have so rigid regular expressions, unless you are absolutely sure there that your input will not change.
So this regex matches at least a digit, then greedily searches for more digits followed by a space or a dash. This is also repeated as much as possible then followed by at least another digit.

When manipulating data in PowerShell, it usually is a good idea to create objects representing the data (after all, PowerShell is all about objects). Filtering based on object properties is usually easier and more robust. Your problem is a good example.
Here is what we are after:
the persons: $persons
where: where
the number of that person: $_.number
matches: -match
the pattern
starting with three digits: ^\d{3}
followed by three digits between dashes or spaces: (-\d{3}-|\ \d{3}\ )
ending on three digits: \d{3}$
Below is the entire script:
$persons = import-csv -Header "name", "number" -delimiter "," data.csv
$persons | where {$_.number -match "^\d{3}(\-\d{3}\-|\ \d{3}\ )\d{3}$"}

You can also use Select-String:
Select-String '(\d{3}[ -]){2}\d{3}$' .\file.txt | % {$_.Line}

Regular expression to match any character being repeated more than 10 times

I'm looking for a simple regular expression to match the same character being repeated more than 10 or so times. So for example, if I have a document littered with horizontal lines:
=================================================
It will match the line of = characters because it is repeated more than 10 times. Note that I'd like this to work for any character.

The regex you need is /(.)\1{9,}/.
Test:
#!perl
use warnings;
use strict;
my $regex = qr/(.)\1{9,}/;
print "NO" if "abcdefghijklmno" =~ $regex;
print "YES" if "------------------------" =~ $regex;
print "YES" if "========================" =~ $regex;
Here the \1 is called a backreference. It references what is captured by the dot . between the brackets (.) and then the {9,} asks for nine or more of the same character. Thus this matches ten or more of any single character.
Although the above test script is in Perl, this is very standard regex syntax and should work in any language. In some variants you might need to use more backslashes, e.g. Emacs would make you write \(.\)\1\{9,\} here.
If a whole string should consist of 9 or more identical characters, add anchors around the pattern:
my $regex = qr/^(.)\1{9,}$/;

In Python you can use (.)\1{9,}
(.) makes group from one char (any char)
\1{9,} matches nine or more characters from 1st group
example:
txt = """1. aaaaaaaaaaaaaaa
2. bb
3. cccccccccccccccccccc
4. dd
5. eeeeeeeeeeee"""
rx = re.compile(r'(.)\1{9,}')
lines = txt.split('\n')
for line in lines:
rxx = rx.search(line)
if rxx:
print line
Output:
1. aaaaaaaaaaaaaaa
3. cccccccccccccccccccc
5. eeeeeeeeeeee

. matches any character. Used in conjunction with the curly braces already mentioned:
$: cat > test
========
============================
oo
ooooooooooooooooooooooo
$: grep -E '(.)\1{10}' test
============================
ooooooooooooooooooooooo

={10,}
matches = that is repeated 10 or more times.

use the {10,} operator:
$: cat > testre
============================
==
==============
$: grep -E '={10,}' testre
============================
==============

You can also use PowerShell to quickly replace words or character reptitions. PowerShell is for Windows. Current version is 3.0.
$oldfile = "$env:windir\WindowsUpdate.log"
$newfile = "$env:temp\newfile.txt"
$text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
$text -replace '/(.)\1{9,}/', ' ' | Set-Content -Path $newfile

PHP's preg_replace example:
$str = "motttherbb fffaaattther";
$str = preg_replace("/([a-z])\\1/", "", $str);
echo $str;
Here [a-z] hits the character, () then allows it to be used with \\1 backreference which tries to match another same character (note this is targetting 2 consecutive characters already), thus:
mother father
If you did:
$str = preg_replace("/([a-z])\\1{2}/", "", $str);
that would be erasing 3 consecutive repeated characters, outputting:
moherbb her

A slightly more generic powershell example. In powershell 7, the match is highlighted including the last space (can you highlight in stack?).
'a b c d e f ' | select-string '([a-f] ){6,}'
a b c d e f

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove formatting from US phone number and their extension number - regex

Try out this pattern. It will match phone numbers with and without parentheses, spaces and hyphens. ((?:\(?)(\d{3})(?:\)?\s?)(\d{3})(?:-?)(\d{4}))

Related

Replacing a Matched Character in Powershell

Regex for multiple app versions

Program-Name Detection

Trying to match this using regular expressions in PowerShell

Regular expression to match any character being repeated more than 10 times

Categories

Resources