I'm trying to format phone numbers in a large CSV directory. I will need to re-format this periodically as it changes so this is not a one-off solution. I have used Notepad++'s regex replace feature successfully in the past and would like to use this tool if possible. However, I'm open to better/faster methods including scripting like PowerShell, which I am familiar with.
Sample of number formats in the database:
XXX-XXXX
XXXXXXX
XXXXXXXXXX
1XXXXXXXXXX
(XXX) XXX-XXXX
1(XXX) XXX-XXXX
(1XXX) XXX-XXXX
XXX-XXX-XXXX
That last one is what I want all phone numbers to look like in the final output. For the one that is lacking the area code, I would add a default value. For the ones with extra country codes, I would need to truncate it.
Here are some of the regex searches I've used:
FIND: 1-(\d{3})-(\d{3})-(\d{4})
REPLACE: \1-\2-\3
This works!
FIND: 1\((\d{3})\)\s(\d{3})-(\d{4})
REPLACE: \1-\2-\3
This works!
FIND: (\d{11})
REPLACE: ???
This finds the correct string, but I don't know how to format the output.
FIND: (\d{3})-(\d{4})
REPLACE: XXX-\1-\2 (here the XXX is my standard area code that I will add)
This finds the correct substring in XXX-XXX-XXXX as well as XXX-XXXX and zip codes with +4 appended (XXXXX-XXXX). Need to just find the XXX-XXXX without anything preceding it and just from phone numbers. Because this is a CSV file, the actual character before each field is a comma.
My problem is twofold. 1) I don't know how to break up a found string into the parts I need for the replace. I need to convert blocks of digits (7, 10 and 11 digits) and format them to fit the pattern XXX-XXX-XXXX. 2) I don't know how to select just the string I'm searching for (i.e. only XXX-XXXX)
Provided you have a sample list of numbers like
Current Expected
---------------------------------
123-1234 XXX-123-1234
1234567 XXX-123-4567
1234567890 123-456-7890
10123456789 012-345-6789
(123) 456-1234 123-456-1234
1(123) 123-1234 123-123-1234
1-123-123-1234 123-123-1234
(1999) 999-1234 999-999-1234
123-123-1234 123-123-1234
You may use
Find What: ^(?:1-?)?(?|\(1?(\d{3})\)|(\d{3}))[-\s]?(\d{3})[-\s]?(\d{4})$|^(\d{3})[-\s]?(\d{4})$
Replace With: (?1$1-$2-$3:XXX-$4-$5)
Details:
^ - start of string
(?:1-?)? - optional sequence of 1 and an optional -
(?|\(1?(\d{3})\)|(\d{3})) - a branch reset group (syntax is (?|...), all groups inside alternative branches receive same IDs) matching either:
\(1?(\d{3})\) - ( + an optional 1 + Group 1 capturing 3 digits + )
| - or
(\d{3}) - Group 1 (still! because of a branch reset group) capturing 3 digits
[-\s]? - 1 or 0 (optional) - or whitespace
(\d{3}) - Group 2 capturing 3 digits
[-\s]? - an optional - or whitespace
(\d{4}) - Group 3 capturing 4 digits
$ - end of line
| - OR
^ - start of line
(\d{3}) - Group 4 capturing 3 digits
[-\s]? - an optional - or whitespace
(\d{4}) - Group 5 capturing 4 digits
$ - end of line
The replacement pattern:
(?1 - If Group 1 matched, then use
$1-$2-$3 - Backreference to Group 1, 2 and 3 with hyphens in between
: - or else
XXX-$4-$5 - XXX (or whatever the country code is), and Group 4 and 5 separated with a hyphen.
) - end of the if-then block.
I'm not familiar with powershell but yea it would be a good idea to make a small script to do this for you.
For the notepad approach though, i'd try running the replace twice:
FIND: (?:^|,)(\d{3})[ -]?(\d{4})(?:,|$)
REPLACE: XXX-\1-\2 where the XXX is your input area code
FIND: \(?1?\(?(\d{3})\)?[ -]?(\d{3})[ -]?(\d{4})
REPLACE: \1-\2-\3
I don't think the order matters. Try it out in a test file first.
I'm not sure what you mean by your second question, are the regexes selecting numbers from the wrong column in csv? (if so that's another reason why a script would be better)
Related
I'm trying to match an exact pattern to do some data cleanup for ISSN's using the code below:
select case when REGEXP_REPLACE('1234-5678 ÿþT(zlsd?k+j''fh{l}x[a]j).,~!##$%^&*()_+{}|:<>?`"\;''/-', '([0-9]{4}[\-]?[Xx0-9]{4})(.*)', '$1') not similar to '[0-9]{4}[\-]?[Xx0-9]{4}' then 'NOT' else 'YES' end
The pattern I want match any 8 digit group with a possible dash in the middle and possible X at the end.
The code above works for most cases, but if capture group 1 is the following example: 123456789 then it also returns positive because it matches the first 8 digits, and I don't want it to.
I tried surrounding capture group 1 with ^...$ but that doesn't work either.
So I would like to match exactly these examples and similar ones:
1234-5678
1234-567X
12345678
1234567X
BUT NOT THESE (and similar):
1234567899
1234567899x
What am I missing?
You may use
^([0-9]{4}-?[Xx0-9]{4})([^0-9].*)?$
See the regex demo
Details
^ - start of string
([0-9]{4}-?[Xx0-9]{4}) - Capturing group 1 ($1): four digits, an optional -, and then four x / X or digits
([^0-9].*)? - an optional Capturing group 2: any char other than a digit and then any 0+ chars as many as possible
$ - end of string.
I have some cases, which I need to filter with a regex. The values which need to be filtered are listed below:
// These should be catched
123456_Test.pdf
123456 Test.pdf
123456.pdf
// These shouldn't be catched
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
The current regEx looks like this:
(\d{6,7})((\_| ){0,1})(.*)\..*
The problem here is, that the latter 3 are also matched. To give you a short overview, whats wrong with the 1st "wrongly" matched strings:
The 1st capture-group has to consist 6-7 digits. (Also the capture-group is needed in the end). If there are letters after these numbers, there has to be a whitespace or underscore. The 1st example of the "shouldn't be catched" shows this. The entry is invalid, since there are letters after 123456 without the needed sign.
The last entry isn't really important, just there for convinience.
What am I missing? How do I adjust my regex in a way, that I can check for signs, only if there are letters following a number-chain?
You may use
^(\d{6,7})([_ ][A-Za-z].*)?\..*$
See the regex demo
Details
^ - start of a string
(\d{6,7}) - Group 1: 6 or 7 digits
([_ ][A-Za-z].*)? - an optional capturing group #2: a _ or space followed with a letter and then any 0+ chars as many as possible, up to the last
\. - . on a line
.* - the rest of the line
$ - end of string.
Check if this perl solution works for you.
> cat regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
> perl -ne ' print if m/\d+(([ _])[a-zA-Z]+| [a-zA-Z]*)?\.pdf/ ' regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
>
Trying to parse a sting using SQL, and have not found any solutions online (apologies, maybe I'm looking for the wrong thing).
I have a string field with a series of numbers I need to pull out and sum. Delimiter is "\r\n".
Example: '\r\n - 1234 somenumbersandtext123 \r\n -5678 sometextmorenumbers123'
So in this example, I want to sum 1234 and 5678.
The stings are all different lengths, and I need to eventually sum the numbers within the string. The string details documents tied to a project, and the numbers represent the size of the file (trying to determine the total file size per project).
Thanks in advance for any guidance.
You may use
regexp_matches(col,'(?:^|\n)\s*-\s*(\d*\.?\d+)','g')
The part captured with (...) will be the output of the regexp_matches function.
Details
(?:^|\n) - start of string or newline
\s*-\s* - a hyphen enclosed with 0+ whitespaces
(\d*\.?\d+) - Capturing group 1 (what will be returned):
\d* - 0+ digits
\.? - 1 or 0 dots
\d+ - 1+ digits.
This seems to work:
SELECT
REGEXP_MATCHES(
string::text
,'\Br\Bn ?- ?([0-9]+)',
'g')
from test_table
The example text is as follows:
01MAR2015 01MAR2015 Example Example
02MAR2015 Example Example Example
03MAR2015 Example Example $2.45
I want to select all the text from the third date (second row) all the way to the dollar amount. I don't know how to skip the first two dates. Thanks for any help.
Expected output:
02MAR2015 Example Example Example
03MAR2015 Example Example $2.45
What I have for now:
([0-9]{2}[A-Z]{3}[0-9]{4}) # to match the date
((\d)*\.(\d){2}) # to match the dollar amount
(?<=([0-9]{2}[A-Z]{3}[0-9]{4}){2})\1.*((\d)*\.(\d){2}) # my attempt
You seem to need to match the text starting at the second line. In AHK, you may use PCRE compatible patterns.
Use
(?<=\n)[0-9]{2}[A-Z]{3}[0-9]{4}[\w\W]*
See the regex demo.
Details
(?<=\n) - matching will start after a newline
[0-9]{2} - 2 digits
[A-Z]{3} - 3 uppercase letters
[0-9]{4} - 4 digits
[\w\W]* - any 0+ chars as many as possible.
I am trying to create a regex to validate a field where the user can enter a 5 digit number with the option of adding a / followed by 3 letters. I have tried quite a few variations of the following code:
^(\d{5})+?([/]+[A-Z]{1,3})?
But I just can't seem to get what I want.
For instance l would like the user to either enter a 5 digit number such as 12345 with the option of adding a forward slash followed by any 3 letters such as 12345/WFE.
You probably want:
^\d{5}(?:/[A-Z]{3})?$
You might have to escape that forward slash depending on your regex flavor.
Explanation:
^ - start of string anchor
\d{5} - 5 digits
(?:/[A-Z]{3}) - non-capturing group consisting of a literal / followed by 3 uppercase letters (depending on your needs you could consider making this a capturing group by removing the ?:).
? - 0 or 1 of what precedes (in this case that's the non-capturing group directly above).
$ - end of string anchor
All in all, the regex looks like this:
You can use this regex
/^\d{5}(?:\/[a-zA-Z]{3})?$/
^\d{5}(?:/[A-Z]{3})?$
Here it is in practice (this is a great site to test your regexes):
http://regexr.com?36h9m
^(\d{5})(\/[A-Z]{3})?
Tested in rubular