Regex Select & Replace to Clean Up US phone numbers

Regex Select & Replace to Clean Up US phone numbers - regex

We pull in a list of phone numbers as part of a datafeed. They are all for North America based companies. I would like to remove any leading "1" or "+1" and any trailing information like "x100", " EXT400", etc. They are stored in MariaDB so I would like to do
UPDATE `CompanyPhone` SET `number`= REGEXP_SUBSTR(`number`,pattern)
to remove the unwanted stuff, I just need the REGEX to select the correct part of the phone number.
"1 (555) 555-5555 x100" -> "(555) 555-5555"
"+15555555555 EXT400" -> "5555555555"
" 555-555-5555" -> "555-555-5555" (remove leading space)
Basically, I need just the first 10 digits, ignoring the first digit if it is a 1, and the formatting currently in the first 10 digits ("()" or " " or "-") if it is possible to keep it.
If everything could be reformatted to (555) 555-5555 that would be a bonus but is not required. I could do this a 2nd query if needed.

You could use REGEXP_REPLACE for this. Assuming you are using MariaDB 10.0.5 or later, you can use PCRE regular expressions. For your sample expressions, this regexp will give you the desired results (demo on Regex101). It looks for 3 groups of numbers (3 digits, 3 digits and then 4 digits) possibly preceded by a 1, and with other non-digit characters (e.g. +, -) around them.
^(?:\D*)1?(?:\D*)(\d{3})(?:\D*)(\d{3})(?:\D*)(\d{4}).*$
So your UPDATE statement will become
UPDATE `CompanyPhone` SET `number`= REGEXP_REPLACE(`number`, '^(?:\\D*)1?(?:\\D*)(\\d{3})(?:\\D*)(\\d{3})(?:\\D*)(\\d{4}).*$', '(\\1) \\2-\\3')

Related

matching numbers after nth occurence of a certain symbol in a line

I'm not sure if using regex is the correct way to go about this here, but I wanted to try solving this with regex first (if it's possible)
I have an edifact file, where the data (in bold) in certain fields in some segments need to be substituted (with different dates, same format)
UNA:+,? '
UNB+UNOC:3+000000000+000000000+20190801:1115+00001+DDMP190001'
UNH+00001+BRKE:01+00+0'
INV+ED Format 1+Brustkrebs+19880117+E000000001+**20080702**+++1+0'
FAL+087897044+0000000++name+000000000+0+**20080702**++1+++J+N+N+N+N+N+++0'
INL+181095200+385762115+++0'
BEE+20080702++++0'
BAA+++J+J++++++J+++++++J++0'
BBA++++++++J++++++J+J++++++J+++++J+++J+J++++++++J+0'
BHP+J+++++J+++++J+++++0'
BLA+++J+++++++++0'
BFA++++++++++++J++0'
BSA++J+++J+J+++0'
BAT+20190801+0'
DAT+**20080702**++++0'
UNT+000014+00001'
UNZ+00001+00001'
at first I was able to match those fields using a positive lookahead and a lookbehind (I had different expressions for matching each date).
Here, for example is the expression I intially used to match the date in the "FAL" segment: (?<=\+[\d]{1}\+)\d{8}(?=\+\+), but then i saw that this date is sometimes preceeded by 9 digits, and sometimes by 1 (based on version) and followed by a either ++ or a + and a date so I added a logiacl OR like this: (?<=\+[\d]{9}\+|\+[\d]{1}\+)\d{8}(?=\+[\d]{8}\+|\+\+)and quickly realized it's not sustainable because I saw that these edifact files vary (far beyond only either 9 and 1 digits)
(I have 6 versions for each type, and i have 6 types total)
Because I have a scheme/map indicating what each version should be built like and I know on what position (based on the + separator) the date is written in each version, I thought about maybe matching the date based on the +, so after the 7th occurence (say in the FAL segment) of plus in a certain line, match the next 8 digits.
is this possible to achieve with regex? and if yes, could someone please tell me how?

I suggest using a pattern like
^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)
where {7} can be adjusted to the value you need for each type of segments, and replace with the backreference to Group 1. In Python, it is \g<1>20200101 (where 20200101 is your new date), in PHP/.NET, it is ${1}20200101. In JS, it will be just $1.
To run on a multiline text, use m flag. In Python regex, you may embed it like (?m)^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+).
See the Python regex demo
Details
^ - start of string/line
((?:[^+\n]*\+){7}) - Group 1: 7 repetitions of any chars other than + and newline, and then a +
\d{8} - 8 digits
(?=\+(?:\d{8})?\+) - that are followed with +, and optional chunk of 8 digits and a +.

Regex for validation of a street number

I'm using an online tool to create contests. In order to send prizes, there's a form in there asking for user information (first name, last name, address,... etc).
There's an option to use regular expressions to validate the data entered in this form.
I'm struggling with the regular expression to put for the street number (I'm located in Belgium).
A street number can be the following:
1234
1234a
1234a12
begins with a number (max 4 digits)
can have letters as well (max 2 char)
Can have numbers after the letter(s) (max3)
I came up with the following expression:
^([0-9]{1,4})([A-Za-z]{1,2})?([0-9]{1,3})?$
But the problem is that as letters and second part of numbers are optional, it allows to enter numbers with up to 8 digits, which is not optimal.
1234 (first group)(no letters in the second group) 5678 (third group)
If one of you can tip me on how to achieve the expected result, it would be greatly appreciated !

You might use this regex:
^\d{1,4}([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|)$
where:
\d{1,4} - 1-4 digits
([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|) - optional group, which can be
[a-zA-Z]{1,2}\d{1,3} - 1-2 letters + 1-3 digits
or
[a-zA-Z]{1,2} - 1-2 letters
or
empty

\d{0,4}[a-zA-Z]{0,2}\d{0,3}
\d{0,4} The first groupe matches a number with 4 digits max
[a-zA-Z]{0,2} The second groupe matches a char with 2 digit in max
\d{0,3} The first groupe matches a number with 3 digits max

You have to keep the last two groups together, not allowing the last one to be present, if the second isn't, e.g.
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
or a little less optimized (but showing the approach a bit better)
^\d{1,4}(?:[a-zA-z]{1,2}(?:\d{1,3})?)?$
As you are using this for a validation I assumed that you don't need the capturing groups and replaced them with non-capturing ones.
You might want to change the first number check to [1-9]\d{0,3} to disallow leading zeros.

Thank you so much for your answers ! I tried Sebastian's solution :
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
And it works like a charm ! I still don't really understand what the ":" stand for, but I'll try to figure it out next time i have to fiddle with Regex !
Have a nice day,
Stan

The first digit cannot be 0.
There shouldn't be other symbols before and after the number.
So:
^[1-9]\d{0,3}(?:[a-zA-Z]{1,2}\d{0,3})?$
The ?: combination means that the () construction does not create a matching substring.
Here is the regex with tests for it.

Trying to create a regex that allowes following format yyyy[: -][VW]Week number

My regex currently looks like this
\b(19|20)\d{2}\b[- :][VW][0-5]{1}(?(?=[5])[0-2]{1}|[0-9]{1})
It doesn't quite do what I want as I'm trying to get this part
(?(?=[5])[0-2]{1}|[0-9]{1})
to say "If the previous number was 5 then you may only choose between 0-2, and if it's another number 0-4 then choosing between 0-9 is allowed
Currently it allowes 00-59 with an exclusion of 05,15,25,35 etc.
Essentially I want it to look like this for example 2016-W25.

You need to replace [5] with a positive lookbehind (?<=5) in order to check a char to the left of the current location:
\b(19|20)\d{2}[- :][VW][0-5](?(?=(?<=5))[0-2]|[0-9])
^^^^^
See the regex demo
Also, you may get rid of the conditional pattern at all using a mere alternation group:
\b(19|20)\d{2}[- :][VW](?:[0-4][0-9]|5[0-2])
^^^^^^^^^^^^^^^^^^^^^
See this regex demo
The (?:[0-4][0-9]|5[0-2]) matches either a digit from 0 to 4 and then any digit (see [0-4][0-9]), or (see |) a 5 followed with 0, 1 or 2 (see 5[0-2]).
NOTE: Since the number of weeks can amount to 53, the [0-2] at the end might be replaced with [0-3] to also match 53 values.

RegEx: Uk Landlines, Mobile phone numbers

I've been struggling with finding a suitable solution :-
I need an regex expression that will match all UK phone numbers and mobile phones.
So far this one appears to cover most of the UK numbers:
^0\d{2,4}[ -]{1}[\d]{3}[\d -]{1}[\d -]{1}[\d]{1,4}$
However mobile numbers do not work with this regex expression or phone-numbers written in a single solid block such as 01234567890.
Could anyone help me create the required regex expression?

[\d -]{1}
is blatently incorrect: a digit OR a space OR a hyphen.
01000 123456
01000 is not a valid UK area code. 123456 is not a valid local number.
It is important that test data be real area codes and real number ranges.
^\s*(?(020[7,8]{1})?[ ]?[1-9]{1}[0-9{2}[ ]?[0-9]{4})|(0[1-8]{1}[0-9]{3})?[ ]?[1-9]{1}[0-9]{2}[ ]?[0-9]{3})\s*|[0-9]+[ ]?[0-9]+$
The above pattern is garbage for many different reasons.
[7,8] matches 7 or comma or 8. You don't need to match a comma.
London numbers also begin with 3 not just 7 or 8.
London 020 numbers aren't the only 2+8 format numbers; see also 023, 024, 028 and 029.
[1-9]{1} simplifies to [1-9]
[ ]? simplifies to \s?
Having found the intial 0 once, why keep searching for it again and again?
^(0....|0....|0....|0....)$ simplifies to ^0(....|....|....|....)$
Seriously. ([1]|[2]|[3]|[7]){1} simplifies to [1237] here.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers

Given that people sometimes write their numbers with spaces in random places, you might be better off ignoring the spaces all together - you could use a regex as simple as this then:
^0(\d ?){10}$
This matches:
01234567890
01234 234567
0121 3423 456
01213 423456
01000 123456
But it would also match:
01 2 3 4 5 6 7 8 9 0
So you may not like it, but it's certainly simpler.

Would this regex do?
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Wed, Sep 8, 2010, 06:38:28
/// Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [1]: A numbered capture group. [\+44], zero or one repetitions
/// \+44
/// Literal +
/// 44
/// [2]: A numbered capture group. [\s+], zero or one repetitions
/// Whitespace, one or more repetitions
/// [3]: A numbered capture group. [\(?]
/// Literal (, zero or one repetitions
/// [area_code]: A named capture group. [(\d{1,5}|\d{4}\s+?\d{1,2})]
/// [4]: A numbered capture group. [\d{1,5}|\d{4}\s+?\d{1,2}]
/// Select from 2 alternatives
/// Any digit, between 1 and 5 repetitions
/// \d{4}\s+?\d{1,2}
/// Any digit, exactly 4 repetitions
/// Whitespace, one or more repetitions, as few as possible
/// Any digit, between 1 and 2 repetitions
/// [5]: A numbered capture group. [\)?]
/// Literal ), zero or one repetitions
/// [6]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// [tel_no]: A named capture group. [(\d{1,4}(\s+|-)?\d{1,4}|(\d{6}))]
/// [7]: A numbered capture group. [\d{1,4}(\s+|-)?\d{1,4}|(\d{6})]
/// Select from 2 alternatives
/// \d{1,4}(\s+|-)?\d{1,4}
/// Any digit, between 1 and 4 repetitions
/// [8]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// Any digit, between 1 and 4 repetitions
/// [9]: A numbered capture group. [\d{6}]
/// Any digit, exactly 6 repetitions
///
///
/// </summary>
public Regex MyRegex = new Regex(
"(\\+44)?\r\n(\\s+)?\r\n(\\(?)\r\n(?<area_code>(\\d{1,5}|\\d{4}\\s+"+
"?\\d{1,2}))(\\)?)\r\n(\\s+|-)?\r\n(?<tel_no>\r\n(\\d{1,4}\r\n(\\s+|-"+
")?\\d{1,4}\r\n|(\\d{6})\r\n))",
RegexOptions.IgnoreCase
| RegexOptions.Singleline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
//// Replace the matched text in the InputText using the replacement pattern
// string result = MyRegex.Replace(InputText,MyRegexReplace);
//// Split the InputText wherever the regex matches
// string[] results = MyRegex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = MyRegex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = MyRegex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = MyRegex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = MyRegex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = MyRegex.GetGroupNumbers();
Notice how the spaces and dashes are optional and can be part of it.. also it is now divided into two capture groups called area_code and tel_no to break it down and easier to extract.

Strip all whitespace and non-numeric characters and then do the test. It'll be musch , much easier than trying to account for all the possible options around brackets, spaces, etc.
Try the following:
#"^(([0]{1})|([\+][4]{2}))([1]|[2]|[3]|[7]){1}\d{8,9}$"
Starts with 0 or +44 (for international) - I;m sure you could add 0044 if you wanted.
It then has a 1, 2, 3 or 7.
It then has either 8 or 9 digits.
If you want to be even smarter, the following may be a useful reference: http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom

It's not a single regex, but there's sample code from Braemoor Software that is simple to follow and fairly thorough.
The JS version is probably easiest to read. It strips out spaces and hyphens (which I realise you said you can't do) then applies a number of positive and negative regexp checks.

Start by stripping the non-numerics, excepting a + as the first character.
(Javascript)
var tel=document.getElementById("tel").value;
tel.substr(0,1).replace(/[^+0-9]/g,'')+tel.substr(1).replace(/[^0-9]/g,'')
The regex below allows, after the international indicator +, any combination of between 7 and 15 digits (the ITU maximum) UNLESS the code is +44 (UK). Otherwise if the string either begins with +44, +440 or 0, it is followed by 2 or 7 and then by nine of any digit, or it is followed by 1, then any digit except 0, then either seven or eight of any digit. (So 0203 is valid, 0703 is valid but 0103 is not valid). There is currently no such code as 025 (or in London 0205), but those could one day be allocated.
/(^\+(?!44)[0-9]{7,15}$)|(^(\+440?|0)(([27][0-9]{9}$)|(1[1-9][0-9]{7,8}$)))/
Its primary purpose is to identify a correct starting digit for a non-corporate number, followed by the correct number of digits to follow. It doesn't deduce if the subscriber's local number is 5, 6, 7 or 8 digits. It does not enforce the prohibition on initial '1' or '0' in the subscriber number, about which I can't find any information as to whether those old rules are still enforced. UK phone rules are not enforced on properly formatted international phone numbers from outside the UK.

After a long search for valid regexen to cover UK cases, I found that the best way (if you're using client side javascript) to validate UK phone numbers is to use libphonenumber-js along with custom config to reduce bundle size:
If you're using NodeJS, generate UK metadata by running:
npx libphonenumber-metadata-generator metadata.custom.json --countries GB --extended
then import and use the metadata with libphonenumber-js/core:
import { isValidPhoneNumber } from "libphonenumber-js/core";
import data from "./metadata.custom.json";
isValidPhoneNumber("01234567890", "GB", data);
CodeSandbox Example

Custom RegEx expression for validating different possibilities of phone number entries?

I'm looking for a custom RegEx expression (that works!) to will validate common phone number with area code entries (no country code) such as:
111-111-1111
(111) 111-1111
(111)111-1111
111 111 1111
111.111.1111
1111111111
And combinations of these / anything else I may have forgotton.
Also, is it possible to have the RegEx expression itself reformat the entry? So take the 1111111111 and put it in 111-111-1111 format. The regex will most likely be entered in a Joomla / some type of CMS module, so I can't really add code to it aside from the expression itself.

\(?(\d{3})\)?[ .-]?(\d{3})[ .-]?(\d{4})
will match all your examples; after a match, backreference 1 will contain the area code, backreference 2 and 3 will contain the phone number.
I hope you don't need to handle international phone numbers, too.
If the phone number is in a string by itself, you could also use
^\s*\(?(\d{3})\)?[ .-]?(\d{3})[ .-]?(\d{4})\s*$
allowing for leading/trailing whitespace and nothing else.

Why not just remove spaces, parenthesis, dashes, and periods, then check that it is a number of 10 digits?

Depending on the language in question, you might be better off using a replace-like statement to replace non-numeric characters: ()-/. with nothing, and then just check if what is left is a 10-digit number.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex Select & Replace to Clean Up US phone numbers - regex

Related

matching numbers after nth occurence of a certain symbol in a line

Regex for validation of a street number

Trying to create a regex that allowes following format yyyy[: -][VW]Week number

RegEx: Uk Landlines, Mobile phone numbers

Custom RegEx expression for validating different possibilities of phone number entries?

Categories

Resources