Removing numbers before - regex

I have a dataset of a customer list. The first column of type factor (Kunden.Nr..Kurzname) has always a number (ranges from 1 to 4 digits) before the actual customer name, that I would like to remove. The data set currently looks like this:
Kunden.Nr..Kurzname Name..Vorname Adresse Postfach PLZ
1 1529 33ER TAXI AG 33er Taxi AG Jägerstrasse 5 <NA> 4016
2 2384 4EYES GMBH 4eyes GmbH Grubenweg 25 <NA> 4153
3 1548 A. SCHULMANN AG A. Schulmann AG Kernstrasse 10 <NA> 8004
4 3427 AAA DENT AG AAA Dent AG Die Zahnärzte.ch Centralbahnstrasse 20 4051
5 555 AARE SEELAND MOB Aare Seeland mobil AG Hauptstrasse 93 <NA> 2560
6 856 AASTRA TELECOM S Aastra Telecom Schweiz AG Schulhausgasse 24 <NA> 3113
And I would like to have it like this:
Kunden.Nr..Kurzname Name..Vorname Adresse Postfach PLZ
1 33ER TAXI AG 33er Taxi AG Jägerstrasse 5 <NA> 4016
2 4EYES GMBH 4eyes GmbH Grubenweg 25 <NA> 4153
3 A. SCHULMANN AG A. Schulmann AG Kernstrasse 10 <NA> 8004
4 AAA DENT AG AAA Dent AG Die Zahnärzte.ch Centralbahnstrasse 20 4051
5 AARE SEELAND MOB Aare Seeland mobil AG Hauptstrasse 93 <NA> 2560
6 AASTRA TELECOM S Aastra Telecom Schweiz AG Schulhausgasse 24 <NA> 3113
Basically, I would need to remove everything before and including the first space. Figured out that I probably have to use "gsub", but unfortunately I haven't used R for a long time. Help is highly appreciated.

I would like to suggest making use of groups:
gsub("^(\\d+)([[:space:]])(.+)$","\\3",x)
For example:
> x <- c("1529 33ER TAXI AG", "2384 4EYES GMBH")
> gsub("^(\\d+)([[:space:]])(.+)$","\\3",x)
[1] "33ER TAXI AG" "4EYES GMBH"
Demos
Regex101
Ideone
Explanation
Courtesy of regex101.com.

All the answers before are kind of overloaded. Here is a suggestion, that is somewhat straightforward and does everything like you asked.
DF <- #your data.frame
FindFirstSpace <- regexpr(" ", DF$Kunden.Nr..Kurzname, fixed = TRUE)
DF$Kunden.Nr..Kurzname <- substr(DF$Kunden.Nr..Kurzname, FindFirstSpace + 1, 1000)
regexpr returns the first instance of " " from your character vector. Note that regexpr is made for finding expressions "like" your pattern. But fixed = TRUE makes the search specific.
Then take the Substring from after the first space. For stop value you can take any number big enough.

You can simply do gsub("^[0-9]{1,4}\\s","",df$Kunden.Nr..Kurzname)

Related

Regex for converting spaces to tabs but leaving word items in the middle alone?

I have a problem that my Googling tells me can be solved with Regex, but I'm completely unfamiliar and I tried following some tutorials but I'm entirely lost. I have this sample data set:
59 65 21366 CLEMENTINES 4.89 2.00 9.78
59 61 22384 PORK BACK RIBS 6.50 2.40 15.59
59 65 30669 BANANAS 1.89 1.00 1.89
59 13 391314 KODIAK POWER CAKES 14.69 1.00 14.69
59 65 392373 BAJA CHOPPED SALAD KIT 2.99 1.00 2.99
59 39 429227 FILA MENS ANKLE SOCK 6PK 9.99 1.00 9.99
59 65 1056187 ASIAN CASHEW SALAD KIT 2.99 1.00 2.99
59 28 1159696 SHOPKINS GG/TWOZIES ASST 5.97 1.00 5.97
59 13 1221327 KODIAK POWER CAKES -3.00 -3.00 COUPON
59 14 1270070 KLEENEX ULTRA SOFT 12 PCK 16.49 1.00 16.49
59 21 5221111 10 DRAWER STORAGE CART 29.99 1.00 29.99
59 17 1019 HALF + HALF 1 L 1.99 1.00 1.99
I want to import it into a spreadsheet. Visually I can see what I want (3 numeric columns at the beginning, then a description that may or may not contain spaces, then usually 3 numeric columns, but sometimes 2 + a word (see the line that ends in "coupon").
But because of the spaces and lack of quotes, my Excel skills (which are also marginal) don't allow me to import this in a sensible way.
I thought of doing multiple processes: pull off the 3 columns at the left and then 3 columns at the right... but in Excel I see no way to operate "from the right".
Any help appreciated. Thanks.
[edit] I realize from the comments that my ignorance has resulted in a poor question.
I didn't realize "Regex" was specific to language, etc. I am trying to import a csv into Excel, but I was using Notepad++ to perform the regex operations. I don't know what "flavor" that uses but the answer below helped greatly.
You can match this with:
^(\S*) (\S*) (\S*) (.*) (\S*) (\S*) (\S*)$
^ matches the start of a line
\S* matches one or more non-whitespace characters
.* matches anything, including spaces
the parentheses capture the matches into capture groups
$ matches the end of a line.
You haven't said what tool you intend to use to do this.
One way is with a Perl one-liner:
perl -pe 's/^(\S*) (\S*) (\S*) (.*) (\S*) (\S*) (\S*)$/"\1","\2","\3","\4","\5","\6","\7"/' input.txt
Returning:
"59","65","21366","CLEMENTINES","4.89","2.00","9.78"
...
"59","13","1221327","KODIAK POWER CAKES","-3.00","-3.00","COUPON"
... etc.

Replace Value & Shift Data Frame If Certain Condition Met

I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.
90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.
Here is an example:
Num Last-Name First-Name Cat. DOB Location
11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA
df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
"2 Pearl, Sam R 1986-11-04 UK",
"5 Livingston, Steph LL 1983-12-12 USA",
"7 Thornton, Mark LR 1982-03-26 USA",
"10 Silver, John RED LL 1983-09-14 USA")
You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.
I started to run code that used stringr to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a dplyr mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:
Num Last-Name First-Name Cat. DOB Location Color
11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED
I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.
here's a simpler way:
input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)
the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.
IDEone demo
We could use gsub to remove the extra substrings
v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
"\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE,
col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <- trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
"", trimws(df1)))
d1
# Num LastName FirstName Cat DOB Location Color
#1 11 Jackson Adam L 1982-06-15 USA
#2 2 Pearl Sam R 1986-11-04 UK
#3 5 Livingston Steph LL 1983-12-12 USA
#4 7 Thornton Mark LR 1982-03-26 USA
#5 10 Silver John LL 1983-09-14 USA RED
Using strsplit instead of regex:
# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)
# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
# if number of items in row is 6, insert NA, else rearrange
if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))
# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")
df2
# Num Last-Name First-Name Cat. DOB Location Color
# 1 11 Jackson Adam L 1982-06-15 USA <NA>
# 2 2 Pearl Sam R 1986-11-04 UK <NA>
# 3 5 Livingston Steph LL 1983-12-12 USA <NA>
# 4 7 Thornton Mark LR 1982-03-26 USA <NA>
# 5 10 Silver John LL 1983-09-14 USA RED

Separate numbers and words in a string

I have a string variable in Stata, that contains entries such as:
110xyz 43 abc
110xyz 44 abc
111 xyz 56 abc
The key is that sometimes the first set of numbers (anything from 1 to 5 digits long) is not followed by a space before the next word begins (and that can be anything from 1 to 50 letters long). Is there a neat way to insert a blank to separate number and word? (110 xyz would be the desired result in the above example) I triedregexr(), but that doesn't help.
What went wrong with regexr()? It seems exactly the tool to use, so perhaps it's regular expressions you're having trouble with.
My Stata syntax might be a little off, but try something like this:
gen string = "110xyz 43 abc"
gen number = regexs(1) if regexm(string, "^([0-9]+) *([a-zA-Z].*)")
gen remain = regexs(2) if regexm(string, "^([0-9]+) *([a-zA-Z].*)")
gen fixed = number + " " + remain
This can be simplified, but I'm starting you off with what I think has the highest probability of working, e.g. guaranteed match, because I don't know what kind of null value results if regexm() didn't match.
moss from SSC offers a convenience wrapper for Stata's regular expression functions. Demonstration:
clear
input str16 test
"110xyz 43 abc"
"110xyz 44 abc"
"111 xyz 56 abc"
end
ssc inst moss
help moss
moss test, match("([0-9]+)") regex pre(num)
moss test, match("([a-z]+)") regex pre(word)
list test *match*
+------------------------------------------------------------+
| test nummat~1 nummat~2 wordma~1 wordma~2 |
|------------------------------------------------------------|
1. | 110xyz 43 abc 110 43 xyz abc |
2. | 110xyz 44 abc 110 44 xyz abc |
3. | 111 xyz 56 abc 111 56 xyz abc |
+------------------------------------------------------------+

How to replace specific characters of a string with tab in R

Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.
You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith
Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")

Regular expression to match standard 10 digit phone number

I want to write a regular expression for a standard US type phone number that supports the following formats:
###-###-####
(###) ###-####
### ### ####
###.###.####
where # means any number. So far I came up with the following expressions
^[1-9]\d{2}-\d{3}-\d{4}
^\(\d{3}\)\s\d{3}-\d{4}
^[1-9]\d{2}\s\d{3}\s\d{4}
^[1-9]\d{2}\.\d{3}\.\d{4}
respectively. I am not quite sure if the last one is correct for the dotted check. I also want to know if there is any way I could write a single expression instead of the 4 different ones that cater to the different formats I mentioned. If so, I am not sure how do I do that. And also how do I modify the expression/expressions so that I can also include a condition to support the area code as optional component. Something like
+1 ### ### ####
where +1 is the area code and it is optional.
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
Matches the following
123-456-7890
(123) 456-7890
123 456 7890
123.456.7890
+91 (123) 456-7890
If you do not want a match on non-US numbers use
^(\+0?1\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
Update :
As noticed by user Simon Weaver below, if you are also interested in matching on unformatted numbers just make the separator character class optional as [\s.-]?
^(\+\d{1,2}\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
https://regex101.com/r/j48BZs/2
There are many variations possible for this problem. Here is a regular expression similar to an answer I previously placed on SO.
^\s*(?:\+?(\d{1,3}))?[-. (]*(\d{3})[-. )]*(\d{3})[-. ]*(\d{4})(?: *x(\d+))?\s*$
It would match the following examples and much more:
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
800 555 1234x5678
8005551234 x5678
1 800 555-1234
1----800----555-1234
Regardless of the way the phone number is entered, the capture groups can be used to breakdown the phone number so you can process it in your code.
Group1: Country Code (ex: 1 or 86)
Group2: Area Code (ex: 800)
Group3: Exchange (ex: 555)
Group4: Subscriber Number (ex: 1234)
Group5: Extension (ex: 5678)
Here is a breakdown of the expression if you're interested:
^\s* #Line start, match any whitespaces at the beginning if any.
(?:\+?(\d{1,3}))? #GROUP 1: The country code. Optional.
[-. (]* #Allow certain non numeric characters that may appear between the Country Code and the Area Code.
(\d{3}) #GROUP 2: The Area Code. Required.
[-. )]* #Allow certain non numeric characters that may appear between the Area Code and the Exchange number.
(\d{3}) #GROUP 3: The Exchange number. Required.
[-. ]* #Allow certain non numeric characters that may appear between the Exchange number and the Subscriber number.
(\d{4}) #Group 4: The Subscriber Number. Required.
(?: *x(\d+))? #Group 5: The Extension number. Optional.
\s*$ #Match any ending whitespaces if any and the end of string.
To make the Area Code optional, just add a question mark after the (\d{3}) for the area code.
^(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
Matches these phone numbers:
1-718-444-1122
718-444-1122
(718)-444-1122
17184441122
7184441122
718.444.1122
1718.444.1122
1-123-456-7890
1 123-456-7890
1 (123) 456-7890
1 123 456 7890
1.123.456.7890
+91 (123) 456-7890
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
18001234567
1 800 123 4567
1-800-123-4567
+18001234567
+1 800 123 4567
+1 (800) 123 4567
1(800)1234567
+1800 1234567
1.8001234567
1.800.123.4567
+1 (800) 123-4567
18001234567
1 800 123 4567
+1 800 123-4567
+86 800 123 4567
1-800-123-4567
1 (800) 123-4567
(800)123-4567
(800) 123-4567
(800)1234567
800-123-4567
800.123.4567
1231231231
123-1231231
123123-1231
123-123 1231
123 123-1231
123-123-1231
(123)123-1231
(123)123 1231
(123) 123-1231
(123) 123 1231
+99 1234567890
+991234567890
(555) 444-6789
555-444-6789
555.444.6789
555 444 6789
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
(003) 555-1212
(103) 555-1212
(911) 555-1212
18005551234
1 800 555 1234
+86 800-555-1234
1 (800) 555-1234
See regex101.com
Regex pattern to validate a regular 10 digit phone number plus optional international code (1 to 3 digits) and optional extension number (any number of digits):
/(\+\d{1,3}\s?)?((\(\d{3}\)\s?)|(\d{3})(\s|-?))(\d{3}(\s|-?))(\d{4})(\s?(([E|e]xt[:|.|]?)|x|X)(\s?\d+))?/g
Demo: https://www.regextester.com/103299
Valid entries:
/* Full number */
+999 (999) 999-9999 Ext. 99999
/* Regular local phone number (XXX) XXX-XXXX */
1231231231
123-1231231
123123-1231
123-123 1231
123 123-1231
123-123-1231
(123)123-1231
(123)123 1231
(123) 123-1231
(123) 123 1231
/* International codes +XXX (XXX) XXX-XXXX */
+99 1234567890
+991234567890
/* Extensions (XXX) XXX-XXXX Ext. XXX... */
1234567890 Ext 1123123
1234567890Ext 1123123
1234567890 Ext1123123
1234567890Ext1123123
1234567890 Ext: 1123123
1234567890Ext: 1123123
1234567890 Ext:1123123
1234567890Ext:1123123
1234567890 Ext. 1123123
1234567890Ext. 1123123
1234567890 Ext.1123123
1234567890Ext.1123123
1234567890 ext 1123123
1234567890ext 1123123
1234567890 ext1123123
1234567890ext1123123
1234567890 ext: 1123123
1234567890ext: 1123123
1234567890 ext:1123123
1234567890ext:1123123
1234567890 X 1123123
1234567890X1123123
1234567890X 1123123
1234567890 X1123123
1234567890 x 1123123
1234567890x1123123
1234567890 x1123123
1234567890x 1123123
Here's a fairly compact one I created.
Search: \+?1?\s*\(?-*\.*(\d{3})\)?\.*-*\s*(\d{3})\.*-*\s*(\d{4})$
Replace: +1 \($1\) $2-$3
Tested against the following use cases.
18001234567
1 800 123 4567
1-800-123-4567
+18001234567
+1 800 123 4567
+1 (800) 123 4567
1(800)1234567
+1800 1234567
1.8001234567
1.800.123.4567
1--800--123--4567
+1 (800) 123-4567
Adding up an example using above mentioned solutions on jsfiddle.
I have modified the code a bit as per my clients requirement. Hope this also helps someone.
/^\s*(?:\+?(\d{1,3}))?[- (]*(\d{3})[- )]*(\d{3})[- ]*(\d{4})(?: *[x/#]{1}(\d+))?\s*$/
See Example Here
Phone number regex that I use:
/^[+]?(\d{1,2})?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$/
Covers:
18001234567
1 800 123 4567
+1 800 123-4567
+86 800 123 4567
1-800-123-4567
1 (800) 123-4567
(800)123-4567
(800) 123-4567
(800)1234567
800-123-4567
800.123.4567
try this for Pakistani users .Here's a fairly compact one I created.
((\+92)|0)[.\- ]?[0-9][.\- ]?[0-9][.\- ]?[0-9]
Tested against the following use cases.
+92 -345 -123 -4567
+92 333 123 4567
+92 300 123 4567
+92 321 123 -4567
+92 345 - 540 - 5883
Starting with #Ravi's answer, I also applied some validation rules for the NPA (Area) Code.
In particular:
It should start with a 2 (or higher)
It cannot have "11" as the second and third digits (N11).
There are a couple other restrictions, including reserved blocks (N9X, 37X, 96X) and 555, but I left those out, particularly because the reserved blocks may see future use, and 555 is useful for testing.
This is what I came up with:
^((\+\d{1,2}|1)[\s.-]?)?\(?[2-9](?!11)\d{2}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
Alternately, if you also want to match blank values (if the field isn't required), you can use:
(^((\+\d{1,2}|1)[\s.-]?)?\(?[2-9](?!11)\d{2}\)?[\s.-]?\d{3}[\s.-]?\d{4}$|^$)
My test cases for valid numbers (many from #Francis' answer) are:
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
My invalid test cases include:
(003) 555-1212 // Area code starts with 0
(103) 555-1212 // Area code starts with 1
(911) 555-1212 // Area code ends with 11
180055512345 // Too many digits
1 800 5555 1234 // Prefix code too long
+1 800 555x1234 // Invalid delimiter
+867 800 555 1234 // Country code too long
1-800-555-1234p // Invalid character
1 (800) 555-1234 // Too many spaces
800x555x1234 // Invalid delimiter
86 800 555 1212 // Non-NA country code doesn't have +
My regular expression does not include grouping to extract the digit groups, but it can be modified to include those.
I find this regular expression most useful for me for 10 digit contact number :
^(?:(?:\+|0{0,2})91(\s*[\-]\s*)?|[0]?)?[789]\d{9}$
Reference: https://regex101.com/r/QeQewP/1
Explanation:
Perhaps the easiest one compare to several others.
\(?\d+\)?[-.\s]?\d+[-.\s]?\d+
It matches the following:
(555) 444-6789
555-444-6789
555.444.6789
555 444 6789
The expressions for 1, 3 and 4 are quite similar, so you can use:
^([1-9]\d{2})([- .])(\d{3})$2(\d{4})$
Note that, depending on the language and brand of regexes used, you might need to put \2 instead of $2 or such matching might not be supported at all.
I see no good way to combine this with the format 2, apart from the obvious ^(regex for 1,3,4|regex for 2)$ which is ugly, clumsy and makes it hard to get out the parts of the numbers.
As for the area code, you can add (\+\d)? to the beginning to capture a single-digit area code (sorry, I don't know the format of your area codes).
How about this?
^(\+?[01])?[-.\s]?\(?[1-9]\d{2}\)?[-.\s]?\d{3}[-.\s]?\d{4}
EDIT: I forgot about the () one.
EDIT 2: Got the first 3 digit part wrong.
This code will match a US or Canadian phone number, and will also make sure that it is a valid area code and exchange:
^((\+1)?[\s-]?)?\(?[2-9]\d\d\)?[\s-]?[2-9]\d\d[\s-]?\d\d\d\d
Test on Regex101.com
This is my Regex the worked on US numbers in the FreeCodeCamp phone number challenge:
/^\d{3}(-|\s)\d{3}(-|\s)\d{4}$|^\d{10}$|^1\s\d{3}(-|\s)\d{3}(-|\s)\d{4}$|^(1\s?)?\(\d{3}\)(\s|\-)?\d{3}\-\d{4}$/
Matches:
555-555-5555
(555)555-5555
(555) 555-5555
555 555 5555
5555555555
1 555 555 5555 etc
Above regex is a slight modification of #Francis Gagnon.
Objective : To detect any possible pattern a user can share their US phone number
Version 1:
^\s*(?:\+?(\d{1,3}))?[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d[\W\D\s]*?\d)(?: *x(\d+))?\s*$
Test it over here Codepen: https://codepen.io/kiranbhattarai/pen/NWKMXQO
Explanation of the regex : https://regexr.com/4kt5j
Version 2:
\s*(?:\+?(\d{1,3}))?[\W\D\s]^|()*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d[\W\D\s]*?\d)(?: *x(\d+))?\s*$
What is in it: The test cases can be a part of the string. In version one the test cases should be a start of a line to work.
Codepen: https://codepen.io/kiranbhattarai/pen/GRKGNGG
Explanation of the regex : https://regexr.com/4kt9n
If you can find a pattern that can fail please do comment i will fix it.
Test Cases: Pass
8 0 0 4 4 4 5 55 5
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
800 555 1234x5678
8005551234 x5678
1 800 555-1234
1----800----555-1234
800 (555) 1234
800(555)1234
8 0 0 5 5 5 1 2 3 4
8.0.0.5.5.5.1.2.3.4
8-0-0-5-5-5-1-2-3-4
(8)005551234
(80)05551234
8(00)5551234
8#0#0#5551234
8/0/0/5/5/5/1/2/3/4
8*0*0*5*5*5*1*2*3*4
8:0:0:5:5:5:1:2:3:4
8,0,0,5,5,5,1,2,3,4
800,555,1234
800:555:1234
1-718-444-1122
718-444-1122
(718)-444-1122
17184441122
7184441122
718.444.1122
1718.444.1122
1-123-456-7890
1 123-456-7890
1 (123) 456-7890
1 123 456 7890
1.123.456.7890
+91 (123) 456-7890
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
18001234567
1 800 123 4567
1-800-123-4567
+18001234567
+1 800 123 4567
+1 (800) 123 4567
1(800)1234567
+1800 1234567
1.8001234567
1.800.123.4567
+1 (800) 123-4567
18001234567
1 800 123 4567
+1 800 123-4567
+86 800 123 4567
1-800-123-4567
1 (800) 123-4567
(800)123-4567
(800) 123-4567
(800)1234567
800-123-4567
800.123.4567
1231231231
123-1231231
123123-1231
123-123 1231
123 123-1231
123-123-1231
(123)123-1231
(123) 123-1231
(123) 123 1231
+99 1234567890
+991234567890
(555) 444-6789
555-444-6789
555.444.6789
555 444 6789
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
(003) 555-1212
(103) 555-1212
(911) 555-1212
18005551234
1 800 555 1234
+86 800-555-1234
1 (800) 555-1234
I'm just throwing this answer in there since it solves a problem of mine, it's based off of #stormy's answer, but includes 3 digit country codes and more importantly can be used anywhere in a string, but won't match is it's not preceded by a space/start of the string and ending with a word boundary. This is useful so that it won't match random numbers in the middle of a URL or something
((?:\s|^)(?:\+\d{1,3}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})(?:\b)
Here's a regex that matches North American numbers as well as international numbers such as for middle east.
^((\+|0{0,2})([0-9]){1,3})?[-.●\s]?\(?([0-9]{2,3})\)?[-.●\s]?([0-9]{3})[-.●\s]?([0-9]{4})$
I know this doesn't answer OP's question directly but if you are asking the same question as OP there is a good chance your are looking for a way to validate and store a phone number in either state or a database. Instead of trying to detect every possible combination of character that could be a phone number you might find it easier to break this task into multiple steps.
strip out all none numbers
strip out leading 1s
make sure the number is at most 10 digits
Javascript pseudo example assuming "phone" is user input stored as a string:
phone.replace(/\D/g, "")
phone.replace(/^1+/g, "")
phone.slice(0, 10)
phone.length === 10 ? "do something" : "don't do something"
Code above will need to be tweaked for your purposes and is left as simple as possible for none javascript readers.
For presentation purposes you can always layer dashes and leading 1s back in later but for storage you should probable only keep the actual numbers. This approach also has the added advantage of leaving you with some easy to digest regular expressions.
^(\+1)?\s?(\([1-9]\d{2}\)|[1-9]\d{2})(-|\s|.)\d{3}(-|\s|.)\d{4}
This is a more comprehensive version that will match as much as I can think of as well as give you group matching for country, region, first, and last.
(?<number>(\+?(?<country>(\d{1,3}))(\s|-|\.)?)?(\(?(?<region>(\d{3}))\)?(\s|-|\.)?)((?<first>(\d{3}))(\s|-|\.)?)((?<last>(\d{4}))))
what about multiple numbers with "+" and seperate them with ";" "," "-" or " " characters?
I ended up with
const regexBase = '(?:\\+?(\\d{1,3}))?[-. (]*(\\d{3})?[-. )]*(\\d{3})[-. ]*(\\d{4,5})(?: *x(\\d+))?';
const phoneRegex = new RegExp('\\s*' + regexBase + '\\s*', 'g');
this was to allow for things like dutch numbers, for example
+358 300 20200