How to replace specific characters of a string with tab in R - regex

Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.

You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith

Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")

Related

Replace Value & Shift Data Frame If Certain Condition Met

I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.
90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.
Here is an example:
Num Last-Name First-Name Cat. DOB Location
11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA
df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
"2 Pearl, Sam R 1986-11-04 UK",
"5 Livingston, Steph LL 1983-12-12 USA",
"7 Thornton, Mark LR 1982-03-26 USA",
"10 Silver, John RED LL 1983-09-14 USA")
You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.
I started to run code that used stringr to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a dplyr mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:
Num Last-Name First-Name Cat. DOB Location Color
11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED
I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.
here's a simpler way:
input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)
the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.
IDEone demo
We could use gsub to remove the extra substrings
v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
"\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE,
col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <- trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
"", trimws(df1)))
d1
# Num LastName FirstName Cat DOB Location Color
#1 11 Jackson Adam L 1982-06-15 USA
#2 2 Pearl Sam R 1986-11-04 UK
#3 5 Livingston Steph LL 1983-12-12 USA
#4 7 Thornton Mark LR 1982-03-26 USA
#5 10 Silver John LL 1983-09-14 USA RED
Using strsplit instead of regex:
# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)
# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
# if number of items in row is 6, insert NA, else rearrange
if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))
# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")
df2
# Num Last-Name First-Name Cat. DOB Location Color
# 1 11 Jackson Adam L 1982-06-15 USA <NA>
# 2 2 Pearl Sam R 1986-11-04 UK <NA>
# 3 5 Livingston Steph LL 1983-12-12 USA <NA>
# 4 7 Thornton Mark LR 1982-03-26 USA <NA>
# 5 10 Silver John LL 1983-09-14 USA RED

Regex extraction of text data between 2 commas in R

I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:
73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London
Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma
My desired output is:
Wandsworth
Lambeth
I can manage to extract the part before:
df$extraction <- sub('.*,\\s*','',address)
and after
df$extraction <- sub('.*,\\s*','',address)
But not the middle part that I need. Can someone please help?
Many Thanks!
You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.
address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London"
)
read.csv(text = address, colClasses = c("NULL", "character", "NULL"),
header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Or we could use fread(). Its select argument is nice and it strips white space automatically.
data.table::fread(paste(address, collapse = "\n"),
select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Here are a couple of approaches:
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
Or
# target the whole string, but use a capture group
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"
Here are two options that aren't dependent on the city name being the same. The first uses a regex pattern with stringr::str_extract():
raw_address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London",
"Street, District, City"
)
df <- data.frame(raw_address, stringsAsFactors = FALSE)
df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')
> df
raw_address distict
1 73 Greenhill Gardens, Wandsworth, London Wandsworth
2 22 Acacia Heights, Lambeth, London Lambeth
3 Street, District, City District
The second uses strsplit() and makes getting the other elements of the address easier:
df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1)
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)
> df
raw_address address distict city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2 22 Acacia Heights, Lambeth, London 22 Acacia Heights Lambeth London
3 Street, District, City Street District City
The split is done on ,\\s* in case there is no space or are multiple spaces after a comma.
You could try this
(?<=, )(.+?),
Works with any data set location doesn't have to be in london.

Regular expression to match standard 10 digit phone number

I want to write a regular expression for a standard US type phone number that supports the following formats:
###-###-####
(###) ###-####
### ### ####
###.###.####
where # means any number. So far I came up with the following expressions
^[1-9]\d{2}-\d{3}-\d{4}
^\(\d{3}\)\s\d{3}-\d{4}
^[1-9]\d{2}\s\d{3}\s\d{4}
^[1-9]\d{2}\.\d{3}\.\d{4}
respectively. I am not quite sure if the last one is correct for the dotted check. I also want to know if there is any way I could write a single expression instead of the 4 different ones that cater to the different formats I mentioned. If so, I am not sure how do I do that. And also how do I modify the expression/expressions so that I can also include a condition to support the area code as optional component. Something like
+1 ### ### ####
where +1 is the area code and it is optional.
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
Matches the following
123-456-7890
(123) 456-7890
123 456 7890
123.456.7890
+91 (123) 456-7890
If you do not want a match on non-US numbers use
^(\+0?1\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
Update :
As noticed by user Simon Weaver below, if you are also interested in matching on unformatted numbers just make the separator character class optional as [\s.-]?
^(\+\d{1,2}\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
https://regex101.com/r/j48BZs/2
There are many variations possible for this problem. Here is a regular expression similar to an answer I previously placed on SO.
^\s*(?:\+?(\d{1,3}))?[-. (]*(\d{3})[-. )]*(\d{3})[-. ]*(\d{4})(?: *x(\d+))?\s*$
It would match the following examples and much more:
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
800 555 1234x5678
8005551234 x5678
1 800 555-1234
1----800----555-1234
Regardless of the way the phone number is entered, the capture groups can be used to breakdown the phone number so you can process it in your code.
Group1: Country Code (ex: 1 or 86)
Group2: Area Code (ex: 800)
Group3: Exchange (ex: 555)
Group4: Subscriber Number (ex: 1234)
Group5: Extension (ex: 5678)
Here is a breakdown of the expression if you're interested:
^\s* #Line start, match any whitespaces at the beginning if any.
(?:\+?(\d{1,3}))? #GROUP 1: The country code. Optional.
[-. (]* #Allow certain non numeric characters that may appear between the Country Code and the Area Code.
(\d{3}) #GROUP 2: The Area Code. Required.
[-. )]* #Allow certain non numeric characters that may appear between the Area Code and the Exchange number.
(\d{3}) #GROUP 3: The Exchange number. Required.
[-. ]* #Allow certain non numeric characters that may appear between the Exchange number and the Subscriber number.
(\d{4}) #Group 4: The Subscriber Number. Required.
(?: *x(\d+))? #Group 5: The Extension number. Optional.
\s*$ #Match any ending whitespaces if any and the end of string.
To make the Area Code optional, just add a question mark after the (\d{3}) for the area code.
^(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
Matches these phone numbers:
1-718-444-1122
718-444-1122
(718)-444-1122
17184441122
7184441122
718.444.1122
1718.444.1122
1-123-456-7890
1 123-456-7890
1 (123) 456-7890
1 123 456 7890
1.123.456.7890
+91 (123) 456-7890
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
18001234567
1 800 123 4567
1-800-123-4567
+18001234567
+1 800 123 4567
+1 (800) 123 4567
1(800)1234567
+1800 1234567
1.8001234567
1.800.123.4567
+1 (800) 123-4567
18001234567
1 800 123 4567
+1 800 123-4567
+86 800 123 4567
1-800-123-4567
1 (800) 123-4567
(800)123-4567
(800) 123-4567
(800)1234567
800-123-4567
800.123.4567
1231231231
123-1231231
123123-1231
123-123 1231
123 123-1231
123-123-1231
(123)123-1231
(123)123 1231
(123) 123-1231
(123) 123 1231
+99 1234567890
+991234567890
(555) 444-6789
555-444-6789
555.444.6789
555 444 6789
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
(003) 555-1212
(103) 555-1212
(911) 555-1212
18005551234
1 800 555 1234
+86 800-555-1234
1 (800) 555-1234
See regex101.com
Regex pattern to validate a regular 10 digit phone number plus optional international code (1 to 3 digits) and optional extension number (any number of digits):
/(\+\d{1,3}\s?)?((\(\d{3}\)\s?)|(\d{3})(\s|-?))(\d{3}(\s|-?))(\d{4})(\s?(([E|e]xt[:|.|]?)|x|X)(\s?\d+))?/g
Demo: https://www.regextester.com/103299
Valid entries:
/* Full number */
+999 (999) 999-9999 Ext. 99999
/* Regular local phone number (XXX) XXX-XXXX */
1231231231
123-1231231
123123-1231
123-123 1231
123 123-1231
123-123-1231
(123)123-1231
(123)123 1231
(123) 123-1231
(123) 123 1231
/* International codes +XXX (XXX) XXX-XXXX */
+99 1234567890
+991234567890
/* Extensions (XXX) XXX-XXXX Ext. XXX... */
1234567890 Ext 1123123
1234567890Ext 1123123
1234567890 Ext1123123
1234567890Ext1123123
1234567890 Ext: 1123123
1234567890Ext: 1123123
1234567890 Ext:1123123
1234567890Ext:1123123
1234567890 Ext. 1123123
1234567890Ext. 1123123
1234567890 Ext.1123123
1234567890Ext.1123123
1234567890 ext 1123123
1234567890ext 1123123
1234567890 ext1123123
1234567890ext1123123
1234567890 ext: 1123123
1234567890ext: 1123123
1234567890 ext:1123123
1234567890ext:1123123
1234567890 X 1123123
1234567890X1123123
1234567890X 1123123
1234567890 X1123123
1234567890 x 1123123
1234567890x1123123
1234567890 x1123123
1234567890x 1123123
Here's a fairly compact one I created.
Search: \+?1?\s*\(?-*\.*(\d{3})\)?\.*-*\s*(\d{3})\.*-*\s*(\d{4})$
Replace: +1 \($1\) $2-$3
Tested against the following use cases.
18001234567
1 800 123 4567
1-800-123-4567
+18001234567
+1 800 123 4567
+1 (800) 123 4567
1(800)1234567
+1800 1234567
1.8001234567
1.800.123.4567
1--800--123--4567
+1 (800) 123-4567
Adding up an example using above mentioned solutions on jsfiddle.
I have modified the code a bit as per my clients requirement. Hope this also helps someone.
/^\s*(?:\+?(\d{1,3}))?[- (]*(\d{3})[- )]*(\d{3})[- ]*(\d{4})(?: *[x/#]{1}(\d+))?\s*$/
See Example Here
Phone number regex that I use:
/^[+]?(\d{1,2})?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$/
Covers:
18001234567
1 800 123 4567
+1 800 123-4567
+86 800 123 4567
1-800-123-4567
1 (800) 123-4567
(800)123-4567
(800) 123-4567
(800)1234567
800-123-4567
800.123.4567
try this for Pakistani users .Here's a fairly compact one I created.
((\+92)|0)[.\- ]?[0-9][.\- ]?[0-9][.\- ]?[0-9]
Tested against the following use cases.
+92 -345 -123 -4567
+92 333 123 4567
+92 300 123 4567
+92 321 123 -4567
+92 345 - 540 - 5883
Starting with #Ravi's answer, I also applied some validation rules for the NPA (Area) Code.
In particular:
It should start with a 2 (or higher)
It cannot have "11" as the second and third digits (N11).
There are a couple other restrictions, including reserved blocks (N9X, 37X, 96X) and 555, but I left those out, particularly because the reserved blocks may see future use, and 555 is useful for testing.
This is what I came up with:
^((\+\d{1,2}|1)[\s.-]?)?\(?[2-9](?!11)\d{2}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
Alternately, if you also want to match blank values (if the field isn't required), you can use:
(^((\+\d{1,2}|1)[\s.-]?)?\(?[2-9](?!11)\d{2}\)?[\s.-]?\d{3}[\s.-]?\d{4}$|^$)
My test cases for valid numbers (many from #Francis' answer) are:
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
My invalid test cases include:
(003) 555-1212 // Area code starts with 0
(103) 555-1212 // Area code starts with 1
(911) 555-1212 // Area code ends with 11
180055512345 // Too many digits
1 800 5555 1234 // Prefix code too long
+1 800 555x1234 // Invalid delimiter
+867 800 555 1234 // Country code too long
1-800-555-1234p // Invalid character
1 (800) 555-1234 // Too many spaces
800x555x1234 // Invalid delimiter
86 800 555 1212 // Non-NA country code doesn't have +
My regular expression does not include grouping to extract the digit groups, but it can be modified to include those.
I find this regular expression most useful for me for 10 digit contact number :
^(?:(?:\+|0{0,2})91(\s*[\-]\s*)?|[0]?)?[789]\d{9}$
Reference: https://regex101.com/r/QeQewP/1
Explanation:
Perhaps the easiest one compare to several others.
\(?\d+\)?[-.\s]?\d+[-.\s]?\d+
It matches the following:
(555) 444-6789
555-444-6789
555.444.6789
555 444 6789
The expressions for 1, 3 and 4 are quite similar, so you can use:
^([1-9]\d{2})([- .])(\d{3})$2(\d{4})$
Note that, depending on the language and brand of regexes used, you might need to put \2 instead of $2 or such matching might not be supported at all.
I see no good way to combine this with the format 2, apart from the obvious ^(regex for 1,3,4|regex for 2)$ which is ugly, clumsy and makes it hard to get out the parts of the numbers.
As for the area code, you can add (\+\d)? to the beginning to capture a single-digit area code (sorry, I don't know the format of your area codes).
How about this?
^(\+?[01])?[-.\s]?\(?[1-9]\d{2}\)?[-.\s]?\d{3}[-.\s]?\d{4}
EDIT: I forgot about the () one.
EDIT 2: Got the first 3 digit part wrong.
This code will match a US or Canadian phone number, and will also make sure that it is a valid area code and exchange:
^((\+1)?[\s-]?)?\(?[2-9]\d\d\)?[\s-]?[2-9]\d\d[\s-]?\d\d\d\d
Test on Regex101.com
This is my Regex the worked on US numbers in the FreeCodeCamp phone number challenge:
/^\d{3}(-|\s)\d{3}(-|\s)\d{4}$|^\d{10}$|^1\s\d{3}(-|\s)\d{3}(-|\s)\d{4}$|^(1\s?)?\(\d{3}\)(\s|\-)?\d{3}\-\d{4}$/
Matches:
555-555-5555
(555)555-5555
(555) 555-5555
555 555 5555
5555555555
1 555 555 5555 etc
Above regex is a slight modification of #Francis Gagnon.
Objective : To detect any possible pattern a user can share their US phone number
Version 1:
^\s*(?:\+?(\d{1,3}))?[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d[\W\D\s]*?\d)(?: *x(\d+))?\s*$
Test it over here Codepen: https://codepen.io/kiranbhattarai/pen/NWKMXQO
Explanation of the regex : https://regexr.com/4kt5j
Version 2:
\s*(?:\+?(\d{1,3}))?[\W\D\s]^|()*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d)[\W\D\s]*(\d[\W\D\s]*?\d[\D\W\s]*?\d[\W\D\s]*?\d)(?: *x(\d+))?\s*$
What is in it: The test cases can be a part of the string. In version one the test cases should be a start of a line to work.
Codepen: https://codepen.io/kiranbhattarai/pen/GRKGNGG
Explanation of the regex : https://regexr.com/4kt9n
If you can find a pattern that can fail please do comment i will fix it.
Test Cases: Pass
8 0 0 4 4 4 5 55 5
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
800 555 1234x5678
8005551234 x5678
1 800 555-1234
1----800----555-1234
800 (555) 1234
800(555)1234
8 0 0 5 5 5 1 2 3 4
8.0.0.5.5.5.1.2.3.4
8-0-0-5-5-5-1-2-3-4
(8)005551234
(80)05551234
8(00)5551234
8#0#0#5551234
8/0/0/5/5/5/1/2/3/4
8*0*0*5*5*5*1*2*3*4
8:0:0:5:5:5:1:2:3:4
8,0,0,5,5,5,1,2,3,4
800,555,1234
800:555:1234
1-718-444-1122
718-444-1122
(718)-444-1122
17184441122
7184441122
718.444.1122
1718.444.1122
1-123-456-7890
1 123-456-7890
1 (123) 456-7890
1 123 456 7890
1.123.456.7890
+91 (123) 456-7890
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
18001234567
1 800 123 4567
1-800-123-4567
+18001234567
+1 800 123 4567
+1 (800) 123 4567
1(800)1234567
+1800 1234567
1.8001234567
1.800.123.4567
+1 (800) 123-4567
18001234567
1 800 123 4567
+1 800 123-4567
+86 800 123 4567
1-800-123-4567
1 (800) 123-4567
(800)123-4567
(800) 123-4567
(800)1234567
800-123-4567
800.123.4567
1231231231
123-1231231
123123-1231
123-123 1231
123 123-1231
123-123-1231
(123)123-1231
(123) 123-1231
(123) 123 1231
+99 1234567890
+991234567890
(555) 444-6789
555-444-6789
555.444.6789
555 444 6789
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
(003) 555-1212
(103) 555-1212
(911) 555-1212
18005551234
1 800 555 1234
+86 800-555-1234
1 (800) 555-1234
I'm just throwing this answer in there since it solves a problem of mine, it's based off of #stormy's answer, but includes 3 digit country codes and more importantly can be used anywhere in a string, but won't match is it's not preceded by a space/start of the string and ending with a word boundary. This is useful so that it won't match random numbers in the middle of a URL or something
((?:\s|^)(?:\+\d{1,3}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})(?:\b)
Here's a regex that matches North American numbers as well as international numbers such as for middle east.
^((\+|0{0,2})([0-9]){1,3})?[-.●\s]?\(?([0-9]{2,3})\)?[-.●\s]?([0-9]{3})[-.●\s]?([0-9]{4})$
I know this doesn't answer OP's question directly but if you are asking the same question as OP there is a good chance your are looking for a way to validate and store a phone number in either state or a database. Instead of trying to detect every possible combination of character that could be a phone number you might find it easier to break this task into multiple steps.
strip out all none numbers
strip out leading 1s
make sure the number is at most 10 digits
Javascript pseudo example assuming "phone" is user input stored as a string:
phone.replace(/\D/g, "")
phone.replace(/^1+/g, "")
phone.slice(0, 10)
phone.length === 10 ? "do something" : "don't do something"
Code above will need to be tweaked for your purposes and is left as simple as possible for none javascript readers.
For presentation purposes you can always layer dashes and leading 1s back in later but for storage you should probable only keep the actual numbers. This approach also has the added advantage of leaving you with some easy to digest regular expressions.
^(\+1)?\s?(\([1-9]\d{2}\)|[1-9]\d{2})(-|\s|.)\d{3}(-|\s|.)\d{4}
This is a more comprehensive version that will match as much as I can think of as well as give you group matching for country, region, first, and last.
(?<number>(\+?(?<country>(\d{1,3}))(\s|-|\.)?)?(\(?(?<region>(\d{3}))\)?(\s|-|\.)?)((?<first>(\d{3}))(\s|-|\.)?)((?<last>(\d{4}))))
what about multiple numbers with "+" and seperate them with ";" "," "-" or " " characters?
I ended up with
const regexBase = '(?:\\+?(\\d{1,3}))?[-. (]*(\\d{3})?[-. )]*(\\d{3})[-. ]*(\\d{4,5})(?: *x(\\d+))?';
const phoneRegex = new RegExp('\\s*' + regexBase + '\\s*', 'g');
this was to allow for things like dutch numbers, for example
+358 300 20200

Phone validation regex

I'm using this pattern to check the validation of a phone number
^[0-9\-\+]{9,15}$
It's works for 0771234567 and +0771234567,
but I want it to works for 077-1234567 and +077-1234567 and +077-1-23-45-67 and +077-123-45-6-7
What should I change in the pattern?
Please refer to this SO Post
example of a regular expression in jquery for phone numbers
/\(?([0-9]{3})\)?([ .-]?)([0-9]{3})\2([0-9]{4})/
(123) 456 7899
(123).456.7899
(123)-456-7899
123-456-7899
123 456 7899
1234567899
are supported
This solution actually validates the numbers and the format. For example: 123-456-7890 is a valid format but is NOT a valid US number and this answer bears that out where others here do not.
If you do not want the extension capability remove the following including the parenthesis:
(?:\s*(?:#|x.?|ext.?|extension)\s*(\d+)\s*)? :)
edit (addendum) I needed this in a client side only application so I converted it. Here it is for the javascript folks:
var myPhoneRegex = /(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]‌​)\s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)([2-9]1[02-9]‌​|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})\s*(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+)\s*)?$/i;
if (myPhoneRegex.test(phoneVar)) {
// Successful match
} else {
// Match attempt failed
}
hth.
end edit
This allows extensions or not and works with .NET
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]‌​)\s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)([2-9]1[02-9]‌​|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
To validate with or without trailing spaces. Perhaps when using .NET validators and trimming server side use this slightly different regex:
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]‌​)\s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)([2-9]1[02-9]‌​|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})\s*(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+)\s*)?$
All valid:
1 800 5551212
800 555 1212
8005551212
18005551212
+1800 555 1212 extension65432
800 5551212 ext3333
Invalid #s
234-911-5678
314-159-2653
123-234-5678
EDIT: Based on Felipe's comment I have updated this for international.
Based on what I could find out from here and here regarding valid global numbers
This is tested as a first line of defense of course. An overarching element of the international number is that it is no longer than 15 characters. I did not write a replace for all the non digits and sum the result. It should be done for completeness. Also, you may notice that I have not combined the North America regex with this one. The reason is that this international regex will match North American numbers, however, it will also accept known invalid # such as +1 234-911-5678. For more accurate results you should separate them as well.
Pauses and other dialing instruments are not mentioned and therefore invalid per E.164
\(?\+[0-9]{1,3}\)? ?-?[0-9]{1,3} ?-?[0-9]{3,5} ?-?[0-9]{4}( ?-?[0-9]{3})?
With 1-10 letter word for extension and 1-6 digit extension:
\(?\+[0-9]{1,3}\)? ?-?[0-9]{1,3} ?-?[0-9]{3,5} ?-?[0-9]{4}( ?-?[0-9]{3})? ?(\w{1,10}\s?\d{1,6})?
Valid International: Country name for ref its not a match.
+55 11 99999-5555 Brazil
+593 7 282-3889 Ecuador
(+44) 0848 9123 456 UK
+1 284 852 5500 BVI
+1 345 9490088 Grand Cayman
+32 2 702-9200 Belgium
+65 6511 9266 Asia Pacific
+86 21 2230 1000 Shanghai
+9124 4723300 India
+821012345678 South Korea
And for your extension pleasure
+55 11 99999-5555 ramal 123 Brazil
+55 11 99999-5555 foo786544 Brazil
Enjoy
I have a more generic regex to allow the user to enter only numbers, +, -, whitespace and (). It respects the parenthesis balance and there is always a number after a symbol.
^([+]?[\s0-9]+)?(\d{3}|[(]?[0-9]+[)])?([-]?[\s]?[0-9])+$
false, ""
false, "+48 504 203 260##"
false, "+48.504.203.260"
false, "+55(123) 456-78-90-"
false, "+55(123) - 456-78-90"
false, "504.203.260"
false, " "
false, "-"
false, "()"
false, "() + ()"
false, "(21 7777"
false, "+48 (21)"
false, "+"
true , " 1"
true , "1"
true, "555-5555-555"
true, "+48 504 203 260"
true, "+48 (12) 504 203 260"
true, "+48 (12) 504-203-260"
true, "+48(12)504203260"
true, "+4812504203260"
true, "4812504203260
Consider:
^\+?[0-9]{3}-?[0-9]{6,12}$
This only allows + at the beginning; it requires 3 digits, followed by an optional dash, followed by 6-12 more digits.
Note that the original regex allows 'phone numbers' such as 70+12---12+92, which is a bit more liberal than you probably had in mind.
The question was amended to add:
+077-1-23-45-67 and +077-123-45-6-7
You now probably need to be using a regex system that supports alternatives:
^\+?[0-9]{3}-?([0-9]{7}|[0-9]-[0-9]{2}-[0-9]{2}-[0-9]{2}|[0-9]{3}-[0-9]{2}-[0-9]-[0-9])$
The first alternative is seven digits; the second is 1-23-45-67; the third is 123-45-6-7. These all share the optional plus + followed by 3 digits and an optional dash - prefix.
The comment below mentions another pattern:
+077-12-34-567
It is not at all clear what the general pattern should be - maybe one or more digits separated by dashes; digits at front and back?
^\+?[0-9]{3}-?[0-9](-[0-9]+)+$
This will allow the '+077-' prefix, followed by any sequence of digits alternating with dashes, with at least one digit between each dash and no dash at the end.
/^[0-9\+]{1,}[0-9\-]{3,15}$/
so first is a digit or a +, then some digits or -
First test the length of the string to see if it is between 9 and 15.
Then use this regex to validate:
^\+?\d+(-\d+)*$
This is yet another variation of the normal* (special normal*)* pattern, with normal being \d and special being -.
I tried :
^(1[ \-\+]{0,3}|\+1[ -\+]{0,3}|\+1|\+)?((\(\+?1-[2-9][0-9]{1,2}\))|(\(\+?[2-8][0-9][0-9]\))|(\(\+?[1-9][0-9]\))|(\(\+?[17]\))|(\([2-9][2-9]\))|([ \-\.]{0,3}[0-9]{2,4}))?([ \-\.][0-9])?([ \-\.]{0,3}[0-9]{2,4}){2,3}$
I took care of special country codes like 1-97... as well. Here are the numbers I tested against (from Puneet Lamba and MCattle):
***** PASS *****
18005551234
1 800 555 1234
+1 800 555-1234
+86 800 555 1234
1-800-555-1234
1.800.555.1234
+1.800.555.1234
1 (800) 555-1234
(800)555-1234
(800) 555-1234
(800)5551234
800-555-1234
800.555.1234
(+230) 5 911 4450
123345678
(1) 345 654 67
+1 245436
1-976 33567
(1-734) 5465654
+(230) 2 345 6568
***** CORRECTLY FAILING *****
(003) 555-1212
(103) 555-1212
(911) 555-1212
1-800-555-1234p
800x555x1234
+1 800 555x1234
***** FALSE POSITIVES *****
180055512345
1 800 5555 1234
+867 800 555 1234
1 (800) 555-1234
86 800 555 1212
Originally posted here: Regular expression to match standard 10 digit phone number
Here is the regex for Ethiopian phone numbers (EthioTelecom and Safaricom). For my fellow Ethiopian developers ;)
phoneExp = /^(^\+251|^251|^0)?(9|7)\d{8}$/;
It matches the following (restrict any unwanted character in start and end position)
+251912345678
251912345678
0912345678
912345678
+251712345678
251712345678
0712345678
712345678
You can test it on this site regexr.
^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
Matches the following cases:
123-456-7890
(123) 456-7890
123 456 7890
123.456.7890
+91 (123) 456-7890
Try this
\+?\(?([0-9]{3})\)?[-.]?\(?([0-9]{3})\)?[-.]?\(?([0-9]{4})\)?
It matches the following cases
+123-(456)-(7890)
+123.(456).(7890)
+(123).(456).(7890)
+(123)-(456)-(7890)
+123(456)(7890)
+(123)(456)(7890)
123-(456)-(7890)
123.(456).(7890)
(123).(456).(7890)
(123)-(456)-(7890)
123(456)(7890)
(123)(456)(7890)
For further explanation on the pattern CLICKME
The following regex matches a '+' followed by n digits
var mobileNumber = "+18005551212";
var regex = new RegExp("^\\+[0-9]*$");
var OK = regex.test(mobileNumber);
if (OK) {
console.log("is a phone number");
} else {
console.log("is NOT a phone number");
}
^+?\d{3}-?\d{2}-?\d{2}-?\d{3}$
You may try this....
How about this one....Hope this helps...
^(\\+?)\d{3,3}-?\d{2,2}-?\d{2,2}-?\d{3,3}$
^[0-9\-\+]{9,15}$
would match 0+0+0+0+0+0, or 000000000, etc.
(\-?[0-9]){7}
would match a specific number of digits with optional hyphens in any position among them.
What is this +077 format supposed to be?
It's not a valid format. No country codes begin with 0.
The digits after the + should usually be a country code, 1 to 3 digits long.
Allowing for "+" then country code CC, then optional hyphen, then "0" plus two digits, then hyphens and digits for next seven digits, try:
^\+CC\-?0[1-9][0-9](\-?[0-9]){7}$
Oh, and {3,3} is redundant, simplifes to {3}.
This regex matches any number with the common format 1-(999)-999-9999 and anything in between. Also, the regex will allow braces or no braces and separations with period, space or dash. "^([01][- .])?(\(\d{3}\)|\d{3})[- .]?\d{3}[- .]\d{4}$"
Adding to #Joe Johnston's answer, this will also accept:
+16444444444,,241119933
(Required for Apple's special character support for dial-ins - https://support.apple.com/kb/PH18551?locale=en_US)
\(?\+[0-9]{1,3}\)? ?-?[0-9]{1,3} ?-?[0-9]{3,5} ?-?[0-9]{4}( ?-?[0-9]{3})? ?([\w\,\#\^]{1,10}\s?\d{1,10})?
Note: Accepts upto 10 digits for extension code
/^(([+]{0,1}\d{2})|\d?)[\s-]?[0-9]{2}[\s-]?[0-9]{3}[\s-]?[0-9]{4}$/gm
https://regexr.com/4n3c4
Tested for
+94 77 531 2412
+94775312412
077 531 2412
0775312412
77 531 2412
// Not matching
77-53-12412
+94-77-53-12412
077 123 12345
77123 12345
JS code:
function checkIfValidPhoneNumber(input){
"use strict";
if(/^((\+?\d{1,3})?[\(\- ]?\d{3,5}[\)\- ]?)?(\d[.\- ]?\d)+$/.test(input)&&input.replace(/\D/g,"").length<=15){
return true;
} else {
return false;
}
}
It may be primitive in terms of checking phone number, but it checks that input text is compliant with E.164 recommendation.
Maximum phone length is 15 digits
Country code consists of 1 to 3 digits, could be preceded with plus (could be omitted)
Region (network) code consists of 3 to 5 digits (could be omitted but only if country code is omitted)
It allows some delimiters in phone number and around region code (.- )
For example:
+7(918)000-12-34
911
1-23456-789.10.11.12
all are compliant with E.164 and validated
for all phone number format:
/^\+?([87](?!95[5-7]|99[08]|907|94[^09]|336)([348]\d|9[0-6789]|7[01247])\d{8}|[1246]\d{9,13}|68\d{7}|5[1-46-9]\d{8,12}|55[1-9]\d{9}|55[138]\d{10}|55[1256][14679]9\d{8}|554399\d{7}|500[56]\d{4}|5016\d{6}|5068\d{7}|502[345]\d{7}|5037\d{7}|50[4567]\d{8}|50855\d{4}|509[34]\d{7}|376\d{6}|855\d{8,9}|856\d{10}|85[0-4789]\d{8,10}|8[68]\d{10,11}|8[14]\d{10}|82\d{9,10}|852\d{8}|90\d{10}|96(0[79]|17[0189]|181|13)\d{6}|96[23]\d{9}|964\d{10}|96(5[569]|89)\d{7}|96(65|77)\d{8}|92[023]\d{9}|91[1879]\d{9}|9[34]7\d{8}|959\d{7,9}|989\d{9}|971\d{8,9}|97[02-9]\d{7,11}|99[^4568]\d{7,11}|994\d{9}|9955\d{8}|996[2579]\d{8}|998[3789]\d{8}|380[345679]\d{8}|381\d{9}|38[57]\d{8,9}|375[234]\d{8}|372\d{7,8}|37[0-4]\d{8}|37[6-9]\d{7,11}|30[69]\d{9}|34[679]\d{8}|3459\d{11}|3[12359]\d{8,12}|36\d{9}|38[169]\d{8}|382\d{8,9}|46719\d{10})$/

How can I split a line when some fields contain spaces?

I have a text file that I extracted from a PDF file. It's arranged in a tabular format; this is part of it:
DATE SESS PROF1 PROF2 COURSE SEC GRADE COUNT
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A 3
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A- 2
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B 4
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B+ 2
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B- 1
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 WU 1
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
2007/09 1 FUENTES TANIA DACSB 06500 002 A 3
2007/09 1 FUENTES TANIA DACSB 06500 002 A- 8
2007/09 1 FUENTES ALEXA DACSB 06500 002 B 5
2007/09 1 FUENTES ALEXA DACSB 06500 002 B+ 3
2007/09 1 FUENTES ALEXA DACSB 06500 002 B- 1
2007/09 1 FUENTES ALEXA DACSB 06500 002 C 1
2007/09 1 FUENTES ALEXA DACSB 06500 002 C+ 1
2007/09 1 LIGGINS FREDER DACSB 06500 003 A 1
Where the first line is the columns names, and the rest of the lines are the data.
there are 8 columns which I want to get, at first it seemed very easy by splitting with split(/\s+/, ...) for each line I read, but then,I noticed that in some lines there are additional spaces, for example:
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
Sometimes the data for a certain column is optional as you can see it.
The problem is complex, but it's not unsolvable. It seems to me that course will always contain a space between the alpha code and the numeric code and that the prof names will also always contain a space. But then you're pretty much screwed if somebody has a two-part last name like "VAN DYKE".
A regex would describe this record:
my $record_exp
= qr{ ^ \s*
(\d{4}/\d{2}) # yyyy/mm date
\s+
(\d+) # any number of digits
\s+
(\S+ \s \S+) # non-space cluster, single space, non-space cluster
\s+
# sames as last, possibly not there, separating spaces are included
# in the conditional, because we have to make sure it will start
# right at the next rule.
(?:(\S+ \s \S+)\s+)?
# a cluster of alpha, single space, cluster of digits
(\p{Alpha}+ \s \d+)
\s+ # any number of spaces
(\S+) # any number of non-space
\s+ # ditto..
(\S+)
\s+
(\S+)
}x;
Which makes the loop a lot easier:
while ( <$input> ) {
my #fields = m{$record_exp};
# ... list of semantic actions here...
}
But you could also store it into structures, knowing that the only variable part of the data is the profs:
use strict;
use warnings;
my #records;
<$input>; # bleed the first line
while ( <$input> ) {
my #fields = split; # split on white-space
my $record = { date => shift #fields };
$record->{session} = shift #fields;
$record->{profs} = [ join( ' ', splice( #fields, 0, 2 )) ];
while ( #fields > 5 ) {
push #{ $record->{profs} }, join( ' ', splice( #fields, 0, 2 ));
}
$record->{course} = splice( #fields, 0, 2 );
#$record{ qw<sec grade count> } = #fields;
push #records, $record;
}
Believe it ambiguous :
if PROF1 can contain spaces, how do you know where it ends and where PROF2 begins? What if PROF2 also contains a space? Or 3 spaces ..
You probably can't even tell yourself, and if you can it's because you can tell the difference between a first-name and a surname.
If you're on Linux/Unix, try running text2pdf on the pdf.. might give you better results.
Looks to me like the first four columns and last 5 columns are always present and the 5th and 6th (prof2) columns are optional
So split the line as you were attempting, pull off the first four and last five elements from the resulting array, then whatever remains is your 5th column and 6th columns
If however either the prof1 or the prof2 entry can be missing, you're stuck - your file format is ambiguous
There is nothing that says you must use only a single regex. You can go prune off bits of your line in chunks if that makes it easier to handle the weird parts.
I would probably still use split(), but then access the data thusly:
my #values = split '\s+', $string;
my $date = $values[0];
my $sess = $values[1];
my $count = $values[-1];
my $grade = $values[-2];
my $sec = $values[-3];
my $course = $values[-4];
my #profs = #values[2..($#values-5)];
With this construct you don't have to worry about how many profs you have. Even if you have none, the other values will all work fine (and you'll get an empty array for your profs).