Extract phone number regex - regex

How can I extract phone numbers from a text file?
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.
I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.
Ideally, we'd get something like:
[[1]]
[1] "2-613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "1.575.555.5555"
[[5]]
[1] "7164347566"

This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":
library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")
Output:
[[1]]
[1] " 2-613-213-4567 " " 5555555555 "
[[2]]
[1] " 613 213 4567"
[[3]]
[1] " (613)2134567"
[[4]]
[1] "1.575.555.5555 "
[[5]]
[1] "7164347566"
The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.
Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:
str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
Output:
[[1]]
[1] "613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "575.555.5555"
[[5]]
[1] "7164347566"
The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:
huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"

The qdapRegex now has the rm_phone specifically designed for this task:
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
library(qdapRegex)
ex_phone(x)
## [[1]]
## [1] "613-213-4567" "5555555555"
##
## [[2]]
## [1] "613 213 4567"
##
## [[3]]
## [1] "(613)2134567"
##
## [[4]]
## [1] "1.575.555.5555"
##
## [[5]]
## [1] "7164347566"

You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.
> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555" "613 213 4567" "(613)2134567"
# [5] "1.575.555.5555" "7164347566"

Related

Cleaning up dates (years, specifically) with regex

I have database with an non-validated year field. Most of the entries are 4-digit years but about 10% of the entries are "whatever." This has led me down the rabbit hole of regular expressions to little avail. Getting better results than what I have is progress, even if I don't extract 100%.
#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))
The desired output is to grab the first readable year, so 1992-1993 would be 1992 and "the 70s" would be 1970.
How can I improve my parsing accuracy? Thanks!
EDIT: Pursuant to garyh's answer this gets me much closer:
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79" "07-2608" "07-262008" "96" "70" "93" "70" "15" "60" "70" NA "2013" "1992"
but note that, while the dates with dashes in them work with garyh's regex101.com demo, they don't work with R, keeping the month and day values, and the first dash.
Also I realize I didn't include an example date with slashes rather dashes. Another term in the regex should handle that but again, with R, it doesn't not produce the same (correct) result that regex101.com does.
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"
These negative lookbacks and lookaheads are very powerful but stretch my feeble brain.
Not sure what flavour of regex R uses but this seems to get all the years in the string
/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g
This is matching any 4 digits or any 2 digits provided they're not followed by a dash - or digit, or preceded by another digit
see demo here
You're going to need some elbow grease and do something like:
library(lubridate)
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")
yearEntries <-ifelse(!is.na(y), y, x)
yearEntries
## [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA "13" "92"
We have no idea which year you need from ranged entries, but this should get you started.
I found a very simple way to get a good result (though I would not claim it is bullet proof). It grabs the last readable year, which is okay too.
yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96 ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
# [1] "99" "08" "08" "96" "70" "95" "15" "ongoing" NA "13" "93"
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
# [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993
...and Bob's your uncle!
#garyth's regex work well actually if you use the regmatches/grexprcombo to extract the pattern instead of sub:
regmatches(yearEntries,
gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"
[[2]]
[1] "08"
[[3]]
[1] "2008"
[[4]]
[1] "96"
[[5]]
[1] "70"
[[6]]
[1] "95"
[[7]]
[1] "70"
[[8]]
[1] "15"
[[9]]
[1] "60"
[[10]]
[1] "70"
[[11]]
character(0)
[[12]]
[1] "2013"
[[13]]
[1] "1992" "1993"
To only keep the first matching pattern:
sapply(regmatches(yearEntries,gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE)),`[`,1)
[1] "79" "08" "2008" "96" "70" "95" "70" "15" "60" "70" NA "2013" "1992"
regmatches("07/09/13",gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}","07/09/13",perl=TRUE))
[[1]]
[1] "13"

regex capture repeated phrases

I can capture repeated words pretty easily using:
"(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b" but this regex does not seem to extend to mutipe words (and why should it in its current state). How could I find repeated phrases using regex?
Here I extract repeated terms (regardless of case) but the same regex doesn't word to extract a repeated phrase:
library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
I would hope for a regex that would return:
"is a big is a Big"
for:
x <- "this is a big is a Big deal"
To cover corner cases here's a larger desired test and output...
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see how",
"this is a big is a Big deal for those of, those of you who are.",
"I like it. It is cool",
)
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
[[6]]
[1] NA
My current regex only gets me to:
rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "want want"
##
## [[3]]
## [1] "want, want"
##
## [[4]]
## [1] "want...want" "see see"
##
## [[5]]
## [1] NA
I think this does what you want (note we only allow a single space, ..., or , as separators, but you should be able to tweak that easily):
pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)
Produces:
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Try this:
> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Here is a visualization (except there is an error in the visualization - the \S parts should be within the group.
(?i)\b(\S.*\S)[ ,.]*\b(\1)
Debuggex Demo
You might want to replace [ ,.] with [ [:punct:]]. I did not do that since debuggex does not support POSIX character groups.

splitting a string in which upper case follows lower case in stringr

I have a string vector that looks like this and I'd like to split it up:
str <- c("Fruit LoopsJalapeno Sandwich", "Red Bagel", "Basil LeafBarbeque SauceFried Beef")
str_split(str, '[a-z][A-Z]', n = 3)
[[1]]
[1] "Fruit Loop" "alapeno Sandwich"
[[2]]
[1] "Red Bagel"
[[3]]
[1] "Basil Lea" "arbeque Sauc" "ried Beef"
But I need to keep those letters at the end and start of the words.
Here's 2 approaches in base (you can generalize to stringr if you want).
This one subs out this place with a placeholder and then splits on that.
strsplit(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE")
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
This method uses lookaheads and lookbehinds:
strsplit(str, "(?<=[a-z])(?=[A-Z])", perl=TRUE)
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
EDIT Generalized to stringr so you can grab 3 pieces if you want
stringr::str_split(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE", 3)
You could also match instead of splitting based off your string.
unlist(regmatches(str, gregexpr('[A-Z][a-z]+ [A-Z][a-z]+', str)))
# [1] "Fruit Loops" "Jalapeno Sandwich" "Red Bagel"
# [4] "Basil Leaf" "Barbeque Sauce" "Fried Beef"

Regex to extract US zip codes but not faux codes

Using the XML package and XPath to scrape addresses from websites, I sometimes can get only a string that has embedded in it the zip code I want. It is straightforward to extract the zip code, but sometimes there are other five-digit strings that show up.
Here are some variations on the problem in a df.
zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345"))
The R statement to extract zip codes (both 5 digit and plus 4 digits) is below, but it is tricked by the faux zip codes of the street number and the suite number (and there may be other possibilities in other address strings).
regmatches(zips$address, gregexpr("\\d{5}([-]?\\d{4})?", zips$address, perl = TRUE))
An answer to a previous SO question suggested that a "regex will return the last consecutive five digit string. It uses a negative look-ahead to ensure the absence of 5-digit strings after the one being returned."
Extracting a zip code from an address string
\b\d{5}\b(?!.*\b\d{5}\b)
But that question and answer deals with PHP and offers an if loop with preg_matches()` I am not familiar with those languages and tools, but the idea might be right.
My question: what R code will find real zip codes and ignore false lookalikes?
This is my first regex answer (I am still learning) so hopefully I don't say anything wrong to lead you in the wrong direction.
Basically, this regex looks for, as you hinted in your question, the last string that looks like a zip code which is not followed by a string that looks like a zip code
the basic syntax is pattern(?!.*pattern) which says to match pattern only if it is not followed (a negative look-ahead assertion, syntax: (?! )) by anything .* and pattern
so we can replace pattern with what you are interested in finding:
[0-9]{5}(-[0-9]{4})?
that is, a digit string [0-9] of exactly 5 characters {5} (which may optionally be followed ? by another group defined as a hyphen and another digit string of length four (-[0-9]{4})
put it all together with gregexpr to search for the matches and regmatches to interpret the results for me, I get:
zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345"))
regmatches(zips$address,
gregexpr('[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)', zips$address, perl = TRUE))
# [[1]]
# [1] "12345"
#
# [[2]]
# [1] "12345-0000"
#
# [[3]]
# [1] "12345"
#
# [[4]]
# [1] "12345"
#
# [[5]]
# [1] "12345"
The qdapRegex package has the rm_zip function for this:
zips <- data.frame(id = seq(1, 5),
address = c("Company, 18540 Main Ave., City, ST 12345",
"Company 18540 Main Ave. City ST 12345-0000",
"Company 18540 Main Ave. City State 12345",
"Company, 18540 Main Ave., City, ST 12345 USA",
"Company, One Main Ave Suite 18540, City, ST 12345")
)
lapply(rm_zip(zips$address, extract=TRUE), tail, 1)
## [[1]]
## [1] "12345"
##
## [[2]]
## [1] "12345-0000"
##
## [[3]]
## [1] "12345"
##
## [[4]]
## [1] "12345"
##
## [[5]]
## [1] "12345"
EDIT Per #lawyeR's comments:
I think that you want some regex that is more specific than the dictionary system used by qdapRegex. The current implementation of rm_zip allows for validation purposes and thus I wouldn't alter the regular expression it uses to be more flexible. I also wouldn't alter the function rm_zip to have additional parameters/arguments as qdapRegex attempts to have consistently operating functions.
That being said you could create your own function using the rm_ function and supply your own regular expression. I have done this using both of the parameters specified in your comment:
More complex data set:
zips <- data.frame(id = seq(1, 6),
address = c("Company, 18540 Main Ave., City, ST 12345",
"Company 18540 Main Ave. City ST 12345-0000",
"Company 18540 Main Ave. City State 12345",
"Company, 18540 Main Ave., City, ST 12345 USA",
"Company, One Main Ave Suite 18540m, City, ST 12345",
"company 12345678")
)
Function to grab even if a character follows the zip
## paste together a more flexible regular expression
pat <- pastex(
"#rm_zip",
"(?<!\\d)\\d{5}(?!\\d)",
"(?<!\\d)\\d{5}-\\d{4}(?!\\d)"
)
## Create your own function that extract is set to TRUE
rm_zip2 <- rm_(pattern=pat, extract=TRUE)
rm_zip2(zips$address)
## [[1]]
## [1] "18540" "12345"
##
## [[2]]
## [1] "18540" "12345-0000"
##
## [[3]]
## [1] "18540" "12345"
##
## [[4]]
## [1] "18540" "12345"
##
## [[5]]
## [1] "18540" "12345"
##
## [[6]]
## [1] NA
Function to extract just 5 digit zips
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE)
rm_zip3(zips$address)
## [[1]]
## [1] "18540" "12345"
##
## [[2]]
## [1] "18540" "12345"
##
## [[3]]
## [1] "18540" "12345"
##
## [[4]]
## [1] "18540" "12345"
##
## [[5]]
## [1] "18540" "12345"
##
## [[6]]
## [1] NA

How to separate POS from words

Need to create an text sparce matrix (DTM) for classification. To prepare the text, first I need to eliminate (separate) the POS-tags the text. My guess was to do it like below. I'm new to R and don't now how to negate a REGEX (see below NOT!).
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
My guess how it could work:
(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS" "/VVFIN" "./$."
[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."
[[3]]
[1] "-/TRUNC" "/APPR" "/KON"
[[4]]
[1] "/PIS" "/ADJD" "./$."
[[5]]
[1] "/NN" "!!!/NE"
But don't konw how to negate the expression like:
# VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
One possibility is to eliminate the tags by, searching for POS-tags and replacing them with '' (i.e. empty text):
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "-RRB" "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
With the friendly help of rawr