Matching special character in R - regex

Hi I have the following data.
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2",
"appple+20gfree",
"BELI HG MSWAT ALA +VAT T 100g BAR WR",
"TOOLAIT CASSE+LSST+SSSRE 40g SAC MDC")
In my second step I remove all whitespace in shopping_list.
require(stringr)
shopping_list_trim <- str_replace_all(shopping_list, fixed(" "), "")
print(shopping_list_trim)
[1] "applesx4" "bagofflour" "bagofsugar"
[4] "milkx2" "appple+20gfree" "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
If I want to extract the string that does not contain plus sign I use the following code.
str_extract(shopping_list_trim, "^[^+]+$")
[1] "applesx4" "bagofflour" "bagofsugar" "milkx2" NA NA NA
Would like to help to extract the string that contain plus sign.
I would like the output to be the following one.
NA NA NA NA "appple+20gfree"
"BELIHGMSWATALA+VATT100gBARWR" "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Does anybody have idea how to extract only string that contains plus sign?

This will do the trick
> str_extract(shopping_list_trim, "^(?=.*\\+)(.+)$")
[1] NA
[2] NA
[3] NA
[4] NA
[5] "appple+20gfree"
[6] "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Regex Breakdown
^(?=.*\\+) #Lookahead to check if there is one plus sign
(.+)$ #Capture the string if the above is true

If you can't/don't want to use look-arounds, try
^.*\+.*$
It matches anything followed by a + followed by anything :)
See it work here at regex101.
Regards

Related

How to allow for arbitrary number of wildcards in regexes?

I have a list of character strings:
> head(g_patterns_clean_strings)
[[1]]
[1] "1FAFA"
[[2]]
[1] "FA,TRFA"
[[3]]
[1] "FAEX"
I am trying to identify specific patterns in these character strings, as such:
library(devtools)
g_patterns_clean <- source_gist("164f798524fd6904236a")[[1]]
g_patterns_clean_strings <- source_gist("af70a76691aacf05c1bb")[[1]]
FA_EX_logic_vector <- grepl(g_patterns_clean_strings, pattern = "(FAEX|EXFA)+")
FA_EX_cluster <- subset(g_patterns_clean, FA_EX_logic_vector)
Let's now say that I want to allow for an arbitrary number of other characters in between FA and EX (or EX and FA), how can I specify that in the regex above?
This is a flexible generalization of #eipi10's answer:
(FA.{0,2}EX|EX.{0,2}FA)
The . matches any character, and the {0,2} quantifier matches between 0 and 2 occurrences of .

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"
I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".
I know that sub and gsub can replace string, but what commands should I use to extract part of a string?
1) This extracts all numbers in order:
s <- "buy 1000 shares of Google at 1100 GBP"
library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)
giving:
[1] 1000 1100
2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:
strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100
The portion of the string matched by the part of the regular expression within parens is returned.
3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:
strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)
ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:
strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches
strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)
You can use the sub function:
s <- "buy 1000 shares of Google at 1100 GBP"
# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"
# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"
# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"
You can also use gregexpr and regmatches to extract all substrings at once:
regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+",
s, perl = TRUE))
# [[1]]
# [1] "1000" "Google" "1100"
I feel compelled to include the obligatory stringr solution as well.
library(stringr)
s <- "buy 1000 shares of Google at 1100 GBP"
str_match(s, "([0-9]+) shares")[2]
[1] "1000"
str_match(s, "([0-9]+) GBP")[2]
[1] "1100"
If you want to extract all digits from text use this function from stringi package.
"Nd" is the class of decimal digits.
stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"
[[2]]
[1] "43"
[[3]]
[1] "66" "123"
[[4]]
[1] NA
Please note that here 66 and 123 numbers are extracted separatly.

Extract phone number regex

How can I extract phone numbers from a text file?
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.
I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.
Ideally, we'd get something like:
[[1]]
[1] "2-613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "1.575.555.5555"
[[5]]
[1] "7164347566"
This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":
library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")
Output:
[[1]]
[1] " 2-613-213-4567 " " 5555555555 "
[[2]]
[1] " 613 213 4567"
[[3]]
[1] " (613)2134567"
[[4]]
[1] "1.575.555.5555 "
[[5]]
[1] "7164347566"
The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.
Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:
str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
Output:
[[1]]
[1] "613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "575.555.5555"
[[5]]
[1] "7164347566"
The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:
huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"
The qdapRegex now has the rm_phone specifically designed for this task:
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
library(qdapRegex)
ex_phone(x)
## [[1]]
## [1] "613-213-4567" "5555555555"
##
## [[2]]
## [1] "613 213 4567"
##
## [[3]]
## [1] "(613)2134567"
##
## [[4]]
## [1] "1.575.555.5555"
##
## [[5]]
## [1] "7164347566"
You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.
> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555" "613 213 4567" "(613)2134567"
# [5] "1.575.555.5555" "7164347566"

Regex matching everything that's not a 4 digit number

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"
regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).
It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"
I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning

split on last occurrence of digit, take 2nd part

If I have a string and want to split on the last digit and keep the last part of the split hpw can I do that?
x <- c("ID", paste0("X", 1:10, state.name[1:10]))
I'd like
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
But would settle for:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
I can get the first part by:
unlist(strsplit(x, "[^0-9]*$"))
But want the second part.
Thank you in advance.
You can do this one easy step with a regular expression:
gsub("(^.*\\d+)(\\w*)", "\\2", x)
Results in:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
[9] "Delaware" "Florida" "Georgia"
What the regex does:
"(^.*\\d+)(\\w*)": Look for two groups of characters.
The first group (^.*\\d+) looks for any digit followed by at least one number at the start of the string.
The second group \\w* looks for an alpha-numeric character.
The "\\2" as the second argument to gsub() means to replace the original string with the second group that the regex found.
library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))
gives
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" "Delaware"
[10] "Florida" "Georgia"
I would look at the documentation stringr for (most possibly) an even better approach.
This seems a bit clunky, but it works:
state.pt2 <- unlist(strsplit(x,"^.[0-9]+"))
state.pt2[state.pt2!=""]
It would be nice to remove the ""'s generated by the match at the start of the string but I can't figure that out.
Here's another method using substr and gregexpr too that avoids having to subset the results:
substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))
gsubfn
Try this gsubfn solution:
> library(gsubfn)
> strapply(x, ".*\\d(\\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
It matches the last digit followed by word characters and returns the word characters or if that fails it matches the end of line (to ensure that it matches something). If the first match succeeded then return it; otherwise, the back reference will be empty so return NA.
Note that the formula is a short hand way of writing the function function(z) if (nchar(z)) z else NA and that function could alternately replace the formula at the expense of a slightly more keystrokes.
gsub
A similar strategy could also work using just straight gsub but requires two lines and a marginally more complex regular expression. Here we use the second alternative to slurp up non-matches from the first alternative:
> s <- gsub(".*\\d(\\w*)|.*", "\\1", x)
> ifelse(nchar(s), s, NA)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
EDIT: minor improvements