R programming: regexpr next occurence - regex

I have the following code and results:
> x <- c("ABCDE CDEFG FGHIJ")
> x
[1] "ABCDE CDEFG FGHIJ"
> regexpr("D", x)
[1] 4
attr(,"match.length")
[1] 1
regexpr only returns the first occurrence of "D", how can I get it to return all occurrences of "D"

You were so close -- just a couple of line down from regexpr in the help file...
gregexpr("D", x)
# [[1]]
# [1] 4 8
# attr(,"match.length")
# [1] 1 1
# attr(,"useBytes")
# [1] TRUE

You can also make use strsplit like this :
which(unlist(strsplit(x,split=""))=="D")
[1] 4 8
This way you can also have exact match for D.

I'm sure there will be a pure regex approach. However stringr::str_locate_all will suffice
library(stringr)
unique(unlist(str_locate_all(x, 'D')))
## [1] 4 8

Related

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"
I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".
I know that sub and gsub can replace string, but what commands should I use to extract part of a string?
1) This extracts all numbers in order:
s <- "buy 1000 shares of Google at 1100 GBP"
library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)
giving:
[1] 1000 1100
2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:
strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100
The portion of the string matched by the part of the regular expression within parens is returned.
3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:
strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)
ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:
strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches
strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)
You can use the sub function:
s <- "buy 1000 shares of Google at 1100 GBP"
# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"
# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"
# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"
You can also use gregexpr and regmatches to extract all substrings at once:
regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+",
s, perl = TRUE))
# [[1]]
# [1] "1000" "Google" "1100"
I feel compelled to include the obligatory stringr solution as well.
library(stringr)
s <- "buy 1000 shares of Google at 1100 GBP"
str_match(s, "([0-9]+) shares")[2]
[1] "1000"
str_match(s, "([0-9]+) GBP")[2]
[1] "1100"
If you want to extract all digits from text use this function from stringi package.
"Nd" is the class of decimal digits.
stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"
[[2]]
[1] "43"
[[3]]
[1] "66" "123"
[[4]]
[1] NA
Please note that here 66 and 123 numbers are extracted separatly.

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).
There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

Extract phone number regex

How can I extract phone numbers from a text file?
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.
I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.
Ideally, we'd get something like:
[[1]]
[1] "2-613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "1.575.555.5555"
[[5]]
[1] "7164347566"
This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":
library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")
Output:
[[1]]
[1] " 2-613-213-4567 " " 5555555555 "
[[2]]
[1] " 613 213 4567"
[[3]]
[1] " (613)2134567"
[[4]]
[1] "1.575.555.5555 "
[[5]]
[1] "7164347566"
The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.
Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:
str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
Output:
[[1]]
[1] "613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "575.555.5555"
[[5]]
[1] "7164347566"
The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:
huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"
The qdapRegex now has the rm_phone specifically designed for this task:
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
library(qdapRegex)
ex_phone(x)
## [[1]]
## [1] "613-213-4567" "5555555555"
##
## [[2]]
## [1] "613 213 4567"
##
## [[3]]
## [1] "(613)2134567"
##
## [[4]]
## [1] "1.575.555.5555"
##
## [[5]]
## [1] "7164347566"
You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.
> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555" "613 213 4567" "(613)2134567"
# [5] "1.575.555.5555" "7164347566"

Find location of $ in string

Based on the advice here: Find location of character in string, I tried this:
> gregexpr(pattern ='$',"data.frame.name$variable.name")
[[1]]
[1] 30
attr(,"match.length")
[1] 0
attr(,"useBytes")
[1] TRUE
But it didn't work; note:
> nchar("data.frame.name$variable.name")
[1] 29
How do you find the location of $ in this string?
The problem is that $ is the end-of-string marker in the regex. Try this instead:
> gregexpr(pattern ='\\$',"data.frame.name$variable.name")
[[1]]
[1] 16
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
... which gives the right answer - i.e. 16.
Here's one solution using strsplit and which
> which(strsplit("data.frame.name$variable.name", "")[[1]]=="$")
[1] 16

Regex matching everything that's not a 4 digit number

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"
regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).
It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"
I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning