str_extract specific patterns (example) - regex

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!

You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')

vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"

Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345

You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

Related

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

R remove first character from string

I can remove the last character from a string:
listfruit <- c("aapplea","bbananab","oranggeo")
gsub('.{1}$', '', listfruit)
But I am having problems trying to remove the first character from a string.
And also the first and last character.
I would be grateful for your help.
If we need to remove the first character, use sub, match one character (. represents a single character), replace it with ''.
sub('.', '', listfruit)
#[1] "applea" "bananab" "ranggeo"
Or for the first and last character, match the character at the start of the string (^.) or the end of the string (.$) and replace it with ''.
gsub('^.|.$', '', listfruit)
#[1] "apple" "banana" "rangge"
We can also capture it as a group and replace with the backreference.
sub('^.(.*).$', '\\1', listfruit)
#[1] "apple" "banana" "rangge"
Another option is with substr
substr(listfruit, 2, nchar(listfruit)-1)
#[1] "apple" "banana" "rangge"
library(stringr)
str_sub(listfruit, 2, -2)
#[1] "apple" "banana" "rangge"
Removing first and last characters.
For me performance was important so I run a quick benchmark with the available solutions.
library(magrittr)
comb_letters = combn(letters,5) %>% apply(2, paste0,collapse = "")
bench::mark(
gsub = {gsub(pattern = '^.|.$',replacement = '',x = comb_letters)},
substr = {substr(comb_letters,start = 2,stop = nchar(comb_letters) - 1)},
str_sub = {stringr::str_sub(comb_letters,start = 2,end = -2)}
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 gsub 32.9ms 33.7ms 29.7 513.95KB 0
#> 2 substr 15.07ms 15.84ms 62.7 1.51MB 2.09
#> 3 str_sub 5.08ms 5.36ms 177. 529.34KB 2.06
Created on 2019-12-30 by the reprex package (v0.3.0)

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Capturing a repeating group

I am trying to write a regex that would match and capture the following for me ...
String: 17+18+19+5+21
Numbers to be captured here (separately) are present in the array - [17,18,21].
Please note that the string can be n character long (following the same pattern of \d+) and the order of these numbers in the string are not fixed.
Thanks in advance
Given this setup:
library(gsubfn)
s <- "17+18+19+5+21"
a <- c(17, 18, 21)
1) Try this:
L <- as.list(c(setNames(a, a), NA))
strapply(s, "\\d+", L, simplify = na.omit)
giving:
[1] 17 18 21
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"
2) or this:
pat <- paste(a, collapse = "|")
strapplyc(s, pat, simplify = as.numeric)
giving:
[1] 17 18 21
3) or this non-regexp solution
intersect(scan(text = s, what = 0, sep = "+", quiet = TRUE), a)
giving
[1] 17 18 21
ADDED additional solution.
How about simply:
(17|18|21)
It needs to be a global match, so in Pearl it would be like this:
$string =~ m/(17|18|21)/g
Example string:
21+18+19+5+21+18+19+17
Matches:
"21", "18", "21", "18", "17"
Working regex example:
http://regex101.com/r/jL8iF7
Use can use gregexpr and regmatches:
vec <- "17+18+19+5+21"
a <- c(17, 18, 21)
pattern <- paste0("\\b(", paste(a, collapse = "|"), ")\\b")
# [1] "\\b(17|18|21)\\b"
regmatches(vec, gregexpr(pattern, vec))[[1]]
# [1] "17" "18" "21"
Note that this matches the exact number, i.e., 17 does not match 177.

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.