Capturing a repeating group - regex

I am trying to write a regex that would match and capture the following for me ...
String: 17+18+19+5+21
Numbers to be captured here (separately) are present in the array - [17,18,21].
Please note that the string can be n character long (following the same pattern of \d+) and the order of these numbers in the string are not fixed.
Thanks in advance

Given this setup:
library(gsubfn)
s <- "17+18+19+5+21"
a <- c(17, 18, 21)
1) Try this:
L <- as.list(c(setNames(a, a), NA))
strapply(s, "\\d+", L, simplify = na.omit)
giving:
[1] 17 18 21
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"
2) or this:
pat <- paste(a, collapse = "|")
strapplyc(s, pat, simplify = as.numeric)
giving:
[1] 17 18 21
3) or this non-regexp solution
intersect(scan(text = s, what = 0, sep = "+", quiet = TRUE), a)
giving
[1] 17 18 21
ADDED additional solution.

How about simply:
(17|18|21)
It needs to be a global match, so in Pearl it would be like this:
$string =~ m/(17|18|21)/g
Example string:
21+18+19+5+21+18+19+17
Matches:
"21", "18", "21", "18", "17"
Working regex example:
http://regex101.com/r/jL8iF7

Use can use gregexpr and regmatches:
vec <- "17+18+19+5+21"
a <- c(17, 18, 21)
pattern <- paste0("\\b(", paste(a, collapse = "|"), ")\\b")
# [1] "\\b(17|18|21)\\b"
regmatches(vec, gregexpr(pattern, vec))[[1]]
# [1] "17" "18" "21"
Note that this matches the exact number, i.e., 17 does not match 177.

Related

str_extract specific patterns (example)

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Regex/ Substring

I have a sequence like this in a list "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
I would like to create a substring like wherever a "K" is present it needs to pull out 6 characters before and 6 characters after "K"
Ex : MSGSRRKATPASR , here -6..K..+6
for the whole sequence..I tried the substring function in R but we need to specify the start and end position. Here the positions are unknown
Thanks
.{6}K.{6}
Try this.This will give the desired result.
See demo.
http://regex101.com/r/dM0rS8/4
use this:
\w{7}(?<=K)\w{6}
this uses positive lookbehind to ensure that there are characters present before K.
demo here: http://regex101.com/r/pK3jK1/2
Using rex may make this type of task a little simpler.
x <- "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
library(rex)
re_matches(x,
rex(
capture(name = "amino_acids",
n(any, 6),
"K",
n(any, 6)
)
),
global = TRUE)[[1]]
#> amino_acids
#>1 MSGSRRKATPASR
#>2 GEGSFAKVKYAKN
#>3 GDQAAIKILDREK
#>4 KMVEQLKREISTM
#>5 IEVMASKTKIYIV
#>6 GGELFDKIAQQGR
#>7 VYHRDLKPENLIL
#>8 DANGVLKVSDFGL
#>9 PEVLSDKGYDGAA
#>10 NLMTLYKRICKAE
#>11 WFSQGAKRVIKRI
#>12 LEDEWFKKGYKPP
#>13 AAFSNSKECLVTE
#>14 LENLFEKQAQLVK
#>15 ASEIMSKMEETAK
#>16 LGFNVRKDNYKIK
#>17 GDKSGRKGQLSVA
#>18 HVVELRKTGGDTL
#>19 VCDSFYKNFSSGL
However the above is greedy, each K will only appear in one result.
If you want to output an AA for each K
library(rex)
locs <- re_matches(x,
rex(
"K" %if_prev_is% n(any, 6) %if_next_is% n(any, 6)
),
global = TRUE, locations = TRUE)[[1]]
substring(x, locs$start - 6, locs$end + 6)
#> [1] "MSGSRRKATPASR" "GEGSFAKVKYAKN" "GSFAKVKYAKNTV" "AKVKYAKNTVTGD"
#> [5] "GDQAAIKILDREK" "KILDREKVFRHKM" "EKVFRHKMVEQLK" "KMVEQLKREISTM"
#> [9] "REISTMKLIKHPN" "STMKLIKHPNVVE" "IEVMASKTKIYIV" "VMASKTKIYIVLE"
#>[13] "GGELFDKIAQQGR" "AQQGRLKEDEARR" "VYHRDLKPENLIL" "DANGVLKVSDFGL"
#>[17] "PEVLSDKGYDGAA" "NLMTLYKRICKAE" "LYKRICKAEFSCP" "WFSQGAKRVIKRI"
#>[21] "GAKRVIKRILEPN" "LEDEWFKKGYKPP" "EDEWFKKGYKPPS" "WFKKGYKPPSFDQ"
#>[25] "AAFSNSKECLVTE" "ECLVTEKKEKPVS" "CLVTEKKEKPVSM" "VTEKKEKPVSMNA"
#>[29] "LENLFEKQAQLVK" "KQAQLVKKETRFT" "QAQLVKKETRFTS" "ASEIMSKMEETAK"
#>[33] "KMEETAKPLGFNV" "LGFNVRKDNYKIK" "VRKDNYKIKMKGD" "KDNYKIKMKGDKS"
#>[37] "NYKIKMKGDKSGR" "IKMKGDKSGRKGQ" "GDKSGRKGQLSVA" "HVVELRKTGGDTL"
#>[41] "DTLEFHKVCDSFY" "VCDSFYKNFSSGL" "NFSSGLKDVVWNT"

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.