Replace matches according to the pattern that was matched - regex

Given a set of regular expressions, is there a simple way to match multiple patterns, and replace the matched text according to the pattern that was matched?
For example, for the following data x, each element begins with either a number or a letter, and ends with either a number or a letter. Let's call these patterns num_num (for begins with number, ends with number), num_let (begins with number, ends with letter), let_num, and let_let.
x <- c('123abc', '78fdsaq', 'aq12111', '1p33', '123', 'pzv')
type <- list(
num_let='^\\d.*[[:alpha:]]$',
num_num='^\\d(.*\\d)?$',
let_num='^[[:alpha:]].*\\d$',
let_let='^[[:alpha:]](.*[[:alpha:]])$'
)
To replace each string with the name of the pattern it follows, we could do:
m <- lapply(type, grep, x)
rep(names(type), sapply(m, length))[order(unlist(m))]
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
Is there a more efficient approach?
gsubfn?
I know that with gsubfn we can simultaneously replace different matches, e.g.:
library(gsubfn)
gsubfn('.*', list('1p33'='foo', '123abc'='bar'), x)
## [1] "bar" "78fdsaq" "aq12111" "foo" "123" "pzv"
but I'm not sure whether the replacements can be made dependent on the pattern that was matched rather than on the match itself.
stringr?
str_replace_all doesn't play nicely with this example, since matches are replaced for patterns iteratively, and we end up with everything being overwritten with let_let:
library(stringr)
str_replace_all(x, setNames(names(type), unlist(type)))
## [1] "let_let" "let_let" "let_let" "let_let" "let_let" "let_let"
Reordering type so the pattern corresponding to let_let appears first solves the problem, but needing to do this makes me nervous.
type2 <- rev(type)
str_replace_all(x, setNames(names(type2), unlist(type2)))
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

Perhaps one of these.
# base R method
mm2 <- character(length(x))
for( n in 1:length(type)) mm2 <- replace(mm2, grep(type[n],x), names(type)[n])
# purrr 0.2.0 method
library(purrr)
mm3 <- map(grep, .x=type, x = x) %>% (function(z) replace(x, flatten_int(z), rep(names(type), lengths(z))))
The base R method is somewhat faster than the posted code for both small and larger data sets. The purrr method is slower than the posted code for small data sets but about the same as the base R method for larger data sets.

stringr
We can use str_replace_all if we alter the replacements so they are no longer matched by any of the regular expressions and then add an additional replacement to return them to their original form. For example
library(stringr)
type2 <- setNames(c(str_replace(names(type), "(.*)", "__\\1__"), "\\1"),
c(unlist(type), "^__(.*)__$"))
str_replace_all(x, type2)
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
grepl and tidyr
Another approach is match first and then replace, one way to do this is to use grepl and tidyr
library(plyr)
library(dplyr)
library(tidyr)
out <- data.frame(t(1*aaply(type, 1, grepl, x)))
out[out == 0] <- NA
out <- out %>%
mutate(id = 1:nrow(.)) %>%
gather(name,value, -id, na.rm = T) %>%
select(name)
as.character(out[,1])
## [1] "num_let" "num_let" "num_num" "num_num" "let_num" "let_let"
While this approach doesn't look as efficient it makes it easy to find rows where there are more or less than one match.
From what I understand substitution matching is implemented in pcre2 and I believe allows this type of problem to be solved directly in the regex. Unfortunately it seems that no one has built a pcre2 package for R yet.

Related

Use lapply on a subset of list elements and return list of same length as original in R

I want to apply a regex operation to a subset of list elements (which are character strings) using lapply and return a list of same length as the original. The list elements are long strings (derived from reading in long text files and collapsing paragraphs into a single string). The regex operation is valid only for the subset of list elements/strings. I want the non-subsetted list elements (character strings) to be returned in their original state.
The regex operation is str_extract from the stringr package, i.e. I want to extract a substring from a longer string. I subset the list elements based on a regex pattern in the filename.
An example with simplified data:
library(stringr)
texts <- as.list(c("abcdefghijkl", "mnopqrstuvwxyz", "ghijklmnopqrs", "uvwxyzabcdef"))
filenames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(texts) <- filenames
regexp <- "abcdef"
I know in advance to which strings I want to apply the regex operation, and hence I want to subset these strings. That is, I don't want to run the regex over all elements in the list, as doing so will return some invalid results (which is not apparent in this simplified example).
I've made a few naive efforts, e.g.:
x <- lapply(texts[str_detect(names(texts), "1997")], str_extract, regexp)
> x
$AB1997R.txt
[1] "abcdef"
$DC1997S.txt
[1] "abcdef"
which returns a reduced-length list containing just the substrings found.
But the results I want to get are:
> x
$AB1997R.txt
[1] "abcdef"
$BG2000S.txt
[1] "mnopqrstuvwxyz"
$MN1999R.txt
[1] "ghijklmnopqrs"
$DC1997S.txt
[1] "abcdef"
where the strings not containing the regex pattern are returned in their original state.
I have informed myself about stringr, lapply and llply (in the plyr package), but many operations are illustrated using dataframes as examples, not lists, and don't involve regex operations on character strings. I can achieve my goal using a for loop, but I'm trying to get away from that, as is generally advised, and get better at using the apply-class of functions.
You can use the subset operator [<-:
x <- texts
is1997 <- str_detect(names(texts), "1997")
x[is1997] <- lapply(texts[is1997], str_extract, regexp)
x
# $AB1997R.txt
# [1] "abcdef"
#
# $BG2000S.txt
# [1] "mnopqrstuvwxyz"
#
# $MN1999R.txt
# [1] "ghijklmnopqrs"
#
# $DC1997S.txt
# [1] "abcdef"
#
You can try sub
sub(paste0('.*(', regexp, ').*'), '\\1', texts)
# AB1997R.txt BG2000S.txt MN1999R.txt DC1997S.txt
# "abcdef" "mnopqrstuvwxyz" "ghijklmnopqrs" "abcdef"
Also, if you need to match the names of 'texts' with 1997, we can use grep
indx <- grep('1997', names(texts))
texts[indx] <- sub(paste0('.*(', regexp, ').*'), '\\1', texts[indx])
as.list(texts)

regular expression in R-- new lines

I'm trying to using regular expression in R by using regexpr function. I have multiple conditions to match, therefore my regular expression is very long actually, for example "A\s+(\d+)|(\d+)\s+A". So I want to put each separate expression on different lines, like
"A\\s+(\\d+)|
(\\d+)\\s+A|"
But it's not working. The bracket tells R that I want to extract the digit number out. Can anyone give suggestions?
1) paste Try using paste:
paste("A\\s+(\\d+)",
"(\\d+)\\s+A",
sep = "|")
2) rex Another possibility is to use the rex package
library(rex)
rex(group("A", spaces, capture(digits)) %or%
group(capture(digits), spaces, "A"))
which gives:
(?:(?:A[[:space:]]+([[:digit:]]+))|(?:([[:digit:]]+)[[:space:]]+A))
3) rebus The rebus package is similar in intent:
library(rebus)
literal("A") %R% one_or_more(space()) %R% capture(one_or_more(ascii_digit())) %|%
capture(one_or_more(digit())) %R% one_or_more(space()) %R% literal("A")
which emits:
<regex> \QA\E[[:space:]]+([0-9]+)|([[:digit:]]+)[[:space:]]+\QA\E
If you want to break string literal up on to several lines in your script, one solution is to use paste0:
my_expr <- paste0('partone',
'parttwo',
'partthree')
Then you get the desired result:
> my_expr
[1] "partoneparttwopartthree"
You can't just break it up onto several lines in between quotes, b/c then the new line character is part of the expression.
If you are also trying to trouble shoot your regular expression, you'll need to post a sample of the data you are trying to work with and the desired result
Just use the x modifier with perl = TRUE in whatever function you're using. Place the x modifier ((?x)) at the beginning of the expression and white space is ignored. Additionally, comment charcters are ignored as well.
pat <- "(?x)\\\\ ## Grab a backslash followed by...
[a-zA-Z0-9]*cite[a-zA-Z0-9]* ## A word that contains ‘cite‘
(\\[([^]]+)\\]){0,2}\\** ## Look for 0-2 square brackets w/ content
\\{([a-zA-Z0-9 ,]+)\\}" ## Look for curly braces with viable bibkey
tex <- c(
"Many \\parencite*{Ted2005, Moe1999} say graphs \\textcite{Few2010}.",
"But \\authorcite{Ware2013} said perception good too.",
"Random words \\pcite[see][p. 22]{Get9999c}.",
"Still more \\citep[p. 22]{Foo1882c}?"
)
gsub(pat, "", tex, perl=TRUE)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
A second approach...I maintain a package called regexr that attempts to enable maintainers of regular expressions libraries:
to write regular expressions in a way that is similar to the ways R code is written.
This may be overkill if you're aren't panning long term maintence of the expression but you could do the same thing with regexr by (no need for perl = TRUE). Note the minimal comments as the meaning is shared with sub expression names. The %:)% is a comment operator (commented code is happy code) but you need not use the leading names or comments, just construct.:
library(regexr)
pat2 <- construct(
backslash = "\\\\" %:)% "\\",
cite_command = "[a-zA-Z0-9]*cite[a-zA-Z0-9]*" %:)% "parencite",
square_brack = "(\\[([^]]+)\\]){0,2}\\**" %:)% "[e.g.][p. 12]",
bibkeys = "\\{([a-zA-Z0-9 ,]+)\\}" %:)% "{Rinker2014}"
)
gsub(pat2, "", tex)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
The regexr frame work requires a bit of upfront time but the "code" is much easier to maintain, more modular, and easier to understand by others without learning a new "language". This is one approach of many and I tend to use a combination of standard regex, regexr and rebus (which works within the regexr framework). So for example we can grab any of the sub expressions from pat2 with the subs function as follows:
subs(pat2)
## $backslash
## [1] "\\\\"
##
## $cite_command
## [1] "[a-zA-Z0-9]*cite[a-zA-Z0-9]*"
##
## $square_brack
## [1] "(\\[([^]]+)\\]){0,2}\\**"
##
## $bibkeys
## [1] "\\{([a-zA-Z0-9 ,]+)\\}"
I also included simple way to test the main and sub expressions for perl validity as follows:
test(pat2)
## $regex
## [1] TRUE
##
## $subexpressions
## backslash cite_command square_brack bibkeys
## TRUE TRUE TRUE TRUE

Extract substring in R from string with fixed start position and end point as a character found

I want to do the following extraction in R.
I have a column which has links like these
http://www.imdb.com/title/tt2569314/companycredits
I want to extract the tt2569314 out of this and store it in a new column.
The way I want to do it is, say, take substring of column where start position is LEN(http://www.imdb.com/) and end position is dynamic based on when the first '/' is found after the start position.
I want this to be kind of a mixture of SUBSTR and INSTR in SQL.
Please advise.
You could try this:
a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"
If all the links are similar in path structure, you can use the dirname
x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"
Or you can paste together a regular expression with the base URL
y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"
Or you may even be able to get away with this:
basename(dirname(x))
# [1] "tt2569314"
It's a bit more drawn out if you use the substring. But stringr has a couple of helpful functions.
library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"
You could try:
str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
library(httr)
gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
#[1] "tt2569314"
You may try this also,
> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"

Easy way to find and replace dynamic values ( {{example}} ) via regex in R

I have some dynamic values obtained from json of the format {{example_value}}. I have some R code which calculates the actual value. However, the only solution I have found to replace the placeholder with the actual value is very long and ugly.
Does anyone have any neat solutions?
Example of replacing {{example_value}} with 5.5:
> gsub( gsub("\\}","\\\\}",gsub("\\{","\\\\{","{{example_value}}")),
5.5, "{{example_value}}")
[1] "5.5"
Another example which explains why I wrote the nested gsub:
dictionary = "{{example_value}}"
> gsub( gsub("\\}","\\\\}",gsub("\\{","\\\\{",dictionary)),
5.5, "{{example_value}}")
[1] "5.5"
Typically dictionary is a list which contains all the dynamic values I expect to replace.
You can use this:
gsub("{{example_value}}", "5.5", subject, perl=TRUE);
While #zx81's suggestion seems most appropriate for a direct replace, You could also work with regular expressions to pull out tags in braces.
a<-"The total is {{example}} dollars less"
m <- regexpr("{{([^}]+)}}", a, perl=T)
regmatches(a, m)
# [1] "{{example}}"
And then regmatches has a nice feature where you can easily replace matches
regmatches(a, m) <- 5.5
a
# [1] "The total is 5.5 less"
Which is kind of a neat trick.
EDIT: Perhaps this may lead you to what you're looking for.
re <- c('{{foo}}', '{{bar}}')
val <- c('5.5', '1.1')
recurse <- function(pattern, repl, x) {
for (i in 1:length(pattern))
x <- gsub(pattern[i], repl[i], x, perl=T)
x
}
x <- 'I have {{foo}} and {{bar}}'
recurse(re, val, x)
# [1] "I have 5.5 and 1.1"

Extract info inside all parenthesis in R

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"
Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.
Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.
I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.
Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"