dynamic regex in R - regex

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?

In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE

Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE

dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"

Related

Extract the last word between | |

I have the following dataset
> head(names$SAMPLE_ID)
[1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
[2] "Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|"
[3] "Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|"
[4] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[5] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[6] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
I want to extract the last word between || as a new variable i.e.
Acinetobacter
Bacillus
Haemophilus
I have tried using
library(stringr)
names$sample2 <- str_match(names$SAMPLE_ID, "|.*?|")
We can use
library(stringi)
stri_extract_last_regex(v1, '\\w+')
#[1] "Acinetobacter"
data
v1 <- "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
Using just base R:
myvar <- gsub("^..*\\|(\\w+)\\|$", "\\1", names$SAMPLE_ID)
^.*\\|\\K.*?(?=\\|)
Use \K to remove rest from the final matche.See demo.Also use perl=T
https://regex101.com/r/fM9lY3/45
x <- c("Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|",
"Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|" )
unlist(regmatches(x, gregexpr('^.*\\|\\K.*?(?=\\|)', x, perl = TRUE)))
# [1] "Streptococcus" "Streptococcus"
The ending is all you need [^|]+(?=\|$)
Per #RichardScriven :
Which in R would be regmatches(x, regexpr("[^|]+(?=\\|$)", x, perl = TRUE)
You can use package "stringr" as well in this case. Here is the code:
v<- "Bacteria|
Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
v1<- str_replace_all(v, "\\|", " ")
word(v1,-2)
Here I used v as the string. The basic theory is to replace all the | with spaces, and then get the last word in the string by using function word().

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

R regular expression: isolate a string between quotes

I have a string myFunction(arg1=\"hop\",arg2=TRUE). I want to isolate what is in between quotes (\"hop\" in this example)
I have tried so far with no success:
gsub(pattern="(myFunction)(\\({1}))(.*)(\\\"{1}.*\\\"{1})(.*)(\\){1})",replacement="//4",x="myFunction(arg1=\"hop\",arg2=TRUE)")
Any help by a regex guru would be welcome!
Try
sub('[^\"]+\"([^\"]+).*', '\\1', x)
#[1] "hop"
Or
sub('[^\"]+(\"[^\"]+.).*', '\\1', x)
#[1] "\"hop\""
The \" is not needed as " would work too
sub('[^"]*("[^"]*.).*', '\\1', x)
#[1] "\"hop\""
If there are multiple matches, as #AvinashRaj mentioned in his post, sub may not be that useful. An option using stringi would be
library(stringi)
stri_extract_all_regex(x1, '"[^"]*"')[[1]]
#[1] "\"hop\"" "\"hop2\""
data
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
x1 <- "myFunction(arg1=\"hop\",arg2=TRUE arg3=\"hop2\", arg4=TRUE)"
You could use regmatches function also. Sub or gsub only works for a particular input , for general case you must do grabing instead of removing.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> regmatches(x, gregexpr('"[^"]*"', x))[[1]]
[1] "\"hop\""
To get only the text inside quotes then pass the result of above function to a gsub function which helps to remove the quotes.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop"
> x <- "myFunction(arg1=\"hop\",arg2=\"TRUE\")"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop" "TRUE"
You can try:
str='myFunction(arg1=\"hop\",arg2=TRUE)'
gsub('.*(\\".*\\").*','\\1',str)
#[1] "\"hop\""
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
unlist(strsplit(x,'"'))[2]
# [1] "hop"

How to fill gap between two characters with regex

I have a data set like below. I would like to replace all dots between two 1's with 1's, as shown in the desired.result. Can I do this with regex in base R?
I tried:
regexpr("^1\\.1$", my.data$my.string, perl = TRUE)
Here is a solution in c#
Characters between two exact characters
Thank you for any suggestions.
my.data <- read.table(text='
my.string state
................1...............1. A
......1..........................1 A
.............1.....2.............. B
......1.................1...2..... B
....1....2........................ B
1...2............................. C
..........1....................1.. C
.1............................1... C
.................1...........1.... C
........1....2.................... C
......1........................1.. C
....1....1...2.................... D
......1....................1...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
desired.result <- read.table(text='
my.string state
................11111111111111111. A
......1111111111111111111111111111 A
.............1.....2.............. B
......1111111111111111111...2..... B
....1....2........................ B
1...2............................. C
..........1111111111111111111111.. C
.111111111111111111111111111111... C
.................1111111111111.... C
........1....2.................... C
......11111111111111111111111111.. C
....111111...2.................... D
......1111111111111111111111...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
Below is an option using gsub with the \G feature and lookaround assertions.
> gsub('(?:1|\\G(?<!^))\\K\\.(?=\\.*1)', '1', my.data$my.string, perl = TRUE)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."
# [7] "..........1111111111111111111111.." ".111111111111111111111111111111..."
# [9] ".................1111111111111...." "........1....2...................."
# [11] "......11111111111111111111111111.." "....111111...2...................."
# [13] "......1111111111111111111111......" ".................1...2............"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. Since it seems you want to avoid the dots at the start of the string position we use a lookaround assertion \G(?<!^) to exclude the start of the string.
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.
You can find an overall breakdown that explains the regular expression here.
Using gsubfn, the first argument is a regular expression which matches the 1's and the characters between the 1's and captures the latter. The second argument is a function, expressed in formula notation, which uses gsub to replace each character in the captured string with 1:
library(gsubfn)
transform(my.data, my.string = gsubfn("1(.*)1", ~ gsub(".", 1, x), my.string))
If there can be multiple pairs of 1's in a string then use "1(.*?)1" as the regular expression instead.
Visualization The regular expression here is simple enough that it can be directly understood but here is a debuggex visualization anwyays:
1(.*)1
Debuggex Demo
Here is an option that uses a relatively simple regex and the standard combination of gregexpr(), regmatches(), and regmatches<-() to identify, extract, operate on, and then replace substrings matching that regex.
## Copy the character vector
x <- my.data$my.string
## Find sequences of "."s bracketed on either end by a "1"
m <- gregexpr("(?<=1)\\.+(?=1)", x, perl=TRUE)
## Standard template for operating on and replacing matched substrings
regmatches(x,m) <- sapply(regmatches(x,m), function(X) gsub(".", "1", X))
## Check that it worked
head(x)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."

regex in R - combining two patterns in "OR" fashion. Can't spot the error.

I want to find pat1 OR pat2 in vec
vec <- c("(and) i.e.", "(and) ie", "(and)ie", "and i.e.", "and ie", "and) i.e.")
pat1 <- "\\(and) i\\.e\\."
pat2 <- "\\(and) ie"
I attempt to combine the two patterns using (pat1|pat2)
# combine the two patterns
pat1or2 <- paste0("(", pat1, "|", pat2, ")")
# [1] "(\\(and) i\\.e\\.|\\(and) ie)"
grep(pat1, vec, value=TRUE)
# [1] "(and) i.e."
grep(pat2, vec, value=TRUE)
# [1] "(and) ie"
grep(pat1or2, vec, value=TRUE)
# character(0)
Clearly, I am missing something and I cannot spot it.
(Tried messing with perl and fixed, but that wasnt it)
Can you point out my error in combining the two patterns?
You forgot to backslash all of your parentheses. Your two patterns should be:
pat1 <- "\\(and\\) i\\.e\\."
pat2 <- "\\(and\\) ie"
After that, everything should be fine, with or without perl = TRUE. What could have put you on track to finding the error is using perl = TRUE with your old (wrong) patterns:
grep(pat1, vec, value=TRUE, perl = TRUE)
# Error in grep(pat1, vec, value = TRUE, perl = TRUE) :
# invalid regular expression '\(and) i\.e\.'
making it clear you had unbalanced parentheses.
It can be simplified a bit like this:
pat1 <- "(and) i.e."
pat2 <- "(and) ie"
ok <- grepl(pat1, vec, fixed = TRUE) | grepl(pat2, vec, fixed = TRUE)
vec[ ok ]
This could alternately be written in this form which generalizes to more than two patterns:
pats <- c(pat1, pat2)
ok <- Reduce(function(x, y) x | grepl(y, vec, fixed = TRUE), pats, FALSE)
vec[ ok ]