R regular expression: isolate a string between quotes - regex

I have a string myFunction(arg1=\"hop\",arg2=TRUE). I want to isolate what is in between quotes (\"hop\" in this example)
I have tried so far with no success:
gsub(pattern="(myFunction)(\\({1}))(.*)(\\\"{1}.*\\\"{1})(.*)(\\){1})",replacement="//4",x="myFunction(arg1=\"hop\",arg2=TRUE)")
Any help by a regex guru would be welcome!

Try
sub('[^\"]+\"([^\"]+).*', '\\1', x)
#[1] "hop"
Or
sub('[^\"]+(\"[^\"]+.).*', '\\1', x)
#[1] "\"hop\""
The \" is not needed as " would work too
sub('[^"]*("[^"]*.).*', '\\1', x)
#[1] "\"hop\""
If there are multiple matches, as #AvinashRaj mentioned in his post, sub may not be that useful. An option using stringi would be
library(stringi)
stri_extract_all_regex(x1, '"[^"]*"')[[1]]
#[1] "\"hop\"" "\"hop2\""
data
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
x1 <- "myFunction(arg1=\"hop\",arg2=TRUE arg3=\"hop2\", arg4=TRUE)"

You could use regmatches function also. Sub or gsub only works for a particular input , for general case you must do grabing instead of removing.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> regmatches(x, gregexpr('"[^"]*"', x))[[1]]
[1] "\"hop\""
To get only the text inside quotes then pass the result of above function to a gsub function which helps to remove the quotes.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop"
> x <- "myFunction(arg1=\"hop\",arg2=\"TRUE\")"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop" "TRUE"

You can try:
str='myFunction(arg1=\"hop\",arg2=TRUE)'
gsub('.*(\\".*\\").*','\\1',str)
#[1] "\"hop\""

x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
unlist(strsplit(x,'"'))[2]
# [1] "hop"

Related

How to split a string before the delimiter?

I have a character string like the below.
a <- "T,2016,07,T,2016,07,22,T,2016,07"
I would like to split it to get this,
b <- c("T,2016,07", "T,2016,07", "T,2016,07")
Could you tell me the way? Many thanks.
Or use regular expression to split:
strsplit(a, ",(?=T)", perl = T)
# [[1]]
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"
You can do
x <- gsub("T", "%T", a)
y <- unlist(strsplit(x, "%"))[-1]
a <- "T,2016,07,T,2016,07,22,T,2016,07"
paste0("T", Filter(nzchar, strsplit(a, ",?T")[[1]]))
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"

Extract the last word between | |

I have the following dataset
> head(names$SAMPLE_ID)
[1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
[2] "Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|"
[3] "Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|"
[4] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[5] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[6] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
I want to extract the last word between || as a new variable i.e.
Acinetobacter
Bacillus
Haemophilus
I have tried using
library(stringr)
names$sample2 <- str_match(names$SAMPLE_ID, "|.*?|")
We can use
library(stringi)
stri_extract_last_regex(v1, '\\w+')
#[1] "Acinetobacter"
data
v1 <- "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
Using just base R:
myvar <- gsub("^..*\\|(\\w+)\\|$", "\\1", names$SAMPLE_ID)
^.*\\|\\K.*?(?=\\|)
Use \K to remove rest from the final matche.See demo.Also use perl=T
https://regex101.com/r/fM9lY3/45
x <- c("Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|",
"Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|" )
unlist(regmatches(x, gregexpr('^.*\\|\\K.*?(?=\\|)', x, perl = TRUE)))
# [1] "Streptococcus" "Streptococcus"
The ending is all you need [^|]+(?=\|$)
Per #RichardScriven :
Which in R would be regmatches(x, regexpr("[^|]+(?=\\|$)", x, perl = TRUE)
You can use package "stringr" as well in this case. Here is the code:
v<- "Bacteria|
Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
v1<- str_replace_all(v, "\\|", " ")
word(v1,-2)
Here I used v as the string. The basic theory is to replace all the | with spaces, and then get the last word in the string by using function word().

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"

inverse of gsub

I have some html code I'm working with. I want to extract certain strings.
I want to extract this from string x preferred using base R: coleman_l, SMOG4
Here is what I have:
x <- "<code>(hi)auto</code>(coleman_l, SMOG4)<br />Read</li>"
#remove the string (this works)
gsub("a></code>(.+?)<br", "a></code><br", x)
#> gsub("a></code>(.+?)<br", "a></code><br", x)
#[1] "<code>(hi)auto</code><br />Read</li>"
#attempt to extract that information (doesn't work)
re <- "(?<=a></code>().*?(?=)<br)"
regmatches(x, gregexpr(re, x, perl=TRUE))
Error message:
> regmatches(x, gregexpr(re, x, perl=TRUE))
Error in gregexpr(re, x, perl = TRUE) :
invalid regular expression '(?<=a></code>().*?(?=)<br)'
In addition: Warning message:
In gregexpr(re, x, perl = TRUE) : PCRE pattern compilation error
'lookbehind assertion is not fixed length'
at ')'
enter code here
NOTE: Tagged as regex but this is R specific regex.
For these types of problems, I would use backreferences to extract the portion I want.
x <-
"<code>(hi)auto</code>(coleman_l, SMOG4)<br />Read</li>"
gsub(".*a></code>(.+?)<br.*", "\\1", x)
# [1] "(coleman_l, SMOG4)"
If the parentheses should also be removed, add them to the "plain text" part that you are tying to match, but remember that they would need to be escaped:
gsub(".*a></code>\\((.+?)\\)<br.*", "\\1", x)
# [1] "coleman_l, SMOG4"
FWIW, OP's original approach could have worked with little tweak.
> x
[1] "<code>(hi)auto</code>(coleman_l, SMOG4)<br />Read</li>"
> re <- "(?<=a></code>\\().*?(?=\\)<br)"
> regmatches(x, gregexpr(re, x, perl=TRUE))
[[1]]
[1] "coleman_l, SMOG4"
An advantage of doing it this way compared to other suggested solution is that if there is possibility of multiple matches, then all of them will show up.
> x <- '<code>(hi)auto</code>(coleman_l, SMOG4)<br />Read</li><code>(hi)auto</code>(coleman_l_2, SMOG4_2)<br />Read</li>'
> regmatches(x, gregexpr(re, x, perl=TRUE))
[[1]]
[1] "coleman_l, SMOG4" "coleman_l_2, SMOG4_2"
This will work, despite being ugly.
x<-"<code>(hi)auto</code>(coleman_l, SMOG4)<br />Read</li>"
x2 <- gsub("^.+(\\(.+\\)).+\\((.+)\\).+$","\\2",x)
x2
[1] "coleman_l, SMOG4"