How to split a string before the delimiter? - regex

I have a character string like the below.
a <- "T,2016,07,T,2016,07,22,T,2016,07"
I would like to split it to get this,
b <- c("T,2016,07", "T,2016,07", "T,2016,07")
Could you tell me the way? Many thanks.

Or use regular expression to split:
strsplit(a, ",(?=T)", perl = T)
# [[1]]
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"

You can do
x <- gsub("T", "%T", a)
y <- unlist(strsplit(x, "%"))[-1]

a <- "T,2016,07,T,2016,07,22,T,2016,07"
paste0("T", Filter(nzchar, strsplit(a, ",?T")[[1]]))
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"

Related

Extract the last word between | |

I have the following dataset
> head(names$SAMPLE_ID)
[1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
[2] "Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|"
[3] "Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|"
[4] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[5] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[6] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
I want to extract the last word between || as a new variable i.e.
Acinetobacter
Bacillus
Haemophilus
I have tried using
library(stringr)
names$sample2 <- str_match(names$SAMPLE_ID, "|.*?|")
We can use
library(stringi)
stri_extract_last_regex(v1, '\\w+')
#[1] "Acinetobacter"
data
v1 <- "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
Using just base R:
myvar <- gsub("^..*\\|(\\w+)\\|$", "\\1", names$SAMPLE_ID)
^.*\\|\\K.*?(?=\\|)
Use \K to remove rest from the final matche.See demo.Also use perl=T
https://regex101.com/r/fM9lY3/45
x <- c("Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|",
"Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|" )
unlist(regmatches(x, gregexpr('^.*\\|\\K.*?(?=\\|)', x, perl = TRUE)))
# [1] "Streptococcus" "Streptococcus"
The ending is all you need [^|]+(?=\|$)
Per #RichardScriven :
Which in R would be regmatches(x, regexpr("[^|]+(?=\\|$)", x, perl = TRUE)
You can use package "stringr" as well in this case. Here is the code:
v<- "Bacteria|
Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
v1<- str_replace_all(v, "\\|", " ")
word(v1,-2)
Here I used v as the string. The basic theory is to replace all the | with spaces, and then get the last word in the string by using function word().

R regular expression: isolate a string between quotes

I have a string myFunction(arg1=\"hop\",arg2=TRUE). I want to isolate what is in between quotes (\"hop\" in this example)
I have tried so far with no success:
gsub(pattern="(myFunction)(\\({1}))(.*)(\\\"{1}.*\\\"{1})(.*)(\\){1})",replacement="//4",x="myFunction(arg1=\"hop\",arg2=TRUE)")
Any help by a regex guru would be welcome!
Try
sub('[^\"]+\"([^\"]+).*', '\\1', x)
#[1] "hop"
Or
sub('[^\"]+(\"[^\"]+.).*', '\\1', x)
#[1] "\"hop\""
The \" is not needed as " would work too
sub('[^"]*("[^"]*.).*', '\\1', x)
#[1] "\"hop\""
If there are multiple matches, as #AvinashRaj mentioned in his post, sub may not be that useful. An option using stringi would be
library(stringi)
stri_extract_all_regex(x1, '"[^"]*"')[[1]]
#[1] "\"hop\"" "\"hop2\""
data
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
x1 <- "myFunction(arg1=\"hop\",arg2=TRUE arg3=\"hop2\", arg4=TRUE)"
You could use regmatches function also. Sub or gsub only works for a particular input , for general case you must do grabing instead of removing.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> regmatches(x, gregexpr('"[^"]*"', x))[[1]]
[1] "\"hop\""
To get only the text inside quotes then pass the result of above function to a gsub function which helps to remove the quotes.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop"
> x <- "myFunction(arg1=\"hop\",arg2=\"TRUE\")"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop" "TRUE"
You can try:
str='myFunction(arg1=\"hop\",arg2=TRUE)'
gsub('.*(\\".*\\").*','\\1',str)
#[1] "\"hop\""
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
unlist(strsplit(x,'"'))[2]
# [1] "hop"

regex in R - combining two patterns in "OR" fashion. Can't spot the error.

I want to find pat1 OR pat2 in vec
vec <- c("(and) i.e.", "(and) ie", "(and)ie", "and i.e.", "and ie", "and) i.e.")
pat1 <- "\\(and) i\\.e\\."
pat2 <- "\\(and) ie"
I attempt to combine the two patterns using (pat1|pat2)
# combine the two patterns
pat1or2 <- paste0("(", pat1, "|", pat2, ")")
# [1] "(\\(and) i\\.e\\.|\\(and) ie)"
grep(pat1, vec, value=TRUE)
# [1] "(and) i.e."
grep(pat2, vec, value=TRUE)
# [1] "(and) ie"
grep(pat1or2, vec, value=TRUE)
# character(0)
Clearly, I am missing something and I cannot spot it.
(Tried messing with perl and fixed, but that wasnt it)
Can you point out my error in combining the two patterns?
You forgot to backslash all of your parentheses. Your two patterns should be:
pat1 <- "\\(and\\) i\\.e\\."
pat2 <- "\\(and\\) ie"
After that, everything should be fine, with or without perl = TRUE. What could have put you on track to finding the error is using perl = TRUE with your old (wrong) patterns:
grep(pat1, vec, value=TRUE, perl = TRUE)
# Error in grep(pat1, vec, value = TRUE, perl = TRUE) :
# invalid regular expression '\(and) i\.e\.'
making it clear you had unbalanced parentheses.
It can be simplified a bit like this:
pat1 <- "(and) i.e."
pat2 <- "(and) ie"
ok <- grepl(pat1, vec, fixed = TRUE) | grepl(pat2, vec, fixed = TRUE)
vec[ ok ]
This could alternately be written in this form which generalizes to more than two patterns:
pats <- c(pat1, pat2)
ok <- Reduce(function(x, y) x | grepl(y, vec, fixed = TRUE), pats, FALSE)
vec[ ok ]

Removing repeating substrings from within a string in R

Is there any way (using regular expressions such as gsub or other means) to remove repetitions from a string?
Essentially:
a = c("abc, def, def, abc")
f(a)
#[1] "abc, def"
One obvious way is to strsplit the string, get unique strings and stitch them together.
paste0(unique(strsplit(a, ",[ ]*")[[1]]), collapse=", ")
You can also use stringr::str_extract_all
require(stringr)
unique(unlist(str_extract_all(a, '\\w+')))
you can also use this function based on gsub. I was not able to directly do it with a single regular expression.
f <- function(x) {
x <- gsub("(.+)(.+)?\\1", "\\1\\2", x, perl=T)
if (grepl("(.+)(.+)?\\1", x, perl=T))
x <- f(x)
else
return(x)
}
b <- f(a)
b
[1] "abc, def"
hth

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"