How to get pattern between first occurrence of two characters in R? - regex

I am trying to match a pattern: anything that is between VD= and the first occurrence of | from a character string, say tmp, like this:
tmp <- "PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
gene <- sub("^.*VD=([A-Za-z0-9]+)[|].*", "\\1", tmp)
gene
# [1] "SMO"
But when there is no VD= or | in the string, it grabs the whole string:
tmp <- "PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del"
gene <- sub("^.*VD=([A-Za-z0-9]+)[|].*", "\\1", tmp)
gene
# [1] "PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del"
I don't understand why it is grabbing the whole string instead of NA even when there are no VD= or | characters present. Is there a way to grab a pattern between the first occurrence of two characters and print it or print NA if the pattern is not found.
Any help would be much appreciated.
Thanks!

Your regex seems quite complicated for the task. Using simple regex like this
Regex: VD=([^|]+) would be sufficient. Use \\1 to back-reference.
Explanation: ([^|]+) matches anything from VD= until first | is encountered.
Regex101 Demo
tmp <- c("PC=D;RS=72450731;RE=72450735;LEN=1;S1=72;S2=802.939;REP=3;VT=Del", "PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627")
gsub('VD=([^|]+)|.', '\\1', tmp)
# [1] "" "SMO"

It looks to me like you're effectively trying to parse a multilevel delimited string. I recommend not trying to use a single regex to extract the information you want, but rather using a more rigorous stepwise breakdown of the elements of the syntax.
First, you can split on semicolon to get the top-level pieces that look like variable assignments:
tmp <- 'PC=I;RS=128850544;RE=128850566;LEN=6;S1=36;S2=499.417;REP=2;VT=Ins;VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627;VC=intronic;VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627';
specs <- strsplit(fixed=T,tmp,';')[[1L]];
specs;
## [1] "PC=I"
## [2] "RS=128850544"
## [3] "RE=128850566"
## [4] "LEN=6"
## [5] "S1=36"
## [6] "S2=499.417"
## [7] "REP=2"
## [8] "VT=Ins"
## [9] "VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
## [10] "VC=intronic"
## [11] "VW=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
Next you can search for the LHS of interest, extracting just the first occurrence (in case there are multiple matches):
vdspec <- grep(perl=T,value=T,'^VD=',specs)[1L];
vdspec;
## [1] "VD=SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
You can drill down into the RHS and then split that into the pipe-delimited fields:
vd <- sub(perl=T,'^VD=','',vdspec);
vd;
## [1] "SMO|CCDS5811.1|r.?|-|-|protein_coding:CDS:intron:insertion:intron_variant|SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
vdfields <- strsplit(fixed=T,vd,'|')[[1L]];
vdfields;
## [1] "SMO"
## [2] "CCDS5811.1"
## [3] "r.?"
## [4] "-"
## [5] "-"
## [6] "protein_coding:CDS:intron:insertion:intron_variant"
## [7] "SO:0000010:SO:0000316:SO:0000188:SO:0000667:SO:0001627"
Now you can easily get the value you're looking for:
vdfields[1L];
## [1] "SMO"
If your target LHS does not match, you'll get NA from the grep()[1L] call:
xxspec <- grep(perl=T,value=T,'^XX=',specs)[1L];
xxspec;
## [1] NA
Thus you can branch on the result of the grep()[1L] call to handle the case of a missing LHS.

Related

Substring content between quotation marks

In a DF I have column entries of different length as the following:
tmp_ezg.\"dr_HE_10691\" , tmp_ezg.\"dr_MV_0110200016\" , tmp_ezg.\"dr_MV_0111290017\" etc.
How can I best substring what's in between the quotation marks?
My idea:
substring(DF$name, 10)
Since the content of the quotation marks has different lengths I cannot provide substring() a value where to stop.
Is there a possibility to substring only between certain symbols (i.e. quotation marks)?
To separate the content between the quotation marks (assuming there are exactly two in each entry), you just split the string by \\\" (escaped backslash and quotation mark):
y <- strsplit(x, split = "\\\"")
If all entries end with a quotation mark, this will give you a list of entries with two values, and the second value in each entry is your string.
[[1]]
[1] "tmp_ezg." "dr_HE_10691"
[[2]]
[1] "tmp_ezg." "dr_MV_0110200016"
[[3]]
[1] "tmp_ezg." "dr_MV_0111290017"
For example
x <- c('tmp_ezg.\"dr_HE_10691\"' ,
'tmp_ezg.\"dr_MV_0110200016\"' ,
'tmp_ezg.\"dr_MV_0111290017\"')
res <- sub('.*?"([^"]+)"', "\\1", x)
print(res, quote=F)
# [1] dr_HE_10691
# [2] dr_MV_0110200016
# [3] dr_MV_0111290017
... if I'm not mistaken.

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Split on first/nth occurrence of delimiter

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

Exception handling for regular expressions in R

I've found several related questions, but haven't found one that I solves my problem yet, please let me know if I'm missing a question that addresses this.
Essentially I want to use a regular expression to find a pattern but with an exception based on the preceding characters. For example, I have the following text object ("muffins") as a vector and I want to match the names ("Sarah","Muffins", and "Bob").:
muffins
[1] "Dear Sarah,"
[2] "I love your dog, Muffins, who is adorable and very friendly. However, I cannot say I enjoy the \"muffins\" he regularly leaves in my front yard. Please consider putting him on a leash outside and properly walking him like everyone else in the neighborhood."
[3] "Sincerely,"
[4] "Bob"
My approach was the search for capitalized words and then exclude words capitalized for grammatical reasons, such as the beginning of a sentence.
pattern = "\\b[[:upper:]]\\w+\\b"
m = gregexpr(pattern,muffins)
regmatches(muffins,m)
This pattern gets me most of the way, returning:
[[1]]
[1] "Dear" "Sarah"
[[2]]
[1] "Muffins" "However" "Please"
[[3]]
[1] "Sincerely"
[[4]]
[1] "Win"
and I can identify some of the sentence beginnings with:
pattern2 = "[.]\\s[[:upper:]]\\w+\\b"
m = gregexpr(pattern2,muffins)
regmatches(muffins,m)
but I can't seem to do both simultaneously, where I say I want pattern where pattern2 is not the case.
I've tried several combinations that I thought would work, but with little success. A few of the ones I tried:
pattern2 = "(?<![.]\\s[[:upper:]]\\w+\\b)(\\b[[:upper:]]\\w+\\b)"
pattern2 = "(^[.]\\s[[:upper:]]\\w+\\b)(\\b[[:upper:]]\\w+\\b)"
Any advice or insight would be greatly appreciated!
You maybe looking for a negative look-behind.
pattern = "(?<!\\.\\s)\\b[[:upper:]]\\w+\\b"
m = gregexpr(pattern,muffins, perl=TRUE)
regmatches(muffins,m)
# [[1]]
# [1] "Dear" "Sarah"
#
# [[2]]
# [1] "Muffins"
#
# [[3]]
# [1] "Sincerely"
#
# [[4]]
# [1] "Bob"
The look behind part (?<!\\.\\s) makes sure there's not a period and a space immediately before the match.
The below regex would match only the names Bob, Sarah and Muffins,
(?<=^)[A-Z][a-z]+(?=$)|(?<!\. )[A-Z][a-z]+(?=,[^\n])|(?<= )[A-Z][a-z]+(?=,$)
DEMO
Trying to use regular expressions to identify names becomes a problem. There is no hope of working reliably. It is very complicated to match names from arbitrary data. If extracting these names is your goal, you need to approach this in a different way instead of simply trying to match an uppercase letter followed by word characters.
Considering your vector is as you posted in your question:
x <- c('Dear Sarah,',
'I love your dog, Muffins, who is adorable and very friendly. However, I cannot say I enjoy the "muffins" he regularly leaves in my front yard. Please consider putting him on a leash outside and properly walking him like everyone else in the neighborhood.',
'Sincerely',
'Bob')
m = regmatches(x, gregexpr('(?<!\\. )[A-Z][a-z]{1,7}\\b(?! [A-Z])', x, perl=T))
Filter(length, m)
# [[1]]
# [1] "Sarah"
# [[2]]
# [1] "Muffins"
# [[3]]
# [1] "Bob"

Regular Expression to anonymize emails

i use in R the regular expression
regexp <- "(^|[^([:alnum:]|.|_)])abc#abc.de($|[^[:alnum:]])"
to find the email-adress abc#abc.de in an spefic text and replace it by an anonym-mail-adress.
tmp <- c("aaaaabc#abc.debbbb", ## <- should not be matched
"aaaa abc#abc.de bbbb", ## <- should be matched
"abc#abc.de", ## <- should be matched
"aaa.abc#abc.de", ## <- should not be matched
"aaaa_abc#abc.de", ## <- should not be matched
"(abc#abc.de)", ## <- should be matched
"aaaa (abc#abc.de) bbbb") ## <- should be matched
replacement <- paste("\\1", "anonym#anonym.de", "\\2", sep="")
gsub(regexp, replacement, tmp, ignore.case=TRUE)
as result I get
> gsub(regexp, replacement, tmp, ignore.case=TRUE)
[1] "aaaaabc#abc.debbbb" "aaaa anonym#anonym.de bbbb"
[3] "anonym#anonym.de" "aaa.abc#abc.de"
[5] "aaaa_abc#abc.de" "(abc#abc.de)"
[7] "aaaa (abc.abc.de) bbbb"
I don't know why the last two elements of the array are not matched?
Thank you and best regards.
How about this?
gsub("^(abc#abc)|(?<=[ (])(abc#abc)", "anonym#anonym", tmp, perl=T)
The pattern before |: ^(abc#abc) checks for beginning with abc#abc, of course.
The pattern after | uses positive lookbehind and searches for abc#abc preceded by space or ( (left paranthesis) and if found, replaces with anonym#anonym.
This is what I get: (Note: I replaced abc.abc in the last string with abc#abc)
[1] "aaaaabc#abc.debbbb" "aaaa anonym#anonym.de bbbb"
[3] "anonym#anonym.de" "aaa.abc#abc.de"
[5] "aaaa_abc#abc.de" "(anonym#anonym.de)"
[7] "aaaa (anonym#anonym.de) bbbb"
Edit: To explain the problem with your regexp, it seems like a problem with the part:
[^([:alnum:]|.|_)]
I think the negation has to be present in every | statement. Also, you should use [.] instead of . as the latter implies any character. Alternatively, instead of using negation in for every character you're checking, we can condense this part by removing all unncessary | as:
[^.[:alpha:]_] # not a . or _ or any alphanumeric
# using gsub on it:
gsub("(^|[^.[:alpha:]_])abc#abc", " anonym#anonym", tmp)
# [1] "aaaaabc#abc.debbbb" "aaaa anonym#anonym.de bbbb"
# [3] " anonym#anonym.de" "aaa.abc#abc.de"
# [5] "aaaa_abc#abc.de" " anonym#anonym.de)"
# [7] "aaaa anonym#anonym.de) bbbb"
You get every abc#abc replaced. But, you'll lose the character before abc#abc everytime because you're checking for it in the pattern as well. So, you'll have to use the capture group. That is, if you wrap a regular expression with () then you can refer to that "capture" using special variables such as \\1, \\2 etc... Here, we have captured (^|[^.[:alpha:]_]), i.e., the part before abc#abc. Since it is the first capture, we'll refer to it as \\1 to use it to recover the missing character in the previous result:
gsub("(^|[^.[:alpha:]_])abc#abc", "\\1anonym#anonym", tmp)
# [1] "aaaaabc#abc.debbbb" "aaaa anonym#anonym.de bbbb"
# [3] "anonym#anonym.de" "aaa.abc#abc.de"
# [5] "aaaa_abc#abc.de" "(anonym#anonym.de)"
# [7] "aaaa (anonym#anonym.de) bbbb"
This is the result you needed. And this is the same as my initial answer using positive look-behind. In that case, since it just checks if it is preceded by something, you don't have to capture anything special. Only the abc#abc part got replaced. Hope this helps.