strsplit by parentheses [duplicate] - regex

This question already has answers here:
Regular Expression to get a string between parentheses in Javascript
(10 answers)
Closed 7 years ago.
Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.
strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"

If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all
library(stringr)
str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
#[1] "123-456-789"
str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')
In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)
With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').
strsplit(str1, '[()]')[[1]][2]
#[1] "123-456-789"
If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep
lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))
Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).
library(stringi)
stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
#[1] "123-456-789"
stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)
Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.
library(qdapRegex)
rm_round(str1, extract=TRUE)[[1]]
#[1] "123-456-789"
rm_round(str2, extract=TRUE)
data
str1 <- "A B C (123-456-789)"
str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
"(123-423-498) ABCDD",
"(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")

or with sub from base R:
sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"
Explanation:
[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times
Another option with look-ahead and look-behind
sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"

The capture groups in sub will target your desired output:
sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"
Extra check to make sure I pass #akrun's extended example:
sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"

You may try these gsub functions.
> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"
Few more...
> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"

Try this also:
k<-"A B C (123-456-789)"
regmatches(k,gregexpr("*.(\\d+).*",k))[[1]]
[1] "(123-456-789)"
With suggestion from #Arun:
regmatches(k, gregexpr('(?<=\\()[^A-Z ]+(?=\\))', k, perl=TRUE))[[1]]
With suggestion from #akrun:
regmatches(k, gregexpr('[0-9-]+', k))[[1]]

Related

extracting multiple overlapping substrings

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.
I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

Return the first occurrence of a character in a string

I have been trying to extract a portion of string after the occurrence of a first ^ sign. For example, the string looks like abc^28092015^def^1234. I need to extract 28092015 sandwiched between the 1st two ^ signs.
So, I need to extract 8 characters from the occurrence of the 1st ^ sign. I have been trying to extract the position of the first ^ sign and then use it as an argument in the substr function.
I tried to use this:
x=abc^28092015^def^1234 `rev(gregexpr("\\^", x)[[1]])[1]`
Referring the answer discussed here.
But it continues to return the last position. Can anyone please help me out?
I would use sub.
x <- "^28092015^def^1234"
sub("^.*?\\^(.*?)\\^.*", "\\1", x)
# [1] "28092015"
Since ^ is a special char in regex, you need to escape that in-order to match literal ^ symbols.
or
Do splitting on ^ and get the value of second index.
strsplit(x,"^", fixed=-T)[[1]][2]
# [1] "28092015"
or
You may use gsub aslo.
gsub("^.*?\\^|\\^.*", "", x, perl=T)
# [1] "28092015"
Here's one option with base R:
x <- "abc^28092015^def^1234"
m <- regexpr("(?<=\\^)(.+?)(?=\\^)", x, perl = TRUE)
##
R> regmatches(x, m)
#[1] "28092015"
Another option is stri_extract_first from library(stringi)
library(stringi)
stri_extract_first_regex(str1, '(?<=\\^)\\d+(?=\\^)')
#[1] "28092015"
If it is any character between two ^
stri_extract(str1, regex='(?<=\\^)[^^]+')
#[1] "28092015"
data
str1 <- 'abc^28092015^def^1234'
x <- 'abc^28092015^def^1234'
library(qdapRegex)
unlist(rm_between(x, '^', '^', extract=TRUE))[1]
# [1] "28092015"
It would be better if you split it using ^. But if you still want the pattern, you can try this.
^\S+\^(\d+)(?=\^)
Then match group 1.
OUTPUT
28092015
See DEMO

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

Split on first/nth occurrence of delimiter

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

remove leading zeroes from timestamp %j%Y %H:%M

My timestamp is in the form
0992006 09:00
I need to remove the leading zeros to get this form:
992006 9:00
Here's the code I'm using now, which doesn't remove leading zeros:
prediction$TIMESTAMP <- as.character(format(prediction$TIMESTAMP, '%j%Y %H:%M'))
Simplest way is to create your own boundary that asserts either the start of the string or a space precedes.
gsub('(^| )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
You could do the same making the replacement exempt using a trick. \K resets the starting point of the reported match and any previously consumed characters are no longer included.
gsub('(^| )\\K0+', '', '0992006 09:00', perl=T)
# [1] "992006 9:00"
Or you could use sub and match until the second set of leading zeros.
sub('^0+([0-9]+ )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
And to cover all possibilities, if you know that you will ever have a format like 0992006 00:00, simply remove the + quantifier from zero in the regular expression so it only removes the first leading zero.
str1 <- "0992006 09:00"
gsub("(?<=^| )0+", "", str1, perl=TRUE)
#[1] "992006 9:00"
For situations like below, it could be:
str2 <- "0992006 00:00"
gsub("(?<=^| )0", "", str2, perl=TRUE)
#[1] "992006 0:00"
Explanation
Here the idea is to use look behind (?<=^| )0+ to match 0s
if it occurs either at the beginning of the string
(?<=^
or |
if it follows after a space )0+
and replace those matched 0s by "" in the second part of the gsub argument.
In the second string, the hour and minutes are all 0's. So, using the first code would result in:
gsub("(?<=^| )0+", "", str2, perl=TRUE)
#[1] "992006 :00"
Here, it is unclear what the OP would accept as a result. So, I thought, instead of removing the whole 0s before the :, it would be better if one 0 was left. So, I replaced the multiple 0+ code to just one 0 and replace that by "".
Here's another option using a lookbehind
gsub("(^0)|(?<=\\s)0", "", "0992006 09:00", perl = TRUE)
## [1] "992006 9:00"
With sub:
sub("^[0]+", "", prediction$TIMESTAMP)
[1] "992006 09:00"
You can also use stringr without a regular expression, by using the substrings.
> library(stringr)
> str_c(str_sub(word(x, 1:2), 2), collapse = " ")
# [1] "992006 9:00"
Some more Perl regexes,
> gsub("(?<!:)\\b0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"
> gsub("(?<![\\d:])0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"