Extract characters within brackets "[" and "]" including brackets - regex

I have a character string like this:
GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT
I realize that I can use substing for this particular case. However, the position of the [X/Y] differs among strings and the content between the brackets varies in length.
I would like to extract the [X/Y].

stringr is useful for these types of operations,
library(stringr)
str_extract(x, '\\[.*\\]')
#[1] "[A/C]"
or str_extract_all if you have more than one patterns in your strings

We can use bracketXtract from qdap
library(qdap)
unname(bracketXtract(dat, "square", with = TRUE))
#[1] "[A/C]"
Or using base R
gsub
gsub("^[^[]+|[^]]+$", '', dat)
#[1] "[A/C]"
strsplit
strsplit(dat, "[^[]+(?=\\[)|(?<=])[^]]+", perl=TRUE)[[1]][2]
#[1] "[A/C]"
data
dat <- "GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT"

provided that there's only 1 pair of "[]" per string, use grepexpr:
dat<-c("GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT")
substring(dat, gregexpr("\\[", dat), gregexpr("\\]", dat))

Related

R - Extracting number from string with regular expression

I want to extract a number with decimals from a string with one single expression if possible.
For example transform "2,123.02" to "2123.02" - my current solution is:
paste(unlist(str_extract_all("2,123.02","\\(?[0-9.]+\\)?",simplify=F)),collapse="")
But what I'm looking for is the expression in str_extract_all to just bind it together as a vector by themself. Is this possible to achieve with an regular expression?
You can try replacing the comma by an empty string:
gsub(",", "", "2,123.02")
#[1] "2123.02"
NB: If you need to replace only commas in between numbers, you can use lookarounds:
gsub("(?<=[0-9]),(?=[0-9])", "", "this, this is my number 2,123.02", perl=TRUE)
#[1] "this, this is my number 2123.02"
I edited with sub instead of gsub in case you have strings with more than one number with a comma. In case you only have one, sub is "sufficient".
NB2: You can call str_extrac_all on the result from gsub, e.g.:
str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", "","first number: 2,123.02, second number: 3,456", perl=T), "\\d+\\.*\\d*", simplify=F)
#[[1]]
#[1] "2123.02" "3456"
Another option is extract_numeric in the tidyr package.
library(tidyr)
extract_numeric("2,123.02")
[1] 2123.02

extracting multiple overlapping substrings

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.
I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

Trying to use a regular expression in R to capture some data

So I have a table in R, and an example of of the string I am trying to capture is this:
C.Hale (79-83)
I want to write a regular expression to extract the (79-83).
How do I go about doing this?
We can use sub. We match one or more characters that are not a space ([^ ]+) from the beginning of the string (^) , followed by a space (\\s) and replace it with a ''.
sub('^[^ ]+\\s', '', str1)
#[1] "(79-83)"
Or another option is stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(str1, '\\([^)]+\\)')[[1]]
#[1] "(79-83)"
data
str1 <- 'C.Hale (79-83)'
One possibility using the qdapRegex package I maintain:
x <- "C.Hale (79-83)"
library(qdapRegex)
rm_round(x, extract = TRUE, include.markers = TRUE)
## [[1]]
## [1] "(79-83)"

Return the first occurrence of a character in a string

I have been trying to extract a portion of string after the occurrence of a first ^ sign. For example, the string looks like abc^28092015^def^1234. I need to extract 28092015 sandwiched between the 1st two ^ signs.
So, I need to extract 8 characters from the occurrence of the 1st ^ sign. I have been trying to extract the position of the first ^ sign and then use it as an argument in the substr function.
I tried to use this:
x=abc^28092015^def^1234 `rev(gregexpr("\\^", x)[[1]])[1]`
Referring the answer discussed here.
But it continues to return the last position. Can anyone please help me out?
I would use sub.
x <- "^28092015^def^1234"
sub("^.*?\\^(.*?)\\^.*", "\\1", x)
# [1] "28092015"
Since ^ is a special char in regex, you need to escape that in-order to match literal ^ symbols.
or
Do splitting on ^ and get the value of second index.
strsplit(x,"^", fixed=-T)[[1]][2]
# [1] "28092015"
or
You may use gsub aslo.
gsub("^.*?\\^|\\^.*", "", x, perl=T)
# [1] "28092015"
Here's one option with base R:
x <- "abc^28092015^def^1234"
m <- regexpr("(?<=\\^)(.+?)(?=\\^)", x, perl = TRUE)
##
R> regmatches(x, m)
#[1] "28092015"
Another option is stri_extract_first from library(stringi)
library(stringi)
stri_extract_first_regex(str1, '(?<=\\^)\\d+(?=\\^)')
#[1] "28092015"
If it is any character between two ^
stri_extract(str1, regex='(?<=\\^)[^^]+')
#[1] "28092015"
data
str1 <- 'abc^28092015^def^1234'
x <- 'abc^28092015^def^1234'
library(qdapRegex)
unlist(rm_between(x, '^', '^', extract=TRUE))[1]
# [1] "28092015"
It would be better if you split it using ^. But if you still want the pattern, you can try this.
^\S+\^(\d+)(?=\^)
Then match group 1.
OUTPUT
28092015
See DEMO

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"