Extract string between parenthesis in R - regex

I have to extract values between a very peculiar feature in R. For eg.
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
This is my example string and I wish to extract text between {[0-9]: and } such that my output for the above string looks like
## output should be
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213"

This is a horrible hack and probably breaks on your real data. Ideally you could just use a parser but if you're stuck with regex... well... it's not pretty
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
# split based on }{ allowing for newlines and spaces
out <- strsplit(a, "\\}[[:space:]]*\\{")
# Make a single vector
out <- unlist(out)
# Have an excess open bracket in first
out[1] <- substring(out[1], 2)
# Have an excess closing bracket in last
n <- length(out)
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1)
# Remove the number colon at the beginning of the string
answer <- gsub("^[0-9]*\\:", "", out)
which gives
> answer
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
You could wrap something like that in a function if you need to do this for multiple strings.

Using PERL. This way is a bit more robust.
a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}"
foohacky = function(str){
#remove opening bracket
pt1 = gsub('\\{+[0-9]:', '##',str)
#remove a closing bracket that is preceded by any alphanumeric character
pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE)
#split up and hack together the result
pt3 = strsplit(pt2, "##")[[1]][-1]
pt3
}
For example
> foohacky(a)
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
It also works with nesting
> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}"
> foohacky(a)
[1] "0987617820" "{112:123123214321}" "{20:asdasd3214213}"

Here's a more general way, which returns any pattern between {[0-9]: and } allowing for a single nest of {} inside the match.
regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE)
a_parse <- regmatches(a, regPattern)
a <- unlist(a_parse)

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Extract the last word between | |

I have the following dataset
> head(names$SAMPLE_ID)
[1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
[2] "Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|"
[3] "Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|"
[4] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[5] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[6] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
I want to extract the last word between || as a new variable i.e.
Acinetobacter
Bacillus
Haemophilus
I have tried using
library(stringr)
names$sample2 <- str_match(names$SAMPLE_ID, "|.*?|")
We can use
library(stringi)
stri_extract_last_regex(v1, '\\w+')
#[1] "Acinetobacter"
data
v1 <- "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
Using just base R:
myvar <- gsub("^..*\\|(\\w+)\\|$", "\\1", names$SAMPLE_ID)
^.*\\|\\K.*?(?=\\|)
Use \K to remove rest from the final matche.See demo.Also use perl=T
https://regex101.com/r/fM9lY3/45
x <- c("Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|",
"Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|" )
unlist(regmatches(x, gregexpr('^.*\\|\\K.*?(?=\\|)', x, perl = TRUE)))
# [1] "Streptococcus" "Streptococcus"
The ending is all you need [^|]+(?=\|$)
Per #RichardScriven :
Which in R would be regmatches(x, regexpr("[^|]+(?=\\|$)", x, perl = TRUE)
You can use package "stringr" as well in this case. Here is the code:
v<- "Bacteria|
Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
v1<- str_replace_all(v, "\\|", " ")
word(v1,-2)
Here I used v as the string. The basic theory is to replace all the | with spaces, and then get the last word in the string by using function word().

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

Sequentially replace multiple places matching single pattern in a string with different replacements

Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
hello,world??your,make|[]world,hello,pos
to different replacements, e.g. increasing numbers
1,2??3,4|[]5,6,7
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
library(gsubfn)
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be ore.search and ore.subst, the latter of which can accept a function as the replacement value.
Examples:
library(ore)
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted
ore.search("(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
x="hello,world??your,make|[]world,hello,pos"
#split x into single chars
x_split=strsplit(x,"")[[1]]
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
rle_res=rle(x_split)
#replace run lengths by 1
rle_res$lengths[rle_res$values=="a"]=1
#replace run values by increasing number
rle_res$values[rle_res$values=="a"]=1:sum(rle_res$values=="a")
#use inverse.rle on the modified rle object and collapse string
paste0(inverse.rle(rle_res),collapse="")
#[1] "1,2??3,4|[]5,6,7"

How to fill gap between two characters with regex

I have a data set like below. I would like to replace all dots between two 1's with 1's, as shown in the desired.result. Can I do this with regex in base R?
I tried:
regexpr("^1\\.1$", my.data$my.string, perl = TRUE)
Here is a solution in c#
Characters between two exact characters
Thank you for any suggestions.
my.data <- read.table(text='
my.string state
................1...............1. A
......1..........................1 A
.............1.....2.............. B
......1.................1...2..... B
....1....2........................ B
1...2............................. C
..........1....................1.. C
.1............................1... C
.................1...........1.... C
........1....2.................... C
......1........................1.. C
....1....1...2.................... D
......1....................1...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
desired.result <- read.table(text='
my.string state
................11111111111111111. A
......1111111111111111111111111111 A
.............1.....2.............. B
......1111111111111111111...2..... B
....1....2........................ B
1...2............................. C
..........1111111111111111111111.. C
.111111111111111111111111111111... C
.................1111111111111.... C
........1....2.................... C
......11111111111111111111111111.. C
....111111...2.................... D
......1111111111111111111111...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
Below is an option using gsub with the \G feature and lookaround assertions.
> gsub('(?:1|\\G(?<!^))\\K\\.(?=\\.*1)', '1', my.data$my.string, perl = TRUE)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."
# [7] "..........1111111111111111111111.." ".111111111111111111111111111111..."
# [9] ".................1111111111111...." "........1....2...................."
# [11] "......11111111111111111111111111.." "....111111...2...................."
# [13] "......1111111111111111111111......" ".................1...2............"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. Since it seems you want to avoid the dots at the start of the string position we use a lookaround assertion \G(?<!^) to exclude the start of the string.
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.
You can find an overall breakdown that explains the regular expression here.
Using gsubfn, the first argument is a regular expression which matches the 1's and the characters between the 1's and captures the latter. The second argument is a function, expressed in formula notation, which uses gsub to replace each character in the captured string with 1:
library(gsubfn)
transform(my.data, my.string = gsubfn("1(.*)1", ~ gsub(".", 1, x), my.string))
If there can be multiple pairs of 1's in a string then use "1(.*?)1" as the regular expression instead.
Visualization The regular expression here is simple enough that it can be directly understood but here is a debuggex visualization anwyays:
1(.*)1
Debuggex Demo
Here is an option that uses a relatively simple regex and the standard combination of gregexpr(), regmatches(), and regmatches<-() to identify, extract, operate on, and then replace substrings matching that regex.
## Copy the character vector
x <- my.data$my.string
## Find sequences of "."s bracketed on either end by a "1"
m <- gregexpr("(?<=1)\\.+(?=1)", x, perl=TRUE)
## Standard template for operating on and replacing matched substrings
regmatches(x,m) <- sapply(regmatches(x,m), function(X) gsub(".", "1", X))
## Check that it worked
head(x)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."