R regular expression - regex

I have a string like below.
testSampe <- "Old:windows\r\nNew:linux\r\n"
I want to erase the string between ":" an "\".
Like this "Old\r\nNew\r\n".
How can I construct the regex for this?
I tried to gsub function with regex ":.*\\\\", It doesn't work.
gsub(":.*\\\\", "\\\\r", testSampe)

> testSampe <- "Old:windows\r\nNew:linux\r\n"
> gsub(":[^\r\n]*", "", testSampe)
[1] "Old\r\nNew\r\n"

You have a choice of a few different regular expressions that will match. See falsetru's answer or use:
rx <- ":[[:alnum:]]*(?=\\r)"
As a more readable alternative to gsub, use str_replace_all in the stringr package.
library(stringr)
str_replace_all(testSampe, perl(rx), "")

Related

Extract characters within brackets "[" and "]" including brackets

I have a character string like this:
GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT
I realize that I can use substing for this particular case. However, the position of the [X/Y] differs among strings and the content between the brackets varies in length.
I would like to extract the [X/Y].
stringr is useful for these types of operations,
library(stringr)
str_extract(x, '\\[.*\\]')
#[1] "[A/C]"
or str_extract_all if you have more than one patterns in your strings
We can use bracketXtract from qdap
library(qdap)
unname(bracketXtract(dat, "square", with = TRUE))
#[1] "[A/C]"
Or using base R
gsub
gsub("^[^[]+|[^]]+$", '', dat)
#[1] "[A/C]"
strsplit
strsplit(dat, "[^[]+(?=\\[)|(?<=])[^]]+", perl=TRUE)[[1]][2]
#[1] "[A/C]"
data
dat <- "GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT"
provided that there's only 1 pair of "[]" per string, use grepexpr:
dat<-c("GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT")
substring(dat, gregexpr("\\[", dat), gregexpr("\\]", dat))

extracting multiple overlapping substrings

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.
I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

R capture everything from pattern until pattern

I am trying to extract a substring in between the two patterns BB and </p>:
require("stringr")
str = "<notes>\n <p>AA:</p>\n <p>BB: word, otherword</p>\n <p>Number:</p>\n <p>Level: 1</p>\n"
str_extract(str, "BB.*?:</p>")
The extracted substring should be "word, otherword", but I capture too much:
[1] "BB: word, otherword</p>\n <p>Number:</p>"
Maybe something like this?
> gsub(".*BB: (.*?)</p>.*$", "\\1", str)
# [1] "word, otherword"
This is a job for Perl regular expressions. Namely, lookahead and lookbehind references. In stringr you can wrap the regex in a perl function like so:
str_extract(str, perl("(?<=BB: ).*?(?=</p>)"))
[1] "word, otherword"
You can also do this with base:
regmatches(str, regexpr(perl("(?<=BB: ).*?(?=</p>)"), str, perl=TRUE))
[1] "word, otherword"

Gsub to get part matched strings in R regular expression?

gsub('[a-zA-Z]+([0-9]{5})','\\1','htf84756.iuy')
[1] "84756.iuy"
I want to get 84756,how can i do?
Using gregexpr() with regmatches() has the advantage of only requiring that your pattern match the bit that you actually want to extract:
string <- 'htf84756.iuy'
pat <- "(\\d){5}"
regmatches(string, gregexpr(pat, string))[[1]]
# [1] "84756"
(In practice, these functions are more useful when a supplied string might contain more than one substring matching pat.)
Try this:
R> gsub('[a-zA-Z]+([0-9]{5}).*','\\1','htf84756.iuy')
[1] "84756"
R>
You need the added .* at the end of the "greedy" regexp to terminate it after the 5 digits.
This could work as well (like Dirk's answer better) based on what to add to yours:
gsub('[a-zA-Z]+([0-9]{5})\\.([a-zA-Z])+','\\1','htf84756.iuy')
If you just want the numeric string this may be helpful as well:
gsub('[^0-9]','','htf84756.iuy')
With stringr, you can use str_extract:
library(stringr)
str_extract("htf84756.iuy", "[0-9]+")

R: Extract data from string using POSIX regular expression

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?
st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."
First of all, this generates an error
Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"
I was thinking something like
sub(".*\\.", st, "")
The first problem is that you need to escape the \ in your string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
As for the main problem, this will return the bit you want from the string you gave:
> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"
But a simpler solution would be to split on the \\. and select the last chunk:
> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"
or slightly more automated
> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"
Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.
However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.
Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.
gsub("\\.", "", gsub("(.+?\\.){2}", "", st))
[1] "DATABASE_NAME"
An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
library(stringr)
str_split(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"