Grab from beginning to first occurrence of character with gsub - regex

I have the following regex that I'd like to grab everything from the beginning of the sentence until the first ##. I could use strsplit as I demonstrate to do this task but am preferring a gsub solution. If gusub is not the correct tool (I think it is though) I'd prefer a base solution because I want to learn the base regex tools.
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
strsplit(x, "##")[[c(1, 1)]] #works
gsub("(.*)(##.*)", "\\1", x) #I want to work

Just add one character, putting a ? after the first quantifier to make it "non-greedy":
gsub("(.*?)(##.*)", "\\1", x)
# [1] "gfd gdr tsvfvetrv erv tevgergre "
Here's the relevant documentation, from ?regex
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to 'minimal' by appending
'?' to the quantifier.

I'd say:
sub("##.*", "", x)
Removes everything including and after the first occurance of ##.

In this case, I'd say to the inverse, i.e. replace everything following # with an empty string:
gsub("#.*$", "", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
But you can also use the non-greedy modifier ? to make your regex work in the way you suggested:
gsub("(.*?)#.*$", "\\1", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "

Here's another approach that uses more string tools instead of a more complicated regular expression. It first finds the location of the first ## and then extracts the substring up to that point:
library(stringr)
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
loc <- str_locate(x, "##")
str_sub(x, 1, loc[, "start"] - 1)
Generally, I think this sort of step-by-step approach is more maintainable than complex regular expressions.

Try this as your regex
^[^#]+
starts at the beginning of the string and matches anything not a # up to the first #

There are several simpler answers already here, but since you indicated in your question that you'd like to learn about regex support in base R, here's another way, using positive lookahead assertion (?=#) and non-greedy option (?U).
regmatches(x, regexpr('(?U)^.+(?=#)', x, perl=TRUE))
[1] "gfd gdr tsvfvetrv erv tevgergre "

Related

How to replace square brackets with curly brackets using R's regex?

Due to conversions between pandoc-citeproc and latex I'd like to replace this
[#Fotheringham1981]
with this
\cite{Fotheringham1981}
.
The issue with treating each bracket separately is illustrated in the reproducible example below.
x <- c("[#Fotheringham1981]", "df[1,2]")
x1 <- gsub("\\[#", "\\\\cite{", x)
x2 <- gsub("\\]", "\\}", x1)
x2[1] # good
## [1] "\\cite{Fotheringham1981}"
x2[2] # bad
## [1] "df[1,2}"
Seen a similar issue solved for C#, but not using R's perly regex - any ideas?
Edit:
It should be able to handle long documents, e.g.
old_rmd <- "$p = \alpha e^{\beta d}$ [#Wilson1971] and $p = \alpha d^{\beta}$
[#Fotheringham1981]."
new_rmd1 <- gsub("\\[#([^\\]]*)\\]", "\\\\cite{\\1}", old_rmd, perl = T)
new_rmd2 <- gsub("\\[#([^]]*)]", "\\\\cite{\\1}", old_rmd)
new_rmd1
## "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n \\cite{Fotheringham1981}."
new_rmd2
## [1] "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n\\cite{Fotheringham1981}."
You can use
gsub("\\[#([^]]*)]", "\\\\cite{\\1}", x)
See IDEONE demo
Regex breakdown:
\\[# - a literal [# symbol sequence
([^]]*) - a capture group 1 that matches 0 or more occurrences of any symbol but a ] (note that if ] appears at the beginning of a character class, it does not need escaping)
] - a literal ] symbol
You do not need to use perl=T with this one because the ] inside a character class is not escaped. Otherwise, it would require using that option.
Also, I believe we should only escape what must be escaped. If there is a way to avoid backslash hell, we should. Thus, you can even use
gsub("[[]#([^]]*)]", "\\\\cite{\\1}", x)
Here is another demo
Why TRE-based regex works better than the PCRE one:
In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari's TRE engine [source]. The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). TRE is also said to use predictable and modest memory consumption and a quadratic worst-case time in the length of the used regular expression matching algorithm. That is why it seems best to rely on TRE rather than on PCRE regex when dealing with larger documents.
You need to use capturing group.
x <- c("[#Fotheringham1981]", "df[1,2]")
gsub("\\[#([^\\]]*)\\]", "\\\\cite{\\1}", x, perl=T)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"
or
gsub("\\[#(.*?)\\]", "\\\\cite{\\1}", x)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"
This matches [# and then sets up a capture group, i.e. everything within (...), and then .*? matches the shortest string until ] :
gsub("\\[(#.*?)\\]", "\\\\cite{\\1}", x)
## [1] "\\cite{#Fotheringham1981}" "df[1,2]"
Here is a railroad diagram of the regular expression:
\[(#.*?)\]
Debuggex Demo

extracting multiple overlapping substrings

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.
I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

Return the first occurrence of a character in a string

I have been trying to extract a portion of string after the occurrence of a first ^ sign. For example, the string looks like abc^28092015^def^1234. I need to extract 28092015 sandwiched between the 1st two ^ signs.
So, I need to extract 8 characters from the occurrence of the 1st ^ sign. I have been trying to extract the position of the first ^ sign and then use it as an argument in the substr function.
I tried to use this:
x=abc^28092015^def^1234 `rev(gregexpr("\\^", x)[[1]])[1]`
Referring the answer discussed here.
But it continues to return the last position. Can anyone please help me out?
I would use sub.
x <- "^28092015^def^1234"
sub("^.*?\\^(.*?)\\^.*", "\\1", x)
# [1] "28092015"
Since ^ is a special char in regex, you need to escape that in-order to match literal ^ symbols.
or
Do splitting on ^ and get the value of second index.
strsplit(x,"^", fixed=-T)[[1]][2]
# [1] "28092015"
or
You may use gsub aslo.
gsub("^.*?\\^|\\^.*", "", x, perl=T)
# [1] "28092015"
Here's one option with base R:
x <- "abc^28092015^def^1234"
m <- regexpr("(?<=\\^)(.+?)(?=\\^)", x, perl = TRUE)
##
R> regmatches(x, m)
#[1] "28092015"
Another option is stri_extract_first from library(stringi)
library(stringi)
stri_extract_first_regex(str1, '(?<=\\^)\\d+(?=\\^)')
#[1] "28092015"
If it is any character between two ^
stri_extract(str1, regex='(?<=\\^)[^^]+')
#[1] "28092015"
data
str1 <- 'abc^28092015^def^1234'
x <- 'abc^28092015^def^1234'
library(qdapRegex)
unlist(rm_between(x, '^', '^', extract=TRUE))[1]
# [1] "28092015"
It would be better if you split it using ^. But if you still want the pattern, you can try this.
^\S+\^(\d+)(?=\^)
Then match group 1.
OUTPUT
28092015
See DEMO

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

R: Extract data from string using POSIX regular expression

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?
st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."
First of all, this generates an error
Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"
I was thinking something like
sub(".*\\.", st, "")
The first problem is that you need to escape the \ in your string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
As for the main problem, this will return the bit you want from the string you gave:
> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"
But a simpler solution would be to split on the \\. and select the last chunk:
> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"
or slightly more automated
> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"
Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.
However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.
Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.
gsub("\\.", "", gsub("(.+?\\.){2}", "", st))
[1] "DATABASE_NAME"
An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
library(stringr)
str_split(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"