How to replace square brackets with curly brackets using R's regex? - regex

Due to conversions between pandoc-citeproc and latex I'd like to replace this
[#Fotheringham1981]
with this
\cite{Fotheringham1981}
.
The issue with treating each bracket separately is illustrated in the reproducible example below.
x <- c("[#Fotheringham1981]", "df[1,2]")
x1 <- gsub("\\[#", "\\\\cite{", x)
x2 <- gsub("\\]", "\\}", x1)
x2[1] # good
## [1] "\\cite{Fotheringham1981}"
x2[2] # bad
## [1] "df[1,2}"
Seen a similar issue solved for C#, but not using R's perly regex - any ideas?
Edit:
It should be able to handle long documents, e.g.
old_rmd <- "$p = \alpha e^{\beta d}$ [#Wilson1971] and $p = \alpha d^{\beta}$
[#Fotheringham1981]."
new_rmd1 <- gsub("\\[#([^\\]]*)\\]", "\\\\cite{\\1}", old_rmd, perl = T)
new_rmd2 <- gsub("\\[#([^]]*)]", "\\\\cite{\\1}", old_rmd)
new_rmd1
## "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n \\cite{Fotheringham1981}."
new_rmd2
## [1] "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n\\cite{Fotheringham1981}."

You can use
gsub("\\[#([^]]*)]", "\\\\cite{\\1}", x)
See IDEONE demo
Regex breakdown:
\\[# - a literal [# symbol sequence
([^]]*) - a capture group 1 that matches 0 or more occurrences of any symbol but a ] (note that if ] appears at the beginning of a character class, it does not need escaping)
] - a literal ] symbol
You do not need to use perl=T with this one because the ] inside a character class is not escaped. Otherwise, it would require using that option.
Also, I believe we should only escape what must be escaped. If there is a way to avoid backslash hell, we should. Thus, you can even use
gsub("[[]#([^]]*)]", "\\\\cite{\\1}", x)
Here is another demo
Why TRE-based regex works better than the PCRE one:
In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari's TRE engine [source]. The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). TRE is also said to use predictable and modest memory consumption and a quadratic worst-case time in the length of the used regular expression matching algorithm. That is why it seems best to rely on TRE rather than on PCRE regex when dealing with larger documents.

You need to use capturing group.
x <- c("[#Fotheringham1981]", "df[1,2]")
gsub("\\[#([^\\]]*)\\]", "\\\\cite{\\1}", x, perl=T)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"
or
gsub("\\[#(.*?)\\]", "\\\\cite{\\1}", x)
# [1] "\\cite{Fotheringham1981}" "df[1,2]"

This matches [# and then sets up a capture group, i.e. everything within (...), and then .*? matches the shortest string until ] :
gsub("\\[(#.*?)\\]", "\\\\cite{\\1}", x)
## [1] "\\cite{#Fotheringham1981}" "df[1,2]"
Here is a railroad diagram of the regular expression:
\[(#.*?)\]
Debuggex Demo

Related

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

Extracting variable name from bspline-coeffiecient in R with regex

In a linear model, I have some splines, using the bs function from the splines package (like lm(y ~ bs(x, 3))).
In the model summary and model data frame (colnames(fit$model)) these terms appear as, e.g., bs(efc17age, 3).
Now I would like to extract the variable name using regular expressions. However,, I just don't understand regex syntax.
This is how far I came:
x <- "bs(e17age, 3)1"
sub("bs\\((*?)", "", x)
> [1] "e17age, 3)1"
I just want to have "e17age"... It must be so easy, if you understand regex...
You can use the following snippet:
x <- "bs(e17age, 3)1"
sub("^bs\\(([^,]*).*", "\\1", x)
Regex ^bs\\(([^,]*).* matches bs( at the start of the string, then captures any number of characters other than , with ([^,]*), and then matches any character up to the end. With the replacement string \\1, we get our captured text back.
See IDEONE demo

Retrieve digits after specific string in R

I have a bunch of strings that contain the word "radius" followed by one or two digits. They also contain a lot of other letters, digits, and underscores. For example, one is "inflow100_radius6_distance12". I want a regex that will just return the one or two digits following "radius." If R recognized \K, then I would just use this:
radius\K[0-9]{1,2}
and be done. But R doesn't allow \K, so I ended up with this instead (which selects radius and the following numbers, and then cuts off "radius"):
result <- regmatches(input_string, gregexpr("radius[0-9]{1,2}", input_string))
result <- unlist(substr(result, 7, 8)))
I'm pretty new to regex, so I'm sure there's a better way. Any ideas?
\K is recognized. You can solve the problem by turning on the perl = TRUE parameter.
result <- regmatches(x, gregexpr('radius\\K\\d+', x, perl=T))
1) Match the entire string replacing it with the digits after radius:
sub(".*radius(\\d+).*", "\\1", "inflow100_radius6_distance12")
## [1] "6"
The regular expression can be visualized as follows:
.*radius(\d+).*
Debuggex Demo
2) This also works, involves a simpler regular expression and converts it to numeric at the same time:
library(gsubfn)
strapply("inflow100_radius6_distance12", "radius(\\d+)", as.numeric, simplify = TRUE)
## [1] 6
Here is a visualization of the regular expression:
radius(\d+)
Debuggex Demo

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."

Grab from beginning to first occurrence of character with gsub

I have the following regex that I'd like to grab everything from the beginning of the sentence until the first ##. I could use strsplit as I demonstrate to do this task but am preferring a gsub solution. If gusub is not the correct tool (I think it is though) I'd prefer a base solution because I want to learn the base regex tools.
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
strsplit(x, "##")[[c(1, 1)]] #works
gsub("(.*)(##.*)", "\\1", x) #I want to work
Just add one character, putting a ? after the first quantifier to make it "non-greedy":
gsub("(.*?)(##.*)", "\\1", x)
# [1] "gfd gdr tsvfvetrv erv tevgergre "
Here's the relevant documentation, from ?regex
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to 'minimal' by appending
'?' to the quantifier.
I'd say:
sub("##.*", "", x)
Removes everything including and after the first occurance of ##.
In this case, I'd say to the inverse, i.e. replace everything following # with an empty string:
gsub("#.*$", "", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
But you can also use the non-greedy modifier ? to make your regex work in the way you suggested:
gsub("(.*?)#.*$", "\\1", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
Here's another approach that uses more string tools instead of a more complicated regular expression. It first finds the location of the first ## and then extracts the substring up to that point:
library(stringr)
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
loc <- str_locate(x, "##")
str_sub(x, 1, loc[, "start"] - 1)
Generally, I think this sort of step-by-step approach is more maintainable than complex regular expressions.
Try this as your regex
^[^#]+
starts at the beginning of the string and matches anything not a # up to the first #
There are several simpler answers already here, but since you indicated in your question that you'd like to learn about regex support in base R, here's another way, using positive lookahead assertion (?=#) and non-greedy option (?U).
regmatches(x, regexpr('(?U)^.+(?=#)', x, perl=TRUE))
[1] "gfd gdr tsvfvetrv erv tevgergre "