I'm trying to using regular expression in R by using regexpr function. I have multiple conditions to match, therefore my regular expression is very long actually, for example "A\s+(\d+)|(\d+)\s+A". So I want to put each separate expression on different lines, like
"A\\s+(\\d+)|
(\\d+)\\s+A|"
But it's not working. The bracket tells R that I want to extract the digit number out. Can anyone give suggestions?
1) paste Try using paste:
paste("A\\s+(\\d+)",
"(\\d+)\\s+A",
sep = "|")
2) rex Another possibility is to use the rex package
library(rex)
rex(group("A", spaces, capture(digits)) %or%
group(capture(digits), spaces, "A"))
which gives:
(?:(?:A[[:space:]]+([[:digit:]]+))|(?:([[:digit:]]+)[[:space:]]+A))
3) rebus The rebus package is similar in intent:
library(rebus)
literal("A") %R% one_or_more(space()) %R% capture(one_or_more(ascii_digit())) %|%
capture(one_or_more(digit())) %R% one_or_more(space()) %R% literal("A")
which emits:
<regex> \QA\E[[:space:]]+([0-9]+)|([[:digit:]]+)[[:space:]]+\QA\E
If you want to break string literal up on to several lines in your script, one solution is to use paste0:
my_expr <- paste0('partone',
'parttwo',
'partthree')
Then you get the desired result:
> my_expr
[1] "partoneparttwopartthree"
You can't just break it up onto several lines in between quotes, b/c then the new line character is part of the expression.
If you are also trying to trouble shoot your regular expression, you'll need to post a sample of the data you are trying to work with and the desired result
Just use the x modifier with perl = TRUE in whatever function you're using. Place the x modifier ((?x)) at the beginning of the expression and white space is ignored. Additionally, comment charcters are ignored as well.
pat <- "(?x)\\\\ ## Grab a backslash followed by...
[a-zA-Z0-9]*cite[a-zA-Z0-9]* ## A word that contains ‘cite‘
(\\[([^]]+)\\]){0,2}\\** ## Look for 0-2 square brackets w/ content
\\{([a-zA-Z0-9 ,]+)\\}" ## Look for curly braces with viable bibkey
tex <- c(
"Many \\parencite*{Ted2005, Moe1999} say graphs \\textcite{Few2010}.",
"But \\authorcite{Ware2013} said perception good too.",
"Random words \\pcite[see][p. 22]{Get9999c}.",
"Still more \\citep[p. 22]{Foo1882c}?"
)
gsub(pat, "", tex, perl=TRUE)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
A second approach...I maintain a package called regexr that attempts to enable maintainers of regular expressions libraries:
to write regular expressions in a way that is similar to the ways R code is written.
This may be overkill if you're aren't panning long term maintence of the expression but you could do the same thing with regexr by (no need for perl = TRUE). Note the minimal comments as the meaning is shared with sub expression names. The %:)% is a comment operator (commented code is happy code) but you need not use the leading names or comments, just construct.:
library(regexr)
pat2 <- construct(
backslash = "\\\\" %:)% "\\",
cite_command = "[a-zA-Z0-9]*cite[a-zA-Z0-9]*" %:)% "parencite",
square_brack = "(\\[([^]]+)\\]){0,2}\\**" %:)% "[e.g.][p. 12]",
bibkeys = "\\{([a-zA-Z0-9 ,]+)\\}" %:)% "{Rinker2014}"
)
gsub(pat2, "", tex)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
The regexr frame work requires a bit of upfront time but the "code" is much easier to maintain, more modular, and easier to understand by others without learning a new "language". This is one approach of many and I tend to use a combination of standard regex, regexr and rebus (which works within the regexr framework). So for example we can grab any of the sub expressions from pat2 with the subs function as follows:
subs(pat2)
## $backslash
## [1] "\\\\"
##
## $cite_command
## [1] "[a-zA-Z0-9]*cite[a-zA-Z0-9]*"
##
## $square_brack
## [1] "(\\[([^]]+)\\]){0,2}\\**"
##
## $bibkeys
## [1] "\\{([a-zA-Z0-9 ,]+)\\}"
I also included simple way to test the main and sub expressions for perl validity as follows:
test(pat2)
## $regex
## [1] TRUE
##
## $subexpressions
## backslash cite_command square_brack bibkeys
## TRUE TRUE TRUE TRUE
Related
Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!
Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.
This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed last year.
I have the following string : "PRODUCT colgate good but not goodOKAY"
I want to extract all the words between PRODUCT and OKAY
This can be done with sub:
s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)
giving:
[1] "colgate good but not good"
No packages are needed.
Here is a visualization of the regular expression:
.*PRODUCT *(.*?) *OKAY.*
Debuggex Demo
x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")
(?<=PRODUCT) -- look behind the match for PRODUCT
.* match everything except new lines.
(?=OKAY) -- look ahead to match OKAY.
I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.
(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)
You can use gsub:
vec <- "PRODUCT colgate good but not goodOKAY"
gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"
You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:
x <- "PRODUCT colgate good but not goodOKAY"
library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)
## [[1]]
## [1] "colgate good but not good"
You could use the package unglue :
library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"
I have a series of expressions such as:
"<i>the text I need to extract</i></b></a></div>"
I need to extract the text between the <i> and </i> "symbols". This is, the result should be:
"the text I need to extract"
At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i> and </i>?
Thanks.
If there is only one <i>...</i> as in the example then match everything up to <i> and everything from </i> forward and replace them both with the empty string:
x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)
giving:
[1] "the text I need to extract"
If there could be multiple occurrences in the same string then try:
library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)
giving the same in this example.
This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:
library(qdapRegex)
x <- "<i>the text I need to extract</i></b></a></div>"
rm_between(x, "<i>", "</i>", extract=TRUE)
## [[1]]
## [1] "the text I need to extract"
I would point out that it may be more reliable to use an html parser for this job.
If this is html (which it look like it is) you should probably use an html parser. Package XML can do this
library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"
On an entire html document, you can use
doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)
You can use the following approach with gregexpr and regmatches if you don't know the number of matches in a string.
vec <- c("<i>the text I need to extract</i></b></a></div>",
"abc <i>another text</i> def <i>and another text</i> ghi")
regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
#
# [[2]]
# [1] "another text" "and another text"
<i>((?:(?!<\/i>).)*)<\/i>
This should do it for you.
Say I have the following vector
x <- c('One', 'TWO', 'THREE / FOUR')
I want to convert TWO and THREE / FOUR to Two and Three / Four, respectively. I've taken a look into casefold() and the whole chartr() help page but couldn't figure this out.
In my real problem, I have a vector of 1500 strings in which I intend to detect entries written in all caps (I know many of them include a slash just like the one in the example above) and convert them to start case.
One thing I can do is run grepl('^[A-Z]+$', x) (as suggested by tenub), but it doesn't detect the THREE / FOUR as being all caps (it yields [1] FALSE TRUE FALSE). From what I've seen, just the presence of a space is enough to have this return FALSE.
Removing the anchor grepl('[A-Z]+$', x) (as suggested by TheGreatCO) works for the example above, but fails in the next:
y <- "Imposto Territorial Rural - ITR"
grepl('[A-Z]+', y)
[1] TRUE
Moreover, elements containing accents are always left out, no matter what I try:
z <- c('Á')
grepl('[A-Z]+', z)
[1] FALSE
Part of this is a demo example in the package gsubfn. You can run it after installing the package with demo(gsubfn::gsubfn-lower).
x <- c('One', 'TWO', 'THREE / FOUR', 'ÁÁÁ')
library(gsubfn)
## find indices of vector where there are no lowercase letters
## (therefore all letters must be uppercase)
idx <- grep("[[:lower:]]", x, invert = TRUE)
## in these indices, run tolower on characters
## that do not follow a word boundary \\B
x[idx] <- gsubfn("\\B.", tolower, x[idx], perl = TRUE)
# [1] "One" "Two" "Three / Four" "Ááá"
Both \B and [:lower:] are locale-dependent by Sys.getlocale("LC_CTYPE"). Mine is "English_United States.1252". Your mileage may vary.
I don't know R so well, but I base this answer in the description of gsub and regular expression support given in this document
gsub("([A-Z])([:alpha:]*)", paste(\1,tolower(\2),sep=""), x)
I am not sure if you have to enclose \1 and \2 with quotes, try it and if it does not work try it with the quotes around \1 and \2