Regular expression for condition until next space - regex

How would I write a regular expression that grabs a capital letter located anywhere then any subsequent character until a space?
Input:
cake pietypeAPPLE CRUMBLE tart toastTexas price
For example, I would want to grab "APPLE" despite it not being preceded by a space. I want "CRUMBLE". I also want "Texas" even though not all of its components are upper case.
I will use gsub(pattern, replacement = "", x = string) to get the following output
Output:
cake pietype tart toast price
Thanks!

You can use regmatches to extract these substrings.
> x <- 'cake pietypeAPPLE CRUMBLE tart toastTexas price'
> regmatches(x, gregexpr('[A-Z]\\S+', x))[[1]]
# [1] "APPLE" "CRUMBLE" "Texas"
Alternatively, if you want to be strict on matching letter characters only.
> regmatches(x, gregexpr('[A-Z][A-Za-z]+', x))[[1]]
If you want to replace them, I would use the following to avoid excess space left in between the words.
> gsub('[A-Z][A-Za-z]+( [A-Z][A-Za-z]+)*', '', x)
# [1] "cake pietype tart toast price"

Here's an approach using the qdapRegex package:
x <- 'cake pietypeAPPLE CRUMBLE tart toastTexas price'
library(qdapRegex)
rm_default(x, pattern="[A-Z][A-Za-z]*")
## [1] "cake pietype tart toast price"
If you want to extract these terms:
rm_default(x, pattern="[A-Z][A-Za-z]*", extract=TRUE)
## [[1]]
## [1] "APPLE" "CRUMBLE" "Texas"

Related

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?

I'm looking at a number of cells in a data frame and am trying to extract any one of several sequences of characters; there's only one of these sequences per per cell.
Here's what I mean:
dF$newColumn = str_extract_all(string = "dF$column1", pattern ="sequence_1|sequence_2")
Am I screwing the syntax up here? Can I pull this sort of thing with stringr? Please rectify my ignorance!
Yes, you can use | since it denotes logical or in regex. Here's an example:
vec <- c("abc text", "text abc", "def text", "text def text")
library(stringr)
str_extract_all(string = vec, pattern = "abc|def")
The result:
[[1]]
[1] "abc"
[[2]]
[1] "abc"
[[3]]
[1] "def"
[[4]]
[1] "def"
However, in your command, you should replace "dF$column1" with dF$column1 (without quotes).

Split on first/nth occurrence of delimiter

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

Using regexes in grep function in R

Could anyone maybe know how to extract x and y from this character: "x and y" using grep function (not using stringi package) if x and y are random characters?
I am so not skilled in regular expressions.
Thanks for any response.
The regex here matches any chars "and" chars and then extracts them with regmatches:
txt <- c("x and y", "a and b", " C and d", "qq and rr")
matches <- regexec("([[:alpha:]]+)[[:blank:]]+and[[:blank:]]+([[:alpha:]]+)", txt)
regmatches(txt, matches)[[1]][2:3]
## [1] "x" "y"
regmatches(txt, matches)[[2]][2:3]
## [1] "a" "b"
regmatches(txt, matches)[[3]][2:3]
## [1] "C" "d"
regmatches(txt, matches)[[4]][2:3]
## [1] "qq" "rr"
([[:alpha:]]+) matches one or more alpha characters and places it in a match group. [[:blank:]]+ matches one or more "whitespace" characters. There are less verbose ways to write these regexes but the expanded ones (to me) help make it easier to grok if there will be folks reading the code that aren't familiar with regexes.
I also didn't need to call regmatches 4x, but it was faster to cut/paste for a toy example.
As #MrFlick commented, grep is not the right function to extract these substrings.
You can use regmatches and do something like this:
> x <- c('x and y', 'abc and def', 'foo and bar')
> regmatches(x, gregexpr('and(*SKIP)(*F)|\\w+', x, perl=T))
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"
Or if " and " is always constant, then use strsplit as suggested in the comments.
> x <- c('x and y', 'abc and def', 'foo and bar')
> strsplit(x, ' and ', fixed=T)
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"

Why does strsplit use positive lookahead and lookbehind assertion matches differently?

Common sense and a sanity-check using gregexpr() indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString:
testString <- "text XX text"
BB <- "(?<= XX )"
FF <- "(?= XX )"
as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5
strsplit(), however, uses those match locations differently, splitting testString at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.
strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"
strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text" " " "XX text"
I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit() to be better behaved?
Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).
I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:
FF <- "(?=funky)"
testString <- "take me to funky town"
gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
What happens is that the lonely lookahead (?=funky) matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.
Now the remaining string is funky town, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strspliting with an empty regex (when argument split=""). After this the remaining string is unky town, which is returned as the last split since there's no match.
Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.
Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit algorithm is documented, I belive this does not meet the definition of a bug.
Based on Theodore Lytras' careful explication of substr()'s behavior, a reasonably clean workaround is to prefix the to-be-matched lookahead assertion with a positive lookbehind assertion that matches any single character:
testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"
Looks like a bug to me. This doesn't appear to just be related to spaces, specifically, but rather any lonely lookahead (positive or negative):
FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f" "unky take me to " "f" "unky "
# [5] "f" "unky town"
FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx" "y" "xxxxxxx"
Seems to work fine if given something to capture along with the zero-width assertion, such as:
FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
Perhaps something like that might function as a workaround.