Split string by final space in R - regex

I have a vector a strings with a number of spaces in. I would like to split this into two vectors split by the final space. For example:
vec <- c('This is one', 'And another', 'And one more again')
Should become
vec1 = c('This is', 'And', 'And one more again')
vec2 = c('one', 'another', 'again')
Is there a quick and easy way to do this? I have done similar things before using gsub and regex, and have managed to get the second vector using the following
vec2 <- gsub(".* ", "", vec)
But can't work out how to get vec1.
Thanks in advance

Here is one way using a lookahead assertion:
do.call(rbind, strsplit(vec, ' (?=[^ ]+$)', perl=TRUE))
# [,1] [,2]
# [1,] "This is" "one"
# [2,] "And" "another"
# [3,] "And one more" "again"

Related

Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?

I'm looking at a number of cells in a data frame and am trying to extract any one of several sequences of characters; there's only one of these sequences per per cell.
Here's what I mean:
dF$newColumn = str_extract_all(string = "dF$column1", pattern ="sequence_1|sequence_2")
Am I screwing the syntax up here? Can I pull this sort of thing with stringr? Please rectify my ignorance!
Yes, you can use | since it denotes logical or in regex. Here's an example:
vec <- c("abc text", "text abc", "def text", "text def text")
library(stringr)
str_extract_all(string = vec, pattern = "abc|def")
The result:
[[1]]
[1] "abc"
[[2]]
[1] "abc"
[[3]]
[1] "def"
[[4]]
[1] "def"
However, in your command, you should replace "dF$column1" with dF$column1 (without quotes).

Extracting values from a string in R using regex

I'm trying to extract the first and second numbers of this string and store them in separate variables.
(User20,10.25)
I can't figure out how to get the user number and then his value.
What I have managed to do so far is this, but I don't know how to remove the rest of the string and get only the number.
gsub("\\(User", "", string)
Try
str1 <- '(User20,10.25)'
scan(text=gsub('[^0-9.-]+', ' ', str1),quiet=TRUE)
#[1] 20.00 10.25
In case the string is
str2 <- '(User20-ht,-10.25)'
scan(text=gsub('-(?=[^0-9])|[^0-9.-]+', " ", str2, perl=TRUE), quiet=TRUE)
#[1] 20.00 -10.25
Or
library(stringr)
str_extract_all(str1, '[0-9.-]+')[[1]]
#[1] "20" "10.25"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '[0-9.-]+')[[1]]
#[1] "20" "10.25"
Tyler Rinker's "qdapRegex" package has some functions that are useful for this kind of stuff.
In this case, you would most likely be interested in rm_number:
library(qdapRegex)
rm_number(x, extract = TRUE)
# [[1]]
# [1] "20" "10.25"
You can use strsplit with sub ...
> sub('\\(User|\\)', '', strsplit(x, ',')[[1]])
[1] "20" "10.25"
It would probably be easier to match the context that you want instead.
> regmatches(x, gregexpr('[0-9.]+', x))[[1]]
[1] "20" "10.25"
The following is one approach:
[^,\)\([A-Z]]

Split on first/nth occurrence of delimiter

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

Using regexes in grep function in R

Could anyone maybe know how to extract x and y from this character: "x and y" using grep function (not using stringi package) if x and y are random characters?
I am so not skilled in regular expressions.
Thanks for any response.
The regex here matches any chars "and" chars and then extracts them with regmatches:
txt <- c("x and y", "a and b", " C and d", "qq and rr")
matches <- regexec("([[:alpha:]]+)[[:blank:]]+and[[:blank:]]+([[:alpha:]]+)", txt)
regmatches(txt, matches)[[1]][2:3]
## [1] "x" "y"
regmatches(txt, matches)[[2]][2:3]
## [1] "a" "b"
regmatches(txt, matches)[[3]][2:3]
## [1] "C" "d"
regmatches(txt, matches)[[4]][2:3]
## [1] "qq" "rr"
([[:alpha:]]+) matches one or more alpha characters and places it in a match group. [[:blank:]]+ matches one or more "whitespace" characters. There are less verbose ways to write these regexes but the expanded ones (to me) help make it easier to grok if there will be folks reading the code that aren't familiar with regexes.
I also didn't need to call regmatches 4x, but it was faster to cut/paste for a toy example.
As #MrFlick commented, grep is not the right function to extract these substrings.
You can use regmatches and do something like this:
> x <- c('x and y', 'abc and def', 'foo and bar')
> regmatches(x, gregexpr('and(*SKIP)(*F)|\\w+', x, perl=T))
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"
Or if " and " is always constant, then use strsplit as suggested in the comments.
> x <- c('x and y', 'abc and def', 'foo and bar')
> strsplit(x, ' and ', fixed=T)
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"

R regex just not quite right

I'm having a problem with regex in R. Maybe I've just been looking at it too long. I've got strings of the form
'thing1 - thing2'
'thingA - thingB'
where the first is separated from the second by a space, a dash, and another space. The first thing is a combination of letters, digits, slashes, and periods; the second can be the same, or not exist (in which case there is also no separating dash). I want to use regmatches with gregexpr to find patterns matching the first and second parts. That's something like
regmatches(
'thing1 - thing2',
gregexpr('^(\\w|\\s|\\.|/)+(\\s-\\s){0,1}', 'thing1 - thing2', perl=T)
)
Fine and well. But sometimes thing1 is tricky, with a dash with no spaces (eg 10-43), or it can be the exact string Blue - MC, which obviously messes up the "separate by \\s-\\s" rule. And I just can't seem to get the regex right! I tried
regmatches(
c('10-43', 'Blue - MC'),
gregexpr(
'^\\w(\\w|\\s|\\.|/\\S-\\S)+\\s-\\s{0,1}|^Blue\\s-\\sMC',
c('10-43', 'Blue - MC'), perl=T
)
)
and I get c('10', 'Blue'). Help? Thanks!
and
I know you said you want to use gregexpr and regmatches, but why not strsplit since all you're doing is splitting the strings that will "always be separated by a space-dash-space"?
Per your comment, you can split at space-dash-space, but keep Blue - MC by simply removing Blue - MC from the list before applying the split. Then you can add it back in afterward.
> things <- c('thing1 - thing2', 'thingA - thingB', 'thingC', 'Blue - MC')
> w <- which(things == 'Blue - MC')
> ( s <- c(strsplit(things[-w], " - ", fixed = TRUE), things[w]) )
#[[1]]
#[1] "thing1" "thing2"
#[[2]]
#[1] "thingA" "thingB"
#[[3]]
#[1] "thingC"
#[[4]]
#[1] "Blue - MC"
Then if you only want the first of each of those,
> sapply(s, "[", 1)
#[1] "thing1" "thingA" "thingC" "Blue - MC"
When I want to capture parts of a message, I like to use the regcapturedmatches.R helper function. I would use it like this
v <- c("thing1 - thing2", "thingalone","Blue-MC","1 - 2")
m <- gregexpr('^(.*?)(?:\\s-\\s(.*))?$', v, perl=T)
regmatches(v, m)
do.call(rbind, regcapturedmatches(v,m))
That returns
[,1] [,2]
[1,] "thing1" "thing2"
[2,] "thingalone" ""
[3,] "Blue-MC" ""
[4,] "1" "2"
Which I believe satisfies your expectations.