R split on delimiter (split) keep the delimiter (split) - regex

In R you can use the strsplit function to split a vector on a delimiter(split) as follows:
x <- "What is this? It's an onion. What! That's| Well Crazy."
unlist(strsplit(x, "[\\?\\.\\!\\|]", perl=TRUE))
## [1] "What is this" " It's an onion" " What" " That's"
## [5] " Well Crazy"
I'd like to keep the delimiter(split) using R. So the desired output would be:
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy."

You can use "(?<=DELIMITERS)":
unlist(strsplit(x, "(?<=[?.!|])", perl=TRUE))
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy.

Related

R: How to replace space (' ') in string with a *single* backslash and space ('\ ')

I've searched many times, and haven't found the answer here or elsewhere. I want to replace each space ' ' in variables containing file names with a '\ '. (A use case could be for shell commands, with the spaces escaped, so each file name doesn't appear as a list of arguments.) I have looked through the StackOverflow question "how to replace single backslash in R", and find that many combinations do work as advertised:
> gsub(" ", "\\\\", "a b")
[1] "a\\b"
> gsub(" ", "\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
but try these with a single-slash version, and R ignores it:
> gsub(" ", "\\ ", "a b")
[1] "a b"
> gsub(" ", "\ ", "a b", fixed = TRUE)
[1] "a b"
For the case going in the opposite direction — removing slashes from a string, it works for two:
> gsub("\\\\", " ", "a\\b")
[1] "a b"
> gsub("\\", " ", "a\\b", fixed = TRUE)
[1] "a b"
However, for single slashes some inner perversity in R prevents me from even attempting to remove them:
> gsub("\\", " ", "a\\b")
Error in gsub("\\", " ", "a\\b") :
invalid regular expression '\', reason 'Trailing backslash'
> gsub("\", " ", "a\b", fixed = TRUE)
Error: unexpected string constant in "gsub("\", " ", ""
The 'invalid regular expression' is telling us something, but I don't see what. (Note too that the perl = True option does not help.)
Even with three back slashes R fails to notice even one:
> gsub(" ", "\\\ ", "a b")
[1] "a b"
The patter extends too! Even multiples of two work:
> gsub(" ", "\\\\\\\\", "a b")
[1] "a\\\\b"
but not odd multiples (should get '\\\ ':
> gsub(" ", "\\\\\\ ", "a b")
[1] "a\\ b"
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
(I would expect 3 slashes, not two.)
My two questions are:
How can my goal of replacing a ' ' with a '\ ' be accomplished?
Why did the odd number-slash variants of the replacements fail, while the even number-slash replacements worked?
For shell commands a simple work-around is to quote the file names, but part of my interest is just wanting to understand what is going on with R's regex engine.
Get ready for a face-palm, because this:
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
is actually working.
The two backslashes you see are just the R console's way of displaying a single backslash, which is escaped when printed to the screen.
To confirm the replacement with a single backslash is indeed working, try writing the output to a text file and inspect yourself:
f <- file("C:\\output.txt")
writeLines(gsub(" ", "\\", "a b", fixed = TRUE), f)
close(f)
In output.txt you should see the following:
a\b
Very helpful discussion! (I've been Googling the heck out of this for 2 days.)
Another way to see the difference (rather than writing to a file) is to compare the contents of the string using print and cat.
z <- gsub(" ", "\\", "a b", fixed = TRUE)
> print(z)
[1] "a\\ b"
> cat(z)
a\ b
So, by using cat instead of print we can confirm that the gsub line is doing what was intended when we're trying to add single backslashes to a string.

replacing values in selected columns of a dataframe using RegExp

Assume I have a dataframe
mydata <- c("10 stack"," 10 stack and x" , "10 stack / dd" ," 10 stackxx")
R>mydata
[1] " 10 stack"
[2] " 10 stack and x"
[3] " 10 stack / dd"
[4] " 10 stackxx"
what I want to do is to replace and word begin with 10 stack [anything]to any other words in the dataframe , but without removing the rest of the string
the desired output. Also replace the backslash with and or comma.
[1] " new"
[2] " new and x"
[3] " new and dd"
[4] " new"
my code is
mydata[mydata =="10 stack" ] <- new # I can replace one type, but I need faster operation.
mydata[mydata =="///" ] <- and #for replacing backslash with and
I found another method can solve the problem
mydata<-as.data.frame(sapply(mydata,gsub,pattern="//\",replacement=","))
Try
library(stringi)
stri_replace_all_regex(mydata, c("10 stack", "\\/"), c("new", "and"), vectorize_all=FALSE)
Which gives:
#[1] "new" " new and x" "new and dd" " newxx"
As per mentioned by #rock321987 in the comments, if you want to replace 10 stack[anything], You could use the pattern \\b10 stack[^\\s]* instead:
stri_replace_all_regex(mydata, c("\\b10 stack[^\\s]*", "\\/"), c("new", "and"),
vectorize_all=FALSE)
Which gives:
#[1] "new" " new and x" "new and dd" " new"
You need to use sub() function, which matches pattern and substitute it with replacement.
sub("10 stack", " new", mydata)

Why does strsplit return a list

Consider
text <- "who let the dogs out"
fooo <- strsplit(text, " ")
fooo
[[1]]
[1] "who" "let" "the" "dogs" "out"
the output of strsplit is a list. The list's first element then is a vector, that contains the words above.
Why does the function behave that way? Is there any case in which it would return a list with more than one element?
And I can access the words using
fooo[[1]][1]
[1] "who"
, but is there no simpler way?
To your first question, one reason that comes to mind is so that it can keep different length result vectors in the same object, since it is vectorized over x:
text <- "who let the dogs out"
vtext <- c(text, "who let the")
##
> strsplit(text, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
> strsplit(vtext, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
[[2]]
[1] "who" "let" "the"
If this were to be returned as a data.frame, matrix, etc... instead of a list, it would have to be padded with additional elements.

Why does gsub not work as expected

I have this string:
c <- "thethirsty thirsty itthirsty (thirsty) is"
I want the output to be as
"thethirsty thirsty itthirsty no is"
This is what I am trying.
gsub(" (thirsty) ", " no ", c)
This is what I am getting. Why does not it work? And suggest an alternative to do this.
"thethirsty no itthirsty (thirsty) is"
By default gsub interprets the first parameter as a regular expression. You don't want that and should set fixed=TRUE:
gsub(" (thirsty) ", " no ", c, fixed=TRUE)
#[1] "thethirsty thirsty itthirsty no is"

split string with regex

I'm looking to split a string of a generic form, where the square brackets denote the "sections" of the string. Ex:
x <- "[a] + [bc] + 1"
And return a character vector that looks like:
"[a]" " + " "[bc]" " + 1"
EDIT: Ended up using this:
x <- "[a] + [bc] + 1"
x <- gsub("\\[",",[",x)
x <- gsub("\\]","],",x)
strsplit(x,",")
I've seen TylerRinker's code and suspect it may be more clear than this but this may serve as way to learn a different set of functions. (I liked his better before I noticed that it split on spaces.) I tried adapting this to work with strsplit but that function always removes the separators.
Maybe this could be adapted to make a newstrsplit that splits at the separators but leaves them in? Probably need to not split at first or last position and distinguish between opening and closing separators.
scan(text= # use scan to separate after insertion of commas
gsub("\\]", "],", # put commas in after "]"'s
gsub(".\\[", ",[", x)) , # add commas before "[" unless at first position
what="", sep=",") # tell scan this character argument and separators are ","
#Read 4 items
#[1] "[a]" " +" "[bc]" " + 1"
This is one lazy approach:
FUN <- function(x) {
all <- unlist(strsplit(x, "\\s+"))
last <- paste(c(" ", tail(all, 2)), collapse="")
c(head(all, -2), last)
}
x <- "[a] + [bc] + 1"
FUN(x)
## > FUN(x)
## [1] "[a]" "+" "[bc]" " +1"
You can compute the split points manually and use substring :
split.pos <- gregexpr('\\[.*?]',x)[[1]]
split.length <- attr(split.pos, "match.length")
split.start <- sort(c(split.pos, split.pos+split.length))
split.end <- c(split.start[-1]-1, nchar(x))
substring(x,split.start,split.end)
# [1] "[a]" " + " "[bc]" " + 1"
And here's a version that splits on the brackets AND keeps them in the result, using positive lookahead and lookbehind:
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\[)", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\])", perl=TRUE))
for (i in which(x=="[")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="[")]
}
splitme(x)
#[1] "[a]" " + " "[bc]" " + 1"