Why does gsub not work as expected - regex

I have this string:
c <- "thethirsty thirsty itthirsty (thirsty) is"
I want the output to be as
"thethirsty thirsty itthirsty no is"
This is what I am trying.
gsub(" (thirsty) ", " no ", c)
This is what I am getting. Why does not it work? And suggest an alternative to do this.
"thethirsty no itthirsty (thirsty) is"

By default gsub interprets the first parameter as a regular expression. You don't want that and should set fixed=TRUE:
gsub(" (thirsty) ", " no ", c, fixed=TRUE)
#[1] "thethirsty thirsty itthirsty no is"

Related

Google Sheets RegexpReplace with computable replacers

I'm trying to replace a pattern with some string computed with other GSheets functions. For example, I want to make all the int numbers in the string ten times larger: "I want to multiply 2 numbers in this string by 10" should turn into "I want to multiply 20 numbers in this string by 100".
Assuming for short, that my string is in A1 cell, I've tried a construction
REGEXREPLACE(A1, "([0-9]+)", TEXT(10*VALUE("$1"),"###"))
But it seems REGEXREPLACE firstly computes the arguments and only after that yields regular expression rules. So it converts 3rd argument
TEXT(10*VALUE("$1"),"###") => TEXT(10*1,"###") => "10"
and then just replaces all integers in the string with 10.
It turns out, I need to substitute the group $1 BEFORE implementing outer functions in the 3rd argument. Is there any way to do such a thing?
Maybe there's another way. See if this works
=join(" ", ArrayFormula(if(isnumber(split(A1, " ")), split(A1, " ")*10, split(A1, " "))))
try:
=ARRAYFORMULA(JOIN(" ", IFERROR(SPLIT(A1, " ")*10, SPLIT(A1, " "))))
or:
=ARRAYFORMULA(JOIN(" ", IF(ISNUMBER(SPLIT(A1, " ")), SPLIT(A1, " ")*10, SPLIT(A1, " "))))

R: How to replace space (' ') in string with a *single* backslash and space ('\ ')

I've searched many times, and haven't found the answer here or elsewhere. I want to replace each space ' ' in variables containing file names with a '\ '. (A use case could be for shell commands, with the spaces escaped, so each file name doesn't appear as a list of arguments.) I have looked through the StackOverflow question "how to replace single backslash in R", and find that many combinations do work as advertised:
> gsub(" ", "\\\\", "a b")
[1] "a\\b"
> gsub(" ", "\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
but try these with a single-slash version, and R ignores it:
> gsub(" ", "\\ ", "a b")
[1] "a b"
> gsub(" ", "\ ", "a b", fixed = TRUE)
[1] "a b"
For the case going in the opposite direction — removing slashes from a string, it works for two:
> gsub("\\\\", " ", "a\\b")
[1] "a b"
> gsub("\\", " ", "a\\b", fixed = TRUE)
[1] "a b"
However, for single slashes some inner perversity in R prevents me from even attempting to remove them:
> gsub("\\", " ", "a\\b")
Error in gsub("\\", " ", "a\\b") :
invalid regular expression '\', reason 'Trailing backslash'
> gsub("\", " ", "a\b", fixed = TRUE)
Error: unexpected string constant in "gsub("\", " ", ""
The 'invalid regular expression' is telling us something, but I don't see what. (Note too that the perl = True option does not help.)
Even with three back slashes R fails to notice even one:
> gsub(" ", "\\\ ", "a b")
[1] "a b"
The patter extends too! Even multiples of two work:
> gsub(" ", "\\\\\\\\", "a b")
[1] "a\\\\b"
but not odd multiples (should get '\\\ ':
> gsub(" ", "\\\\\\ ", "a b")
[1] "a\\ b"
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
(I would expect 3 slashes, not two.)
My two questions are:
How can my goal of replacing a ' ' with a '\ ' be accomplished?
Why did the odd number-slash variants of the replacements fail, while the even number-slash replacements worked?
For shell commands a simple work-around is to quote the file names, but part of my interest is just wanting to understand what is going on with R's regex engine.
Get ready for a face-palm, because this:
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
is actually working.
The two backslashes you see are just the R console's way of displaying a single backslash, which is escaped when printed to the screen.
To confirm the replacement with a single backslash is indeed working, try writing the output to a text file and inspect yourself:
f <- file("C:\\output.txt")
writeLines(gsub(" ", "\\", "a b", fixed = TRUE), f)
close(f)
In output.txt you should see the following:
a\b
Very helpful discussion! (I've been Googling the heck out of this for 2 days.)
Another way to see the difference (rather than writing to a file) is to compare the contents of the string using print and cat.
z <- gsub(" ", "\\", "a b", fixed = TRUE)
> print(z)
[1] "a\\ b"
> cat(z)
a\ b
So, by using cat instead of print we can confirm that the gsub line is doing what was intended when we're trying to add single backslashes to a string.

R split on delimiter (split) keep the delimiter (split)

In R you can use the strsplit function to split a vector on a delimiter(split) as follows:
x <- "What is this? It's an onion. What! That's| Well Crazy."
unlist(strsplit(x, "[\\?\\.\\!\\|]", perl=TRUE))
## [1] "What is this" " It's an onion" " What" " That's"
## [5] " Well Crazy"
I'd like to keep the delimiter(split) using R. So the desired output would be:
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy."
You can use "(?<=DELIMITERS)":
unlist(strsplit(x, "(?<=[?.!|])", perl=TRUE))
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy.

String rearrangement in R

I am on the lookout for two R functions that would perform the following string rearrangements:
(1) place the parts following a ", " in a string at the start of a string, e.g.
name="2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-"
should yield
"(E)-3,7-dimethyl-2,6-Octadien-1-ol"
(note that there could be any number of ", " in a string, or none at all, and that the parts after the ", " should be placed at the start of the string successively, starting from the end of the string. What would be the most efficient way of achieving this in R (without using loops etc)?
(2) place the parts between "<" and ">" at the start of a string and remove any ", ".
E.g.
name="Pyrazine <2-acetyl-, 3-ethyl->"
should yield
"2-acetyl-3-ethyl-Pyrazine"
(this is a simpler gsub problem, right?)
The part between the "<" and ">" could be in any place in the string though.
E.g.
name="Cyclohexanol <4-tertbutyl-> acetate"
should yield
"4-tertbutyl-Cyclohexanol acetate"
Any thoughts would be welcome!
cheers,
Tom
For the first problem:
name <- c("2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-",
"2,6-Octadien-1-ol,3,7-dimethyl-,(E)-")
sapply(strsplit(name, "(?<!\\d), ?", perl = TRUE), function(x)
paste(rev(x), collapse = ""))
# [1] "(E)-3,7-dimethyl-2,6-Octadien-1-ol" "(E)-3,7-dimethyl-2,6-Octadien-1-ol"
For the second problem:
name <- c("Pyrazine <2-acetyl-, 3-ethyl->",
"Cyclohexanol <4-tertbutyl-> acetate")
inside <- gsub(", ", "", sub("^.*<(.+)>.*$", "\\1", name))
outside <- sub("^(.*) <.*>(.*)$" , "\\1\\2", name)
paste0(inside, outside)
# [1] "2-acetyl-3-ethyl-Pyrazine" "4-tertbutyl-Cyclohexanol acetate"

How does "gsub" handle spaces?

I have a character string "ab b cde", i.e. "ab[space]b[space]cde". I want to replace "space-b" and "space-c" with blank spaces, so that the output string is "ab[space][space][space][space]de". I can't figure out how to get rid of the second "b" without deleting the first one. I have tried:
gsub("[\\sb,\\sc]", " ", "ab b cde", perl=T)
but this is giving me "a[spaces]de". Any pointers? Thanks.
Edit: Consider a more complicated problem: I want to convert the string "akui i ii" i.e. "akui[space]i[space]ii" to "akui[spaces|" by removing the "space-i" and "space-ii".
[\sb,\sc] means "one character among space, b, ,, space, c".
You probably want something like (\sb|\sc), which means "space followed by b, or space followed by c"
or \s[bc] which means "space followed by b or c".
s <- "ab b cde"
gsub( "(\\sb|\\sc)", " ", s, perl=TRUE )
gsub( "\\s[bc]", " ", s, perl=TRUE )
gsub( "[[:space:]][bc]", " ", s, perl=TRUE ) # No backslashes
To remove multiple instances of a letter (as in the second example) include a + after the letter to be removed.
s2 <- "akui i ii"
gsub("\\si+", " ", s2)
There is a simple solution to this.
gsub("\\s[bc]", " ", "ab b cde", perl=T)
This will give you what you want.
You can use lookbehind matching like this:
gsub("(?<=\\s)i+", " ", "akui i ii", perl=T)
Edit:
lookbehind is still the way to go, demonstrated with an other example from your original post. Hope this helps.
x<-"ab b cde"
gsub(" b| c", " ",x)
Note the double spaces in the 2nd argument.