replacing values in selected columns of a dataframe using RegExp - regex

Assume I have a dataframe
mydata <- c("10 stack"," 10 stack and x" , "10 stack / dd" ," 10 stackxx")
R>mydata
[1] " 10 stack"
[2] " 10 stack and x"
[3] " 10 stack / dd"
[4] " 10 stackxx"
what I want to do is to replace and word begin with 10 stack [anything]to any other words in the dataframe , but without removing the rest of the string
the desired output. Also replace the backslash with and or comma.
[1] " new"
[2] " new and x"
[3] " new and dd"
[4] " new"
my code is
mydata[mydata =="10 stack" ] <- new # I can replace one type, but I need faster operation.
mydata[mydata =="///" ] <- and #for replacing backslash with and
I found another method can solve the problem
mydata<-as.data.frame(sapply(mydata,gsub,pattern="//\",replacement=","))

Try
library(stringi)
stri_replace_all_regex(mydata, c("10 stack", "\\/"), c("new", "and"), vectorize_all=FALSE)
Which gives:
#[1] "new" " new and x" "new and dd" " newxx"
As per mentioned by #rock321987 in the comments, if you want to replace 10 stack[anything], You could use the pattern \\b10 stack[^\\s]* instead:
stri_replace_all_regex(mydata, c("\\b10 stack[^\\s]*", "\\/"), c("new", "and"),
vectorize_all=FALSE)
Which gives:
#[1] "new" " new and x" "new and dd" " new"

You need to use sub() function, which matches pattern and substitute it with replacement.
sub("10 stack", " new", mydata)

Related

R: How to replace space (' ') in string with a *single* backslash and space ('\ ')

I've searched many times, and haven't found the answer here or elsewhere. I want to replace each space ' ' in variables containing file names with a '\ '. (A use case could be for shell commands, with the spaces escaped, so each file name doesn't appear as a list of arguments.) I have looked through the StackOverflow question "how to replace single backslash in R", and find that many combinations do work as advertised:
> gsub(" ", "\\\\", "a b")
[1] "a\\b"
> gsub(" ", "\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
but try these with a single-slash version, and R ignores it:
> gsub(" ", "\\ ", "a b")
[1] "a b"
> gsub(" ", "\ ", "a b", fixed = TRUE)
[1] "a b"
For the case going in the opposite direction — removing slashes from a string, it works for two:
> gsub("\\\\", " ", "a\\b")
[1] "a b"
> gsub("\\", " ", "a\\b", fixed = TRUE)
[1] "a b"
However, for single slashes some inner perversity in R prevents me from even attempting to remove them:
> gsub("\\", " ", "a\\b")
Error in gsub("\\", " ", "a\\b") :
invalid regular expression '\', reason 'Trailing backslash'
> gsub("\", " ", "a\b", fixed = TRUE)
Error: unexpected string constant in "gsub("\", " ", ""
The 'invalid regular expression' is telling us something, but I don't see what. (Note too that the perl = True option does not help.)
Even with three back slashes R fails to notice even one:
> gsub(" ", "\\\ ", "a b")
[1] "a b"
The patter extends too! Even multiples of two work:
> gsub(" ", "\\\\\\\\", "a b")
[1] "a\\\\b"
but not odd multiples (should get '\\\ ':
> gsub(" ", "\\\\\\ ", "a b")
[1] "a\\ b"
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
(I would expect 3 slashes, not two.)
My two questions are:
How can my goal of replacing a ' ' with a '\ ' be accomplished?
Why did the odd number-slash variants of the replacements fail, while the even number-slash replacements worked?
For shell commands a simple work-around is to quote the file names, but part of my interest is just wanting to understand what is going on with R's regex engine.
Get ready for a face-palm, because this:
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
is actually working.
The two backslashes you see are just the R console's way of displaying a single backslash, which is escaped when printed to the screen.
To confirm the replacement with a single backslash is indeed working, try writing the output to a text file and inspect yourself:
f <- file("C:\\output.txt")
writeLines(gsub(" ", "\\", "a b", fixed = TRUE), f)
close(f)
In output.txt you should see the following:
a\b
Very helpful discussion! (I've been Googling the heck out of this for 2 days.)
Another way to see the difference (rather than writing to a file) is to compare the contents of the string using print and cat.
z <- gsub(" ", "\\", "a b", fixed = TRUE)
> print(z)
[1] "a\\ b"
> cat(z)
a\ b
So, by using cat instead of print we can confirm that the gsub line is doing what was intended when we're trying to add single backslashes to a string.

Clean character vector and strsplit into dataframe

I have a character verctor I want to transform into a data frame.
It's mostly clean but I can't figure out how to finish the cleaning. Notice that the real data are a Date column as yyyy-mm-dd and a Variable column as a number (in this case four digits but not always) separated by a comma.
class(myvec)
[1] "character"
myvec
[1] " \"2016-01-01,8631n\" " " \"2016-01-02,8577n\" "
[3] " \"2016-01-03,8476n\" " " \"2016-01-04,8365n\" "
[5] " \"2016-01-05,8331n\" " " \"2016-01-06,8801n\" "
[7] " \"2016-01-07,5020n\""
The space and backslash" (' \"') should be removed. The same with the n\"
The expected output should be a data frame like this
Date Variable
[1,] "2016-01-01" "8631"
[2,] "2016-01-02" "8577"
[3,] "2016-01-03" "8476"
[4,] "2016-01-04" "8365"
[5,] "2016-01-05" "8331"
[6,] "2016-01-06" "8801"
[7,] "2016-01-07" "5020"
Once the vector is clan, I think this does the job
do.call(rbind,strsplit(clean_vector,","))
I think I can convert to date with lubridate and the var to numeric with as.numeric on my own, the question is about getting the character vector clean and in the correct format.
You can remove the offending characters by enumerating them:
# example
x = " \"2016-01-01,8631n\" "
gsub("[n \"]","",x)
# "2016-01-01,8631"
This works because [xyz] identifies any single character from the list xyz.
Or you can take a substring, since the formatting is fixed-width, with bad chars at the start and end:
substr(x,3,17)
# "2016-01-01,8631"
If the var part of the string varies in length, nchar(x)-3 should work in place of 17.

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

R split on delimiter (split) keep the delimiter (split)

In R you can use the strsplit function to split a vector on a delimiter(split) as follows:
x <- "What is this? It's an onion. What! That's| Well Crazy."
unlist(strsplit(x, "[\\?\\.\\!\\|]", perl=TRUE))
## [1] "What is this" " It's an onion" " What" " That's"
## [5] " Well Crazy"
I'd like to keep the delimiter(split) using R. So the desired output would be:
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy."
You can use "(?<=DELIMITERS)":
unlist(strsplit(x, "(?<=[?.!|])", perl=TRUE))
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy.

How does "gsub" handle spaces?

I have a character string "ab b cde", i.e. "ab[space]b[space]cde". I want to replace "space-b" and "space-c" with blank spaces, so that the output string is "ab[space][space][space][space]de". I can't figure out how to get rid of the second "b" without deleting the first one. I have tried:
gsub("[\\sb,\\sc]", " ", "ab b cde", perl=T)
but this is giving me "a[spaces]de". Any pointers? Thanks.
Edit: Consider a more complicated problem: I want to convert the string "akui i ii" i.e. "akui[space]i[space]ii" to "akui[spaces|" by removing the "space-i" and "space-ii".
[\sb,\sc] means "one character among space, b, ,, space, c".
You probably want something like (\sb|\sc), which means "space followed by b, or space followed by c"
or \s[bc] which means "space followed by b or c".
s <- "ab b cde"
gsub( "(\\sb|\\sc)", " ", s, perl=TRUE )
gsub( "\\s[bc]", " ", s, perl=TRUE )
gsub( "[[:space:]][bc]", " ", s, perl=TRUE ) # No backslashes
To remove multiple instances of a letter (as in the second example) include a + after the letter to be removed.
s2 <- "akui i ii"
gsub("\\si+", " ", s2)
There is a simple solution to this.
gsub("\\s[bc]", " ", "ab b cde", perl=T)
This will give you what you want.
You can use lookbehind matching like this:
gsub("(?<=\\s)i+", " ", "akui i ii", perl=T)
Edit:
lookbehind is still the way to go, demonstrated with an other example from your original post. Hope this helps.
x<-"ab b cde"
gsub(" b| c", " ",x)
Note the double spaces in the 2nd argument.