R: gsub with fixed=T or F and special cases - regex

Building on top of two questions I previously asked:
R: How to prevent memory overflow when using mgsub in vector mode?
gsub speed vs pattern length
I do like suggestions on usage of fixed=TRUE by #Tyler as it speeds up calculations significantly. However, it's not always applicable. I need to substitute, say, caps as a stand-alone word w/ or w/o punctuation that surrounds it. A priori it's not know what can follow or precede the word, but it must be any of regular punctuation signs (, . ! - + etc). It cannot be a number or a letter. Example below. capsule must stay as is.
i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"
orig = "caps"
change = "cap"
gsub_FixedTrue <- function(i) {
i = paste0(" ", i, " ")
orig = paste0(" ", orig, " ")
change = paste0(" ", change, " ")
i = gsub(orig,change,i,fixed=TRUE)
i = gsub("^\\s|\\s$", "", i, perl=TRUE)
return(i)
}
#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {
i = gsub(paste0("\\b",orig,"\\b"),change,i)
return(i)
}
print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct
Results. Second output is desired
[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"

Using parts from your previous question to test I think we can put a place holder in front of punctuation as follows, without slowing it down too much:
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
"Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")
line <- rep(line, 1700000/length(line))
line <- gsub("([[:punct:]])", " <DEL>\\1<DEL> ", line, perl=TRUE)
## Start
line2 <- paste0(" ", line, " ")
e2 <- paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")
for (i in seq_along(e2)) {
line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}
gsub("^\\s|\\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

R: How to replace space (' ') in string with a *single* backslash and space ('\ ')

I've searched many times, and haven't found the answer here or elsewhere. I want to replace each space ' ' in variables containing file names with a '\ '. (A use case could be for shell commands, with the spaces escaped, so each file name doesn't appear as a list of arguments.) I have looked through the StackOverflow question "how to replace single backslash in R", and find that many combinations do work as advertised:
> gsub(" ", "\\\\", "a b")
[1] "a\\b"
> gsub(" ", "\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
but try these with a single-slash version, and R ignores it:
> gsub(" ", "\\ ", "a b")
[1] "a b"
> gsub(" ", "\ ", "a b", fixed = TRUE)
[1] "a b"
For the case going in the opposite direction — removing slashes from a string, it works for two:
> gsub("\\\\", " ", "a\\b")
[1] "a b"
> gsub("\\", " ", "a\\b", fixed = TRUE)
[1] "a b"
However, for single slashes some inner perversity in R prevents me from even attempting to remove them:
> gsub("\\", " ", "a\\b")
Error in gsub("\\", " ", "a\\b") :
invalid regular expression '\', reason 'Trailing backslash'
> gsub("\", " ", "a\b", fixed = TRUE)
Error: unexpected string constant in "gsub("\", " ", ""
The 'invalid regular expression' is telling us something, but I don't see what. (Note too that the perl = True option does not help.)
Even with three back slashes R fails to notice even one:
> gsub(" ", "\\\ ", "a b")
[1] "a b"
The patter extends too! Even multiples of two work:
> gsub(" ", "\\\\\\\\", "a b")
[1] "a\\\\b"
but not odd multiples (should get '\\\ ':
> gsub(" ", "\\\\\\ ", "a b")
[1] "a\\ b"
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
(I would expect 3 slashes, not two.)
My two questions are:
How can my goal of replacing a ' ' with a '\ ' be accomplished?
Why did the odd number-slash variants of the replacements fail, while the even number-slash replacements worked?
For shell commands a simple work-around is to quote the file names, but part of my interest is just wanting to understand what is going on with R's regex engine.
Get ready for a face-palm, because this:
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
is actually working.
The two backslashes you see are just the R console's way of displaying a single backslash, which is escaped when printed to the screen.
To confirm the replacement with a single backslash is indeed working, try writing the output to a text file and inspect yourself:
f <- file("C:\\output.txt")
writeLines(gsub(" ", "\\", "a b", fixed = TRUE), f)
close(f)
In output.txt you should see the following:
a\b
Very helpful discussion! (I've been Googling the heck out of this for 2 days.)
Another way to see the difference (rather than writing to a file) is to compare the contents of the string using print and cat.
z <- gsub(" ", "\\", "a b", fixed = TRUE)
> print(z)
[1] "a\\ b"
> cat(z)
a\ b
So, by using cat instead of print we can confirm that the gsub line is doing what was intended when we're trying to add single backslashes to a string.

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

String rearrangement in R

I am on the lookout for two R functions that would perform the following string rearrangements:
(1) place the parts following a ", " in a string at the start of a string, e.g.
name="2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-"
should yield
"(E)-3,7-dimethyl-2,6-Octadien-1-ol"
(note that there could be any number of ", " in a string, or none at all, and that the parts after the ", " should be placed at the start of the string successively, starting from the end of the string. What would be the most efficient way of achieving this in R (without using loops etc)?
(2) place the parts between "<" and ">" at the start of a string and remove any ", ".
E.g.
name="Pyrazine <2-acetyl-, 3-ethyl->"
should yield
"2-acetyl-3-ethyl-Pyrazine"
(this is a simpler gsub problem, right?)
The part between the "<" and ">" could be in any place in the string though.
E.g.
name="Cyclohexanol <4-tertbutyl-> acetate"
should yield
"4-tertbutyl-Cyclohexanol acetate"
Any thoughts would be welcome!
cheers,
Tom
For the first problem:
name <- c("2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-",
"2,6-Octadien-1-ol,3,7-dimethyl-,(E)-")
sapply(strsplit(name, "(?<!\\d), ?", perl = TRUE), function(x)
paste(rev(x), collapse = ""))
# [1] "(E)-3,7-dimethyl-2,6-Octadien-1-ol" "(E)-3,7-dimethyl-2,6-Octadien-1-ol"
For the second problem:
name <- c("Pyrazine <2-acetyl-, 3-ethyl->",
"Cyclohexanol <4-tertbutyl-> acetate")
inside <- gsub(", ", "", sub("^.*<(.+)>.*$", "\\1", name))
outside <- sub("^(.*) <.*>(.*)$" , "\\1\\2", name)
paste0(inside, outside)
# [1] "2-acetyl-3-ethyl-Pyrazine" "4-tertbutyl-Cyclohexanol acetate"

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.