String rearrangement in R - regex

I am on the lookout for two R functions that would perform the following string rearrangements:
(1) place the parts following a ", " in a string at the start of a string, e.g.
name="2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-"
should yield
"(E)-3,7-dimethyl-2,6-Octadien-1-ol"
(note that there could be any number of ", " in a string, or none at all, and that the parts after the ", " should be placed at the start of the string successively, starting from the end of the string. What would be the most efficient way of achieving this in R (without using loops etc)?
(2) place the parts between "<" and ">" at the start of a string and remove any ", ".
E.g.
name="Pyrazine <2-acetyl-, 3-ethyl->"
should yield
"2-acetyl-3-ethyl-Pyrazine"
(this is a simpler gsub problem, right?)
The part between the "<" and ">" could be in any place in the string though.
E.g.
name="Cyclohexanol <4-tertbutyl-> acetate"
should yield
"4-tertbutyl-Cyclohexanol acetate"
Any thoughts would be welcome!
cheers,
Tom

For the first problem:
name <- c("2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-",
"2,6-Octadien-1-ol,3,7-dimethyl-,(E)-")
sapply(strsplit(name, "(?<!\\d), ?", perl = TRUE), function(x)
paste(rev(x), collapse = ""))
# [1] "(E)-3,7-dimethyl-2,6-Octadien-1-ol" "(E)-3,7-dimethyl-2,6-Octadien-1-ol"
For the second problem:
name <- c("Pyrazine <2-acetyl-, 3-ethyl->",
"Cyclohexanol <4-tertbutyl-> acetate")
inside <- gsub(", ", "", sub("^.*<(.+)>.*$", "\\1", name))
outside <- sub("^(.*) <.*>(.*)$" , "\\1\\2", name)
paste0(inside, outside)
# [1] "2-acetyl-3-ethyl-Pyrazine" "4-tertbutyl-Cyclohexanol acetate"

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

R: gsub with fixed=T or F and special cases

Building on top of two questions I previously asked:
R: How to prevent memory overflow when using mgsub in vector mode?
gsub speed vs pattern length
I do like suggestions on usage of fixed=TRUE by #Tyler as it speeds up calculations significantly. However, it's not always applicable. I need to substitute, say, caps as a stand-alone word w/ or w/o punctuation that surrounds it. A priori it's not know what can follow or precede the word, but it must be any of regular punctuation signs (, . ! - + etc). It cannot be a number or a letter. Example below. capsule must stay as is.
i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"
orig = "caps"
change = "cap"
gsub_FixedTrue <- function(i) {
i = paste0(" ", i, " ")
orig = paste0(" ", orig, " ")
change = paste0(" ", change, " ")
i = gsub(orig,change,i,fixed=TRUE)
i = gsub("^\\s|\\s$", "", i, perl=TRUE)
return(i)
}
#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {
i = gsub(paste0("\\b",orig,"\\b"),change,i)
return(i)
}
print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct
Results. Second output is desired
[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"
Using parts from your previous question to test I think we can put a place holder in front of punctuation as follows, without slowing it down too much:
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
"Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")
line <- rep(line, 1700000/length(line))
line <- gsub("([[:punct:]])", " <DEL>\\1<DEL> ", line, perl=TRUE)
## Start
line2 <- paste0(" ", line, " ")
e2 <- paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")
for (i in seq_along(e2)) {
line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}
gsub("^\\s|\\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)

How to remove a specific pattern from a string in R?

I have this string (for example).
str <- "T gwed is atyrt mtt yfdgfg grter effgf y"
I want to remove lone occurring alphabets from this string ('T' at the start and 'y' at the end in this case) and output should be
"gwed is atyrt mtt yfdgfg grter effgf"
I used this
str <- gsub("[A-Za-z] ", "", str)
But it gives this as a result.
[1] "gweiatyrmtyfdgfgrtey"
Here it considers cases like "gwed " also and hence it merges every word of the string.
How do i achieve my aim?
Also, I have this huge text with thousands of strings (not just a single string), so keep this in mind while providing an answer.
str <- "T gwed is atyrt mtt yfdgfg grter effgf y"
gsub(" ?\\<[[:alpha:]]\\> ?", "", str)
## [1] "gwed is atyrt mtt yfdgfg grter effgf"
You need to use the special character to denote word boundaries, i.e., \\< and \\>. The _? (where _ is a space) denotes that you also want to remove single spaces around the single letters (if present). See ?regex for more.
Another option wthout using regular expressions:
xx <- unlist(strsplit(str, " "))
paste(xx[nchar(xx)>1],collapse=' ')
[1] "gwed is atyrt mtt yfdgfg grter effgf"

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.