Split string by words in R - regex

I would like to split a string by two words:
s <- "PCB153 treated HepG2 cells at T18"
strsplit(s, split = <treated><at>)
What should I write instead of <>?
I would get:
"PCB153" "HepG2 cells" "T18"

strsplit(s, split="treated|at")
#[[1]]
#[1] "PCB153 " " HepG2 cells " " T18"

You have to enter it as a string. To split on treated:
s <- "PCB153 treated HepG2 cells at T18"
s2 <- strsplit(s,split="treated")
unlist(s2)
To split on treated and at:
unlist(strsplit(unlist(s2),split="at"))

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

String rearrangement in R

I am on the lookout for two R functions that would perform the following string rearrangements:
(1) place the parts following a ", " in a string at the start of a string, e.g.
name="2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-"
should yield
"(E)-3,7-dimethyl-2,6-Octadien-1-ol"
(note that there could be any number of ", " in a string, or none at all, and that the parts after the ", " should be placed at the start of the string successively, starting from the end of the string. What would be the most efficient way of achieving this in R (without using loops etc)?
(2) place the parts between "<" and ">" at the start of a string and remove any ", ".
E.g.
name="Pyrazine <2-acetyl-, 3-ethyl->"
should yield
"2-acetyl-3-ethyl-Pyrazine"
(this is a simpler gsub problem, right?)
The part between the "<" and ">" could be in any place in the string though.
E.g.
name="Cyclohexanol <4-tertbutyl-> acetate"
should yield
"4-tertbutyl-Cyclohexanol acetate"
Any thoughts would be welcome!
cheers,
Tom
For the first problem:
name <- c("2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-",
"2,6-Octadien-1-ol,3,7-dimethyl-,(E)-")
sapply(strsplit(name, "(?<!\\d), ?", perl = TRUE), function(x)
paste(rev(x), collapse = ""))
# [1] "(E)-3,7-dimethyl-2,6-Octadien-1-ol" "(E)-3,7-dimethyl-2,6-Octadien-1-ol"
For the second problem:
name <- c("Pyrazine <2-acetyl-, 3-ethyl->",
"Cyclohexanol <4-tertbutyl-> acetate")
inside <- gsub(", ", "", sub("^.*<(.+)>.*$", "\\1", name))
outside <- sub("^(.*) <.*>(.*)$" , "\\1\\2", name)
paste0(inside, outside)
# [1] "2-acetyl-3-ethyl-Pyrazine" "4-tertbutyl-Cyclohexanol acetate"

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.

split string with regex

I'm looking to split a string of a generic form, where the square brackets denote the "sections" of the string. Ex:
x <- "[a] + [bc] + 1"
And return a character vector that looks like:
"[a]" " + " "[bc]" " + 1"
EDIT: Ended up using this:
x <- "[a] + [bc] + 1"
x <- gsub("\\[",",[",x)
x <- gsub("\\]","],",x)
strsplit(x,",")
I've seen TylerRinker's code and suspect it may be more clear than this but this may serve as way to learn a different set of functions. (I liked his better before I noticed that it split on spaces.) I tried adapting this to work with strsplit but that function always removes the separators.
Maybe this could be adapted to make a newstrsplit that splits at the separators but leaves them in? Probably need to not split at first or last position and distinguish between opening and closing separators.
scan(text= # use scan to separate after insertion of commas
gsub("\\]", "],", # put commas in after "]"'s
gsub(".\\[", ",[", x)) , # add commas before "[" unless at first position
what="", sep=",") # tell scan this character argument and separators are ","
#Read 4 items
#[1] "[a]" " +" "[bc]" " + 1"
This is one lazy approach:
FUN <- function(x) {
all <- unlist(strsplit(x, "\\s+"))
last <- paste(c(" ", tail(all, 2)), collapse="")
c(head(all, -2), last)
}
x <- "[a] + [bc] + 1"
FUN(x)
## > FUN(x)
## [1] "[a]" "+" "[bc]" " +1"
You can compute the split points manually and use substring :
split.pos <- gregexpr('\\[.*?]',x)[[1]]
split.length <- attr(split.pos, "match.length")
split.start <- sort(c(split.pos, split.pos+split.length))
split.end <- c(split.start[-1]-1, nchar(x))
substring(x,split.start,split.end)
# [1] "[a]" " + " "[bc]" " + 1"
And here's a version that splits on the brackets AND keeps them in the result, using positive lookahead and lookbehind:
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\[)", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\])", perl=TRUE))
for (i in which(x=="[")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="[")]
}
splitme(x)
#[1] "[a]" " + " "[bc]" " + 1"