split string with regex - regex

I'm looking to split a string of a generic form, where the square brackets denote the "sections" of the string. Ex:
x <- "[a] + [bc] + 1"
And return a character vector that looks like:
"[a]" " + " "[bc]" " + 1"
EDIT: Ended up using this:
x <- "[a] + [bc] + 1"
x <- gsub("\\[",",[",x)
x <- gsub("\\]","],",x)
strsplit(x,",")

I've seen TylerRinker's code and suspect it may be more clear than this but this may serve as way to learn a different set of functions. (I liked his better before I noticed that it split on spaces.) I tried adapting this to work with strsplit but that function always removes the separators.
Maybe this could be adapted to make a newstrsplit that splits at the separators but leaves them in? Probably need to not split at first or last position and distinguish between opening and closing separators.
scan(text= # use scan to separate after insertion of commas
gsub("\\]", "],", # put commas in after "]"'s
gsub(".\\[", ",[", x)) , # add commas before "[" unless at first position
what="", sep=",") # tell scan this character argument and separators are ","
#Read 4 items
#[1] "[a]" " +" "[bc]" " + 1"

This is one lazy approach:
FUN <- function(x) {
all <- unlist(strsplit(x, "\\s+"))
last <- paste(c(" ", tail(all, 2)), collapse="")
c(head(all, -2), last)
}
x <- "[a] + [bc] + 1"
FUN(x)
## > FUN(x)
## [1] "[a]" "+" "[bc]" " +1"

You can compute the split points manually and use substring :
split.pos <- gregexpr('\\[.*?]',x)[[1]]
split.length <- attr(split.pos, "match.length")
split.start <- sort(c(split.pos, split.pos+split.length))
split.end <- c(split.start[-1]-1, nchar(x))
substring(x,split.start,split.end)
# [1] "[a]" " + " "[bc]" " + 1"

And here's a version that splits on the brackets AND keeps them in the result, using positive lookahead and lookbehind:
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\[)", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\])", perl=TRUE))
for (i in which(x=="[")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="[")]
}
splitme(x)
#[1] "[a]" " + " "[bc]" " + 1"

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

Better Strategy for pulling elements from string

I have a string that looks like this:
x <- "\r\n Ticker Symbol: RBO\r\n \t Exchange: TSX \r\n\t Assets ($mm) 36.26 \r\n\t Units Outstanding: 1,800,000 \r\n\t Mgmt. Fee** 0.25 \r\n 2013 MER* n/a \r\n\t CUSIP: 74932K103"
What I need is this:
list(Ticker = "RBO", Assets = 36.26, Shares = 1,800,000)
I've tried splitting, regex, etc. But I feel my string manipulation skills are not up to snuff.
Here's my "best" attempt so far.
x <- unlist(strsplit(unlist(strsplit(x, "\r\n\t") ),"\r\n"))
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
x <- trim(x)
gsub("[A-Z]+$","\\2",x[2]) # bad attempt to get RBO
Update/better answer:
A look at cat(x) and readLines(x) helps a lot here
> cat(x)
#
# Ticker Symbol: RBO
# Exchange: TSX
# Assets ($mm) 36.26 #
# Units Outstanding: 1,800,000
# Mgmt. Fee** 0.25
# 2013 MER* n/a
# CUSIP: 74932K103
> readLines(textConnection(x))
# [1] "" " Ticker Symbol: RBO"
# [3] " \t Exchange: TSX " "\t Assets ($mm) 36.26 "
# [5] "\t Units Outstanding: 1,800,000 " "\t Mgmt. Fee** 0.25 "
# [7] " 2013 MER* n/a " "\t CUSIP: 74932K103"
Now we know a few things. One, we don't need the first line, and we do want the second line. That makes things easier because now the first line matches our desired first line. Next, it would be easier your list names matched the names in the string. I chose these.
> nm <- c("Symbol", "Assets", "Units")
Now all we have to do use grep with sapply, and we'll get back a named vector of matches. Setting value = TRUE in grep will return us the strings.
> (y <- sapply(nm, grep, x = readLines(textConnection(x))[-1], value = TRUE))
# b Symbol Assets
# " Ticker Symbol: RBO" "\t Assets ($mm) 36.26 "
# Units
# "\t Units Outstanding: 1,800,000 "
Then we strsplit that on "[: ]", take the last element in each split, and we're done.
> lapply(strsplit(y, "[: ]"), tail, 1)
$Symbol
[1] "RBO"
$Assets
[1] "36.26"
$Units
[1] "1,800,000
You could achieve the same result with
> g <- gsub("[[:cntrl:]]", "", capture.output(cat(x))[-1])
> m <- mapply(grep, nm, MoreArgs = list(x = g, value = TRUE))
> lapply(strsplit(m, "[: ]"), tail, 1)
Hope that helps.
Original Answer:
It looks like if you're pulling these from a large table, that they'd all be in the same element "slot" each time, so maybe this might be a little easier.
> s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]]
Explained:
- [: ] match a ":" character followed by a space character
- | or
- [[:cntrl:]] any control character, which in this case is any of \r, \t, and \n. This is probably better explained here
Then, nzchar looks in the above result for non-zero length character strings, and returns TRUE if matched, FALSE otherwise. So we can look at the result of the first line, determine where the matches are, and subset based on that.
> as.list(s[nzchar(s)][c(3, 8, 11)])
[[1]]
[1] "RBO"
[[2]]
[1] "36.26"
[[3]]
[1] "1,800,000"
You could put is into one line by assigning s as the inner call. Since functions and calls are evaluated from the inside out, s is assigned before R reaches the outside s subset. This is a bit less readable though.
s[nzchar(s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]])][c(3,8,11)]
So this would go s <- strsplit(...) -> [[ -> nzchar -> s[.. >- [c(3,8,11)]
Perhaps:
sub( "\\\r\\\n.+$", "", sub( "^.+Ticker Symbol: ", "", x) )
[1] "RBO"
I suppose you might do it all in one pattern with parentheses. and backreference.
> sub( "^.+Ticker Symbol: ([[:alpha:]]{1,})\\\r\\\n.+$", "\\1", x)
[1] "RBO"
If you just want to extract different parts of the string, you can use regexpr to find phrases and extract the contents after the phrase. For example
extr<-list(
"Ticker" = "Ticker Symbol: ",
"Assets" = "Assets ($mm) ",
"Shares" = "Units Outstanding: "
)
lines<-strsplit(x,"\r\n")[[1]]
Map(function(p) {
m <- regexpr(p, lines, fixed=TRUE)
if(length( w<- which(m!=-1))==1) {
gsub("^\\sw+|\\s$", "",
substr(lines[w], m[w] + attr(m,"match.length")[w], nchar(lines[w])))
} else {
NA
}
}, extr)
Which returns the named list as desired
$Ticker
[1] "RBO"
$Assets
[1] "36.26"
$Shares
[1] "1,800,000"
Here extr is a list where the name of the element is the name that will be used in the final list, and the element value is the exact string that will be matched in the text. I added in a gsub as well to trim off any whitespace.
The stringr package is good for scraping data from strings. Here are the steps I use every time. You can always make the rules as specific or robust as you see fit.
require(stringr)
## take out annoying characters
x <- gsub("\r\n", "", x)
x <- gsub("\t", "", x)
x <- gsub("\\(\\$mm\\) ", "", x)
## define character index positions of interest
tickerEnd <- str_locate(x, "Ticker Symbol: ")[[1, "end"]]
assetsEnd <- str_locate(x, "Assets ")[[1, "end"]]
unitsStart <- str_locate(x, "Units Outstanding: ")[[1, "start"]]
unitsEnd <- str_locate(x, "Units Outstanding: ")[[1, "end"]]
mgmtStart <- str_locate(x, "Mgmt")[[1, "start"]]
## get substrings based on indices
tickerTxt <- substr(x, tickerEnd + 1, tickerEnd + 4) # allows 4-character symbols
assetsTxt <- substr(x, assetsEnd + 1, unitsStart - 1)
sharesTxt <- substr(x, unitsEnd + 1, mgmtStart - 1)
## cut out extraneous characters
ticker <- gsub(" ", "", tickerTxt)
assets <- gsub(" ", "", assetsTxt)
shares <- gsub(" |,", "", sharesTxt)
## add data to data frame
df <- data.frame(ticker, as.numeric(assets), as.numeric(shares), stringsAsFactors = FALSE)

Split string by words in R

I would like to split a string by two words:
s <- "PCB153 treated HepG2 cells at T18"
strsplit(s, split = <treated><at>)
What should I write instead of <>?
I would get:
"PCB153" "HepG2 cells" "T18"
strsplit(s, split="treated|at")
#[[1]]
#[1] "PCB153 " " HepG2 cells " " T18"
You have to enter it as a string. To split on treated:
s <- "PCB153 treated HepG2 cells at T18"
s2 <- strsplit(s,split="treated")
unlist(s2)
To split on treated and at:
unlist(strsplit(unlist(s2),split="at"))

String rearrangement in R

I am on the lookout for two R functions that would perform the following string rearrangements:
(1) place the parts following a ", " in a string at the start of a string, e.g.
name="2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-"
should yield
"(E)-3,7-dimethyl-2,6-Octadien-1-ol"
(note that there could be any number of ", " in a string, or none at all, and that the parts after the ", " should be placed at the start of the string successively, starting from the end of the string. What would be the most efficient way of achieving this in R (without using loops etc)?
(2) place the parts between "<" and ">" at the start of a string and remove any ", ".
E.g.
name="Pyrazine <2-acetyl-, 3-ethyl->"
should yield
"2-acetyl-3-ethyl-Pyrazine"
(this is a simpler gsub problem, right?)
The part between the "<" and ">" could be in any place in the string though.
E.g.
name="Cyclohexanol <4-tertbutyl-> acetate"
should yield
"4-tertbutyl-Cyclohexanol acetate"
Any thoughts would be welcome!
cheers,
Tom
For the first problem:
name <- c("2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-",
"2,6-Octadien-1-ol,3,7-dimethyl-,(E)-")
sapply(strsplit(name, "(?<!\\d), ?", perl = TRUE), function(x)
paste(rev(x), collapse = ""))
# [1] "(E)-3,7-dimethyl-2,6-Octadien-1-ol" "(E)-3,7-dimethyl-2,6-Octadien-1-ol"
For the second problem:
name <- c("Pyrazine <2-acetyl-, 3-ethyl->",
"Cyclohexanol <4-tertbutyl-> acetate")
inside <- gsub(", ", "", sub("^.*<(.+)>.*$", "\\1", name))
outside <- sub("^(.*) <.*>(.*)$" , "\\1\\2", name)
paste0(inside, outside)
# [1] "2-acetyl-3-ethyl-Pyrazine" "4-tertbutyl-Cyclohexanol acetate"

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.