Better Strategy for pulling elements from string - regex

I have a string that looks like this:
x <- "\r\n Ticker Symbol: RBO\r\n \t Exchange: TSX \r\n\t Assets ($mm) 36.26 \r\n\t Units Outstanding: 1,800,000 \r\n\t Mgmt. Fee** 0.25 \r\n 2013 MER* n/a \r\n\t CUSIP: 74932K103"
What I need is this:
list(Ticker = "RBO", Assets = 36.26, Shares = 1,800,000)
I've tried splitting, regex, etc. But I feel my string manipulation skills are not up to snuff.
Here's my "best" attempt so far.
x <- unlist(strsplit(unlist(strsplit(x, "\r\n\t") ),"\r\n"))
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
x <- trim(x)
gsub("[A-Z]+$","\\2",x[2]) # bad attempt to get RBO

Update/better answer:
A look at cat(x) and readLines(x) helps a lot here
> cat(x)
#
# Ticker Symbol: RBO
# Exchange: TSX
# Assets ($mm) 36.26 #
# Units Outstanding: 1,800,000
# Mgmt. Fee** 0.25
# 2013 MER* n/a
# CUSIP: 74932K103
> readLines(textConnection(x))
# [1] "" " Ticker Symbol: RBO"
# [3] " \t Exchange: TSX " "\t Assets ($mm) 36.26 "
# [5] "\t Units Outstanding: 1,800,000 " "\t Mgmt. Fee** 0.25 "
# [7] " 2013 MER* n/a " "\t CUSIP: 74932K103"
Now we know a few things. One, we don't need the first line, and we do want the second line. That makes things easier because now the first line matches our desired first line. Next, it would be easier your list names matched the names in the string. I chose these.
> nm <- c("Symbol", "Assets", "Units")
Now all we have to do use grep with sapply, and we'll get back a named vector of matches. Setting value = TRUE in grep will return us the strings.
> (y <- sapply(nm, grep, x = readLines(textConnection(x))[-1], value = TRUE))
# b Symbol Assets
# " Ticker Symbol: RBO" "\t Assets ($mm) 36.26 "
# Units
# "\t Units Outstanding: 1,800,000 "
Then we strsplit that on "[: ]", take the last element in each split, and we're done.
> lapply(strsplit(y, "[: ]"), tail, 1)
$Symbol
[1] "RBO"
$Assets
[1] "36.26"
$Units
[1] "1,800,000
You could achieve the same result with
> g <- gsub("[[:cntrl:]]", "", capture.output(cat(x))[-1])
> m <- mapply(grep, nm, MoreArgs = list(x = g, value = TRUE))
> lapply(strsplit(m, "[: ]"), tail, 1)
Hope that helps.
Original Answer:
It looks like if you're pulling these from a large table, that they'd all be in the same element "slot" each time, so maybe this might be a little easier.
> s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]]
Explained:
- [: ] match a ":" character followed by a space character
- | or
- [[:cntrl:]] any control character, which in this case is any of \r, \t, and \n. This is probably better explained here
Then, nzchar looks in the above result for non-zero length character strings, and returns TRUE if matched, FALSE otherwise. So we can look at the result of the first line, determine where the matches are, and subset based on that.
> as.list(s[nzchar(s)][c(3, 8, 11)])
[[1]]
[1] "RBO"
[[2]]
[1] "36.26"
[[3]]
[1] "1,800,000"
You could put is into one line by assigning s as the inner call. Since functions and calls are evaluated from the inside out, s is assigned before R reaches the outside s subset. This is a bit less readable though.
s[nzchar(s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]])][c(3,8,11)]
So this would go s <- strsplit(...) -> [[ -> nzchar -> s[.. >- [c(3,8,11)]

Perhaps:
sub( "\\\r\\\n.+$", "", sub( "^.+Ticker Symbol: ", "", x) )
[1] "RBO"
I suppose you might do it all in one pattern with parentheses. and backreference.
> sub( "^.+Ticker Symbol: ([[:alpha:]]{1,})\\\r\\\n.+$", "\\1", x)
[1] "RBO"

If you just want to extract different parts of the string, you can use regexpr to find phrases and extract the contents after the phrase. For example
extr<-list(
"Ticker" = "Ticker Symbol: ",
"Assets" = "Assets ($mm) ",
"Shares" = "Units Outstanding: "
)
lines<-strsplit(x,"\r\n")[[1]]
Map(function(p) {
m <- regexpr(p, lines, fixed=TRUE)
if(length( w<- which(m!=-1))==1) {
gsub("^\\sw+|\\s$", "",
substr(lines[w], m[w] + attr(m,"match.length")[w], nchar(lines[w])))
} else {
NA
}
}, extr)
Which returns the named list as desired
$Ticker
[1] "RBO"
$Assets
[1] "36.26"
$Shares
[1] "1,800,000"
Here extr is a list where the name of the element is the name that will be used in the final list, and the element value is the exact string that will be matched in the text. I added in a gsub as well to trim off any whitespace.

The stringr package is good for scraping data from strings. Here are the steps I use every time. You can always make the rules as specific or robust as you see fit.
require(stringr)
## take out annoying characters
x <- gsub("\r\n", "", x)
x <- gsub("\t", "", x)
x <- gsub("\\(\\$mm\\) ", "", x)
## define character index positions of interest
tickerEnd <- str_locate(x, "Ticker Symbol: ")[[1, "end"]]
assetsEnd <- str_locate(x, "Assets ")[[1, "end"]]
unitsStart <- str_locate(x, "Units Outstanding: ")[[1, "start"]]
unitsEnd <- str_locate(x, "Units Outstanding: ")[[1, "end"]]
mgmtStart <- str_locate(x, "Mgmt")[[1, "start"]]
## get substrings based on indices
tickerTxt <- substr(x, tickerEnd + 1, tickerEnd + 4) # allows 4-character symbols
assetsTxt <- substr(x, assetsEnd + 1, unitsStart - 1)
sharesTxt <- substr(x, unitsEnd + 1, mgmtStart - 1)
## cut out extraneous characters
ticker <- gsub(" ", "", tickerTxt)
assets <- gsub(" ", "", assetsTxt)
shares <- gsub(" |,", "", sharesTxt)
## add data to data frame
df <- data.frame(ticker, as.numeric(assets), as.numeric(shares), stringsAsFactors = FALSE)

Related

Selecting the word immediately after a keyword

I'm trying to extract the word immediately a keyword using R. I don't have a lot of experience with regular expressions so everything I've found so far doesn't help me much. If I could get the function to return multiple instances that would be ideal.
For example if my keyword was the and my string was:
The yellow log is in the stream
It would return yellow and stream.
I found this solution for c# and it seems exactly like what I want but I'm having trouble implementing it in R.
You can try
library(stringr)
str_extract_all(str1, perl('(?<=\\b(?i)The )\\w+'))[[1]]
#[1] "yellow" "stream"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '(?<=\\b(?i)The )\\w+')[[1]]
#[1] "yellow" "stream"
EDIT: Changed based on #Roland's suggestion in the comments.
data
str1 <- 'The yellow log is in the stream'
assign key to whatever string you want and use
key <- 'the'
p <- "The yellow log is in the stream"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
or as #Roland points out, it would be safer to use a word boundary around your keyword to avoid this:
key <- 'the'
p <- "The yellow log is in the stream drinking absinthe and beer"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream" "and"
regmatches(p, gregexpr(sprintf('(?i)(?<=\\b%s )\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
Here is non regex solution:
mytext <- "The yellow log is in the stream"
mykey <- "the"
x <- unlist(strsplit(mytext," "))
x[which(tolower(x)==mykey)+1]
Try this: this returns 'yellow' and 'stream'
x <- "The yellow log is in the stream"
regmatches(x, gregexpr("(?:(?:T|t)he)\\s(\\w+)", x, perl = TRUE))[[1]]
## [1] "The yellow" "the stream"
The qdapRegex package I maintain has a regular expression after_ in the regex_supplement dictionary that is perfect for this. You can use rm_ to make your own after_the function:
library(qdapRegex)
x<- "The yellow log is in the stream"
after_the <- rm_(pattern = S("#after_", "[Tt]he"), extract = TRUE)
after_the(x)
## [[1]]
## [1] "yellow" "stream"
The S function is a wrapper for sprintf that allows you to easily pass elements (like the work "the" in this case) to the base regex producing:
S("#after_", "the", "The")
## [1] "(?<=\\b(the|The)\\s)(\\w+)"
EDIT
library(qdapRegex)
x<- c("The yellow log is in the stream", "I like the one box for a pack")
after_ <- rm_(extract = TRUE)
after_the(x)
after_ <- rm_(extract = TRUE)
words <- c("the", "a", "one")
setNames(lapply(words, function(y){
after_(x, pattern = S("#after_", y, TC(y)))
}), words)
## $the
## $the[[1]]
## [1] "yellow" "stream"
##
## $the[[2]]
## [1] "one"
##
##
## $a
## $a[[1]]
## [1] NA
##
## $a[[2]]
## [1] "pack"
##
##
## $one
## $one[[1]]
## [1] NA
##
## $one[[2]]
## [1] "box"

Insert a character at multiple positions in a string at once

Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).
Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).
A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"
Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.

split string with regex

I'm looking to split a string of a generic form, where the square brackets denote the "sections" of the string. Ex:
x <- "[a] + [bc] + 1"
And return a character vector that looks like:
"[a]" " + " "[bc]" " + 1"
EDIT: Ended up using this:
x <- "[a] + [bc] + 1"
x <- gsub("\\[",",[",x)
x <- gsub("\\]","],",x)
strsplit(x,",")
I've seen TylerRinker's code and suspect it may be more clear than this but this may serve as way to learn a different set of functions. (I liked his better before I noticed that it split on spaces.) I tried adapting this to work with strsplit but that function always removes the separators.
Maybe this could be adapted to make a newstrsplit that splits at the separators but leaves them in? Probably need to not split at first or last position and distinguish between opening and closing separators.
scan(text= # use scan to separate after insertion of commas
gsub("\\]", "],", # put commas in after "]"'s
gsub(".\\[", ",[", x)) , # add commas before "[" unless at first position
what="", sep=",") # tell scan this character argument and separators are ","
#Read 4 items
#[1] "[a]" " +" "[bc]" " + 1"
This is one lazy approach:
FUN <- function(x) {
all <- unlist(strsplit(x, "\\s+"))
last <- paste(c(" ", tail(all, 2)), collapse="")
c(head(all, -2), last)
}
x <- "[a] + [bc] + 1"
FUN(x)
## > FUN(x)
## [1] "[a]" "+" "[bc]" " +1"
You can compute the split points manually and use substring :
split.pos <- gregexpr('\\[.*?]',x)[[1]]
split.length <- attr(split.pos, "match.length")
split.start <- sort(c(split.pos, split.pos+split.length))
split.end <- c(split.start[-1]-1, nchar(x))
substring(x,split.start,split.end)
# [1] "[a]" " + " "[bc]" " + 1"
And here's a version that splits on the brackets AND keeps them in the result, using positive lookahead and lookbehind:
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\[)", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\])", perl=TRUE))
for (i in which(x=="[")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="[")]
}
splitme(x)
#[1] "[a]" " + " "[bc]" " + 1"

R: removing the '$' symbols

I have downloaded some data from a web server, including prices that are formatted for humans, including $ and thousand separators.
> head(m)
[1] $129,900 $139,900 $254,000 $260,000 $290,000 $295,000
I was able to get rid of the commas, using
m <- sub(',','',m)
but
m <- sub('$','',m)
does not remove the dollar sign. If I try mn <- as.numeric(m) or as.integer I get an error message:
Warning message: NAs introduced by coercion
and the result is:
> head(m)
[1] NA NA NA NA NA NA
How can I remove the $ sign? Thanks
dat <- gsub('[$]','',dat)
dat <- as.numeric(gsub(',','',dat))
> dat
[1] 129900 139900 254000 260000 290000 295000
In one step
gsub('[$]([0-9]+)[,]([0-9]+)','\\1\\2',dat)
[1] "129900" "139900" "254000" "260000" "290000" "295000"
Try this. It means replace anything that is not a digit with the empty string:
as.numeric(gsub("\\D", "", dat))
or to remove anything that is neither a digit nor a decimal:
as.numeric(gsub("[^0-9.]", "", dat))
UPDATE: Added a second similar approach in case the data in the question is not representative.
you could also use:
x <- c("$129,900", "$139,900", "$254,000", "$260,000", "$290,000", "$295,000")
library(qdap)
as.numeric(mgsub(c("$", ","), "", x))
yielding:
> as.numeric(mgsub(c("$", ","), "", x))
[1] 129900 139900 254000 260000 290000 295000
If you wanted to stay in base use the fixed = TRUE argument to gsub:
x <- c("$129,900", "$139,900", "$254,000", "$260,000", "$290,000", "$295,000")
as.numeric(gsub("$", "", gsub(",", "", x), fixed = TRUE))