R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches) - regex

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)

Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "

You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit

Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.

Related

merge strings among rows by id

I wish to merge strings among rows by an id variable. I know how to do that with the R code below. However, my code seems vastly overly complex.
In the present case each string has two elements that are not dots. Each pair of consecutive rows within an id have one element in common. So, only one of those elements remains after the two rows are merged.
The desired result is shown and the R code below returns the desired result. Thank you for any suggestions. Sorry my R code is so long and convoluted, but it does work and my goal is to obtain more efficient code in base R.
my.data <- read.table(text = '
id my.string
2 11..................
2 .1...2..............
2 .....2...3..........
5 ....................
6 ......2.....2.......
6 ............2...4...
7 .1...2..............
7 .....2....3.........
7 ..........3..3......
7 .............34.....
8 ....1.....1.........
8 ..........12........
8 ...........2....3...
9 ..................44
10 .2.......2..........
11 ...2...2............
11 .......2.....2......
11 .............2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text = '
id my.string
2 11...2...3..........
5 ....................
6 ......2.....2...4...
7 .1...2....3..34.....
8 ....1.....12....3...
9 ..................44
10 .2.......2..........
11 ...2...2.....2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
# obtain position of first and last non-dot
# from: http://stackoverflow.com/questions/29229333/position-of-first-and-last-non-dot-in-a-string-with-regex
first.last.dot <- data.frame(my.data, do.call(rbind, gregexpr("^\\.*\\K[^.]|[^.](?=\\.*$)", my.data[,2], perl=TRUE)))
# obtain non-dot elements
first.last.dot$first.element <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X1, first.last.dot$X1))
first.last.dot$last.element <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X2, first.last.dot$X2))
# obtain some book-keeping variables
first.last.dot$number.within.group <- sequence(rle(first.last.dot$id)$lengths)
most.records.per.id <- max(first.last.dot$number.within.group)
n.ids <- length(unique(first.last.dot$id))
# create matrices for recording data
positions.per.id <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))
values.per.id <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))
# use nested for-loops to fill matrices with data
positions.per.id[1,1] = first.last.dot$X1[1]
values.per.id[1,1] = first.last.dot$first.element[1]
positions.per.id[1,2] = first.last.dot$X2[1]
values.per.id[1,2] = first.last.dot$last.element[1]
j = 1
for(i in 2:nrow(first.last.dot)) {
if(first.last.dot$id[i] != first.last.dot$id[i-1]) j = j + 1
positions.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$X1[i]
positions.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$X2[i]
values.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$first.element[i]
values.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$last.element[i]
}
# convert matrix data into new strings using nested for-loops
new.strings <- matrix(0, nrow = nrow(positions.per.id), ncol = nchar(my.data$my.string[1]))
for(i in 1:nrow(positions.per.id)) {
for(j in 1:ncol(positions.per.id)) {
new.strings[i,positions.per.id[i,j]] <- values.per.id[i,j]
}
}
# format new strings
new.strings[is.na(new.strings)] <- 0
new.strings[new.strings==0] <- '.'
new.strings2 <- data.frame(id = unique(first.last.dot$id), my.string = (do.call(paste0, as.data.frame(new.strings))), stringsAsFactors = FALSE)
new.strings2
all.equal(desired.result, new.strings2)
# [1] TRUE
Dude, this was tough. Please don't make me explain what I did.
data.frame(id=unique(my.data$id), my.string=sapply(lapply(unique(my.data$id), function(id) gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1)) ), function(x) paste0(x,collapse='') ), stringsAsFactors=F );
Ok, I'll explain it:
It begins with this lapply() call:
lapply(unique(my.data$id), function(id) ... )
As you can see, the above basically iterates over the unique ids in the data.frame, processing each one in turn. Here's the contents of the function:
gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1))
Let's take that in pieces, starting with the innermost subexpression:
strsplit(my.data[my.data$id==id,'my.string'],'')
The above indexes all my.string cells for the current id value, and splits each string using strsplit(). This produces a list of character vectors, with each list component containing a vector of character strings, where the whole vector corresponds to the input string which was split. The use of the empty string as the delimiter causes each individual character in each input string to become an element in the output vector in the list component corresponding to said input string.
Here's an example of what the above expression generates (for id==2):
[[1]]
[1] "1" "1" "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."
[[2]]
[1] "." "1" "." "." "." "2" "." "." "." "." "." "." "." "." "." "." "." "." "." "."
[[3]]
[1] "." "." "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."
The above strsplit() call is wrapped in the following (with the ... representing the previous expression):
do.call(paste0,...)
That calls paste0() once, passing the output vectors that were generated by strsplit() as arguments. This does a kind of element-wise pasting of all vectors, so you end up with a single vector like this, for each unique id:
[1] "1.." "11." "..." "..." "..." ".22" "..." "..." "..." "..3" "..." "..." "..." "..." "..." "..." "..." "..." "..." "..."
The above paste0() call is wrapped in the following:
gsub('\\.','',...)
That strips all literal dots from all elements, resulting in something like this, for each unique id:
[1] "1" "11" "" "" "" "22" "" "" "" "3" "" "" "" "" "" "" "" "" "" ""
The above gsub() call is wrapped in the following:
substr(...,1,1)
That extracts just the first character of each element, which, if it exists, is the desired character in that position. Empty elements are acceptable, as that just means the id had no non-dot characters in any of its input strings at that position.
The above substr() call is wrapped in the following:
gsub('^$','.',...)
That simply replaces empty elements with a literal dot, which is obviously necessary before we put the string back together. So we have, for id==2:
[1] "1" "1" "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."
That completes the function that was given to the lapply() call. Thus, coming out of that call will be a list of character vectors representing the desired output strings. All that remains is collapsing the elements of those vectors back into a single string, which is why we then need this:
sapply(..., function(x) paste0(x,collapse='') )
Using sapply() (simplify-apply) is appropriate because it automatically combines all desired output strings into a single character vector, rather than leaving them as a list:
[1] "11...2...3.........." "...................." "......2.....2...4..." ".1...2....3..34....." "....1.....12....3..." "..................44" ".2.......2.........." "...2...2.....2...2.."
Thus, all that remains is producing the full output data.frame, similar to the input data.frame:
data.frame(id=unique(my.data$id), my.string=..., stringsAsFactors=F )
Resulting in:
id my.string
1 2 11...2...3..........
2 5 ....................
3 6 ......2.....2...4...
4 7 .1...2....3..34.....
5 8 ....1.....12....3...
6 9 ..................44
7 10 .2.......2..........
8 11 ...2...2.....2...2..
And we're done!
Doing this in base R is a bit masochistic, so I won't do that, but with some perseverance you can do it yourself. Here's a data.table version instead (you'll need to install the latest 1.9.5 version from github to get tstrsplit):
library(data.table)
dt = as.data.table(my.data) # or setDT to convert in place
dt[, paste0(lapply(tstrsplit(my.string, ""),
function(i) {
res = i[i != "."];
if (length(res) > 0)
res[1]
else
'.'
}), collapse = "")
, by = id]
# id V1
#1: 2 11...2...3..........
#2: 5 ....................
#3: 6 ......2.....2...4...
#4: 7 .1...2....3..34.....
#5: 8 ....1.....12....3...
#6: 9 ..................44
#7: 10 .2.......2..........
#8: 11 ...2...2.....2...2..
Here's a possibility using functions from stringi and dplyr packages:
library(stringi)
library(dplyr)
# split my.string
m <- stri_split_boundaries(my.data$my.string, type = "character", simplify = TRUE)
df <- data.frame(id = my.data$id, m)
# function to apply to each column - select "." or unique "number"
myfun <- function(x) if(all(x == ".")) "." else unique(x[x != "."])
df %>%
# for each id...
group_by(id) %>%
# ...and each column, apply function
summarise_each(funs(myfun)) %>%
# for each row...
rowwise() %>%
#...concatenate strings
do(data.frame(id = .[1], mystring = paste(.[-1], collapse = "")))
# id mystring
# 1 2 11...2...3..........
# 2 5 ....................
# 3 6 ......2.....2...4...
# 4 7 .1...2....3..34.....
# 5 8 ....1.....12....3...
# 6 9 ..................44
# 7 10 .2.......2..........
# 8 11 ...2...2.....2...2..

Insert a character at multiple positions in a string at once

Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).
Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).
A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"
Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"

How to give space between 2 words after removing Punctuation and Numbers text mining in R

We can see that in the below example after removing number 3054 and punctuation marks - in given string "BG3054-suhas B-DC chr 23.7-22.8.13" the output will combine as bgsuhas but i need a space between this two word as bg suhas. Same thing you can see in below given string as bdc, bbxsh. Can you help me to a space between these words for text mining.
I need like this
bg suhas b dc chr rashmi as an output matrix
Newcol<-c("BG3054-suhas B-DC chr 23.7-22.8.13","BBXSH0030 Rashmi S 23.4.13to22.5.13")
text.corp <- Corpus(VectorSource(Newcol))
text.corp <- tm_map(text.corp, tolower)
text.corp <- tm_map(text.corp, stripWhitespace)
text.corp <- tm_map(text.corp, removeNumbers)
text.corp <- tm_map(text.corp, removePunctuation)
text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english")))
dtm <- DocumentTermMatrix(text.corp)
dtm.mat <- as.matrix(dtm)
dtm.mat
OUTPUT
Terms
Docs bbxsh bdc bgsuhas chr rashmi
1 0 1 1 1 0
2 1 0 0 0 1
I would just replace anything that's not an a-z letter with a space as a preprocessing step using gsub:
Newcol <- gsub("[^a-zA-Z]+", " ", Newcol)
Newcol
# [1] "BG suhas B DC chr " "BBXSH Rashmi S to "
Then your tm code should work fine for processing Newcol.
Define your own content transformer:
replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]"," ", x))})
And then use it:
text.corp <- tm_map(text.corp, replacePunctuation )

Better Strategy for pulling elements from string

I have a string that looks like this:
x <- "\r\n Ticker Symbol: RBO\r\n \t Exchange: TSX \r\n\t Assets ($mm) 36.26 \r\n\t Units Outstanding: 1,800,000 \r\n\t Mgmt. Fee** 0.25 \r\n 2013 MER* n/a \r\n\t CUSIP: 74932K103"
What I need is this:
list(Ticker = "RBO", Assets = 36.26, Shares = 1,800,000)
I've tried splitting, regex, etc. But I feel my string manipulation skills are not up to snuff.
Here's my "best" attempt so far.
x <- unlist(strsplit(unlist(strsplit(x, "\r\n\t") ),"\r\n"))
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
x <- trim(x)
gsub("[A-Z]+$","\\2",x[2]) # bad attempt to get RBO
Update/better answer:
A look at cat(x) and readLines(x) helps a lot here
> cat(x)
#
# Ticker Symbol: RBO
# Exchange: TSX
# Assets ($mm) 36.26 #
# Units Outstanding: 1,800,000
# Mgmt. Fee** 0.25
# 2013 MER* n/a
# CUSIP: 74932K103
> readLines(textConnection(x))
# [1] "" " Ticker Symbol: RBO"
# [3] " \t Exchange: TSX " "\t Assets ($mm) 36.26 "
# [5] "\t Units Outstanding: 1,800,000 " "\t Mgmt. Fee** 0.25 "
# [7] " 2013 MER* n/a " "\t CUSIP: 74932K103"
Now we know a few things. One, we don't need the first line, and we do want the second line. That makes things easier because now the first line matches our desired first line. Next, it would be easier your list names matched the names in the string. I chose these.
> nm <- c("Symbol", "Assets", "Units")
Now all we have to do use grep with sapply, and we'll get back a named vector of matches. Setting value = TRUE in grep will return us the strings.
> (y <- sapply(nm, grep, x = readLines(textConnection(x))[-1], value = TRUE))
# b Symbol Assets
# " Ticker Symbol: RBO" "\t Assets ($mm) 36.26 "
# Units
# "\t Units Outstanding: 1,800,000 "
Then we strsplit that on "[: ]", take the last element in each split, and we're done.
> lapply(strsplit(y, "[: ]"), tail, 1)
$Symbol
[1] "RBO"
$Assets
[1] "36.26"
$Units
[1] "1,800,000
You could achieve the same result with
> g <- gsub("[[:cntrl:]]", "", capture.output(cat(x))[-1])
> m <- mapply(grep, nm, MoreArgs = list(x = g, value = TRUE))
> lapply(strsplit(m, "[: ]"), tail, 1)
Hope that helps.
Original Answer:
It looks like if you're pulling these from a large table, that they'd all be in the same element "slot" each time, so maybe this might be a little easier.
> s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]]
Explained:
- [: ] match a ":" character followed by a space character
- | or
- [[:cntrl:]] any control character, which in this case is any of \r, \t, and \n. This is probably better explained here
Then, nzchar looks in the above result for non-zero length character strings, and returns TRUE if matched, FALSE otherwise. So we can look at the result of the first line, determine where the matches are, and subset based on that.
> as.list(s[nzchar(s)][c(3, 8, 11)])
[[1]]
[1] "RBO"
[[2]]
[1] "36.26"
[[3]]
[1] "1,800,000"
You could put is into one line by assigning s as the inner call. Since functions and calls are evaluated from the inside out, s is assigned before R reaches the outside s subset. This is a bit less readable though.
s[nzchar(s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]])][c(3,8,11)]
So this would go s <- strsplit(...) -> [[ -> nzchar -> s[.. >- [c(3,8,11)]
Perhaps:
sub( "\\\r\\\n.+$", "", sub( "^.+Ticker Symbol: ", "", x) )
[1] "RBO"
I suppose you might do it all in one pattern with parentheses. and backreference.
> sub( "^.+Ticker Symbol: ([[:alpha:]]{1,})\\\r\\\n.+$", "\\1", x)
[1] "RBO"
If you just want to extract different parts of the string, you can use regexpr to find phrases and extract the contents after the phrase. For example
extr<-list(
"Ticker" = "Ticker Symbol: ",
"Assets" = "Assets ($mm) ",
"Shares" = "Units Outstanding: "
)
lines<-strsplit(x,"\r\n")[[1]]
Map(function(p) {
m <- regexpr(p, lines, fixed=TRUE)
if(length( w<- which(m!=-1))==1) {
gsub("^\\sw+|\\s$", "",
substr(lines[w], m[w] + attr(m,"match.length")[w], nchar(lines[w])))
} else {
NA
}
}, extr)
Which returns the named list as desired
$Ticker
[1] "RBO"
$Assets
[1] "36.26"
$Shares
[1] "1,800,000"
Here extr is a list where the name of the element is the name that will be used in the final list, and the element value is the exact string that will be matched in the text. I added in a gsub as well to trim off any whitespace.
The stringr package is good for scraping data from strings. Here are the steps I use every time. You can always make the rules as specific or robust as you see fit.
require(stringr)
## take out annoying characters
x <- gsub("\r\n", "", x)
x <- gsub("\t", "", x)
x <- gsub("\\(\\$mm\\) ", "", x)
## define character index positions of interest
tickerEnd <- str_locate(x, "Ticker Symbol: ")[[1, "end"]]
assetsEnd <- str_locate(x, "Assets ")[[1, "end"]]
unitsStart <- str_locate(x, "Units Outstanding: ")[[1, "start"]]
unitsEnd <- str_locate(x, "Units Outstanding: ")[[1, "end"]]
mgmtStart <- str_locate(x, "Mgmt")[[1, "start"]]
## get substrings based on indices
tickerTxt <- substr(x, tickerEnd + 1, tickerEnd + 4) # allows 4-character symbols
assetsTxt <- substr(x, assetsEnd + 1, unitsStart - 1)
sharesTxt <- substr(x, unitsEnd + 1, mgmtStart - 1)
## cut out extraneous characters
ticker <- gsub(" ", "", tickerTxt)
assets <- gsub(" ", "", assetsTxt)
shares <- gsub(" |,", "", sharesTxt)
## add data to data frame
df <- data.frame(ticker, as.numeric(assets), as.numeric(shares), stringsAsFactors = FALSE)

Capturing a repeating group

I am trying to write a regex that would match and capture the following for me ...
String: 17+18+19+5+21
Numbers to be captured here (separately) are present in the array - [17,18,21].
Please note that the string can be n character long (following the same pattern of \d+) and the order of these numbers in the string are not fixed.
Thanks in advance
Given this setup:
library(gsubfn)
s <- "17+18+19+5+21"
a <- c(17, 18, 21)
1) Try this:
L <- as.list(c(setNames(a, a), NA))
strapply(s, "\\d+", L, simplify = na.omit)
giving:
[1] 17 18 21
attr(,"na.action")
[1] 3 4
attr(,"class")
[1] "omit"
2) or this:
pat <- paste(a, collapse = "|")
strapplyc(s, pat, simplify = as.numeric)
giving:
[1] 17 18 21
3) or this non-regexp solution
intersect(scan(text = s, what = 0, sep = "+", quiet = TRUE), a)
giving
[1] 17 18 21
ADDED additional solution.
How about simply:
(17|18|21)
It needs to be a global match, so in Pearl it would be like this:
$string =~ m/(17|18|21)/g
Example string:
21+18+19+5+21+18+19+17
Matches:
"21", "18", "21", "18", "17"
Working regex example:
http://regex101.com/r/jL8iF7
Use can use gregexpr and regmatches:
vec <- "17+18+19+5+21"
a <- c(17, 18, 21)
pattern <- paste0("\\b(", paste(a, collapse = "|"), ")\\b")
# [1] "\\b(17|18|21)\\b"
regmatches(vec, gregexpr(pattern, vec))[[1]]
# [1] "17" "18" "21"
Note that this matches the exact number, i.e., 17 does not match 177.