Insert a character at multiple positions in a string at once - regex

Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).

Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).

A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"

Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

removing consecutive duplicates in strings R

I'd like to collapse two strings s1 = "word1 word2 word3" and s2 = "word2 word3 word4" but removing the extra (future) consecutive overlap/duplicate ("word2 word3"). That is, I should obtain s = "word1 word2 word3 word4" rather than s = "word1 word2 word3 word2 word3 word4".
More simply, it should also work for single-word overlaps: s1 = "word1 word2" and s2 = "word2 word3" should give me s = word1 word2 word3" rather than s = "word1 word2 word2 word3".
I am using wordnumber for illustration purposes but of course it should work for any word...
Use unique on the result, that should remove all the duplicates.
And perhaps also use sort?
EDIT: Sorry, my first answer did miss the point completely. Here's a revised solution based on the stringr-package, that I think should work. The idea is to first split the strings into vectors, then compare the vectors and check if an overlap is present - finally join the vectors based on whether or not an overlap was detected.
s1 = "word1 word2 word3"
s2 = "word2 word3 word4"
library(stringr)
.s1_splitted <- str_split(
string = s1,
pattern = "\ +")[[1]]
.s2_splitted <- str_split(
string = s2,
pattern = "\ +")[[1]]
.matches12 <- charmatch(
x = .s1_splitted,
table = .s2_splitted)
If the last number is different from NA, and shorter than the
length of .s1_splitted, then check if the end of the vector
looks like it ought to do.
.last_element <- tail(.matches12, n = 1)
if (! is.na(.last_element)) {
if (.last_element <= length(.s1_splitted)) {
.overlap <- identical(
x = 1:.last_element,
y = tail(x = .matches12,
n = .last_element))
}
} else
.overlap <- FALSE
Join the components, based on overlap.
if (.overlap) {
.joined <- c(
head(x = .s1_splitted,
n = - .last_element),
.s2_splitted)
} else
.joined <- c(.s1_splitted,
.s2_splitted)
Convert back to a string
.result <- paste(.joined, collapse = " ")
This was surprisingly difficult, but I believe I have a solution:
sjoin <- function(s1,s2) {
ss1 <- strsplit(s1,'\\s+')[[1L]];
ss2 <- strsplit(s2,'\\s+')[[1L]];
if (length(ss1)==0L) return(s2);
if (length(ss2)==0L) return(s1);
n <- 0L; for (i in seq(min(length(ss1),length(ss2)),1L))
if (all(ss1[seq(to=length(ss1),len=i)]==ss2[seq(1L,len=i)])) {
n <- i;
break;
}; ## end if
paste(collapse=' ',c(ss1,if (n==0L) ss2 else ss2[-1:-n]));
}; ## end sjoin()
sjoin('1 2 3','2 3 4');
## [1] "1 2 3 4"
sjoin('1 2 3 x','2 3 4');
## [1] "1 2 3 x 2 3 4"
sjoin('1 2 3','x 2 3 4');
## [1] "1 2 3 x 2 3 4"
sjoin('','')
## [1] ""
sjoin('a','');
## [1] "a"
sjoin('','a');
## [1] "a"
sjoin('a','a')
## [1] "a"
sjoin('a b c','a b c');
## [1] "a b c"
sjoin('a b c','c');
## [1] "a b c"
sjoin('a b c','c d');
## [1] "a b c d"
sjoin('b','b c d');
## [1] "b c d"
sjoin('a b','b c d');
## [1] "a b c d"

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.

Better Strategy for pulling elements from string

I have a string that looks like this:
x <- "\r\n Ticker Symbol: RBO\r\n \t Exchange: TSX \r\n\t Assets ($mm) 36.26 \r\n\t Units Outstanding: 1,800,000 \r\n\t Mgmt. Fee** 0.25 \r\n 2013 MER* n/a \r\n\t CUSIP: 74932K103"
What I need is this:
list(Ticker = "RBO", Assets = 36.26, Shares = 1,800,000)
I've tried splitting, regex, etc. But I feel my string manipulation skills are not up to snuff.
Here's my "best" attempt so far.
x <- unlist(strsplit(unlist(strsplit(x, "\r\n\t") ),"\r\n"))
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
x <- trim(x)
gsub("[A-Z]+$","\\2",x[2]) # bad attempt to get RBO
Update/better answer:
A look at cat(x) and readLines(x) helps a lot here
> cat(x)
#
# Ticker Symbol: RBO
# Exchange: TSX
# Assets ($mm) 36.26 #
# Units Outstanding: 1,800,000
# Mgmt. Fee** 0.25
# 2013 MER* n/a
# CUSIP: 74932K103
> readLines(textConnection(x))
# [1] "" " Ticker Symbol: RBO"
# [3] " \t Exchange: TSX " "\t Assets ($mm) 36.26 "
# [5] "\t Units Outstanding: 1,800,000 " "\t Mgmt. Fee** 0.25 "
# [7] " 2013 MER* n/a " "\t CUSIP: 74932K103"
Now we know a few things. One, we don't need the first line, and we do want the second line. That makes things easier because now the first line matches our desired first line. Next, it would be easier your list names matched the names in the string. I chose these.
> nm <- c("Symbol", "Assets", "Units")
Now all we have to do use grep with sapply, and we'll get back a named vector of matches. Setting value = TRUE in grep will return us the strings.
> (y <- sapply(nm, grep, x = readLines(textConnection(x))[-1], value = TRUE))
# b Symbol Assets
# " Ticker Symbol: RBO" "\t Assets ($mm) 36.26 "
# Units
# "\t Units Outstanding: 1,800,000 "
Then we strsplit that on "[: ]", take the last element in each split, and we're done.
> lapply(strsplit(y, "[: ]"), tail, 1)
$Symbol
[1] "RBO"
$Assets
[1] "36.26"
$Units
[1] "1,800,000
You could achieve the same result with
> g <- gsub("[[:cntrl:]]", "", capture.output(cat(x))[-1])
> m <- mapply(grep, nm, MoreArgs = list(x = g, value = TRUE))
> lapply(strsplit(m, "[: ]"), tail, 1)
Hope that helps.
Original Answer:
It looks like if you're pulling these from a large table, that they'd all be in the same element "slot" each time, so maybe this might be a little easier.
> s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]]
Explained:
- [: ] match a ":" character followed by a space character
- | or
- [[:cntrl:]] any control character, which in this case is any of \r, \t, and \n. This is probably better explained here
Then, nzchar looks in the above result for non-zero length character strings, and returns TRUE if matched, FALSE otherwise. So we can look at the result of the first line, determine where the matches are, and subset based on that.
> as.list(s[nzchar(s)][c(3, 8, 11)])
[[1]]
[1] "RBO"
[[2]]
[1] "36.26"
[[3]]
[1] "1,800,000"
You could put is into one line by assigning s as the inner call. Since functions and calls are evaluated from the inside out, s is assigned before R reaches the outside s subset. This is a bit less readable though.
s[nzchar(s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]])][c(3,8,11)]
So this would go s <- strsplit(...) -> [[ -> nzchar -> s[.. >- [c(3,8,11)]
Perhaps:
sub( "\\\r\\\n.+$", "", sub( "^.+Ticker Symbol: ", "", x) )
[1] "RBO"
I suppose you might do it all in one pattern with parentheses. and backreference.
> sub( "^.+Ticker Symbol: ([[:alpha:]]{1,})\\\r\\\n.+$", "\\1", x)
[1] "RBO"
If you just want to extract different parts of the string, you can use regexpr to find phrases and extract the contents after the phrase. For example
extr<-list(
"Ticker" = "Ticker Symbol: ",
"Assets" = "Assets ($mm) ",
"Shares" = "Units Outstanding: "
)
lines<-strsplit(x,"\r\n")[[1]]
Map(function(p) {
m <- regexpr(p, lines, fixed=TRUE)
if(length( w<- which(m!=-1))==1) {
gsub("^\\sw+|\\s$", "",
substr(lines[w], m[w] + attr(m,"match.length")[w], nchar(lines[w])))
} else {
NA
}
}, extr)
Which returns the named list as desired
$Ticker
[1] "RBO"
$Assets
[1] "36.26"
$Shares
[1] "1,800,000"
Here extr is a list where the name of the element is the name that will be used in the final list, and the element value is the exact string that will be matched in the text. I added in a gsub as well to trim off any whitespace.
The stringr package is good for scraping data from strings. Here are the steps I use every time. You can always make the rules as specific or robust as you see fit.
require(stringr)
## take out annoying characters
x <- gsub("\r\n", "", x)
x <- gsub("\t", "", x)
x <- gsub("\\(\\$mm\\) ", "", x)
## define character index positions of interest
tickerEnd <- str_locate(x, "Ticker Symbol: ")[[1, "end"]]
assetsEnd <- str_locate(x, "Assets ")[[1, "end"]]
unitsStart <- str_locate(x, "Units Outstanding: ")[[1, "start"]]
unitsEnd <- str_locate(x, "Units Outstanding: ")[[1, "end"]]
mgmtStart <- str_locate(x, "Mgmt")[[1, "start"]]
## get substrings based on indices
tickerTxt <- substr(x, tickerEnd + 1, tickerEnd + 4) # allows 4-character symbols
assetsTxt <- substr(x, assetsEnd + 1, unitsStart - 1)
sharesTxt <- substr(x, unitsEnd + 1, mgmtStart - 1)
## cut out extraneous characters
ticker <- gsub(" ", "", tickerTxt)
assets <- gsub(" ", "", assetsTxt)
shares <- gsub(" |,", "", sharesTxt)
## add data to data frame
df <- data.frame(ticker, as.numeric(assets), as.numeric(shares), stringsAsFactors = FALSE)

split string with regex

I'm looking to split a string of a generic form, where the square brackets denote the "sections" of the string. Ex:
x <- "[a] + [bc] + 1"
And return a character vector that looks like:
"[a]" " + " "[bc]" " + 1"
EDIT: Ended up using this:
x <- "[a] + [bc] + 1"
x <- gsub("\\[",",[",x)
x <- gsub("\\]","],",x)
strsplit(x,",")
I've seen TylerRinker's code and suspect it may be more clear than this but this may serve as way to learn a different set of functions. (I liked his better before I noticed that it split on spaces.) I tried adapting this to work with strsplit but that function always removes the separators.
Maybe this could be adapted to make a newstrsplit that splits at the separators but leaves them in? Probably need to not split at first or last position and distinguish between opening and closing separators.
scan(text= # use scan to separate after insertion of commas
gsub("\\]", "],", # put commas in after "]"'s
gsub(".\\[", ",[", x)) , # add commas before "[" unless at first position
what="", sep=",") # tell scan this character argument and separators are ","
#Read 4 items
#[1] "[a]" " +" "[bc]" " + 1"
This is one lazy approach:
FUN <- function(x) {
all <- unlist(strsplit(x, "\\s+"))
last <- paste(c(" ", tail(all, 2)), collapse="")
c(head(all, -2), last)
}
x <- "[a] + [bc] + 1"
FUN(x)
## > FUN(x)
## [1] "[a]" "+" "[bc]" " +1"
You can compute the split points manually and use substring :
split.pos <- gregexpr('\\[.*?]',x)[[1]]
split.length <- attr(split.pos, "match.length")
split.start <- sort(c(split.pos, split.pos+split.length))
split.end <- c(split.start[-1]-1, nchar(x))
substring(x,split.start,split.end)
# [1] "[a]" " + " "[bc]" " + 1"
And here's a version that splits on the brackets AND keeps them in the result, using positive lookahead and lookbehind:
splitme <- function(x) {
x <- unlist(strsplit(x, "(?=\\[)", perl=TRUE))
x <- unlist(strsplit(x, "(?<=\\])", perl=TRUE))
for (i in which(x=="[")) {
x[i+1] <- paste(x[i], x[i+1], sep="")
}
x[-which(x=="[")]
}
splitme(x)
#[1] "[a]" " + " "[bc]" " + 1"