removing consecutive duplicates in strings R - regex

I'd like to collapse two strings s1 = "word1 word2 word3" and s2 = "word2 word3 word4" but removing the extra (future) consecutive overlap/duplicate ("word2 word3"). That is, I should obtain s = "word1 word2 word3 word4" rather than s = "word1 word2 word3 word2 word3 word4".
More simply, it should also work for single-word overlaps: s1 = "word1 word2" and s2 = "word2 word3" should give me s = word1 word2 word3" rather than s = "word1 word2 word2 word3".
I am using wordnumber for illustration purposes but of course it should work for any word...

Use unique on the result, that should remove all the duplicates.
And perhaps also use sort?
EDIT: Sorry, my first answer did miss the point completely. Here's a revised solution based on the stringr-package, that I think should work. The idea is to first split the strings into vectors, then compare the vectors and check if an overlap is present - finally join the vectors based on whether or not an overlap was detected.
s1 = "word1 word2 word3"
s2 = "word2 word3 word4"
library(stringr)
.s1_splitted <- str_split(
string = s1,
pattern = "\ +")[[1]]
.s2_splitted <- str_split(
string = s2,
pattern = "\ +")[[1]]
.matches12 <- charmatch(
x = .s1_splitted,
table = .s2_splitted)
If the last number is different from NA, and shorter than the
length of .s1_splitted, then check if the end of the vector
looks like it ought to do.
.last_element <- tail(.matches12, n = 1)
if (! is.na(.last_element)) {
if (.last_element <= length(.s1_splitted)) {
.overlap <- identical(
x = 1:.last_element,
y = tail(x = .matches12,
n = .last_element))
}
} else
.overlap <- FALSE
Join the components, based on overlap.
if (.overlap) {
.joined <- c(
head(x = .s1_splitted,
n = - .last_element),
.s2_splitted)
} else
.joined <- c(.s1_splitted,
.s2_splitted)
Convert back to a string
.result <- paste(.joined, collapse = " ")

This was surprisingly difficult, but I believe I have a solution:
sjoin <- function(s1,s2) {
ss1 <- strsplit(s1,'\\s+')[[1L]];
ss2 <- strsplit(s2,'\\s+')[[1L]];
if (length(ss1)==0L) return(s2);
if (length(ss2)==0L) return(s1);
n <- 0L; for (i in seq(min(length(ss1),length(ss2)),1L))
if (all(ss1[seq(to=length(ss1),len=i)]==ss2[seq(1L,len=i)])) {
n <- i;
break;
}; ## end if
paste(collapse=' ',c(ss1,if (n==0L) ss2 else ss2[-1:-n]));
}; ## end sjoin()
sjoin('1 2 3','2 3 4');
## [1] "1 2 3 4"
sjoin('1 2 3 x','2 3 4');
## [1] "1 2 3 x 2 3 4"
sjoin('1 2 3','x 2 3 4');
## [1] "1 2 3 x 2 3 4"
sjoin('','')
## [1] ""
sjoin('a','');
## [1] "a"
sjoin('','a');
## [1] "a"
sjoin('a','a')
## [1] "a"
sjoin('a b c','a b c');
## [1] "a b c"
sjoin('a b c','c');
## [1] "a b c"
sjoin('a b c','c d');
## [1] "a b c d"
sjoin('b','b c d');
## [1] "b c d"
sjoin('a b','b c d');
## [1] "a b c d"

Related

Find patterns in strings and concatenate unique parts in R

Conciser having 2 strings s1 and s2. How is it possible to concatenate only those parts in these strings, which are unique, and leave non unique parts as is. Let / be the separator between concatenated parts. The desired result looks like this:
s1 <- "very big house"
s2 <- "very small house"
some_function(s1,s2)
"very big/small house" #// desired result.
Usually leading and tailing charters are the same for both strings and the differences appear only in the middle.
More examples of desired result:
# a
s1 <- "1b"; s2 <- "2b"
"1/2b" # <- new string
# b
s1 <- "a_1_b"; s2 <- "a_2_b"
"a_1/2_b" # <- new string
# c
s1 <- "a"; s2 <- "b"
"a/b" # <- new string
Here's a solution:
pasteMergePrefixAndSuffix <- function(vs1,vs2,sep=' ') {
## cycle string vectors to same length
vsl <- max(length(vs1),length(vs2));
vs1 <- rep(vs1,len=vsl);
vs2 <- rep(vs2,len=vsl);
## precompute character splits
ss1 <- strsplit(vs1,'');
ss2 <- strsplit(vs2,'');
## iterate over each pair of strings
sapply(seq_along(vs1),function(si) {
s1 <- vs1[si];
s2 <- vs2[si];
s1l <- length(ss1[[si]]);
s2l <- length(ss2[[si]]);
slmin <- min(s1l,s2l);
## handle trivial case of exact equality
if (s1==s2) return(s1);
## get prefix and suffix lengths
if (slmin==0L) { ## empty string cannot result in a prefix or suffix
pl <- sl <- 0L;
} else {
eq <- ss1[[si]][seq_len(slmin)]==ss2[[si]][seq_len(slmin)];
pl <- if (all(eq)) slmin else if (eq[1L]==T) which(!eq)[1L]-1L else 0L;
eq <- rev(ss1[[si]])[seq_len(slmin)]==rev(ss2[[si]])[seq_len(slmin)];
sl <- if (all(eq)) slmin else if (eq[1L]==T) which(!eq)[1L]-1L else 0L;
}; ## end if
## paste together prefix, sep-pasted middles, and suffix
m1 <- substr(s1,pl+1L,s1l-sl);
m2 <- substr(s2,pl+1L,s2l-sl);
paste0(substr(s1,1L,pl),paste(sep=sep,m1,m2),substr(s1,s1l-sl+1L,s1l));
});
}; ## end pasteMergePrefixAndSuffix()
Demo:
pasteMergePrefixAndSuffix('a','b');
## [1] "a b"
pasteMergePrefixAndSuffix('a','b','/');
## [1] "a/b"
s1 <- 'very big house'; s2 <- 'very small house'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "very big/small house"
s1 <- '1b'; s2 <- '2b'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "1/2b"
s1 <- 'a_1_b'; s2 <- 'a_2_b'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "a_1/2_b"
s1 <- 'ab'; s2 <- 'ab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "ab"
s1 <- 'xab'; s2 <- 'ab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "x/ab"
s1 <- 'ab'; s2 <- 'abx'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "ab/x"
s1 <- 'abx'; s2 <- 'ab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "abx/"
s1 <- 'ab'; s2 <- 'xab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "/xab"
s1 <- ''; s2 <- 'x'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "/x"
s1 <- 'x'; s2 <- ''; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "x/"

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

Insert a character at multiple positions in a string at once

Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).
Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).
A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"
Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.

Look for specific character in string and place it at different positions after a defined separator in the same string

let's define the following string s:
s <- "$ A; B; C;"
I need to translate s into:
"$ A; $B; $C;"
the semicolon is the separator. However, $ is only one of 3 special characters which can appear in the string. The data frame m holds all 3 special characters:
m <- data.frame(sp = c("$", "%", "&"))
I first used strsplit to split the string using the semicolon as the separator
> strsplit(s, ";")
[[1]]
[1] "$ A" " B" " C"
I think the next step would be to use grep or match to check if the first string contains any of the 3 special characters defined in data frame m. If so, maybe use gsub to insert the matched special character into the remaining sub strings. Then simple use paste with collapse = "" to merge the substrings together again. Does that make sense?
Cheers
What about something like this:
getmeout = gsub("[$|%|& ]", "", unlist(strsplit(s, ";")))
whatspecial = unique(gsub("[^$|%|&]", "", s))
whatspecial
# [1] "$"
getmeout
# [1] "A" "B" "C"
paste0(whatspecial, getmeout, sep=";", collapse="")
# [1] "$A;$B;$C;"
Here is one method:
library(stringr)
separator <- '; '
# extract the first part
first.part <- str_split(s, separator)[[1]][1]
first.part
# [1] "$ A"
# try to identify your special character
special <- m$sp[str_detect(first.part, as.character(m$sp))]
special
# [1] $
# Levels: $ & %
# make sure you only matched one of them
stopifnot(length(special) == 1)
# search and replace
gsub(separator, paste(separator, special, sep=""), s)
# [1] "$ A; $B; $C;"
Let me know if I missed some of your assumptions.
Back-referencing turns it into a one-liner:
s <- c( "$ A; B; C;", "& A; B; C;", "% A; B; C;" )
ms = c("$", "%", "&")
s <- gsub( paste0("([", paste(ms,collapse="") ,"]) ([A-Z]); ([A-Z]); ([A-Z]);") , "\\1 \\2; \\1 \\3; \\1 \\4" , s)
> s
[1] "$ A; $ B; $ C" "& A; & B; & C" "% A; % B; % C"
You can then make the regular expression appropriately generic (match more than one space, more than one alphanumeric character, etc.) if you need to.