How do Capture Groups Beyond \9 get Referenced in R? - regex

Is it possible in R to capture groups >9 in a regular expression?
sub("(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)", "\\1 & \\9",
"abc-02-03-04-05-06-07-08-09")
gives
[1] "abc & 09"
which is expected result, but
sub("(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)", "\\1 & \\10",
"abc-02-03-04-05-06-07-08-09-10")
[1] "abc & abc0"
fails, as the expected result would have been
[1] "abc & 10"
I need this for a function like the following, which works fine for up to 9 formats but no more:
x <- as.Date(c("2005-09-02", "2012-04-08"))
fmt <- "dddd, d.m.yy"
fmt <- gsub(pattern = "dddd", replacement = "\\\\1", x = fmt)
fmt <- gsub(pattern = "ddd", replacement = "\\\\2", x = fmt)
fmt <- gsub(pattern = "dd", replacement = "\\\\3", x = fmt)
fmt <- gsub(pattern = "d", replacement = "\\\\4", x = fmt)
fmt <- gsub(pattern = "mmmm", replacement = "\\\\5", x = fmt)
fmt <- gsub(pattern = "mmm", replacement = "\\\\6", x = fmt)
fmt <- gsub(pattern = "mm", replacement = "\\\\7", x = fmt)
fmt <- gsub(pattern = "m", replacement = "\\\\8", x = fmt)
fmt <- gsub(pattern = "yyyy", replacement = "\\\\9", x = fmt)
fmt <- gsub(pattern = "yy", replacement = "\\\\10", x = fmt)
fmt <- gsub(pattern = "y", replacement = "\\\\11", x = fmt)
fmt
sub("(.+)-(.+)-(.+)-0?(.+)-(.+)-(.+)-(.+)-0?(.+)-(.+)-(.+)-0?(.+)", fmt,
format(x, "%A-%a-%d-%d-%B-%b-%m-%m-%Y-%y-%y"))

Its important to note that the limit is nine backreferences; you get unlimited captures. By using str_match from stringr (or, more clunkily, regmatches from base R), you can always restructure your code to avoid having to use backreferences.
library(stringr)
(matches <- str_match(
"abc-02-03-04-05-06-07-08-09-10",
"(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)-(.+)")
)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] "abc-02-03-04-05-06-07-08-09-10" "abc" "02" "03" "04" "05" "06" "07" "08" "09" "10"
paste(matches[, 2], matches[, 11], sep = " & ")
## [1] "abc & 10"

As Dason indicates, you're better off splitting the string and taking the desired elements.
elements <- c(1,10)
paste(strsplit("abc-02-03-04-05-06-07-08-09-10", '-')[[1]][elements], collapse=' & ')
## [1] "abc & 10"
This can be vectorized with sapply, if needed:
sapply(strsplit("abc-02-03-04-05-06-07-08-09-10", '-'), function(x) paste(x[elements], collapse=' & '))

Related

Find patterns in strings and concatenate unique parts in R

Conciser having 2 strings s1 and s2. How is it possible to concatenate only those parts in these strings, which are unique, and leave non unique parts as is. Let / be the separator between concatenated parts. The desired result looks like this:
s1 <- "very big house"
s2 <- "very small house"
some_function(s1,s2)
"very big/small house" #// desired result.
Usually leading and tailing charters are the same for both strings and the differences appear only in the middle.
More examples of desired result:
# a
s1 <- "1b"; s2 <- "2b"
"1/2b" # <- new string
# b
s1 <- "a_1_b"; s2 <- "a_2_b"
"a_1/2_b" # <- new string
# c
s1 <- "a"; s2 <- "b"
"a/b" # <- new string
Here's a solution:
pasteMergePrefixAndSuffix <- function(vs1,vs2,sep=' ') {
## cycle string vectors to same length
vsl <- max(length(vs1),length(vs2));
vs1 <- rep(vs1,len=vsl);
vs2 <- rep(vs2,len=vsl);
## precompute character splits
ss1 <- strsplit(vs1,'');
ss2 <- strsplit(vs2,'');
## iterate over each pair of strings
sapply(seq_along(vs1),function(si) {
s1 <- vs1[si];
s2 <- vs2[si];
s1l <- length(ss1[[si]]);
s2l <- length(ss2[[si]]);
slmin <- min(s1l,s2l);
## handle trivial case of exact equality
if (s1==s2) return(s1);
## get prefix and suffix lengths
if (slmin==0L) { ## empty string cannot result in a prefix or suffix
pl <- sl <- 0L;
} else {
eq <- ss1[[si]][seq_len(slmin)]==ss2[[si]][seq_len(slmin)];
pl <- if (all(eq)) slmin else if (eq[1L]==T) which(!eq)[1L]-1L else 0L;
eq <- rev(ss1[[si]])[seq_len(slmin)]==rev(ss2[[si]])[seq_len(slmin)];
sl <- if (all(eq)) slmin else if (eq[1L]==T) which(!eq)[1L]-1L else 0L;
}; ## end if
## paste together prefix, sep-pasted middles, and suffix
m1 <- substr(s1,pl+1L,s1l-sl);
m2 <- substr(s2,pl+1L,s2l-sl);
paste0(substr(s1,1L,pl),paste(sep=sep,m1,m2),substr(s1,s1l-sl+1L,s1l));
});
}; ## end pasteMergePrefixAndSuffix()
Demo:
pasteMergePrefixAndSuffix('a','b');
## [1] "a b"
pasteMergePrefixAndSuffix('a','b','/');
## [1] "a/b"
s1 <- 'very big house'; s2 <- 'very small house'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "very big/small house"
s1 <- '1b'; s2 <- '2b'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "1/2b"
s1 <- 'a_1_b'; s2 <- 'a_2_b'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "a_1/2_b"
s1 <- 'ab'; s2 <- 'ab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "ab"
s1 <- 'xab'; s2 <- 'ab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "x/ab"
s1 <- 'ab'; s2 <- 'abx'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "ab/x"
s1 <- 'abx'; s2 <- 'ab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "abx/"
s1 <- 'ab'; s2 <- 'xab'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "/xab"
s1 <- ''; s2 <- 'x'; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "/x"
s1 <- 'x'; s2 <- ''; pasteMergePrefixAndSuffix(s1,s2,'/');
## [1] "x/"

Split or substitute strings with wildcards in R [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 6 years ago.
I have the following vector:
a <- c("abc_lvl1", "def_lvl2")
I basically want to split into two vectors:
("abc", "def") and ("lvl1", "lvl2). I know how to substitute with sub:
sub(".*_", "", a)
[1] "lvl1" "lvl2"
I think this translates into "Search for any number of any characters before "_" and replace with nothing." Accordingly - i thought - this should give me the other desired vector:
sub("_*.", "", a), but it removes just the leading character:
[1] "bc_lvl1" "ef_lvl2"
Where do i mess up?
This is essentially the equivalent for the "text-to-columns"-function in excel.
There are several ways to do this. Here are a few, some using packages, and others with base R.
Given:
a <- c("abc_lvl1", "def_lvl2")
Here are some options:
do.call(rbind, strsplit(a, "_", TRUE))
matrix(scan(what = "", text = a, sep = "_"), ncol = 2, byrow = TRUE)
scan(text = a, sep = "_", what = list("", "")) ## a list
library(splitstackshape)
cSplit(data.table(a), "a", "_")
library(data.table)
setDT(tstrsplit(a, "_"))[]
library(dplyr)
library(tidyr)
data_frame(a) %>%
separate(a, into = c("this", "that"))
library(reshape2)
colsplit(a, "_", c("this", "that"))
library(stringi)
t(stri_split_fixed(a, "_", simplify = TRUE))
library(iotools)
mstrsplit(a, "_") # Matrix
dstrsplit(a, col_types = c("character", "character"), "_") # data.frame
library(gsubfn)
read.pattern(text = a, pattern = "(.*)_(.*)")
We can use read.csv/read.table and specify the sep="_". It will split the strings into two columns.
read.csv(text=a, sep="_", header=FALSE)
Just to build on the initial comments
a <- c("abc_lvl1", "def_lvl2")
a1 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][1]}))
a2 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][2]}))
a1
[1] "abc" "def"
a2
[1] "lvl1" "lvl2"

Removing words from a corpus of documents with a tailored list of words

The tm package has the ability to let the user 'prune' the words and punctuation in a corpus of documents:
tm_map( corpusDocs, removeWords, stopwords("english") )
Is there a way to supply tm_map with a tailored list of words that is read in from a csv file and used in place of stopwords("english")?
Thank you.
BSL
Lets take a file (wordMappings)
"from"|"to"
###Words######
"this"|"ThIs"
"is"|"Is"
"a"|"A"
"sample"|"SamPle"
First removel of words;
readFile <- function(fileName, seperator) {
read.csv(paste0("data\\", fileName, ".txt"),
sep=seperator, #"\t",
quote = "\"",
comment.char = "#",
blank.lines.skip = TRUE,
stringsAsFactors = FALSE,
encoding = "UTF-8")
}
kelimeler <- c("this is a sample")
corpus = Corpus(VectorSource(kelimeler))
seperatorOfTokens <- ' '
words <- readFile("wordMappings", "|")
toSpace <- content_transformer(function(x, from) gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", ' ', seperatorOfTokens), x))
for (word in words$from) {
corpus <- tm_map(corpus, toSpace, word)
}
If you want a more flexible solution, for example not just removing also replacing with then;
#Specific Transformations
toMyToken <- content_transformer( function(x, from, to)
gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", to, seperatorOfTokens), x))
for (i in seq(1:nrow(words))) {
print(sprintf("%s -> %s ", words$from[i], words$to[i]))
corpus <- tm_map(corpus, toMyToken, words$from[i], words$to[i])
}
Now a sample run;
[1] "this -> ThIs "
[1] "is -> Is "
[1] "a -> A "
[1] "sample -> SamPle "
> content(corpus[[1]])
[1] " ThIs Is A SamPle "
>
My solution, which may be cumbersome and inelegant:
#read in items to be removed
removalList = as.matrix( read.csv( listOfWordsAndPunc, header = FALSE ) )
#
#create document term matrix
termListing = colnames( corpusFileDocs_dtm )
#
#find intersection of terms in removalList and termListing
commonWords = intersect( removalList, termListing )
removalIndxs = match( commonWords, termListing )
#
#create m for term frequency, etc.
m = as.matrix( atsapFileDocs_dtm )
#
#use removalIndxs to drop irrelevant columns from m
allColIndxs = 1 : length( termListing )
keepColIndxs = setdiff( allColIndxs, removalIndxs )
m = m[ ,keepColIndxs ]
#
#thence to tf-idf analysis with revised m
Any stylistic or computational suggestions for improvement are gratefully sought.
BSL

Insert a character at multiple positions in a string at once

Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).
Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).
A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"
Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"

regex grab from beginning to n occurrence of character

I'm really putting time into learning regex and I'm playing with different toy scenarios. One setup I can't get to work is to grab from the beginning of a string to n occurrence of a character where n > 1.
Here I can grab from the beginning of the string to the first underscore but I can't generalize this to the second or third underscore.
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
gsub("_.*$", "", x)
Here's what I'm trying to achieve with regex. (`sub`/`gsub`):
## > sapply(lapply(strsplit(x, "_"), "[", 1:2), paste, collapse="_")
## [1] "a_b" "1_2" "<_?"
#or
## > sapply(lapply(strsplit(x, "_"), "[", 1:3), paste, collapse="_")
## [1] "a_b_c" "1_2_3" "<_?_."
Related post: regex from first character to the end of the string
Here's a start. To make this safe for general use, you'll need it to properly escape regular expressions' special characters:
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:", "", "abcd", "____abcd")
matchToNth <- function(char, n) {
others <- paste0("[^", char, "]*") ## matches "[^_]*" if char is "_"
mainPat <- paste0(c(rep(c(others, char), n-1), others), collapse="")
paste0("(^", mainPat, ")", "(.*$)")
}
gsub(matchToNth("_", 2), "\\1", x)
# [1] "a_b" "1_2" "<_?" "" "abcd" "_"
gsub(matchToNth("_", 3), "\\1", x)
# [1] "a_b_c" "1_2_3" "<_?_." "" "abcd" "__"
How about:
gsub('^(.+_.+?).*$', '\\1', x)
# [1] "a_b" "1_2" "<_?"
Alternatively you can use {} to indicate the number of repeats...
sub('((.+_){1}.+?).*$', '\\1', x) # {0} will give "a", {1} - "a_b", {2} - "a_b_c" and so on
So you don't have to repeat yourself if you wanted to match the nth one...
second underscore in perl style regex:
/^(.?_.?_)/
and third:
/^(.*?_.*?_.*?_)/
Maybe something like this
x
## [1] "a_b_c_d" "1_2_3_4" "<_?_._:"
gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){1}", x)))
## [1] "a" "1" "<"
gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){2}", x)))
## [1] "a_b" "1_2" "<_?"
gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){3}", x)))
## [1] "a_b_c" "1_2_3" "<_?_."
Using Justin's approach this was what I devised:
beg2char <- function(text, char = " ", noc = 1, include = FALSE) {
inc <- ifelse(include, char, "?")
specchar <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
if(char %in% specchar) {
char <- paste0("\\", char)
}
ins <- paste(rep(paste0(char, ".+"), noc - 1), collapse="")
rep <- paste0("^(.+", ins, inc, ").*$")
gsub(rep, "\\1", text)
}
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
beg2char(x, "_", 1)
beg2char(x, "_", 2)
beg2char(x, "_", 3)
beg2char(x, "_", 4)
beg2char(x, "_", 3, include=TRUE)