Removing repeating substrings from within a string in R - regex

Is there any way (using regular expressions such as gsub or other means) to remove repetitions from a string?
Essentially:
a = c("abc, def, def, abc")
f(a)
#[1] "abc, def"

One obvious way is to strsplit the string, get unique strings and stitch them together.
paste0(unique(strsplit(a, ",[ ]*")[[1]]), collapse=", ")

You can also use stringr::str_extract_all
require(stringr)
unique(unlist(str_extract_all(a, '\\w+')))

you can also use this function based on gsub. I was not able to directly do it with a single regular expression.
f <- function(x) {
x <- gsub("(.+)(.+)?\\1", "\\1\\2", x, perl=T)
if (grepl("(.+)(.+)?\\1", x, perl=T))
x <- f(x)
else
return(x)
}
b <- f(a)
b
[1] "abc, def"
hth

Related

How to split a string before the delimiter?

I have a character string like the below.
a <- "T,2016,07,T,2016,07,22,T,2016,07"
I would like to split it to get this,
b <- c("T,2016,07", "T,2016,07", "T,2016,07")
Could you tell me the way? Many thanks.
Or use regular expression to split:
strsplit(a, ",(?=T)", perl = T)
# [[1]]
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"
You can do
x <- gsub("T", "%T", a)
y <- unlist(strsplit(x, "%"))[-1]
a <- "T,2016,07,T,2016,07,22,T,2016,07"
paste0("T", Filter(nzchar, strsplit(a, ",?T")[[1]]))
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"

Extracting hashtags AND attached string elements (IF ANY) with regular expressions AND positive lookarounds and lookbehinds in r

I'd like to create a function in r using regular expressions that extracts hashtags (and one for #'s as well) but checks to see if its a part of a string and return those parts of that string. I'm still picking up hashtags (and #'s) and so I'm assuming that I'm not picking up pure hashtag strings (#word) because this is after using a function to remove URLs, emails, hashtags, and #'s via:
clean.text <- function(x){
x <- gsub("http[^[:space:]]+"," ", x)
x <- gsub("([_+A-Za-z0-9-]+(\\.[_+A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,14}))","", x)
x <- gsub("\\s#[[:alnum:]_]+"," ", x)
x <- gsub("\\s#[^[:space:]]+"," ", x)
x
}
So I'd like to know what parts of the string are attached to the hashtags (and #'s) because I'm still getting hashtags (and #'s) when use the following on my cleaned text.
findHash2 <- function(x){
m <- gregexpr("(#\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
findAT2 <- function(x){
m <- gregexpr("#(\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
Note: again, this is after I apply my clean.text function to my text. Would it be something like this?
findHash1 <- function(x){
m <- gregexpr("(?<=^)#\\w+(?=$)", x, perl=TRUE)
w <- unlist(regmatches(x, m))
return(paste(w, collapse=" "))
}
UPDATE Example
x <- "yp#MonicaSarkar: RT #saultracey: Sun kissed .....#olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl
x <-I don'nt#know #It would %%%%#be #best if#you just.idk#provided/a fewexample#character! strings# #my#&^( 160,000+posts#in #of text #my) #data is#so huge!# (some# that #should match#and some that #shouldn't) and post# the desired#output.#We'll take it from there."
As for the desired output, I guess something like:
[1] yp#MonicaSarkar: #saultracey: .....#olmpicrings
Or in the second example:
[1] don'nt#know if#you 160,000+posts#in %%%%#be fewexample#character!
Ultimately, I'd like to see what's attached to the hash tags.
I'd like to use a function or functions that would extract a hashtag (another function or set of functions for #'s) if part of a string in three scenarios and presented my attempt at the first: one that says it must be preceded and followed by one or more characters, another that matches if only followed by one or more characters and a third that matches only if preceded by one or more characters. That is: one that would match the string hashtag only if it's at the middle not if it's present at the start or at the end of a string, one that would match the string only if it's present at the start, and one that would match the string if it's present at the end.
Would three functions like I discussed need to be created for that type of procedure or could it be combined into one?

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

String split with conditions in R

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.
mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")
result
"MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
You can do this using gsubfn
library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
This allows for cases when you have more than two "_", and you want to split on the second one, for example,
mystring<-c("MODY_60.2.ReCal.sort.bam",
"MODY_116.21_C4U.ReCal.sort.bam",
"MODY_116.3_C2RX-1-10.ReCal.sort.bam",
"MODY_116.4.ReCal.sort.bam",
"MODY_116.4_asdfsadf_1212_asfsdf",
"MODY_116.5.ReCal_asdfsadf_1212_asfsdf", # split by second "_", leaving ".ReCal"
"MODY")
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.
With the stringr package:
str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
It also works with more than two delimiters.
Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.
IMO, this feature is elegant when you want to supply different alternatives.
x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam',
'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')
sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$
You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.
https://regex101.com/r/wL4aB6/1
A little longer, but needs less regular expression knowledge:
library(stringr)
indx <- str_locate_all(mystring, "_")
for (i in seq_along(indx)) {
if (nrow(indx[[i]]) == 1) {
mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
} else {
mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
}
}
gregexpr can search for a pattern in strings and give the location.
First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).
After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.
inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
substr(mystring, 1, inds - 1),
sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"