Replace element in string after first occurrence - regex

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.

This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.

This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

Related

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

Using regular expressions in R to extract information from string

I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"

Replace parts of string using package stringi (regex)

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
"ab2c3"
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments
Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
})
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"
Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
res<-paste(collapse="",do.call(paste0,rle(strsplit(string,"",fixed=TRUE)[[1]])[2:1]))
gsub("1","",res)
#[1] "ab2c3"

Eliminating the characters that are not a date in R

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?
You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"
A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]
Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"
Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"