Replace parts of string using package stringi (regex) - regex

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
"ab2c3"
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments

Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
})
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"

Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
res<-paste(collapse="",do.call(paste0,rle(strsplit(string,"",fixed=TRUE)[[1]])[2:1]))
gsub("1","",res)
#[1] "ab2c3"

Related

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

Using regular expressions in R to extract information from string

I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"

Remove all characters before a period in a string

This keeps everything before a period:
gsub("\\..*","", data$column )
how to keep everything after the period?
To remove all the characters before a period in a string(including period).
gsub("^.*\\.","", data$column )
Example:
> data <- 'foobar.barfoo'
> gsub("^.*\\.","", data)
[1] "barfoo"
To remove all the characters before the first period(including period).
> data <- 'foo.bar.barfoo'
> gsub("^.*?\\.","", data)
[1] "bar.barfoo"
You could use stringi with lookbehind regex
library(stringi)
stri_extract_first_regex(data1, "(?<=\\.).*")
#[1] "bar.barfoo"
stri_extract_first_regex(data, "(?<=\\.).*")
#[1] "barfoo"
If the string doesn't have ., this retuns NA (it is not clear about how to deal with this in the question)
stri_extract_first_regex(data2, "(?<=\\.).*")
#[1] NA
###data
data <- 'foobar.barfoo'
data1 <- 'foo.bar.barfoo'
data2 <- "foobar"
If you don't want to think about the regex for this the qdap package has the char2end function that grabs from a particular character until the end of the string.
data <- c("foo.bar", "foo.bar.barfoo")
library(qdap)
char2end(data, ".")
## [1] "bar" "bar.barfoo"
use this :
gsub(".*\\.","", data$column )
this will keep everything after period
require(stringr)
I run a course on Data Analysis and the students came up with this solution :
get_after_period <- function(my_vector) {
# Return a string vector without the characters
# before a period (excluding the period)
# my_vector, a string vector
str_sub(my_vector, str_locate(my_vector, "\\.")[,1]+1)
}
Now, just call the function :
my_vector <- c('foobar.barfoo', 'amazing.point')
get_after_period(my_vector)
[1] "barfoo" "point"

Eliminating the characters that are not a date in R

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?
You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"
A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]
Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"
Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.

Convert string of accounting data into integers (positive and negative)

I have a data formatted in the "accounting" style from Excel that look like ($317.40) or $13,645.48. As a regexp newbie, I'm looking for a more efficient way of removing all useless symbols and converting strings with parentheses into negative numbers.
Here's how I've been doing it so far:
spending$Amount <- gsub("^[(]", "-", spending$Amount)
spending$Amount <- gsub("[$]", "", spending$Amount)
spending$Amount <- gsub("[)]", "", spending$Amount)
spending$Amount <- as.numeric(gsub("[,]", "", spending$Amount))
Can I do this in one line? Is there a specialized R package that can do it?
A nested gsub solution
x <- c("($317.40)", "$13,645.48")
as.numeric(gsub("\\(", "-", gsub("\\)|\\$|,", "", x)))
## [1] -317.40 13645.48
## Really convoluted bad way of doing it solution
mapply(FUN = function(x,y) ifelse(x,-1,1)*as.numeric(paste(y,collapse="")), grepl('\\(',x) ,regmatches(x, gregexpr('[0-9\\.]+',x)) )