Eliminating the characters that are not a date in R

Eliminating the characters that are not a date in R - regex

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?

You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"

A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]

Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"

Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.

Related

Extracting only a certain word from a string

I have list of US presidents with speeches about various topics (although some don't have labels) and in the column filename I have something in the format like
1981_Reagan, 1982_economy_Reagan... 1994_Clinton, 1994_criminal_justice_Clinton
(each in a separate row) and I'd like to extract which president spoke. I was going to use a function like sub_str but not sure how to go about extracting the just the name - obviously the different lengths of names are a consideration, but also not wanting to extract unwanted information such as the year or other words.

here is a simple way using strsplit, assuming the president name is always at the end of the string separated from everything by "_":
vec <- c("1981_Reagan",
"1982_economy_Reagan",
"1994_Clinton",
"1994_criminal_justice_Clinton")
sapply(strsplit(vec, "_"), function(x) x[length(x)])
#output
"Reagan" "Reagan" "Clinton" "Clinton"
basically split the strings by "_" and extract the last element from each resulting vector
Another way using regex:
sub(".+_", "", vec)
replace any characters up to _ with nothing. This is greedy so it will replace until the last _.

You can also:
vec <- c("1981_Reagan",
"1982_economy_Reagan",
"1994_Clinton",
"1994_criminal_justice_Clinton")
sub(".*_(\\w+)","\\1",vec,perl=T)
#[1] "Reagan" "Reagan" "Clinton" "Clinton"
My solution seems to be the fastest by using Perl.
vec <- c("1981_Reagan",
"1982_economy_Reagan",
"1994_Clinton",
"1994_criminal_justice_Clinton")
vec <- rep(vec,99999)
f1 <- function(vec) {sub(".*_", "", vec)}
f2 <- function(vec) {sub(".*_(\\w+)","\\1",vec,perl=T)}
f3 <- function(vec) {gsub(".+_", "", vec)}
microbenchmark::microbenchmark( f1(vec), f2(vec), f3(vec),times=100)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(vec) 212.8052 213.9725 215.5334 215.1973 216.5564 222.4681 100 b
# f2(vec) 133.7839 134.6375 136.0296 135.0752 136.3612 142.8160 100 a
# f3(vec) 290.8456 293.4051 295.5549 294.5525 295.5341 338.8277 100 c

One regularity, at least, in the example input is that the presidents' names (and only their names) are capitalised.
You can capitalise on that...
library(stringr)
str_extract(original_string, "(?<=_)[A-Z][^_]+")
[1] "Reagan" "Reagan" "Clinton" "Clinton"
Where
original_string <- c(
"1981_Reagan",
"1982_economy_Reagan",
"1994_Clinton",
"1994_criminal_justice_Clinton"
)

Replace element in string after first occurrence

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.

This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.

This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.

As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )

another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"

You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

Replace parts of string using package stringi (regex)

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
"ab2c3"
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments

Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
})
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"

Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
res<-paste(collapse="",do.call(paste0,rle(strsplit(string,"",fixed=TRUE)[[1]])[2:1]))
gsub("1","",res)
#[1] "ab2c3"

Remove all characters before a period in a string

This keeps everything before a period:
gsub("\\..*","", data$column )
how to keep everything after the period?

To remove all the characters before a period in a string(including period).
gsub("^.*\\.","", data$column )
Example:
> data <- 'foobar.barfoo'
> gsub("^.*\\.","", data)
[1] "barfoo"
To remove all the characters before the first period(including period).
> data <- 'foo.bar.barfoo'
> gsub("^.*?\\.","", data)
[1] "bar.barfoo"

You could use stringi with lookbehind regex
library(stringi)
stri_extract_first_regex(data1, "(?<=\\.).*")
#[1] "bar.barfoo"
stri_extract_first_regex(data, "(?<=\\.).*")
#[1] "barfoo"
If the string doesn't have ., this retuns NA (it is not clear about how to deal with this in the question)
stri_extract_first_regex(data2, "(?<=\\.).*")
#[1] NA
###data
data <- 'foobar.barfoo'
data1 <- 'foo.bar.barfoo'
data2 <- "foobar"

If you don't want to think about the regex for this the qdap package has the char2end function that grabs from a particular character until the end of the string.
data <- c("foo.bar", "foo.bar.barfoo")
library(qdap)
char2end(data, ".")
## [1] "bar" "bar.barfoo"

use this :
gsub(".*\\.","", data$column )
this will keep everything after period

require(stringr)
I run a course on Data Analysis and the students came up with this solution :
get_after_period <- function(my_vector) {
# Return a string vector without the characters
# before a period (excluding the period)
# my_vector, a string vector
str_sub(my_vector, str_locate(my_vector, "\\.")[,1]+1)
}
Now, just call the function :
my_vector <- c('foobar.barfoo', 'amazing.point')
get_after_period(my_vector)
[1] "barfoo" "point"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Eliminating the characters that are not a date in R - regex

You can use sub: vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15") sub(" .+", "", vec) # [1] "01/01/13" "03/01/13" "04/03/13"

Related

Extracting only a certain word from a string

Replace element in string after first occurrence

R: Substring after finding a character position?

Replace parts of string using package stringi (regex)

Remove all characters before a period in a string

Categories

Resources