Correct wrongly formatted dates

Correct wrongly formatted dates - regex

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy

Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.

I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

Related

Length of location sequence (given as a string) in R

I have a list of geolocation sequences. Each element of my list is of the form :
> "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
I would like to be able to get the length of the sequence (4 in the example above). Could you please help me? I've tried to use regular expressions but I couldn't succeed.
Thank you in advance!

As a few folks suggested in the comments, you can count the # of times a specific character appears in your sequences. This assumes the data are well formed and consistent. For example:
library(stringr)
x <- "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
str_count(x, "\\[") - 1 #subtract 1 since there are two opening [
yields:
> str_count(x, "\\[") - 1
[1] 4

In case you do not want to load a library
str <- "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
nchar(str)-nchar(gsub("\\]", "", str))-1

You can use:
\[[^[]
And then count the matches.
Demo
In R:
> x <- "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
> length(gregexpr("\\[[^\\[]",x)[[1]])
[1] 4

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.

If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation

Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...

Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")

You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.

Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

Separate strings into groups of 2 characters separated by colon (1330 to 13:30) in R

How do I turn "1330" into "13:30", or "133000" into "13:30:00"? Essentially, I want to insert a colon between every pair of numbers. I'm trying to convert characters into times.
It seems like there should be a really elegant way to do this, but I can't think of it. I was thinking of using some combination of paste() and substr(), but an elegant solution is escaping me.
EDIT: example string that needs to be converted:
X <- c("120000", "120500", "121000", "121500", "122000", "122500", "123000") #example of noon to 12:30pm

This replaces each sequence of two characters not followed by a boundary with those same characters followed by a colon:
gsub("(..)\\B", "\\1:", X)
On the sample string it gives:
[1] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00"

You can use a regular expression with a positive lookahead:
gsub("(\\d{2})(?=\\d{2})", "\\1:", X, perl = TRUE)
# [1] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00"

Using substring:
test <- "1330"
paste(substring(test,seq(1,nchar(test)-1,2),seq(2,nchar(test),2)),collapse=":")
#[1] "13:30"
test <- "133000"
paste(substring(test,seq(1,nchar(test)-1,2),seq(2,nchar(test),2)),collapse=":")
#[1] "13:30:00"
Or if you want an actual time representation you could do:
test <- "1330"
as.POSIXct(test,format="%H%M")
#[1] "2013-05-09 13:30:00 EST"
Which you can reformat like:
format(as.POSIXct(test,format="%H%M"),"%H:%M")
#[1] "13:30"

Can do it with strptime in one step:
strptime(X, format="%H%M%S")
[1] "2013-05-08 12:00:00" "2013-05-08 12:05:00" "2013-05-08 12:10:00" "2013-05-08 12:15:00" "2013-05-08 12:20:00"
[6] "2013-05-08 12:25:00" "2013-05-08 12:30:00"
After the complaint about the dates in date-time objects, one can suppress that "artificial" reality with:
strftime( strptime(X, format="%H%M%S"), "%H:%M:%S" )
[1] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00"

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.

1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index

It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]

If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.

The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Correct wrongly formatted dates - regex

Here is a pretty simple solution using sub ... sub('^-+([^-]+).+', '\\1', df$col) # [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" # [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

Related

Length of location sequence (given as a string) in R

R: how to convert part of a string to variable name and return its value in the same string?

conditional string splitting in R (using tidyr)

Separate strings into groups of 2 characters separated by colon (1330 to 13:30) in R

Regular expressions in R to erase all characters after the first space?

Categories

Resources