R use gsub as substr - regex

I'm using H2O for some distributed computing work (via the h2o package in R). Many of the base R functions are present but I'm unable to find a suitable substitute for the substr function. I do have access to the sub and gsub functions and was hoping to possibly use some form of regex as a workaround.
I'm using the following code but not having any luck:
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1$var2 <- substr(df1$var1, 1,6)
df1$var3 <- gsub('\\d{1,8}','\\d{1,6}', df1$var1)
df1
The output in df1$var2 is what I'm looking for. Any suggestions?
EDIT:
Running this code:
library(h2o)
localH2O = h2o.init(nthreads = 2)
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1.hex <- as.h2o(localH2O , df1)
df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Gets this message:
> df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Error in as.character.default(x) :
no method for coercing this S4 class to a vector

Use capture groups:
gsub('(.+)..','\\1', df1$var1)
This regex matches (.+).. with df1$var1, and replace it with the substring that matches the first capture group (.+). Since there is .. at the end of the regex, the last two characters are not matched with the .+, thus they are not in the result.

Capture the first 6 value like so using a pattern that matches the whole sting
gsub('^(.{6}).*$','\\1', df1$var1)
A slightly more general replacement for substr(x,start,stop) is
if(start > 1)
gsub('^(.{*start-1*})(.{*stop-start+1*})).*$','\\1', 'asdfhjkl')
else
gsub('^(.{*stop*})).*$','\\1', 'asdfhjkl')
where the values between the * characters are the actual integer values of the expression. (although you'll have to make sure that nchar(x)is less than stop, otherwise the patterns won't match b/c the string is too short.)

The regex (?<=^.{6}).*$ matches al characters after the first 6 ones. If you want to replace substr(df1$var1, 1, 6) with sub, you can use this command:
sub('(?<=^.{6}).*$', '', df1$var1, perl = TRUE)
# [1] "141022" "141023" "141024" "141025" "141026" "141027" "141028" "141029"
# [9] "141030" "141031"
This command replaces all digits after the first 6 ones with the empty string.

Related

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy
Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.
I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

Replace repeating character with another repeated character

I would like to replace 3 or more consecutive 0s in a string by consecutive 1s. Example: '1001000001' becomes '1001111111'.
In R, I wrote the following code:
gsub("0{3,}","1",reporting_line_string)
but obviously it replaces the 5 0s by a single 1. How can I get 5 1s ?
Thanks,
You can use gsubfn function, which you can supply a replacement function to replace the content matched by the regex.
require(gsubfn)
gsubfn("0{3,}", function (x) paste(replicate(nchar(x), "1"), collapse=""), input)
You can replace paste(replicate(nchar(x), "1"), collapse="") with stri_dup("1", nchar(x)) if you have stringi package installed.
Or a more concise solution, as G. Grothendieck suggested in the comment:
gsubfn("0{3,}", ~ gsub(".", 1, x), input)
Alternatively, you can use the following regex in Perl mode to replace:
gsub("(?!\\A)\\G0|(?=0{3,})0", "1", input, perl=TRUE)
It is extensible to any number of consecutive 0 by changing the 0{3,} part.
I personally don't endorse the use of this solution, though, since it is less maintainable.
Here's an option that builds on your approach, but makes use of gregexpr and regmatches. There's probably a more DRY way to do this, but it's not coming to my mind right now....
x <- c("1001000001", "120000siw22000100")
x
# [1] "1001000001" "120000siw22000100"
a <- regmatches(x, gregexpr("0{3,}", x))
regmatches(x, gregexpr("0{3,}", x)) <- lapply(a, function(x) gsub("0", "1", x))
x
# [1] "1001111111" "121111siw22111100"
For regex ignorants (like me), try some brute force. Split the string into single characters using strsplit, find consecutive runs of "0" using rle, create a vector of relevant indices (run lengths of "0" > 2) using rep, insert a "1" at the indices, paste to a single string.
x2 <- strsplit(x = "1001000001", split = "")[[1]]
r <- rle(x2 == "0")
idx <- rep(x = r$lengths > 2, times = r$lengths)
x2[idx] <- "1"
paste(x2, collapse = "")
# [1] "1001111111"
0(?=00)|(?<=00)0|(?<=0)0(?=0)
You can try this.Replace by 1.See demo.
http://regex101.com/r/dP9rO4/5

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

extract partial string based on pattern in r

I would like to extract partial string from a list. I don't know how to define the pattern of the strings. Thank you for your helps.
library(stringr)
names = c("GAPIT..flowerdate.GWAS.Results.csv","GAPIT..flwrcolor.GWAS.Results.csv",
"GAPIT..height.GWAS.Results.csv","GAPIT..matdate.GWAS.Results.csv")
# I want to extract out "flowerdate", "flwrcolor", "height" and "matdate"
traits <- str_extract_all(string = files, pattern = "..*.")
# the result is not what I want.
You can also use regmatches
> regmatches(c, regexpr("[[:lower:]]+", c))
[1] "flowerdate" "flwrcolor" "height" "matdate"
I encourage you not to use c as a variable name, because you're overwriting c function.
I borrow the answer from Roman Luštrik for my previous question “How to extract out a partial name as new column name in a data frame”
traits <- unlist(lapply(strsplit(names, "\\."), "[[", 3))
Use sub:
sub(".*\\.{2}(.+?)\\..*", "\\1", names)
# [1] "flowerdate" "flwrcolor" "height" "matdate"
Here are a few solutions. The first two do not use regular expressions at all. The lsat one uses a single gsub:
1) read.table. This assumes the desired string is always the 3rd field:
read.table(text = names, sep = ".", as.is = TRUE)[[3]]
2) strsplit This assumes the desired string has more than 3 characters and is lower case:
sapply(strsplit(names, "[.]"), Filter, f = function(x) nchar(x) > 3 & tolower(x) == x)
3) gsub This assumes that two dots preceed the string and one dot plus junk not containing two successive dots comes afterwards:
gsub(".*[.]{2}|[.].*", "", names)
REVISED Added additional solutions.

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.