Replace repeating character with another repeated character - regex

I would like to replace 3 or more consecutive 0s in a string by consecutive 1s. Example: '1001000001' becomes '1001111111'.
In R, I wrote the following code:
gsub("0{3,}","1",reporting_line_string)
but obviously it replaces the 5 0s by a single 1. How can I get 5 1s ?
Thanks,

You can use gsubfn function, which you can supply a replacement function to replace the content matched by the regex.
require(gsubfn)
gsubfn("0{3,}", function (x) paste(replicate(nchar(x), "1"), collapse=""), input)
You can replace paste(replicate(nchar(x), "1"), collapse="") with stri_dup("1", nchar(x)) if you have stringi package installed.
Or a more concise solution, as G. Grothendieck suggested in the comment:
gsubfn("0{3,}", ~ gsub(".", 1, x), input)
Alternatively, you can use the following regex in Perl mode to replace:
gsub("(?!\\A)\\G0|(?=0{3,})0", "1", input, perl=TRUE)
It is extensible to any number of consecutive 0 by changing the 0{3,} part.
I personally don't endorse the use of this solution, though, since it is less maintainable.

Here's an option that builds on your approach, but makes use of gregexpr and regmatches. There's probably a more DRY way to do this, but it's not coming to my mind right now....
x <- c("1001000001", "120000siw22000100")
x
# [1] "1001000001" "120000siw22000100"
a <- regmatches(x, gregexpr("0{3,}", x))
regmatches(x, gregexpr("0{3,}", x)) <- lapply(a, function(x) gsub("0", "1", x))
x
# [1] "1001111111" "121111siw22111100"

For regex ignorants (like me), try some brute force. Split the string into single characters using strsplit, find consecutive runs of "0" using rle, create a vector of relevant indices (run lengths of "0" > 2) using rep, insert a "1" at the indices, paste to a single string.
x2 <- strsplit(x = "1001000001", split = "")[[1]]
r <- rle(x2 == "0")
idx <- rep(x = r$lengths > 2, times = r$lengths)
x2[idx] <- "1"
paste(x2, collapse = "")
# [1] "1001111111"

0(?=00)|(?<=00)0|(?<=0)0(?=0)
You can try this.Replace by 1.See demo.
http://regex101.com/r/dP9rO4/5

Related

R use gsub as substr

I'm using H2O for some distributed computing work (via the h2o package in R). Many of the base R functions are present but I'm unable to find a suitable substitute for the substr function. I do have access to the sub and gsub functions and was hoping to possibly use some form of regex as a workaround.
I'm using the following code but not having any luck:
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1$var2 <- substr(df1$var1, 1,6)
df1$var3 <- gsub('\\d{1,8}','\\d{1,6}', df1$var1)
df1
The output in df1$var2 is what I'm looking for. Any suggestions?
EDIT:
Running this code:
library(h2o)
localH2O = h2o.init(nthreads = 2)
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1.hex <- as.h2o(localH2O , df1)
df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Gets this message:
> df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Error in as.character.default(x) :
no method for coercing this S4 class to a vector
Use capture groups:
gsub('(.+)..','\\1', df1$var1)
This regex matches (.+).. with df1$var1, and replace it with the substring that matches the first capture group (.+). Since there is .. at the end of the regex, the last two characters are not matched with the .+, thus they are not in the result.
Capture the first 6 value like so using a pattern that matches the whole sting
gsub('^(.{6}).*$','\\1', df1$var1)
A slightly more general replacement for substr(x,start,stop) is
if(start > 1)
gsub('^(.{*start-1*})(.{*stop-start+1*})).*$','\\1', 'asdfhjkl')
else
gsub('^(.{*stop*})).*$','\\1', 'asdfhjkl')
where the values between the * characters are the actual integer values of the expression. (although you'll have to make sure that nchar(x)is less than stop, otherwise the patterns won't match b/c the string is too short.)
The regex (?<=^.{6}).*$ matches al characters after the first 6 ones. If you want to replace substr(df1$var1, 1, 6) with sub, you can use this command:
sub('(?<=^.{6}).*$', '', df1$var1, perl = TRUE)
# [1] "141022" "141023" "141024" "141025" "141026" "141027" "141028" "141029"
# [9] "141030" "141031"
This command replaces all digits after the first 6 ones with the empty string.

strsplit on first instance [duplicate]

This question already has answers here:
Splitting a string on the first space
(7 answers)
Closed 4 years ago.
I would like to write a strsplit command that grabs the first ")" and splits the string.
For example:
f("12)34)56")
"12" "34)56"
I have read over several other related regex SO questions but I am afraid I am not able to make heads or tails of this. Thank you any assistance.
You could get the same list-type result as you would with strsplit if you used regexpr to get the first match, and then the inverted result of regmatches.
x <- "12)34)56"
regmatches(x, regexpr(")", x), invert = TRUE)
# [[1]]
# [1] "12" "34)56"
Need speed? Then go for stringi functions. See timings e.g. here.
library(stringi)
x <- "12)34)56"
stri_split_fixed(str = x, pattern = ")", n = 2)
It might be safer to identify where the character is and then substring either side of it:
x <- "12)34)56"
spl <- regexpr(")",x)
substring(x,c(1,spl+1),c(spl-1,nchar(x)))
#[1] "12" "34)56"
Another option is to use str_split in the package stringr:
library(stringr)
f <- function(string)
{
unlist(str_split(string,"\\)",n=2))
}
> f("12)34)56")
[1] "12" "34)56"
Replace the first ( with the non-printing character "\01" and then strsplit on that. You can use any character you like in place of "\01" as long as it does not appear.
strsplit(sub(")", "\01", "12)34)56"), "\01")

Extract substring in R from string with fixed start position and end point as a character found

I want to do the following extraction in R.
I have a column which has links like these
http://www.imdb.com/title/tt2569314/companycredits
I want to extract the tt2569314 out of this and store it in a new column.
The way I want to do it is, say, take substring of column where start position is LEN(http://www.imdb.com/) and end position is dynamic based on when the first '/' is found after the start position.
I want this to be kind of a mixture of SUBSTR and INSTR in SQL.
Please advise.
You could try this:
a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"
If all the links are similar in path structure, you can use the dirname
x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"
Or you can paste together a regular expression with the base URL
y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"
Or you may even be able to get away with this:
basename(dirname(x))
# [1] "tt2569314"
It's a bit more drawn out if you use the substring. But stringr has a couple of helpful functions.
library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"
You could try:
str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
library(httr)
gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
#[1] "tt2569314"
You may try this also,
> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

Easy way to find and replace dynamic values ( {{example}} ) via regex in R

I have some dynamic values obtained from json of the format {{example_value}}. I have some R code which calculates the actual value. However, the only solution I have found to replace the placeholder with the actual value is very long and ugly.
Does anyone have any neat solutions?
Example of replacing {{example_value}} with 5.5:
> gsub( gsub("\\}","\\\\}",gsub("\\{","\\\\{","{{example_value}}")),
5.5, "{{example_value}}")
[1] "5.5"
Another example which explains why I wrote the nested gsub:
dictionary = "{{example_value}}"
> gsub( gsub("\\}","\\\\}",gsub("\\{","\\\\{",dictionary)),
5.5, "{{example_value}}")
[1] "5.5"
Typically dictionary is a list which contains all the dynamic values I expect to replace.
You can use this:
gsub("{{example_value}}", "5.5", subject, perl=TRUE);
While #zx81's suggestion seems most appropriate for a direct replace, You could also work with regular expressions to pull out tags in braces.
a<-"The total is {{example}} dollars less"
m <- regexpr("{{([^}]+)}}", a, perl=T)
regmatches(a, m)
# [1] "{{example}}"
And then regmatches has a nice feature where you can easily replace matches
regmatches(a, m) <- 5.5
a
# [1] "The total is 5.5 less"
Which is kind of a neat trick.
EDIT: Perhaps this may lead you to what you're looking for.
re <- c('{{foo}}', '{{bar}}')
val <- c('5.5', '1.1')
recurse <- function(pattern, repl, x) {
for (i in 1:length(pattern))
x <- gsub(pattern[i], repl[i], x, perl=T)
x
}
x <- 'I have {{foo}} and {{bar}}'
recurse(re, val, x)
# [1] "I have 5.5 and 1.1"