Replace character string elements by indices efficiently in R - regex

I would like to efficiently replace elements in my character object with other particular elements in particular places (these places are indices which I know as they are results of the gregexpr function).
I would like some foo function that works like:
foo("qwerty", c(1,3,5), c("z", "x", "y"))
giving me:
[1] "zwxryy"
I searched the stringr package cran pdf but nothing hit my mind. Thank you in advance for any suggestions.

For example:
xx <- unlist(strsplit("qwerty",""))
xx[c(1,3,5)] <- c("z", "x", "y")
paste0(xx,collapse='')
[1] "zwxryy"

You could also try the one below, if you don't have that many characters to replace
st1 <- "qwerty"
gsub("^.(.).(.).","z\\1x\\2y", st1)
#[1] "zwxryy"

In stringi package there is stri_sub function that works like this:
a <- "12345"
stri_sub(a, from=c(1,3,5),len=1) <- letters[c(1,3,5)]
a
## [1] "a2345" "12c45" "1234e"
it's almost what you want. Just use this in loop:
a <- "12345"
for(i in c(1,3,5)){
stri_sub(a, from=i,len=1) <- letters[i]
}
a
## [1] "a2c4e"
Be aware that this kind of function is on our TODO list, check:
https://github.com/Rexamine/stringi/issues?state=open

Related

Replace repeating character with another repeated character

I would like to replace 3 or more consecutive 0s in a string by consecutive 1s. Example: '1001000001' becomes '1001111111'.
In R, I wrote the following code:
gsub("0{3,}","1",reporting_line_string)
but obviously it replaces the 5 0s by a single 1. How can I get 5 1s ?
Thanks,
You can use gsubfn function, which you can supply a replacement function to replace the content matched by the regex.
require(gsubfn)
gsubfn("0{3,}", function (x) paste(replicate(nchar(x), "1"), collapse=""), input)
You can replace paste(replicate(nchar(x), "1"), collapse="") with stri_dup("1", nchar(x)) if you have stringi package installed.
Or a more concise solution, as G. Grothendieck suggested in the comment:
gsubfn("0{3,}", ~ gsub(".", 1, x), input)
Alternatively, you can use the following regex in Perl mode to replace:
gsub("(?!\\A)\\G0|(?=0{3,})0", "1", input, perl=TRUE)
It is extensible to any number of consecutive 0 by changing the 0{3,} part.
I personally don't endorse the use of this solution, though, since it is less maintainable.
Here's an option that builds on your approach, but makes use of gregexpr and regmatches. There's probably a more DRY way to do this, but it's not coming to my mind right now....
x <- c("1001000001", "120000siw22000100")
x
# [1] "1001000001" "120000siw22000100"
a <- regmatches(x, gregexpr("0{3,}", x))
regmatches(x, gregexpr("0{3,}", x)) <- lapply(a, function(x) gsub("0", "1", x))
x
# [1] "1001111111" "121111siw22111100"
For regex ignorants (like me), try some brute force. Split the string into single characters using strsplit, find consecutive runs of "0" using rle, create a vector of relevant indices (run lengths of "0" > 2) using rep, insert a "1" at the indices, paste to a single string.
x2 <- strsplit(x = "1001000001", split = "")[[1]]
r <- rle(x2 == "0")
idx <- rep(x = r$lengths > 2, times = r$lengths)
x2[idx] <- "1"
paste(x2, collapse = "")
# [1] "1001111111"
0(?=00)|(?<=00)0|(?<=0)0(?=0)
You can try this.Replace by 1.See demo.
http://regex101.com/r/dP9rO4/5

strsplit on first instance [duplicate]

This question already has answers here:
Splitting a string on the first space
(7 answers)
Closed 4 years ago.
I would like to write a strsplit command that grabs the first ")" and splits the string.
For example:
f("12)34)56")
"12" "34)56"
I have read over several other related regex SO questions but I am afraid I am not able to make heads or tails of this. Thank you any assistance.
You could get the same list-type result as you would with strsplit if you used regexpr to get the first match, and then the inverted result of regmatches.
x <- "12)34)56"
regmatches(x, regexpr(")", x), invert = TRUE)
# [[1]]
# [1] "12" "34)56"
Need speed? Then go for stringi functions. See timings e.g. here.
library(stringi)
x <- "12)34)56"
stri_split_fixed(str = x, pattern = ")", n = 2)
It might be safer to identify where the character is and then substring either side of it:
x <- "12)34)56"
spl <- regexpr(")",x)
substring(x,c(1,spl+1),c(spl-1,nchar(x)))
#[1] "12" "34)56"
Another option is to use str_split in the package stringr:
library(stringr)
f <- function(string)
{
unlist(str_split(string,"\\)",n=2))
}
> f("12)34)56")
[1] "12" "34)56"
Replace the first ( with the non-printing character "\01" and then strsplit on that. You can use any character you like in place of "\01" as long as it does not appear.
strsplit(sub(")", "\01", "12)34)56"), "\01")

Extract substring in R from string with fixed start position and end point as a character found

I want to do the following extraction in R.
I have a column which has links like these
http://www.imdb.com/title/tt2569314/companycredits
I want to extract the tt2569314 out of this and store it in a new column.
The way I want to do it is, say, take substring of column where start position is LEN(http://www.imdb.com/) and end position is dynamic based on when the first '/' is found after the start position.
I want this to be kind of a mixture of SUBSTR and INSTR in SQL.
Please advise.
You could try this:
a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"
If all the links are similar in path structure, you can use the dirname
x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"
Or you can paste together a regular expression with the base URL
y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"
Or you may even be able to get away with this:
basename(dirname(x))
# [1] "tt2569314"
It's a bit more drawn out if you use the substring. But stringr has a couple of helpful functions.
library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"
You could try:
str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
library(httr)
gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
#[1] "tt2569314"
You may try this also,
> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"

Split on first comma in string

How can I efficiently split the following string on the first comma using base?
x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)
Desired outcome (2 strings):
[[1]]
[1] "I want to split here" "though I don't want to split elsewhere, even here."
Thank you in advance.
EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.
Apologies for the lack of clarity.
Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.
XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
From the stringr package:
str_split_fixed(x, pattern = ', ', n = 2)
# [,1]
# [1,] "I want to split here"
# [,2]
# [1,] "though I don't want to split elsewhere, even here."
(That's a matrix with one row and two columns.)
Here is yet another solution, with a regular expression to capture what is before and after the first comma.
x <- "I want to split here, though I don't want to split elsewhere, even here."
library(stringr)
str_match(x, "^(.*?),\\s*(.*)")[,-1]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
library(stringr)
str_sub(x,end = min(str_locate(string=x, ',')-1))
This will get the first bit you want. Change the start= and end= in str_sub to get what ever else you want.
Such as:
str_sub(x,start = min(str_locate(string=x, ',')+1 ))
and wrap in str_trim to get rid of the leading space:
str_trim(str_sub(x,start = min(str_locate(string=x, ',')+1 )))
This works but I like Josh Obrien's better:
y <- strsplit(x, ",")
sapply(y, function(x) data.frame(x= x[1],
z=paste(x[-1], collapse=",")), simplify=F))
Inspired by chase's response.
A number of people gave non base approaches so I figure I'd add the one I usually use (though in this case I needed a base response):
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
library(reshape2)
colsplit(y, ",", c("x","z"))

conditional strsplit

I have a dataframe where one of the columns contains a set of names. I would like to stringsplit a portion of the column names and have done so as follows:
DF$newname <- sapply(strsplit(as.character(DF$oldname), "_"), '[', 5)
in this example the fifth part of the split contains the name part of the character string. The problem is that this dataset contains $oldname names that are in different formats. In the first format the name is as follows where XXX are numbers:
xxx_xxx_xxx_xxx_name_xx (name is in fifth position)
and the second format the $oldname looks like this
xxx_xxx_xxx_xxx_xxx_name_xx (name is in sixth position)
I was thinking that I could use an ifelse command from within a function but am running into a little bit of trouble with the following code:
namesplit = function(df){
x <- strsplit(as.character(df$oldname), "_"), '[', 5)
y <- strsplit(as.character(df$oldname), "_"), '[', 6)
ifelse(is.character(x),x,y) }
DF$newname <- sapply(DF,namesplit)
this code doesn't work as I know I can's use the [ in this way but I am not sure of the best way. while I think I could get this working within a for loop, I would prefer to find a way to extract the names in a way that would allow me to use an apply.
thanks.
You can easily do this using gsub
names <- c('xxx_xxx_xxx_xxx_xxx_name1_xx', 'xxx_xxx_xxx_xxx_name2_xx')
gsub("^.*_([[:alnum:]]+)_.*$", "\\1", names)
[1] "name1" "name2"
If the name is the penultimate portion how about this:
x <- c("xxx_xxx_xxx_xxx_name_xx", "xxx_xxx_xxx_xxx_xxx_name_xx")
namesplit = function(x){
x <- strsplit(as.character(x), "_")
sapply(x, function(x) x[length(x)-1])
}
HTH