strsplit on first instance [duplicate] - regex

This question already has answers here:
Splitting a string on the first space
(7 answers)
Closed 4 years ago.
I would like to write a strsplit command that grabs the first ")" and splits the string.
For example:
f("12)34)56")
"12" "34)56"
I have read over several other related regex SO questions but I am afraid I am not able to make heads or tails of this. Thank you any assistance.

You could get the same list-type result as you would with strsplit if you used regexpr to get the first match, and then the inverted result of regmatches.
x <- "12)34)56"
regmatches(x, regexpr(")", x), invert = TRUE)
# [[1]]
# [1] "12" "34)56"

Need speed? Then go for stringi functions. See timings e.g. here.
library(stringi)
x <- "12)34)56"
stri_split_fixed(str = x, pattern = ")", n = 2)

It might be safer to identify where the character is and then substring either side of it:
x <- "12)34)56"
spl <- regexpr(")",x)
substring(x,c(1,spl+1),c(spl-1,nchar(x)))
#[1] "12" "34)56"

Another option is to use str_split in the package stringr:
library(stringr)
f <- function(string)
{
unlist(str_split(string,"\\)",n=2))
}
> f("12)34)56")
[1] "12" "34)56"

Replace the first ( with the non-printing character "\01" and then strsplit on that. You can use any character you like in place of "\01" as long as it does not appear.
strsplit(sub(")", "\01", "12)34)56"), "\01")

Related

Extract a string of words between two specific words in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed last year.
I have the following string : "PRODUCT colgate good but not goodOKAY"
I want to extract all the words between PRODUCT and OKAY
This can be done with sub:
s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)
giving:
[1] "colgate good but not good"
No packages are needed.
Here is a visualization of the regular expression:
.*PRODUCT *(.*?) *OKAY.*
Debuggex Demo
x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")
(?<=PRODUCT) -- look behind the match for PRODUCT
.* match everything except new lines.
(?=OKAY) -- look ahead to match OKAY.
I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.
(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)
You can use gsub:
vec <- "PRODUCT colgate good but not goodOKAY"
gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"
You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:
x <- "PRODUCT colgate good but not goodOKAY"
library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)
## [[1]]
## [1] "colgate good but not good"
You could use the package unglue :
library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

Replace repeating character with another repeated character

I would like to replace 3 or more consecutive 0s in a string by consecutive 1s. Example: '1001000001' becomes '1001111111'.
In R, I wrote the following code:
gsub("0{3,}","1",reporting_line_string)
but obviously it replaces the 5 0s by a single 1. How can I get 5 1s ?
Thanks,
You can use gsubfn function, which you can supply a replacement function to replace the content matched by the regex.
require(gsubfn)
gsubfn("0{3,}", function (x) paste(replicate(nchar(x), "1"), collapse=""), input)
You can replace paste(replicate(nchar(x), "1"), collapse="") with stri_dup("1", nchar(x)) if you have stringi package installed.
Or a more concise solution, as G. Grothendieck suggested in the comment:
gsubfn("0{3,}", ~ gsub(".", 1, x), input)
Alternatively, you can use the following regex in Perl mode to replace:
gsub("(?!\\A)\\G0|(?=0{3,})0", "1", input, perl=TRUE)
It is extensible to any number of consecutive 0 by changing the 0{3,} part.
I personally don't endorse the use of this solution, though, since it is less maintainable.
Here's an option that builds on your approach, but makes use of gregexpr and regmatches. There's probably a more DRY way to do this, but it's not coming to my mind right now....
x <- c("1001000001", "120000siw22000100")
x
# [1] "1001000001" "120000siw22000100"
a <- regmatches(x, gregexpr("0{3,}", x))
regmatches(x, gregexpr("0{3,}", x)) <- lapply(a, function(x) gsub("0", "1", x))
x
# [1] "1001111111" "121111siw22111100"
For regex ignorants (like me), try some brute force. Split the string into single characters using strsplit, find consecutive runs of "0" using rle, create a vector of relevant indices (run lengths of "0" > 2) using rep, insert a "1" at the indices, paste to a single string.
x2 <- strsplit(x = "1001000001", split = "")[[1]]
r <- rle(x2 == "0")
idx <- rep(x = r$lengths > 2, times = r$lengths)
x2[idx] <- "1"
paste(x2, collapse = "")
# [1] "1001111111"
0(?=00)|(?<=00)0|(?<=0)0(?=0)
You can try this.Replace by 1.See demo.
http://regex101.com/r/dP9rO4/5

Understanding `regexp` in R [duplicate]

This question already has answers here:
Extract all numbers from a single string in R
(4 answers)
Closed 8 years ago.
Understanding regular expressions sometimes can be a trouble. Especially if your not really familiar writing them, like myself.
In R there are a couple of built-in functions (base package) which i would like to understand and be able to use. Like:
grep and gsub, that take as arguments (p, x) where p is a pattern and x is a character vector to look-up. split function also takes regexp as argument like many others.
Anyway i have an example such as:
string <- "39 22' 19'' N"
and i need to be able to extract numbers from it. So using these stringr, iterators, foreach libraries i am trying to figure out an expression using either iter or foreach.
str_locate(string, "[0-9]+") locates and z <- str_extract(obj, "[0-9]+") extracts only the first match on my string.
I have tried making something like
x <- iter(z)
nextElem(x)
but it doesn't work. And another one which normally doesn't work.
a <- foreach(iter(z))
a
How should i fix this using the above libraries?
Thanks.
Check http://cran.r-project.org/web/packages/stringr/stringr.pdf
str_extract_all(your_string, "[0-9]+")
you have exactly the same result with the basic functions:
strsplit(gsub("(\\D+)"," ", string), " ")
This is another way to do it in base R:
string <- "39 22' 19'' N"
regmatches(string,gregexpr("[0-9]+",string))
# [[1]]
# [1] "39" "22" "19"
Note that regmatches(...) returns a list where each element is a char vector with the matches. So to get just the char vector you would use:
regmatches(string,gregexpr("[0-9]+",string))[[1]]
# [1] "39" "22" "19"

Separate strings into groups of 2 characters separated by colon (1330 to 13:30) in R

How do I turn "1330" into "13:30", or "133000" into "13:30:00"? Essentially, I want to insert a colon between every pair of numbers. I'm trying to convert characters into times.
It seems like there should be a really elegant way to do this, but I can't think of it. I was thinking of using some combination of paste() and substr(), but an elegant solution is escaping me.
EDIT: example string that needs to be converted:
X <- c("120000", "120500", "121000", "121500", "122000", "122500", "123000") #example of noon to 12:30pm
This replaces each sequence of two characters not followed by a boundary with those same characters followed by a colon:
gsub("(..)\\B", "\\1:", X)
On the sample string it gives:
[1] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00"
You can use a regular expression with a positive lookahead:
gsub("(\\d{2})(?=\\d{2})", "\\1:", X, perl = TRUE)
# [1] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00"
Using substring:
test <- "1330"
paste(substring(test,seq(1,nchar(test)-1,2),seq(2,nchar(test),2)),collapse=":")
#[1] "13:30"
test <- "133000"
paste(substring(test,seq(1,nchar(test)-1,2),seq(2,nchar(test),2)),collapse=":")
#[1] "13:30:00"
Or if you want an actual time representation you could do:
test <- "1330"
as.POSIXct(test,format="%H%M")
#[1] "2013-05-09 13:30:00 EST"
Which you can reformat like:
format(as.POSIXct(test,format="%H%M"),"%H:%M")
#[1] "13:30"
Can do it with strptime in one step:
strptime(X, format="%H%M%S")
[1] "2013-05-08 12:00:00" "2013-05-08 12:05:00" "2013-05-08 12:10:00" "2013-05-08 12:15:00" "2013-05-08 12:20:00"
[6] "2013-05-08 12:25:00" "2013-05-08 12:30:00"
After the complaint about the dates in date-time objects, one can suppress that "artificial" reality with:
strftime( strptime(X, format="%H%M%S"), "%H:%M:%S" )
[1] "12:00:00" "12:05:00" "12:10:00" "12:15:00" "12:20:00" "12:25:00" "12:30:00"

Split on first comma in string

How can I efficiently split the following string on the first comma using base?
x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)
Desired outcome (2 strings):
[[1]]
[1] "I want to split here" "though I don't want to split elsewhere, even here."
Thank you in advance.
EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.
Apologies for the lack of clarity.
Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.
XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
From the stringr package:
str_split_fixed(x, pattern = ', ', n = 2)
# [,1]
# [1,] "I want to split here"
# [,2]
# [1,] "though I don't want to split elsewhere, even here."
(That's a matrix with one row and two columns.)
Here is yet another solution, with a regular expression to capture what is before and after the first comma.
x <- "I want to split here, though I don't want to split elsewhere, even here."
library(stringr)
str_match(x, "^(.*?),\\s*(.*)")[,-1]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
library(stringr)
str_sub(x,end = min(str_locate(string=x, ',')-1))
This will get the first bit you want. Change the start= and end= in str_sub to get what ever else you want.
Such as:
str_sub(x,start = min(str_locate(string=x, ',')+1 ))
and wrap in str_trim to get rid of the leading space:
str_trim(str_sub(x,start = min(str_locate(string=x, ',')+1 )))
This works but I like Josh Obrien's better:
y <- strsplit(x, ",")
sapply(y, function(x) data.frame(x= x[1],
z=paste(x[-1], collapse=",")), simplify=F))
Inspired by chase's response.
A number of people gave non base approaches so I figure I'd add the one I usually use (though in this case I needed a base response):
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
library(reshape2)
colsplit(y, ",", c("x","z"))