Strsplit() by second occurence of the delimiter - regex

I am trying to split by the second occurrence of a character in a string (return substring in string before second appearance of character x)
For the string:
s <-"a_b_c" , if delimiter is "_" , I need the substring : "a_b"
My function returns the substring by first occurence:
return_topic<-function(s)
{
if (length(grep("_",s))>0)
{ return (unlist(strsplit(s,"_"))[1])}
else return (" ")
}
> return_topic("a_b_c")
[1] "a"

You can use sub:
sub("(.*?_.*?)_.*", "\\1", s)
# [1] "a_b"

One way using strsplit
s <- c('a_b_c', '_b', '_bc_', 'abc__')
sapply(strsplit(s, '^[^_]*?[_][^_]*?(*SKIP)(*F)|_', perl=TRUE),`[`,1)
#[1] "a_b" "_b" "_bc" "abc_"

Related

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.

Extract 2nd to last word in string

I know how to do it in Python, but can't get it to work in R
> string <- "this is a sentence"
> pattern <- "\b([\w]+)[\s]+([\w]+)[\W]*?$"
Error: '\w' is an unrecognized escape in character string starting "\b([\w"
> match <- regexec(pattern, string)
> words <- regmatches(string, match)
> words
[[1]]
character(0)
sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', string)
#[1] "a"
which reads - be non-greedy and look for anything until you get to the sequence - some word characters + some non-word characters + some word characters + optional non-word characters + end of string, then extract the first collection of word characters in that sequence
Non-regex solution:
string <- "this is a sentence"
split <- strsplit(string, " ")[[1]]
split[length(split)-1]
Python non regex version
spl = t.split(" ")
if len(spl) > 0:
s = spl[len(spl)-2]