How to add a "." after a String under Conditions in R - regex

Data <- c("My name is Ernst.","I love chicken","Hello, my name is Stan!","Who?","I Love you!","Winner")
The Function should add a "." if at the end of the Sentence is none of those signs [.?!] to end the sentence.
I was trying do build a function in R with help of Regex but i had some issues to only look at the End of the String.

The below gsub function would add a dot at the end of the sentence only if the sentence is not ended with a . or ? or ! symbols.
> Data <- c("My name is Ernst.","I love chicken","Hello, my name is Stan!","Who?","I Love you!","Winner")
> gsub("^(?!.*[.?!]$)(.*)$", "\\1.", Data, perl=TRUE)
[1] "My name is Ernst." "I love chicken."
[3] "Hello, my name is Stan!" "Who?"
[5] "I Love you!" "Winner."
In regex, lookaheads are used for condition checking purposes. The negative lookahead (?!.*[.?!]$) would checks for the presence of . or ? or ! at the line end. If it's present at the last, then it skips the sentence and the replacement would never happen on that corresponding line. The replacement would occur only if there is no . or ? or ! symbols at the last.
OR
Through negative lookbehind and positive lookahead,
> Data <- c("My name is Ernst.","I love chicken","Hello, my name is Stan!","Who?","I Love you!","Winner")
> sub("(?<![!?.])(?=$)", ".", Data, perl=TRUE)
[1] "My name is Ernst." "I love chicken."
[3] "Hello, my name is Stan!" "Who?"
[5] "I Love you!" "Winner."

using stringi
library(stringi)
stri_replace_all_regex(Data, "(?<![^!?.])\\b$", ".")
#[1] "My name is Ernst." "I love chicken."
#[3] "Hello, my name is Stan!" "Who?"
#[5] "I Love you!" "Winner."

Here is another solution.
x <- c('My name is Ernst.', 'I love chicken',
'Hello, my name is Stan!', 'Who?', 'I Love you!', 'Winner')
r <- sub('[^?!.]\\K$', '.', x, perl=T)
## [1] "My name is Ernst." "I love chicken."
## [3] "Hello, my name is Stan!" "Who?"
## [5] "I Love you!" "Winner."

Here are some possible approaches:
1) If the last character is not dot, ? or ! then replace it with that character followed by dot:
sub("([^.!?])$", "\\1.", Data)
For the data in the question we get:
[1] "My name is Ernst." "I love chicken."
[3] "Hello, my name is Stan!" "Who?"
[5] "I Love you!" "Winner."
2) A gsubfn solution is even simpler. It replaces the empty () with a dot if the last character is not a dot, ! or ? .
library(gsubfn)
gsubfn("[^.!?]()$", ".", Data)
3) This one uses grepl. If dot, ! or ? is the last character then append the empty string and otherwise append dot.
paste0(Data, ifelse(grepl("[.!?]$", Data), "", "."))
4) This one does not use regular expressions at all. It picks off the last character and if its one of dot, ! or ? it appends the empty string and otherwise appends dot:
paste0(Data, ifelse(substring(Data, nchar(Data)) %in% c(".", "!", "?"), "", "."))

Related

Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?

I'm looking at a number of cells in a data frame and am trying to extract any one of several sequences of characters; there's only one of these sequences per per cell.
Here's what I mean:
dF$newColumn = str_extract_all(string = "dF$column1", pattern ="sequence_1|sequence_2")
Am I screwing the syntax up here? Can I pull this sort of thing with stringr? Please rectify my ignorance!
Yes, you can use | since it denotes logical or in regex. Here's an example:
vec <- c("abc text", "text abc", "def text", "text def text")
library(stringr)
str_extract_all(string = vec, pattern = "abc|def")
The result:
[[1]]
[1] "abc"
[[2]]
[1] "abc"
[[3]]
[1] "def"
[[4]]
[1] "def"
However, in your command, you should replace "dF$column1" with dF$column1 (without quotes).

Split on first/nth occurrence of delimiter

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

Split keep repeated delimiter

I'm trying to use the stringi package to split on a delimiter (potentially the delimiter is repeated) yet keep the delimiter. This is similar to this question I asked moons ago: R split on delimiter (split) keep the delimiter (split) but the delimiter can be repeated. I don't think base strsplit can handle this type of regex. The stringi package can but I can't figure out how to format the regex to it splits on the delimiter if there are repeats and also not to leave an empty string at the end of the string.
Base R solutions, stringr, stringi etc. solutions all welcomed.
The later problem occurs because I use greedy * on the \\s but the space isn't garunteed so I could only think to leave it in:
MWE
text.var <- c("I want to split here.But also||Why?",
"See! Split at end but no empty.",
"a third string. It has two sentences"
)
library(stringi)
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")
# Outcome
## [[1]]
## [1] "I want to split here." "But also|" "|" "Why?"
## [5] ""
##
## [[2]]
## [1] "See!" "Split at end but no empty." ""
##
## [[3]]
## [1] "a third string." "It has two sentences"
# Desired Outcome
## [[1]]
## [1] "I want to split here." "But also||" "Why?"
##
## [[2]]
## [1] "See!" "Split at end but no empty."
##
## [[3]]
## [1] "a third string." "It has two sentences"
Using strsplit
strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
#[[1]]
#[1] "I want to split here." "But also||" "Why?"
#[[2]]
#[1] "See!" "Split at end but no empty."
#[[3]]
#[1] "a third string." "It has two sentences"
Or
library(stringi)
stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
#[[1]]
#[1] "I want to split here." "But also||" "Why?"
#[[2]]
#[1] "See!" "Split at end but no empty."
#[[3]]
#[1] "a third string." "It has two sentences"
Just use a pattern that finds inter-character locations that: (1) are preceded by one of ?.!|; and (2) are not followed by one of ?.!|. Tack on \\s* to match and eat up any number of consecutive space characters, and you're good to go.
## (look-behind)(look-ahead)(spaces)
strsplit(text.var, "(?<=([?.!|]))(?!([?.!|]))\\s*", perl=TRUE)
# [[1]]
# [1] "I want to split here." "But also||" "Why?"
#
# [[2]]
# [1] "See!" "Split at end but no empty."
#
# [[3]]
# [1] "a third string." "It has two sentences"

Split string recursively

Say I have text like this:
pattern = "This_is some word/expression I'd like to parse:intelligently(using special symbols-like '.')"
The challenge is how to split it into words, using word separators from the
c(" ","-","/","\\","_",":","(",")",".",",")
family.
Desired result:
"This" "is" "some" "word" "expression" "I'd" "like" "to" "parse" "intelligently" "using" "special" "symbols" "like"
Methods:
I could do sapply or for loop using:
keywords = unlist(strsplit(pattern," "))
keywords = unlist(strsplit(keywords,"-"))
# etc.
Question:
But what's the solution using Reduce(f, x, init, accummulate=TRUE)?
You shouldn't need Reduce here. You should be able to do something like the following:
splitters <- c(" ","/","\\","_",":","(",")",".",",","-") # dash should come last
pattern <- paste0("[", paste(splitters, collapse = ""), "]")
string <- "This_is some word/expression I'd like to parse:intelligently(using special symbols-like '.')"
strsplit(string, pattern)[[1]]
# [1] "This" "is" "some" "word"
# [5] "expression" "I'd" "like" "to"
# [9] "parse" "intelligently" "using" "special"
# [13] "symbols" "like" "'" "'"
Note that a - in a regex character class should come first or last, so I've edited your vector of "splitters" accordingly. Also, you may want to add a + at the end of your "pattern" in case you want to collapse, say, multiple spaces into one.
You can use option perl = TRUE and then split on punctuation or space
> strsplit(pattern, '[[:punct:]]|[[:space:]]', perl = TRUE)
[[1]]
[1] "This" "is" "some" "word" "expression"
[6] "I" "d" "like" "to" "parse"
[11] "intelligently" "using" "special" "symbols" "like"
[16] ""
I'd go with (It will keep "I'd" together)
strsplit(pattern, "[^[:alnum:][:digit:]']")
## [[1]]
## [1] "This" "is" "some" "word" "expression" "I'd" "like" "to" "parse"
## [10] "intelligently" "using" "special" "symbols" "like" "'" "'"

R - remove anything after comma from column

I'd like to strip this column so that it just shows last name - if there is a comma I'd like to remove the comma and anything after it. I have data column that is a mix of just last names and last, first. The data looks as follows:
Last Name
Sample, A
Tester
Wilfred, Nancy
Day, Bobby Jean
Morris
You could use gsub() and some regex:
> x <- 'Day, Bobby Jean'
> gsub("(.*),.*", "\\1", x)
[1] "Day"
You can use gsub:
gsub(",.*", "", c("last only", "last, first"))
# [1] "last only" "last"
",.*" says: replace comma (,) and every character after that (.*), with nothing "".
str1 <- c("Sample, A", "Tester", "Wifred, Nancy", "Day, Bobby Jean", "Morris")
library(stringr)
str_extract(str1, perl('[A-Za-z]+(?=(,|\\b))'))
#[1] "Sample" "Tester" "Wifred" "Day" "Morris"
Match alphabets [A-Za-z]+ and extract those which are followed by , or word boundary.
This is will work
a <- read.delim("C:\\Desktop\\a.csv", row.names = NULL,header=TRUE,
stringsAsFactors=FALSE,sep=",")
a=as.matrix(a)
Data=str_replace_all(string=a,pattern="\\,.*$",replacement=" ")
Also try strsplit:
string <- c("Sample, A", "Tester", "Wifred, Nancy", "Day, Bobby Jean", "Morris")
sapply(strsplit(string, ","), "[", 1)
#[1] "Sample" "Tester" "Wifred" "Day" "Morris"