Split on first comma in string - regex

How can I efficiently split the following string on the first comma using base?
x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)
Desired outcome (2 strings):
[[1]]
[1] "I want to split here" "though I don't want to split elsewhere, even here."
Thank you in advance.
EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.
Apologies for the lack of clarity.

Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.
XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."

From the stringr package:
str_split_fixed(x, pattern = ', ', n = 2)
# [,1]
# [1,] "I want to split here"
# [,2]
# [1,] "though I don't want to split elsewhere, even here."
(That's a matrix with one row and two columns.)

Here is yet another solution, with a regular expression to capture what is before and after the first comma.
x <- "I want to split here, though I don't want to split elsewhere, even here."
library(stringr)
str_match(x, "^(.*?),\\s*(.*)")[,-1]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."

library(stringr)
str_sub(x,end = min(str_locate(string=x, ',')-1))
This will get the first bit you want. Change the start= and end= in str_sub to get what ever else you want.
Such as:
str_sub(x,start = min(str_locate(string=x, ',')+1 ))
and wrap in str_trim to get rid of the leading space:
str_trim(str_sub(x,start = min(str_locate(string=x, ',')+1 )))

This works but I like Josh Obrien's better:
y <- strsplit(x, ",")
sapply(y, function(x) data.frame(x= x[1],
z=paste(x[-1], collapse=",")), simplify=F))
Inspired by chase's response.
A number of people gave non base approaches so I figure I'd add the one I usually use (though in this case I needed a base response):
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
library(reshape2)
colsplit(y, ",", c("x","z"))

Related

Performance - How to get those words in a list of words that matches with a give sentence in R

Am trying to get only those words from a list that are present in a given sentence.The words can include bigram words as well. For example,
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
myresult should be :
"really good" "better"
I have 1000 sentences like this on which i need to compare the words. The word list is also bigger. i tried brute force method using grep command but it took a lot of time (as expected). I am looking to get the matching words in a way that the performance is better.
require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))
# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)
# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE),
grams,
by=c('wordList'='grams'))
matches
wordList
1 really good
2 better
I was able to use #epi99's answer with a slight modification.
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
What about
unlist(sapply(wordList, function(x) grep(x, sentence)))

Combining fragmented sentences in an R dataframe

I have a dataframe which contains parts of whole sentences spread across, in some cases, multiple rows of a dataframe.
For example, head(mydataframe) returns
# 1 Do you have any idea what
# 2 they were arguing about?
# 3 Do--Do you speak
# 4 English?
# 5 yeah.
# 6 No, I'm sorry.
Assuming a sentence can be terminated by either
"." or "?" or "!" or "..."
are there any R library functions capable of outputting the following:
# 1 Do you have any idea what they were arguing about?
# 2 Do--Do you speak English?
# 3 yeah.
# 4 No, I'm sorry.
This should work for all the sentences ending with: . ... ? or !
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))
Credits to #AvinashRaj for the pointers on the lookbehind
Which gives:
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah..."
#[4] "No, I'm sorry."
Data
I modified the toy dataset to include a case where a string ends with ... (as per requested by OP)
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah...", "No, I'm sorry."),
stringsAsFactors = FALSE)
Here is what I got. I am sure there are better ways to do this. Here I used base functions. I created a sample data frame called foo. First, I created a string with all texts in txt. toString() adds ,, so I removed them in the first gsub(). Then, I took care of white space (more than 2 spaces) in the second gsub(). Then, I split the string by the delimiters you specified. Crediting Tyler Rinker for this post, I managed to leave delimiters in strsplit(). The final job was to remove white space at sentence initial position. Then, unlist the list.
EDIT
Steven Beaupré revised my code. That is the way to go!
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah.", "No, I'm sorry."),
stringsAsFactors = FALSE)
library(magrittr)
toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x)
{gsub(pattern = "^ ", replacement = "", x = x)
}) %>%
unlist
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah."
#[4] "No I'm sorry."

r ngram extraction with regex

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Replace character string elements by indices efficiently in R

I would like to efficiently replace elements in my character object with other particular elements in particular places (these places are indices which I know as they are results of the gregexpr function).
I would like some foo function that works like:
foo("qwerty", c(1,3,5), c("z", "x", "y"))
giving me:
[1] "zwxryy"
I searched the stringr package cran pdf but nothing hit my mind. Thank you in advance for any suggestions.
For example:
xx <- unlist(strsplit("qwerty",""))
xx[c(1,3,5)] <- c("z", "x", "y")
paste0(xx,collapse='')
[1] "zwxryy"
You could also try the one below, if you don't have that many characters to replace
st1 <- "qwerty"
gsub("^.(.).(.).","z\\1x\\2y", st1)
#[1] "zwxryy"
In stringi package there is stri_sub function that works like this:
a <- "12345"
stri_sub(a, from=c(1,3,5),len=1) <- letters[c(1,3,5)]
a
## [1] "a2345" "12c45" "1234e"
it's almost what you want. Just use this in loop:
a <- "12345"
for(i in c(1,3,5)){
stri_sub(a, from=i,len=1) <- letters[i]
}
a
## [1] "a2c4e"
Be aware that this kind of function is on our TODO list, check:
https://github.com/Rexamine/stringi/issues?state=open