Combining fragmented sentences in an R dataframe - regex

I have a dataframe which contains parts of whole sentences spread across, in some cases, multiple rows of a dataframe.
For example, head(mydataframe) returns
# 1 Do you have any idea what
# 2 they were arguing about?
# 3 Do--Do you speak
# 4 English?
# 5 yeah.
# 6 No, I'm sorry.
Assuming a sentence can be terminated by either
"." or "?" or "!" or "..."
are there any R library functions capable of outputting the following:
# 1 Do you have any idea what they were arguing about?
# 2 Do--Do you speak English?
# 3 yeah.
# 4 No, I'm sorry.

This should work for all the sentences ending with: . ... ? or !
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))
Credits to #AvinashRaj for the pointers on the lookbehind
Which gives:
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah..."
#[4] "No, I'm sorry."
Data
I modified the toy dataset to include a case where a string ends with ... (as per requested by OP)
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah...", "No, I'm sorry."),
stringsAsFactors = FALSE)

Here is what I got. I am sure there are better ways to do this. Here I used base functions. I created a sample data frame called foo. First, I created a string with all texts in txt. toString() adds ,, so I removed them in the first gsub(). Then, I took care of white space (more than 2 spaces) in the second gsub(). Then, I split the string by the delimiters you specified. Crediting Tyler Rinker for this post, I managed to leave delimiters in strsplit(). The final job was to remove white space at sentence initial position. Then, unlist the list.
EDIT
Steven Beaupré revised my code. That is the way to go!
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah.", "No, I'm sorry."),
stringsAsFactors = FALSE)
library(magrittr)
toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x)
{gsub(pattern = "^ ", replacement = "", x = x)
}) %>%
unlist
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah."
#[4] "No I'm sorry."

Related

Performance - How to get those words in a list of words that matches with a give sentence in R

Am trying to get only those words from a list that are present in a given sentence.The words can include bigram words as well. For example,
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
myresult should be :
"really good" "better"
I have 1000 sentences like this on which i need to compare the words. The word list is also bigger. i tried brute force method using grep command but it took a lot of time (as expected). I am looking to get the matching words in a way that the performance is better.
require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))
# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)
# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE),
grams,
by=c('wordList'='grams'))
matches
wordList
1 really good
2 better
I was able to use #epi99's answer with a slight modification.
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
What about
unlist(sapply(wordList, function(x) grep(x, sentence)))

r ngram extraction with regex

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

How to convert a vector of strings to Title Case

I have a vector of strings in lower case. I'd like to change them to title case, meaning the first letter of every word would be capitalized. I've managed to do it with a double loop, but I'm hoping there's a more efficient and elegant way to do it, perhaps a one-liner with gsub and a regex.
Here's some sample data, along with the double loop that works, followed by other things I tried that didn't work.
strings = c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
# For each string in the strings vector, find the position of each
# instance of a space followed by a letter
matches = gregexpr("\\b[a-z]+", strings)
# For each string in the strings vector, convert the first letter
# of each word to upper case
for (i in 1:length(strings)) {
# Extract the position of each regex match for the string in row i
# of the strings vector.
match.positions = matches[[i]][1:length(matches[[i]])]
# Convert the letter in each match position to upper case
for (j in 1:length(match.positions)) {
substr(strings[i], match.positions[j], match.positions[j]) =
toupper(substr(strings[i], match.positions[j], match.positions[j]))
}
}
This worked, but it seems inordinately complicated. I resorted to it only after experimenting unsuccessfully with more straightforward approaches. Here are some of the things I tried, along with the output:
# Google search suggested \\U might work, but evidently not in R
gsub("(\\b[a-z]+)", "\\U\\1" ,strings)
[1] "Ufirst Uphrase" "Uanother Uphrase Uto Uconvert"
[3] "Uand Uhere'Us Uanother Uone" "Ulast-Uone"
# I tried this on a lark, but to no avail
gsub("(\\b[a-z]+)", toupper("\\1"), strings)
[1] "first phrase" "another phrase to convert"
[3] "and here's another one" "last-one"
The regex captures the correct positions in each string as shown by a call to gregexpr, but the replacement string is clearly not working as desired.
If you can't already tell, I'm relatively new to regexes and would appreciate help on how to get the replacement to work correctly. I'd also like to learn how to structure the regex so as to avoid capturing a letter after an apostrophe, since I don't want to change the case of those letters.
The main problem is that you're missing perl=TRUE (and your regex is slightly wrong, although that may be a result of flailing around to try to fix the first problem).
Using [:lower:] instead of [a-z] is slightly safer in case your code ends up being run in some weird (sorry, Estonians) locale where z is not the last letter of the alphabet ...
re_from <- "\\b([[:lower:]])([[:lower:]]+)"
strings <- c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
gsub(re_from, "\\U\\1\\L\\2" ,strings, perl=TRUE)
## [1] "First Phrase" "Another Phrase To Convert"
## [3] "And Here's Another One" "Last-One"
You may prefer to use \\E (stop capitalization) rather than \\L (start lowercase), depending on what rules you want to follow, e.g.:
string2 <- "using AIC for model selection"
gsub(re_from, "\\U\\1\\E\\2" ,string2, perl=TRUE)
## [1] "Using AIC For Model Selection"
Without using regex, the help page for tolower has two example functions that will do this.
The more robust version is
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
capwords(c("using AIC for model selection"))
## -> [1] "Using AIC For Model Selection"
To get your regex approach (almost) working you need to set `perl = TRUE)
gsub("(\\b[a-z]{1})", "\\U\\1" ,strings, perl=TRUE)
[1] "First Phrase" "Another Phrase To Convert"
[3] "And Here'S Another One" "Last-One"
but you will need to deal with apostrophes slightly better perhaps
sapply(lapply(strsplit(strings, ' '), gsub, pattern = '^([[:alnum:]]{1})', replace = '\\U\\1', perl = TRUE), paste,collapse = ' ')
A quick search of SO found https://stackoverflow.com/a/6365349/1385941
Already excellent answers here. Here's one using a convenience function from the reports package:
strings <- c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
CA(strings)
## > CA(strings)
## [1] "First Phrase" "Another Phrase To Convert"
## [3] "And Here's Another One" "Last-one"
Though it doesn't capitalize one as it didn't make sense to do so for my purposes.
Update I manage the qdapRegex package that has the TC (title case) function that does true title case:
TC(strings)
## [[1]]
## [1] "First Phrase"
##
## [[2]]
## [1] "Another Phrase to Convert"
##
## [[3]]
## [1] "And Here's Another One"
##
## [[4]]
## [1] "Last-One"
I'll throw one more into the mix for fun:
topropper(strings)
[1] "First Phrase" "Another Phrase To Convert" "And Here's Another One"
[4] "Last-one"
topropper <- function(x) {
# Makes Proper Capitalization out of a string or collection of strings.
sapply(x, function(strn)
{ s <- strsplit(strn, "\\s")[[1]]
paste0(toupper(substring(s, 1,1)),
tolower(substring(s, 2)),
collapse=" ")}, USE.NAMES=FALSE)
}
Here is another one-liner, based on stringr package:
str_to_title(strings, locale = "en")
where strings is your vector of strings.
Source
The best way for conversion of any case to any other case is the use of snakecase package in r.
Simply use the package
library(snakecase)
strings = c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
to_title_case(strings)
## [1] "First Phrase" "Another Phrase to Convert"
## [3] "And Here s Another One" "Last One"
Keep Coding!

Split on first comma in string

How can I efficiently split the following string on the first comma using base?
x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)
Desired outcome (2 strings):
[[1]]
[1] "I want to split here" "though I don't want to split elsewhere, even here."
Thank you in advance.
EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.
Apologies for the lack of clarity.
Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.
XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
From the stringr package:
str_split_fixed(x, pattern = ', ', n = 2)
# [,1]
# [1,] "I want to split here"
# [,2]
# [1,] "though I don't want to split elsewhere, even here."
(That's a matrix with one row and two columns.)
Here is yet another solution, with a regular expression to capture what is before and after the first comma.
x <- "I want to split here, though I don't want to split elsewhere, even here."
library(stringr)
str_match(x, "^(.*?),\\s*(.*)")[,-1]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
library(stringr)
str_sub(x,end = min(str_locate(string=x, ',')-1))
This will get the first bit you want. Change the start= and end= in str_sub to get what ever else you want.
Such as:
str_sub(x,start = min(str_locate(string=x, ',')+1 ))
and wrap in str_trim to get rid of the leading space:
str_trim(str_sub(x,start = min(str_locate(string=x, ',')+1 )))
This works but I like Josh Obrien's better:
y <- strsplit(x, ",")
sapply(y, function(x) data.frame(x= x[1],
z=paste(x[-1], collapse=",")), simplify=F))
Inspired by chase's response.
A number of people gave non base approaches so I figure I'd add the one I usually use (though in this case I needed a base response):
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
library(reshape2)
colsplit(y, ",", c("x","z"))