Splitting a string by space except when contained within quotes - regex

I've been trying to split a space delimited string with double-quotes in R for some time but without success. An example of a string is as follows:
rainfall snowfall "Channel storage" "Rivulet storage"
It's important for us because these are column headings that must match the subsequent data. There are other suggestions on this site as to how to go about this but they don't seem to work with R. One example:
Regex for splitting a string using space when not surrounded by single or double quotes
Here is some code I've been trying:
str <- 'rainfall snowfall "Channel storage" "Rivulet storage"'
regex <- "[^\\s\"']+|\"([^\"]*)\""
split <- strsplit(str, regex, perl=T)
what I would like is
[1] "rainfall" "snowfall" "Channel storage" "Rivulet storage"
but what I get is:
[1] "" " " " " " "
The vector is the right length (which is encouraging) but of course the strings are empty or contain a single space. Any suggestions?
Thanks in advance!

scan will do this for you
scan(text=str, what='character', quiet=TRUE)
[1] "rainfall" "snowfall" "Channel storage" "Rivulet storage"

As mplourde said, use scan. that's by far the cleanest solution (unless you want to keep the \", that is...)
If you want to use regexes to do this (or something not solved that easily by scan), you are still looking at it the wrong way. Your regex returns what you want, so if you use that in your strsplit it will cut out everything you want to keep.
In these scenarios you should look at the function gregexp, which returns the starting positions of your matches and adds the lengths of the match as an attribute. The result of this can be passed to the function regmatches(), like this:
str <- 'rainfall snowfall "Channel storage" "Rivulet storage"'
regex <- "[^\\s\"]+|\"([^\"]+)\""
regmatches(str,gregexpr(regex,str,perl=TRUE))
But if you just needs the character vector as the solution of mplourde returns, go for that. And most likely that's what you're after anyway.

You can use strapply from package gsubfn. In strapply you can define matching string rather than splitting string.
str <- "rainfall snowfall 'Channel storage' 'Rivulet storage'"
strapply(str,"\\w+|'\\w+ \\w+'",c)[[1]]
[1] "rainfall" "snowfall" "'Channel storage'" "'Rivulet storage'"

Related

Dealing with Spaces and NA's when Uniting Multiple Columns with Tidyr

So using the simple dataframe below, I want to create a new column that has all the days for each person, separated by a semi-colon.
For example, using Doug, it should look like - Monday; Wednesday; Friday
I would like to use Tidyr's Unite function for this but when I use it, I get - Monday;;Wednesday;;Friday, because of the NA's, which also could be blank spaces as well. Sometimes there are semi-colons at the beginning and end as well. So I'm hoping there's a way to keep using "unite" but enhanced with a regular expression so that I end up with each day of the week separated by one semi-colon, and no semi-colons at the beginning or end.
I would also like to stick with Tidyr, Dplyr, Stringr, etc.
Names<-c("Doug","Ken","Erin","Yuki","John")
Monday<-c("Monday"," "," ","Monday","Monday")
Tuesday<-c(" ","Tuesday","Tuesday"," ","Tuesday")
Wednesday<-c(" ","Wednesday","Wednesday","Wednesday"," ")
Thursday<-c(" "," "," "," ","Thursday")
Friday<-c(" "," "," "," ","Friday")
Days<-data.frame(Monday,Tuesday,Wednesday,Thursday,Friday)
Days<-Days%>%unite(BestDays,Monday,Tuesday,Wednesday,Thursday,Friday,sep="; ",remove=FALSE)
You can try :
Names<-c("Doug","Ken","Erin","Yuki","John")
Monday<-c("Monday",NA,NA,"Monday","Monday")
Tuesday<-c(NA,"Tuesday","Tuesday",NA,"Tuesday")
Wednesday<-c(NA,"Wednesday","Wednesday","Wednesday",NA)
Thursday<-c(NA,NA,NA,NA,"Thursday")
Friday<-c(NA,NA,NA,NA,"Friday")
Days<-data.frame(Monday,Tuesday,Wednesday,Thursday,Friday)
concat_str = function(str) str %>% na.omit %>% paste(collapse = "; ")
Days$BestDaysConcat = apply(Days[,c("Monday","Tuesday","Wednesday","Thursday","Friday")], 1, concat_str)
From getAnywhere("unite_.data.frame"), unite is calling do.call("paste", c(data[from], list(sep = sep))) underhood, and paste as far as I know doesn't provide a functionality to omit NAs unless manually implemented in some way;
Nevertheless, you can use a regular expression method as follows with gsub from base R to clean up the result column:
gsub("^\\s;\\s|;\\s{2}", "", Days$BestDays)
# [1] "Monday" "Tuesday; Wednesday"
# [3] "Tuesday; Wednesday" "Monday; Wednesday"
# [5] "Monday; Tuesday; Thursday; Friday"
This removes either ^\\s;\\s pattern or ;\\s{2} pattern, the former handle the case when the string starts with space string where we can just remove the space and it's following ;\\s, otherwise remove ;\\s{2} which can handle cases where \\s are both in the middle of the string and at the end of the string.

r ngram extraction with regex

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!

Extract first instance of character and digit mix using regex (R)

I have a string that i would like to extract the first instance of the character/digit mix - ie the first instance of the screen resolution below.
The string to match
scrn <- " dimensions: 1280x800 pixels (338x211 millimeters)"
And i would like to get either a vector or list with entries c(1280, 800)
I can do this rather awkwardly with
strsplit(sapply(strsplit(scrn, " "), "[", 7),"x", scrn)
where i knew the 7 by reviewing the strsplit output.
But i am assuming there is a neat regular expressions way to do this
My attempt fwiw (which i would then need to split a couple of times)
gsub("[[:alpha:]]{2,}|(\\:)*(\\s) ", "", scrn)
Is this what you mean?
sub('scrn\\s*<-\\s*"\\s*dimensions:\\s*(\\d+)x(\\d+)', "c(\\1,\\2)", subject, perl=TRUE);
Output:
c(1280,800)
Following #zx81 hint of (\\d+)x(\\d+) this gets it done fairly neatly
scrn <- " dimensions: 1280x800 pixels (338x211 millimeters)"
g <- regexec("(\\d+)x(\\d+)", scrn)
unlist(regmatches( scrn, g ))[-1]

How to convert a vector of strings to Title Case

I have a vector of strings in lower case. I'd like to change them to title case, meaning the first letter of every word would be capitalized. I've managed to do it with a double loop, but I'm hoping there's a more efficient and elegant way to do it, perhaps a one-liner with gsub and a regex.
Here's some sample data, along with the double loop that works, followed by other things I tried that didn't work.
strings = c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
# For each string in the strings vector, find the position of each
# instance of a space followed by a letter
matches = gregexpr("\\b[a-z]+", strings)
# For each string in the strings vector, convert the first letter
# of each word to upper case
for (i in 1:length(strings)) {
# Extract the position of each regex match for the string in row i
# of the strings vector.
match.positions = matches[[i]][1:length(matches[[i]])]
# Convert the letter in each match position to upper case
for (j in 1:length(match.positions)) {
substr(strings[i], match.positions[j], match.positions[j]) =
toupper(substr(strings[i], match.positions[j], match.positions[j]))
}
}
This worked, but it seems inordinately complicated. I resorted to it only after experimenting unsuccessfully with more straightforward approaches. Here are some of the things I tried, along with the output:
# Google search suggested \\U might work, but evidently not in R
gsub("(\\b[a-z]+)", "\\U\\1" ,strings)
[1] "Ufirst Uphrase" "Uanother Uphrase Uto Uconvert"
[3] "Uand Uhere'Us Uanother Uone" "Ulast-Uone"
# I tried this on a lark, but to no avail
gsub("(\\b[a-z]+)", toupper("\\1"), strings)
[1] "first phrase" "another phrase to convert"
[3] "and here's another one" "last-one"
The regex captures the correct positions in each string as shown by a call to gregexpr, but the replacement string is clearly not working as desired.
If you can't already tell, I'm relatively new to regexes and would appreciate help on how to get the replacement to work correctly. I'd also like to learn how to structure the regex so as to avoid capturing a letter after an apostrophe, since I don't want to change the case of those letters.
The main problem is that you're missing perl=TRUE (and your regex is slightly wrong, although that may be a result of flailing around to try to fix the first problem).
Using [:lower:] instead of [a-z] is slightly safer in case your code ends up being run in some weird (sorry, Estonians) locale where z is not the last letter of the alphabet ...
re_from <- "\\b([[:lower:]])([[:lower:]]+)"
strings <- c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
gsub(re_from, "\\U\\1\\L\\2" ,strings, perl=TRUE)
## [1] "First Phrase" "Another Phrase To Convert"
## [3] "And Here's Another One" "Last-One"
You may prefer to use \\E (stop capitalization) rather than \\L (start lowercase), depending on what rules you want to follow, e.g.:
string2 <- "using AIC for model selection"
gsub(re_from, "\\U\\1\\E\\2" ,string2, perl=TRUE)
## [1] "Using AIC For Model Selection"
Without using regex, the help page for tolower has two example functions that will do this.
The more robust version is
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
capwords(c("using AIC for model selection"))
## -> [1] "Using AIC For Model Selection"
To get your regex approach (almost) working you need to set `perl = TRUE)
gsub("(\\b[a-z]{1})", "\\U\\1" ,strings, perl=TRUE)
[1] "First Phrase" "Another Phrase To Convert"
[3] "And Here'S Another One" "Last-One"
but you will need to deal with apostrophes slightly better perhaps
sapply(lapply(strsplit(strings, ' '), gsub, pattern = '^([[:alnum:]]{1})', replace = '\\U\\1', perl = TRUE), paste,collapse = ' ')
A quick search of SO found https://stackoverflow.com/a/6365349/1385941
Already excellent answers here. Here's one using a convenience function from the reports package:
strings <- c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
CA(strings)
## > CA(strings)
## [1] "First Phrase" "Another Phrase To Convert"
## [3] "And Here's Another One" "Last-one"
Though it doesn't capitalize one as it didn't make sense to do so for my purposes.
Update I manage the qdapRegex package that has the TC (title case) function that does true title case:
TC(strings)
## [[1]]
## [1] "First Phrase"
##
## [[2]]
## [1] "Another Phrase to Convert"
##
## [[3]]
## [1] "And Here's Another One"
##
## [[4]]
## [1] "Last-One"
I'll throw one more into the mix for fun:
topropper(strings)
[1] "First Phrase" "Another Phrase To Convert" "And Here's Another One"
[4] "Last-one"
topropper <- function(x) {
# Makes Proper Capitalization out of a string or collection of strings.
sapply(x, function(strn)
{ s <- strsplit(strn, "\\s")[[1]]
paste0(toupper(substring(s, 1,1)),
tolower(substring(s, 2)),
collapse=" ")}, USE.NAMES=FALSE)
}
Here is another one-liner, based on stringr package:
str_to_title(strings, locale = "en")
where strings is your vector of strings.
Source
The best way for conversion of any case to any other case is the use of snakecase package in r.
Simply use the package
library(snakecase)
strings = c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
to_title_case(strings)
## [1] "First Phrase" "Another Phrase to Convert"
## [3] "And Here s Another One" "Last One"
Keep Coding!

gsub every other occurrence of a condition

Sometimes I use R for parsing text from pdfs for quotes in writing an article (I use LATEX). One thing I'd like to do is change straight left and right quotes to LATEX style left and right quotes.
LATEX would change "dog" to ``dog'' (so two ` for the left and two ' for the right)
Here's an example of what I have and what I'd like to get.
#currently
x <- c('I like "proper" cooking.', 'I heard him say, "I want some too" and "nice".')
[1] "I like \"proper\" cooking." "I heard him say, \"I want some too\" and \"nice\"."
#desired outcome
[1] "I like ``proper'' cooking." "I heard him say, ``I want some too'' and ``nice''."
EDIT: Thought I'd share the actual use for context. Using ttmaccer's solution (works on a windows machine):
g <- function(){
require(qdap)
x <- readClipboard()
x <- clean(paste2(x, " "))
zz <- mgsub(c("- ", "“", "”"), c("", "``", "''"), x)
zz <- gsub("\"([^\"].*?)\"","``\\1''", zz)
writeClipboard(noquote(zz), format = 1)
}
Note: qdap can be downloaded HERE
A naive solution would be:
> gsub("\"([^\"].*?)\"","``\\1''",x)
[1] "I like ``proper'' cooking."
[2] "I heard him say, ``I want some too'' and ``nice''."
but I'm not sure how you would handle "some \"text\" with one \""
a two stage solution:
stage 1: use "((?:[^\\"]|\\.)*)" to match double quoted string
stage 2: use \\"([^\\"]*)\\" to replace \" from group 1 of stage 1