regex split on Boolean

regex split on Boolean - regex

I wish to split a string into vectors and lists. If there is a OR or an || I want to split into lists. If there is ANDor&&split into a vector. With the word version I get it but not with the use of|and&`. Here is the code:
splitting <- function(x) {
lapply(strsplit(x, "OR|[\\|\\|]"), function(y){
strsplit(y, "AND|[\\&\\&]")
})
}
splitting("3AND4AND5OR4OR6AND7") ## desired outcome for all three
splitting("3&&4&&5||4||6&&7")
splitting("3&&4&&5OR4||6&&7")
Here is the desired outcome:
> splitting("3AND4AND5OR4OR6AND7")
[[1]]
[[1]][[1]]
[1] "3" "4" "5"
[[1]][[2]]
[1] "4"
[[1]][[3]]
[1] "6" "7"
How can I set this regex appropriately? What am I doing incorrect?

I'm not saying it's the best answer but if you've already solved the problem using "AND" and "OR" then why not reduce it down to a problem you've already solved?
splitting <- function(x) {
x <- gsub("&&", "AND", x, fixed = TRUE)
x <- gsub("||", "OR", x, fixed = TRUE)
lapply(strsplit(x, "OR|[\\|\\|]"), function(y){
strsplit(y, "AND|[\\&\\&]")
})
}
splitting("3AND4AND5OR4OR6AND7") ## desired outcome for all three
splitting("3&&4&&5||4||6&&7")
splitting("3&&4&&5OR4||6&&7")
this was just the first thing that popped into my head and I haven't really thought about if there is a better way to do it.
Also this appears to work
splitting <- function(x) {
#x <- gsub("&&", "AND", x, fixed = T)
#x <- gsub("||", "OR", x, fixed = T)
lapply(strsplit(x, "OR|\\|\\|"), function(y){
strsplit(y, "AND|\\&\\&")
})
}

Related

Split or substitute strings with wildcards in R [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 6 years ago.
I have the following vector:
a <- c("abc_lvl1", "def_lvl2")
I basically want to split into two vectors:
("abc", "def") and ("lvl1", "lvl2). I know how to substitute with sub:
sub(".*_", "", a)
[1] "lvl1" "lvl2"
I think this translates into "Search for any number of any characters before "_" and replace with nothing." Accordingly - i thought - this should give me the other desired vector:
sub("_*.", "", a), but it removes just the leading character:
[1] "bc_lvl1" "ef_lvl2"
Where do i mess up?
This is essentially the equivalent for the "text-to-columns"-function in excel.

There are several ways to do this. Here are a few, some using packages, and others with base R.
Given:
a <- c("abc_lvl1", "def_lvl2")
Here are some options:
do.call(rbind, strsplit(a, "_", TRUE))
matrix(scan(what = "", text = a, sep = "_"), ncol = 2, byrow = TRUE)
scan(text = a, sep = "_", what = list("", "")) ## a list
library(splitstackshape)
cSplit(data.table(a), "a", "_")
library(data.table)
setDT(tstrsplit(a, "_"))[]
library(dplyr)
library(tidyr)
data_frame(a) %>%
separate(a, into = c("this", "that"))
library(reshape2)
colsplit(a, "_", c("this", "that"))
library(stringi)
t(stri_split_fixed(a, "_", simplify = TRUE))
library(iotools)
mstrsplit(a, "_") # Matrix
dstrsplit(a, col_types = c("character", "character"), "_") # data.frame
library(gsubfn)
read.pattern(text = a, pattern = "(.*)_(.*)")

We can use read.csv/read.table and specify the sep="_". It will split the strings into two columns.
read.csv(text=a, sep="_", header=FALSE)

Just to build on the initial comments
a <- c("abc_lvl1", "def_lvl2")
a1 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][1]}))
a2 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][2]}))
a1
[1] "abc" "def"
a2
[1] "lvl1" "lvl2"

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?

you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end

I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"

Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")

Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"

Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"

Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"

use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"

In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)

Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.

You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.

R : How to search for a regex in a vector over elements outwardly?

Is it possible in R to search for a regex in a vector as if all the elements are a collapsed single element? If we collapse all the elements into one to do this, it becomes impossible to put them back to their element-wise form after the search.
here is a vector.
vector<-c("I", "met", "a", "cow")
now, the search word is "meta" (elements 2 and 3 collapsed).
Let's say my task is to merge the two elements across which the search string lies.
So what I expect is this:
vector = "I", "meta", "cow"
Is it possible to do this? Please help.

If you'd like something that matches "meta" but not "taco", this will do the trick:
myFun <- function(vector, word) {
D <- "UnLiKeLyStRiNg"
## Construct a string on which you'll perform regex-search
xx <- paste0(paste0(D, vector, collapse=""), D)
## Construct the regex pattern
start <- paste0("(?<=", D, ")")
mid <- paste0(strsplit(word, "")[[1]], collapse=paste0("(", D, ")?"))
end <- paste0("(?=", D, ")")
pat <- paste0(start, mid, end)
## Use it
strsplit(gsub(pat, word, xx, perl=TRUE), D)[[1]][-1]
}
vector <- c("I", "met", "a", "cow")
myFun(vector, "meta")
# [1] "I" "meta" "cow"
myFun(vector, "taco")
# [1] "I" "met" "a" "cow"
myFun(vector, "Imet")
# [1] "Imet" "a" "cow"
myFun(vector, "Ime")
# [1] "I" "met" "a" "cow"

If only complete elements should merged, you could try this approach:
mergeRegExpr <- function(x, pattern) {
str <- paste(x, sep="", collapse="")
## find starting position of each word
wordStart <- head(cumsum(c(1, nchar(x))), -1)
## look for pattern
rx <- regexpr(pattern=pattern, text=str, fixed=TRUE)
## pos of matching pattern == rx+nchar(pattern)-1
rxEnd <- rx+attr(rx, "match.length")-1
## which vector elements doesn't match pattern
sel <- wordStart < rx | wordStart > rxEnd
## insert merged elements
return(append(x[sel], paste(x[!sel], collapse=""), rx-1))
}
vector <- c("I", "met", "a", "cow")
mergeRegExpr(vector, "meta")
# "I" "meta" "cow"
mergeRegExpr(vector, "acow")
# "I" "met" "acow"
mergeRegExpr(vector, "Imeta")
# "Imeta" "cow"
## partial matching doesn't work
mergeRegExpr(vector, "taco")
# "I" "metacow"

Building on Carl Witthoft's comment, my solution was not with regex, but with basic matching:
# A slightly longer vector
v = c("I", "met", "a", "cow", "today",
"You", "met", "a", "cow", "today")
# Create the combinations of each pair
temp1 = sapply(1:(length(v)-1),
function(x) paste0(v[x], v[x+1]))
# Grab the index of the desired search term
temp2 = which(temp1 %in% "meta")
# The following also works.
# Don't know what's faster/better.
# temp2 = grep("meta", temp1)
# Do some manual substitution and deletion
v[temp2] <- "meta"
v <- v[-(temp2+1)]
I don't think this is an ideal situation at all though.

Fastest way to capitalize the first word in a string (base)

Using the base install functions what is the fastest way to capitalize the first letter in a vector of text strings?
I have provided a solution below but it seems to be a very inefficient approach (using substring and pasting it all together). I'm guessing there's a regex solution I'm not thinking of.
Once I have a few responses I'll benchmark them and report back the fastest solution using microbenchmarking.
Thank you in advance for your help.
x <- c("i like chicken.", "mmh so good", NA)
#desired output
[1] "I like chicken." "Mmh so good" NA

I didn't time it, but I bet this is pretty fast
capitalize <- function(string) {
#substring(string, 1, 1) <- toupper(substring(string, 1, 1))
substr(string, 1, 1) <- toupper(substr(string, 1, 1))
string
}
capitalize(x)
#[1] "I like chicken." "Mmh so good" NA

I think this will be slowest, but let it race against other solutions:
capitalize<-function(string) {
sub("^(.)","\\U\\1", string, perl=TRUE )
}
x <- c("i like chicken.", "mmh so good", NA)
capitalize(x)
EDIT: actually on ideone it is faster than substring
EDIT 2: matching any lowercase letter turns out to be slightly slower:
sub("^(\\p{Ll})","\\U\\1", string, perl=TRUE)

The Hmisc package contains a capitalize function:
> require(Hmisc)
> capitalize(c("i like chicken.", "mmh so good", NA))
[1] "I like chicken." "Mmh so good" NA
(Although this appears to be slower than both the substring and regular expression versions.)

My solution using substring:
capitalize <- function(string) {
cap <- function(x) {
if (is.na(x)) {
NA
}
else {
nc <- nchar(x)
paste0(toupper(substr(x, 1, 1)), substr(x,
2, nc))
}
}
sapply(string, cap, USE.NAMES = FALSE)
}
x <- c("i like chicken.", "mmh so good", NA)
capitalize(x)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex split on Boolean - regex

Related

Split or substitute strings with wildcards in R [duplicate]

Combining lines in character vector in R

How to prevent regmatches drop non matches?

R : How to search for a regex in a vector over elements outwardly?

Fastest way to capitalize the first word in a string (base)

Categories

Resources