Splitting a string without delimiters in R - regex

I have a character vector in R, with each element containing a string - let's use this example:
my.files <- c("AWCallibration#NoneBino-3", "AWExperiment1#NoneBino-1", "AWExperiment2#NonemonL-2"
)
I would like to extract certain information out of these strings -
First, two uppercase alpha characters (in this case, always "AW")
Whether the trial was for calibration ("Callibration") or data collection - if it was the latter, which condition was used ("Experiment1" or "Experiment2")
Which sub-condition was used on this particular trial ("Bino" or "monL")
The repetition of the sub-condition ("1" or "2")
I first tried using strsplit, but this only appears to work for cases with regular delimiters such as "_". substring appeared to suit my needs better, but did not actually work due to the fact that splits don't occur in regular places ("Experiment1" is eleven elements long, "Callibration" is twelve).
I suspect that use of regular expressions may be the answer here, but I don't know how to account for the different lengths between the splits.

You can extract the information one by one:
first <- substr(my.files, 1, 2)
# [1] "AW" "AW" "AW"
second <- sub("^..(.*)#.*", "\\1", my.files)
# [1] "Callibration" "Experiment1" "Experiment2"
third <- sub("^.*#None(.*)-\\d+$", "\\1", my.files)
# [1] "Bino" "Bino" "monL"
fourth <- sub(".*-(\\d+)$", "\\1", my.files)
# [1] "3" "1" "2"
All in one command:
strsplit(my.files, "(?<=^..)(?=[A-Z])|#None|-", perl = TRUE)
# [[1]]
# [1] "AW" "Callibration" "Bino" "3"
#
# [[2]]
# [1] "AW" "Experiment1" "Bino" "1"
#
# [[3]]
# [1] "AW" "Experiment2" "monL" "2"

Here are a few different solutions:
gsubfn::strapplyc Try this:
library(gsubfn)
pat <- "(..)(.*)#None(.*)-(.*)"
strapplyc(my.files, pat, simplify = rbind)
which gives:
[,1] [,2] [,3] [,4]
[1,] "AW" "Callibration" "Bino" "3"
[2,] "AW" "Experiment1" "Bino" "1"
[3,] "AW" "Experiment2" "monL" "2"
Note that in the development version of the gsubfn package there is a read.pattern command which could use the above pat like this: read.pattern(text = my.files, pattern = pat, as.is = TRUE) .
sub/strsplit Here is an alternate solution. It inserts a minus after the second character and then splits each strip by minus or #None:
my.files2 <- sub("(..)", "\\1-", my.files)
do.call(rbind, strsplit(my.files2, "-|#None"))
which gives:
[,1] [,2] [,3] [,4]
[1,] "AW" "Callibration" "Bino" "3"
[2,] "AW" "Experiment1" "Bino" "1"
[3,] "AW" "Experiment2" "monL" "2"
gsub/read.table Here we use gsub to insert a minus after the first two characters and also we replace #None with minus. Then we use read.table with a sep of minus to read it in:
withMinus <- gsub("^(..)|#None", "\\1-", my.files)
read.table(text = withMinus, sep = "-", as.is = TRUE)
V1 V2 V3 V4
1 AW Callibration Bino 3
2 AW Experiment1 Bino 1
3 AW Experiment2 monL 2
REVISION:
A correction and a second solution.
Third solution.

Related

Add leading zeros within string

I have a series of column names that I'm trying to standardize.
names <- c("apple", "banana", "orange", "apple1", "apple2", "apple10", "apple11", "banana2", "banana12")
I would like anything that has a one digit number to be padded by a zero, so
apple
banana
orange
apple01
apple02
apple10
apple11
banana02
...
I've been trying to use stringr
strdouble <- str_detect(names, "[0-9]{2}")
strsingle <- str_detect(names, "[0-9]")
str_detect(names[strsingle & !strdouble])
but unable to figure out how to selectively replace/prepend...
You can use sub("([a-z])([0-9])$","\\10\\2",names) :
[1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02"
[9] "banana12"
It only changes the names where there is a single digit following a letter (the $ is the end of the string).
The \\1 selects the first block in () : the letter. Then it puts a leading 0, then the second block in () : the digit.
Here's one option using negative look-ahead and look-behind assertions to identify single digits.
gsub('(?<!\\d)(\\d)(?!\\d)', '0\\1', names, perl=TRUE)
# [1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02" "banana12"
str_pad from stringr:
library(stringr)
pad_if = function(x, cond, n, fill = "0") str_pad(x, n*cond, pad = fill)
s = str_split_fixed(names,"(?=\\d)",2)
# [,1] [,2]
# [1,] "apple" ""
# [2,] "banana" ""
# [3,] "orange" ""
# [4,] "apple" "1"
# [5,] "apple" "2"
# [6,] "apple" "10"
# [7,] "apple" "11"
# [8,] "banana" "2"
# [9,] "banana" "12"
paste0(s[,1], pad_if(s[,2], cond = nchar(s[,2]) > 0, n = max(nchar(s[,2]))))
# [1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02" "banana12"
This also extends to cases like going from c("a","a2","a20","a202") to c("a","a002","a020","a202"), which the other approaches don't cover.
The stringr package is based on stringi, which has all the same functionality used here, I'm guessing.
sprintf from base, with a similar approach:
pad_if2 = function(x, cond, n, fill = "0")
replace(x, cond, sprintf(paste0("%",fill,n,"d"), as.numeric(x)[cond]))
s0 = strsplit(names,"(?<=\\D)(?=\\d)|$",perl=TRUE)
s1 = sapply(s0,`[`,1)
s2 = sapply(sapply(s0,`[`,-1), paste0, "")
paste0(s1, pad_if2(s2, cond = nchar(s2) > 0, n = max(nchar(s2))))
pad_if2 has less general use than pad_if, since it requires x be coercible to numeric. Pretty much every step here is clunkier than the corresponding code with the packages mentioned above.
Key is to identify single digit with $ and letter before digit. It could be tried:
gsub('[^0-9]([0-9])$','0\\1',names)
[1] "apple" "banana" "orange" "appl01" "appl02" "apple10" "apple11" "banan02" "banana12"
or look-ahead.
gsub('(?<=[a-z])(\\d)$','0\\1',names,perl=T)

str_extract specific patterns (example)

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.

regex grab from beginning to n occurrence of character

I'm really putting time into learning regex and I'm playing with different toy scenarios. One setup I can't get to work is to grab from the beginning of a string to n occurrence of a character where n > 1.
Here I can grab from the beginning of the string to the first underscore but I can't generalize this to the second or third underscore.
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
gsub("_.*$", "", x)
Here's what I'm trying to achieve with regex. (`sub`/`gsub`):
## > sapply(lapply(strsplit(x, "_"), "[", 1:2), paste, collapse="_")
## [1] "a_b" "1_2" "<_?"
#or
## > sapply(lapply(strsplit(x, "_"), "[", 1:3), paste, collapse="_")
## [1] "a_b_c" "1_2_3" "<_?_."
Related post: regex from first character to the end of the string
Here's a start. To make this safe for general use, you'll need it to properly escape regular expressions' special characters:
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:", "", "abcd", "____abcd")
matchToNth <- function(char, n) {
others <- paste0("[^", char, "]*") ## matches "[^_]*" if char is "_"
mainPat <- paste0(c(rep(c(others, char), n-1), others), collapse="")
paste0("(^", mainPat, ")", "(.*$)")
}
gsub(matchToNth("_", 2), "\\1", x)
# [1] "a_b" "1_2" "<_?" "" "abcd" "_"
gsub(matchToNth("_", 3), "\\1", x)
# [1] "a_b_c" "1_2_3" "<_?_." "" "abcd" "__"
How about:
gsub('^(.+_.+?).*$', '\\1', x)
# [1] "a_b" "1_2" "<_?"
Alternatively you can use {} to indicate the number of repeats...
sub('((.+_){1}.+?).*$', '\\1', x) # {0} will give "a", {1} - "a_b", {2} - "a_b_c" and so on
So you don't have to repeat yourself if you wanted to match the nth one...
second underscore in perl style regex:
/^(.?_.?_)/
and third:
/^(.*?_.*?_.*?_)/
Maybe something like this
x
## [1] "a_b_c_d" "1_2_3_4" "<_?_._:"
gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){1}", x)))
## [1] "a" "1" "<"
gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){2}", x)))
## [1] "a_b" "1_2" "<_?"
gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){3}", x)))
## [1] "a_b_c" "1_2_3" "<_?_."
Using Justin's approach this was what I devised:
beg2char <- function(text, char = " ", noc = 1, include = FALSE) {
inc <- ifelse(include, char, "?")
specchar <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
if(char %in% specchar) {
char <- paste0("\\", char)
}
ins <- paste(rep(paste0(char, ".+"), noc - 1), collapse="")
rep <- paste0("^(.+", ins, inc, ").*$")
gsub(rep, "\\1", text)
}
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
beg2char(x, "_", 1)
beg2char(x, "_", 2)
beg2char(x, "_", 3)
beg2char(x, "_", 4)
beg2char(x, "_", 3, include=TRUE)

Inserting line breaks in long strings

I'm writing a function to write automacilly tables in tex format. One problem I'm having is with tables with long strings. To solve this, I created a function that break long strings in more rows. My functions breaks in every space that have at leat len characters before (it doesn't break words). I want change this rule to: Break in every space that the next space has at least len characters (in other words, I don't want 'substrings' with more than len characters, except that cases where a word has more than 10 characters).
quebra <- function(text, len=30) {
trim <- function(x) gsub('^ *|(?<= ) | *$', '', x, perl=TRUE)
quebrado <- strsplit(trim(paste(text)),paste0('(?<=.{',len,'}) '), perl=T)
tam <- max(sapply(quebrado, length))
out <- sapply(quebrado, function(x, tam) x[1:tam], tam=tam)
out[is.na(out)] <- ''
out
}
Example:
quebra('1234567890 123456789 123456789', 10) is returning:
[,1]
[1,] "1234567890"
[2,] "123456789 123456789"
but I want:
[,1]
[1,] "1234567890"
[2,] "123456789"
[3,] "123456789"
I think this should work, but I could not adapt it to strsplit() format.
Don't reinvent the wheel. Just use strwrap.