Inserting line breaks in long strings - regex

I'm writing a function to write automacilly tables in tex format. One problem I'm having is with tables with long strings. To solve this, I created a function that break long strings in more rows. My functions breaks in every space that have at leat len characters before (it doesn't break words). I want change this rule to: Break in every space that the next space has at least len characters (in other words, I don't want 'substrings' with more than len characters, except that cases where a word has more than 10 characters).
quebra <- function(text, len=30) {
trim <- function(x) gsub('^ *|(?<= ) | *$', '', x, perl=TRUE)
quebrado <- strsplit(trim(paste(text)),paste0('(?<=.{',len,'}) '), perl=T)
tam <- max(sapply(quebrado, length))
out <- sapply(quebrado, function(x, tam) x[1:tam], tam=tam)
out[is.na(out)] <- ''
out
}
Example:
quebra('1234567890 123456789 123456789', 10) is returning:
[,1]
[1,] "1234567890"
[2,] "123456789 123456789"
but I want:
[,1]
[1,] "1234567890"
[2,] "123456789"
[3,] "123456789"
I think this should work, but I could not adapt it to strsplit() format.

Don't reinvent the wheel. Just use strwrap.

Related

Extracting and merging numbers from strings

I have strings with numbers as follow:
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
and would like to get an output like:
97226424979
815264627
4902022801986
0781482789
06643420034
060418728
I tried using:
as.numeric(gsub("([0-9]+).*$", "\\1", numbers))
but the numbers are separate in the output.
To get your exact output,
#to avoid scientific notation
options(scipen=999)
#find which have leading 0
ind <- which(substring(x, 1, 1) == 0)
y <- as.numeric(gsub("\\D", "", numbers))
y[ind] <- paste0('0', y[ind])
y
#[1] "97226424979" "815264627" "4902022801986" "0781482789" "06643420034" "060418728"
([0-9]+).*$ puts a number sequence until the first non-number into \\1. However, you want:
numbers <- readLines(n=6)
972 2 6424979
81|5264627
49-0202-2801986
07.81.48.27.89
0664/3420034
06041 - 8728
as.numeric(gsub("\\D", "", numbers))
This replaces all non-numbers by nothing.

Using regular expressions in R to extract information from string

I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"

str_extract specific patterns (example)

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

Splitting a string without delimiters in R

I have a character vector in R, with each element containing a string - let's use this example:
my.files <- c("AWCallibration#NoneBino-3", "AWExperiment1#NoneBino-1", "AWExperiment2#NonemonL-2"
)
I would like to extract certain information out of these strings -
First, two uppercase alpha characters (in this case, always "AW")
Whether the trial was for calibration ("Callibration") or data collection - if it was the latter, which condition was used ("Experiment1" or "Experiment2")
Which sub-condition was used on this particular trial ("Bino" or "monL")
The repetition of the sub-condition ("1" or "2")
I first tried using strsplit, but this only appears to work for cases with regular delimiters such as "_". substring appeared to suit my needs better, but did not actually work due to the fact that splits don't occur in regular places ("Experiment1" is eleven elements long, "Callibration" is twelve).
I suspect that use of regular expressions may be the answer here, but I don't know how to account for the different lengths between the splits.
You can extract the information one by one:
first <- substr(my.files, 1, 2)
# [1] "AW" "AW" "AW"
second <- sub("^..(.*)#.*", "\\1", my.files)
# [1] "Callibration" "Experiment1" "Experiment2"
third <- sub("^.*#None(.*)-\\d+$", "\\1", my.files)
# [1] "Bino" "Bino" "monL"
fourth <- sub(".*-(\\d+)$", "\\1", my.files)
# [1] "3" "1" "2"
All in one command:
strsplit(my.files, "(?<=^..)(?=[A-Z])|#None|-", perl = TRUE)
# [[1]]
# [1] "AW" "Callibration" "Bino" "3"
#
# [[2]]
# [1] "AW" "Experiment1" "Bino" "1"
#
# [[3]]
# [1] "AW" "Experiment2" "monL" "2"
Here are a few different solutions:
gsubfn::strapplyc Try this:
library(gsubfn)
pat <- "(..)(.*)#None(.*)-(.*)"
strapplyc(my.files, pat, simplify = rbind)
which gives:
[,1] [,2] [,3] [,4]
[1,] "AW" "Callibration" "Bino" "3"
[2,] "AW" "Experiment1" "Bino" "1"
[3,] "AW" "Experiment2" "monL" "2"
Note that in the development version of the gsubfn package there is a read.pattern command which could use the above pat like this: read.pattern(text = my.files, pattern = pat, as.is = TRUE) .
sub/strsplit Here is an alternate solution. It inserts a minus after the second character and then splits each strip by minus or #None:
my.files2 <- sub("(..)", "\\1-", my.files)
do.call(rbind, strsplit(my.files2, "-|#None"))
which gives:
[,1] [,2] [,3] [,4]
[1,] "AW" "Callibration" "Bino" "3"
[2,] "AW" "Experiment1" "Bino" "1"
[3,] "AW" "Experiment2" "monL" "2"
gsub/read.table Here we use gsub to insert a minus after the first two characters and also we replace #None with minus. Then we use read.table with a sep of minus to read it in:
withMinus <- gsub("^(..)|#None", "\\1-", my.files)
read.table(text = withMinus, sep = "-", as.is = TRUE)
V1 V2 V3 V4
1 AW Callibration Bino 3
2 AW Experiment1 Bino 1
3 AW Experiment2 monL 2
REVISION:
A correction and a second solution.
Third solution.

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.