Add leading zeros within string - regex

I have a series of column names that I'm trying to standardize.
names <- c("apple", "banana", "orange", "apple1", "apple2", "apple10", "apple11", "banana2", "banana12")
I would like anything that has a one digit number to be padded by a zero, so
apple
banana
orange
apple01
apple02
apple10
apple11
banana02
...
I've been trying to use stringr
strdouble <- str_detect(names, "[0-9]{2}")
strsingle <- str_detect(names, "[0-9]")
str_detect(names[strsingle & !strdouble])
but unable to figure out how to selectively replace/prepend...

You can use sub("([a-z])([0-9])$","\\10\\2",names) :
[1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02"
[9] "banana12"
It only changes the names where there is a single digit following a letter (the $ is the end of the string).
The \\1 selects the first block in () : the letter. Then it puts a leading 0, then the second block in () : the digit.

Here's one option using negative look-ahead and look-behind assertions to identify single digits.
gsub('(?<!\\d)(\\d)(?!\\d)', '0\\1', names, perl=TRUE)
# [1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02" "banana12"

str_pad from stringr:
library(stringr)
pad_if = function(x, cond, n, fill = "0") str_pad(x, n*cond, pad = fill)
s = str_split_fixed(names,"(?=\\d)",2)
# [,1] [,2]
# [1,] "apple" ""
# [2,] "banana" ""
# [3,] "orange" ""
# [4,] "apple" "1"
# [5,] "apple" "2"
# [6,] "apple" "10"
# [7,] "apple" "11"
# [8,] "banana" "2"
# [9,] "banana" "12"
paste0(s[,1], pad_if(s[,2], cond = nchar(s[,2]) > 0, n = max(nchar(s[,2]))))
# [1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02" "banana12"
This also extends to cases like going from c("a","a2","a20","a202") to c("a","a002","a020","a202"), which the other approaches don't cover.
The stringr package is based on stringi, which has all the same functionality used here, I'm guessing.
sprintf from base, with a similar approach:
pad_if2 = function(x, cond, n, fill = "0")
replace(x, cond, sprintf(paste0("%",fill,n,"d"), as.numeric(x)[cond]))
s0 = strsplit(names,"(?<=\\D)(?=\\d)|$",perl=TRUE)
s1 = sapply(s0,`[`,1)
s2 = sapply(sapply(s0,`[`,-1), paste0, "")
paste0(s1, pad_if2(s2, cond = nchar(s2) > 0, n = max(nchar(s2))))
pad_if2 has less general use than pad_if, since it requires x be coercible to numeric. Pretty much every step here is clunkier than the corresponding code with the packages mentioned above.

Key is to identify single digit with $ and letter before digit. It could be tried:
gsub('[^0-9]([0-9])$','0\\1',names)
[1] "apple" "banana" "orange" "appl01" "appl02" "apple10" "apple11" "banan02" "banana12"
or look-ahead.
gsub('(?<=[a-z])(\\d)$','0\\1',names,perl=T)

Related

R remove first character from string

I can remove the last character from a string:
listfruit <- c("aapplea","bbananab","oranggeo")
gsub('.{1}$', '', listfruit)
But I am having problems trying to remove the first character from a string.
And also the first and last character.
I would be grateful for your help.
If we need to remove the first character, use sub, match one character (. represents a single character), replace it with ''.
sub('.', '', listfruit)
#[1] "applea" "bananab" "ranggeo"
Or for the first and last character, match the character at the start of the string (^.) or the end of the string (.$) and replace it with ''.
gsub('^.|.$', '', listfruit)
#[1] "apple" "banana" "rangge"
We can also capture it as a group and replace with the backreference.
sub('^.(.*).$', '\\1', listfruit)
#[1] "apple" "banana" "rangge"
Another option is with substr
substr(listfruit, 2, nchar(listfruit)-1)
#[1] "apple" "banana" "rangge"
library(stringr)
str_sub(listfruit, 2, -2)
#[1] "apple" "banana" "rangge"
Removing first and last characters.
For me performance was important so I run a quick benchmark with the available solutions.
library(magrittr)
comb_letters = combn(letters,5) %>% apply(2, paste0,collapse = "")
bench::mark(
gsub = {gsub(pattern = '^.|.$',replacement = '',x = comb_letters)},
substr = {substr(comb_letters,start = 2,stop = nchar(comb_letters) - 1)},
str_sub = {stringr::str_sub(comb_letters,start = 2,end = -2)}
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 gsub 32.9ms 33.7ms 29.7 513.95KB 0
#> 2 substr 15.07ms 15.84ms 62.7 1.51MB 2.09
#> 3 str_sub 5.08ms 5.36ms 177. 529.34KB 2.06
Created on 2019-12-30 by the reprex package (v0.3.0)

str_extract specific patterns (example)

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Splitting a string without delimiters in R

I have a character vector in R, with each element containing a string - let's use this example:
my.files <- c("AWCallibration#NoneBino-3", "AWExperiment1#NoneBino-1", "AWExperiment2#NonemonL-2"
)
I would like to extract certain information out of these strings -
First, two uppercase alpha characters (in this case, always "AW")
Whether the trial was for calibration ("Callibration") or data collection - if it was the latter, which condition was used ("Experiment1" or "Experiment2")
Which sub-condition was used on this particular trial ("Bino" or "monL")
The repetition of the sub-condition ("1" or "2")
I first tried using strsplit, but this only appears to work for cases with regular delimiters such as "_". substring appeared to suit my needs better, but did not actually work due to the fact that splits don't occur in regular places ("Experiment1" is eleven elements long, "Callibration" is twelve).
I suspect that use of regular expressions may be the answer here, but I don't know how to account for the different lengths between the splits.
You can extract the information one by one:
first <- substr(my.files, 1, 2)
# [1] "AW" "AW" "AW"
second <- sub("^..(.*)#.*", "\\1", my.files)
# [1] "Callibration" "Experiment1" "Experiment2"
third <- sub("^.*#None(.*)-\\d+$", "\\1", my.files)
# [1] "Bino" "Bino" "monL"
fourth <- sub(".*-(\\d+)$", "\\1", my.files)
# [1] "3" "1" "2"
All in one command:
strsplit(my.files, "(?<=^..)(?=[A-Z])|#None|-", perl = TRUE)
# [[1]]
# [1] "AW" "Callibration" "Bino" "3"
#
# [[2]]
# [1] "AW" "Experiment1" "Bino" "1"
#
# [[3]]
# [1] "AW" "Experiment2" "monL" "2"
Here are a few different solutions:
gsubfn::strapplyc Try this:
library(gsubfn)
pat <- "(..)(.*)#None(.*)-(.*)"
strapplyc(my.files, pat, simplify = rbind)
which gives:
[,1] [,2] [,3] [,4]
[1,] "AW" "Callibration" "Bino" "3"
[2,] "AW" "Experiment1" "Bino" "1"
[3,] "AW" "Experiment2" "monL" "2"
Note that in the development version of the gsubfn package there is a read.pattern command which could use the above pat like this: read.pattern(text = my.files, pattern = pat, as.is = TRUE) .
sub/strsplit Here is an alternate solution. It inserts a minus after the second character and then splits each strip by minus or #None:
my.files2 <- sub("(..)", "\\1-", my.files)
do.call(rbind, strsplit(my.files2, "-|#None"))
which gives:
[,1] [,2] [,3] [,4]
[1,] "AW" "Callibration" "Bino" "3"
[2,] "AW" "Experiment1" "Bino" "1"
[3,] "AW" "Experiment2" "monL" "2"
gsub/read.table Here we use gsub to insert a minus after the first two characters and also we replace #None with minus. Then we use read.table with a sep of minus to read it in:
withMinus <- gsub("^(..)|#None", "\\1-", my.files)
read.table(text = withMinus, sep = "-", as.is = TRUE)
V1 V2 V3 V4
1 AW Callibration Bino 3
2 AW Experiment1 Bino 1
3 AW Experiment2 monL 2
REVISION:
A correction and a second solution.
Third solution.

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.