R remove first character from string - regex

I can remove the last character from a string:
listfruit <- c("aapplea","bbananab","oranggeo")
gsub('.{1}$', '', listfruit)
But I am having problems trying to remove the first character from a string.
And also the first and last character.
I would be grateful for your help.

If we need to remove the first character, use sub, match one character (. represents a single character), replace it with ''.
sub('.', '', listfruit)
#[1] "applea" "bananab" "ranggeo"
Or for the first and last character, match the character at the start of the string (^.) or the end of the string (.$) and replace it with ''.
gsub('^.|.$', '', listfruit)
#[1] "apple" "banana" "rangge"
We can also capture it as a group and replace with the backreference.
sub('^.(.*).$', '\\1', listfruit)
#[1] "apple" "banana" "rangge"
Another option is with substr
substr(listfruit, 2, nchar(listfruit)-1)
#[1] "apple" "banana" "rangge"

library(stringr)
str_sub(listfruit, 2, -2)
#[1] "apple" "banana" "rangge"
Removing first and last characters.

For me performance was important so I run a quick benchmark with the available solutions.
library(magrittr)
comb_letters = combn(letters,5) %>% apply(2, paste0,collapse = "")
bench::mark(
gsub = {gsub(pattern = '^.|.$',replacement = '',x = comb_letters)},
substr = {substr(comb_letters,start = 2,stop = nchar(comb_letters) - 1)},
str_sub = {stringr::str_sub(comb_letters,start = 2,end = -2)}
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 gsub 32.9ms 33.7ms 29.7 513.95KB 0
#> 2 substr 15.07ms 15.84ms 62.7 1.51MB 2.09
#> 3 str_sub 5.08ms 5.36ms 177. 529.34KB 2.06
Created on 2019-12-30 by the reprex package (v0.3.0)

Related

Add leading zeros within string

I have a series of column names that I'm trying to standardize.
names <- c("apple", "banana", "orange", "apple1", "apple2", "apple10", "apple11", "banana2", "banana12")
I would like anything that has a one digit number to be padded by a zero, so
apple
banana
orange
apple01
apple02
apple10
apple11
banana02
...
I've been trying to use stringr
strdouble <- str_detect(names, "[0-9]{2}")
strsingle <- str_detect(names, "[0-9]")
str_detect(names[strsingle & !strdouble])
but unable to figure out how to selectively replace/prepend...
You can use sub("([a-z])([0-9])$","\\10\\2",names) :
[1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02"
[9] "banana12"
It only changes the names where there is a single digit following a letter (the $ is the end of the string).
The \\1 selects the first block in () : the letter. Then it puts a leading 0, then the second block in () : the digit.
Here's one option using negative look-ahead and look-behind assertions to identify single digits.
gsub('(?<!\\d)(\\d)(?!\\d)', '0\\1', names, perl=TRUE)
# [1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02" "banana12"
str_pad from stringr:
library(stringr)
pad_if = function(x, cond, n, fill = "0") str_pad(x, n*cond, pad = fill)
s = str_split_fixed(names,"(?=\\d)",2)
# [,1] [,2]
# [1,] "apple" ""
# [2,] "banana" ""
# [3,] "orange" ""
# [4,] "apple" "1"
# [5,] "apple" "2"
# [6,] "apple" "10"
# [7,] "apple" "11"
# [8,] "banana" "2"
# [9,] "banana" "12"
paste0(s[,1], pad_if(s[,2], cond = nchar(s[,2]) > 0, n = max(nchar(s[,2]))))
# [1] "apple" "banana" "orange" "apple01" "apple02" "apple10" "apple11" "banana02" "banana12"
This also extends to cases like going from c("a","a2","a20","a202") to c("a","a002","a020","a202"), which the other approaches don't cover.
The stringr package is based on stringi, which has all the same functionality used here, I'm guessing.
sprintf from base, with a similar approach:
pad_if2 = function(x, cond, n, fill = "0")
replace(x, cond, sprintf(paste0("%",fill,n,"d"), as.numeric(x)[cond]))
s0 = strsplit(names,"(?<=\\D)(?=\\d)|$",perl=TRUE)
s1 = sapply(s0,`[`,1)
s2 = sapply(sapply(s0,`[`,-1), paste0, "")
paste0(s1, pad_if2(s2, cond = nchar(s2) > 0, n = max(nchar(s2))))
pad_if2 has less general use than pad_if, since it requires x be coercible to numeric. Pretty much every step here is clunkier than the corresponding code with the packages mentioned above.
Key is to identify single digit with $ and letter before digit. It could be tried:
gsub('[^0-9]([0-9])$','0\\1',names)
[1] "apple" "banana" "orange" "appl01" "appl02" "apple10" "apple11" "banan02" "banana12"
or look-ahead.
gsub('(?<=[a-z])(\\d)$','0\\1',names,perl=T)

str_extract specific patterns (example)

I'm still a little confused by regex syntax. Can you please help me with these patterns:
_A00_A1234B_
_A00_A12345B_
_A1_A12345_
my approaches so far:
vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))
or
str_extract(str2, "[A-Z][0-9]{5}[A-Z]")
The expected outputs are
A1234B
A12345B
A12345
Thanks!
You can try
library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?
Or you can use stringi which would be faster
library(stringi)
stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B" "A12345B" "A12345"
Or a base R option would be
regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
#[1] "A1234B" "A12345B" "A12345"
data
str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
You can use sub and this regex:
sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B" "A12345B" "A12345"
Using rex to construct the regular expression may make it more understandable.
x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")
# approach #1, assumes always is between the second underscores.
re_matches(x,
rex(
"_",
anything,
"_",
capture(anything),
"_"
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
rex(
capture(
alpha,
between(digit, 4, 5),
maybe(alpha)
)
)
)
#> 1
#> 1 A1234B
#> 2 A12345B
#> 3 A12345
You can do this without using a regular expression ...
x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B" "A12345B" "A12345"
If you insist on using a regular expression, the following will suffice.
regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Regex/ Substring

I have a sequence like this in a list "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
I would like to create a substring like wherever a "K" is present it needs to pull out 6 characters before and 6 characters after "K"
Ex : MSGSRRKATPASR , here -6..K..+6
for the whole sequence..I tried the substring function in R but we need to specify the start and end position. Here the positions are unknown
Thanks
.{6}K.{6}
Try this.This will give the desired result.
See demo.
http://regex101.com/r/dM0rS8/4
use this:
\w{7}(?<=K)\w{6}
this uses positive lookbehind to ensure that there are characters present before K.
demo here: http://regex101.com/r/pK3jK1/2
Using rex may make this type of task a little simpler.
x <- "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
library(rex)
re_matches(x,
rex(
capture(name = "amino_acids",
n(any, 6),
"K",
n(any, 6)
)
),
global = TRUE)[[1]]
#> amino_acids
#>1 MSGSRRKATPASR
#>2 GEGSFAKVKYAKN
#>3 GDQAAIKILDREK
#>4 KMVEQLKREISTM
#>5 IEVMASKTKIYIV
#>6 GGELFDKIAQQGR
#>7 VYHRDLKPENLIL
#>8 DANGVLKVSDFGL
#>9 PEVLSDKGYDGAA
#>10 NLMTLYKRICKAE
#>11 WFSQGAKRVIKRI
#>12 LEDEWFKKGYKPP
#>13 AAFSNSKECLVTE
#>14 LENLFEKQAQLVK
#>15 ASEIMSKMEETAK
#>16 LGFNVRKDNYKIK
#>17 GDKSGRKGQLSVA
#>18 HVVELRKTGGDTL
#>19 VCDSFYKNFSSGL
However the above is greedy, each K will only appear in one result.
If you want to output an AA for each K
library(rex)
locs <- re_matches(x,
rex(
"K" %if_prev_is% n(any, 6) %if_next_is% n(any, 6)
),
global = TRUE, locations = TRUE)[[1]]
substring(x, locs$start - 6, locs$end + 6)
#> [1] "MSGSRRKATPASR" "GEGSFAKVKYAKN" "GSFAKVKYAKNTV" "AKVKYAKNTVTGD"
#> [5] "GDQAAIKILDREK" "KILDREKVFRHKM" "EKVFRHKMVEQLK" "KMVEQLKREISTM"
#> [9] "REISTMKLIKHPN" "STMKLIKHPNVVE" "IEVMASKTKIYIV" "VMASKTKIYIVLE"
#>[13] "GGELFDKIAQQGR" "AQQGRLKEDEARR" "VYHRDLKPENLIL" "DANGVLKVSDFGL"
#>[17] "PEVLSDKGYDGAA" "NLMTLYKRICKAE" "LYKRICKAEFSCP" "WFSQGAKRVIKRI"
#>[21] "GAKRVIKRILEPN" "LEDEWFKKGYKPP" "EDEWFKKGYKPPS" "WFKKGYKPPSFDQ"
#>[25] "AAFSNSKECLVTE" "ECLVTEKKEKPVS" "CLVTEKKEKPVSM" "VTEKKEKPVSMNA"
#>[29] "LENLFEKQAQLVK" "KQAQLVKKETRFT" "QAQLVKKETRFTS" "ASEIMSKMEETAK"
#>[33] "KMEETAKPLGFNV" "LGFNVRKDNYKIK" "VRKDNYKIKMKGD" "KDNYKIKMKGDKS"
#>[37] "NYKIKMKGDKSGR" "IKMKGDKSGRKGQ" "GDKSGRKGQLSVA" "HVVELRKTGGDTL"
#>[41] "DTLEFHKVCDSFY" "VCDSFYKNFSSGL" "NFSSGLKDVVWNT"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)