Remove numbers from alphanumeric characters - regex

I have a list of alphanumeric characters that looks like:
x <-c('ACO2', 'BCKDHB456', 'CD444')
I would like the following output:
x <-c('ACO', 'BCKDHB', 'CD')
Any suggestions?
# dput(tmp2)
structure(c(432L, 326L, 217L, 371L, 179L, 182L, 188L, 268L, 255L,...,
), class = "factor")

You can use gsub for this:
gsub('[[:digit:]]+', '', x)
gsub('[0-9]+', '', x)
# [1] "ACO" "BCKDHB" "CD"

If your goal is just to remove numbers, then the removeNumbers() function removes numbers from a text. Using it reduces the risk of mistakes.
x <-c('ACO2', 'BCKDHB456', 'CD444')
x <- removeNumbers(x)
[1] "ACO" "BCKDHB" "CD"

Using stringr
Most stringr functions handle regex
str_replace_all will do what you need
str_replace_all(c('ACO2', 'BCKDHB456', 'CD444'), "[:digit:]", "")

A solution using stringi:
# your data
x <-c('ACO2', 'BCKDHB456', 'CD444')
# extract capital letters
x <- stri_extract_all_regex(x, "[A-Z]+")
# unlist, so that you have a vector
x <- unlist(x)
Solution in one line:


Regex of consecutive punctuation in R

I have a character vector that looks like this:
z <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
[9992] "./."
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9996] ",/,"
[9997] "wretched/JJ"
I want to remove all entries that consist of three consecutive punctuation marks, resulting in something like this:
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9997] "wretched/JJ"
I've tried different regex expressions:
sub("[:punct:]/[:punct:]", "", z)
sub("[:punct:]{3}", "", z)
with either single/double brackets, both yield:
[9992] "./."
[9993] "To"
[9994] "my$"
[9995] "starved"
[9996] ",/,"
[9997] "wretched"
Any ideas? And I apologize in advance if the question is dumb; I'm not very good at this!
Try this:
x <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
grep("[[:punct:]]{3}", x, value = TRUE, invert = TRUE)
## [1] "To/TO" "my/PRP$" "starved/VBN" "wretched/JJ"

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
x <- c("13.23-45-6A", "0.1B", "4.A-A")

Extracting numbers from a string including decimels and scientific notation

I have some strings that look like
x<-"p = 9.636e-05"
And I would like to extract just the number using gsub. So far I have
gsub("[[:alpha:]](?!-)|=|\\^2", "", x)
But that removes the 'e' from the scientific notation, giving me
" 9.636-05"
Which can't be converted to a number using as.numeric. I know that it would be possible to use a lookahead to match the "-", but I don't know exactly how to go about doing this.
You could try
sub('.* = ', '', x)
#[1] "9.636e-05"
You can use the following to initially remove all non-digit characters at the start of the string:
sub('^\\D+', '', x)
format(as.numeric(gsub("[^0-9e.-]", "", x)), scientific = FALSE)
# [1] "0.00009636"
Through sub or regmatches function.
> x<-"p = 9.636e-05"
> sub(".* ", "", x)
[1] "9.636e-05"
> regmatches(x, regexpr("\\S+$", x))
[1] "9.636e-05"
> library(stringi)
> stri_extract(x, regex="\\S+$")
[1] "9.636e-05"

Replace parts of string using package stringi (regex)

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments
Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"
Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
#[1] "ab2c3"

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"