Remove numbers from alphanumeric characters - regex

I have a list of alphanumeric characters that looks like:
x <-c('ACO2', 'BCKDHB456', 'CD444')
I would like the following output:
x <-c('ACO', 'BCKDHB', 'CD')
Any suggestions?
# dput(tmp2)
structure(c(432L, 326L, 217L, 371L, 179L, 182L, 188L, 268L, 255L,...,
), class = "factor")

You can use gsub for this:
gsub('[[:digit:]]+', '', x)
or
gsub('[0-9]+', '', x)
# [1] "ACO" "BCKDHB" "CD"

If your goal is just to remove numbers, then the removeNumbers() function removes numbers from a text. Using it reduces the risk of mistakes.
library(tm)
x <-c('ACO2', 'BCKDHB456', 'CD444')
x <- removeNumbers(x)
x
[1] "ACO" "BCKDHB" "CD"

Using stringr
Most stringr functions handle regex
str_replace_all will do what you need
str_replace_all(c('ACO2', 'BCKDHB456', 'CD444'), "[:digit:]", "")

A solution using stringi:
# your data
x <-c('ACO2', 'BCKDHB456', 'CD444')
# extract capital letters
x <- stri_extract_all_regex(x, "[A-Z]+")
# unlist, so that you have a vector
x <- unlist(x)
Solution in one line:

Related

Regex of consecutive punctuation in R

I have a character vector that looks like this:
z <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
[9992] "./."
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9996] ",/,"
[9997] "wretched/JJ"
I want to remove all entries that consist of three consecutive punctuation marks, resulting in something like this:
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9997] "wretched/JJ"
I've tried different regex expressions:
sub("[:punct:]/[:punct:]", "", z)
and
sub("[:punct:]{3}", "", z)
with either single/double brackets, both yield:
[9992] "./."
[9993] "To"
[9994] "my$"
[9995] "starved"
[9996] ",/,"
[9997] "wretched"
Any ideas? And I apologize in advance if the question is dumb; I'm not very good at this!
Try this:
x <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
grep("[[:punct:]]{3}", x, value = TRUE, invert = TRUE)
## [1] "To/TO" "my/PRP$" "starved/VBN" "wretched/JJ"

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

Extracting numbers from a string including decimels and scientific notation

I have some strings that look like
x<-"p = 9.636e-05"
And I would like to extract just the number using gsub. So far I have
gsub("[[:alpha:]](?!-)|=|\\^2", "", x)
But that removes the 'e' from the scientific notation, giving me
" 9.636-05"
Which can't be converted to a number using as.numeric. I know that it would be possible to use a lookahead to match the "-", but I don't know exactly how to go about doing this.
You could try
sub('.* = ', '', x)
#[1] "9.636e-05"
You can use the following to initially remove all non-digit characters at the start of the string:
sub('^\\D+', '', x)
Try
format(as.numeric(gsub("[^0-9e.-]", "", x)), scientific = FALSE)
# [1] "0.00009636"
Through sub or regmatches function.
> x<-"p = 9.636e-05"
> sub(".* ", "", x)
[1] "9.636e-05"
> regmatches(x, regexpr("\\S+$", x))
[1] "9.636e-05"
> library(stringi)
> stri_extract(x, regex="\\S+$")
[1] "9.636e-05"

Replace parts of string using package stringi (regex)

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
"ab2c3"
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments
Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
})
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"
Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
res<-paste(collapse="",do.call(paste0,rle(strsplit(string,"",fixed=TRUE)[[1]])[2:1]))
gsub("1","",res)
#[1] "ab2c3"

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"