Extracting numbers from a string including decimels and scientific notation - regex

I have some strings that look like
x<-"p = 9.636e-05"
And I would like to extract just the number using gsub. So far I have
gsub("[[:alpha:]](?!-)|=|\\^2", "", x)
But that removes the 'e' from the scientific notation, giving me
" 9.636-05"
Which can't be converted to a number using as.numeric. I know that it would be possible to use a lookahead to match the "-", but I don't know exactly how to go about doing this.

You could try
sub('.* = ', '', x)
#[1] "9.636e-05"

You can use the following to initially remove all non-digit characters at the start of the string:
sub('^\\D+', '', x)

Try
format(as.numeric(gsub("[^0-9e.-]", "", x)), scientific = FALSE)
# [1] "0.00009636"

Through sub or regmatches function.
> x<-"p = 9.636e-05"
> sub(".* ", "", x)
[1] "9.636e-05"
> regmatches(x, regexpr("\\S+$", x))
[1] "9.636e-05"
> library(stringi)
> stri_extract(x, regex="\\S+$")
[1] "9.636e-05"

Related

R regular expression: isolate a string between quotes

I have a string myFunction(arg1=\"hop\",arg2=TRUE). I want to isolate what is in between quotes (\"hop\" in this example)
I have tried so far with no success:
gsub(pattern="(myFunction)(\\({1}))(.*)(\\\"{1}.*\\\"{1})(.*)(\\){1})",replacement="//4",x="myFunction(arg1=\"hop\",arg2=TRUE)")
Any help by a regex guru would be welcome!
Try
sub('[^\"]+\"([^\"]+).*', '\\1', x)
#[1] "hop"
Or
sub('[^\"]+(\"[^\"]+.).*', '\\1', x)
#[1] "\"hop\""
The \" is not needed as " would work too
sub('[^"]*("[^"]*.).*', '\\1', x)
#[1] "\"hop\""
If there are multiple matches, as #AvinashRaj mentioned in his post, sub may not be that useful. An option using stringi would be
library(stringi)
stri_extract_all_regex(x1, '"[^"]*"')[[1]]
#[1] "\"hop\"" "\"hop2\""
data
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
x1 <- "myFunction(arg1=\"hop\",arg2=TRUE arg3=\"hop2\", arg4=TRUE)"
You could use regmatches function also. Sub or gsub only works for a particular input , for general case you must do grabing instead of removing.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> regmatches(x, gregexpr('"[^"]*"', x))[[1]]
[1] "\"hop\""
To get only the text inside quotes then pass the result of above function to a gsub function which helps to remove the quotes.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop"
> x <- "myFunction(arg1=\"hop\",arg2=\"TRUE\")"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop" "TRUE"
You can try:
str='myFunction(arg1=\"hop\",arg2=TRUE)'
gsub('.*(\\".*\\").*','\\1',str)
#[1] "\"hop\""
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
unlist(strsplit(x,'"'))[2]
# [1] "hop"

Replace parts of string using package stringi (regex)

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
"ab2c3"
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments
Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
})
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"
Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
res<-paste(collapse="",do.call(paste0,rle(strsplit(string,"",fixed=TRUE)[[1]])[2:1]))
gsub("1","",res)
#[1] "ab2c3"

Finding last character in R using regexpr function

I am having problem with finding the last character in a string. I am trying to use the regexpr function to check if the last character is equal to / forward slash.
But unfortunately it does work. Can anyone help me? Below is my code.
regexpr( pattern = ".$", text = /home/rexamine/archivist2/ex/// ) != "/"
You can avoid using regular expression and use substr to do this.
> x <- '/home/rexamine/archivist2/ex///'
> substr(x, nchar(x)-1+1, nchar(x)) == '/'
[1] TRUE
Or use str_sub from the stringr package:
> str_sub(x, -1) == '/'
[1] TRUE
You could use a simple grepl function,
> text = "/home/rexamine/archivist2/ex///"
> grepl("/$", text, perl=TRUE)
[1] TRUE
> text = "/home/rexamine/archivist2/ex"
> grepl("/$", text, perl=TRUE)
[1] FALSE
^.*\/$
You can use this.This will fail if last character is not /.

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

Remove numbers from alphanumeric characters

I have a list of alphanumeric characters that looks like:
x <-c('ACO2', 'BCKDHB456', 'CD444')
I would like the following output:
x <-c('ACO', 'BCKDHB', 'CD')
Any suggestions?
# dput(tmp2)
structure(c(432L, 326L, 217L, 371L, 179L, 182L, 188L, 268L, 255L,...,
), class = "factor")
You can use gsub for this:
gsub('[[:digit:]]+', '', x)
or
gsub('[0-9]+', '', x)
# [1] "ACO" "BCKDHB" "CD"
If your goal is just to remove numbers, then the removeNumbers() function removes numbers from a text. Using it reduces the risk of mistakes.
library(tm)
x <-c('ACO2', 'BCKDHB456', 'CD444')
x <- removeNumbers(x)
x
[1] "ACO" "BCKDHB" "CD"
Using stringr
Most stringr functions handle regex
str_replace_all will do what you need
str_replace_all(c('ACO2', 'BCKDHB456', 'CD444'), "[:digit:]", "")
A solution using stringi:
# your data
x <-c('ACO2', 'BCKDHB456', 'CD444')
# extract capital letters
x <- stri_extract_all_regex(x, "[A-Z]+")
# unlist, so that you have a vector
x <- unlist(x)
Solution in one line: