Extracting values from a string in R using regex - regex

I'm trying to extract the first and second numbers of this string and store them in separate variables.
(User20,10.25)
I can't figure out how to get the user number and then his value.
What I have managed to do so far is this, but I don't know how to remove the rest of the string and get only the number.
gsub("\\(User", "", string)

Try
str1 <- '(User20,10.25)'
scan(text=gsub('[^0-9.-]+', ' ', str1),quiet=TRUE)
#[1] 20.00 10.25
In case the string is
str2 <- '(User20-ht,-10.25)'
scan(text=gsub('-(?=[^0-9])|[^0-9.-]+', " ", str2, perl=TRUE), quiet=TRUE)
#[1] 20.00 -10.25
Or
library(stringr)
str_extract_all(str1, '[0-9.-]+')[[1]]
#[1] "20" "10.25"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '[0-9.-]+')[[1]]
#[1] "20" "10.25"

Tyler Rinker's "qdapRegex" package has some functions that are useful for this kind of stuff.
In this case, you would most likely be interested in rm_number:
library(qdapRegex)
rm_number(x, extract = TRUE)
# [[1]]
# [1] "20" "10.25"

You can use strsplit with sub ...
> sub('\\(User|\\)', '', strsplit(x, ',')[[1]])
[1] "20" "10.25"
It would probably be easier to match the context that you want instead.
> regmatches(x, gregexpr('[0-9.]+', x))[[1]]
[1] "20" "10.25"

The following is one approach:
[^,\)\([A-Z]]

Related

Extract characters within brackets "[" and "]" including brackets

I have a character string like this:
GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT
I realize that I can use substing for this particular case. However, the position of the [X/Y] differs among strings and the content between the brackets varies in length.
I would like to extract the [X/Y].
stringr is useful for these types of operations,
library(stringr)
str_extract(x, '\\[.*\\]')
#[1] "[A/C]"
or str_extract_all if you have more than one patterns in your strings
We can use bracketXtract from qdap
library(qdap)
unname(bracketXtract(dat, "square", with = TRUE))
#[1] "[A/C]"
Or using base R
gsub
gsub("^[^[]+|[^]]+$", '', dat)
#[1] "[A/C]"
strsplit
strsplit(dat, "[^[]+(?=\\[)|(?<=])[^]]+", perl=TRUE)[[1]][2]
#[1] "[A/C]"
data
dat <- "GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT"
provided that there's only 1 pair of "[]" per string, use grepexpr:
dat<-c("GATATATGGCACAGCAGTTGGATCCTTGAATGTCC[A/C]AGGTATATGTTATAGAAGCCTCGCAATTGTGTGTT")
substring(dat, gregexpr("\\[", dat), gregexpr("\\]", dat))

Trying to use a regular expression in R to capture some data

So I have a table in R, and an example of of the string I am trying to capture is this:
C.Hale (79-83)
I want to write a regular expression to extract the (79-83).
How do I go about doing this?
We can use sub. We match one or more characters that are not a space ([^ ]+) from the beginning of the string (^) , followed by a space (\\s) and replace it with a ''.
sub('^[^ ]+\\s', '', str1)
#[1] "(79-83)"
Or another option is stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(str1, '\\([^)]+\\)')[[1]]
#[1] "(79-83)"
data
str1 <- 'C.Hale (79-83)'
One possibility using the qdapRegex package I maintain:
x <- "C.Hale (79-83)"
library(qdapRegex)
rm_round(x, extract = TRUE, include.markers = TRUE)
## [[1]]
## [1] "(79-83)"

Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?

I'm looking at a number of cells in a data frame and am trying to extract any one of several sequences of characters; there's only one of these sequences per per cell.
Here's what I mean:
dF$newColumn = str_extract_all(string = "dF$column1", pattern ="sequence_1|sequence_2")
Am I screwing the syntax up here? Can I pull this sort of thing with stringr? Please rectify my ignorance!
Yes, you can use | since it denotes logical or in regex. Here's an example:
vec <- c("abc text", "text abc", "def text", "text def text")
library(stringr)
str_extract_all(string = vec, pattern = "abc|def")
The result:
[[1]]
[1] "abc"
[[2]]
[1] "abc"
[[3]]
[1] "def"
[[4]]
[1] "def"
However, in your command, you should replace "dF$column1" with dF$column1 (without quotes).

remove leading zeroes from timestamp %j%Y %H:%M

My timestamp is in the form
0992006 09:00
I need to remove the leading zeros to get this form:
992006 9:00
Here's the code I'm using now, which doesn't remove leading zeros:
prediction$TIMESTAMP <- as.character(format(prediction$TIMESTAMP, '%j%Y %H:%M'))
Simplest way is to create your own boundary that asserts either the start of the string or a space precedes.
gsub('(^| )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
You could do the same making the replacement exempt using a trick. \K resets the starting point of the reported match and any previously consumed characters are no longer included.
gsub('(^| )\\K0+', '', '0992006 09:00', perl=T)
# [1] "992006 9:00"
Or you could use sub and match until the second set of leading zeros.
sub('^0+([0-9]+ )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
And to cover all possibilities, if you know that you will ever have a format like 0992006 00:00, simply remove the + quantifier from zero in the regular expression so it only removes the first leading zero.
str1 <- "0992006 09:00"
gsub("(?<=^| )0+", "", str1, perl=TRUE)
#[1] "992006 9:00"
For situations like below, it could be:
str2 <- "0992006 00:00"
gsub("(?<=^| )0", "", str2, perl=TRUE)
#[1] "992006 0:00"
Explanation
Here the idea is to use look behind (?<=^| )0+ to match 0s
if it occurs either at the beginning of the string
(?<=^
or |
if it follows after a space )0+
and replace those matched 0s by "" in the second part of the gsub argument.
In the second string, the hour and minutes are all 0's. So, using the first code would result in:
gsub("(?<=^| )0+", "", str2, perl=TRUE)
#[1] "992006 :00"
Here, it is unclear what the OP would accept as a result. So, I thought, instead of removing the whole 0s before the :, it would be better if one 0 was left. So, I replaced the multiple 0+ code to just one 0 and replace that by "".
Here's another option using a lookbehind
gsub("(^0)|(?<=\\s)0", "", "0992006 09:00", perl = TRUE)
## [1] "992006 9:00"
With sub:
sub("^[0]+", "", prediction$TIMESTAMP)
[1] "992006 09:00"
You can also use stringr without a regular expression, by using the substrings.
> library(stringr)
> str_c(str_sub(word(x, 1:2), 2), collapse = " ")
# [1] "992006 9:00"
Some more Perl regexes,
> gsub("(?<!:)\\b0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"
> gsub("(?<![\\d:])0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"

Regex matches processing in R

I would like to extract the 2 matching groups using R.
Right now I've got this, but is not working well:
Code:
str = '123abc'
vector <- gregexpr('(?<first>\\d+)(?<second>\\w+)', str, perl=TRUE)
regmatches(str, vector)
Result:
[[1]]
[1] "123abc"
I want the result to be something like this:
[1] "123"
[2] "abc"
I'm not sure if you have a specific reason for using regmatches, unless you are e.g. importing the expressions in that format. If well-defined groups are common to all your entries, you can match them in this way:
x <- "123abc"
sub("([[:digit:]]+)[[:alpha:]]+","\\1",x)
sub("[[:digit:]]+([[:alpha:]]+)","\\1",x)
Result
[1] "123"
[1] "abc"
I.e., match the entire structure of the string, then replace it with the part you want to retain by enclosing it in round brackets and referring to it with a backreference ("\\1").
I've renamed your string s to avoid clobbering str. Here is one approach:
library(stringr)
s <- '123abc'
reg <- '([[:digit:]]+)([[:alpha:]]+)'
complete <- unlist(str_extract_all(s, reg))
partials <- unlist(str_match_all(s, reg))
partials <- partials[!(partials %in% complete)]
partials
[1] "123" "abc"
Depending on how well structured your inputs are, you may want to use strsplit to split the string.
Documentation here.
Try this:
> library(gsubfn)
> strapplyc("123abc", '(\\d+)(\\w+)')[[1]]
[1] "123" "abc"