R - Gsub return first match - regex

I want to extract the 12 and the 0 from the test vector. Every time I try it would either give me 120 or 12:0
TestVector <- c("12:0")
gsub("\\b[:numeric:]*",replacement = "\\1", x = TestVector, fixed = F)
What can I use to extract the 12 and the 0. Can we just have one where I just extract the 12 so I can change it to extract the 0. Can we do this exclusively with gsub?

One option, which doesn't involve using explicit regular expressions, would be to use strsplit() and split the timestamp on the colon:
TestVector <- c("12:0")
parts <- unlist(strsplit(TestVector, ":")))
> parts[1]
[1] "12"
> parts[2]
[1] "0"

Try this
gsub("\\b(\\d+):(\\d+)\\b",replacement = "\\1 \\2", x = TestVector, fixed = F)
Regex Breakdown
\\b #Word boundary
(\\d+) #Find all digits before :
: #Match literally colon
(\\d+) #Find all digits after :
\\b #Word boundary
I think there is no named class as [:numeric:] in R till I know, but it has named class [[:digit:]]. You can use it as
gsub("\\b([[:digit:]]+):([[:digit:]]+)\\b",replacement = "\\1 \\2", x = TestVector)
As suggested by rawr, a much simpler and intuitive way to do it would be to just simply replace : with space
gsub(":",replacement = " ", x = TestVector, fixed = F)

This can be done using scan from base R
scan(text=TestVector, sep=":", what=numeric(), quiet=TRUE)
#[1] 12 0
or with str_extract
library(stringr)
str_extract_all(TestVector, "[^:]+")[[1]]

Related

How to match regular expression exactly in R and pull out pattern

I want to get pattern from my vector of strings
string <- c(
"P10000101 - Przychody netto ze sprzedazy produktów" ,
"P10000102_PL - Przychody nettozy uslug",
"P1000010201_PL - Handlowych, marketingowych, szkoleniowych",
"P100001020101 - - Handlowych,, szkoleniowych - refaktury",
"- Handlowych, marketingowych,P100001020102, - pozostale"
)
As result I want to get exact match of regular expression
result <- c(
"P10000101",
"P10000102_PL",
"P1000010201_PL",
"P100001020101",
"P100001020102"
)
I tried with this pattern = "([PLA]\\d+)" and different combinations of value = T, fixed = T, perl = T.
grep(x = string, pattern = "([PLA]\\d+(_PL)?)", fixed = T)
We can try with str_extract
library(stringr)
str_extract(string, "P\\d+(_[A-Z]+)*")
#[1] "P10000101" "P10000102_PL" "P1000010201_PL" "P100001020101" "P100001020102"
grep is for finding whether the match pattern is present in a particular string or not. For extraction, either use sub or gregexpr/regmatches or str_extract
Using the base R (regexpr/regmatches)
regmatches(string, regexpr("P\\d+(_[A-Z]+)*", string))
#[1] "P10000101" "P10000102_PL" "P1000010201_PL" "P100001020101" "P100001020102"
Basically, the pattern to match is P followed by one more numbers (\\d+) followed by greedy (*) match of _ and one or more upper case letters.

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Regex/ Substring

I have a sequence like this in a list "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
I would like to create a substring like wherever a "K" is present it needs to pull out 6 characters before and 6 characters after "K"
Ex : MSGSRRKATPASR , here -6..K..+6
for the whole sequence..I tried the substring function in R but we need to specify the start and end position. Here the positions are unknown
Thanks
.{6}K.{6}
Try this.This will give the desired result.
See demo.
http://regex101.com/r/dM0rS8/4
use this:
\w{7}(?<=K)\w{6}
this uses positive lookbehind to ensure that there are characters present before K.
demo here: http://regex101.com/r/pK3jK1/2
Using rex may make this type of task a little simpler.
x <- "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
library(rex)
re_matches(x,
rex(
capture(name = "amino_acids",
n(any, 6),
"K",
n(any, 6)
)
),
global = TRUE)[[1]]
#> amino_acids
#>1 MSGSRRKATPASR
#>2 GEGSFAKVKYAKN
#>3 GDQAAIKILDREK
#>4 KMVEQLKREISTM
#>5 IEVMASKTKIYIV
#>6 GGELFDKIAQQGR
#>7 VYHRDLKPENLIL
#>8 DANGVLKVSDFGL
#>9 PEVLSDKGYDGAA
#>10 NLMTLYKRICKAE
#>11 WFSQGAKRVIKRI
#>12 LEDEWFKKGYKPP
#>13 AAFSNSKECLVTE
#>14 LENLFEKQAQLVK
#>15 ASEIMSKMEETAK
#>16 LGFNVRKDNYKIK
#>17 GDKSGRKGQLSVA
#>18 HVVELRKTGGDTL
#>19 VCDSFYKNFSSGL
However the above is greedy, each K will only appear in one result.
If you want to output an AA for each K
library(rex)
locs <- re_matches(x,
rex(
"K" %if_prev_is% n(any, 6) %if_next_is% n(any, 6)
),
global = TRUE, locations = TRUE)[[1]]
substring(x, locs$start - 6, locs$end + 6)
#> [1] "MSGSRRKATPASR" "GEGSFAKVKYAKN" "GSFAKVKYAKNTV" "AKVKYAKNTVTGD"
#> [5] "GDQAAIKILDREK" "KILDREKVFRHKM" "EKVFRHKMVEQLK" "KMVEQLKREISTM"
#> [9] "REISTMKLIKHPN" "STMKLIKHPNVVE" "IEVMASKTKIYIV" "VMASKTKIYIVLE"
#>[13] "GGELFDKIAQQGR" "AQQGRLKEDEARR" "VYHRDLKPENLIL" "DANGVLKVSDFGL"
#>[17] "PEVLSDKGYDGAA" "NLMTLYKRICKAE" "LYKRICKAEFSCP" "WFSQGAKRVIKRI"
#>[21] "GAKRVIKRILEPN" "LEDEWFKKGYKPP" "EDEWFKKGYKPPS" "WFKKGYKPPSFDQ"
#>[25] "AAFSNSKECLVTE" "ECLVTEKKEKPVS" "CLVTEKKEKPVSM" "VTEKKEKPVSMNA"
#>[29] "LENLFEKQAQLVK" "KQAQLVKKETRFT" "QAQLVKKETRFTS" "ASEIMSKMEETAK"
#>[33] "KMEETAKPLGFNV" "LGFNVRKDNYKIK" "VRKDNYKIKMKGD" "KDNYKIKMKGDKS"
#>[37] "NYKIKMKGDKSGR" "IKMKGDKSGRKGQ" "GDKSGRKGQLSVA" "HVVELRKTGGDTL"
#>[41] "DTLEFHKVCDSFY" "VCDSFYKNFSSGL" "NFSSGLKDVVWNT"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)