Find all variants of word in R - regex

I have the following words.
words <- c("hail(0.75)", "hail0.75", "hail0.88", "hail075", "hail1.00", "hail1.75", "hail100", "hail125", "hail1.75)", "hail150", "hail175", "hail200", "hail225", "hail275", "hail450", "hail088", "hail75", "hail80", "hail88")
[1] "hail(0.75)" "hail0.75" "hail0.88" "hail075" "hail1.00" "hail1.75"
[7] "hail100" "hail125" "hail1.75)" "hail150" "hail175" "hail200"
[13] "hail225" "hail275" "hail450" "hail088" "hail75" "hail80"
[19] "hail88"
as you can see, hail(0.75) is repeated with various typos/formatting (i.e. hail075, hail0.75)
How can I find all occurences of hail(0.75) including its variants described above?
I've tried
grep("hail[0,7,5]"), words, value = T)
[1] "hail0.75" "hail0.88" "hail075" "hail088" "hail75"
to find instances of hail which contain the numbers 075.
However, it includes hail088 which is unwanted and excludes hail(0.75) which is wanted.

Another option is to remove all not digit numbers and use it as an index:
idx <- gsub("[^[:digit:]]","",words)
words[idx=="075"]
[1] "hail(0.75)" "hail0.75" "hail075"

Is this what you are looking for?
> x <- c("hail(0.75)", "hail0.75", "hail0.88", "hail075", "hail1.00", "hail1.75", "hail100", "hail125", "hail1.75)", "hail150", "hail175", "hail200", "hail225", "hail275", "hail450", "hail088", "hail75", "hail80", "hail88")
> x
[1] "hail(0.75)" "hail0.75" "hail0.88" "hail075" "hail1.00"
[6] "hail1.75" "hail100" "hail125" "hail1.75)" "hail150"
[11] "hail175" "hail200" "hail225" "hail275" "hail450"
[16] "hail088" "hail75" "hail80" "hail88"
And you grep:
> x[grep("^hail[[:punct:]]*0[[:punct:]]*75.*", x)]
[1] "hail(0.75)" "hail0.75" "hail075"
This works presuming that the 7 and the 5 are always next to each other.
Quick explanation: ^ signifies the beginning of a string, [[:punct:]] is any punctuation character, and * is the previous character (in this case the [[:punct:]]) repeated 0 or more times.

Related

How to modify string in R taking into account the number of symbols you want to modify [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 6 years ago.
This question is very easy to understand, but I can't wrap my head around how to get a solution. Let's say I have a vector and I want to modify it so it would have 5 integers at the end, and missing digits are replaced with zeros:
Smth1 Smth00001
Smth22 Smth00022
Smth333 Smth00333
Smth4444 Smth04444
Smth55555 Smth55555
I guess it can be done with regex and functions like gsub, but don't understand how to take into account the length of the replaced string
Here's an idea using stringi:
v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
library(stringi)
d <- stri_extract(v, regex = "[:digit:]+")
a <- stri_extract(v, regex = "[:alpha:]+")
paste0(a, stri_pad_left(d, 5, "0"))
Which gives:
[1] "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Using base R. Someone else can prettify the regex:
sprintf("%s%05d", gsub("^([^0-9]+)..*$", "\\1", x),
as.numeric(gsub("^..*[^0-9]([0-9]+)$", "\\1", x)))
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
Here is a simple 1-line solution similar to Zelazny's but using a replace callback method inside a gsubfn using gsubfn library:
> library(gsubfn)
> v <- c("Smth1", "Smth22", "Smth333", "Smth4444", "Smth55555")
> gsubfn('[0-9]+$', ~ sprintf("%05d",as.numeric(x)), v)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
The regex [0-9]+$ (see the regex demo) matches 1 or more digits at the end of the string only due to the $ anchor. The matched digits are passed to the callback (~) and sprintf("%05d",as.numeric(x)) pads the number (parsed as a numeric with as.numeric) with zeros.
To only modify strings that have 1+ non-digit symbols at the start and then 1 or more digits up to the end, just use this PCRE-based gsubfn:
> gsubfn('^[^0-9]+\\K([0-9]+)$', ~ sprintf("%05d",as.numeric(x)), v, perl=TRUE)
[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
where
^ - start of string
[^0-9]+\\K - matches 1+ non-digit symbols and \K will omit them
([0-9]+) - Group 1 passed to the callback
$ - end of string.
Here a solution using the library stringr:
library(stringr)
library(dplyr)
num <- str_extract(v, "[1-9]+")
padding <- 9 - nchar(num)
ouput <- paste0(str_extract(v, "[^0-9]+") %>%
str_pad(width = padding, side = c("right"), pad = "0"), num)
The output is:
"Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"
library(stringr)
paste0(str_extract(v,'\\D+'),str_pad(str_extract(v,'\\d+'),5,'left', '0'))
#[1] "Smth00001" "Smth00022" "Smth00333" "Smth04444" "Smth55555"

Matching the first digits in R

Anyone know a good way to match and categorize the first n digits of a number in R?
For example,
123451
123452
123461
123462
In this case, the if we match on the first n=1-4 digits, we would get all the same group. If we match with n=5 digits, we would get 2 groups.
I thought about doing this by making the numeric vector a character vector, splitting it so that each number is an element that can then be truncated to n digits, and matching based on those digits; however, I have a lot of numbers, and it seems there must be a better way to sort only the first n digits of a number in R. Any thoughts?
Thanks!
Here's a vectorised solution that does not involve conversion to character:
nums <- c(123451,
123452,
123461,
123462)
firstDigits <- function(x, n) {
ndigits <- floor(log10(x)) + 1
floor(x / 10^(ndigits - n))
}
factor(firstDigits(nums, 4))
## [1] 1234 1234 1234 1234
## Levels: 1234
factor(firstDigits(nums, 5))
## [1] 12345 12345 12346 12346
## Levels: 12345 12346
factor(firstDigits(nums, 6))
## [1] 123451 123452 123461 123462
## Levels: 123451 123452 123461 123462

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Regex/ Substring

I have a sequence like this in a list "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
I would like to create a substring like wherever a "K" is present it needs to pull out 6 characters before and 6 characters after "K"
Ex : MSGSRRKATPASR , here -6..K..+6
for the whole sequence..I tried the substring function in R but we need to specify the start and end position. Here the positions are unknown
Thanks
.{6}K.{6}
Try this.This will give the desired result.
See demo.
http://regex101.com/r/dM0rS8/4
use this:
\w{7}(?<=K)\w{6}
this uses positive lookbehind to ensure that there are characters present before K.
demo here: http://regex101.com/r/pK3jK1/2
Using rex may make this type of task a little simpler.
x <- "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
library(rex)
re_matches(x,
rex(
capture(name = "amino_acids",
n(any, 6),
"K",
n(any, 6)
)
),
global = TRUE)[[1]]
#> amino_acids
#>1 MSGSRRKATPASR
#>2 GEGSFAKVKYAKN
#>3 GDQAAIKILDREK
#>4 KMVEQLKREISTM
#>5 IEVMASKTKIYIV
#>6 GGELFDKIAQQGR
#>7 VYHRDLKPENLIL
#>8 DANGVLKVSDFGL
#>9 PEVLSDKGYDGAA
#>10 NLMTLYKRICKAE
#>11 WFSQGAKRVIKRI
#>12 LEDEWFKKGYKPP
#>13 AAFSNSKECLVTE
#>14 LENLFEKQAQLVK
#>15 ASEIMSKMEETAK
#>16 LGFNVRKDNYKIK
#>17 GDKSGRKGQLSVA
#>18 HVVELRKTGGDTL
#>19 VCDSFYKNFSSGL
However the above is greedy, each K will only appear in one result.
If you want to output an AA for each K
library(rex)
locs <- re_matches(x,
rex(
"K" %if_prev_is% n(any, 6) %if_next_is% n(any, 6)
),
global = TRUE, locations = TRUE)[[1]]
substring(x, locs$start - 6, locs$end + 6)
#> [1] "MSGSRRKATPASR" "GEGSFAKVKYAKN" "GSFAKVKYAKNTV" "AKVKYAKNTVTGD"
#> [5] "GDQAAIKILDREK" "KILDREKVFRHKM" "EKVFRHKMVEQLK" "KMVEQLKREISTM"
#> [9] "REISTMKLIKHPN" "STMKLIKHPNVVE" "IEVMASKTKIYIV" "VMASKTKIYIVLE"
#>[13] "GGELFDKIAQQGR" "AQQGRLKEDEARR" "VYHRDLKPENLIL" "DANGVLKVSDFGL"
#>[17] "PEVLSDKGYDGAA" "NLMTLYKRICKAE" "LYKRICKAEFSCP" "WFSQGAKRVIKRI"
#>[21] "GAKRVIKRILEPN" "LEDEWFKKGYKPP" "EDEWFKKGYKPPS" "WFKKGYKPPSFDQ"
#>[25] "AAFSNSKECLVTE" "ECLVTEKKEKPVS" "CLVTEKKEKPVSM" "VTEKKEKPVSMNA"
#>[29] "LENLFEKQAQLVK" "KQAQLVKKETRFT" "QAQLVKKETRFTS" "ASEIMSKMEETAK"
#>[33] "KMEETAKPLGFNV" "LGFNVRKDNYKIK" "VRKDNYKIKMKGD" "KDNYKIKMKGDKS"
#>[37] "NYKIKMKGDKSGR" "IKMKGDKSGRKGQ" "GDKSGRKGQLSVA" "HVVELRKTGGDTL"
#>[41] "DTLEFHKVCDSFY" "VCDSFYKNFSSGL" "NFSSGLKDVVWNT"

Extract with Specific Prefix and control no. of digits in Regex- R

I need to extract the below pattern in R
(10 digits), prefix with 3, 5, 9 (e.g. 3234567890, 5234567890, 9234567890)
(10 digits), prefix with 4 (e.g. 4234567890)
(10 digits), prefix with 8 (e.g. 8234567890)
and the below
TAM(5 digits) – e.g. TAM12345 (numbers starting with TAM and 5 digits)
E(7 digits) – e.g. E1234567 (numbers starting with E and only 7 digits)
A(5 digits) – e.g. A12345 (numbers starting with A and only 5 digits)
I use stingr library.
I am able to extract numbers (with alpha)- not sure how to give specific Prefix and to restrict the digits
The email is below
These are the notice number - with high priority
3234567890 and 5234567890 and the long pending issue 9234567890 along with the discuused numbers 4234567890,8234567890.
Special messages from TAM12345,E1234567 and A12345
Required Output
3234567890, 5234567890, 9234567890
4234567890
8234567890
TAM12345
E1234567
A12345
You could try the below code which uses word boundary \b. Word boundary is used to match between a word character and a non-word character.
> library(stringr)
> str_extract_all(x, perl('\\b(?:[35948]\\d{9}|TAM\\d{5}|E\\d{7}|A\\d{5})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
Using the stringr library:
> library(stringr)
> str_extract_all(x, perl('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
Using the gsubfn library:
> library(gsubfn)
> strapply(x, '\\b([3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', perl=T)
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
And base R which handles this just as well.
> regmatches(x, gregexpr('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', x, perl=T))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"