R: indexing result from a regular expression - regex

I am trying to use the indexes that were returned from searching through a string for every instance of a character. When I use gregexp (pattern, text),
lookfor<-"n"
string<-"ATTnGGCnATTn"
gregexpr(pattern=lookfor,text=string)
I get the following:
[[1]]
[1] 4 8 12
attr(,"match.length")
[1] 1 1 1
attr(,"useBytes")
[1] TRUE
How do I index through the first line to be able to use those locations? Thank you in advance for your help!

Addition (2) : After thinking about this for a while, I came to the conclusion that you could have simply used unlist on your original gregexpr call
> unlist(gregexpr("n", string))
# [1] 4 8 12
From your comment
I am looking for the position of each letter n
it follows that you could do any of these:
> which(strsplit(string, "")[[1]] == "n")
# [1] 4 8 12
> cumsum(nchar(strsplit(string, "n")[[1]])+1)
# [1] 4 8 12
> nc <- 1:nchar(string)
> which(substring(string, nc, nc) == "n")
# [1] 4 8 12
Addition (1) in regards to the similar strings (comment in another answer) : You could use strsplit again, and locate those values with one of the methods above
> string2 <- "ATTTGGCCATTG"
> w <- which(strsplit(string, "")[[1]] == "n")
> strsplit(string2, "")[[1]][w]
[1] "T" "C" "G"

If you want to extract all the matches, you can use the builtin function regmatches()
m <- gregexpr(regexp,string)
regmatches(string,m)
This will return a list of character vectors because string can be greater than length 1. If you're only passing one string in, you can get at the vector of matches bypassing the list with
regmatches(string,m)[[1]]

Here is a step-by-step method to find the indices. I suspect there are more efficient ways to achieve the same result. The argument fixed = TRUE tells R to look for the literal lower case "n" rather than treat it as a regular expression.
Having done so, the [[1]] portion at the end retains only the indices element of the list
To show all indices, use the length function.
string="ATTnGGCnATTn"
index <- gregexpr(pattern = "n", text = string, fixed = TRUE)[[1]]
first.index <- index[1:length(index)]

I think I figured out my own answer!
Using the Biostrings package in Bioconductor:
string<-"ATTnGGCnATTn"
matches<-matchPattern(pattern="n", subject=string)
m<-as.vector(ranges(matches))
Now I can call up each position of of "n" in all similar strings. Thank you for those that took the time to answer.

Related

Remove end of each number in a vector in R

I have a vector like this:
a <- c(11223344,55667788)
I would like to create a new vector cutting of the last two numbers of each entry in a:
[1] 112233 556677
Do I have to use regex to achieve this or is there a simple indexing trick that I'm not aware of?
Or you could use sub
as.numeric(sub('..$','', a))
#[1] 112233 556677
If they're integers:
> trunc(a / 100)
[1] 112233 556677
Only if they're strictly positive, you could use floor:
> floor(a / 100)
[1] 112233 556677
You can use substr (or substring), cutting the number of characters off at two before the end
a <- c(11223344, 55667788)
substr(a, 1, nchar(a)-2)
# [1] "112233" "556677"
wrapping in as.numeric if necessary.

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Regex/ Substring

I have a sequence like this in a list "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
I would like to create a substring like wherever a "K" is present it needs to pull out 6 characters before and 6 characters after "K"
Ex : MSGSRRKATPASR , here -6..K..+6
for the whole sequence..I tried the substring function in R but we need to specify the start and end position. Here the positions are unknown
Thanks
.{6}K.{6}
Try this.This will give the desired result.
See demo.
http://regex101.com/r/dM0rS8/4
use this:
\w{7}(?<=K)\w{6}
this uses positive lookbehind to ensure that there are characters present before K.
demo here: http://regex101.com/r/pK3jK1/2
Using rex may make this type of task a little simpler.
x <- "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
library(rex)
re_matches(x,
rex(
capture(name = "amino_acids",
n(any, 6),
"K",
n(any, 6)
)
),
global = TRUE)[[1]]
#> amino_acids
#>1 MSGSRRKATPASR
#>2 GEGSFAKVKYAKN
#>3 GDQAAIKILDREK
#>4 KMVEQLKREISTM
#>5 IEVMASKTKIYIV
#>6 GGELFDKIAQQGR
#>7 VYHRDLKPENLIL
#>8 DANGVLKVSDFGL
#>9 PEVLSDKGYDGAA
#>10 NLMTLYKRICKAE
#>11 WFSQGAKRVIKRI
#>12 LEDEWFKKGYKPP
#>13 AAFSNSKECLVTE
#>14 LENLFEKQAQLVK
#>15 ASEIMSKMEETAK
#>16 LGFNVRKDNYKIK
#>17 GDKSGRKGQLSVA
#>18 HVVELRKTGGDTL
#>19 VCDSFYKNFSSGL
However the above is greedy, each K will only appear in one result.
If you want to output an AA for each K
library(rex)
locs <- re_matches(x,
rex(
"K" %if_prev_is% n(any, 6) %if_next_is% n(any, 6)
),
global = TRUE, locations = TRUE)[[1]]
substring(x, locs$start - 6, locs$end + 6)
#> [1] "MSGSRRKATPASR" "GEGSFAKVKYAKN" "GSFAKVKYAKNTV" "AKVKYAKNTVTGD"
#> [5] "GDQAAIKILDREK" "KILDREKVFRHKM" "EKVFRHKMVEQLK" "KMVEQLKREISTM"
#> [9] "REISTMKLIKHPN" "STMKLIKHPNVVE" "IEVMASKTKIYIV" "VMASKTKIYIVLE"
#>[13] "GGELFDKIAQQGR" "AQQGRLKEDEARR" "VYHRDLKPENLIL" "DANGVLKVSDFGL"
#>[17] "PEVLSDKGYDGAA" "NLMTLYKRICKAE" "LYKRICKAEFSCP" "WFSQGAKRVIKRI"
#>[21] "GAKRVIKRILEPN" "LEDEWFKKGYKPP" "EDEWFKKGYKPPS" "WFKKGYKPPSFDQ"
#>[25] "AAFSNSKECLVTE" "ECLVTEKKEKPVS" "CLVTEKKEKPVSM" "VTEKKEKPVSMNA"
#>[29] "LENLFEKQAQLVK" "KQAQLVKKETRFT" "QAQLVKKETRFTS" "ASEIMSKMEETAK"
#>[33] "KMEETAKPLGFNV" "LGFNVRKDNYKIK" "VRKDNYKIKMKGD" "KDNYKIKMKGDKS"
#>[37] "NYKIKMKGDKSGR" "IKMKGDKSGRKGQ" "GDKSGRKGQLSVA" "HVVELRKTGGDTL"
#>[41] "DTLEFHKVCDSFY" "VCDSFYKNFSSGL" "NFSSGLKDVVWNT"

string manipulations using R

I am working with a string vector.Each element in the vector is in the format "MMYYPub". I want to switch the place of "MM" and "YY" in the string, from "MMYYPub" to "YYMMPub" . Is this feasible in R.
Example :
vec(1)
'0100pub'
vec(2)
'0200pub'
Where the first two digits are month, and the following digits are year. There 10 years data in total from 1994 to 2013.
It might also be useful to know about the yearmon class to represent monthly data. The yearmon object can then printed in a format of choice.
library(zoo)
ym <- as.yearmon("0414pub", format = "%m%ypub")
ym
# [1] "apr 2014"
format(ym, "%y%mpub")
# [1] "1404pub"
You need to read up on regular expressions. Here is one way:
R> val <- "0405pub"
R> gsub("(\\d\\d)(\\d\\d)(.*)", "\\2\\1\\3", val)
[1] "0504pub"
R>
We use the fact that
\d denotes a digit (but need to escape the backslash)
(...) groups arguments, so here we matches one (two digits), two (also two digits) and three (remainder)
we then "simply" create the replacement string as "two before one followed by three"
There are other ways to do it, this will accomplish given the pattern you described.
Edit: Here is a shorter variant using \\d{2} to request two digits:
R> gsub("(\\d{2})(\\d{2})", "\\2\\1", val)
[1] "0504pub"
R>
One way would be using (gsub) to replace all occurrences in the vector.
> vec <- c('0100pub', '0200pub')
> gsub('([0-9]{2})([0-9]{2})', '\\2\\1', vec)
[1] "0001pub" "0002pub"
You might not need regex for this, if you just want to swap the characters around. substring and paste work just as well in this case:
> f <- function(x) paste0(substring(x,3,4), substring(x,1,2), substring(x,5))
> x
[1] "0103pub" "0204pub"
> f(x)
[1] "0301pub" "0402pub"

Getting distance between two words in R

Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.
This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.
you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.