Extracting position of pattern in a string using ifelse in R - regex

I have a set of strings x for example:
[1] "0000000000000000000000000000000000000Y" "9000000000D00000000000000000000Y"
[3] "0000000000000D00000000000000000000X" "000000000000000000D00000000000000000000Y"
[5] "000000000000000000D00000000000000000000Y" "000000000000000000D00000000000000000000Y"
[6]"000000000000000000000000D0000000011011D1X"
I want to extract the last position of a particular character like 1. I am running this code:
ifelse(grepl("1",x),rev(gregexpr("1",x)[[1]])[1],50)
But this is returning -1 for all elements. How do I correct this?

We can use stri_locate_last from stringi. If there are no matches, it will return NA.
library(stringi)
r1 <- stri_locate_last(v1, fixed=1)[,1]
r1
#[1] NA NA NA NA NA NA 40
nchar(v1)
#[1] 38 32 35 40 40 40 41
If we need to replace the NA values with number of characters
ifelse(is.na(r1), nchar(v1), r1)
data
v1 <- c("0000000000000000000000000000000000000Y",
"9000000000D00000000000000000000Y",
"0000000000000D00000000000000000000X",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000000000D0000000011011D1X")

In base R, the following returns the position of the last matched "1".
# Make some toy data
toydata <- c("001", "007", "00101111Y", "000AAAYY")
# Find last postion
last_pos <- sapply(gregexpr("1", toydata), function(m) m[length(m)])
print(last_pos)
#[1] 3 -1 8 -1
It returns -1 whenever the pattern is not matched.

Related

Subset all 3 digit numbers and collapse them with a separator in a data frame. R

I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))

R regex find ranges in strings

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0
Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'
1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))

Regex to extract number after a certain string

How can I in R extract the number that always comes after the string -{any single letter}, e.g. from the vector:
c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
one should get:
(465, 5, 3216).
The -{any single letter} pattern occurs only once.
You could use gsub, e.g.:
x <- c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
as.numeric(gsub("^.*-[A-Z]+([0-9]+).*$", "\\1", x))
# [1] 465 5 3216
library(stringr)
v <- c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
as.numeric(sapply(str_match_all(v, "\\-[a-zA-Z]([0-9]+)"),"[")[2,])
## [1] 465 5 3216
> x <- c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
> as.numeric(gsub("[A-Z]|-", "", x))
## [1] 465 5 3216

Find the location of a character in string

I would like to find the location of a character in a string.
Say: string = "the2quickbrownfoxeswere2tired"
I would like the function to return 4 and 24 -- the character location of the 2s in string.
You can use gregexpr
gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired")
[[1]]
[1] 4 24
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
or perhaps str_locate_all from package stringr which is a wrapper for gregexpr stringi::stri_locate_all (as of stringr version 1.0)
library(stringr)
str_locate_all(pattern ='2', "the2quickbrownfoxeswere2tired")
[[1]]
start end
[1,] 4 4
[2,] 24 24
note that you could simply use stringi
library(stringi)
stri_locate_all(pattern = '2', "the2quickbrownfoxeswere2tired", fixed = TRUE)
Another option in base R would be something like
lapply(strsplit(x, ''), function(x) which(x == '2'))
should work (given a character vector x)
Here's another straightforward alternative.
> which(strsplit(string, "")[[1]]=="2")
[1] 4 24
You can make the output just 4 and 24 using unlist:
unlist(gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired"))
[1] 4 24
find the position of the nth occurrence of str2 in str1(same order of parameters as Oracle SQL INSTR), returns 0 if not found
instr <- function(str1,str2,startpos=1,n=1){
aa=unlist(strsplit(substring(str1,startpos),str2))
if(length(aa) < n+1 ) return(0);
return(sum(nchar(aa[1:n])) + startpos+(n-1)*nchar(str2) )
}
instr('xxabcdefabdddfabx','ab')
[1] 3
instr('xxabcdefabdddfabx','ab',1,3)
[1] 15
instr('xxabcdefabdddfabx','xx',2,1)
[1] 0
To only find the first locations, use lapply() with min():
my_string <- c("test1", "test1test1", "test1test1test1")
unlist(lapply(gregexpr(pattern = '1', my_string), min))
#> [1] 5 5 5
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(min) %>%
unlist()
#> [1] 5 5 5
To only find the last locations, use lapply() with max():
unlist(lapply(gregexpr(pattern = '1', my_string), max))
#> [1] 5 10 15
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(max) %>%
unlist()
#> [1] 5 10 15
You could use grep as well:
grep('2', strsplit(string, '')[[1]])
#4 24

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)