Find the location of a character in string - regex

I would like to find the location of a character in a string.
Say: string = "the2quickbrownfoxeswere2tired"
I would like the function to return 4 and 24 -- the character location of the 2s in string.

You can use gregexpr
gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired")
[[1]]
[1] 4 24
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
or perhaps str_locate_all from package stringr which is a wrapper for gregexpr stringi::stri_locate_all (as of stringr version 1.0)
library(stringr)
str_locate_all(pattern ='2', "the2quickbrownfoxeswere2tired")
[[1]]
start end
[1,] 4 4
[2,] 24 24
note that you could simply use stringi
library(stringi)
stri_locate_all(pattern = '2', "the2quickbrownfoxeswere2tired", fixed = TRUE)
Another option in base R would be something like
lapply(strsplit(x, ''), function(x) which(x == '2'))
should work (given a character vector x)

Here's another straightforward alternative.
> which(strsplit(string, "")[[1]]=="2")
[1] 4 24

You can make the output just 4 and 24 using unlist:
unlist(gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired"))
[1] 4 24

find the position of the nth occurrence of str2 in str1(same order of parameters as Oracle SQL INSTR), returns 0 if not found
instr <- function(str1,str2,startpos=1,n=1){
aa=unlist(strsplit(substring(str1,startpos),str2))
if(length(aa) < n+1 ) return(0);
return(sum(nchar(aa[1:n])) + startpos+(n-1)*nchar(str2) )
}
instr('xxabcdefabdddfabx','ab')
[1] 3
instr('xxabcdefabdddfabx','ab',1,3)
[1] 15
instr('xxabcdefabdddfabx','xx',2,1)
[1] 0

To only find the first locations, use lapply() with min():
my_string <- c("test1", "test1test1", "test1test1test1")
unlist(lapply(gregexpr(pattern = '1', my_string), min))
#> [1] 5 5 5
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(min) %>%
unlist()
#> [1] 5 5 5
To only find the last locations, use lapply() with max():
unlist(lapply(gregexpr(pattern = '1', my_string), max))
#> [1] 5 10 15
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(max) %>%
unlist()
#> [1] 5 10 15

You could use grep as well:
grep('2', strsplit(string, '')[[1]])
#4 24

Related

R regex extract rating out of 10 from string

I have some text strings that I would like to extract certain bits of information from. In particular I would like to extract the rating out of 10 from.
I would like help in constructing a functionfunc_to_extract_rating that does the following...
text_string_vec <- c('blah$2.94 blah blah 3/10 blah blah.',
'foo foo 8/10.',
'10/10 bar bar21/09/2010 bar bar',
'jdsfs1/10djflks5/10.')
func_to_extract_rating <- function(){}
output <- lapply(text_string_vec,func_to_extract_rating)
output
[[1]]
[1] 3 10
[[2]]
[1] 8 10
[[3]]
[1] 10 10
[[4]]
[[4]][[1]]
[1] 1 10
[[4]][[2]]
[1] 5 10
Something like this maybe:
library(stringr)
result = str_extract_all(text_string_vec, "[0-9]{1,2}/10")
result = lapply(result, function(x) gsub("/"," ", x))
[[1]]
[1] "3 10"
[[2]]
[1] "8 10"
[[3]]
[1] "10 10"
[[4]]
[1] "1 10" "5 10"
But since it's always out of 10, if you just want the numeric rating, you can do:
result = str_extract_all(text_string_vec, "[0-9]{1,2}/10")
result = lapply(result, function(x) as.numeric(gsub("/10","", x)))
Here is a base R option
lapply(strsplit(str1, "([0-9]{1,2}\\/10)(*SKIP)(*FAIL)|.", perl = TRUE),
function(x) {
lst <- lapply(strsplit(x[nzchar(x)], "/"), as.numeric)
if(length(lst)==1) unlist(lst) else lst})
#[[1]]
#[1] 3 10
#[[2]]
#[1] 8 10
#[[3]]
#[1] 10 10
#[[4]]
#[[4]][[1]]
#[1] 1 10
#[[4]][[2]]
#[1] 5 10

Extracting position of pattern in a string using ifelse in R

I have a set of strings x for example:
[1] "0000000000000000000000000000000000000Y" "9000000000D00000000000000000000Y"
[3] "0000000000000D00000000000000000000X" "000000000000000000D00000000000000000000Y"
[5] "000000000000000000D00000000000000000000Y" "000000000000000000D00000000000000000000Y"
[6]"000000000000000000000000D0000000011011D1X"
I want to extract the last position of a particular character like 1. I am running this code:
ifelse(grepl("1",x),rev(gregexpr("1",x)[[1]])[1],50)
But this is returning -1 for all elements. How do I correct this?
We can use stri_locate_last from stringi. If there are no matches, it will return NA.
library(stringi)
r1 <- stri_locate_last(v1, fixed=1)[,1]
r1
#[1] NA NA NA NA NA NA 40
nchar(v1)
#[1] 38 32 35 40 40 40 41
If we need to replace the NA values with number of characters
ifelse(is.na(r1), nchar(v1), r1)
data
v1 <- c("0000000000000000000000000000000000000Y",
"9000000000D00000000000000000000Y",
"0000000000000D00000000000000000000X",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000000000D0000000011011D1X")
In base R, the following returns the position of the last matched "1".
# Make some toy data
toydata <- c("001", "007", "00101111Y", "000AAAYY")
# Find last postion
last_pos <- sapply(gregexpr("1", toydata), function(m) m[length(m)])
print(last_pos)
#[1] 3 -1 8 -1
It returns -1 whenever the pattern is not matched.

R Merging 4 Strings into 1 String

I'm searching for the locations of 4 different substrings in x and trying to merge these four outputs into one cumulative string:
x <- ("AAABBADSJALKACCWIEUADD")
outputA <- gregexpr(pattern = "AAA", x)
outputB <- gregexpr(pattern = "ABB", x)
outputC <- gregexpr(pattern = "ACC", x)
outputD <- gregexpr(pattern = "ADD", x)
I would like to merge these four outputs and output this merged result as a text file with each element separated on new line.
merged_output
# 1
# 3
# 13
# 20
Thank you
Actually you can do it all at once using a lookahead (?=)
gregexpr("A(?=AA|BB|CC|DD)", x, perl=T)[[1]]
# [1] 1 3 13 20
# attr(,"match.length")
# [1] 1 1 1 1
# attr(,"useBytes")
# [1] TRUE
For example
library(stringi)
cat("merged_output",
paste("#",
stri_locate_first_fixed(pattern = c("AAA", "ABB", "ACC", "ADD"), ("AAABBADSJALKACCWIEUADD"))[, "start"]),
file = tf <- tempfile(fileext = ".txt"),
sep = "\n")
Now, the file named in tf contains
> merged_output
> # 1
> # 3
> # 13
> # 20
Not very automated, but
cat(paste(c(outputA[[1]][1], outputB[[1]][1], outputC[[1]][1], outputD[[1]][1]),
collapse = "\n"),
file = "outputfile.txt")
should do it.

split string without loss of characters

I wish to split strings at a certain character while retaining that character in the second resulting string. I can achieve almost all of the desired operation, except that I lose the characters I specify in strsplit, which I guess is called the delimiter.
Is there a way to request that strsplit retain the delimiter? Or must I use a regular expression of some kind? Thank you for any advice. This seems like a very basic question. Sorry if it is a duplicate. I prefer to use base R.
Here is an example showing what I have so far:
my.table <- read.table(text = '
model npar AICc
AA(~region+state+county+city)BB(~region+state+county+city)CC(~1) 17 11111.11
AA(~region+state+county)BB(~region+state+county)CC(~123) 14 22222.22
AA(~region+state)BB(~region+state)CC(~33) 13 33333.33
AA(~region)BB(~region)CC(~4321) 6 44444.44
', header = TRUE, stringsAsFactors = FALSE)
desired.result <- read.table(text = '
model CC npar AICc
AA(~region+state+county+city)BB(~region+state+county+city) CC(~1) 17 11111.11
AA(~region+state+county)BB(~region+state+county) CC(~123) 14 22222.22
AA(~region+state)BB(~region+state) CC(~33) 13 33333.33
AA(~region)BB(~region) CC(~4321) 6 44444.44
', header = TRUE, stringsAsFactors = FALSE)
split.model <- strsplit(my.table$model, 'CC\\(')
split.models <- matrix(unlist(split.model), ncol=2, byrow=TRUE, dimnames = list(NULL, c("model", "CC")))
desires.result2 <- data.frame(split.models, my.table[,2:ncol(my.table)])
desires.result2
# model CC npar AICc
# 1 AA(~region+state+county+city)BB(~region+state+county+city) ~1) 17 11111.11
# 2 AA(~region+state+county)BB(~region+state+county) ~123) 14 22222.22
# 3 AA(~region+state)BB(~region+state) ~33) 13 33333.33
# 4 AA(~region)BB(~region) ~4321) 6 44444.44
The basic idea is to use look-around operations from regular expressions to strsplit to get your desired result. However, it's a bit trickier than that with strsplit and positive lookahead. Read this excellent post from #JoshO'Brien for explanation.
pattern <- "(?<=\\))(?=CC)"
strsplit(my.table$model, pattern, perl=TRUE)
# [[1]]
# [1] "AA(~region+state+county+city)BB(~region+state+county+city)"
# [2] "CC(~1)"
# [[2]]
# [1] "AA(~region+state+county)BB(~region+state+county)"
# [2] "CC(~123)"
# [[3]]
# [1] "AA(~region+state)BB(~region+state)" "CC(~33)"
# [[4]]
# [1] "AA(~region)BB(~region)" "CC(~4321)"
Of course, I leave the task of do.call(rbind, ...) and cbind to get the final desired.output to you.
Almost right after I posted I thought of using gsub to insert a space and then split on the space. Although, I like Arun's answer better.
my.table <- read.table(text = '
model npar AICc
AA(~region+state+county+city)BB(~region+state+county+city)CC(~1) 17 11111.11
AA(~region+state+county)BB(~region+state+county)CC(~123) 14 22222.22
AA(~region+state)BB(~region+state)CC(~33) 13 33333.33
AA(~region)BB(~region)CC(~4321) 6 44444.44
', header = TRUE, stringsAsFactors = FALSE)
my.table$model <- gsub("CC", " CC", my.table$model)
split.model <- strsplit(my.table$model, ' ')
split.models <- matrix(unlist(split.model), ncol=2, byrow=TRUE, dimnames = list(NULL, c("model", "CC")))
desires.result <- data.frame(split.models, my.table[,2:ncol(my.table)])
desires.result
# model CC npar AICc
# 1 AA(~region+state+county+city)BB(~region+state+county+city) CC(~1) 17 11111.11
# 2 AA(~region+state+county)BB(~region+state+county) CC(~123) 14 22222.22
# 3 AA(~region+state)BB(~region+state) CC(~33) 13 33333.33
# 4 AA(~region)BB(~region) CC(~4321) 6 44444.44
... why not just tack the separator back on afterwards? Would seem to save a lot of trouble fiddling with regexes.
split.model <- lapply(strsplit(my.table$model, 'CC\\('), function(x) {
x[2] <- paste0("CC(", x[2])
x
})

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)