R regex extract rating out of 10 from string - regex

I have some text strings that I would like to extract certain bits of information from. In particular I would like to extract the rating out of 10 from.
I would like help in constructing a functionfunc_to_extract_rating that does the following...
text_string_vec <- c('blah$2.94 blah blah 3/10 blah blah.',
'foo foo 8/10.',
'10/10 bar bar21/09/2010 bar bar',
'jdsfs1/10djflks5/10.')
func_to_extract_rating <- function(){}
output <- lapply(text_string_vec,func_to_extract_rating)
output
[[1]]
[1] 3 10
[[2]]
[1] 8 10
[[3]]
[1] 10 10
[[4]]
[[4]][[1]]
[1] 1 10
[[4]][[2]]
[1] 5 10

Something like this maybe:
library(stringr)
result = str_extract_all(text_string_vec, "[0-9]{1,2}/10")
result = lapply(result, function(x) gsub("/"," ", x))
[[1]]
[1] "3 10"
[[2]]
[1] "8 10"
[[3]]
[1] "10 10"
[[4]]
[1] "1 10" "5 10"
But since it's always out of 10, if you just want the numeric rating, you can do:
result = str_extract_all(text_string_vec, "[0-9]{1,2}/10")
result = lapply(result, function(x) as.numeric(gsub("/10","", x)))

Here is a base R option
lapply(strsplit(str1, "([0-9]{1,2}\\/10)(*SKIP)(*FAIL)|.", perl = TRUE),
function(x) {
lst <- lapply(strsplit(x[nzchar(x)], "/"), as.numeric)
if(length(lst)==1) unlist(lst) else lst})
#[[1]]
#[1] 3 10
#[[2]]
#[1] 8 10
#[[3]]
#[1] 10 10
#[[4]]
#[[4]][[1]]
#[1] 1 10
#[[4]][[2]]
#[1] 5 10

Related

Detecting number repetition in R using regex

Shouldn't this code work for repeating number detection in R?
> grep(pattern = "\\d{2}", x = 1223)
[1] 1
> grep(pattern = "\\d{3}", x = 1223)
[1] 1
If we have 988 we should get true and if 123 we should get false.
Sounds like it isn't.
> grep(pattern = "\\d{2}", x = "1223")
[1] 1
> grep(pattern = "\\d{2}", x = "13")
[1] 1
You need to use backreferences:
> grep(pattern = "(\\d)\\1", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1{1,}", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1", x = "1234")
integer(0)
EDIT: Seems like you need to figure how it works: (\\d) creates a capture group for the \\d, which can be referred to using a backreference \\1. For example, if you have numbers like x2y and you want to find those where x is the same as y, then:
> grep(pattern = "(\\d)2\\1", x = "121")
[1] 1
> grep(pattern = "(\\d)2\\1", x = "124")
integer(0)
I'd strongly recommend that you read a basic tutorial on regular expressions.
I know the question explicitly says "using regex" in the title, but here is a non-regex method that could work, depending on what you want to do.
strings <- c("1223","1233","1234","113")
# detect consecutive repeat digits, or characters
(strings.rle <- lapply(strings, function(x)rle(unlist(strsplit(x,"")))))
[[1]]
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "1" "2" "3"
[[2]]
Run Length Encoding
lengths: int [1:3] 1 1 2
values : chr [1:3] "1" "2" "3"
[[3]]
Run Length Encoding
lengths: int [1:4] 1 1 1 1
values : chr [1:4] "1" "2" "3" "4"
[[4]]
Run Length Encoding
lengths: int [1:2] 2 1
values : chr [1:2] "1" "3"
Now you can work with strings.rle to do what you want
# which entries have consecutive repeat digits, or characters
strings[sapply(strings.rle, function(x) any(x$lengths > 1))]
[1] "1223" "1233" "113"
or
# which digits or characters are consecutively repeated?
lapply(strings.rle, function(x) x$values[which(x$lengths > 1)])
[[1]]
[1] "2"
[[2]]
[1] "3"
[[3]]
character(0)
[[4]]
[1] "1"

Regex to extract number after a certain string

How can I in R extract the number that always comes after the string -{any single letter}, e.g. from the vector:
c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
one should get:
(465, 5, 3216).
The -{any single letter} pattern occurs only once.
You could use gsub, e.g.:
x <- c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
as.numeric(gsub("^.*-[A-Z]+([0-9]+).*$", "\\1", x))
# [1] 465 5 3216
library(stringr)
v <- c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
as.numeric(sapply(str_match_all(v, "\\-[a-zA-Z]([0-9]+)"),"[")[2,])
## [1] 465 5 3216
> x <- c("JFSDLKJ-H465", "FJSLKJHSD-Y5FSDLKJ", "DFSJLKJAAA-Z3216FJJ")
> as.numeric(gsub("[A-Z]|-", "", x))
## [1] 465 5 3216

Find the location of a character in string

I would like to find the location of a character in a string.
Say: string = "the2quickbrownfoxeswere2tired"
I would like the function to return 4 and 24 -- the character location of the 2s in string.
You can use gregexpr
gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired")
[[1]]
[1] 4 24
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
or perhaps str_locate_all from package stringr which is a wrapper for gregexpr stringi::stri_locate_all (as of stringr version 1.0)
library(stringr)
str_locate_all(pattern ='2', "the2quickbrownfoxeswere2tired")
[[1]]
start end
[1,] 4 4
[2,] 24 24
note that you could simply use stringi
library(stringi)
stri_locate_all(pattern = '2', "the2quickbrownfoxeswere2tired", fixed = TRUE)
Another option in base R would be something like
lapply(strsplit(x, ''), function(x) which(x == '2'))
should work (given a character vector x)
Here's another straightforward alternative.
> which(strsplit(string, "")[[1]]=="2")
[1] 4 24
You can make the output just 4 and 24 using unlist:
unlist(gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired"))
[1] 4 24
find the position of the nth occurrence of str2 in str1(same order of parameters as Oracle SQL INSTR), returns 0 if not found
instr <- function(str1,str2,startpos=1,n=1){
aa=unlist(strsplit(substring(str1,startpos),str2))
if(length(aa) < n+1 ) return(0);
return(sum(nchar(aa[1:n])) + startpos+(n-1)*nchar(str2) )
}
instr('xxabcdefabdddfabx','ab')
[1] 3
instr('xxabcdefabdddfabx','ab',1,3)
[1] 15
instr('xxabcdefabdddfabx','xx',2,1)
[1] 0
To only find the first locations, use lapply() with min():
my_string <- c("test1", "test1test1", "test1test1test1")
unlist(lapply(gregexpr(pattern = '1', my_string), min))
#> [1] 5 5 5
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(min) %>%
unlist()
#> [1] 5 5 5
To only find the last locations, use lapply() with max():
unlist(lapply(gregexpr(pattern = '1', my_string), max))
#> [1] 5 10 15
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(max) %>%
unlist()
#> [1] 5 10 15
You could use grep as well:
grep('2', strsplit(string, '')[[1]])
#4 24

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)

converting a matrix to a list

Suppose I have a matrix foo as follows:
foo <- cbind(c(1,2,3), c(15,16,17))
> foo
[,1] [,2]
[1,] 1 15
[2,] 2 16
[3,] 3 17
I'd like to turn it into a list that looks like
[[1]]
[1] 1 15
[[2]]
[1] 2 16
[[3]]
[1] 3 17
You can do it as follows:
lapply(apply(foo, 1, function(x) list(c(x[1], x[2]))), function(y) unlist(y))
I'm interested in an alternative method that isn't as complicated. Note, if you just do apply(foo, 1, function(x) list(c(x[1], x[2]))), it returns a list within a list, which I'm hoping to avoid.
Here's a cleaner solution:
as.list(data.frame(t(foo)))
That takes advantage of the fact that a data frame is really just a list of equal length vectors (while a matrix is really a vector that is displayed with columns and rows...you can see this by calling foo[5], for instance).
You could also do this, although it isn't much of an improvement:
lapply(1:nrow(foo), function(i) foo[i,])
library(plyr)
alply(foo, 1)