Extract all numbers from a single string in R - regex

Let's imagine you have a string:
strLine <- "The transactions (on your account) were as follows: 0 3,000 (500) 0 2.25 (1,200)"
Is there a function that strips out the numbers into an array/vector producing the following required solution:
result <- c(0, 3000, -500, 0, 2.25, -1200)?
i.e.
result[3] = -500
Notice, the numbers are presented in accounting form so negative numbers appear between (). Also, you can assume that only numbers appear to the right of the first occurance of a number. I am not that good with regexp so would appreciate it if you could help if this would be required. Also, I don't want to assume the string is always the same so I am looking to strip out all words (and any special characters) before the location of the first number.

library(stringr)
x <- str_extract_all(strLine,"\\(?[0-9,.]+\\)?")[[1]]
> x
[1] "0" "3,000" "(500)" "0" "2.25" "(1,200)"
Change the parens to negatives:
x <- gsub("\\((.+)\\)","-\\1",x)
x
[1] "0" "3,000" "-500" "0" "2.25" "-1,200"
And then as.numeric() or taRifx::destring to finish up (the next version of destring will support negatives by default so the keep option won't be necessary):
library(taRifx)
destring( x, keep="0-9.-")
[1] 0 3000 -500 0 2.25 -1200
OR:
as.numeric(gsub(",","",x))
[1] 0 3000 -500 0 2.25 -1200

Here's the base R way, for the sake of completeness...
x <- unlist(regmatches(strLine, gregexpr('\\(?[0-9,.]+', strLine)))
x <- as.numeric(gsub('\\(', '-', gsub(',', '', x)))
[1] 0.00 3000.00 -500.00 0.00 2.25 -1200.00

What for me worked perfectly when working on single strings in a data frame (One string per row in same column) was the following:
library(taRifx)
DataFrame$Numbers<-as.character(destring(DataFrame$Strings, keep="0-9.-"))
The results are in a new column from the same data frame.

Since this came up in another question, this is an uncrutched stringi solution (vs the stringr crutch):
as.numeric(
stringi::stri_replace_first_fixed(
stringi::stri_replace_all_regex(
unlist(stringi::stri_match_all_regex(
"The transactions (on your account) were as follows: 0 3,000 (500) 0 2.25 (1,200)",
"\\(?[0-9,.]+\\)?"
)), "\\)$|,", ""
),
"(", "-"
)
)

Related

Extracting position of pattern in a string using ifelse in R

I have a set of strings x for example:
[1] "0000000000000000000000000000000000000Y" "9000000000D00000000000000000000Y"
[3] "0000000000000D00000000000000000000X" "000000000000000000D00000000000000000000Y"
[5] "000000000000000000D00000000000000000000Y" "000000000000000000D00000000000000000000Y"
[6]"000000000000000000000000D0000000011011D1X"
I want to extract the last position of a particular character like 1. I am running this code:
ifelse(grepl("1",x),rev(gregexpr("1",x)[[1]])[1],50)
But this is returning -1 for all elements. How do I correct this?
We can use stri_locate_last from stringi. If there are no matches, it will return NA.
library(stringi)
r1 <- stri_locate_last(v1, fixed=1)[,1]
r1
#[1] NA NA NA NA NA NA 40
nchar(v1)
#[1] 38 32 35 40 40 40 41
If we need to replace the NA values with number of characters
ifelse(is.na(r1), nchar(v1), r1)
data
v1 <- c("0000000000000000000000000000000000000Y",
"9000000000D00000000000000000000Y",
"0000000000000D00000000000000000000X",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000000000D0000000011011D1X")
In base R, the following returns the position of the last matched "1".
# Make some toy data
toydata <- c("001", "007", "00101111Y", "000AAAYY")
# Find last postion
last_pos <- sapply(gregexpr("1", toydata), function(m) m[length(m)])
print(last_pos)
#[1] 3 -1 8 -1
It returns -1 whenever the pattern is not matched.

How to count frequencies of certain character in a string?

If I have a run of characters such as "AABBABBBAAAABBAAAABBBAABBBBABABB".
Is there a way to get R to count the runs of A and state how many of each length ?
So I'd like to know how many instances of 3 A's in a row, how many instances of a single A, how many instances of 2 A's in a row, etc.
table(rle(strsplit("AABBABBBAAAABBAAAABBBAABBBBABABB","")[[1]]))
gives
values
lengths A B
1 3 1
2 2 3
3 0 2
4 2 1
which (reading down the A column) means there were 3 A runs of length 1, 2 A runs of length 2 and 2 A runs of length 4.
Try
v1 <- scan(text=gsub('[^A]+', ',', str1), sep=',', what='', quiet=TRUE)
table(v1[nzchar(v1)])
# A AA AAAA
# 3 2 2
Or
library(stringi)
table(stri_extract_all_regex(str1, '[A]+')[[1]])
# A AA AAAA
# 3 2 2
Benchmarks
set.seed(42)
x1 <- stri_rand_strings(1,1e7, pattern='[A-G]')
system.time(table(stri_split_regex(x1, "[^A]+", omit_empty = TRUE)))
# user system elapsed
# 0.829 0.002 0.831
system.time(table(stri_extract_all_regex(x1, '[A]+')[[1]]))
# user system elapsed
# 0.790 0.002 0.791
system.time(table(rle(strsplit(x1,"")[[1]])) )
# user system elapsed
# 30.230 1.243 31.523
system.time(table(strsplit(x1, "[^A]+")))
# user system elapsed
# 4.253 0.006 4.258
system.time(table(attr(gregexpr("A+",x1)[[1]], 'match.length')))
# user system elapsed
# 1.994 0.004 1.999
library(microbenchmark)
microbenchmark(david=table(stri_split_regex(x1, "[^A]+", omit_empty = TRUE)),
akrun= table(stri_extract_all_regex(x1, '[A]+')[[1]]),
david2 = table(strsplit(x1, "[^A]+")),
glen = table(rle(strsplit(x1,"")[[1]])),
plannapus = table(attr(gregexpr("A+",x1)[[1]], 'match.length')),
times=20L, unit='relative')
#Unit: relative
# expr min lq mean median uq max neval cld
# david 1.0000000 1.000000 1.000000 1.000000 1.0000000 1.000000 20 a
# akrun 0.7908313 1.023388 1.054670 1.336510 0.9903384 1.004711 20 a
# david2 4.9325256 5.461389 5.613516 6.207990 5.6647301 5.374668 20 c
# glen 14.9064240 15.975846 16.672339 20.570874 15.8710402 15.465140 20 d
#plannapus 2.5077719 3.123360 2.836338 3.557242 2.5689176 2.452964 20 b
data
str1 <- 'AABBABBBAAAABBAAAABBBAABBBBABABB'
Here's additional way using strsplit
x <- "AABBABBBAAAABBAAAABBBAABBBBABABB"
table(strsplit(x, "[^A]+"))
# A AA AAAA
# 3 2 2
Or similarly with the stringi package
library(stringi)
table(stri_split_regex(x, "[^A]+", omit_empty = TRUE))
For completeness, here is another way, using the regmatches and gregexpr combo, to extract regexes:
x <- "AABBABBBAAAABBAAAABBBAABBBBABABB"
table(regmatches(x,gregexpr("A+",x))[[1]])
# A AA AAAA
# 3 2 2
Or in fact, since gregexpr keeps the length of the captured substring as attribute, one could even do, directly:
table(attr(gregexpr("A+",x)[[1]],'match.length'))
# 1 2 4
# 3 2 2

R regex find ranges in strings

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0
Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'
1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))

How to calculate the number of occurrence of a given character in each row of a column of strings?

I have a data.frame in which certain variables contain a text string. I wish to count the number of occurrences of a given character in each individual string.
Example:
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
I wish to create a new column for q.data with the number of occurence of "a" in string (ie. c(2,1,0)).
The only convoluted approach I have managed is:
string.counter<-function(strings, pattern){
counts<-NULL
for(i in 1:length(strings)){
counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
}
return(counts)
}
string.counter(strings=q.data$string, pattern="a")
number string number.of.a
1 1 greatgreat 2
2 2 magic 1
3 3 not 0
The stringr package provides the str_count function which seems to do what you're interested in
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0
If you don't want to leave base R, here's a fairly succinct and expressive possibility:
x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0
nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0
Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.
Here's benchmark results (with a scaled up size of the test to 3000 rows)
q.data<-q.data[rep(1:NROW(q.data), 1000),]
str(q.data)
'data.frame': 3000 obs. of 3 variables:
$ number : int 1 2 3 1 2 3 1 2 3 1 ...
$ string : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
$ number.of.a: int 2 1 0 2 1 0 2 1 0 2 ...
benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
test replications elapsed relative user.self sys.self user.child sys.child
1 Dason 100 4.173 9.959427 2.985 1.204 0 0
3 DWin 100 0.419 1.000000 0.417 0.003 0 0
4 Josh 100 18.635 44.474940 17.883 0.827 0 0
2 Tim 100 3.705 8.842482 3.646 0.072 0 0
Another good option, using charToRaw:
sum(charToRaw("abc.d.aa") == charToRaw('.'))
The stringi package provides the functions stri_count and stri_count_fixed which are very fast.
stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0
benchmark
Compared to the fastest approach from #42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.
library(microbenchmark)
benchmark <- microbenchmark(
stringi = stringi::stri_count(test.data$string, fixed = "a"),
baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
stringr = str_count(test.data$string, "a")
)
autoplot(benchmark)
data
q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]
A variation of https://stackoverflow.com/a/12430764/589165 is
> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0
I'm sure someone can do better, but this works:
sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter)
})
greatgreat magic not
2 1 0
or in a function:
countLetter <- function(charvec, letter){
sapply(charvec, function(x, letter){
sum(unlist(strsplit(x, split = "")) == letter)
}, letter = letter)
}
countLetter(as.character(q.data$string),"a")
You could just use string division
require(roperators)
my_strings <- c('apple', banana', 'pear', 'melon')
my_strings %s/% 'a'
Which will give you 1, 3, 1, 0. You can also use string division with regular expressions and whole words.
The question below has been moved here, but it seems this page doesn't directly answer to Farah El's question.
How to find number 1s in 101 in R
So, I'll write an answer here, just in case.
library(magrittr)
n %>% # n is a number you'd like to inspect
as.character() %>%
str_count(pattern = "1")
https://stackoverflow.com/users/8931457/farah-el
Yet another base R option could be:
lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))
[1] 2 1 0
The next expression does the job and also works for symbols, not only letters.
The expression works as follows:
1: it uses lapply on the columns of the dataframe q.data to iterate over the rows of the column 2 ("lapply(q.data[,2],"),
2: it apply to each row of the column 2 a function "function(x){sum('a' == strsplit(as.character(x), '')[[1]])}".
The function takes each row value of column 2 (x), convert to character (in case it is a factor for example), and it does the split of the string on every character ("strsplit(as.character(x), '')"). As a result we have a a vector with each character of the string value for each row of the column 2.
3: Each vector value of the vector is compared with the desired character to be counted, in this case "a" (" 'a' == "). This operation will return a vector of True and False values "c(True,False,True,....)", being True when the value in the vector matches the desired character to be counted.
4: The total times the character 'a' appears in the row is calculated as the sum of all the 'True' values in the vector "sum(....)".
5: Then it is applied the "unlist" function to unpack the result of the "lapply" function and assign it to a new column in the dataframe ("q.data$number.of.a<-unlist(....")
q.data$number.of.a<-unlist(lapply(q.data[,2],function(x){sum('a' == strsplit(as.character(x), '')[[1]])}))
>q.data
# number string number.of.a
#1 greatgreat 2
#2 magic 1
#3 not 0
Another base R answer, not so good as those by #IRTFM and #Finn (or as those using stringi/stringr), but better than the others:
sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
q.data<-q.data[rep(1:NROW(q.data), 3000),]
library(rbenchmark)
library(stringr)
library(stringi)
benchmark( Dason = {str_count(q.data$string, "a") },
Tim = {sapply(q.data$string, function(x, letter = "a"){sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {nchar(q.data$string) -nchar( gsub("a", "", q.data$string, fixed=TRUE))},
Markus = {stringi::stri_count(q.data$string, fixed = "a")},
Finn={nchar(gsub("[^a]", "", q.data$string))},
tmmfmnk={lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))},
Josh1 = {sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)},
Josh2 = {lengths(regmatches(q.data$string, gregexpr("g",q.data$string )))},
Iago = {sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))},
replications =100, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
4 Markus 100 0.076 1.000 0.076 0.000 0 0
3 DWin 100 0.277 3.645 0.277 0.000 0 0
1 Dason 100 0.290 3.816 0.291 0.000 0 0
5 Finn 100 1.057 13.908 1.057 0.000 0 0
9 Iago 100 3.214 42.289 3.215 0.000 0 0
2 Tim 100 6.000 78.947 6.002 0.000 0 0
6 tmmfmnk 100 6.345 83.487 5.760 0.003 0 0
8 Josh2 100 12.542 165.026 12.545 0.000 0 0
7 Josh1 100 13.288 174.842 13.268 0.028 0 0
The easiest and the cleanest way IMHO is :
q.data$number.of.a <- lengths(gregexpr('a', q.data$string))
# number string number.of.a`
#1 1 greatgreat 2`
#2 2 magic 1`
#3 3 not 0`
s <- "aababacababaaathhhhhslsls jsjsjjsaa ghhaalll"
p <- "a"
s2 <- gsub(p,"",s)
numOcc <- nchar(s) - nchar(s2)
May not be the efficient one but solve my purpose.

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)