R regex find ranges in strings - regex

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0

Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'

1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))

Related

Grabbing columns with special characters and upper case letters

I have a data frame and I'm trying to loop through the data frame to identify those columns which contain a special character or which are all capital letters.
I have tried a few things but nothing where I'm apple to catch the column names within the loop.
data = data.frame(one=c(1,3,5,1,3,5,1,3,5,1,3,5), two=c(1,3,5,1,3,5,1,3,5,1,3,5),
thr=c("A","B","D","E","F","G","H","I","J","H","I","J"),
fou=c("A","B","D","A","B","D","A","B","D","A","B","D"),
fiv=c(1,3,5,1,3,5,1,3,5,1,3,5),
six=c("A","B","D","E","F","G","H","I","J","H","I","J"),
sev=c("A","B","D","A","B","D","A","B","D","A","B","D"),
eig=c("A","B","D","A","B","D","A","B","D","A","B","D"),
nin=c(1.24,3.52,5.33,1.44,3.11,5.33,1.55,3.66,5.33,1.32,3.54,5.77),
ten=c(1:12),
ele=rep(1,12),
twe=c(1,2,1,2,1,2,1,2,1,2,1,2),
thir=c("THiS","THAT34","T(&*(", "!!!","#$#","$Q%J","who","THIS","this","this","this","this"),
stringsAsFactors = FALSE)
data
colls <- c()
spec=c("$","%","&")
for( col in names(data) ) {
if( length(strings[stringr::str_detect(data[,col], spec)]) >= 1 ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
for( col in names(data) ) {
if( any(data[,col]) %in% spec ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
Can anyone shed light on a good way to tackle this problem.
EDIT:
The end goal is to have a vector with a name of column names which meet that criteria. Sorry for my poor SO question, but hopefully this will help with what I'm trying to do
I would use grep() to search for the pattern you are interested in. See here.
[:upper:] Matches any upper case letters.
Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.
The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)
[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~.
Note that rather than use [:punct:] you could define your special characters manually.
We can try the resultant code on the first row of your data set:
#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match.
We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.
grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.
Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.
To do this we must convert our script into a function which we will call with apply()
myFunction <- function(x){
matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
#Given a set of logical vectors 'matches', is at least one of the values true? using any()
return(any(matches))
}
apply(X = data, 1, myFunction)
The 1 above instructs apply() to reiterate across rows rather than columns.
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.
If you are just interested in which values in column thirteen fit the stated criteria you can use:
matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
To subset your dataframe on matching rows:
data[matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
3 5 5 D D 5 D D D 5.33 3 1 1 T(&*(
4 1 1 E A 1 E A A 1.44 4 1 2 !!!
5 3 3 F B 3 F B B 3.11 5 1 1 #$#
6 5 5 G D 5 G D D 5.33 6 1 2 $Q%J
8 3 3 I B 3 I B B 3.66 8 1 2 THIS
To subset your dataframe on non-matching rows:
data[!matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
1 1 1 A A 1 A A A 1.24 1 1 1 THiS
2 3 3 B B 3 B B B 3.52 2 1 2 THAT34
7 1 1 H A 1 H A A 1.55 7 1 1 who
9 5 5 J D 5 J D D 5.33 9 1 1 this
10 1 1 H A 1 H A A 1.32 10 1 2 this
11 3 3 I B 3 I B B 3.54 11 1 1 this
12 5 5 J D 5 J D D 5.77 12 1 2 this
Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.
EDIT:
To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:
colnames(data)[apply(X = data, 2, myFunction)]
"thr" "fou" "six" "sev" "eig" "thir"
The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.
I would collapse the data into strings (one string per row)
strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*(" "11EA1EAA1.44 412!!!" "33FB3FBB3.11 511#$#"
# [5] "55GD5GDD5.33 612$Q%J" "33IB3IBB3.66 812THIS"
# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))
strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511#$#" "55GD5GDD5.33 612$Q%J"
You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

R separating out number and units from a column in a dataframe

I have a dataframe which contains a column that has numbers as well as variable units:
num <- c(1:5)
val <- c("5%","10K", "100.2mv","1.4g","1.007kbars")
df <- data.frame(num,val)
df
How can I create two new columns from df$val, one that contains just the number and one the units?
Thank you for your help.
Here's a solution using stringr:
library(stringr)
df$extr_nums <- str_extract(val, "\\d+\\.?\\d*")
df$extr_units <- str_replace(val, nums, "")
df
num val extr_nums extr_units
1 1 5% 5 %
2 2 10K 10 K
3 3 100.2mv 100.2 mv
4 4 1.4g 1.4 g
5 5 1.007kbars 1.007 kbars
The regexp is translated as: "at least 1 digit, followed by optional dot, followed by optional digits".

How do I count the number of words in a text (string)?

I have this string vector (for example):
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
To count the number of words in this vector I used this (as given here Count the number of words in a string in R?, which is a possible duplicate but with another issue)
No_words <- sapply(gregexpr("\\W+", str), length) + 1
but it returns
6 2 2 2
String has only 1 element in last two places (i.e. "tusla" and "laq")
so it should return
6 2 1 1
How do I get around this problem?
You can try
sapply(gregexpr("\\S+", x), length)
## [1] 6 2 1 1
Or as suggested in comments you can try
sapply(strsplit(x, "\\s+"), length)
## [1] 6 2 1 1
Use the stringi package and stri_count:
require(stringi)
str <- c(
"this is a string current trey",
"nospaces",
"multiple spaces",
" leadingspaces",
"trailingspaces ",
" leading and trailing ",
"just one space each")
> stri_count(str,regex="\\S+")
[1] 6 1 2 1 1 3 4
Use the wc-function from the qdap package.
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
library("qdap")
wc(str)
That returns:
wc(str)
[1] 6 2 1 1

How to properly manipulate a string column in a data frame in R?

I have a data.frame with a string column that contains periods e.g "a.b.c.X". I want to split out the string by periods and retain the third segment e.g. "c" in the example given. Here is what I'm doing.
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a.b.a.X 1
2 a.b.b.X 2
3 a.b.c.X 3
And what I want is
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a 1
2 b 2
3 c 3
I'm attempting to use within, but I'm getting strange results. The value in the first row in the first column is being repeated.
> get = function(x) { unlist(strsplit(x, "\\."))[3] }
> within(df, v <- get(as.character(v)))
v b
1 a 1
2 a 2
3 a 3
What is the best practice for doing this? What am I doing wrong?
Update:
Here is the solution I used from #agstudy's answer:
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> get = function(x) gsub(".*?[.].*?[.](.*?)[.].*", '\\1', x)
> within(df, v <- get(v))
v b
1 a 1
2 b 2
3 c 3
Using some regular expression you can do :
gsub(".*?[.].*?[.](.*?)[.].*", '\\1', df$v)
[1] "a" "b" "c"
Or more concise:
gsub("(.*?[.]){2}(.*?)[.].*", '\\2', v)
The problem is not with within but with your get function. It returns a single character ("a") which gets recycled when added to your data.frame. Your code should look like this:
get.third <- function(x) sapply(strsplit(x, "\\."), `[[`, 3)
within(df, v <- get.third(as.character(v)))
Here is one possible solution:
df[, "v"] <- do.call(rbind, strsplit(as.character(df[, "v"]), "\\."))[, 3]
## > df
## v b
## 1 a 1
## 2 b 2
## 3 c 3
The answer to "what am I doing wrong" is that the bit of code that you thought was extracting the third element of each split string was actually putting all the elements of all your strings in a single vector, and then returning the third element of that:
get = function(x) {
splits = strsplit(x, "\\.")
print("All the elements: ")
print(unlist(splits))
print("The third element:")
print(unlist(splits)[3])
# What you actually wanted:
third_chars = sapply(splits, function (x) x[3])
}
within(df, v2 <- get(as.character(v)))

How to calculate the number of occurrence of a given character in each row of a column of strings?

I have a data.frame in which certain variables contain a text string. I wish to count the number of occurrences of a given character in each individual string.
Example:
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
I wish to create a new column for q.data with the number of occurence of "a" in string (ie. c(2,1,0)).
The only convoluted approach I have managed is:
string.counter<-function(strings, pattern){
counts<-NULL
for(i in 1:length(strings)){
counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
}
return(counts)
}
string.counter(strings=q.data$string, pattern="a")
number string number.of.a
1 1 greatgreat 2
2 2 magic 1
3 3 not 0
The stringr package provides the str_count function which seems to do what you're interested in
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0
If you don't want to leave base R, here's a fairly succinct and expressive possibility:
x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0
nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0
Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.
Here's benchmark results (with a scaled up size of the test to 3000 rows)
q.data<-q.data[rep(1:NROW(q.data), 1000),]
str(q.data)
'data.frame': 3000 obs. of 3 variables:
$ number : int 1 2 3 1 2 3 1 2 3 1 ...
$ string : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
$ number.of.a: int 2 1 0 2 1 0 2 1 0 2 ...
benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
test replications elapsed relative user.self sys.self user.child sys.child
1 Dason 100 4.173 9.959427 2.985 1.204 0 0
3 DWin 100 0.419 1.000000 0.417 0.003 0 0
4 Josh 100 18.635 44.474940 17.883 0.827 0 0
2 Tim 100 3.705 8.842482 3.646 0.072 0 0
Another good option, using charToRaw:
sum(charToRaw("abc.d.aa") == charToRaw('.'))
The stringi package provides the functions stri_count and stri_count_fixed which are very fast.
stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0
benchmark
Compared to the fastest approach from #42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.
library(microbenchmark)
benchmark <- microbenchmark(
stringi = stringi::stri_count(test.data$string, fixed = "a"),
baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
stringr = str_count(test.data$string, "a")
)
autoplot(benchmark)
data
q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]
A variation of https://stackoverflow.com/a/12430764/589165 is
> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0
I'm sure someone can do better, but this works:
sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter)
})
greatgreat magic not
2 1 0
or in a function:
countLetter <- function(charvec, letter){
sapply(charvec, function(x, letter){
sum(unlist(strsplit(x, split = "")) == letter)
}, letter = letter)
}
countLetter(as.character(q.data$string),"a")
You could just use string division
require(roperators)
my_strings <- c('apple', banana', 'pear', 'melon')
my_strings %s/% 'a'
Which will give you 1, 3, 1, 0. You can also use string division with regular expressions and whole words.
The question below has been moved here, but it seems this page doesn't directly answer to Farah El's question.
How to find number 1s in 101 in R
So, I'll write an answer here, just in case.
library(magrittr)
n %>% # n is a number you'd like to inspect
as.character() %>%
str_count(pattern = "1")
https://stackoverflow.com/users/8931457/farah-el
Yet another base R option could be:
lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))
[1] 2 1 0
The next expression does the job and also works for symbols, not only letters.
The expression works as follows:
1: it uses lapply on the columns of the dataframe q.data to iterate over the rows of the column 2 ("lapply(q.data[,2],"),
2: it apply to each row of the column 2 a function "function(x){sum('a' == strsplit(as.character(x), '')[[1]])}".
The function takes each row value of column 2 (x), convert to character (in case it is a factor for example), and it does the split of the string on every character ("strsplit(as.character(x), '')"). As a result we have a a vector with each character of the string value for each row of the column 2.
3: Each vector value of the vector is compared with the desired character to be counted, in this case "a" (" 'a' == "). This operation will return a vector of True and False values "c(True,False,True,....)", being True when the value in the vector matches the desired character to be counted.
4: The total times the character 'a' appears in the row is calculated as the sum of all the 'True' values in the vector "sum(....)".
5: Then it is applied the "unlist" function to unpack the result of the "lapply" function and assign it to a new column in the dataframe ("q.data$number.of.a<-unlist(....")
q.data$number.of.a<-unlist(lapply(q.data[,2],function(x){sum('a' == strsplit(as.character(x), '')[[1]])}))
>q.data
# number string number.of.a
#1 greatgreat 2
#2 magic 1
#3 not 0
Another base R answer, not so good as those by #IRTFM and #Finn (or as those using stringi/stringr), but better than the others:
sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
q.data<-q.data[rep(1:NROW(q.data), 3000),]
library(rbenchmark)
library(stringr)
library(stringi)
benchmark( Dason = {str_count(q.data$string, "a") },
Tim = {sapply(q.data$string, function(x, letter = "a"){sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {nchar(q.data$string) -nchar( gsub("a", "", q.data$string, fixed=TRUE))},
Markus = {stringi::stri_count(q.data$string, fixed = "a")},
Finn={nchar(gsub("[^a]", "", q.data$string))},
tmmfmnk={lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))},
Josh1 = {sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)},
Josh2 = {lengths(regmatches(q.data$string, gregexpr("g",q.data$string )))},
Iago = {sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))},
replications =100, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
4 Markus 100 0.076 1.000 0.076 0.000 0 0
3 DWin 100 0.277 3.645 0.277 0.000 0 0
1 Dason 100 0.290 3.816 0.291 0.000 0 0
5 Finn 100 1.057 13.908 1.057 0.000 0 0
9 Iago 100 3.214 42.289 3.215 0.000 0 0
2 Tim 100 6.000 78.947 6.002 0.000 0 0
6 tmmfmnk 100 6.345 83.487 5.760 0.003 0 0
8 Josh2 100 12.542 165.026 12.545 0.000 0 0
7 Josh1 100 13.288 174.842 13.268 0.028 0 0
The easiest and the cleanest way IMHO is :
q.data$number.of.a <- lengths(gregexpr('a', q.data$string))
# number string number.of.a`
#1 1 greatgreat 2`
#2 2 magic 1`
#3 3 not 0`
s <- "aababacababaaathhhhhslsls jsjsjjsaa ghhaalll"
p <- "a"
s2 <- gsub(p,"",s)
numOcc <- nchar(s) - nchar(s2)
May not be the efficient one but solve my purpose.