Reg expression subset work individually but when in function nothing happens

Reg expression subset work individually but when in function nothing happens - regex

Good evening,
I have a list "a" which I successfully subset using regular expressions.
a=a[grep("Macy*|Nors*", a$Geography, perl=TRUE),]
a=a[grep("Levis*|Diesel*|Replay*", a$Brand.Name, perl=TRUE), ]
a=a[grep("Week*", a$Time, perl=TRUE), ]
I created the function clean below but when I apply it to my list "a"nothing happens
clean=function(x){
x=x[grep("Macy*|Nors*", x$Geography, perl=TRUE),]
x=x[grep("Levis*|Diesel*|Replay*", x$Brand.Name, perl=TRUE), ]
x=x[grep("Week*", x$Time, perl=TRUE), ]
return (x)
}
clean(a) just returns original "a"
I tried printing each individual step but literally nothing happens.
Thank you for your help

a = data.frame(Geography = c(paste('Macy',1:5, sep = " "),'john'),stringsAsFactors = FALSE)
a
Geography
1 Macy 1
2 Macy 2
3 Macy 3
4 Macy 4
5 Macy 5
6 john
clean = function(x){
x = x[grep("Macy*|Nors*", x[['Geography']], perl = TRUE),]
return(x)
}
clean(a)
[1] "Macy 1" "Macy 2" "Macy 3" "Macy 4" "Macy 5"

Related

How do I count the number of words in a text (string)?

I have this string vector (for example):
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
To count the number of words in this vector I used this (as given here Count the number of words in a string in R?, which is a possible duplicate but with another issue)
No_words <- sapply(gregexpr("\\W+", str), length) + 1
but it returns
6 2 2 2
String has only 1 element in last two places (i.e. "tusla" and "laq")
so it should return
6 2 1 1
How do I get around this problem?

You can try
sapply(gregexpr("\\S+", x), length)
## [1] 6 2 1 1
Or as suggested in comments you can try
sapply(strsplit(x, "\\s+"), length)
## [1] 6 2 1 1

Use the stringi package and stri_count:
require(stringi)
str <- c(
"this is a string current trey",
"nospaces",
"multiple spaces",
" leadingspaces",
"trailingspaces ",
" leading and trailing ",
"just one space each")
> stri_count(str,regex="\\S+")
[1] 6 1 2 1 1 3 4

Use the wc-function from the qdap package.
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
library("qdap")
wc(str)
That returns:
wc(str)
[1] 6 2 1 1

Create new column in dataframe based on partial string matching other column

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
The dataframe is as follows:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other
I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.

Since you have only two conditions, you can use a nested ifelse:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"

I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.
If you wanna create new labels, use a named vector.
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
Created on 2021-11-13 by the reprex package (v2.0.1)

How to calculate the number of occurrence of a given character in each row of a column of strings?

I have a data.frame in which certain variables contain a text string. I wish to count the number of occurrences of a given character in each individual string.
Example:
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
I wish to create a new column for q.data with the number of occurence of "a" in string (ie. c(2,1,0)).
The only convoluted approach I have managed is:
string.counter<-function(strings, pattern){
counts<-NULL
for(i in 1:length(strings)){
counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
}
return(counts)
}
string.counter(strings=q.data$string, pattern="a")
number string number.of.a
1 1 greatgreat 2
2 2 magic 1
3 3 not 0

The stringr package provides the str_count function which seems to do what you're interested in
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0

If you don't want to leave base R, here's a fairly succinct and expressive possibility:
x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0

nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0
Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.
Here's benchmark results (with a scaled up size of the test to 3000 rows)
q.data<-q.data[rep(1:NROW(q.data), 1000),]
str(q.data)
'data.frame': 3000 obs. of 3 variables:
$ number : int 1 2 3 1 2 3 1 2 3 1 ...
$ string : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
$ number.of.a: int 2 1 0 2 1 0 2 1 0 2 ...
benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
test replications elapsed relative user.self sys.self user.child sys.child
1 Dason 100 4.173 9.959427 2.985 1.204 0 0
3 DWin 100 0.419 1.000000 0.417 0.003 0 0
4 Josh 100 18.635 44.474940 17.883 0.827 0 0
2 Tim 100 3.705 8.842482 3.646 0.072 0 0

Another good option, using charToRaw:
sum(charToRaw("abc.d.aa") == charToRaw('.'))

The stringi package provides the functions stri_count and stri_count_fixed which are very fast.
stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0
benchmark
Compared to the fastest approach from #42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.
library(microbenchmark)
benchmark <- microbenchmark(
stringi = stringi::stri_count(test.data$string, fixed = "a"),
baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
stringr = str_count(test.data$string, "a")
)
autoplot(benchmark)
data
q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]

A variation of https://stackoverflow.com/a/12430764/589165 is
> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0

I'm sure someone can do better, but this works:
sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter)
})
greatgreat magic not
2 1 0
or in a function:
countLetter <- function(charvec, letter){
sapply(charvec, function(x, letter){
sum(unlist(strsplit(x, split = "")) == letter)
}, letter = letter)
}
countLetter(as.character(q.data$string),"a")

You could just use string division
require(roperators)
my_strings <- c('apple', banana', 'pear', 'melon')
my_strings %s/% 'a'
Which will give you 1, 3, 1, 0. You can also use string division with regular expressions and whole words.

The question below has been moved here, but it seems this page doesn't directly answer to Farah El's question.
How to find number 1s in 101 in R
So, I'll write an answer here, just in case.
library(magrittr)
n %>% # n is a number you'd like to inspect
as.character() %>%
str_count(pattern = "1")
https://stackoverflow.com/users/8931457/farah-el

Yet another base R option could be:
lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))
[1] 2 1 0

The next expression does the job and also works for symbols, not only letters.
The expression works as follows:
1: it uses lapply on the columns of the dataframe q.data to iterate over the rows of the column 2 ("lapply(q.data[,2],"),
2: it apply to each row of the column 2 a function "function(x){sum('a' == strsplit(as.character(x), '')[[1]])}".
The function takes each row value of column 2 (x), convert to character (in case it is a factor for example), and it does the split of the string on every character ("strsplit(as.character(x), '')"). As a result we have a a vector with each character of the string value for each row of the column 2.
3: Each vector value of the vector is compared with the desired character to be counted, in this case "a" (" 'a' == "). This operation will return a vector of True and False values "c(True,False,True,....)", being True when the value in the vector matches the desired character to be counted.
4: The total times the character 'a' appears in the row is calculated as the sum of all the 'True' values in the vector "sum(....)".
5: Then it is applied the "unlist" function to unpack the result of the "lapply" function and assign it to a new column in the dataframe ("q.data$number.of.a<-unlist(....")
q.data$number.of.a<-unlist(lapply(q.data[,2],function(x){sum('a' == strsplit(as.character(x), '')[[1]])}))
>q.data
# number string number.of.a
#1 greatgreat 2
#2 magic 1
#3 not 0

Another base R answer, not so good as those by #IRTFM and #Finn (or as those using stringi/stringr), but better than the others:
sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
q.data<-q.data[rep(1:NROW(q.data), 3000),]
library(rbenchmark)
library(stringr)
library(stringi)
benchmark( Dason = {str_count(q.data$string, "a") },
Tim = {sapply(q.data$string, function(x, letter = "a"){sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {nchar(q.data$string) -nchar( gsub("a", "", q.data$string, fixed=TRUE))},
Markus = {stringi::stri_count(q.data$string, fixed = "a")},
Finn={nchar(gsub("[^a]", "", q.data$string))},
tmmfmnk={lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))},
Josh1 = {sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)},
Josh2 = {lengths(regmatches(q.data$string, gregexpr("g",q.data$string )))},
Iago = {sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))},
replications =100, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
4 Markus 100 0.076 1.000 0.076 0.000 0 0
3 DWin 100 0.277 3.645 0.277 0.000 0 0
1 Dason 100 0.290 3.816 0.291 0.000 0 0
5 Finn 100 1.057 13.908 1.057 0.000 0 0
9 Iago 100 3.214 42.289 3.215 0.000 0 0
2 Tim 100 6.000 78.947 6.002 0.000 0 0
6 tmmfmnk 100 6.345 83.487 5.760 0.003 0 0
8 Josh2 100 12.542 165.026 12.545 0.000 0 0
7 Josh1 100 13.288 174.842 13.268 0.028 0 0

The easiest and the cleanest way IMHO is :
q.data$number.of.a <- lengths(gregexpr('a', q.data$string))
# number string number.of.a`
#1 1 greatgreat 2`
#2 2 magic 1`
#3 3 not 0`

s <- "aababacababaaathhhhhslsls jsjsjjsaa ghhaalll"
p <- "a"
s2 <- gsub(p,"",s)
numOcc <- nchar(s) - nchar(s2)
May not be the efficient one but solve my purpose.

Subset elements in a list based on a logical condition

How can I subset a list based on a condition (TRUE, FALSE) in another list? Please, see my example below:
l <- list(a=c(1,2,3), b=c(4,5,6,5), c=c(3,4,5,6))
l
$a
[1] 1 2 3
$b
[1] 4 5 6 5
$c
[1] 3 4 5 6
cond <- lapply(l, function(x) length(x) > 3)
cond
$a
[1] FALSE
$b
[1] TRUE
$c
[1] TRUE
> l[cond]
Error in l[cond] : invalid subscript type 'list'

This is what the Filter function was made for:
Filter(function(x) length(x) > 3, l)
$b
[1] 4 5 6 5
$c
[1] 3 4 5 6

Another way is to use sapply instead of lapply.
cond <- sapply(l, function(x) length(x) > 3)
l[cond]

[ is expecting a vector, so use unlist on cond:
l[unlist(cond)]
$b
[1] 4 5 6 5
$c
[1] 3 4 5 6

> l[as.logical(cond)]
$b
[1] 4 5 6 5
$c
[1] 3 4 5 6

I recently learned lengths(), which gets the length of each element of a list. This allows us to avoid making another list including logical values as the OP tried.
lengths(l)
#a b c
#3 4 4
Using this in a logical condition, we can subset list elements in l.
l[lengths(l) > 3]
$b
[1] 4 5 6 5
$c
[1] 3 4 5 6

Well im am very new to R but as it is a functional language by far the best solution according to the previous answers is something like:
filter <- function (inputList, selector) sapply(inputList, function (element) selector(element))
Assume you have a complex list like yours:
myList <- list(
a=c(1,2,3),
b=c(4,5,6,5),
c=c(3,4,5,6))
Then you can filter the elements like:
selection <- myList[filter(myList, function (element) length(element) > 3]
Well of course this also works for list that just contain a value at the first level:
anotherList <- list(1, 2, 3, 4)
selection <- myList[filter(anotherList, function (element) element == 2)]
Or you can put it all together like:
filter <- function (inputList, selector) inputList[sapply(inputList, function (element) selector(element))]

cond <- lapply(l, length) > 3
l[cond]

l <- list(a=c(1,2,3), b=c(4,5,6,5), c=c(3,4,5,6))
l[lengths(l) > 3]
$b
[1] 4 5 6 5
$c
[1] 3 4 5 6
If a condition on value is needed:
cond <- lapply(l, function(i) i > 3)
res <- Map(`[`, l, cond)
res
$a
numeric(0)
$b
[1] 4 5 6 5
$c
[1] 4 5 6

improve my code for collapsing a list of data.frames

Dear StackOverFlowers (flowers in short),
I have a list of data.frames (walk.sample) that I would like to collapse into a single (giant) data.frame. While collapsing, I would like to mark (adding another column) which rows have came from which element of the list. This is what I've got so far.
This is the data.frame that needs to be collapsed/stacked.
> walk.sample
[[1]]
walker x y
1073 3 228.8756 -726.9198
1086 3 226.7393 -722.5561
1081 3 219.8005 -728.3990
1089 3 225.2239 -727.7422
1032 3 233.1753 -731.5526
[[2]]
walker x y
1008 3 205.9104 -775.7488
1022 3 208.3638 -723.8616
1072 3 233.8807 -718.0974
1064 3 217.0028 -689.7917
1026 3 234.1824 -723.7423
[[3]]
[1] 3
[[4]]
walker x y
546 2 629.9041 831.0852
524 2 627.8698 873.3774
578 2 572.3312 838.7587
513 2 633.0598 871.7559
538 2 636.3088 836.6325
1079 3 206.3683 -729.6257
1095 3 239.9884 -748.2637
1005 3 197.2960 -780.4704
1045 3 245.1900 -694.3566
1026 3 234.1824 -723.7423
I have written a function to add a column that denote from which element the rows came followed by appending it to an existing data.frame.
collapseToDataFrame <- function(x) { # collapse list to a dataframe with a twist
walk.df <- data.frame()
for (i in 1:length(x)) {
n.rows <- nrow(x[[i]])
if (length(x[[i]])>1) {
temp.df <- cbind(x[[i]], rep(i, n.rows))
names(temp.df) <- c("walker", "x", "y", "session")
walk.df <- rbind(walk.df, temp.df)
} else {
cat("Empty list", "\n")
}
}
return(walk.df)
}
> collapseToDataFrame(walk.sample)
Empty list
Empty list
walker x y session
3 1 -604.5055 -123.18759 1
60 1 -562.0078 -61.24912 1
84 1 -594.4661 -57.20730 1
9 1 -604.2893 -110.09168 1
43 1 -632.2491 -54.52548 1
1028 3 240.3905 -724.67284 1
1040 3 232.5545 -681.61225 1
1073 3 228.8756 -726.91980 1
1091 3 209.0373 -740.96173 1
1036 3 248.7123 -694.47380 1
I'm curious whether this can be done more elegantly, with perhaps do.call() or some other more generic function?

I think this will work...
lengths <- sapply(walk.sample, function(x) if (is.null(nrow(x))) 0 else nrow(x))
cbind(do.call(rbind, walk.sample[lengths > 1]),
session = rep(1:length(lengths), ifelse(lengths > 1, lengths, 0)))

I'm not claiming this to be the most elegant approach, but I think it is working
library(plyr)
ldply(sapply(1:length(walk.sample), function(i)
if (length(walk.sample[[i]]) > 1)
cbind(walk.sample[[i]],session=rep(i,nrow(walk.sample[[i]])))
),rbind)
EDIT
After applying Marek's apt remarks
do.call(rbind,lapply(1:length(walk.sample), function(i)
if (length(walk.sample[[i]]) > 1)
cbind(walk.sample[[i]],session=i) ))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reg expression subset work individually but when in function nothing happens - regex

Related

How do I count the number of words in a text (string)?

Create new column in dataframe based on partial string matching other column

How to calculate the number of occurrence of a given character in each row of a column of strings?

Subset elements in a list based on a logical condition

improve my code for collapsing a list of data.frames

Categories

Resources