R: How to group and aggregate list elements using regex?

R: How to group and aggregate list elements using regex? - regex

I want to aggregate (sum up) the following product list by groups (see below):
prods <- list("101.2000"=data.frame(1,2,3),
"102.2000"=data.frame(4,5,6),
"103.2000"=data.frame(7,8,9),
"104.2000"=data.frame(1,2,3),
"105.2000"=data.frame(4,5,6),
"106.2000"=data.frame(7,8,9),
"101.2001"=data.frame(1,2,3),
"102.2001"=data.frame(4,5,6),
"103.2001"=data.frame(7,8,9),
"104.2001"=data.frame(1,2,3),
"105.2001"=data.frame(4,5,6),
"106.2001"=data.frame(7,8,9))
test= list("100.2000"=data.frame(2,3,5),
"100.2001"=data.frame(4,5,6))
names <- c("A", "B", "C")
prods <- lapply(prods, function (x) {colnames(x) <- names; return(x)})
Each element of the product list (prods) has a name combination of the product number and the year (e.g. 101.2000 --> 101 = prod nr. and 2000 = year). And the groups only contain product numbers for the aggregation.
group1 <- c(101, 106)
group2 <- c(102, 104)
group3 <- c(105, 103)
My expected result, shows the aggregated product groups by year:
$group1.2000
A B C
1 8 10 12
$group2.2000
A B C
1 5 7 9
$group3.2000
A B C
1 11 13 15
$group1.2001
A B C
1 8 10 12
$group2.2001
A B C
1 5 7 9
$group3.2001
A B C
1 11 13 15
So far, I tried this way: First I decomposed the names of prods into product numbers:
prodnames <- names(prods)
prodnames_sub <- gsub("\\..*.","", prodnames)
And then I tried to aggregate using lapply:
lapply(prods, function(x) aggregate( ... , FUN = sum)
However, I didn't find how to implement the previous product numbers in the aggregation function. Ideas? Thanks

Here are two approaches. No packages are used in either one.
1) Using lists Create a two column data.frame S from the groups whose columns are the products (value column) and associated groups (ind column). Create the list to split by, By. In code to produce By, sub("\\.*", "", names(prods)) extracts the products and match is then used to find the associated group. sub("\\..*", "", names(prods)) extracts the year. Next perform the split and lapply over it to run the summations. The two components of By (group and year) can be reversed to change the order of the output, if desired.
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
By <- list(group = S$ind[match(sub("\\..*", "", names(prods)), S$values)],
year = sub(".*\\.", "", names(prods)))
lapply(split(prods, By), function(x) colSums(do.call(rbind, x)))
2) Using data.frames Convert the groups and prods each to a data frame, merge them, perform an aggregate and split back into a list. The output is the same as requested except for order. (Reverse the two right hand variables in the aggregate formula to get the order shown in the question but that will also reverse the two parts of each component name in he output list.)
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
DF0 <- do.call(rbind, prods)
DF <- cbind(do.call(rbind, strsplit(rownames(DF0), ".", fixed = TRUE)), DF0)
M <- merge(DF, S, all.x = TRUE, by = 1)
Ag <- aggregate(cbind(A, B, C) ~ ind + `2`, M, sum)
lapply(split(Ag, paste(Ag[[1]], Ag[[2]], sep = ".")), "[", 3:5)
giving:
$group1.2000
A B C
1 8 10 12
$group1.2001
A B C
4 8 10 12
$group2.2000
A B C
2 5 7 9
$group2.2001
A B C
5 5 7 9
$group3.2000
A B C
3 11 13 15
$group3.2001
A B C
6 11 13 15

Related

How to replicate column-names, split them at delimiter '/', into multiple column-names, in R?

I have this matrix (it's big in size) "mymat". I need to replicate the columns that have "/" in their column name matching at "/" and make a "resmatrix". How can I get this done in R?
mymat
a b IID:WE:G12D/V GH:SQ:p.R172W/G c
1 3 4 2 4
22 4 2 2 4
2 3 2 2 4
resmatrix
a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
1 3 4 4 2 2 4
22 4 2 2 2 2 4
2 3 2 2 2 2 4

Find out which columns have the "/" and replicate them, then rename. To calculate the new names, just split on / and replace the last letter for the second name.
# which columns have '/' in them?
which.slash <- grep('/', names(mymat), value=T)
new.names <- unlist(lapply(strsplit(which.slash, '/'),
function (bits) {
# bits[1] is e.g. IID:WE:G12D and bits[2] is the V
# take bits[1] and replace the last letter for the second colname
c(bits[1], sub('.$', bits[2], bits[1]))
}))
# make resmat by copying the appropriate columns
resmat <- cbind(mymat, mymat[, which.slash])
# order the columns to make sure the names replace properly
resmat <- resmat[, order(names(resmat))]
# put the new names in
names(resmat)[grep('/', names(resmat))] <- sort(new.names)
resmat looks like this
# a b c GH:SQ:p.R172G GH:SQ:p.R172W IID:WE:G12D IID:WE:G12V
# 1 1 3 4 2 2 4 4
# 2 22 4 4 2 2 2 2
# 3 2 3 4 2 2 2 2

You could use grep to get the index of column names with / ('nm1'), replicate the column names in 'nm1' by using sub/scan to create 'nm2'. Then, cbind the columns that are not 'nm1', with the replicated columns ('nm1'), change the column names with 'nm2', and if needed order the columns.
#get the column index with grep
nm1 <- grepl('/', names(df1))
#used regex to rearrange the substrings in the nm1 column names
#removed the `/` and use `scan` to split at the space delimiter
nm2 <- scan(text=gsub('([^/]+)(.)/(.*)', '\\1\\2 \\1\\3',
names(df1)[nm1]), what='', quiet=TRUE)
#cbind the columns that are not in nm1, with the replicate nm1 columns
df2 <- cbind(df1[!nm1], setNames(df1[rep(which(nm1), each= 2)], nm2))
#create another index to find the starting position of nm1 columns
nm3 <- names(df1)[1:(which(nm1)[1L]-1)]
#we concatenate the nm3, nm2, and the rest of the columns to match
#the expected output order
df2N <- df2[c(nm3, nm2, setdiff(names(df1)[!nm1], nm3))]
df2N
# a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
#1 1 3 4 4 2 2 4
#2 22 4 2 2 2 2 4
#3 2 3 2 2 2 2 4
data
df1 <- structure(list(a = c(1L, 22L, 2L), b = c(3L, 4L, 3L),
`IID:WE:G12D/V` = c(4L,
2L, 2L), `GH:SQ:p.R172W/G` = c(2L, 2L, 2L), c = c(4L, 4L, 4L)),
.Names = c("a", "b", "IID:WE:G12D/V", "GH:SQ:p.R172W/G", "c"),
class = "data.frame", row.names = c(NA, -3L))

Which pattern occurs the most in a matrix - R (UPDATE)

UPDATE 2
*I've added some code (and explanation) I wrote myself at the end of this question, this is however a suboptimal solution (both in coding efficiency as resulting output) but kind of manages to make a selection of items that adhere to the constraints. If you have any ideas on how to improve it (again both in efficiency as resulting output) please let me know.
1. Updated Post
Please look below for the initial question and sample code. Thx to alexis_laz his answer the problem was solved for a small number of items. However when the number of items becomes to large the combn function in R cannot calculate it anymore because of the invalid 'ncol' value (too large or NA) error. Since my dataset has indeed a lot of items, I was wondering whether replacing some of his code (shown after this) with C++ provides a solution to this, and if this is the case what code I should use for this? Tnx!
This is the code as provided by alexis_laz;
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
2. Initial Post
Currently I'm working on a set of data coming from adaptive measurement, which means that not all persons have made all of the same items. For my analysis however I need a dataset that contains only items that have been made by all persons (or a subset of these persons).
I have a matrix object in R with rows = persons (100000), and columns = items(220), and a 1 in a cell if the person has made the item and a 0 if the person has not made the item.
How can I use R to determine which combination of at least 15 items, is made by the highest amount of persons?
Hopefully the question is clear (if not please ask me for more details and I will gladly provide those).
Tnx in advance.
Joost
Edit:
Below is a sample matrix with the items (A:E) as columns and persons (1:5) as rows.
mat <- matrix(c(1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,0,1,1,0,0,1,1,1,1,0),5,5,byrow=T)
colnames(mat) <- c("A","B","C","D","E")
rownames(mat) <- 1:5
> mat
A B C D E
"1" 1 1 1 0 0
"2" 1 1 0 1 1
"3" 1 1 1 0 1
"4" 0 1 1 0 0
"5" 1 1 1 1 0
mat[1,1] = 1 means that person 1 has given a response to item 1.
Now (in this example) I'm interested in finding out which set of at least 3 items is made by at least 3 people. So here I can just go through all possible combinations of 3, 4 and 5 items to check how many people have a 1 in the matrix for each item in a combination.
This will result in me choosing the item combination A, B and C, since it is the only combination of items that has been made by 3 people (namely persons 1, 3 and 5).
Now for my real dataset I want to do this but then for a combination of at least 10 items that a group of at least 75 people all responded to. And since I have a lot of data preferably not by hand as in the example data.
I'm thus looking for a function/code in R, that will let me select the minimal amount of items, and questions, and than gives me all combinations of items and persons that adhere to these constraints or have a greater number of items/persons than the constrained.
Thus for the example matrix it would be something like;
f <- function(data,no.items,no.persons){
#code
}
> f(mat,3,3)
no.item no.pers items persons
1 3 3 A, B, C 1, 3, 5
Or in case of at least 2 items that are made by at least 3 persons;
> f(mat,2,3)
no.item no.pers items persons
1 2 4 A, B 1, 2, 3, 5
2 2 3 A, C 1, 3, 5
3 2 4 B, C 1, 3, 4, 5
4 3 3 A, B, C 1, 3, 5
Hopefully this clears up what my question actually is about. Tnx for the quick replies that I already received!
3. Written Code
Below is the code I've written today. It takes each item once as a starting point and then looks to the item that has been answered most by people who also responded to the start item. It the takes these two items and looks to a third item, and repeats this until the number of people that responded to all selected questions drops below the given limit. One drawback of the code is that it takes some time to run, (it goes up somewhat exponentially when the number of items grows). The second drawback is that this still does not evaluate all possible combinations of items, in the sense that the start item, and the subsequently chosen item may have a lot of persons that answered to these items in common, however if the chosen item has almost no similarities with the other (not yet chosen) items, the sample might shrink very fast. While if an item was chosen with somewhat less persons in common with the start item, and this item has a lot of connections to other items, the final collection of selected items might be much bigger than the one based on the code used below. So again suggestions and improvements in both directions are welcome!
set.seed(512)
mat <- matrix(rbinom(1000000, 1, .6), 10000, 100)
colnames(mat) <- 1:100
fff <- function(data,persons,items){
xx <- list()
for(j in 1:ncol(data)){
d <- matrix(c(j,length(which(data[,j]==1))),1,2)
colnames(d) <- c("item","n")
t = persons+1
a <- j
while(t >= persons){
b <- numeric(0)
for(i in 1:ncol(data)){
z <- c(a,i)
if(i %in% a){
b[i] = 0
} else {
b[i] <- length(which(rowSums(data[,z])==length(z)))
}
}
c <- c(which.max(b),max(b))
d <- rbind(d,c)
a <- c(a,c[1])
t <- max(b)
}
print(j)
xx[[j]] = d
}
x <- y <- z <- numeric(0)
zz <- matrix(c(0,0,rep(NA,ncol(data))),length(xx),ncol(data)+2,byrow=T)
colnames(zz) <- c("n.pers", "n.item", rep("I",ncol(data)))
for(i in 1:length(xx)){
zz[i,1] <- xx[[i]][nrow(xx[[i]])-1,2]
zz[i,2] <- length(unname(xx[[i]][1:nrow(xx[[i]])-1,1]))
zz[i,3:(zz[i,2]+2)] <- unname(xx[[i]][1:nrow(xx[[i]])-1,1])
}
zz <- zz[,colSums(is.na(zz))<nrow(zz)]
zz <- zz[which((rowSums(zz,na.rm=T)/rowMeans(zz,na.rm=T))-2>=items),]
zz <- as.data.frame(zz)
return(zz)
}
fff(mat,110,8)
> head(zz)
n.pers n.item I I I I I I I I I I
1 156 9 1 41 13 80 58 15 91 12 39 NA
2 160 9 2 27 59 13 81 16 15 6 92 NA
3 158 9 3 59 83 32 25 80 14 41 16 NA
4 160 9 4 24 27 71 32 10 63 42 51 NA
5 114 10 5 59 66 27 47 13 44 63 30 52
6 158 9 6 13 56 61 12 59 8 45 81 NA
#col 1 = number of persons in sample
#col 2 = number of items in sample
#col 3:12 = which items create this sample (NA if n.item is less than 10)

to follow up on my comment, something like:
set.seed(1618)
mat <- matrix(rbinom(1000, 1, .6), 100, 10)
colnames(mat) <- sample(LETTERS, 10)
rownames(mat) <- sprintf('person%s', 1:100)
mat1 <- mat[rowSums(mat) > 5, ]
head(mat1)
# A S X D R E Z K P C
# person1 1 1 1 0 1 1 1 1 1 1
# person3 1 0 1 1 0 1 0 0 1 1
# person4 1 0 1 1 1 1 1 0 1 1
# person5 1 1 1 1 1 0 1 1 0 0
# person6 1 1 1 1 0 1 0 1 1 0
# person7 0 1 1 1 1 1 1 1 0 0
table(rowSums(mat1))
# 6 7 8 9
# 24 23 21 5
tab <- table(sapply(1:nrow(mat1), function(x)
paste(names(mat1[x, ][mat1[x, ] == 1]), collapse = ',')))
data.frame(tab[tab > 1])
# tab.tab...1.
# A,S,X,D,R,E,P,C 2
# A,S,X,D,R,E,Z,P,C 2
# A,S,X,R,E,Z,K,C 3
# A,S,X,R,E,Z,P,C 2
# A,S,X,Z,K,P,C 2

Here is another idea that matches your output:
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
ff(mat, 3, 3)
# persons items
#1 1, 3, 5 A, B, C
ff(mat, 2, 3)
# persons items
#1 1, 2, 3, 5 A, B
#2 1, 3, 5 A, C
#3 1, 3, 4, 5 B, C
#4 1, 3, 5 A, B, C

R separating out number and units from a column in a dataframe

I have a dataframe which contains a column that has numbers as well as variable units:
num <- c(1:5)
val <- c("5%","10K", "100.2mv","1.4g","1.007kbars")
df <- data.frame(num,val)
df
How can I create two new columns from df$val, one that contains just the number and one the units?
Thank you for your help.

Here's a solution using stringr:
library(stringr)
df$extr_nums <- str_extract(val, "\\d+\\.?\\d*")
df$extr_units <- str_replace(val, nums, "")
df
num val extr_nums extr_units
1 1 5% 5 %
2 2 10K 10 K
3 3 100.2mv 100.2 mv
4 4 1.4g 1.4 g
5 5 1.007kbars 1.007 kbars
The regexp is translated as: "at least 1 digit, followed by optional dot, followed by optional digits".

Create new column in dataframe based on partial string matching other column

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
The dataframe is as follows:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other
I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.

Since you have only two conditions, you can use a nested ifelse:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"

I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.
If you wanna create new labels, use a named vector.
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
Created on 2021-11-13 by the reprex package (v2.0.1)

How to properly manipulate a string column in a data frame in R?

I have a data.frame with a string column that contains periods e.g "a.b.c.X". I want to split out the string by periods and retain the third segment e.g. "c" in the example given. Here is what I'm doing.
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a.b.a.X 1
2 a.b.b.X 2
3 a.b.c.X 3
And what I want is
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a 1
2 b 2
3 c 3
I'm attempting to use within, but I'm getting strange results. The value in the first row in the first column is being repeated.
> get = function(x) { unlist(strsplit(x, "\\."))[3] }
> within(df, v <- get(as.character(v)))
v b
1 a 1
2 a 2
3 a 3
What is the best practice for doing this? What am I doing wrong?
Update:
Here is the solution I used from #agstudy's answer:
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> get = function(x) gsub(".*?[.].*?[.](.*?)[.].*", '\\1', x)
> within(df, v <- get(v))
v b
1 a 1
2 b 2
3 c 3

Using some regular expression you can do :
gsub(".*?[.].*?[.](.*?)[.].*", '\\1', df$v)
[1] "a" "b" "c"
Or more concise:
gsub("(.*?[.]){2}(.*?)[.].*", '\\2', v)

The problem is not with within but with your get function. It returns a single character ("a") which gets recycled when added to your data.frame. Your code should look like this:
get.third <- function(x) sapply(strsplit(x, "\\."), `[[`, 3)
within(df, v <- get.third(as.character(v)))

Here is one possible solution:
df[, "v"] <- do.call(rbind, strsplit(as.character(df[, "v"]), "\\."))[, 3]
## > df
## v b
## 1 a 1
## 2 b 2
## 3 c 3

The answer to "what am I doing wrong" is that the bit of code that you thought was extracting the third element of each split string was actually putting all the elements of all your strings in a single vector, and then returning the third element of that:
get = function(x) {
splits = strsplit(x, "\\.")
print("All the elements: ")
print(unlist(splits))
print("The third element:")
print(unlist(splits)[3])
# What you actually wanted:
third_chars = sapply(splits, function (x) x[3])
}
within(df, v2 <- get(as.character(v)))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R: How to group and aggregate list elements using regex? - regex

Related

How to replicate column-names, split them at delimiter '/', into multiple column-names, in R?

Which pattern occurs the most in a matrix - R (UPDATE)

R separating out number and units from a column in a dataframe

Create new column in dataframe based on partial string matching other column

How to properly manipulate a string column in a data frame in R?

Categories

Resources