This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2
I have this matrix (it's big in size) "mymat". I need to replicate the columns that have "/" in their column name matching at "/" and make a "resmatrix". How can I get this done in R?
mymat
a b IID:WE:G12D/V GH:SQ:p.R172W/G c
1 3 4 2 4
22 4 2 2 4
2 3 2 2 4
resmatrix
a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
1 3 4 4 2 2 4
22 4 2 2 2 2 4
2 3 2 2 2 2 4
Find out which columns have the "/" and replicate them, then rename. To calculate the new names, just split on / and replace the last letter for the second name.
# which columns have '/' in them?
which.slash <- grep('/', names(mymat), value=T)
new.names <- unlist(lapply(strsplit(which.slash, '/'),
function (bits) {
# bits[1] is e.g. IID:WE:G12D and bits[2] is the V
# take bits[1] and replace the last letter for the second colname
c(bits[1], sub('.$', bits[2], bits[1]))
}))
# make resmat by copying the appropriate columns
resmat <- cbind(mymat, mymat[, which.slash])
# order the columns to make sure the names replace properly
resmat <- resmat[, order(names(resmat))]
# put the new names in
names(resmat)[grep('/', names(resmat))] <- sort(new.names)
resmat looks like this
# a b c GH:SQ:p.R172G GH:SQ:p.R172W IID:WE:G12D IID:WE:G12V
# 1 1 3 4 2 2 4 4
# 2 22 4 4 2 2 2 2
# 3 2 3 4 2 2 2 2
You could use grep to get the index of column names with / ('nm1'), replicate the column names in 'nm1' by using sub/scan to create 'nm2'. Then, cbind the columns that are not 'nm1', with the replicated columns ('nm1'), change the column names with 'nm2', and if needed order the columns.
#get the column index with grep
nm1 <- grepl('/', names(df1))
#used regex to rearrange the substrings in the nm1 column names
#removed the `/` and use `scan` to split at the space delimiter
nm2 <- scan(text=gsub('([^/]+)(.)/(.*)', '\\1\\2 \\1\\3',
names(df1)[nm1]), what='', quiet=TRUE)
#cbind the columns that are not in nm1, with the replicate nm1 columns
df2 <- cbind(df1[!nm1], setNames(df1[rep(which(nm1), each= 2)], nm2))
#create another index to find the starting position of nm1 columns
nm3 <- names(df1)[1:(which(nm1)[1L]-1)]
#we concatenate the nm3, nm2, and the rest of the columns to match
#the expected output order
df2N <- df2[c(nm3, nm2, setdiff(names(df1)[!nm1], nm3))]
df2N
# a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
#1 1 3 4 4 2 2 4
#2 22 4 2 2 2 2 4
#3 2 3 2 2 2 2 4
data
df1 <- structure(list(a = c(1L, 22L, 2L), b = c(3L, 4L, 3L),
`IID:WE:G12D/V` = c(4L,
2L, 2L), `GH:SQ:p.R172W/G` = c(2L, 2L, 2L), c = c(4L, 4L, 4L)),
.Names = c("a", "b", "IID:WE:G12D/V", "GH:SQ:p.R172W/G", "c"),
class = "data.frame", row.names = c(NA, -3L))
UPDATE 2
*I've added some code (and explanation) I wrote myself at the end of this question, this is however a suboptimal solution (both in coding efficiency as resulting output) but kind of manages to make a selection of items that adhere to the constraints. If you have any ideas on how to improve it (again both in efficiency as resulting output) please let me know.
1. Updated Post
Please look below for the initial question and sample code. Thx to alexis_laz his answer the problem was solved for a small number of items. However when the number of items becomes to large the combn function in R cannot calculate it anymore because of the invalid 'ncol' value (too large or NA) error. Since my dataset has indeed a lot of items, I was wondering whether replacing some of his code (shown after this) with C++ provides a solution to this, and if this is the case what code I should use for this? Tnx!
This is the code as provided by alexis_laz;
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
2. Initial Post
Currently I'm working on a set of data coming from adaptive measurement, which means that not all persons have made all of the same items. For my analysis however I need a dataset that contains only items that have been made by all persons (or a subset of these persons).
I have a matrix object in R with rows = persons (100000), and columns = items(220), and a 1 in a cell if the person has made the item and a 0 if the person has not made the item.
How can I use R to determine which combination of at least 15 items, is made by the highest amount of persons?
Hopefully the question is clear (if not please ask me for more details and I will gladly provide those).
Tnx in advance.
Joost
Edit:
Below is a sample matrix with the items (A:E) as columns and persons (1:5) as rows.
mat <- matrix(c(1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,0,1,1,0,0,1,1,1,1,0),5,5,byrow=T)
colnames(mat) <- c("A","B","C","D","E")
rownames(mat) <- 1:5
> mat
A B C D E
"1" 1 1 1 0 0
"2" 1 1 0 1 1
"3" 1 1 1 0 1
"4" 0 1 1 0 0
"5" 1 1 1 1 0
mat[1,1] = 1 means that person 1 has given a response to item 1.
Now (in this example) I'm interested in finding out which set of at least 3 items is made by at least 3 people. So here I can just go through all possible combinations of 3, 4 and 5 items to check how many people have a 1 in the matrix for each item in a combination.
This will result in me choosing the item combination A, B and C, since it is the only combination of items that has been made by 3 people (namely persons 1, 3 and 5).
Now for my real dataset I want to do this but then for a combination of at least 10 items that a group of at least 75 people all responded to. And since I have a lot of data preferably not by hand as in the example data.
I'm thus looking for a function/code in R, that will let me select the minimal amount of items, and questions, and than gives me all combinations of items and persons that adhere to these constraints or have a greater number of items/persons than the constrained.
Thus for the example matrix it would be something like;
f <- function(data,no.items,no.persons){
#code
}
> f(mat,3,3)
no.item no.pers items persons
1 3 3 A, B, C 1, 3, 5
Or in case of at least 2 items that are made by at least 3 persons;
> f(mat,2,3)
no.item no.pers items persons
1 2 4 A, B 1, 2, 3, 5
2 2 3 A, C 1, 3, 5
3 2 4 B, C 1, 3, 4, 5
4 3 3 A, B, C 1, 3, 5
Hopefully this clears up what my question actually is about. Tnx for the quick replies that I already received!
3. Written Code
Below is the code I've written today. It takes each item once as a starting point and then looks to the item that has been answered most by people who also responded to the start item. It the takes these two items and looks to a third item, and repeats this until the number of people that responded to all selected questions drops below the given limit. One drawback of the code is that it takes some time to run, (it goes up somewhat exponentially when the number of items grows). The second drawback is that this still does not evaluate all possible combinations of items, in the sense that the start item, and the subsequently chosen item may have a lot of persons that answered to these items in common, however if the chosen item has almost no similarities with the other (not yet chosen) items, the sample might shrink very fast. While if an item was chosen with somewhat less persons in common with the start item, and this item has a lot of connections to other items, the final collection of selected items might be much bigger than the one based on the code used below. So again suggestions and improvements in both directions are welcome!
set.seed(512)
mat <- matrix(rbinom(1000000, 1, .6), 10000, 100)
colnames(mat) <- 1:100
fff <- function(data,persons,items){
xx <- list()
for(j in 1:ncol(data)){
d <- matrix(c(j,length(which(data[,j]==1))),1,2)
colnames(d) <- c("item","n")
t = persons+1
a <- j
while(t >= persons){
b <- numeric(0)
for(i in 1:ncol(data)){
z <- c(a,i)
if(i %in% a){
b[i] = 0
} else {
b[i] <- length(which(rowSums(data[,z])==length(z)))
}
}
c <- c(which.max(b),max(b))
d <- rbind(d,c)
a <- c(a,c[1])
t <- max(b)
}
print(j)
xx[[j]] = d
}
x <- y <- z <- numeric(0)
zz <- matrix(c(0,0,rep(NA,ncol(data))),length(xx),ncol(data)+2,byrow=T)
colnames(zz) <- c("n.pers", "n.item", rep("I",ncol(data)))
for(i in 1:length(xx)){
zz[i,1] <- xx[[i]][nrow(xx[[i]])-1,2]
zz[i,2] <- length(unname(xx[[i]][1:nrow(xx[[i]])-1,1]))
zz[i,3:(zz[i,2]+2)] <- unname(xx[[i]][1:nrow(xx[[i]])-1,1])
}
zz <- zz[,colSums(is.na(zz))<nrow(zz)]
zz <- zz[which((rowSums(zz,na.rm=T)/rowMeans(zz,na.rm=T))-2>=items),]
zz <- as.data.frame(zz)
return(zz)
}
fff(mat,110,8)
> head(zz)
n.pers n.item I I I I I I I I I I
1 156 9 1 41 13 80 58 15 91 12 39 NA
2 160 9 2 27 59 13 81 16 15 6 92 NA
3 158 9 3 59 83 32 25 80 14 41 16 NA
4 160 9 4 24 27 71 32 10 63 42 51 NA
5 114 10 5 59 66 27 47 13 44 63 30 52
6 158 9 6 13 56 61 12 59 8 45 81 NA
#col 1 = number of persons in sample
#col 2 = number of items in sample
#col 3:12 = which items create this sample (NA if n.item is less than 10)
to follow up on my comment, something like:
set.seed(1618)
mat <- matrix(rbinom(1000, 1, .6), 100, 10)
colnames(mat) <- sample(LETTERS, 10)
rownames(mat) <- sprintf('person%s', 1:100)
mat1 <- mat[rowSums(mat) > 5, ]
head(mat1)
# A S X D R E Z K P C
# person1 1 1 1 0 1 1 1 1 1 1
# person3 1 0 1 1 0 1 0 0 1 1
# person4 1 0 1 1 1 1 1 0 1 1
# person5 1 1 1 1 1 0 1 1 0 0
# person6 1 1 1 1 0 1 0 1 1 0
# person7 0 1 1 1 1 1 1 1 0 0
table(rowSums(mat1))
# 6 7 8 9
# 24 23 21 5
tab <- table(sapply(1:nrow(mat1), function(x)
paste(names(mat1[x, ][mat1[x, ] == 1]), collapse = ',')))
data.frame(tab[tab > 1])
# tab.tab...1.
# A,S,X,D,R,E,P,C 2
# A,S,X,D,R,E,Z,P,C 2
# A,S,X,R,E,Z,K,C 3
# A,S,X,R,E,Z,P,C 2
# A,S,X,Z,K,P,C 2
Here is another idea that matches your output:
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
ff(mat, 3, 3)
# persons items
#1 1, 3, 5 A, B, C
ff(mat, 2, 3)
# persons items
#1 1, 2, 3, 5 A, B
#2 1, 3, 5 A, C
#3 1, 3, 4, 5 B, C
#4 1, 3, 5 A, B, C
I am stuck at a failry simple looping exercise through lists and getting error "TypeError: 'list' object is not callable".
I have three lists with n number of records. I want to write first record from all lists in the same line and want to repeat this procedure for n number of records, it will result in n number of lines. Following are lists that I want to use:
lst1 = ['1','2','4','5','3']
lst2 = ['3','4','3','4','3']
lst3 = ['0.52','0.91','0.18','0.42','0.21']
istring=""
lst=0
for i in range(0,10): # range is simply upper limit of number of records in lists
entry = lst1(lst)
istring = istring + entry.rjust(11) # first entry from each list will be cat here
lst=lst+1
Any startup would be really helpful.
This works for any size of lists:
for i in zip(lst1, lst2, lst3):
for j in i:
print j.rjust(11),
print
1 3 0.52
2 4 0.91
4 3 0.18
5 4 0.42
3 3 0.21
>>> lst1 = ['1','2','4','5','3']
>>> lst2 = ['3','4','3','4','3']
>>> lst3 = ['0.52','0.91','0.18','0.42','0.21']
>>> a = zip(lst1, lst2, lst3)
>>> istring = ""
>>> for entry in a:
... istring += entry[0].rjust(11)
... istring += entry[1].rjust(11)
... istring += entry[2].rjust(11) + "\n"
...
>>> print istring
1 3 0.52
2 4 0.91
4 3 0.18
5 4 0.42
3 3 0.21
Try entry = lst1[lst] instead of entry = lst1(lst)
() usually denotes calling a function, whereas
[] usually denotes accessing an element of something.
A list is not a function.
Also, while you can keep your own index, a for loop makes this unnecessary
x = [1,2,3,4,5,7,9,11,13,15]
y = [2,4,6,8,10,12,14,16,18,20]
z = [3,4,5,6,7,8,9,10,11,12]
for i in range(0,10):
print x[i], y[i], z[i]
1 2 3
2 4 4
3 6 5
4 8 6
5 10 7
7 12 8
9 14 9
11 16 10
13 18 11
15 20 12
Dear StackOverFlowers (flowers in short),
I have a list of data.frames (walk.sample) that I would like to collapse into a single (giant) data.frame. While collapsing, I would like to mark (adding another column) which rows have came from which element of the list. This is what I've got so far.
This is the data.frame that needs to be collapsed/stacked.
> walk.sample
[[1]]
walker x y
1073 3 228.8756 -726.9198
1086 3 226.7393 -722.5561
1081 3 219.8005 -728.3990
1089 3 225.2239 -727.7422
1032 3 233.1753 -731.5526
[[2]]
walker x y
1008 3 205.9104 -775.7488
1022 3 208.3638 -723.8616
1072 3 233.8807 -718.0974
1064 3 217.0028 -689.7917
1026 3 234.1824 -723.7423
[[3]]
[1] 3
[[4]]
walker x y
546 2 629.9041 831.0852
524 2 627.8698 873.3774
578 2 572.3312 838.7587
513 2 633.0598 871.7559
538 2 636.3088 836.6325
1079 3 206.3683 -729.6257
1095 3 239.9884 -748.2637
1005 3 197.2960 -780.4704
1045 3 245.1900 -694.3566
1026 3 234.1824 -723.7423
I have written a function to add a column that denote from which element the rows came followed by appending it to an existing data.frame.
collapseToDataFrame <- function(x) { # collapse list to a dataframe with a twist
walk.df <- data.frame()
for (i in 1:length(x)) {
n.rows <- nrow(x[[i]])
if (length(x[[i]])>1) {
temp.df <- cbind(x[[i]], rep(i, n.rows))
names(temp.df) <- c("walker", "x", "y", "session")
walk.df <- rbind(walk.df, temp.df)
} else {
cat("Empty list", "\n")
}
}
return(walk.df)
}
> collapseToDataFrame(walk.sample)
Empty list
Empty list
walker x y session
3 1 -604.5055 -123.18759 1
60 1 -562.0078 -61.24912 1
84 1 -594.4661 -57.20730 1
9 1 -604.2893 -110.09168 1
43 1 -632.2491 -54.52548 1
1028 3 240.3905 -724.67284 1
1040 3 232.5545 -681.61225 1
1073 3 228.8756 -726.91980 1
1091 3 209.0373 -740.96173 1
1036 3 248.7123 -694.47380 1
I'm curious whether this can be done more elegantly, with perhaps do.call() or some other more generic function?
I think this will work...
lengths <- sapply(walk.sample, function(x) if (is.null(nrow(x))) 0 else nrow(x))
cbind(do.call(rbind, walk.sample[lengths > 1]),
session = rep(1:length(lengths), ifelse(lengths > 1, lengths, 0)))
I'm not claiming this to be the most elegant approach, but I think it is working
library(plyr)
ldply(sapply(1:length(walk.sample), function(i)
if (length(walk.sample[[i]]) > 1)
cbind(walk.sample[[i]],session=rep(i,nrow(walk.sample[[i]])))
),rbind)
EDIT
After applying Marek's apt remarks
do.call(rbind,lapply(1:length(walk.sample), function(i)
if (length(walk.sample[[i]]) > 1)
cbind(walk.sample[[i]],session=i) ))