I am trying to get the longest sublist within a list. I need a rule that recursively searches a list of lists and determines which list has the longest length.
For example:
input: [[1],[1,2],[],[1,2,3,4],[5,6]]
output: [1,2,3,4]
This is what I have so far:
max([H|T], Path, Length) :-
length(H, L),
(L #> Length ->
max(T, H, L) ;
max(T, Path, Length) ).
I would like max() to work like this:
? max([[1],[1,2],[],[1,2,3,4],[5,6]], Path, Distance).
Path = [1,2,3,4]
Distance = 4
When I run a trace, this is the output:
{trace}
| ?- max([[1],[1,2],[],[1,2,3,4],[5,6]], Path, Distance).
1 1 Call: max([[1],[1,2],[],[1,2,3,4],[5,6]],_307,_308) ?
2 2 Call: length([1],_387) ?
2 2 Exit: length([1],1) ?
3 2 Call: 1#>_308 ?
3 2 Exit: 1#>_308 ?
4 2 Call: max([[1,2],[],[1,2,3,4],[5,6]],[1],1) ?
5 3 Call: length([1,2],_462) ?
5 3 Exit: length([1,2],2) ?
6 3 Call: 2#>1 ?
6 3 Exit: 2#>1 ?
7 3 Call: max([[],[1,2,3,4],[5,6]],[1,2],2) ?
8 4 Call: length([],_537) ?
8 4 Exit: length([],0) ?
9 4 Call: 0#>2 ?
9 4 Fail: 0#>2 ?
9 4 Call: max([[1,2,3,4],[5,6]],[1,2],2) ?
10 5 Call: length([1,2,3,4],_587) ?
10 5 Exit: length([1,2,3,4],4) ?
11 5 Call: 4#>2 ?
11 5 Exit: 4#>2 ?
12 5 Call: max([[5,6]],[1,2,3,4],4) ?
13 6 Call: length([5,6],_662) ?
13 6 Exit: length([5,6],2) ?
14 6 Call: 2#>4 ?
14 6 Fail: 2#>4 ?
14 6 Call: max([],[1,2,3,4],4) ?
14 6 Fail: max([],[1,2,3,4],4) ?
12 5 Fail: max([[5,6]],[1,2,3,4],4) ?
9 4 Fail: max([[1,2,3,4],[5,6]],[1,2],2) ?
7 3 Fail: max([[],[1,2,3,4],[5,6]],[1,2],2) ?
4 2 Fail: max([[1,2],[],[1,2,3,4],[5,6]],[1],1) ?
1 1 Fail: max([[1],[1,2],[],[1,2,3,4],[5,6]],_307,_308) ?
(2 ms) no
I believe the issue is that I am not handling the occurrence of an empty set "[]". However, I have attempted several different methods and am unable to get my desired output.
You should define loop ending clause and need one more parameter for returning.
max([], _, Length, Length).
max([H|T], Path, Length, RetLength) :-
length(H, L),
( L #> Length ->
max(T, H, L,RetLength) ;
max(T, Path, Length,RetLength)
).
Test:
?- max([[1],[1,2],[],[1,2,3,4],[5,6]], Path, Distance,Len).
Len = 4.
Related
today I wanted to run TropFishR package, the problem is (to me), every data must be arranged in list. So I tried to reconstruct the alba dataset in order to replicate with my own data in the future. Here is what I have done:
library(TropFishR)
data("alba")
str(alba) #the list contain 4 variables
List of 4
$ sample.no : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
$ midLengths: num [1:14] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
$ dates : Date[1:7], format: "1976-04-17" "1976-07-02" "1976-09-19" ...
$ catch : num [1:14, 1:7] 0 0 0 1 1 1 3 9 5 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:7] "1976.29315068493" "1976.50136986301" "1976.71780821918" "1976.95616438356" ...
- attr(*, "class")= chr "lfq"
And this is what I did:
#1 We create sample.no
sample.no <- c(1:14)
sample.no
#2 We create "midlengths"
midlengths <- seq(from = 1.5, to = 14.5, by = 1)
midlengths
#3 We create "dates"
dates <- as.Date(c("1976-04-17","1976-07-02", "1976-09-19", "1976-12-15", "1977-02-18",
"1977-04-30", "1977-06-24"))
dates
#4 We create "catch"
catch <- as.matrix(read.csv(file.choose(), header=T))
#I copied the alba length freq data, move it to excel and imported as csv file
colnames(catch)<-NULL
print(catch)
#5 create list files
synLFQb <- list(sample.no,midlengths,dates,catch)
synLFQb #just checked if it turned out to be as desired format
#6 create a name for the data list
names(synLFQb) <- c("sample.no","midlengths","dates","catch")
#Finally, we need to assign the class lfq to our new object in order to allow it to be recognized by other TropFishR functions, e.g. plot.lfq:
class(synLFQb) <- "lfq"
it will produce "similar" data list
str(synLFQb)
List of 4
$ sample.no : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
$ midlengths: num [1:14] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
$ dates : Date[1:7], format: "1976-04-17" "1976-07-02" "1976-09-19" ...
$ catch : int [1:14, 1:7] 0 0 0 1 1 1 3 9 5 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : NULL
- attr(*, "class")= chr "lfq"
However, when everytime I tried to do this simple command:
plot(synLFQb, Fname="catch", hist.sc = 1)
It resulted in error:
> plot(synLFQb, Fname="catch", hist.sc = 1)
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In min(x, na.rm = na.rm) :
no non-missing arguments to min; returning Inf
2: In max(x, na.rm = na.rm) :
no non-missing arguments to max; returning -Inf
Any help will be much appreciated.
Please make sure that you call the mid lengths vector in your list "midLengths" with a capital "L". I hope that will does the trick in your example.
I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.
import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338
Here is my minimal working example:
list1 = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] #len = 21
list2 = [1,1,1,0,1,0,0,1,0,1,1,0,1,0,1,0,0,0,1,1,0] #len = 21
list3 = [0,0,1,0,1,1,0,1,0,1,0,1,1,1,0,1,0,1,1,1,1] #len = 21
list4 = [1,0,0,1,1,0,0,0,0,1,0,1,1,1,1,0,1,0,1,0,1] #len = 21
I have four lists and I want to "clean" my list 1 using the following rule: "if any of list2[i] or list3[i] or list4[i] are equal to zero, then I want to eliminate the item I from list1. SO basically I only keep those elements of list1 such that the other lists all have ones there.
here is the function I wrote to solve this
def clean(list1, list2,list3,list4):
for i in range(len(list2)):
if (list2[i]==0 or list3[i]==0 or list4[i]==0):
list1.pop(i)
return list1
however it doesn't work. If you apply it, it give the error
Traceback (most recent call last):line 68, in clean list1.pop(I)
IndexError: pop index out of range
What am I doing wrong? Also, I was told Pandas is really good in dealing with data. Is there a way I can do it with Pandas? Each of these lists are actually columns (after removing the heading) of a csv file.
EDIT
For example at the end I would like to get: list1 = [4,9,11,15]
I think the main problem is that at each iteration, when I pop out the elements, the index of all the successor of that element change! And also, the overall length of the list changes, and so the index in pop() is too large. So hopefully there is another strategy or function that I can use
This is definitely a job for pandas:
import pandas as pd
df = pd.DataFrame({
'l1':list1,
'l2':list2,
'l3':list3,
'l4':list4
})
no_zeroes = df.loc[(df['l2'] != 0) & (df['l3'] != 0) & (df['l4'] != 0)]
Where df.loc[...] takes the full dataframe, then filters it by the criteria provided. In this example, your criteria are that you only keep the items where l2, l3, and l3 are not zero (!= 0).
Gives you a pandas dataframe:
l1 l2 l3 l4
4 4 1 1 1
9 9 1 1 1
12 12 1 1 1
18 18 1 1 1
or if you need just list1:
list1 = df['l1'].tolist()
if you want the criteria to be where all other columns are 1, then use:
all_ones = df.loc[(df['l2'] == 1) & (df['l3'] == 1) & (df['l4'] == 1)]
Note that I'm creating new dataframes for no_zeroes and all_ones and that the original dataframe stays intact if you want to further manipulate the data.
Update:
Per Divakar's answer (far more elegant than my original answer), much the same can be done in pandas:
df = pd.DataFrame([list1, list2, list3, list4])
list1 = df.loc[0, (df[1:] != 0).all()].astype(int).tolist()
Here's one approach with NumPy -
import numpy as np
mask = (np.asarray(list2)==1) & (np.asarray(list3)==1) & (np.asarray(list4)==1)
out = np.asarray(list1)[mask].tolist()
Here's another way with NumPy that stacks those lists into rows to form a 2D array and thus simplifies things quite a bit -
arr = np.vstack((list1, list2, list3, list4))
out = arr[0,(arr[1:] == 1).all(0)].tolist()
Sample run -
In [165]: arr = np.vstack((list1, list2, list3, list4))
In [166]: print arr
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
[ 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0]
[ 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1]
[ 1 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1]]
In [167]: arr[0,(arr[1:] == 1).all(0)].tolist()
Out[167]: [4, 9, 12, 18]
UPDATE 2
*I've added some code (and explanation) I wrote myself at the end of this question, this is however a suboptimal solution (both in coding efficiency as resulting output) but kind of manages to make a selection of items that adhere to the constraints. If you have any ideas on how to improve it (again both in efficiency as resulting output) please let me know.
1. Updated Post
Please look below for the initial question and sample code. Thx to alexis_laz his answer the problem was solved for a small number of items. However when the number of items becomes to large the combn function in R cannot calculate it anymore because of the invalid 'ncol' value (too large or NA) error. Since my dataset has indeed a lot of items, I was wondering whether replacing some of his code (shown after this) with C++ provides a solution to this, and if this is the case what code I should use for this? Tnx!
This is the code as provided by alexis_laz;
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
2. Initial Post
Currently I'm working on a set of data coming from adaptive measurement, which means that not all persons have made all of the same items. For my analysis however I need a dataset that contains only items that have been made by all persons (or a subset of these persons).
I have a matrix object in R with rows = persons (100000), and columns = items(220), and a 1 in a cell if the person has made the item and a 0 if the person has not made the item.
How can I use R to determine which combination of at least 15 items, is made by the highest amount of persons?
Hopefully the question is clear (if not please ask me for more details and I will gladly provide those).
Tnx in advance.
Joost
Edit:
Below is a sample matrix with the items (A:E) as columns and persons (1:5) as rows.
mat <- matrix(c(1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,0,1,1,0,0,1,1,1,1,0),5,5,byrow=T)
colnames(mat) <- c("A","B","C","D","E")
rownames(mat) <- 1:5
> mat
A B C D E
"1" 1 1 1 0 0
"2" 1 1 0 1 1
"3" 1 1 1 0 1
"4" 0 1 1 0 0
"5" 1 1 1 1 0
mat[1,1] = 1 means that person 1 has given a response to item 1.
Now (in this example) I'm interested in finding out which set of at least 3 items is made by at least 3 people. So here I can just go through all possible combinations of 3, 4 and 5 items to check how many people have a 1 in the matrix for each item in a combination.
This will result in me choosing the item combination A, B and C, since it is the only combination of items that has been made by 3 people (namely persons 1, 3 and 5).
Now for my real dataset I want to do this but then for a combination of at least 10 items that a group of at least 75 people all responded to. And since I have a lot of data preferably not by hand as in the example data.
I'm thus looking for a function/code in R, that will let me select the minimal amount of items, and questions, and than gives me all combinations of items and persons that adhere to these constraints or have a greater number of items/persons than the constrained.
Thus for the example matrix it would be something like;
f <- function(data,no.items,no.persons){
#code
}
> f(mat,3,3)
no.item no.pers items persons
1 3 3 A, B, C 1, 3, 5
Or in case of at least 2 items that are made by at least 3 persons;
> f(mat,2,3)
no.item no.pers items persons
1 2 4 A, B 1, 2, 3, 5
2 2 3 A, C 1, 3, 5
3 2 4 B, C 1, 3, 4, 5
4 3 3 A, B, C 1, 3, 5
Hopefully this clears up what my question actually is about. Tnx for the quick replies that I already received!
3. Written Code
Below is the code I've written today. It takes each item once as a starting point and then looks to the item that has been answered most by people who also responded to the start item. It the takes these two items and looks to a third item, and repeats this until the number of people that responded to all selected questions drops below the given limit. One drawback of the code is that it takes some time to run, (it goes up somewhat exponentially when the number of items grows). The second drawback is that this still does not evaluate all possible combinations of items, in the sense that the start item, and the subsequently chosen item may have a lot of persons that answered to these items in common, however if the chosen item has almost no similarities with the other (not yet chosen) items, the sample might shrink very fast. While if an item was chosen with somewhat less persons in common with the start item, and this item has a lot of connections to other items, the final collection of selected items might be much bigger than the one based on the code used below. So again suggestions and improvements in both directions are welcome!
set.seed(512)
mat <- matrix(rbinom(1000000, 1, .6), 10000, 100)
colnames(mat) <- 1:100
fff <- function(data,persons,items){
xx <- list()
for(j in 1:ncol(data)){
d <- matrix(c(j,length(which(data[,j]==1))),1,2)
colnames(d) <- c("item","n")
t = persons+1
a <- j
while(t >= persons){
b <- numeric(0)
for(i in 1:ncol(data)){
z <- c(a,i)
if(i %in% a){
b[i] = 0
} else {
b[i] <- length(which(rowSums(data[,z])==length(z)))
}
}
c <- c(which.max(b),max(b))
d <- rbind(d,c)
a <- c(a,c[1])
t <- max(b)
}
print(j)
xx[[j]] = d
}
x <- y <- z <- numeric(0)
zz <- matrix(c(0,0,rep(NA,ncol(data))),length(xx),ncol(data)+2,byrow=T)
colnames(zz) <- c("n.pers", "n.item", rep("I",ncol(data)))
for(i in 1:length(xx)){
zz[i,1] <- xx[[i]][nrow(xx[[i]])-1,2]
zz[i,2] <- length(unname(xx[[i]][1:nrow(xx[[i]])-1,1]))
zz[i,3:(zz[i,2]+2)] <- unname(xx[[i]][1:nrow(xx[[i]])-1,1])
}
zz <- zz[,colSums(is.na(zz))<nrow(zz)]
zz <- zz[which((rowSums(zz,na.rm=T)/rowMeans(zz,na.rm=T))-2>=items),]
zz <- as.data.frame(zz)
return(zz)
}
fff(mat,110,8)
> head(zz)
n.pers n.item I I I I I I I I I I
1 156 9 1 41 13 80 58 15 91 12 39 NA
2 160 9 2 27 59 13 81 16 15 6 92 NA
3 158 9 3 59 83 32 25 80 14 41 16 NA
4 160 9 4 24 27 71 32 10 63 42 51 NA
5 114 10 5 59 66 27 47 13 44 63 30 52
6 158 9 6 13 56 61 12 59 8 45 81 NA
#col 1 = number of persons in sample
#col 2 = number of items in sample
#col 3:12 = which items create this sample (NA if n.item is less than 10)
to follow up on my comment, something like:
set.seed(1618)
mat <- matrix(rbinom(1000, 1, .6), 100, 10)
colnames(mat) <- sample(LETTERS, 10)
rownames(mat) <- sprintf('person%s', 1:100)
mat1 <- mat[rowSums(mat) > 5, ]
head(mat1)
# A S X D R E Z K P C
# person1 1 1 1 0 1 1 1 1 1 1
# person3 1 0 1 1 0 1 0 0 1 1
# person4 1 0 1 1 1 1 1 0 1 1
# person5 1 1 1 1 1 0 1 1 0 0
# person6 1 1 1 1 0 1 0 1 1 0
# person7 0 1 1 1 1 1 1 1 0 0
table(rowSums(mat1))
# 6 7 8 9
# 24 23 21 5
tab <- table(sapply(1:nrow(mat1), function(x)
paste(names(mat1[x, ][mat1[x, ] == 1]), collapse = ',')))
data.frame(tab[tab > 1])
# tab.tab...1.
# A,S,X,D,R,E,P,C 2
# A,S,X,D,R,E,Z,P,C 2
# A,S,X,R,E,Z,K,C 3
# A,S,X,R,E,Z,P,C 2
# A,S,X,Z,K,P,C 2
Here is another idea that matches your output:
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
ff(mat, 3, 3)
# persons items
#1 1, 3, 5 A, B, C
ff(mat, 2, 3)
# persons items
#1 1, 2, 3, 5 A, B
#2 1, 3, 5 A, C
#3 1, 3, 4, 5 B, C
#4 1, 3, 5 A, B, C
I have this string vector (for example):
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
To count the number of words in this vector I used this (as given here Count the number of words in a string in R?, which is a possible duplicate but with another issue)
No_words <- sapply(gregexpr("\\W+", str), length) + 1
but it returns
6 2 2 2
String has only 1 element in last two places (i.e. "tusla" and "laq")
so it should return
6 2 1 1
How do I get around this problem?
You can try
sapply(gregexpr("\\S+", x), length)
## [1] 6 2 1 1
Or as suggested in comments you can try
sapply(strsplit(x, "\\s+"), length)
## [1] 6 2 1 1
Use the stringi package and stri_count:
require(stringi)
str <- c(
"this is a string current trey",
"nospaces",
"multiple spaces",
" leadingspaces",
"trailingspaces ",
" leading and trailing ",
"just one space each")
> stri_count(str,regex="\\S+")
[1] 6 1 2 1 1 3 4
Use the wc-function from the qdap package.
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
library("qdap")
wc(str)
That returns:
wc(str)
[1] 6 2 1 1