Variation on set cover problem in R / C++ - c++

Given a universe of elements U = {1, 2, 3,...,n} and a number of sets in this universe {S1, S2,...,Sm}, what is the smallest set we can create that will cover at least one element in each of the m sets?
For example, given the following elements U = {1,2,3,4} and sets S = {{4,3,1},{3,1},{4}}, the following sets will cover at least one element from each set:
{1,4}
or
{3,4}
so the minimum sized set required here is 2.
Any thoughts on how this can be scaled up to solve the problem for m=100 or m=1000 sets? Or thoughts on how to code this up in R or C++?
The sample data, from above, using R's library(sets).
s1 <- set(4, 3, 1)
s2 <- set(3, 1)
s3 <- set(4)
s <- set(s1, s2, s3)
Cheers

This is the hitting set problem, which is basically set cover with the roles of elements and sets interchanged. Letting A = {4, 3, 1} and B = {3, 1} and C = {4}, the element-set containment relation is
A B C
1 + + -
2 - - -
3 + + -
4 + - +
so you basically want to solve the problem of covering {A, B, C} with sets 1 = {A, B} and 2 = {} and 3 = {A, B} and 4 = {A, C}.
Probably the easiest way to solve nontrivial instances of set cover in practice is to find an integer programming package with an interface to R or C++. Your example would be rendered as the following integer program, in LP format.
Minimize
obj: x1 + x2 + x3 + x4
Subject To
A: x1 + x3 + x4 >= 1
B: x1 + x3 >= 1
C: x4 >= 1
Binary
x1 x2 x3 x4
End

At first I misunderstood the complexity of the problem and came up with a function that finds a set that covers the m sets - but I then realized that it isn't necessarily the smallest one:
cover <- function(sets, elements = NULL) {
if (is.null(elements)) {
# Build the union of all sets
su <- integer()
for(si in sets) su <- union(su, si)
} else {
su <- elements
}
s <- su
for(i in seq_along(s)) {
# create set candidate with one element removed
sc <- s[-i]
ok <- TRUE
for(si in sets) {
if (!any(match(si, sc, nomatch=0L))) {
ok <- FALSE
break
}
}
if (ok) {
s <- sc
}
}
# The resulting set
s
}
sets <- list(s1=c(1,3,4), s2=c(1,3), s3=c(4))
> cover(sets) # [1] 3 4
Then we can time it:
n <- 100 # number of elements
m <- 1000 # number of sets
sets <- lapply(seq_len(m), function(i) sample.int(n, runif(1, 1, n)))
system.time( s <- cover(sets) ) # 0.53 seconds
Not too bad, but still not the smallest one.
The obvious solution: generate all permutations of elements and pass is to the cover function and keep the smallest result. This will take close to "forever".
Another approach is to generate a limited number of random permutations - this way you get an approximation of the smallest set.
ns <- 10 # number of samples
elements <- seq_len(n)
smin <- sets
for(i in seq_len(ns)) {
s <- cover(sets, sample(elements))
if (length(s) < length(smin)) {
smin <- s
}
}
length(smin) # approximate smallest length

If you restrict each set to have 2 elements, you have the np-complete problem node cover. I would guess the more general problem would also be NP complete (for the exact version).

If you're just interested in an algorithm (rather than an efficient/feasible algorithm), you can simply generate subsets of the universe of increasing cardinality and check that the intersection with all the sets in S is non-empty. Stop as soon as you get one that works; the cardinality is the minimum possible.
The complexity of this is 2^|U| in the worst case, I think. Given Foo Bah's answer, I don't think you're going to get a polynomial-time answer...

Related

Random slopes Cox Proportional Hazards

I have been trying to use coxme to extract random slopes for each of the covariates in my model.
library (coxme)
Start <- runif(5000, 1985, 2015)
Stop <- Start + runif(5000, 2, 10)
S <- data.frame (
X1 <- runif(5000, 5.0, 7.5),
X2 <- runif(5000, 5.0, 7.5),
D <- rbinom(5000, 1, 0.8),
Letters <- sample(LETTERS, 5000, replace = TRUE),
Start <- Start,
Stop <- Stop
)
S_ind1 <- Surv (time = S$Start, time2 = S$Stop, event = S$D)
a <- coxme (S_ind1 ~ X1 + X2 + (X1 + X2|Letters), data = S)
All I get is:
Error in gchol(kfun(theta, varlist, vparm, ntheta, ncoef)) :
NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning messages:
1: In sqrt(xvar * zvar) : NaNs produced
2: In sqrt(xvar * zvar) : NaNs produced
When using my own data I often get:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
'Calloc' could not allocate memory (56076596 of 8 bytes)
Is it possible at all to include random slopes using coxme?
If not, is there any other alternative using other package?
Answer from Terry Therneau, author of the coxme package via email - he asked me to post this here.
Below is my rewrite of your example, removing the Surv indirection and using '=' inside the data.frame call (I'm a bit surprised that <- works in that context), and adding set.seed so that the example is reproducable.
library (coxme)
set.seed(1953)
time1 <- runif(5000, 1985, 2015)
time2 <- time1 + runif(5000, 2, 10)
test <- data.frame (
x1 = runif(5000, 5.0, 7.5),
x2 = runif(5000, 5.0, 7.5),
death = rbinom(5000, 1, 0.8),
letters = sample(LETTERS, 5000, replace = TRUE),
time1 = time1,
time2 = time2)
fit1 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1|letters), data=test)
fit2 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+x1 | letters), test)
fit3 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+x2 | letters), test)
fit4 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+ x1 + x2 | letters),
data=test, vinit= c(1e-6, 1e-8, 1e-8))
*1. All the models work until fit4.
I find your model worrisome, since it has a random slope but no random intercept, in the same way that all regressions through the origin worry me: I have a hard time interpeting the results. Although lme puts intercept terms in by default, coxme does not.
I was hopeful tht fit4 would work, and perhaps with better starting estimates it would. The underlying code for coxme is the hardest maximization problem that I have encountered in all my survival work, hard in the sense that the maximizer gets easily lost and never finds its way. This is a function that sometimes needs hand-holding, via limited iteration counts and/or starting estimates. I wish it were not so, and I have some long term plans to improve this by addition of an alternate MCMC based maximizer, which will in theory never get lost but at the expense of much longer computation time.
If any of the variances get too close to zero then the sqrt() message tends to arise as a function of round off error. In your test case, of course, the actual MLE is at a variance of 0. When this happens, I will often check for a zero variance directly by doing fits with a sequence of fixed variances (vfixed argument). If the likelihood is constant or increasing as the variance goes to values of 1e-6 or less, then I assume the MLE is zero and remove that random term from the model.
Terry T.*

How to find the index of the minimum value between two specific columns of a matrix with Rcpp?

I have a 5x5 matrix and want to find the index of the smallest value between columns "1" and "3". In R I would do like this:
set.seed(1984)
m <- matrix(sample.int(25,25), 5)
min <- which(m[,c(1,3)] == min(m[,c(1,3)]), arr.ind = TRUE)
What is the most efficient way of doing that with Rcpp?
I would opt to use RcppArmadillo over Rcpp as it has more robust matrix manipulation. For instance, you can find the index_min() or index_max() of a subset quickly and then translate it to subscript notation with ind2sub(). One thing to take note of is C++ uses 0-based indices and R uses 1-based indices so you should make sure to add 1 if the goal is to use the index subscript in R.
The following should work for your case:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::urowvec get_min_cols(const arma::mat& x, const arma::uvec& col_indexes) {
// Obtain the minimum index from the selected columns
arma::uword min_index = x.cols(col_indexes).index_min();
// Obtain the subscript notation from index based on reduced dimensions
arma::uvec min_subnot = arma::ind2sub(arma::size(x.n_rows, col_indexes.n_elem),
min_index);
// Transpose to row vector and
// translate indices to _R_ from _C++_ by adding 1
return min_subnot.t() + 1;
}
Test:
set.seed(1984)
m = matrix(sample.int(25,25), 5)
col_indexes = c(1, 3)
min_loc_r = which(m[, col_indexes] == min(m[, col_indexes]), arr.ind = TRUE)
# Note that the column indices has been translated to C++
min_loc_cpp = get_min_cols(m, col_indexes - 1)
min_loc_r
# row col
# [1,] 5 2
min_loc_cpp
# [,1] [,2]
# [1,] 5 2

how to compute mutual information MapReduce based in R??

I want to compute mutual information for all x, y in features : I(x , y)
So I need to compute P(x) P(y) and P(x, Y) in Data for example:
X Y
- - yes 2 no 2 yes 2 no 1 yes 1
p(yes)=3/5 p(2)=3/5 p(yes,2)=2/5
counting in Map Reduce is easy and I did it for P(x) and P(y)
but for p(x, Y) I want to compute co-occuring in each object.
I write a Map-function:
mapper <- function(key, line) {
fvec <- unlist(strsplit(line, split = " "))
for(i in 1:55){
for(j in (i+1):56){
fvec<-c(fvec,paste0(fvec[i],",",fvec[j]))}}
keyval(fvec, 1)
}
"fvec" is a vector of features as first row in our example :
fvec[1]="yes" "2"
in for loops I want to concatenate this features so that
fvec[1]="yes" "2" "yes2"
in Order to count the Occurance of yes-2 together in reduce func :
reduce = function(k,v)
{return(keyval(k,length(v)))
but because of "LOOP PROBLEM" in Hadoop it does not work properly.
please help me with a R solution to handle it :)

Product of a multi-dimensional array (or tensor) and vectors

I would like to ask for a fast way to perform the following operations, either in native Matlab, C++, or using toolboxes/libraries, whichever would give the fastest solutions.
Let M be a tensor of D dimensions: n1 x n2 x... x nD, and let v1, v2,..., vD be D vectors whose dimensions are respectively n1, n2,..., nD.
Compute the product M*vi (1 <= i <= D). The result is a multi-dimensional array of (D-1) dimensions.
Compute the product of M with all vectors, except vi.
For example, with D = 3:
The product of M and v1 is a tensor N of 2 dimensions (i.e. a matrix) where
N[i2][i3] = Sum_over_i1 of M[i1][i2][i3]*v1[i1]
The product of M and v2 is a matrix N where
N[i1][i3] = Sum_over_i2 of M[i1][i2][i3]*v2[i2]
The product of M and v2 and v3 is a vector v where
v[i1] = Sum_over_i2 of (Sum_over_i3 of M[i1][i2][i3]*v2[i2]*v3[i3])
A further question: the above but for sparse tensors.
An example of Matlab code is given below.
Thank you very much in advance for your help!!
n1 = 3;
n2 = 5;
n3 = 4;
M = randn(n1,n2,n3);
v1 = randn(n1,1);
v2 = randn(n2,1);
v3 = randn(n3,1);
%% N = M*v2
N = zeros(n1,n3);
for i1=1:n1
for i3=1:n3
for i2=1:n2
N(i1,i3) = N(i1,i3) + M(i1,i2,i3)*v2(i2);
end
end
end
%% v = M*v2*v3
v = zeros(n1,1);
for i1=1:n1
for i2=1:n2
for i3=1:n3
v(i1) = v(i1) + M(i1,i2,i3)*v2(i2)*v3(i3);
end
end
end
I've noticed that operation you are describing takes (D - 1) dimensional slices of M and scales them by the corresponding entry of vi subsequently summing the result over the indices of vi. This code seems to work for getting N in your example:
N2 = squeeze(sum(M.*(v2)', 2));
To get v in your code, all you need to do is multiply N by v3:
v2 = N2*v3;
EDIT
On older versions of MatLab the element-wise operator .* doesn't work the way I've used it above. One alternative is bsxfun:
N2 = squeeze(sum(bsxfun(#times, M, v2'), 2));
Just checked: In terms of performance, the bsxfun way seems as fast as the .* way for large arrays, at least on R2016b.

elem function of no limit list

list comprehension haskell
paar = [(a,b) | a<-[a | a<-[1..], mod a 3 == 0], b<-[b*b | b<-[1..]]]
a = divisor 3
b = square
The Elements must be constructed by equitable order.
the test >elem (9, 9801) must be True
my Error
Main> elem (9, 9801) test
ERROR - Garbage collection fails to reclaim sufficient space
How can I implement this with Cantor's diagonal argument?
thx
Not quite sure what your goal is here, but here's the reason why your code blows up.
Prelude> let paar = [(a,b) | a<-[a | a<-[1..], mod a 3 == 0], b<-[b*b | b<-[1..]]]
Prelude> take 10 paar
[(3,1),(3,4),(3,9),(3,16),(3,25),(3,36),(3,49),(3,64),(3,81),(3,100)]
Notice you're generating all the (3, ?) pairs before any other. The elem function works by searching this list linearly from the beginning. As there are an infinite number of (3, ?) pairs, you will never reach the (9, ?) ones.
In addition, your code is probably holding on to paar somewhere, preventing it from being garbage collected. This results in elem (9, 9801) paar taking not only infinite time but also infinite space, leading to the crash you described.
Ultimately, you probably need to take another approach to solving your problem. For example, something like this:
elemPaar :: (Integer, Integer) -> Bool
elemPaar (a, b) = mod a 3 == 0 && isSquare b
where isSquare = ...
Or alternatively figure out some other search strategy than straight up linear search through an infinite list.
Here's an alternate ordering of the same list (by hammar's suggestion):
-- the integer points along the diagonals of slope -1 on the cartesian plane,
-- organized by x-intercept
-- diagonals = [ (0,0), (1,0), (0,1), (2,0), (1,1), (0,2), ...
diagonals = [ (n-i, i) | n <- [0..], i <- [0..n] ]
-- the multiples of three paired with the squares
paar = [ (3*x, y^2) | (x,y) <- diagonals ]
and in action:
ghci> take 10 diagonals
[(0,0),(1,0),(0,1),(2,0),(1,1),(0,2),(3,0),(2,1),(1,2),(0,3)]
ghci> take 10 paar
[(0,0),(3,0),(0,1),(6,0),(3,1),(0,4),(9,0),(6,1),(3,4),(0,9)]
ghci> elem (9, 9801) paar
True
By using a diagonal path to iterate through all the possible values, we guarantee that we reach each finite point in finite time (though some points are still outside the bounds of memory).
As hammar points out in his comment, though, this isn't sufficient, as it will still take
an infinite amount of time to get a False answer.
However, we have an order on the elements of paar, namely (3*a,b^2) comes before (3*c,d^2) when
a + b < c + d. So to determine whether a given pair (x,y) is in paar, we only have to check
pairs (p,q) while p/3 + sqrt q <= x/3 + sqrt y.
To avoid using Floating numbers, we can use a slightly looser condition, that p <= x || q <= y.
Certainly p > x && q > y implies p/3 + sqrt q > x/3 + sqrt y, so this will still include any possible solutions, and it's guaranteed to terminate.
So we can build this check in
-- check only a finite number of elements so we can get a False result as well
isElem (p, q) = elem (p,q) $ takeWhile (\(a,b) -> a <= p || b <= q) paar
And use it:
ghci> isElem (9,9801)
True
ghci> isElem (9,9802)
False
ghci> isElem (10,9801)
False