Sampling from a boolean matrix in Eigen - c++

I have a matrix A of this form:
Eigen::Matrix<bool, n, m> A(n, m)
and I want to obtain a random element among the ones that are 'true'. The silly way to do that would be to obtain the number of 'true' elements t, generate a random number between 1 and t and iterate:
//r = random number
int k = 0;
for (int i = 0; i < A.rows(); ++i)
for (int j = 0; j < A.cols(); ++j)
{
if (A(i, j))
++k;
if (k == r)
std::cout << "(" << i << ", " << j << ")" << std::endl;
}
This solution is incredibly slow when multiple samples are needed and the matrix is big. Any suggestion as to how I should go about this?
In short: I'd like to find an efficient way to obtain the i-th 'true' element of the above matrix.

You could use Eigen::SparseMatrix instead.
Eigen::SparseMatrix<bool> A(n, m);
With its compressed (or not) column/row storage scheme, you could find the r-th non-zero element in O(m)/O(n) time, or O(log(m)) with binary search.
You could use the COO format utility Eigen::Triplet to find the r-th non-zero element in O(1) time.
std::vector<Eigen::Triplet<bool> > a(num_nonzeros);
And yes, since it's a bool matrix, storing the values is unnecessary too.

Related

How to calculate a mismatch score between n number of strings more efficiently?

Suppose I have a vector that contains n strings, where the strings can be length 5...n. Each string must be compared with each string character by character. If there is a mismatch, the score is increased by one. If there is a match, the score does not increase. Then I will store the resulting scores in a matrix.
I have implemented this in the following way:
for (auto i = 0u; i < vector.size(); ++i)
{
// vector.size() x vector.size() matrix
std::string first = vector[i]; //horrible naming convention
for (auto j = 0u; j < vector.size(); ++j)
{
std::string next = vector[j];
int score = 0;
for (auto k = 0u; k < sizeOfStrings; ++k)
{
if(first[k] == second[k])
{
score += 0;
}
else
{
score += 1;
}
}
//store score into matrix
}
}
I am not happy with this solution because it is O(n^3). So I have been trying to think of other ways to make this more efficient. I have thought about writing another function that would replace the innards of our j for loop, however, that would still be O(n^3) since the function would still need a k loop.
I have also thought about a queue, since I only care about string[0] compared to string[1] to string[n]. String[1] compared to string[2] to string[n]. String[2] compared to string[3] to string[n], etc. So my solutions have unnecessary computations since each string is comparing to every other string. The problem with this, is I am not really sure how to build my matrix out of this.
I have finally, looked into the std template library, however std::mismatch doesn't seem to be what I am looking for, or std::find. What other ideas do you guys have?
I don't think you can easily get away from O(n^3) comparisons, but you can easily implement the change you talk about. Since the comparisons only need to be done one way (i.e. comparing string[1] to string[2] is the same as comparing string[2] to string[1]), as you point out, you don't need to iterate through the entire array each time and can change the start value of your inner loop to be the current index of your outer loop:
for (auto i = 0u; i < vector.size(); ++i) {
// vector.size() x vector.size() matrix
std::string first = vector[i]; //horrible naming convention
for (auto j = i; j < vector.size(); ++j) {
To store it in a matrix, setup your i x j matrix, initialize it to all zeroes and simply store each score in M[i][j]
for (auto k = 0u; k < sizeOfStrings; ++k) {
if (first[k] != second[k]) {
M[i][j]++;
}
}
If you have n strings each of length m, then no matter what (even with your queue idea), you have to do at least (n-1)+(n-2)+...+(1)=n(n-1)/2 string comparisons, so you'll have to do (n(n-1)/2)*m char comparisons. So no matter what, your algorithm is going to be O(mn^2).
General comment:
You don't have to compare the same strings with each other. And what is more important you starting from the begining each time in second loop while you already computed those diffs, so change the second loop to start from i+1.
By doing so your complexity will decrease as you won't check string that you already checked or are the same.
Improvement
Sort vector and remove duplicated entries, then instead wasting computation for checking the same strings you will only check those that are different.
The other answers that say this is at least O(mn^2) or O(n^3) are incorrect. This can be done in O(mn) time where m is string size and n is number of strings.
For simplicity we'll start with the assumption that all characters are ascii.
You have a data structure:
int counts[m][255]
where counts[x][y] is the number of strings that have ascii character y at index x in the string.
Now, if you did not restrict to ascii, then you would need to use a std::map
map counts[m]
But it works the same way, at index m in counts you have a map in which each entry in the map y,z tells you how many strings z use character y at index m. You would also want to choose a map with constant time lookups and constant time insertions to match the complexity.
Going back to ascii and the array
int counts[m][255] // start by initializing this array to all zeros
First initialize the data structure:
m is size of strings,
vec is a std::vector with the strings
for (int i = 0; i < vec.size(); i++) {
std::string str = vec[i];
for(int j = 0; j < m; j++) {
counts[j][str[j]]++;
}
}
Now that you have this structure, you can calculate the scores easily:
for (int i = 0; i < vec.size(); i++) {
std::string str = vec[i];
int score = 0;
for(int j = 0; j < m; j++) {
score += counts[j][str[j]] - 1; //subtracting 1 gives how many other strings have that same char at that index
}
std::cout << "string \"" << str << "\" has score " << score;
}
As you can see by this code, this is O(m * n)

Count number of sub-sequences of given array such that their sum is less or equal to given number?

I have an array of size n of integer values and a given number S.
1<=n<=30
I want to find the total number of sub-sequences such that for each sub-sequences elements sum is less than S.
For example: let n=3 , S=5and array's elements be as {1,2,3}then its total sub-sequences be 7 as-
{1},{2},{3},{1,2},{1,3},{2,3},{1,2,3}
but, required sub sequences is:
{1},{2},{3},{1,2},{1,3},{2,3}
that is {1,2,3}is not taken because its element sum is (1+2+3)=6which is greater than S that is 6>S. Others is taken because, for others sub-sequences elements sum is less than S.
So, total of possible sub-sequences be 6.
So my answer is count, which is6.
I have tried recursive method but its time complexity is 2^n.
Please help us to do it in polynomial time.
You can solve this in reasonable time (probably) using the pseudo-polynomial algorithm for the knapsack problem, if the numbers are restricted to be positive (or, technically, zero, but I'm going to assume positive). It is called pseudo polynomial because it runs in nS time. This looks polynomial. But it is not, because the problem has two complexity parameters: the first is n, and the second is the "size" of S, i.e. the number of digits in S, call it M. So this algorithm is actually n 2^M.
To solve this problem, let's define a two dimensional matrix A. It has n rows and S columns. We will say that A[i][j] is the number of sub-sequences that can be formed using the first i elements and with a maximum sum of at most j. Immediately observe that the bottom-right element of A is the solution, i.e. A[n][S] (yes we are using 1 based indexing).
Now, we want a formula for A[i][j]. Observe that all subsequences using the first i elements either include the ith element, or do not. The number of subsequences that don't is just A[i-1][j]. The number of subsequences that do is just A[i-1][j-v[i]], where v[i] is just the value of the ith element. That's because by including the ith element, we need to keep the remainder of the sum below j-v[i]. So by adding those two numbers, we can combine the subsequences that do and don't include the jth element to get the total number. So this leads us to the following algorithm (note: I use zero based indexing for elements and i, but 1 based for j):
std::vector<int> elements{1,2,3};
int S = 5;
auto N = elements.size();
std::vector<std::vector<int>> A;
A.resize(N);
for (auto& v : A) {
v.resize(S+1); // 1 based indexing for j/S, otherwise too annoying
}
// Number of subsequences using only first element is either 0 or 1
for (int j = 1; j != S+1; ++j) {
A[0][j] = (elements[0] <= j);
}
for (int i = 1; i != N; ++i) {
for (int j = 1; j != S+1; ++j) {
A[i][j] = A[i-1][j]; // sequences that don't use ith element
auto leftover = j - elements[i];
if (leftover >= 0) ++A[i][j]; // sequence with only ith element, if i fits
if (leftover >= 1) { // sequences with i and other elements
A[i][j] += A[i-1][leftover];
}
}
}
Running this program and then outputting A[N-1][S] yields 6 as required. If this program does not run fast enough you can significantly improve performance by using a single vector instead of a vector of vectors (and you can save a bit of space/perf by not wasting a column in order to 1-index, as I did).
Yes. This problem can be solved in pseudo-polynomial time.
Let me redefine the problem statement as "Count the number of subsets that have SUM <= K".
Given below is a solution that works in O(N * K),
where N is the number of elements and K is the target value.
int countSubsets (int set[], int K) {
int dp[N][K];
//1. Iterate through all the elements in the set.
for (int i = 0; i < N; i++) {
dp[i][set[i]] = 1;
if (i == 0) continue;
//2. Include the count of subsets that doesn't include the element set[i]
for (int k = 1; k < K; k++) {
dp[i][k] += dp[i-1][k];
}
//3. Now count subsets that includes element set[i]
for (int k = 0; k < K; k++) {
if (k + set[i] >= K) {
break;
}
dp[i][k+set[i]] += dp[i-1][k];
}
}
//4. Return the sum of the last row of the dp table.
int count = 0;
for (int k = 0; k < K; k++) {
count += dp[N-1][k];
}
// here -1 is to remove the empty subset
return count - 1;
}

What is the fastest way to sort these n^2 numbers?

Given a number 'n', I want to return a sorted array of n^2 numbers containing all the values of k1*k2 where k1 and k2 can range from 1 to n.
For example for n=2 it would return : {1,2,2,4}.(the number are basically 1*1,1*2,2*1,2*2).
and for n=3 it would return : {1,2,2,3,3,4,6,6,9}.
(the numbers being : 1*1, 2*1, 1*2, 2*2, 3*1, 1*3, 3*2, 2*3, 3*3)
I tried it using sort function from c++ standard library, but I was wondering if it could be further optimized.
Well, first of all, you get n^2 entries, the largest of which will be n^2, and of the possible value range, only a tiny amount of values is used for large n. So, I'd suggest a counting approach:
Initialize an array counts[] of size n^2 with zeros.
Iterate through your array of values values[], and do counts[values[i]-1]++.
Reinitialize the values array by iterating through the counts array, dropping as many values of i+1 into the values array as counts[i] gives you.
That's all. It's O(n^2), so you'll hardly find a more performant solution.
vector<int> count(n*n+1);
for (int i = 1; i <= n; ++i)
for (int j = 1; j <= n; ++j)
++count[i*j];
for (int i = 1; i <= n*n; ++i)
for (int j = 0; j < count[i]; ++j)
cout << i << " ";
This is in essence the O(n*n) solution as described in cmaster's answer.

How to add a block matrix onto a sparse matrix in Eigen

For example I have a 10x10 SparseMatrix A, and I want to add a 3x3 identity matrix to the upper left corner of A.
A is known to be already non-zero in those 3 entries.
If I have to add the values one by one it is ok too, but I didn't find the method to manipulate on elements of a Sparse Matrix in Eigen.
Did I miss something?
If all you want is to apply an operation to a specific element at a time, you can use coeffRef like so:
typedef Eigen::Triplet<double> T;
std::vector<T> coefficients;
for (int i = 0; i < 9; i++) coefficients.push_back(T(i, i, (i % 3) + 1));
Eigen::SparseMatrix<double> A(10, 10);
A.setFromTriplets(coefficients.begin(), coefficients.end());
std::cout << A << "\n\n";
for (int i = 0; i < 3; i++) A.coeffRef(i,i) += 1;
std::cout << A << "\n\n";

N choose k for large n and k

I have n elements stored in an array and a number k of possible subset over n(n chose k).
I have to find all the possible combinations of k elements in the array of length n and, for each set(of length k), make some calculations on the elements choosen.
I have written a recursive algorithm(in C++) that works fine, but for large number it crashes going out of heap space.
How can I fix the problem? How can I calculate all the sets of n chose k for large n and k?
Is there any library for C++ that can help me?
I know it is a np problem but I would write the best code in order to calculate the biggest numbers possible.
Which is approximately the biggest numbers (n and k)beyond which it becames unfeasible?
I am only asking for the best algorithm, not for unfeasible space/work.
Here my code
vector<int> people;
vector<int> combination;
void pretty_print(const vector<int>& v)
{
static int count = 0;
cout << "combination no " << (++count) << ": [ ";
for (int i = 0; i < v.size(); ++i) { cout << v[i] << " "; }
cout << "] " << endl;
}
void go(int offset, int k)
{
if (k == 0) {
pretty_print(combination);
return;
}
for (int i = offset; i <= people.size() - k; ++i) {
combination.push_back(people[i]);
go(i+1, k-1);
combination.pop_back();
}
}
int main() {
int n = #, k = #;
for (int i = 0; i < n; ++i) { people.push_back(i+1); }
go(0, k);
return 0;
}
Here is non recursive algorithm:
const int n = ###;
const int k = ###;
int currentCombination[k];
for (int i=0; i<k; i++)
currentCombination[i]=i;
currentCombination[k-1] = k-1-1; // fill initial combination is real first combination -1 for last number, as we will increase it in loop
do
{
if (currentCombination[k-1] == (n-1) ) // if last number is just before overwhelm
{
int i = k-1-1;
while (currentCombination[i] == (n-k+i))
i--;
currentCombination[i]++;
for (int j=(i+1); j<k; j++)
currentCombination[j] = currentCombination[i]+j-i;
}
else
currentCombination[k-1]++;
for (int i=0; i<k; i++)
_tprintf(_T("%d "), currentCombination[i]);
_tprintf(_T("\n"));
} while (! ((currentCombination[0] == (n-1-k+1)) && (currentCombination[k-1] == (n-1))) );
Your recursive algorithm might be blowing the stack. If you make it non-recursive, then that would help, but it probably won't solve the problem if your case is really 100 choose 10. You have two problems. Few, if any, computers in the world have 17+ terabytes of memory. Going through 17 trillion+ iterations to generate all the combinations will take way too long. You need to rethink the problem and either come up with an N choose K case that is more reasonable, or process only a certain subset of the combinations.
You probably do not want to be processing more than a billion or two combinations at the most - and even that will take some time. That translates to around 41 choose 10 to about 44 choose 10. Reducing either N or K will help. Try editing your question and posting the problem you are trying to solve and why you think you need to go through all of the combinations. There may be a way to solve it without going through all of the combinations.
If it turns out you do need to go through all those combinations, then maybe you should look into using a search technique like a genetic algorithm or simulated annealing. Both of these hill climbing search techniques provide the ability to search a large space in a relatively small time for a close to optimal solution, but neither guarantee to find the optimal solution.
You can use next_permutation() in algorithm.h to generate all possible combinations.
Here is some example code:
bool is_chosen(n, false);
fill(is_chosen.begin() + n - k, is_chosen.end(), true);
do
{
for(int i = 0; i < n; i++)
{
if(is_chosen[i])
cout << some_array[i] << " ";
}
cout << endl;
} while( next_permutation(is_chosen.begin(), is_chosen.end()) );
Don't forget to include the algorithm.
As I said in a comment, it's not clear what you really want.
If you want to compute (n choose k) for relatively small values, say n,k < 100 or so, you may want to use a recursive method, using Pascals triangle.
If n,k are large (say n=1000000, k=500000), you may be happy with an approxiate result using Sterlings formula for the factorial: (n choose k) = exp(loggamma(n)-loggamma(k)-loggamma(n-k)), computing loggamma(x) via Sterling's formula.
If you want (n choose k) for all or many k but the same n, you can simply iterate over k and use (n choose k+1) = ((n choose k)*(n-k))/(k+1).