generate a truth table given an input? - c++

Is there a smart algorithm that takes a number of probabilities and generates the corresponding truth table inside a multi-dimensional array or container
Ex :
n = 3
N : [0 0 0
0 0 1
0 1 0
...
1 1 1]
I can do it with for loops and Ifs , but I know my way will be slow and time consuming . So , I am asking If there is an advanced feature that I can use to do that as efficient as possible ?

If we're allowed to fill the table with all zeroes to start, it should be possible to then perform exactly 2^n - 1 fills to set the 1 bits we desire. This may not be faster than writing a manual loop though, it's totally unprofiled.
EDIT:
The line std::vector<std::vector<int> > output(n, std::vector<int>(1 << n)); declares a vector of vectors. The outer vector is length n, and the inner one is 2^n (the number of truth results for n inputs) but I do the power calculation by using left shift so the compiler can insert a constant rather than a call to, say, pow. In the case where n=3 we wind up with a 3x8 vector. I organize it in this way (rather than the customary 8x3 with row as the first index) because we're going to take advantage of a column-based pattern in the output data. Using the vector constructors in this way also ensures that each element of the vector of vectors is initialized to 0. Thus we only have to worry about setting the values we want to 1 and leave the rest alone.
The second set of nested for loops is just used to print out the resulting data when it's done, nothing special there.
The first set of for loops implements the real algorithm. We're taking advantage of a column-based pattern in the output data here. For a given truth table, the left-most column will have two pieces: The first half is all 0 and the second half is all 1. Since we pre-filled zeroes, a single fill of half the column height starting halfway down will apply all the 1s we need. The second column will have rows 1/4th 0, 1/4th 1, 1/4th 0, 1/4th 1. Thus two fills will apply all the 1s we need. We repeat this until we get to the rightmost column in which case every other row is 0 or 1.
We start out saying "I need to fill half the rows at once" (unsigned num_to_fill = 1U << (n - 1);). Then we loop over each column. The first column starts at the position to fill, and fills that many rows with 1. Then we increment the row and reduce the fill size by half (now we're filling 1/4th of the rows at once, but we then skip blank rows and fill a second time) for the next column.
For example:
#include <iostream>
#include <vector>
int main()
{
const unsigned n = 3;
std::vector<std::vector<int> > output(n, std::vector<int>(1 << n));
unsigned num_to_fill = 1U << (n - 1);
for(unsigned col = 0; col < n; ++col, num_to_fill >>= 1U)
{
for(unsigned row = num_to_fill; row < (1U << n); row += (num_to_fill * 2))
{
std::fill_n(&output[col][row], num_to_fill, 1);
}
}
// These loops just print out the results, nothing more.
for(unsigned x = 0; x < (1 << n); ++x)
{
for(unsigned y = 0; y < n; ++y)
{
std::cout << output[y][x] << " ";
}
std::cout << std::endl;
}
return 0;
}

You can split his problem into two sections by noticing each of the rows in the matrix represents an n bit binary number where n is the number of probabilities[sic].
so you need to:
iterate over all n bit numbers
convert each number into a row of your 2d container
edit:
if you are only worried about runtime then for constant n you could always precompute the table, but it think you are stuck with O(2^n) complexity for when it is computed

You want to write the numbers from 0 to 2^N - 1 in binary numeral system. There is nothing smart in it. You just have to populate every cell of the two dimensional array. You cannot do it faster than that.
You can do it without iterating directly over the numbers. Notice that the rightmost column is repeating 0 1, then the next column is repeating 0 0 1 1, then the next one 0 0 0 0 1 1 1 1 and so on. Every column is repeating 2^columnIndex(starting from 0) zeros and then ones.

To elaborate on jk's answer...
If you have n boolean values ("probabilities"?), then you need to
create a truth table array that's n by 2^n
loop i from 0 to (2^n-1)
inside that loop, loop j from 0 to n-1
inside THAT loop, set truthTable[i][j] = jth bit of i (e.g. (i >> j) & 1)

Karnaugh map or Quine-McCluskey
http://en.wikipedia.org/wiki/Karnaugh_map
http://en.wikipedia.org/wiki/Quine%E2%80%93McCluskey_algorithm
That should head you in the right direction for minimizing the resulting truth table.

Related

How do I solve this making it more efficient?

So, I am trying to solve the following question: https://www.codechef.com/TSTAM15/problems/ACM14AM3
The Mars Orbiter Mission probe lifted-off from the First Launch Pad at Satish Dhawan Space Centre (Sriharikota Range SHAR), Andhra
Pradesh, using a Polar Satellite Launch Vehicle (PSLV) rocket C25 at
09:08 UTC (14:38 IST) on 5 November 2013.
The secret behind this successful launch was the launch pad that ISRO
used. An important part of the launch pad is the launch tower. It is
the long vertical structure which supports the rocket.
ISRO now wants to build a better launch pad for their next mission.
For this, ISRO has acquired a long steel bar, and the launch tower can
be made by cutting a segment from the bar. As part of saving the cost,
the bar they have acquired is not homogeneous.
The bar is made up of several blocks, where the ith block has
durability S[i], which is a number between 0 and 9. A segment is
defined as any contiguous group of one or more blocks.
If they cut out a segment of the bar from ith block to jth block
(i<=j), then the durability of the resultant segment is given by (S[i]*10(j-i) + S[i+1]*10(j-i-1) + S[i+2]*10(j-i-2) + … + S[j] * 10(0)) % M. In other words, if W(i,j) is the base-10 number formed by
concatenating the digits S[i], S[i+1], S[i+2], …, S[j], then
the durability of the segment (i,j) is W(i,j) % M.
For technical reasons that ISRO will not disclose, the durability of
the segment used for building the launch tower should be exactly L.
Given S and M, find the number of ways ISRO can cut out a segment from
the steel bar whose durability is L. Input
The first line contains a string S. The ith character of this string
represents the durability of ith segment. The next line contains a
single integer Q, denoting the number of queries. Each of the next Q
lines contain two space separated integers, denoting M and L. Output
For each query, output the number of ways of cutting the bar on a
separate line. Constraints
1 ≤ |S| ≤ 2 * 10^4
Q ≤ 5
0 < M < 500
0 ≤ L < M
Example
Input:
23128765
3
7 2
9 3
15 5
Output:
9
4
5
Explanation
For M=9, L=3, the substrings whose remainder is 3 when divided by
9 are: 3, 31287, 12 and 876.
Now, what I did was, I initially generate all possible substrings of numbers of the given length, and tried to divide it by the given number to check if it is divisible and added it to the answer. Therefore, my code for the same was,
string s;
cin>>s;
int m,l,ans=0;
for ( i = 0; i < s.length(); i++ )
{
for ( j = i+1; j < s.length(); j++ )
{
string p = s.substr(i,j);
long long num = stoi(p);
if (num%m == l)
ans++;
}
}
cout<<ans<<"\n";
return 0;
But obviously since the input length is upto 10^4, this doesn't work in required time. How can I make it more optimal?
A little advice I can give you is to initialize a variable to s.length() to avoid calling the function each time for each for block.
Ok, here goes, with a working program at the bottom
Major optimization #1
Do not (ever) work with strings when it comes to integer arithmetic. You're converting string => integer over and over and over again (this is an O(n^2) problem), which is painstakingly slow. Besides, it also misses the point.
Solution: first convert your array-of-characters (string) to array-of-numbers. Integer arithmetic is fast.
Major optimization #2
Use a smart conversion from "substring" to number. After transforming the characters to actual integers, they become the factors in the the polynomial a_n * 10^n. To convert a substring of n segments into a number, it is enough to compute sum(a_i * 10^i) for 0 <= i < n.
And nicely enough, if the coefficients a_i are arranged the way they are in the problem's statement, you can use Horner's method (https://en.wikipedia.org/wiki/Horner%27s_method) to very quickly evaluate the numerical value of the substring.
In short: keep a running value of the current substring and growing it by one element is just * 10 + new element
Example: string "128472373".
First substring = "1", value = 1.
For the second substring we need to
add the digit "2" as follows: value = value * 10 + "2", thus: value = 1 * 10 + 2 = 12.
For 3rd substring need to add digit "8": value = value * 10 + "8", thus: value = 12 * 10 + 8 = 128.
Etcetera.
I had some issues with formatting the C++ code inline so I stuck it in IDEone: https://ideone.com/TbJiqK
The gist of the program:
In main loop, loop over all possible start points:
// For all startpoints in the segments array ...
for(int* f=segments; f<segments+n_segments; f++)
// add up the substrings that fullfill the question
n += count_segments(f, segments+n_segments, m, l);
// Output the answer for this question
cout << n << endl;
Implementation of the count_segments() function:
// Find all substrings that % m == l
// Use Horner's algorithm to quickly evaluate sum(a_n*10^n) where
// a_n are the segments' durabilities
int count_segments(int* first, int* last, int m, int l) {
int n = 0, number = 0;
while( first<last ) {
number = number * 10 + *first; // This is Horner's method
if( (number % m)==l ) {
n++;
// If you don't believe - enable this line of output and
// see the numbers matching the combinations of the
//cout << "[" << m << ", " << l << "]: " << number << endl;
}
first++;
}
return n;
}

Efficient layout and reduction of virtual 2d data (abstract)

I use C++ and CUDA/C and want to write code for a specific problem and I ran into a quite tricky reduction problem.
My experience in parallel programming isn't negligible but quite limited and I cannot totally forsee the specificity of this problem.
I doubt there is a convenient or even "easy" way to handle the problems I am facing but perhaps I am wrong.
If there are any resources (i.e. articles, books, web-links, ...) or key-words covering this or similar problems, please let me know.
I tried to generalize the whole case as good as possible and keep it abstract instead of posting too much code.
The Layout ...
I have a system of N inital elements and N result elements. (I'll use N=8 for example but N can be any integral value greater than three.)
static size_t const N = 8;
double init_values[N], result[N];
I need to calculate almost every (not all i'm afraid) unique permutation of the init-values without self-interference.
This means calculation f(init_values[0],init_values[1]), f(init_values[0],init_values[2]), ..., f(init_values[0],init_values[N-1]), f(init_values[1],init_values[2]), ..., f(init_values[1],init_values[N-1]), ... and so on.
This is in fact a virtual triangular matrix which has the shape seen in the following illustration.
P 0 1 2 3 4 5 6 7
|---------------------------------------
0| x
|
1| 0 x
|
2| 1 2 x
|
3| 3 4 5 x
|
4| 6 7 8 9 x
|
5| 10 11 12 13 14 x
|
6| 15 16 17 18 19 20 x
|
7| 21 22 23 24 25 26 27 x
Each element is a function of the respective column and row elements in init_values.
P[i] (= P[row(i)][col(i]) = f(init_values[col(i)], init_values[row(i)])
i.e.
P[11] (= P[5][1]) = f(init_values[1], init_values[5])
There are (N*N-N)/2 = 28 possible, unique combinations (Note: P[1][5]==P[5][1], so we only have a lower (or upper) triangular matrix) using the example N = 8.
The basic problem
The result array is computed from P as a sum of the row elements minus the sum of the respective column elements.
For example the result at position 3 will be calculated as a sum of row 3 minus the sum of column three.
result[3] = (P[3]+P[4]+P[5]) - (P[9]+P[13]+P[18]+P[24])
result[3] = sum_elements_row(3) - sum_elements_column(3)
I tried to illustrate it in a picture with N = 4.
As a consequence the following is true:
N-1 operations (potential concurrent writes) will be performed on each result[i]
result[i] will have N-(i+1) writes from subtractions and i additions
Outgoing from each P[i][j] there will be a subtraction to r[j] and a addition to r[i]
This is where the main problems come into place:
Using one thread to compute each P and updating the result directly will result in multiple kernels trying to write to the same result location (N-1 threads each).
Storing the whole matrix P for a subsequent reduction step on the other hand is very expensive in terms of memory consumption and therefore impossible for very large systems.
The idea of having a unqiue, shared result vector for each thread-block is impossible, too.
(N of 50k makes 2.5 billion P elements and therefore [assuming a maximum number of 1024 threads per block] a minimal number of 2.4 million blocks consuming over 900GiB of memory if each block has its own result array with 50k double elements.)
I think I could handle reduction for a more static behaviour but this problem is rather dynamic in terms of potential concurrent memory write-access.
(Or is it possible to handle it by some "basic" type of reduction?)
Adding some complications ...
Unfortunatelly, depending on (arbitrary user) input, which is independant of the initial values, some elements of P need to be skipped.
Let's assume we need to skip permutations P[6], P[14] and P[18]. Therefore we have 24 combinations left, which need to be calculated.
How to tell the kernel which values need to be skipped?
I came up with three approaches, each having notable downsides if N is very large (like several ten thousands of elements).
1. Store all combinations ...
... with their respective row and column index struct combo { size_t row,col; };, that need to be calculated in a vector<combo> and operate on this vector. (used by the current implementation)
std::vector<combo> elements;
// somehow fill
size_t const M = elements.size();
for (size_t i=0; i<M; ++i)
{
// do the necessary computations using elements[i].row and elements[i].col
}
This solution consumes is consuming lots of memory since only "several" (may even be ten thousands of elements but that's not much in contrast to several billion in total) but it avoids
indexation computations
finding of removed elements
for each element of P which is the downside of the second approach.
2. Operate on all elements of P and find removed elements
If I want to operate on each element of P and avoid nested loops (which i couldn't reproduce very well in cuda) I need to do something like this:
size_t M = (N*N-N)/2;
for (size_t i=0; i<M; ++i)
{
// calculate row indices from `i`
double tmp = sqrt(8.0*double(i+1))/2.0 + 0.5;
double row_d = floor(tmp);
size_t current_row = size_t(row_d);
size_t current_col = size_t(floor(row_d*(ict-row_d)-0.5));
// check whether the current combo of row and col is not to be removed
if (!removes[current_row].exists(current_col))
{
// do the necessary computations using current_row and current_col
}
}
The vector removes is very small in contrast to the elements vector in the first example but the additional computations to obtain current_row, current_col and the if-branch are very inefficient.
(Remember we're still talking about billions of evaluations.)
3. Operate on all elements of P and remove elements afterwards
Another idea I had was to calculate all valid and invalid combinations independently.
But unfortunately, due to summation errors the following statement is true:
calc_non_skipped() != calc_all() - calc_skipped()
Is there a convenient, known, high performance way to get the desired results from the initial values?
I know that this question is rather complicated and perhaps limited in relevance. Nevertheless, I hope some illuminative answers will help me to solve my problems.
The current implementation
Currently this is implemented as CPU Code with OpenMP.
I first set up a vector of the above mentioned combos storing every P that needs to be computed and pass it to a parallel for loop.
Each thread is provided with a private result vector and a critical section at the end of the parallel region is used for a proper summation.
First, I was puzzled for a moment why (N**2 - N)/2 yielded 27 for N=7 ... but for indices 0-7, N=8, and there are 28 elements in P. Shouldn't try to answer questions like this so late in the day. :-)
But on to a potential solution: Do you need to keep the array P for any other purpose? If not, I think you can get the result you want with just two intermediate arrays, each of length N: one for the sum of the rows and one for the sum of the columns.
Here's a quick-and-dirty example of what I think you're trying to do (subroutine direct_approach()) and how to achieve the same result using the intermediate arrays (subroutine refined_approach()):
#include <cstdlib>
#include <cstdio>
const int N = 7;
const float input_values[N] = { 3.0F, 5.0F, 7.0F, 11.0F, 13.0F, 17.0F, 23.0F };
float P[N][N]; // Yes, I'm wasting half the array. This way I don't have to fuss with mapping the indices.
float result1[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float result2[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };
float f(float arg1, float arg2)
{
// Arbitrary computation
return (arg1 * arg2);
}
float compute_result(int index)
{
float row_sum = 0.0F;
float col_sum = 0.0F;
int row;
int col;
// Compute the row sum
for (col = (index + 1); col < N; col++)
{
row_sum += P[index][col];
}
// Compute the column sum
for (row = 0; row < index; row++)
{
col_sum += P[row][index];
}
return (row_sum - col_sum);
}
void direct_approach()
{
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
P[row][col] = f(input_values[row], input_values[col]);
}
}
int index;
for (index = 0; index < N; index++)
{
result1[index] = compute_result(index);
}
}
void refined_approach()
{
float row_sums[N];
float col_sums[N];
int index;
// Initialize intermediate arrays
for (index = 0; index < N; index++)
{
row_sums[index] = 0.0F;
col_sums[index] = 0.0F;
}
// Compute the row and column sums
// This can be parallelized by computing row and column sums
// independently, instead of in nested loops.
int row;
int col;
for (row = 0; row < N; row++)
{
for (col = (row + 1); col < N; col++)
{
float computed = f(input_values[row], input_values[col]);
row_sums[row] += computed;
col_sums[col] += computed;
}
}
// Compute the result
for (index = 0; index < N; index++)
{
result2[index] = row_sums[index] - col_sums[index];
}
}
void print_result(int n, float * result)
{
int index;
for (index = 0; index < n; index++)
{
printf(" [%d]=%f\n", index, result[index]);
}
}
int main(int argc, char * * argv)
{
printf("Data reduction test\n");
direct_approach();
printf("Result 1:\n");
print_result(N, result1);
refined_approach();
printf("Result 2:\n");
print_result(N, result2);
return (0);
}
Parallelizing the computation is not so easy, since each intermediate value is a function of most of the inputs. You can compute the sums individually, but that would mean performing f(...) multiple times. The best suggestion I can think of for very large values of N is to use more intermediate arrays, computing subsets of the results, then summing the partial arrays to yield the final sums. I'd have to think about that one when I'm not so tired.
To cope with the skip issue: If it's a simple matter of "don't use input values x, y, and z", you can store x, y, and z in a do_not_use array and check for those values when looping to compute the sums. If the values to be skipped are some function of row and column, you can store those as pairs and check for the pairs.
Hope this gives you ideas for your solution!
Update, now that I'm awake: Dealing with "skip" depends a lot on what data needs to be skipped. Another possibility for the first case - "don't use input values x, y, and z" - a much faster solution for large data sets would be to add a level of indirection: create yet another array, this one of integer indices, and store only the indices of the good inputs. F'r instance, if invalid data is in inputs 2 and 5, the valid array would be:
int valid_indices[] = { 0, 1, 3, 4, 6 };
Interate over the array valid_indices, and use those indices to retrieve the data from your input array to compute the result. On the other paw, if the values to skip depend on both indices of the P array, I don't see how you can avoid some kind of lookup.
Back to parallelizing - No matter what, you'll be dealing with (N**2 - N)/2 computations
of f(). One possibility is to just accept that there will be contention for the sum
arrays, which would not be a big issue if computing f() takes substantially longer than
the two additions. When you get to very large numbers of parallel paths, contention will
again be an issue, but there should be a "sweet spot" balancing the number of parallel
paths against the time required to compute f().
If contention is still an issue, you can partition the problem several ways. One way is
to compute a row or column at a time: for a row at a time, each column sum can be
computed independently and a running total can be kept for each row sum.
Another approach would be to divide the data space and, thus, the computation into
subsets, where each subset has its own row and column sum arrays. After each block
is computed, the independent arrays can then be summed to produce the values you need
to compute the result.
This probably will be one of those naive and useless answers, but it also might help. Feel free to tell me that I'm utterly and completely wrong and I have misunderstood the whole affair.
So... here we go!
The Basic Problem
It seems to me that you can define you result function a little differently and it will lift at least some contention off your intermediate values. Let's suppose that your P matrix is lower-triangular. If you (virtually) fill the upper triangle with the negative of the lower values (and the main diagonal with all zeros,) then you can redefine each element of your result as the sum of a single row: (shown here for N=4, and where -i means the negative of the value in the cell marked as i)
P 0 1 2 3
|--------------------
0| x -0 -1 -3
|
1| 0 x -2 -4
|
2| 1 2 x -5
|
3| 3 4 5 x
If you launch independent threads (executing the same kernel) to calculate the sum of each row of this matrix, each thread will write a single result element. It seems that your problem size is large enough to saturate your hardware threads and keep them busy.
The caveat, of course, is that you'll be calculating each f(x, y) twice. I don't know how expensive that is, or how much the memory contention was costing you before, so I cannot judge whether this is a worthwhile trade-off to do or not. But unless f was really really expensive, I think it might be.
Skipping Values
You mention that you might have tens of thousands elements of the P matrix that you need to ignore in your calculations (effectively skip them.)
To work with the scheme I've proposed above, I believe you should store the skipped elements as (row, col) pairs, and you have to add the transposed of each coordinate pair too (so you'll have twice the number of skipped values.) So your example skip list of P[6], P[14] and P[18] becomes P(4,0), P(5,4), P(6,3) which then becomes P(4,0), P(5,4), P(6,3), P(0,4), P(4,5), P(3,6).
Then you sort this list, first based on row and then column. This makes our list to be P(0,4), P(3,6), P(4,0), P(4,5), P(5,4), P(6,3).
If each row of your virtual P matrix is processed by one thread (or a single instance of your kernel or whatever,) you can pass it the values it needs to skip. Personally, I would store all these in a big 1D array and just pass in the first and last index that each thread would need to look at (I would also not store the row indices in the final array that I passed in, since it can be implicitly inferred, but I think that's obvious.) In the example above, for N = 8, the begin and end pairs passed to each thread will be: (note that the end is one past the final value needed to be processed, just like STL, so an empty list is denoted by begin == end)
Thread 0: 0..1
Thread 1: 1..1 (or 0..0 or whatever)
Thread 2: 1..1
Thread 3: 1..2
Thread 4: 2..4
Thread 5: 4..5
Thread 6: 5..6
Thread 7: 6..6
Now, each thread goes on to calculate and sum all the intermediate values in a row. While it is stepping through the indices of columns, it is also stepping through this list of skipped values and skipping any column number that comes up in the list. This is obviously an efficient and simple operation (since the list is sorted by column too. It's like merging.)
Pseudo-Implementation
I don't know CUDA, but I have some experience working with OpenCL, and I imagine the interfaces are similar (since the hardware they are targeting are the same.) Here's an implementation of the kernel that does the processing for a row (i.e. calculates one entry of result) in pseudo-C++:
double calc_one_result (
unsigned my_id, unsigned N, double const init_values [],
unsigned skip_indices [], unsigned skip_begin, unsigned skip_end
)
{
double res = 0;
for (unsigned col = 0; col < my_id; ++col)
// "f" seems to take init_values[column] as its first arg
res += f (init_values[col], init_values[my_id]);
for (unsigned row = my_id + 1; row < N; ++row)
res -= f (init_values[my_id], init_values[row]);
// At this point, "res" is holding "result[my_id]",
// including the values that should have been skipped
unsigned i = skip_begin;
// The second condition is to check whether we have reached the
// middle of the virtual matrix or not
for (; i < skip_end && skip_indices[i] < my_id; ++i)
{
unsigned col = skip_indices[i];
res -= f (init_values[col], init_values[my_id]);
}
for (; i < skip_end; ++i)
{
unsigned row = skip_indices[i];
res += f (init_values[my_id], init_values[row]);
}
return res;
}
Note the following:
The semantics of init_values and function f are as described by the question.
This function calculates one entry in the result array; specifically, it calculates result[my_id], so you should launch N instances of this.
The only shared variable it writes to is result[my_id]. Well, the above function doesn't write to anything, but if you translate it to CUDA, I imagine you'd have to write to that at the end. However, no one else writes to that particular element of result, so this write will not cause any contention of data race.
The two input arrays, init_values and skipped_indices are shared among all the running instances of this function.
All accesses to data are linear and sequential, except for the skipped values, which I believe is unavoidable.
skipped_indices contain a list of indices that should be skipped in each row. It's contents and structure are as described above, with one small optimization. Since there was no need, I have removed the row numbers and left only the columns. The row number will be passed into the function as my_id anyways and the slice of the skipped_indices array that should be used by each invocation is determined using skip_begin and skip_end.
For the example above, the array that is passed into all invocations of calc_one_result will look like this:[4, 6, 0, 5, 4, 3].
As you can see, apart from the loops, the only conditional branch in this code is skip_indices[i] < my_id in the third for-loop. Although I believe this is innocuous and totally predictable, even this branch can be easily avoided in the code. We just need to pass in another parameter called skip_middle that tells us where the skipped items cross the main diagonal (i.e. for row #my_id, the index at skipped_indices[skip_middle] is the first that is larger than my_id.)
In Conclusion
I'm by no means an expert in CUDA and HPC. But if I have understood your problem correctly, I think this method might eliminate any and all contentions for memory. Also, I don't think this will cause any (more) numerical stability issues.
The cost of implementing this is:
Calling f twice as many times in total (and keeping track of when it is called for row < col so you can multiply the result by -1.)
Storing twice as many items in the list of skipped values. Since the size of this list is in the thousands (and not billions!) it shouldn't be much of a problem.
Sorting the list of skipped values; which again due to its size, should be no problem.
(UPDATE: Added the Pseudo-Implementation section.)

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?
Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.
If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.
You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.

Find the row(s) with maximum number of 0s in a 2-d matrix

Problem
Given a 2D 0/1 Matrix, Find the row(s) with maximum number of 0s.
Example
11111000
11111110
11100000
11000000
11110000
Output
11000000
My idea
If each 0s row is continuous, we can scan from two ends for each row. Common sense says to scan with O(n^2).
Are there any O(n) solutions?
if every row is like 1....10...0, you could binary search first zero in each row.
That would be O(n*lg(n))
for an arbitrary matrix, you must check every cell, so it must be O(n^2).
You can do it in O(N) as follows:
Start at A[i][j] with i=0 and j=No_of_columns-1.
0, keep moving to the left by doing j--
A[i][j] =
1, move down to the next row by doing i++
When you reach the last row or the last column, the value of j will be the answer.
Pseudo code:
Let R be number of rows
Let C be number of columns
Let i = 0
Let j = C-1
Let max1Row = 0
while ( i<R && j>=0 )
if ( matrix[i][j] == 0 )
j--
max1Row = i
else
i++
end-while
print "Max 0's = j"
print "Row number with max 0's = max1Row"
As #amit says:
scanning a matrix is considered O(n). The standard big O notation stands for relationship between run time and the input size. Since your input is of size #rows*#cols, you should regard this number as n, and not to #rows.
Therefore, this is as O(n) as you can get. :)
std::vector<std::string> matrix;
std::vector<std::string>::iterator max = matrix.begin();
for(std::vector<std::string>::iterator i = matrix.begin(); i != matrix.end(); ++i)
{
if(count_zeros(*i) > count_zeros(*max))
max = i;
}
count_zeros() should look something like this:
size_t count_zeros(std::string s)
{
size_t count = 0;
for(std::string::iterator i = s.begin(); i != s.end(); ++i)
if(*i == '0')
++count;
return i;
}
If all the 0s in each row are guaranteed to be at the rightmost, you can do it in O(sqrt(n)).
Put cursor on (len, 0)
If the value to the left of the cursor is 0, move the cursor left. Else, move it down.
If bottom row is reached, terminate. Else, go to step 2.
std::vector<std::string> matrix;
std::vector<std::string>::iterator y = matrix.begin();
for(std::string::reverse_iterator x = (*y).rbegin(); x < matrix.rend(); )
{
if(*x != '0')
{
x -= (*y).rbegin();
++(*y);
x += (*y).rbegin();
continue;
}
++x;
}
With the give sample set (were all starts with 111 and ends in 000 with no mix of 1 and 0) the set can simply be searched in a single pass with a test of A&(A xor B) for testing if there are more or less zero than the previous row -- that is a loop of O(n)....
Here a quick solution with just one if or each row (not for each element):
As your matrix just holds zeros and ones, add up the elements of each row and then return the index/indices of the minimum/minimae.
P.S.: adding ones is really fast when using assmbly inc or ++Variable in C++
Edit: Here another idea. If you really just need 0/1 matrices that do not excedd say 64 columns, you could implement them as bit matrices using ordinary unsigned 64 integers. By setting and deleting the respective bit you can define the entry (0 or 1). The effect: an o(n) check (let me know if I am wrong) as followsm, where intXX is rowXX. The first idea is to extract different bits via XOR
SET tmpA to int01
FOR I = 1 to n-1
XOR intI with intI+1, store result in tmpX
AND tmpX with intI, store result in tmpM
AND tmpX with intI+1, store result in tmpN
if (tmpN < tmpM)
SET tmpA to intI+1
ENDFOR
tmpA should now hold the (last) row with fewest zeros.
Cheers,
G.

algorithm: find count of numbers within a given range

given an unsorted number array where there can be duplicates, pre-process the array so that to find the count of numbers within a given range, the time is O(1).
For example, 7,2,3,2,4,1,4,6. The count of numbers both >= 2 and <= 5 is 5. (2,2,3,4,4).
Sort the array. For each element in the sorted array, insert that element into a hash table, with the value of the element as the key, and its position in the array as the associated value. Any values that are skipped, you'll need to insert as well.
To find the number of items in a range, look up the position of the value at each end of the range in the hash table, and subtract the lower from the upper to find the size of the range.
This sounds suspiciously like one of those clever interview questions some interviewers like to ask, which is usually associated with hints along the way to see how you think.
Regardless... one possible way of implementing this is to make a list of the counts of numbers equal to or less than the list index.
For example, from your list above, generate the list: 0, 1, 3, 4, 6, 6, 7, 8. Then you can count the numbers between 2 and 5 by subtracting list[1] from list[5].
Since we need to access in O(1), the data structure needed would be memory-intensive.
With Hash Table, in worst case access would take O(n)
My Solution:
Build a 2D matrix.
array = {2,3,2,4,1,4,6} Range of numbers = 0 to 6 so n = 7
So we've to create nxn matrix.
array[i][i] represents total count of element = i
so array[4][4] = 2 (since 4 appears 2 times in array)
array[5][5] = 0
array[5][2] = count of numbers both >= 2 and <= 5 = 5
//preprocessing stage 1: Would populate a[i][i] with total count of element = i
a[n][n]={0};
for(i=0;i<=n;i++){
a[i][i]++;
}
//stage 2
for(i=1;i<=n;i++)
for(j=0;j<i;j++)
a[i][j] = a[i-1][j] + a[i][i];
//we are just adding count of element=i to each value in i-1th row and we get ith row.
Now (5,2) would query for a[5][2] and would give answer in O(1)
int main()
{
int arr[8]={7,2,3,2,4,1,4,6};
int count[9];
int total=0;
memset(count,0, sizeof(count));
for(int i=0;i<8;i++)
count[arr[i]]++;
for(int k=0;k<9;k++)
{
if(k>=2 && k<=5 && count[k]>0 )
{
total= total+count[k] ;
}
}
printf("%d:",total);
return 0;
}