Related
I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.
For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.
Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.
Update:
Here is some information about the structure of my data set.
The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.
If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.
Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:
If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
For each set O(i):
Evaluate all combinations and insert into set of combinations: O(k choose 2)
The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.
Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.
First lets solve more simple task of counting number of elements not present in your sets. This task can be reworded in more simple form - instead of 100,000 sets you can think about 1 set which contains all your numbers. Then number of elements not present in this set is x = 1000000 - len(set). Now you can use this number x to count number of combinations. With repetitions: x * x, without repetitions: x * (x - 1). So bottom line of my answer is to put all your numbers in one big set and use it's length to find number of combinations using combinatorics.
Update
So above we have a way to find number of combinations where each element in combination is not in any of the sets. But question was to find number of combinations where each combination is not present in any of the sets.
Lets try to solve simpler problem first:
your sets have all numbers in them, none missing
each number is present exactly in one set, no duplicates across sets
How you would construct such combinations over such sets? You would simply pick two elements from different sets and resulting combination would not be in any of the sets. Number of such combinations could be counted using following code (it accepts sizes of the sets):
int count_combinations(vector<int>& buckets) {
int result = 0;
for (int i=0; i < buckets.size(); ++i) {
for (int j=i+1; j < buckets.size(); ++j) {
result += buckets[i] * buckets[j];
}
}
return result;
}
Now let's imagine that some numbers are missing. Then we can just add additional set with those missing numbers to our sets (as a separate set). But we also need to account that given there were n missing numbers there would be n * (n-1) combinations constructed using only these missing numbers. So following code will produce total number of combinations with account to missing numbers:
int missing_numbers = upper_bound - all_numbers.size() - 1;
int missing_combinations = missing_numbers * (missing_numbers - 1);
return missing_combinations + count_combinations(sets, missing_numbers);
Now lets imagine we have a duplicate across two sets: {a, b, c}, {a, d}.
What types of errors they will introduce? Following pairs: {a, a} - repetition, {a, d} - combination which is present in second set.
So how to treat such duplicates? We need to eliminate them completely from all sets. Even single instance of a duplicate will produce combination present in some set. Because we can just pick any element from the set where duplicate was removed and produce such combination (in my example - if we will keep a in first set, then pick d from the second to produce {a, d}, if we will keep a in second set, then pick b or c from the first to produce {a, b} and {a, c}). So duplicates shall be removed.
Update
However we can't simply remove all duplicates, consider this counterexample:
{a, b} {a, c} {d}. If we simply remove a we will acquire {b} {c} {d} and lost information about not-existing combination {a, d}. Consider another counterexample:
{a, b} {a, b, c} {b, d}. If we simply remove duplicates we will acquire {c} {d} and lost information about {a, d}.
Also we can't simply apply such logic to pairs of sets, a simple counter example for numbers < 3: {1, 2} {1} {2}. Here number of missing combinations is 0, but we will incorrectly count in {1, 2} if we will apply duplicates removal to pair of sets. Bottom line is that I can't come up with good technique which will help to correctly handle duplicate elements across sets.
What you can do, depending on memory requirements, is take advantage of the ordering of Set, and iterate over the values smartly. Something like the code below (untested). You'll iterate over all of your sets, and then for each of your sets you'll iterate over their values. For each of these values, you'll check all of the values in the set after them. Our complexity is reduced to the number of sets times the square of their sizes. You can use a variety of methods to keep track of your found/unfound count, but using a set should be fine, since insertion is simply O(log(n)) where n is no more than 499999500000. In theory using a map of sets (mapping based on the first value) could be slightly faster, but in either case the cost is minimal.
long long numMissing(const std::array<std::set<int>, 100000>& sets){
std::set<pair<int, int> > found;
for (const auto& s : sets){
for (const auto& m : s){
const auto &n = m;
for (n++; n != s.cend(); n++){
found.emplace(m, n);
}
}
}
return 499999500000 - found.size();
}
As an option you can build Bloom Filter(s) over your sets.
Before checking against all sets you can quickly lookup at your bloom filter and since it will never produce false negatives you can safely use your pair as its not present in your sets.
Physically storing each possible pair would take too much memory. We have 100k sets and an average set has 10k numbers = 50M pairs = 400MB with int32 (and set<pair<int, int>> needs much more than 8 bytes per element).
My suggestion is based on two ideas:
don't store, only count the missing pairs
use interval set for compact storage and fast set operations (like boost interval set)
The algorithm is still quadratic on the number of elements in the sets but needs much less space.
Algorithm:
Create the union_set of the individual sets.
We also need a data structure, let's call it sets_for_number to answer this question: which sets contain a particular number? For the simplest case this could be unordered_map<int, vector<int>> (vector stores set indices 0..99999)
Also create the inverse sets for each set. Using interval sets this takes only 10k * 2 * sizeof(int) space per set on average.
dynamic_bitset<> union_set = ...; //union of individual sets (can be vector<bool>)
vector<interval_set<int>> inverse_sets = ...; // numbers 1..999999 not contained in each set
int64_t missing_count = 0;
for(int n = 1; n < 1000000; ++n)
// count the missing pairs whose first element is n
if (union_set.count(n) == 0) {
// all pairs are missing
missing_count += (999999 - n);
} else {
// check which second elements are not present
interval_set<int> missing_second_elements = interval_set<int>(n+1, 1000000);
// iterate over all sets containing n
for(int set_idx: sets_for_number.find(n)) {
// operator&= is in-place intersection
missing_second_elements &= inverse_sets[set_idx];
}
// counting the number of pairs (n, m) where m is a number
// that is not present in any of the sets containing n
for(auto interval: missing_second_elements)
missing_count += interval.size()
}
}
If it is possible, have a set of all numbers and remove each of the number when you insert to your array of set. This will have a O(n) space complexity.
Of course if you don't want to have high spec complexity, maybe you can have a range vector. For each element in the vector, you have a pair of numbers which are the start/end of a range.
I've got multiple arrays and want to find the permutations of all the elements in these arrays. Each element also carries a weight, and these arrays are sorted decreasing by weight. I've got an array with weight that mimics the arrays with he values themselves. I want my search to find permutations with the greatest weight to the lowest weight.
However, each element in an array has a weight associated with it so I want to run my search with those with the highest weight first.
Example:
arr0 = [A, B, C, D]
arr0_weight = [11, 7, 4, 3]
arr1 = [W, X, Y]
arr1_weight = [10, 9, 4]
Thus, the ideal output would be:
AW (11+10=21)
AX (11+9=20)
BW (7+10=17)
BX (7+9=16)
AY (11+4=15)
...
If I did just a for loop like this:
for (int i = 0; i < sizeof(arr0)/4; i++) {
for (int j = 0; j < sizeof(arr1)/4; j++) {
cout << arr0[i] << arr1[j] << endl; }}
I would get:
AW (11+10=21)
AX (11+9=20)
AY (11+4=15)
BW (7+10=17)
BX (7+9=16)
BZ (7+4=11)
Which isn't what I want because 17 > 15 and 16 > 15.
Also, what's a good way to do this for n arrays? If I don't know how many arrays I will have, and their size might not all be the same?
I've looked into putting the values into vectors but I can't find a way to do what I want (a sorted Cartesian product). Any help? Pseudo-code is fine if you don't have time - I'm just really stuck.
Thanks so much.
Your question is about algorithm, not C++.
You want to sort all tuples in Cartesian product from heaviest to lightest.
Easiest way is to find all tuples and sort them by their weight.
If you need sequential access, your should do following. Since weight of tuple is sum of weights of its elements, I think, greediness is optimal here. Let's move to arbitrary number of arrays of arbitrary dimensions. Create set of indices. Initially, it's contains zeros. First tuple that it represents is obviously heaviest. Find one of indices to increment: choose index that loses least weight, that has least difference with next element. Don't forget to keep track of exhausted arrays. When all vectors are exhausted, you're done.
To implement it in C++, you should employ vector<pair<element_t, weight_t>> for input data and set<pair<weight_difference_t, index_t>> as set of indices. All types are probably integers but I used custom types to show which data should be there. Your should also know how pair is compared.
I have a vector that looks like:
vector<int> A = {0, 1, 1, 0, 0, 1, 0, 1};
I'd like to select a random index from the non-zero values of A. Using this example A, I want to randomly select an element from the array {1,2,5,7}.
Currently I do this by creating another array
vector<int> b;
for(int i=0;i<A.size();i++)
if(A[i])
b.push_back(i);
Once b is created, I find the index by using this answer:
get random element from container
Is there a more STL-like (or C++11) way of doing this, perhaps one that does not create an intermediate array? In this example A is small, but in my production code this selection process is in an inner-loop and A is non-static and thousands of elements long.
A great way to do this is Reservoir Sampling.
In short, you walk your array until you find the first non-zero value, and record that index as the first possible answer you might return.
Then, you continue to walk the array. Every time you find a non-zero value, you randomly might change which new index is your possible answer, with decreasing probability.
This algorithm also works great if you need M random index values from your array.
What's great about this, is that you walk each element only one time, and you don't need a separate memory structure to record the non-zero elements. It's O(N) in speed, and O(M) in memory, in your case it's O(1) in memory, since you only want 1 random value.
On the flip side, random number generators are traditionally quite slow. So, you might want to performance test this against any other ideas people come up with here, to see if the trade-off of speed-vs-memory is worth it for you.
With a single pass through the array, you can determine how many false (or true) values there are. If you are doing this kind of thing often, you can even write a class to keep track of this for you.
Regardless, you can then pick a random number i between 0 and num_false (or num_true). Then with another pass through the array, you can return the ith false (or true) index.
We can loop through each non-zero value and assign it a random number. The index with the largest random number is the one we select.
int value = 0;
int index = 0;
while(int i = 0; i < A.size(); i++) {
if(!A[i]) continue;
auto j = rand();
if(j > value) {
index = i;
value = j;
}
}
vector<int> A = {0,1,1,0,0,1,0,1};
random_shuffle(A.begin(),A.end());
auto it = find_if(A.begin(),A.end(),[](const int elem){return elem;});
I'm trying to create an equivalent of the excel VLOOKUP function for a two dimensional csv file I have. If given a number 5 I would like to be able to look at a column of a dynamic table I have and find the row with the highest number less than five in that column.
For example. If I used 5 from my example before:
2 6
3 7
4 11
6 2
9 4
Would return to me 11, the data paired with the highest entry below 5.
I have no idea how to go about doing this. If it helps, the entries in column one (the column I will be searching) will go from smallest to largest.
I am a beginner to C++ so I apologize if I'm missing some obvious method.
std::map can do this pretty easily:
You'd start by creating a map of the correct type, then populating it with your data:
std::map<int, int, std::greater<int> > data;
data[2] = 6;
data[3] = 7;
data[4] = 11;
data[6] = 2;
data[9] = 4;
Then you'd search for data with lower_bound or upper_bound:
std::cout << data.lower_bound(5)->second; // prints 11
A couple of notes: First, note the use of std::greater<T> as the comparison operator. This is necessary because lower_bound will normally return an iterator to the next item (instead of the previous) if the key you're looking for isn't present in the map. Using std::greater<T> sorts the map in reverse, so the "next" item is the smaller one instead of the larger.
Second, note that this automatically sorts the data based on the keys, so it depends only on the data you insert, not the order of insertion.
I am working on a binary linear program problem.
I am not really familiar with any computer language(just learned Java and C++ for a few months), but I may have to use computer anyway since the problem is quite complicated.
The first step is to declare variables m_ij for every entry in (at least 8 X 8) a matrix M.
Then I assign corresponding values of each element of a matrix to each of these variables.
The next is to generate other sets of variables, x_ij1, x_ij2, x_ij3, x_ij4, and x_ij5, whenever the value of m_ij is not 0.
The value of x_ijk variable is either 0 or 1, and I do not have to assign values for x_ijk variables.
Probably the simplest way to do it is to declare and assign a value to each variable, e.g.
int* m_11 = 5, int* m_12 = 2, int* m_13 = 0, ... int* m_1n = 1
int* m_21 = 3, int* m_12 = 1, int* m_13 = 2, ... int* m_2n = 3
and then pick variables, the value of which is not 0, and declare x_ij1 ~ x_ij5 accordingly.
But this might be too much work, especially since I am going to consider many different matrices for this problem.
Is there any way to do this automatically?
I know a little bit of Java and C++, and I am considering using lp_solve package in C++(to solve binary integer linear program problem), but I am willing to use any other language or program if I could do this easily.
I am sure there must be some way to do this(probably using loops, I guess?), and this is a very simple task, but I just don't know about it because I do not have much programming language.
One of my cohort wrote a program for generating a random matrix satisfying some condition we need, so if I could use that matrix as my input, it might be ideal, but just any way to do this would be okay as of now.
Say, if there is a way to do it with MS excel, like putting matrix entries to the cells in an excel file, and import it to C++ and automatically generate variables and assign values to them, then this would simplify the task by a great deal!
Matlab indeed seems very suitable for the task. Though the example offered by #Dr_Sam will indeed create the matrices on the fly, I would recommend you to initialize them before you assign the values. This way your code still ends up with the right variable if something with the same name already existed in the workspace and also your variable will always have the expected size.
Assuming you want to define a square 8x8 matrix:
m = zeros(8)
Now in general, if you want to initialize a three dimensional matrixh of size imax,jmax,kmax:
imax = 8;
jmax = 8;
kmax = 5;
x = zeros(imax,jmax,kmax);
Now assigning to or reading from these matrices is very easy, note that length and with of m have been chosen the same as the first dimensions of x:
m(3,4) = 4; %Assign a value
myvalue = m(3,4) %read the value
m(:,1) = 1:8 *Assign the values 1 through 8 to the first column
x(2,4,5) = 12; %Assign a single value to the three dimensional matrix
x(:,:,2) = m+1; Assign the entire matrix plus one to one of the planes in x.
In C++ you could use a std::vector of vectors, like
std::vector<std::vector<int>> matrix;
You don't need to use separate variables for the matrix values, why would you when you have the matrix?
I don't understand the reason you need to get all values where you evaluate true or false. Instead just put directly into a std::vector the coordinates where your condition evaluates to true:
std::vector<std::pair<int, int> true_values;
for (int i = 0; i < matrix.size(); i++)
{
for (int j = 0; j < matrix[i].size(); j++)
{
if (some_condition_for_this_matrix_value(matrix[i][j], i, j) == true)
true_values.emplace_back(std::make_pair(i, j));
}
}
Now you have a vector of all matrix coordinates where your condition is true.
If you really want to have both true and false values, you could use a std::unordered_map with a std::pair containing the matrix coordinates as key and bool as value:
// Create a type alias, as this type will be used multiple times
typedef std::map<std::pair<int, int>, bool> bool_map_type;
bool_map_type bool_map;
Insert into this map all values from the matrix, with the coordinates of the matrix as the key, and the map value as true or false depending on whatever condition you have.
To get a list of all entries from the bool_map you can remove any false entries with std::remove_if:
std::remove_if(bool_map.begin(), bool_map.end(),
[](const bool_map_type::value_type& value) {
return value.second == false;
};
Now you have a map containing only entries with their value as true. Iterate over this map to get the coordinates to the matrix
Of course, I may totally have misunderstood your problem, in which case you of course are free to disregard this answer. :)
I know both C++ and Matlab (not Python) and in your case, I would really go for Matlab because it's way easier to use when you start programming (but don't forget to come back to C++ when you will find the limitations to Matlab).
In Matlab, you can define matrices very easily: just type the name of the matrix and the index you want to set:
m(1,1) = 1
m(2,2) = 1
gives you a 2x2 identity matrix (indices start with 1 in Matlab and entries are 0 by default). You can also define 3d matrices the same way:
x(1,2,3) = 2
For the import from Excel, it is possible if you save your excel file in CSV format, you can use the function dlmread to read it in Matlab. You could also try later to implement your algorithm directly in Matlab.
Finally, if you want to solve your binary integer programm, there is already a built-in function in Matlab, called bintprog which can solve it for you.
Hope it helps!