I got a numpy 1d arrays, and I want to find the indices of the array such that its values are in the closed interval specified by another 1d array. To be concrete, here is an example
A= np.array([ 0.69452994, 3.4132039 , 6.46148658, 17.85754453,
21.33296454, 1.62110662, 8.02040621, 14.05814177,
23.32640469, 21.12391059])
b = np.array([ 0. , 3.5, 9.8, 19.8 , 50.0])
I want to find the indices in b such that values in A are in which closed interval (b is always in sorted order starting from 0 and ending in the max possible value A can ever take.
In this specific example, my output will be
indx = [0,0,1,2,3,0,1,2,3,3]
How can I do it ?. I tried with np.where, without any success.
Given the sorted nature of b, we can simply use searchsorted/digitize to get the indices where elements off A could be placed to keep the sorted order, which in essence means getting the boundary indices for each of the b elements and finally subtract 1 from those indices for the desired output.
Thus, assuming the right-side boundary is an open one, the solution would be -
np.searchsorted(b,A)-1
np.digitize(A,b,right=True)-1
For left-side open boundary, use :
np.searchsorted(b,A,'right')-1
np.digitize(A,b,right=False)-1
Related
Consider an arbitrary connected (acyclic/cyclic) undirected graph with N Vertices, with vertex numbered from 1 to N. Each vertex has some value assigned to it. Let the values be denoted by A1, A2, A3, ... AN, where A[i] denotes value of ith vertex. Let P be a permutation of A. Each operation, we can swap values of two adjacent vertices. Is it possible to achieve A = P, i.e. after all swapping operation A[i] = P[i] for all 1 <= i <= N. In other words, each vertex i should have value P[i] after the operations.
P.S - I was confused about where to ask this - stack overflow or math.stack exchange. Apologies in advance.
Edit 1: I think the answer should be Yes. But I am only saying this on basis of case analysis of different types of graphs of 5 vertices. I tried to modify the permutation to Q where Q1 < Q2 < .. This changes a problem a bit that now final state should be A1 < A2 < A3... AN. So it can be said can the graph be sorted? Please correct me if my assumption is wrong.
Indeed this is possible. Since we've got a connected graph, we can remove edges until you've got a tree. Removing an edge simply means we won't use it to do adjacent swaps in this case. "Removing a node" simply means we'll never swap the value of the node.
Now we can use the following algorithm to produce the permutation:
Choose a leaf and determine the position of the value intended to be located there after the permutation. Repeatedly swap the value with the next one on the path to the leaf until the value reaches the leaf.
Remove the leaf from the tree; the resulting graph still is a tree
Continue with 1., if there are any nodes left.
In each iteration we reduce the size of the graph by 1 by doing a number of swaps that can be bounded from above by the number of nodes, so with a finite number of swaps we're able to produce the premutation. The algorithm may not yield a solution using the optimum number of swaps, but it shows that it can be done.
I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.
For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.
Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.
Update:
Here is some information about the structure of my data set.
The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.
If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.
Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:
If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
For each set O(i):
Evaluate all combinations and insert into set of combinations: O(k choose 2)
The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.
Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.
First lets solve more simple task of counting number of elements not present in your sets. This task can be reworded in more simple form - instead of 100,000 sets you can think about 1 set which contains all your numbers. Then number of elements not present in this set is x = 1000000 - len(set). Now you can use this number x to count number of combinations. With repetitions: x * x, without repetitions: x * (x - 1). So bottom line of my answer is to put all your numbers in one big set and use it's length to find number of combinations using combinatorics.
Update
So above we have a way to find number of combinations where each element in combination is not in any of the sets. But question was to find number of combinations where each combination is not present in any of the sets.
Lets try to solve simpler problem first:
your sets have all numbers in them, none missing
each number is present exactly in one set, no duplicates across sets
How you would construct such combinations over such sets? You would simply pick two elements from different sets and resulting combination would not be in any of the sets. Number of such combinations could be counted using following code (it accepts sizes of the sets):
int count_combinations(vector<int>& buckets) {
int result = 0;
for (int i=0; i < buckets.size(); ++i) {
for (int j=i+1; j < buckets.size(); ++j) {
result += buckets[i] * buckets[j];
}
}
return result;
}
Now let's imagine that some numbers are missing. Then we can just add additional set with those missing numbers to our sets (as a separate set). But we also need to account that given there were n missing numbers there would be n * (n-1) combinations constructed using only these missing numbers. So following code will produce total number of combinations with account to missing numbers:
int missing_numbers = upper_bound - all_numbers.size() - 1;
int missing_combinations = missing_numbers * (missing_numbers - 1);
return missing_combinations + count_combinations(sets, missing_numbers);
Now lets imagine we have a duplicate across two sets: {a, b, c}, {a, d}.
What types of errors they will introduce? Following pairs: {a, a} - repetition, {a, d} - combination which is present in second set.
So how to treat such duplicates? We need to eliminate them completely from all sets. Even single instance of a duplicate will produce combination present in some set. Because we can just pick any element from the set where duplicate was removed and produce such combination (in my example - if we will keep a in first set, then pick d from the second to produce {a, d}, if we will keep a in second set, then pick b or c from the first to produce {a, b} and {a, c}). So duplicates shall be removed.
Update
However we can't simply remove all duplicates, consider this counterexample:
{a, b} {a, c} {d}. If we simply remove a we will acquire {b} {c} {d} and lost information about not-existing combination {a, d}. Consider another counterexample:
{a, b} {a, b, c} {b, d}. If we simply remove duplicates we will acquire {c} {d} and lost information about {a, d}.
Also we can't simply apply such logic to pairs of sets, a simple counter example for numbers < 3: {1, 2} {1} {2}. Here number of missing combinations is 0, but we will incorrectly count in {1, 2} if we will apply duplicates removal to pair of sets. Bottom line is that I can't come up with good technique which will help to correctly handle duplicate elements across sets.
What you can do, depending on memory requirements, is take advantage of the ordering of Set, and iterate over the values smartly. Something like the code below (untested). You'll iterate over all of your sets, and then for each of your sets you'll iterate over their values. For each of these values, you'll check all of the values in the set after them. Our complexity is reduced to the number of sets times the square of their sizes. You can use a variety of methods to keep track of your found/unfound count, but using a set should be fine, since insertion is simply O(log(n)) where n is no more than 499999500000. In theory using a map of sets (mapping based on the first value) could be slightly faster, but in either case the cost is minimal.
long long numMissing(const std::array<std::set<int>, 100000>& sets){
std::set<pair<int, int> > found;
for (const auto& s : sets){
for (const auto& m : s){
const auto &n = m;
for (n++; n != s.cend(); n++){
found.emplace(m, n);
}
}
}
return 499999500000 - found.size();
}
As an option you can build Bloom Filter(s) over your sets.
Before checking against all sets you can quickly lookup at your bloom filter and since it will never produce false negatives you can safely use your pair as its not present in your sets.
Physically storing each possible pair would take too much memory. We have 100k sets and an average set has 10k numbers = 50M pairs = 400MB with int32 (and set<pair<int, int>> needs much more than 8 bytes per element).
My suggestion is based on two ideas:
don't store, only count the missing pairs
use interval set for compact storage and fast set operations (like boost interval set)
The algorithm is still quadratic on the number of elements in the sets but needs much less space.
Algorithm:
Create the union_set of the individual sets.
We also need a data structure, let's call it sets_for_number to answer this question: which sets contain a particular number? For the simplest case this could be unordered_map<int, vector<int>> (vector stores set indices 0..99999)
Also create the inverse sets for each set. Using interval sets this takes only 10k * 2 * sizeof(int) space per set on average.
dynamic_bitset<> union_set = ...; //union of individual sets (can be vector<bool>)
vector<interval_set<int>> inverse_sets = ...; // numbers 1..999999 not contained in each set
int64_t missing_count = 0;
for(int n = 1; n < 1000000; ++n)
// count the missing pairs whose first element is n
if (union_set.count(n) == 0) {
// all pairs are missing
missing_count += (999999 - n);
} else {
// check which second elements are not present
interval_set<int> missing_second_elements = interval_set<int>(n+1, 1000000);
// iterate over all sets containing n
for(int set_idx: sets_for_number.find(n)) {
// operator&= is in-place intersection
missing_second_elements &= inverse_sets[set_idx];
}
// counting the number of pairs (n, m) where m is a number
// that is not present in any of the sets containing n
for(auto interval: missing_second_elements)
missing_count += interval.size()
}
}
If it is possible, have a set of all numbers and remove each of the number when you insert to your array of set. This will have a O(n) space complexity.
Of course if you don't want to have high spec complexity, maybe you can have a range vector. For each element in the vector, you have a pair of numbers which are the start/end of a range.
I've got multiple arrays and want to find the permutations of all the elements in these arrays. Each element also carries a weight, and these arrays are sorted decreasing by weight. I've got an array with weight that mimics the arrays with he values themselves. I want my search to find permutations with the greatest weight to the lowest weight.
However, each element in an array has a weight associated with it so I want to run my search with those with the highest weight first.
Example:
arr0 = [A, B, C, D]
arr0_weight = [11, 7, 4, 3]
arr1 = [W, X, Y]
arr1_weight = [10, 9, 4]
Thus, the ideal output would be:
AW (11+10=21)
AX (11+9=20)
BW (7+10=17)
BX (7+9=16)
AY (11+4=15)
...
If I did just a for loop like this:
for (int i = 0; i < sizeof(arr0)/4; i++) {
for (int j = 0; j < sizeof(arr1)/4; j++) {
cout << arr0[i] << arr1[j] << endl; }}
I would get:
AW (11+10=21)
AX (11+9=20)
AY (11+4=15)
BW (7+10=17)
BX (7+9=16)
BZ (7+4=11)
Which isn't what I want because 17 > 15 and 16 > 15.
Also, what's a good way to do this for n arrays? If I don't know how many arrays I will have, and their size might not all be the same?
I've looked into putting the values into vectors but I can't find a way to do what I want (a sorted Cartesian product). Any help? Pseudo-code is fine if you don't have time - I'm just really stuck.
Thanks so much.
Your question is about algorithm, not C++.
You want to sort all tuples in Cartesian product from heaviest to lightest.
Easiest way is to find all tuples and sort them by their weight.
If you need sequential access, your should do following. Since weight of tuple is sum of weights of its elements, I think, greediness is optimal here. Let's move to arbitrary number of arrays of arbitrary dimensions. Create set of indices. Initially, it's contains zeros. First tuple that it represents is obviously heaviest. Find one of indices to increment: choose index that loses least weight, that has least difference with next element. Don't forget to keep track of exhausted arrays. When all vectors are exhausted, you're done.
To implement it in C++, you should employ vector<pair<element_t, weight_t>> for input data and set<pair<weight_difference_t, index_t>> as set of indices. All types are probably integers but I used custom types to show which data should be there. Your should also know how pair is compared.
This question already has answers here:
How to find a duplicate element in an array of shuffled consecutive integers?
(19 answers)
Closed 8 years ago.
This is kind of related to this question, but a little tweaked .
We are given an array containing integers between 1 and 1000.
Every integer from 1 and 1000 is in the array once, but one is in the array twice. (i.e. I remove a unique element from the list and introduce a duplicate element which is already in the list,remember the size of the array is still 1000)
Determine which integer is in the array twice
Can you do it while iterating through the array only once?
In the link that i have posted it's a different question altogether.
My Solution:
sorting the array and then finding if the two elements are together. (avg case O(nlog(n)))
Create a bit-array with a 1000 bits (won't take much memory). with 0 stored in each of the bit field. Iterate through the array of 1000 elements and flip the bit sign in the bit-array's index with the value of the array .
i.e. (if the 0th position of the array stores the value 548, we flip the 548th bit in the bit-array to 1).
The field with already flipped as 1 will be the repeated element
Solution2 iterates the array only once.
Now, I was reading about the 'Telescoping series', i haven't understood it fully. but is there a concept in there (or in discrete math) where we can just sum something and subtract with something else to get the duplicate number?
Calculate the sum of the array let it be S and let the repeated element be x. The repeated element can be determined by taking the difference between S and the sum of the array without the repeated element: x=S- (1000*(1001))/2.
Let's say, the x was replaced by y. The summation method tells that
y - x = sum_actual - sum_expected
Of course you can't deduce two variables from a single equation; you need another. Calculate the sum of squares:
y^2 - x^2 = sum_squares_actual - sum_squares_expected
Now recall that sum of squares is n*(n+1)*(2*n + 1)/6
The sum of 1...1000 = 1001 * 500, and is therefore zero modulo 1001. Thus, finding the sum of the array modulo 1001 will give you the repeated element.
result = 0
for x in A:
result = (result + x) % 1001
1000 is not that large. In addition to what other people have said, you could use a count array. For each number x you update the count count[x] = count[x] + 1 and check if this number is equal to 2.
Given two sorted lists, each containing n real numbers, is there a O(log n) time algorithm to compute the element of rank i (where i coresponds to index in increasing order) in the union of the two lists, assuming the elements of the two lists are distinct?
EDIT:
#BEN: This i s what I have been doing , but I am still not getting it.
I have an examples ;
List A : 1, 3, 5, 7
List B : 2, 4, 6, 8
Find rank(i) = 4.
First Step : i/2 = 2;
List A now contains is A: 1, 3
List B now contains is B: 2, 4
compare A[i] to B[i] i.e
A[i] is less;
So the lists now become :
A: 3
B: 2,4
Second Step:
i/2 = 1
List A now contains A:3
List B now contains B:2
NoW I HAVE LOST THE VALUE 4 which is actually the result ...
I know I am missing some thing , but even after close to a day of thinking I cant just figure this one out...
Yes:
You know the element lies within either index [0,i] of the first list or [0,i] of the second list. Take element i/2 from each list and compare. Proceed by bisection.
I'm not including any code because this problem sounds a lot like homework.
EDIT: Bisection is the method behind binary search. It works like this:
Assume i = 10; (zero-based indexing, we're looking for the 11th element overall).
On the first step, you know the answer is either in list1(0...10) or list2(0...10). Take a = list1(5) and b = list2(5).
If a > b, then there are 5 elements in list1 which come before a, and at least 6 elements in list2 which come before a. So a is an upper bound on the result. Likewise there are 5 elements in list2 which come before b and less than 6 elements in list1 which come before b. So b is an lower bound on the result. Now we know that the result is either in list1(0..5) or list2(5..10). If a < b, then the result is either in list1(5..10) or list2(0..5). And if a == b we have our answer (but the problem said the elements were distinct, therefore a != b).
We just repeat this process, cutting the size of the search space in half at each step. Bisection refers to the fact that we choose the middle element (bisector) out of the range we know includes the result.
So the only difference between this and binary search is that in binary search we compare to a value we're looking for, but here we compare to a value from the other list.
NOTE: this is actually O(log i) which is better (at least no worse than) than O(log n). Furthermore, for small i (perhaps i < 100), it would actually be fewer operations to merge the first i elements (linear search instead of bisection) because that is so much simpler. When you add in cache behavior and data locality, the linear search may well be faster for i up to several thousand.
Also, if i > n then rely on the fact that the result has to be toward the end of either list, your initial candidate range in each list is from ((i-n)..n)
Here is how you do it.
Let the first list be ListX and the second list be ListY. We need to find the right combination of ListX[x] and ListY[y] where x + y = i. Since x, y, i are natural numbers we can immediately constrain our problem domain to x*y. And by using the equations max(x) = len(ListX) and max(y) = len(ListY) we now have a subset of x*y elements in the form [x, y] that we need to search.
What you will do is order those elements like so [i - max(y), max(y)], [i - max(y) + 1, max(y) - 1], ... , [max(x), i - max(x)]. You will then bisect this list by choosing the middle [x, y] combination. Since the lists are ordered and distinct you can test ListX[x] < ListY[y]. If true then we bisect the upper half our [x, y] combinations or if false then we bisect the lower half. You will keep bisecting until find the right combination.
There are a lot of details I left, but that is the general gist of it. It is indeed O(log(n))!
Edit: As Ben pointed out this actually O(log(i)). If we let n = len(ListX) + len(ListY) then we know that i <= n.
When merging two lists, you're going to have to touch every element in both lists. If you don't touch every element, some elements will be left behind. Thus your theoretical lower bound is O(n). So you can't do it that way.
You don't have to sort, since you have two lists that are already sorted, and you can maintain that ordering as part of the merge.
edit: oops, I misread the question. I thought given value, you want to find rank, not the other way around. If you want to find rank given value, then this is how to do it in O(log N):
Yes, you can do this in O(log N), if the list allows O(1) random access (i.e. it's an array and not a linked list).
Binary search on L1
Binary search on L2
Sum the indices
You'd have to work out the math, +1, -1, what to do if element isn't found, etc, but that's the idea.