thrust::exclusive_scan_by_key unexpected behavior - c++

int data[ 10 ] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
int keys[ 10 ] = { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
thrust::exclusive_scan_by_key( keys, keys + 10, data, data );
By the examples at Thrust Site I expected 0,0,1,1,2,2,3,3,4,4, but got 0,0,0,0,0,0,0,0,0 instead; Is it bug, or is there somewhere something the defines this behavior?
More importantly, assuming this is not a bug, is there a way to achieve this effect easily?

I don't think you understand what scan_by_key does. From the documentation:
"Specifically, consecutive iterators i and i+1 in the range [first1, last1) belong to the same segment if binary_pred(*i, *(i+1)) is true, and belong to different segments otherwise"
scan_by_key requires that your key array mark distinct segments using contiguous values:
keys: 0 0 0 1 1 1 0 0 0 1 1 1
seg#: 0 0 0 1 1 1 2 2 2 3 3 3
thrust compares adjacent keys to determine segments.
Your keys are producing a segment map like this:
int keys[ 10 ] = { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
seg#: 0 1 2 3 4 5 6 7 8 9
Since you are doing an exclusive scan, the correct answer to such a segment map (regardless of the data) would be all zeroes.
It's not entirely clear what "this effect" is that you want to achieve, but you may want to do back-to-back stable sort by key operations, reversing the sense of keys and values, to rearrange this data to group the segments (i.e. keys 1 and 2) together.

Related

how to print the max sequence of a given vector (possible values 1 and 0) with the property that are different and the number of the max sequence c++

Let's say I have a vector v with random 1 and 0.
std::vector<int> v = {1,0,1,0,0,1,0,1};
I want to find out the max sequence with the property v[i] != v[i-1]. Basically the numbers need to be different. In this example the max sequence is 4 (1, 0, 1, 0) from position v[0] to v[3]. There is also (0,1,0,1) from position v[4] to v[7]. There are 2 max sequences so the final output should look like this:
4 2
Where 4 is the max sequence and 2 the numbers of max sequences.
Let's take another example:
std::vector<int> v2 = {1,0,1,1,1,0,1,0,1,0};
The output here should be:
6 1
The max sequence starts from v[4] to v[9]. There is only one max sequence so it will print 1 this time.
I tried to solve this using a for loop:
n - number of integers in the vector
k - number of different integers in vector
maxk - the max sequence
many - how many max sequence are
for(int i{1}; i < n; i++) {
if(v[i] != v[i-1]) {
k++;
if(k > maxk) {
maxk = k;
}
}
else {
if(k == maxk) {
many++;
}
else {
many = 1;
}
k = 1;
}
}
But if you give it a vector like {1, 0, 0} it will not work. Can someone give me a tip of how this problem can be solved? Sorry for my bad english
First, sequence isn't the right word. A sequence can jump past elements. You mean a subarray.
Second, you talk about arrays with 0 and 1 in them, then give an example with 2. Do you want to not count subarrays with 2? Or count them? In other words if the input is [1, 2, 2] are you expecting an answer of 1 1 or 2 1?'.
That said, just make an array of where the best current subarray begins. For your first example that array would look like this:
1, 0, 1, 0, 0, 1, 0, 1
0, 0, 0, 0, 4, 4, 4, 4
And then a linear scan finds that you have a group of 4 starting at index 0, and another group of 4 starting at index 4.
For your next example,
1, 0, 1, 1, 1, 0, 1, 0, 1, 0
0, 0, 0, 3, 4, 4, 4, 4, 4, 4
And you have a group of 3 starting at index 0, 1 starting at 3, and 6 starting at 4. So we've found the 1 group of 6.
For your last example, what you'd get would depend on the answer you want.
I'll leave coding this to you.

How to find smallest connected label in equivalency list

I have a list of numbers stored in a standard vector. Some of the numbers are children of other numbers. Here is an example
3, 4
3, 5
5, 6
7, 3
8, 9
8, 1
8, 2
9, 8
Or as a graph:
1 2 3-4 5-6 7 8-9
|-------------|
|-----------|
|---|
|-------|
That is there are two clusters 3,4,5,6,7 and 1,2,8,9. The root number is the smallest number of a cluster. Here 3 and 1. I would like to know which algorithms I can use to extract a list like this:
3, 4
3, 5
3, 6
3, 7
1, 2
1, 8
1, 9
An algorithm similar disjoint set union algorithm can help you:
Initialize N disjoint subset, each subset has exactly one number, and root of number i(r(i)) is i.
For each edge (u, v), you can assign:
t = min(r(u), r(v))
r(u) = t
r(v) = t
For each i with i != r(i), you can write out [r(i) - i].

Combinations in ROOT

What does the Combinations function do in ROOT/C++?
I only found this documentation
https://root.cern.ch/doc/master/namespaceROOT_1_1VecOps.html#a6d1d00c2ccb769cc48c6813dbeb132db
But I am still not sure what it does exactly.
Can someone provide an example showing how the answers in the documentation examples are computed?
Here is an example of what Combinations is doing:
Suppose you have a vector v{1., 2., 3., 4.,}
1, 2, 3, and 4 are the elements of the vector v
and 0, 1, 2, 3 are the indices of those elements.
If we write
Combinations (v, 2)
we get
{{ 0, 0, 0, 1, 1, 2} , { 1, 2, 3, 2, 3, 3}}.
That comes from looking at the different combinations of the vector elements.
Which are:
1, 2
1, 3
1, 4
2, 3
2, 4
3, 4
Which has the corresponding indices
0 1
0 2
0 3
1 2
1 3
2 3
Then, the left-side column makes the first vector in the answer and the right side column makes the second vector shown in the answer.

Row-wise Element Indexing in PyTorch for C++

I am using the C++ frontend for PyTorch and am struggling with a relatively basic indexing problem.
I have an 8 by 6 Tensor such as the one below:
[ Variable[CUDAFloatType]{8,6} ]
0 1 2 3 4 5
0 1.7107e-14 4.0448e-17 4.9708e-06 1.1664e-08 9.9999e-01 2.1857e-20
1 1.8288e-14 5.9356e-17 5.3042e-06 1.2369e-08 9.9999e-01 2.4799e-20
2 2.6828e-04 9.0390e-18 1.7517e-02 1.0529e-03 9.8116e-01 6.7854e-26
3 5.7521e-10 3.1037e-11 1.5021e-03 1.2304e-06 9.9850e-01 1.4888e-17
4 1.7811e-13 1.8383e-15 1.6733e-05 3.8466e-08 9.9998e-01 5.2815e-20
5 9.6191e-06 2.6217e-23 3.1345e-02 2.3024e-04 9.6842e-01 2.9435e-34
6 2.2653e-04 8.4642e-18 1.6085e-02 9.7405e-04 9.8271e-01 6.3059e-26
7 3.8951e-14 2.9903e-16 8.3518e-06 1.7974e-08 9.9999e-01 3.6993e-20
I have another Tensor with just 8 elements in it such as:
[ Variable[CUDALongType]{8} ]
0
3
4
4
4
4
4
4
I would like to index the rows of my first tensor using the second to produce:
0
0 1.7107e-14
1 1.2369e-08
2 9.8116e-01
3 9.9850e-01
4 9.9998e-01
5 9.6842e-01
6 9.8271e-01
7 9.9999e-01
I have tried a few different approaches including index_select but it seems to produce an output that has the same dimensions as the input (8x6).
In Python I think I could index with Python's built-in indexing as discussed here: https://github.com/pytorch/pytorch/issues/1080
Unfortunately, in C++ I can only index a Tensor with a scalar (zero-dimensional Tensor) so I don't think that approach works for me here.
How can I achieve my desired result without resorting to loops?
It turns out you can do this in a couple different ways. One with gather and one with index. From the PyTorch discussions where I asked the same question:
Using torch::gather
auto x = torch::randn({8, 6});
int64_t idx_data[8] = { 0, 3, 4, 4, 4, 4, 4, 4 };
auto idx = x.type().toScalarType(torch::kLong).tensorFromBlob(idx_data, 8);
auto result = x.gather(1, idx.unsqueeze(1));
Using the C++ specific torch::index
auto x = torch::randn({8, 6});
int64_t idx_data[8] = { 0, 3, 4, 4, 4, 4, 4, 4 };
auto idx = x.type().toScalarType(torch::kLong).tensorFromBlob(idx_data, 8);
auto rows = torch::arange(0, x.size(0), torch::kLong);
auto result = x.index({rows, idx});

How to determine if two partitions (clusterings) of data points are identical?

I have n data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l of length n assigning each point to a cluster. Values of l ranges from 0 to (possibly) n-1.
Example:
l_1 = [ 1 1 1 0 0 2 6 ]
Is a partition of n=7 points into 4 clusters: first three points are clustered together, the fourth and fifth are together and the last two points forms two distinct singleton clusters.
My question:
Suppose I have two partitions l_1 and l_2 how can I efficiently determine if they represents identical partitions?
Example:
l_2 = [ 2 2 2 9 9 3 1 ]
is identical to l_1 since it represents the same partitions of the points (despite the fact that the "numbers"/"labels" of the clusters are not identical).
On the other hand
l_3 = [ 2 2 2 9 9 3 3 ]
is no longer identical since it groups together the last two points.
I'm looking for a solution in either C++, python or Matlab.
Unwanted direction
A naive approach would be to compare the co-occurrence matrix
c1 = bsxfun( #eq, l_1, l_1' );
c2 = bsxfun( #eq, l_2, l_2' );
l_1_l_2_are_identical = all( c1(:)==c2(:) );
The co-occurrence matrix c1 is of size nxn with true if points k and m are in the same cluster and false otherwise (regardless of the cluster "number"/"label").
Therefore if the co-occurrence matrices c1 and c2 are identical then l_1 and l_2 represent identical partitions.
However, since the number of points, n, might be quite large I would like to avoid O(n^2) solutions...
Any ideas?
Thanks!
When are two partition identical?
Probably if they have the exact same members.
So if you just want to test for identity, you can do the following:
Substitute each partition ID with the smallest object ID in the partition.
Then two partitionings are identical if and only if this representation is identical.
In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form
[ 1 1 1 4 4 6 7 ]
^ first occurrence at pos 1 of 1 in l_1 / 2 in l_2
^ first occurrence at pos 4
for l_1 and l_2, whereas l_3 canonicalizes to
[ 1 1 1 4 4 6 6 ]
To make it more clear, here is another example:
l_4 = [ A B 0 D 0 B A ]
canonicalizes to
[ 1 2 3 4 3 2 1 ]
since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.
If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.
Update: Since it was claimed that this is quadratic, I will further clarify.
To produce the canonical form, use the following approach (actual python code):
def canonical_form(li):
""" Note, this implementation overwrites li """
first = dict()
for i in range(len(li)):
v = first.get(li[i])
if v is None:
first[li[i]] = i
v = i
li[i] = v
return li
print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]
print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# True
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False
If you are going to relabel your partitions, as has been previously suggested, you will potentially need to search through n labels for each of the n items. I.e. the solutions are O(n^2).
Here is my idea: Scan through both lists simultaneously, maintaining a counter for each partition label in each list.
You will need to be able to map partition labels to counter numbers.
If the counters for each list do not match, then the partitions do not match.
This would be O(n).
Here is a proof of concept in Python:
l_1 = [ 1, 1, 1, 0, 0, 2, 6 ]
l_2 = [ 2, 2, 2, 9, 9, 3, 1 ]
l_3 = [ 2, 2, 2, 9, 9, 3, 3 ]
d1 = dict()
d2 = dict()
c1 = []
c2 = []
# assume lists same length
match = True
for i in range(len(l_1)):
if l_1[i] not in d1:
x1 = len(c1)
d1[l_1[i]] = x1
c1.append(1)
else:
x1 = d1[l_1[i]]
c1[x1] += 1
if l_2[i] not in d2:
x2 = len(c2)
d2[l_2[i]] = x2
c2.append(1)
else:
x2 = d2[l_2[i]]
c2[x2] += 1
if x1 != x2 or c1[x1] != c2[x2]:
match = False
print "match = {}".format(match)
In matlab:
function tf = isIdenticalClust( l_1, l_2 )
%
% checks if partitions l_1 and l_2 are identical or not
%
tf = all( accumarray( {l_1} , l_2 , [],#(x) all( x == x(1) ) ) == 1 ) &&...
all( accumarray( {l_2} , l_1 , [],#(x) all( x == x(1) ) ) == 1 );
What this does:
groups all elements of l_1 according to the partition of l_2 and checks if all elements of l_1 at each cluster are all identical. Repeating the same for partitioning l_2 according to l_1.
If both grouping yields the homogenous clusters - they are identical.