How to determine if two partitions (clusterings) of data points are identical?

How to determine if two partitions (clusterings) of data points are identical? - c++

I have n data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l of length n assigning each point to a cluster. Values of l ranges from 0 to (possibly) n-1.
Example:
l_1 = [ 1 1 1 0 0 2 6 ]
Is a partition of n=7 points into 4 clusters: first three points are clustered together, the fourth and fifth are together and the last two points forms two distinct singleton clusters.
My question:
Suppose I have two partitions l_1 and l_2 how can I efficiently determine if they represents identical partitions?
Example:
l_2 = [ 2 2 2 9 9 3 1 ]
is identical to l_1 since it represents the same partitions of the points (despite the fact that the "numbers"/"labels" of the clusters are not identical).
On the other hand
l_3 = [ 2 2 2 9 9 3 3 ]
is no longer identical since it groups together the last two points.
I'm looking for a solution in either C++, python or Matlab.
Unwanted direction
A naive approach would be to compare the co-occurrence matrix
c1 = bsxfun( #eq, l_1, l_1' );
c2 = bsxfun( #eq, l_2, l_2' );
l_1_l_2_are_identical = all( c1(:)==c2(:) );
The co-occurrence matrix c1 is of size nxn with true if points k and m are in the same cluster and false otherwise (regardless of the cluster "number"/"label").
Therefore if the co-occurrence matrices c1 and c2 are identical then l_1 and l_2 represent identical partitions.
However, since the number of points, n, might be quite large I would like to avoid O(n^2) solutions...
Any ideas?
Thanks!

When are two partition identical?
Probably if they have the exact same members.
So if you just want to test for identity, you can do the following:
Substitute each partition ID with the smallest object ID in the partition.
Then two partitionings are identical if and only if this representation is identical.
In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form
[ 1 1 1 4 4 6 7 ]
^ first occurrence at pos 1 of 1 in l_1 / 2 in l_2
^ first occurrence at pos 4
for l_1 and l_2, whereas l_3 canonicalizes to
[ 1 1 1 4 4 6 6 ]
To make it more clear, here is another example:
l_4 = [ A B 0 D 0 B A ]
canonicalizes to
[ 1 2 3 4 3 2 1 ]
since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.
If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.
Update: Since it was claimed that this is quadratic, I will further clarify.
To produce the canonical form, use the following approach (actual python code):
def canonical_form(li):
""" Note, this implementation overwrites li """
first = dict()
for i in range(len(li)):
v = first.get(li[i])
if v is None:
first[li[i]] = i
v = i
li[i] = v
return li
print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]
print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# True
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False

If you are going to relabel your partitions, as has been previously suggested, you will potentially need to search through n labels for each of the n items. I.e. the solutions are O(n^2).
Here is my idea: Scan through both lists simultaneously, maintaining a counter for each partition label in each list.
You will need to be able to map partition labels to counter numbers.
If the counters for each list do not match, then the partitions do not match.
This would be O(n).
Here is a proof of concept in Python:
l_1 = [ 1, 1, 1, 0, 0, 2, 6 ]
l_2 = [ 2, 2, 2, 9, 9, 3, 1 ]
l_3 = [ 2, 2, 2, 9, 9, 3, 3 ]
d1 = dict()
d2 = dict()
c1 = []
c2 = []
# assume lists same length
match = True
for i in range(len(l_1)):
if l_1[i] not in d1:
x1 = len(c1)
d1[l_1[i]] = x1
c1.append(1)
else:
x1 = d1[l_1[i]]
c1[x1] += 1
if l_2[i] not in d2:
x2 = len(c2)
d2[l_2[i]] = x2
c2.append(1)
else:
x2 = d2[l_2[i]]
c2[x2] += 1
if x1 != x2 or c1[x1] != c2[x2]:
match = False
print "match = {}".format(match)

In matlab:
function tf = isIdenticalClust( l_1, l_2 )
%
% checks if partitions l_1 and l_2 are identical or not
%
tf = all( accumarray( {l_1} , l_2 , [],#(x) all( x == x(1) ) ) == 1 ) &&...
all( accumarray( {l_2} , l_1 , [],#(x) all( x == x(1) ) ) == 1 );
What this does:
groups all elements of l_1 according to the partition of l_2 and checks if all elements of l_1 at each cluster are all identical. Repeating the same for partitioning l_2 according to l_1.
If both grouping yields the homogenous clusters - they are identical.

Related

Minimum number of iterations

We are given an array with numbers from ranging from 1 to n (no duplicates) where n = size of the array.
We are allowed to do the following operation :
arr[i] = arr[arr[i]-1] , 0 <= i < n
Now, one iteration is considered when we perform above operation on the entire array.
Our task is to find the number of iterations after we encounter a previously encountered sequence.
Constraints :
a) Array has no duplicates
b) 1 <= arr[i] <= n , 0 <= i < n
c) 1 <= n <= 10^6
Ex 1:
n = 5
arr[] = {5, 4, 2, 1, 3}
After 1st iteration array becomes : {3, 1, 4, 5, 2}
After 2nd iteration array becomes : {4, 3, 5, 2, 1}
After 3rd iteration array becomes : {2, 5, 1, 3, 4}
After 4th iteration array becomes : {5, 4, 2, 1, 3}
In the 4th iteration, the sequence obtained is already seen before
So the expected output is 4.
This question was asked in one of job hiring tests, so I dont have any link to the question.
There were 2 sample test cases given out of which I remember one which is given above. I would really appreciate any help on this question
P.S.
I was able to code the brute force solution, where in I stored all the results in a Set and then kept advancing to the next permutation. But it gave TLE

First, note that an array of length n containing 1, 2, ..., n with no duplicates is a permutation.
Next, observe that arr[i] := arr[arr[i] - 1] is squaring the permutation.
That is, consider permutations as elements of the symmetric group S_n, where multiplication is composition of permutations.
Then the above operation is arr := arr * arr.
So, in terms of permutations and their composition, the question is as follows:
You are given a permutation p (= arr).
Consider permutations p, p^2, p^4, p^8, p^16, ...
What is the number of distinct elements among them?
Now, to solve it, consider the cycle notation of the permutation.
Every permutation is a product of disjoint cycles.
For example, 6 1 4 3 5 2 is the product of the following cycles: (1 6 2) (3 4) (5).
In other words, every application of this permutation:
moves elements at positions 1, 6, 2 along the cycle;
moves elements at positions 4, 3 along the cycle;
leaves element at position 5 in place.
So, when we consider p^k (take an identity permutation and apply the permutation p to it k times), we actually process three independent actions:
move elements at positions 1, 6, 2 along the cycle, k times;
move elements at positions 4, 3 along the cycle, k times;
leave element at position 5 in place, k times.
Now, take into account that, after d applications of a cycle of length d, it just returns all the respective elements to their initial places.
So, we can actually formulate p^k as:
move elements at positions 1, 6, 2 along the cycle, (k mod 3) times;
move elements at positions 4, 3 along the cycle, (k mod 2) times;
leave element at position 5 in place.
We can now prove (using Chinese Remainder Theorem, or just using general knowledge of group theory) that the permutations p, p^2, p^3, p^4, p^5, ... are all distinct up to p^m, where m is the least common multiple of all cycle lengths.
In our example with p = 6 1 4 3 5 2, we have p, p^2, p^3, p^4, p^5, and p^6 all distinct.
But p^6 is the identity permutation: moving six times along a cycle of length 2 or 3 results in the items at their initial places.
So p^7 is the same as p^1, p^8 is the same as p^2, and so on.
Our question however is harder: we want to know the number of distinct permutations not among p, p^2, p^3, p^4, p^5, ..., but among p, p^2, p^4, p^8, p^16, ...: p to the power of a power of two.
To do that, consider all cycle lengths c_1, c_2, ..., c_r in our permutation.
For each c_i, find the pre-period and period of 2^k mod c_i:
For example, c_1 = 3, and 2^k mod 3 look as 1, 2, 1, 2, 1, 2, ..., which is (1, 2) with pre-period 0 and period 2.
As another example, c_2 = 2, and 2^k mod 2 look as 1, 0, 0, 0, ..., which is 1, (0) with pre-period 1 and period 1.
In this problem, this part can be done naively, by just marking visited numbers mod c_i in some array.
By Chinese Remainder Theorem again, after all pre-periods are considered, the period of the whole system of cycles will be the least common multiple of all individual periods.
What remains is to consider pre-periods.
These can be processed with your naive solution anyway, as the lengths of pre-periods here is at most log_2 n.
The answer is the least common multiple of all individual periods, calculated as above, plus the length of the longest pre-period.

How can I write this algorithm that returns the count between x and y in a list?

I am given this algorithmic problem, and need to find a way to return the count in a list S and another list L that is between some variable x and some variable y, inclusive, that runs in O(1) time:
I've issued a challenge against Jack. He will submit a list of his favorite years (from 0 to 2020). If Jack really likes a year,
he may list it multiple times. Since Jack comes up with this list on the fly, it is in no
particular order. Specifically, the list is not sorted, nor do years that appear in the list
multiple times appear next to each other in the list.
I will also submit such a list of years.
I then will ask Jack to pick a random year between 0 and 2020. Suppose Jack picks the year x.
At the same time, I will also then pick a random year between 0 and 2020. Suppose I
pick the year y. Without loss of generality, suppose that x ≤ y.
Once x and y are picked, Jack and I get a very short amount of time (perhaps 5
seconds) to decide if we want to re-do the process of selecting x and y.
If no one asks for a re-do, then we count the number of entries in Jack's list that are
between x and y inclusively and the number of entries in my list that are between x and
y inclusively.
More technically, here is the situation. You are given lists S and L of m and n integers,
respectively, in the range [0, k], representing the collections of years selected by Jack and
I. You may preprocess S and L in O(m+n+k) time. You must then give an algorithm
that runs in O(1) time – so that I can decide if I need to ask for a re-do – that solves the
following problem:
Input: Two integers, x as a member of [0,k] and y as a member of [0,k]
Output: the number of entries in S in the range [x, y], and the number of entries in L in [x, y].
For example, suppose S = {3, 1, 9, 2, 2, 3, 4}. Given x = 2 and y = 3, the returned count
would be 4.
I would prefer pseudocode; it helps me understand the problem a bit easier.

Implementing the approach of user3386109 taking care of edge case of x = 0.
user3386109 : Make a histogram, and then compute the accumulated sum for each entry in the histogram. Suppose S={3,1,9,2,2,3,4} and k is 9. The histogram is H={0,1,2,2,1,0,0,0,0,1}. After accumulating, H={0,1,3,5,6,6,6,6,6,7}. Given x=2 and y=3, the count is H[y] - H[x-1] = H[3] - H[1] = 5 - 1 = 4. Of course, x=0 is a corner case that has to be handled.
# INPUT
S = [3, 1, 9, 2, 2, 3, 4]
L = [2, 9, 4, 6, 8, 5, 3]
k = 9
x = 2
y = 3
# Histogram for S
S_hist = [0]*(k+1)
for element in S:
S_hist[element] = S_hist[element] + 1
# Storing prefix sum in S_hist
sum = S_hist[0]
for index in range(1,k+1):
sum = sum + S_hist[index]
S_hist[index] = sum
# Similar approach for L
# Histogram for L
L_hist = [0] * (k+1)
for element in L:
L_hist[element] = L_hist[element] + 1
# Stroing prefix sum in L_hist
sum = L_hist[0]
for index in range(1,k+1):
sum = sum + L_hist[index]
L_hist[index] = sum
# Finding number of elements between x and y (inclusive) in S
print("number of elements between x and y (inclusive) in S:")
if(x == 0):
print(S_hist[y])
else:
print(S_hist[y] - S_hist[x-1])
# Finding number of elements between x and y (inclusive) in S
print("number of elements between x and y (inclusive) in L:")
if(x == 0):
print(L_hist[y])
else:
print(L_hist[y] - L_hist[x-1])

Unique combinations of 0 and 1 in list in prolog

I have problem, because I want to generate permutations of a list (in prolog), which contains n zeros and 24 - n ones without repetitions. I've tried:findall(L, permutation(L,P), Bag) and then sort it to remove repetitions, but it causes stack overflow. Anyone has an efficient way to do this?

Instead of thinking about lists, think about binary numbers. The list will have a length of 24 elements. If all those elements are 1's we have:
?- X is 0b111111111111111111111111.
X = 16777215.
The de fact standard predicate between/3 can be used to generate numbers in the interval [0, 16777215]:
?- between(0, 16777215, N).
N = 0 ;
N = 1 ;
N = 2 ;
...
Only some of these numbers satisfy your condition. Thus, you will need to filter/test them and then convert the numbers that pass into a list representation of its binary equivalent.

Select n random numbers between 0 and 23 in ascending order. These integers give you the indexes of the zeroes and all the configurations are different. The key is generating these list of indexes.
%
% We need N monotonically increasing integer numbers (to be used
% as indexes) from [From,To].
%
need_indexes(N,From,To,Sol) :-
N>0,
!,
Delta is To-From+1,
N=<Delta, % Still have a chance to generate them all
N_less is N-1,
From_plus is From+1,
(
% Case 1: "From" is selected into the collection of index values
(need_indexes(N_less,From_plus,To,SubSol),Sol=[From|SubSol])
;
% Case 2: "From" is not selected, which is only possible if N<Delta
(N<Delta -> need_indexes(N,From_plus,To,Sol))
).
need_indexes(0,_,_,[]).
Now we can get list of indexes picked from the available possible indexes.
For example:
Give me 5 indexes from 0 to 23 (inclusive):
?- need_indexes(5,0,23,Collected).
Collected = [0, 1, 2, 3, 4] ;
Collected = [0, 1, 2, 3, 5] ;
Collected = [0, 1, 2, 3, 6] ;
Collected = [0, 1, 2, 3, 7] ;
...
Give them all:
?- findall(Collected,need_indexes(5,0,23,Collected),L),length(L,LL).
L = [[0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 3, 6], [0, 1, 2, 3, 7], [0, 1, 2, 3|...], [0, 1, 2|...], [0, 1|...], [0|...], [...|...]|...],
LL = 42504.
We are expecting: (24! / ((24-5)! * 5!)) solutions.
Indeed:
?- L is 20*21*22*23*24 / (1*2*3*4*5).
L = 42504.
Now the only problem is transforming every solution like [0, 1, 2, 3, 4] into a string of 0 and 1. This is left as an exercise!

Here is an even simpler answer to generate strings directly. Very direct.
need_list(ZeroCount,OneCount,Sol) :-
length(Zs,ZeroCount),maplist([X]>>(X='0'),Zs),
length(Os,OneCount),maplist([X]>>(X='1'),Os),
compose(Zs,Os,Sol).
compose([Z|Zs],[O|Os],[Z|More]) :- compose(Zs,[O|Os],More).
compose([Z|Zs],[O|Os],[O|More]) :- compose([Z|Zs],Os,More).
compose([],[O|Os],[O|More]) :- !,compose([],Os,More).
compose([Z|Zs],[],[Z|More]) :- !,compose(Zs,[],More).
compose([],[],[]).
rt(ZeroCount,Sol) :-
ZeroCount >= 0,
ZeroCount =< 24,
OneCount is 24-ZeroCount,
need_list(ZeroCount,OneCount,SolList),
atom_chars(Sol,SolList).
?- rt(20,Sol).
Sol = '000000000000000000001111' ;
Sol = '000000000000000000010111' ;
Sol = '000000000000000000011011' ;
Sol = '000000000000000000011101' ;
Sol = '000000000000000000011110' ;
Sol = '000000000000000000100111' ;
Sol = '000000000000000000101011' ;
Sol = '000000000000000000101101' ;
Sol = '000000000000000000101110' ;
Sol = '000000000000000000110011' ;
Sol = '000000000000000000110101' ;
....
?- findall(Collected,rt(5,Collected),L),length(L,LL).
L = ['000001111111111111111111', '000010111111111111111111', '000011011111111111111111', '000011101111111111111111', '000011110111111111111111', '000011111011111111111111', '000011111101111111111111', '000011111110111111111111', '000011111111011111111111'|...],
LL = 42504.

Data clustering and comparison between two arrays

I have two collections of elements. How can I pick out those with duplicates and put them into each group with least amount of comparison? Preferably in C++.
For Example given
Array 1 = {1, 1, 2, 2, 3, 4, 5, 5, 1, 1, 2, 2, 4, 5, 8, …}
Array 2 = {2, 1, 1, 2, 2, 4, 7, 7, 8, 8, 2, 2, 4, 4, 8, …}.
At first, I want to cluster data.
Array 1 = { Group 1 = {1, 1, 1, 1, …}, Group 2 = {2, 2, 2, 2, …}, Group 3 = {3, …}, Group 4 = {4, 4, …}, Group 5 = {5, 5, 5, …}, Group 6 = {8, …} }.
Array 2 = { Group 1 = {1, 1, …}, Group 2 = {2, 2, 2, 2, 2 …}, Group 3 = {4, 4 ,4, …}, Group 4 = {7, 7, …}, Group 5 = {8, 8, 8 …} }.
And second, I want data matching.
Group 1 of Array 1 == Group 1 of Array 2
Group 2 of Array 1 == Group 2 of Array 2
Group 4 of Array 1 == Group 3 of Array 2
Group 6 of Array 1 == Group 5 of Array 2
How can I solve this problem in C++? Please give me your brilliant tips.
Additionally, I will explain my problem in detail. I have two data sets which is calculated in stereo image. Array 1 is data of left camera, and Array 2 is data of right camera. My final goal is to match groups which have same values such as group 6 of array1 and group 5 of array 2. Data ordering is not my consideration. I just want to find same values between groups in two arrays. (Will you recommend me to use data ordering first to reduce the number of comparison? ).
In order to solve this problem, should I use ‘std::map’ for data clustering, and compare those N! times (N: no. of groups in array 1 or 2)? Is this best way that I can do?
I’d like to get your advice. Thank you for sharing my problems.
My conclusion
My approach is to use map container in C++ STL.
Make 2 map containers (Array1_map, Array2_map).
Insert value of each array into the map containers as a key, and insert index of each array into the map as a value. (Two data of both arrays are orderly saved in a map without duplication.)
Use find() member function of map container for data matching.
After data matching, I was able to get the indexes of each array which have the matched keys (corresponding keys).
Thank you for all your helpful answers!

The easiest way I can see to do this is to construct a histogram of each array. Then you can compare those histograms together. That should be O(NlogN) to convert each array to a histogram where N is the array size and then O(N) to compare the histograms when N is the number of unique elements in the array (size of the map). That would look like
int arr1[] = {...};
int arr2[] = {...};
std::map<int, int> arr1_histogram, arr2_histogram;
for (auto e : arr1)
arr1_histogram[e]++;
for (auto e : arr2)
arr2_histogram[e]++;
if (arr1_histogram == arr2_histogram)
// true case
else
// false case

thrust::exclusive_scan_by_key unexpected behavior

int data[ 10 ] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
int keys[ 10 ] = { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
thrust::exclusive_scan_by_key( keys, keys + 10, data, data );
By the examples at Thrust Site I expected 0,0,1,1,2,2,3,3,4,4, but got 0,0,0,0,0,0,0,0,0 instead; Is it bug, or is there somewhere something the defines this behavior?
More importantly, assuming this is not a bug, is there a way to achieve this effect easily?

I don't think you understand what scan_by_key does. From the documentation:
"Specifically, consecutive iterators i and i+1 in the range [first1, last1) belong to the same segment if binary_pred(*i, *(i+1)) is true, and belong to different segments otherwise"
scan_by_key requires that your key array mark distinct segments using contiguous values:
keys: 0 0 0 1 1 1 0 0 0 1 1 1
seg#: 0 0 0 1 1 1 2 2 2 3 3 3
thrust compares adjacent keys to determine segments.
Your keys are producing a segment map like this:
int keys[ 10 ] = { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
seg#: 0 1 2 3 4 5 6 7 8 9
Since you are doing an exclusive scan, the correct answer to such a segment map (regardless of the data) would be all zeroes.
It's not entirely clear what "this effect" is that you want to achieve, but you may want to do back-to-back stable sort by key operations, reversing the sense of keys and values, to rearrange this data to group the segments (i.e. keys 1 and 2) together.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js