FAST unique combinations (from list with duplicates) WITHOUT LOOKUPS - c++

In spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)
The question is how to generate ON-THE-FLY in a generator function all
the unique combinations from a non-unique list without the
computational expensive need of filtering out duplicates?
I consider combination comboA to be unique if there is no other combination comboB for which sorted lists for both combinations are the same. Let's give an example of code checking for such uniqueness:
comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A" if sorted(comboA)==sorted(comboB) else "A is unique compared to B")
In the above given example B is a duplicate of A and the print() prints B is a duplicate of A.
The problem of getting a generator function capable of providing unique combinations on-the-fly in case of a non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.
The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...
The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.
I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.
Here some timings to illustrate the current situation:
-----------------
k: 6 len(ls): 48
Combos Used Code Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k))) : 2.036 seconds
12271512 len(list(subbags(ls,k))) : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) : 8.174 seconds
12271512 len(set(combinations(sorted(ls),k))): 7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k))) : 2.030 seconds
1 len(list(subbags(ls,k))) : 0.001 seconds
1 len(list(uniqueCombinations(ls,k))) : 3.619 seconds
1 len(set(combinations(sorted(ls),k))): 2.592 seconds
Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.
My interpretation of the results above is that a pure Python function (not using any C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C/C++ code for a Python .so extension module providing the required functionality.

Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter on) the input. One possible recursive realization is:
def subbags(bag, k):
a = sorted(bag)
n = len(a)
sub = []
def index_of_next_unique_item(i):
j = i + 1
while j < n and a[j] == a[i]:
j += 1
return j
def combinate(i):
if len(sub) == k:
yield tuple(sub)
elif n - i >= k - len(sub):
sub.append(a[i])
yield from combinate(i + 1)
sub.pop()
yield from combinate(index_of_next_unique_item(i))
yield from combinate(0)
bag = [1, 2, 3, 1, 2, 1]
k = 3
i = -1
print(sorted(bag), k)
print('---')
for i, subbag in enumerate(subbags(bag, k)):
print(subbag)
print('---')
print(i + 1)
Output:
[1, 1, 1, 2, 2, 3] 3
---
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 2, 2)
(1, 2, 3)
(2, 2, 3)
---
6
Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.

The current state-of-the-art inspired initially by a 50 than by a 100 reps bounties is at the moment (instead of a Python extension module written entirely in C):
An efficient algorithm and implementation that is better than the obvious (set + combinations) approach in the best (and average) case, and is competitive with it in the worst case.
It seems to be possible to fulfill this requirement using a kind of "fake it before you make it" approach. The current state-of-the-art is that there are two generator function algorithms available for solving the problem of getting unique combinations in case of a non-unique list. The below provided algorithm combines both of them what becomes possible because it seems to exist a threshold value for percentage of unique items in the list which can be used for appropriate switching between the two algorithms. The calculation of the percentage of uniqueness is done with so tiny amount of computation time that it even doesn't clearly show up in the final results due to common variation of the taken timing.
def iterFastUniqueCombos(lstList, comboSize, percUniqueThresh=60):
lstListSorted = sorted(lstList)
lenListSorted = len(lstListSorted)
percUnique = 100.0 - 100.0*(lenListSorted-len(set(lstListSorted)))/lenListSorted
lstComboCandidate = []
setUniqueCombos = set()
def idxNextUnique(idxItemOfList):
idxNextUniqueCandidate = idxItemOfList + 1
while (
idxNextUniqueCandidate < lenListSorted
and
lstListSorted[idxNextUniqueCandidate] == lstListSorted[idxItemOfList]
): # while
idxNextUniqueCandidate += 1
idxNextUnique = idxNextUniqueCandidate
return idxNextUnique
def combinate(idxItemOfList):
if len(lstComboCandidate) == sizeOfCombo:
yield tuple(lstComboCandidate)
elif lenListSorted - idxItemOfList >= sizeOfCombo - len(lstComboCandidate):
lstComboCandidate.append(lstListSorted[idxItemOfList])
yield from combinate(idxItemOfList + 1)
lstComboCandidate.pop()
yield from combinate(idxNextUnique(idxItemOfList))
if percUnique > percUniqueThresh:
from itertools import combinations
allCombos = combinations(lstListSorted, comboSize)
for comboCandidate in allCombos:
if comboCandidate in setUniqueCombos:
continue
yield comboCandidate
setUniqueCombos.add(comboCandidate)
else:
yield from combinate(0)
#:if/else
#:def iterFastUniqueCombos()
The below provided timings show that the above iterFastUniqueCombos() generator function provides a clear advantage
over uniqueCombinations() variant in case the list has less than 60 percent of unique elements and is not worse as
the on (set + combinations) based uniqueCombinations() generator function in the opposite case where it gets much faster than the iterUniqueCombos() one (due to switching between
the (set + combinations) and the (no lookups) variant at 60% threshold for amount of unique elements in the list):
=========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 1 percUnique 2
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.04968 seconds.
Combos: 1 print(len(list( iterUniqueCombos(lst,k)))) : 0.00011 seconds.
Combos: 1 print(len(list( iterFastUniqueCombos(lst,k)))) : 0.00008 seconds.
Combos: 1 print(len(list( uniqueCombinations(lst,k)))) : 3.61812 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 48 percUnique 100
Combos: 12271512 print(len(list(combinations(lst,k)))) : 1.99383 seconds.
Combos: 12271512 print(len(list( iterUniqueCombos(lst,k)))) : 49.72461 seconds.
Combos: 12271512 print(len(list( iterFastUniqueCombos(lst,k)))) : 8.07997 seconds.
Combos: 12271512 print(len(list( uniqueCombinations(lst,k)))) : 8.11974 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 27 percUnique 56
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.02774 seconds.
Combos: 534704 print(len(list( iterUniqueCombos(lst,k)))) : 1.60052 seconds.
Combos: 534704 print(len(list( iterFastUniqueCombos(lst,k)))) : 1.62002 seconds.
Combos: 534704 print(len(list( uniqueCombinations(lst,k)))) : 3.41156 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 31 percUnique 64
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.03539 seconds.
Combos: 1114062 print(len(list( iterUniqueCombos(lst,k)))) : 3.49330 seconds.
Combos: 1114062 print(len(list( iterFastUniqueCombos(lst,k)))) : 3.64474 seconds.
Combos: 1114062 print(len(list( uniqueCombinations(lst,k)))) : 3.61857 seconds.

Related

Issue when Using Counter to count pairs Python

I took an online assessment where I was given the following question :
Input 1 : Integer - Length of the list
Input 2 : Integer List - Consists of different numbers
Output : Integer
Question : Find the total number of pairs possible from the list.
Example : 5, [10,23,2,10,23]
Since, 10 & 23 occurs twice, and 2 occurs only once, there are 2 pairs. So, result should be 2.
So, I did the following & I had one of the test cases failed. The test case wasn’t shown so I’m very confused as to where I went wrong. The code is :
dict=Counter(input2)
pairs=0
count=[]
for i in dict.values() :
count.append(i)
for j in count :
pairs+=j//2
return pairs
Except one test case, all the other 7 seems to satisfy. Please help me out.
You can simply divide the value of each entry of the dict that collections.Counter returns by 2:
from collections import Counter
l = [10,10,10,20,20,20,45,46,45]
print({k: v // 2 for k, v in Counter(l).items()})
This outputs:
{10: 1, 20: 1, 45: 1, 46: 0}
Or if you only want the total number of pairs:
print(sum(v // 2 for v in Counter(l).values()))
This outputs:
3

Python 2.7 - Count number of items in Output

I need to count the number of items in my output.
So for example i created this:
a =1000000
while a >=10:
print a
a=a/2
How would i count how many halving steps were carried out?
Thanks
You have 2 ways: the empiric way and the predictible way.
a =1000000
import math
print("theorical iterations {}".format(int(math.log2(a//10)+0.5)))
counter=0
while a >=10:
counter+=1
a//=2
print("real iterations {}".format(counter))
I get:
theorical iterations 17
real iterations 17
The experimental method just counts the iterations, whereas the predictive method relies on the rounded (to upper bound) result of log2 value of a (which matches the complexity of the algorithm).
(It's rounded to upper bound because if it's more than 16, then you need 17 iterations)
c = 0
a = 1000000
while a >= 10:
print a
a = a / 2
c = c + 1

fastest way to obtain cross product

It looks like calculating the cross-product of an array of vectors explicitly is a lot faster than using np.cross. I've tried vector-first and vector-last, it doesn't seem to make a difference, though that was proposed in an answer to a similar question. Am I using it wrong, or is it just slower?
The explicit calculation seems to take about 60ns per cross-product on a laptop. Is that ~roughly~ as fast as it's going to get? In this case, there doesn't seem to be any reason to go to Cython or PyPy or writing a special ufunc yet.
I also see references to the use of einsum, but I don't really understand how to use that, and suspect it is not faster.
a = np.random.random(size=300000).reshape(100000,3) # vector last
b = np.random.random(size=300000).reshape(100000,3)
c, d = a.swapaxes(0, 1), b.swapaxes(0, 1) # vector first
def npcross_vlast(): return np.cross(a, b)
def npcross_vfirst(): return np.cross(c, d, axisa=0, axisb=0)
def npcross_vfirst_axisc(): return np.cross(c, d, axisa=0, axisb=0, axisc=0)
def explicitcross_vlast():
e = np.zeros_like(a)
e[:,0] = a[:,1]*b[:,2] - a[:,2]*b[:,1]
e[:,1] = a[:,2]*b[:,0] - a[:,0]*b[:,2]
e[:,2] = a[:,0]*b[:,1] - a[:,1]*b[:,0]
return e
def explicitcross_vfirst():
e = np.zeros_like(c)
e[0,:] = c[1,:]*d[2,:] - c[2,:]*d[1,:]
e[1,:] = c[2,:]*d[0,:] - c[0,:]*d[2,:]
e[2,:] = c[0,:]*d[1,:] - c[1,:]*d[0,:]
return e
print "explicit"
print timeit.timeit(explicitcross_vlast, number=10)
print timeit.timeit(explicitcross_vfirst, number=10)
print "np.cross"
print timeit.timeit(npcross_vlast, number=10)
print timeit.timeit(npcross_vfirst, number=10)
print timeit.timeit(npcross_vfirst_axisc, number=10)
print all([npcross_vlast()[7,i] == npcross_vfirst()[7,i] ==
npcross_vfirst_axisc()[i,7] == explicitcross_vlast()[7,i] ==
explicitcross_vfirst()[i,7] for i in range(3)]) # check one
explicit
0.0582590103149
0.0560920238495
np.cross
0.399816989899
0.412983894348
0.411231040955
True
The performance of np.cross improved significantly in the 1.9.x release of numpy.
%timeit explicitcross_vlast()
%timeit explicitcross_vfirst()
%timeit npcross_vlast()
%timeit npcross_vfirst()
%timeit npcross_vfirst_axisc()
These are the timings I get for 1.8.0
100 loops, best of 3: 4.47 ms per loop
100 loops, best of 3: 4.41 ms per loop
10 loops, best of 3: 29.1 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 30.6 ms per loop
And these the timings for 1.9.0:
100 loops, best of 3: 4.62 ms per loop
100 loops, best of 3: 4.19 ms per loop
100 loops, best of 3: 4.05 ms per loop
100 loops, best of 3: 4.09 ms per loop
100 loops, best of 3: 4.24 ms per loop
I suspect that the speedup was introduced by merge request #4338.
First off, if you're looking to speed up your code, you should probably try and get rid of cross-products altogether. That's possible in many cases, e.g., when used in connection with dot products <a x b, c x d> = <a, c><b, d> - <a, d><b, c>.
Anyways, in case you really need explicit cross products, check out
eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1
np.einsum('ijk,aj,ak->ai', eijk, a, b)
np.einsum('iak,ak->ai', np.einsum('ijk,aj->iak', eijk, a), b)
These two are equivalent to np.cross, where the second uses two einsums with two arguments each, a technique suggested in a similar question.
The results are disappointing, though: Both of these variants are slower than np.cross (except for tiny n):
The plot was created with
import numpy as np
import perfplot
eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1
b = perfplot.bench(
setup=lambda n: np.random.rand(2, n, 3),
n_range=[2 ** k for k in range(23)],
kernels=[
lambda X: np.cross(X[0], X[1]),
lambda X: np.einsum("ijk,aj,ak->ai", eijk, X[0], X[1]),
lambda X: np.einsum("iak,ak->ai", np.einsum("ijk,aj->iak", eijk, X[0]), X[1]),
],
labels=["np.cross", "einsum", "double einsum"],
xlabel="len(a)",
)
b.save("out.png")
Simply changing your vlast to
def stacked_vlast(a,b):
x = a[:,1]*b[:,2] - a[:,2]*b[:,1]
y = a[:,2]*b[:,0] - a[:,0]*b[:,2]
z = a[:,0]*b[:,1] - a[:,1]*b[:,0]
return np.array([x,y,z]).T
i.e. replacing the column assignment with stacking, as the (old) cross does, slows the speed by 5x.
When I use a local copy of the development cross function, I get a minor speed improvement over your explicit_vlast. That cross uses the out parameter in an attempt to cut down on temporary arrays, but my crude tests suggest that it doesn't make much difference in speed.
https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py
If you explicit version works, I wouldn't upgrade numpy just to get this new cross.

How to compare 2 sparse matrix stored using scikit-learn library load_svmlight_file?

i am trying to compare feature vectors present in test and train data set.These feature vectors are stored in sparse format using scikitlearn library load_svmlight_file.The dimension of feature vectors of both the dataset is same.However,I am getting this error :"The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()."
Why am I getting this error?
How can I resolve it?
Thanks in advance!
from sklearn.datasets import load_svmlight_file
pathToTrainData="../train.txt"
pathToTestData="../test.txt"
X_train,Y_train= load_svmlight_file(pathToTrainData);
X_test,Y_test= load_svmlight_file(pathToTestData);
for ele1 in X_train:
for ele2 in X_test:
if(ele1==ele2):
print "same vector"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-c1f145f984a6> in <module>()
7 for ele1 in X_train:
8 for ele2 in X_test:
----> 9 if(ele1==ele2):
10 print "same vector"
/Users/rkasat/anaconda/lib/python2.7/site-packages/scipy/sparse/base.pyc in __bool__(self)
181 return True if self.nnz == 1 else False
182 else:
--> 183 raise ValueError("The truth value of an array with more than one "
184 "element is ambiguous. Use a.any() or a.all().")
185 __nonzero__ = __bool__
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
You can use this condition to check whether the two sparse arrays are exactly equal without needing to densify them:
if (ele1 - ele2).nnz == 0:
# Matched, do something ...
The nnz attribute gives the number of nonzero elements in the sparse array.
Some simple test runs to show the difference:
import numpy as np
from scipy import sparse
A = sparse.rand(10, 1000000).tocsr()
def benchmark1(A):
for s1 in A:
for s2 in A:
if (s1 - s2).nnz == 0:
pass
def benchmark2(A):
for s1 in A:
for s2 in A:
if (s1.toarray() == s2).all() == 0:
pass
%timeit benchmark1(A)
%timeit benchmark2(A)
Some results:
# Computer 1
10 loops, best of 3: 36.9 ms per loop # with nnz
1 loops, best of 3: 734 ms per loop # with toarray
# Computer 2
10 loops, best of 3: 28 ms per loop
1 loops, best of 3: 312 ms per loop
If your arrays are dense you can run into the same problem, and there the solution is straightforward. Replace
if(ele1==ele2):
with
if (ele1 == ele2).all():
However, since you are working with sparse matrices, this problem is actually not that easy in general. Notably, the functions all and any aren't implemented for sparse matrices (which, at least for all is understandable, because all can only return True if the matrix tested is densely filled with values that evaluate to True).
In your case, since you are only comparing lines of your sparse matrices, you may find it acceptable to densify them and then do the comparison. Try replacing the mentioned line by
if (ele1.toarray() == ele2).all(): # Densifying one of them casts the other to dense too
On a more general note, you seem to want to compare the lines of 2 matrices. Depending on the number of entries, this can be done a lot more efficiently by defining a vectorized comparison function, like this:
def compare(A, B):
return zip(*np.where((np.array(A.multiply(A).sum(1)) +
np.array(B.multiply(B).sum(1)).T) - 2 * A.dot(B.T).toarray() == 0))
This function will return a list of couples of indices, telling you which rows correspond to each other and is a lot more efficient than the double for loop used in your code.
Explanation: The function compare calculates pairwise euclidean distances using the binomial formula (a - b) ** 2 == a ** 2 + b ** 2 - 2 * a * b. This formula also works for l2 norm and scalar products. If the matrices weren't sparse, the formula would become much simpler: squared_distances = (A ** 2).sum(axis=1) + (B ** 2).sum(axis=1) - 2 * A.dot(B.T). Then we check which of these entries are equal to 0 using np.where and return them as tuples.
Benchmarking this, we obtain:
import numpy as np
from scipy import sparse
rng = np.random.RandomState(42)
A = sparse.rand(10, 1000000, random_state=rng).tocsr()
In [12]: %timeit compare(A, A)
100 loops, best of 3: 10.2 ms per loop

Does opening a file related to the program also stop the program?

I have this program that is supposed to search for perfect numbers.
(X is a perfect number if the sum of all numbers that divide X, divided by 2 is equal to X)
sum/2 = x
Now It has found the first four, which were known in Ancient Greece, so it's not really a anything awesome.
The next one should be 33550336.
I know it is a big number, but the program has been going for about 50 minutes, and still hasn't found 33550336.
Is it because I opened the .txt file where I store all the perfect numbers while the program was running, or is it because I don't have a PC fast enough to run it*, or because I'm using Python?
*NOTE: This same PC factorized 500 000 in 10 minutes (while also running the perfect number program and Google Chrome with 3 YouTube tabs), also using Python.
Here is the code to the program:
i = 2
a = open("perfect.txt", 'w')
a.close()
while True:
sum = 0
for x in range(1, i+1):
if i%x == 0:
sum += x
if sum / 2 == i:
a = open("perfect.txt", 'a')
a.write(str(i) + "\n")
a.close()
i += 1
The next one should be 33550336.
Your code (I fixed the indentation so that it does in principle what you want):
i = 2
a = open("perfect.txt", 'w')
a.close()
while True:
sum = 0
for x in range(1, i+1):
if i%x == 0:
sum += x
if sum / 2 == i:
a = open("perfect.txt", 'a')
a.write(str(i) + "\n")
a.close()
i += 1
does i divisions to find the divisors of i.
So to find the perfect numbers up to n, it does
2 + 3 + 4 + ... + (n-1) + n = n*(n+1)/2 - 1
divisions in the for loop.
Now, for n = 33550336, that would be
Prelude> 33550336 * (33550336 + 1) `quot` 2 - 1
562812539631615
roughly 5.6 * 1014 divisions.
Assuming your CPU could do 109 divisions per second (it most likely can't, 108 is a better estimate in my experience, but even that is for machine ints in C), that would take about 560,000 seconds. One day has 86400 seconds, so that would be roughly six and a half days (more than two months with the 108 estimate).
Your algorithm is just too slow to reach that in reasonable time.
If you don't want to use number-theory (even perfect numbers have a very simple structure, and if there are any odd perfect numbers, those are necessarily huge), you can still do better by dividing only up to the square root to find the divisors,
i = 2
a = open("perfect.txt", 'w')
a.close()
while True:
sum = 1
root = int(i**0.5)
for x in range(2, root+1):
if i%x == 0:
sum += x + i/x
if i == root*root:
sum -= x # if i is a square, we have counted the square root twice
if sum == i:
a = open("perfect.txt", 'a')
a.write(str(i) + "\n")
a.close()
i += 1
that only needs about 1.3 * 1011 divisions and should find the fifth perfect number in a couple of hours.
Without resorting to the explicit formula for even perfect numbers (2^(p-1) * (2^p - 1) for primes p such that 2^p - 1 is prime), you can speed it up somewhat by finding the prime factorisation of i and computing the divisor sum from that. That will make the test faster for all composite numbers, and much faster for most,
def factorisation(n):
facts = []
multiplicity = 0
while n%2 == 0:
multiplicity += 1
n = n // 2
if multiplicity > 0:
facts.append((2,multiplicity))
d = 3
while d*d <= n:
if n % d == 0:
multiplicity = 0
while n % d == 0:
multiplicity += 1
n = n // d
facts.append((d,multiplicity))
d += 2
if n > 1:
facts.append((n,1))
return facts
def divisorSum(n):
f = factorisation(n)
sum = 1
for (p,e) in f:
sum *= (p**(e+1) - 1)/(p-1)
return sum
def isPerfect(n):
return divisorSum(n) == 2*n
i = 2
count = 0
out = 10000
while count < 5:
if isPerfect(i):
print i
count += 1
if i == out:
print "At",i
out *= 5
i += 1
would take an estimated 40 minutes on my machine.
Not a bad estimate:
$ time python fastperf.py
6
28
496
8128
33550336
real 36m4.595s
user 36m2.001s
sys 0m0.453s
It is very hard to try and deduce why this has happened. I would suggest that you run your program either under a debugger and test several iteration manually to check if the code is really correct (I know you have already calculated 4 numbers but still). Alternatively it would be good to run your program under a python profiler just to see if it hasn't accidentally blocked on a lock or something.
It is possible, but not likely that this is an issue related to you opening the file while it is running. If it was an issue, there would have probably been some error message and/or program close/crash.
I would edit the program to write a log-type output to a file every so often. For example, everytime you have processed a target number that is an even multiple of 1-Million, write (open-append-close) the date-time and current-number and last-success-number to a log file.
You could then Type the file once in a while to measure progress.