In my python application I have a big list (now with almost 9000 indexes). I need to find the two most similar items in this list. So, what I have now is something like:
aux1 = 0
aux2 = 1
min_distance = 0xffff
weights = get_weights()
for i in range(0, len(_list)):
for j in range(i + 1, len(_list)):
obj1 = _list[i]
obj2 = _list[j]
dist = 0
for key in self.__fields:
dist += weights[key] * (obj1[key] - obj2[key]) ** 2
if dist < min_distance:
min_distance = dist
aux1 = i
aux2 = j
return aux1, aux2, min_distance
In the code, weights is a dict, obj1 and obj2 are both objects in which the __getitem__ is implemented and the return value also comes from a dict. And self.__fields is a list with the selected fields (it has now 9 items).
My problem is, this loop is taking too much time to complete. Even after 5 hours, the i variable still in the first 100th list items.
With this next silly code, I come to the conclusion that the problem is not the size of the list (the silly code finishes with 5 minutes of difference).
count = 0
total = 9000
for i in range(0, total):
for j in range(i + 1, total):
for k in range(0, 10):
count += 1
print("Count is " + str(count))
Therefore, the problem seems to be in the most internal loop of my code:
for key in self.__fields:
dist += weights[key] * (obj1[key] - obj2[key]) ** 2
I know Python, but I'm not a Python specialist. I conclude that the access to the values of three objects through their key is a slow operation. Some time ago, I saw in some blog that list comprehensions and/or lambda operations can be faster.
So, my question is: how do I make this most internal loop faster using list comprehensions and/or lambda? Feel free to give any other advice if you want.
Not sure whether it's any faster, but you could rewrite that code using itertools.combinations and get the min using a key function calculating the "distance".
from itertools import combinations
weights = get_weights()
aux1, aux2 = min(combinations(_list, 2),
key=lambda pair: sum(weights[key] * (pair[0][key] - pair[1][key]) ** 2
for key in self.__fields))
If this does not help, you might consider temporarily turning the dictionaries in _list into lists, holding just the values of the relevant fields. Instead of using dictionary lookup, you can then just zip those lists together with the weights. Afterwards, turm them back into dicts.
weights_list = [weights[f] for f in self.__fields]
as_lists = [[d[f] for f in self.__fields] for d in _list]
aux1, aux2 = min(combinations(as_lists, 2),
key=lambda pair: sum(w * (x - y) ** 2
for w, x, y in zip(weights_list, *pair)))
aux1, aux2 = (dict(zip(self.__fields, x)) for x in (aux1, aux2))
This should be a bit faster, but it will only work if the dicts do not have any other fields than those in self.__fields, otherwise the dicts can not be reconstructed from the lists (at least not as easily). Alternatively, you might use tuples instead of lists and use another dictionary to map those tuples to the original dictionaries...
Or try this, using the indices of the elements instead of the elements themselves (not tested):
idx1, idx2 = min(combinations(range(len(_list)), 2),
key=lambda pair: sum(w * (x - y) ** 2
for w, x, y in zip(weights_list, as_list[pair[0]], as_list[pair[1]])))
aux1, aux2 = _lists[idx1], _lists[idx2]
Related
As an introduction i want to point out that if one has a matrix A consisting of 4 submatrices in a 2x2 pattern, where the diagonal matrices are square, then if we denote its inverse as X, the submatrix X22 = (A22-A21(A11^-1)A12)^-1, which is quite easy to show by hand.
I was trying to do the same for a matrix of 4x4 submatrices, but its quite tedious by hand. So I thought Sympy would be of some help. But I cannot figure out how (I have started by just trying to reproduce the 2x2 result).
I've tried:
import sympy as s
def blockmatrix(name, sizes, names=None):
if names is None:
names = sizes
ll = []
for i, (s1, n1) in enumerate(zip(sizes, names)):
l = []
for j, (s2, n2) in enumerate(zip(sizes, names)):
l.append(s.MatrixSymbol(name+str(n1)+str(n2), s1, s2))
ll.append(l)
return ll
def eyes(*sizes):
ll = []
for i, s1 in enumerate(sizes):
l = []
for j, s2 in enumerate(sizes):
if i==j:
l.append(s.Identity(s1))
continue
l.append(s.ZeroMatrix(s1, s2))
ll.append(l)
return ll
n1, n2 = s.symbols("n1, n2", integer=True, positive=True, nonzero=True)
M = s.Matrix(blockmatrix("m", (n1, n2)))
X = s.Matrix(blockmatrix("x", (n1, n2)))
I = s.Matrix(eyes(n1, n2))
s.solve(M*X[:, 1:]-I[:, 1:], X[:, 1:])
but it just returns an empty list instead of the result.
I have also tried:
Using M*X==I but that just returns False (boolean, not an Expression)
Entering a list of equations
Using 'ordinary' symbols with commutative=False instead of MatrixSymbols -- this gives an exception with GeneratorsError: non-commutative generators: (x12, x22)
but all without luck.
Can you show how to derive a result with Sympy similar to the one I gave as an example for X22?
The most similar other questions on solving with MatrixSymbols seem to have been solved by working around doing exactly that, by using an array of the inner symbols or some such instead. But since I am dealing with symbolically sized MatrixSymbols, that is not an option for me.
Is this what you mean by a matrix of 2x2 matrices?
>>> a = [MatrixSymbol(i,2,2) for i in symbols('a1:5')]
>>> A = Matrix(2,2,a)
>>> X = A.inv()
>>> print(X[1,1]) # [1,1] instead of [2,2] because indexing starts at 0
a1*(a1*a3 - a3*a1)**(-1)
[You indicated not and pointed out that the above is not correct -- that appears to be an issue that should be resolved.]
I am not sure why this isn't implemented, but we can do the solving manually as follows:
>>> n = 2
>>> v = symbols('b:%s'%n**2,commutative=False)
>>> A = Matrix(n,n,symbols('a:%s'%n**2,commutative=False))
>>> B = Matrix(n,n,v)
>>> eqs = list(A*B - eye(n))
>>> for i in range(n**2):
... s = solve(eqs[i],v[i])[0]
... eqs[i+1:] = [e.subs(v[i],s) for e in eqs[i+1:]]
...
>>> s # solution for v[3] which is B22
(-a2*a0**(-1)*a1 + a3)**(-1)
You can change n to 3 and see a modestly complicated expression. Change it to 4 and check the result by hand to give a new definition to the word "tedious" ;-)
The special structure of the equations to be solved can allow for a faster solution, too: the variable of interest is the last factor in each term containing it:
>>> for i in range(n**2):
... c,d = eqs[i].expand().as_independent(v[i])
... assert all(j.args[-1]==v[i] for j in Add.make_args(d))
... s = 1/d.subs(v[i], 1)*-c
... eqs[i+1:] = [e.subs(v[i], s) for e in eqs[i+1:]]
I tried to look at similar questions but I'm really not understanding how I can accomplish this using the methods mentioned in the other questions.
So my problem is: I have one list from which I want to remove certain values. For instance:
a = [[[0,0],[0,1]],[[0,0],[0,1]]]
for y in range(2):
a[y][:] = [x for x in a[y] if not random.random() < s]
This removes the elements for which random.random() is below s (being s between 0 and 1). However, I only want this to happen if the second position of each element of the list (that is the [0,0] bit) is equal to 1. I tried multiple solutions (suggested around here for other questions) and I can't get it to work. Does anyone have any suggestion?
Another condition could be added to check the value of the second "bit" of x (x[1] == 0):
a = [[[0,0],[0,1]],[[0,0],[0,1]]]
for y in range(2):
a[y][:] = [x for x in a[y] if x[1] == 0 or random.random() >= 0.5]
This means that if x[1] == 0, then the pair is kept, regardless of a random value. Otherwise, it is kept only if random.random() >= 0.5.
If I am given a list of integers/floats, how would I find the two closest numbers using sorting?
Such a method will do what you want:
>>> def minDistance(lst):
lst = sorted(lst)
index = -1
distance = max(lst) - min(lst)
for i in range(len(lst)-1):
if lst[i+1] - lst[i] < distance:
distance = lst[i+1] - lst[i]
index = i
for i in range(len(lst)-1):
if lst[i+1] - lst[i] == distance:
print lst[i],lst[i+1]
In the first for loop we find out the minimum distance, and in the second loop, we print all the pairs with this distance. Works as below:
>>> lst = (1,2,3,6,12,9,1.4,145,12,83,53,12,3.4,2,7.5)
>>> minDistance(lst)
2 2
12 12
12 12
>>>
It could be more than one possibilities. Consider this list
[0,1, 20, 25, 30, 200, 201]
[0,1] and [200, 201] are equal closest.
Jose has a valid point. However, you could just consider these cases equal and not care about returning one or the other.
I don't think you need a sorting algorithm, per say, but maybe just a sort of 'champion' algorithm like this one:
def smallestDistance(self, arr):
championI = -1
championJ = -1
champDistance = sys.maxint
i = 0
while i < arr.length:
j = i + 1
while j < arr.length:
if math.fabs(arr[i] - arr[j]) < champDistance:
championI = i
championJ = j
champDistance = math.fabs(arr[i] - arr[j])
j += 1
i += 1
r = [arr[championI], arr[championJ]]
return r
This function will return a sub array with the two values that are closest together. Note that this will only work given an array of at least two long. Otherwise, you will throw some error.
I think the popular sorting algorithm known as bubble sort would do this quite well. Though running at possible O(n^2) time if that kind of thing matters to you...
Here is standard bubble sort based on the sorting of arrays by integer size.
def bubblesort( A ):
for i in range( len( A ) ):
for k in range( len( A ) - 1, i, -1 ):
if ( A[k] < A[k - 1] ):
swap( A, k, k - 1 )
def swap( A, x, y ):
tmp = A[x]
A[x] = A[y]
A[y] = tmp
You can just modify the algorithm slightly to fit your purposes if you insist on doing this using a sorting algorithm. However, I think the initial function works as well...
hope that helps.
I would like to try out the Mincemeat map/reduce Python application for matrix multiplication. I am using Python 2.7. I found several web pages that describe how to do matrix multiplication using Hadoop in Java, and I have been referring to this one http://importantfish.com/one-step-matrix-multiplication-with-hadoop/ both because it is simple and because the pseudocode that it displays is very close to Python code already.
I noticed in the Java code that is also included that the matrix dimensions are supplied to the map and reduce functions via an additional argument of type Context. Mincemeat doesn't provide such a thing, but I got a suggestion that I could provide these values to my map and reduce functions using closures. The map and reduce functions I wrote look like this:
def make_map_fn(num_rows_result, num_cols_result):
m = num_rows_result
p = num_cols_result
def map_fn(key, value):
# value is ('A', i, j, a_ij) or ('B', j, k, b_jk)
if value[0] == 'A':
i = value[1]
j = value[2]
a_ij = value[3]
for k in xrange(1, p):
yield ((i, k), ('A', j, a_ij))
else:
j = value[1]
k = value[2]
b_jk = value[3]
for i in xrange(1, m):
yield ((i, k), ('B', j, b_jk))
return map_fn
def make_reduce_fn(inner_dim):
n = inner_dim
def reduce_fn(key, values):
# key is (i, k)
# values is a list of ('A', j, a_ij) and ('B', j, b_jk)
hash_A = {j: a_ij for (x, j, a_ij) in values if x == 'A'}
hash_B = {j: b_jk for (x, j, b_jk) in values if x == 'B'}
result = 0
for j in xrange(1, n):
result += hash_A[j] * hash_B[j]
return (key, result)
return reduce_fn
Then I assign them to Mincemeat like this:
s = mincemeat.Server()
s.mapfn = make_map_fn(num_rows_A, num_cols_B)
s.reducefn = make_reduce_fn(num_cols_A)
When I run this in Mincemeat, I get this error message:
error: uncaptured python exception, closing channel <__main__.Client connected at 0x2ada4d0>
(<type 'exceptions.TypeError'>:arg 5 (closure) must be tuple
[/usr/lib/python2.7/asyncore.py|read|83]
[/usr/lib/python2.7/asyncore.py|handle_read_event|444]
[/usr/lib/python2.7/asynchat.py|handle_read|140]
[/usr/local/lib/python2.7/dist-packages/mincemeat.py|found_terminator|96]
[/usr/local/lib/python2.7/dist-packages/mincemeat.py|process_command|194]
[/usr/local/lib/python2.7/dist-packages/mincemeat.py|set_mapfn|159])
I searched around on the net with search terms like |python closure must be tuple| and the things that I found seemed to be dealing with cases where someone is trying to construct a function using lambda or function() and need to make sure they didn't omit certain things when defining them as closures. In my case, the map_fn and reduce_fn values returned by make_map_fn and make_reduce_fn look like valid function objects, their func_closure values are tuples of cells containing the array dimensions that I want to supply, but something is still missing. What form do I need to pass these functions in to be usable by Mincemeat?
I hate to be the bearer of bad news, but this is just the result of a few off-by-one errors in your code, plus two errors in the input file provided by the site you linked. It is unrelated to your usage of a closure, misleading error messages notwithstanding.
Off-by-one errors
Notice that the innermost loops in the pseudocode look like this:
for k = 1 to p:
for i = 1 to m:
for j = 1 to n:
In pseudocode, this typically indicates that the endpoint is included, i.e. for k = 1 to p means k = 1, 2, ..., p-1, p. On the other hand, the corresponding loops in your code look like this:
for k in xrange(1, p):
for i in xrange(1, m):
for j in xrange(1, n):
And of course, xrange(1, p) yields 1, 2, ..., p-2, p-1. Assuming you indexed the matrices from 0 (as they did on the site you linked), all your xranges should start at 0 (e.g. xrange(0, p)), as their equivalents in the Java code do (for (int k = 0; k < p; k++)). This fixes one of your problems.
Input file errors
In case you didn't catch this, the input file for A and B that the site provides is incorrect - they forgot the (0,0) entries of both matrices. In particular, you should add a line to the beginning of the form A,0,0,0.0, and a line between 9 and 10 of the form B,0,0,0.0. (I guess where exactly you put it doesn't matter, but for consistency, you may as well put them where they naturally fit.)
Once I correct these two errors, mincemeat gives me the result we expect (formatted):
{(0, 1): ((0, 1), 100.0),
(1, 2): ((1, 2), 310.0),
(0, 0): ((0, 0), 90.0),
(0, 2): ((0, 2), 110.0),
(1, 0): ((1, 0), 240.0),
(1, 1): ((1, 1), 275.0)}
I haven't figured out exactly what's going on with the error message, but I think it boils down to the fact that the incorrect loop indices in the map function are resulting in garbage data being passed to the reduce nodes, which is why the error mentions the reduce function.
Basically, what happens is that hash_A and hash_B in the reduce function sometimes don't have the same keys, so when you try to multiply hash_A[j] * hash_B[j], you'll get a KeyError because j is not a key of one or the other, and this gets caught somewhere upstream and rethrown as a TypeError instead.
Let std::vector<int> counts be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1] be the the sum of vector components. Setting pi:=counts[i]/N, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn).
The counts vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?
EDIT: Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.
Resolved: See this mathoverflow post.
I derived update formulas and algorithms for entropy and Gini index and made the note available on arXiv. (The working version of the note is available here.) Also see this mathoverflow answer.
For the sake of convenience I am including simple Python code, demonstrating the derived formulas:
from math import log
from random import randint
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# update entropy if new example x comes in
def update(H, S, x):
new_S = S+x
return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy(L) using only `update' function
def test(L):
S = 0.0 # sum of the sample elements
H = 0.0 # sample entropy
for x in L:
H = update(H, S, x)
S = S+x
return H
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# entry point
if __name__ == "__main__":
L = [randint(1,100) for k in range(100)]
M = [randint(100,1000) for k in range(100)]
L_ent = entropy(L)
L_sum = sum(L)
M_ent = entropy(M)
M_sum = sum(M)
T = L+M
print("Full = ", entropy(T))
print("Update = ", update(L_ent, L_sum, M_ent, M_sum))
You could re-compute the entropy by re-computing the counts and using some simple mathematical identity to simplify the entropy formula
K = count.size();
N = count[0] + ... + count[K - 1];
H = count[0]/N * log2(count[0]/N) + ... + count[K - 1]/N * log2(count[K - 1]/N)
= F * h
h = (count[0] * log2(count[0]) + ... + count[K - 1] * log2(count[K - 1]))
F = -1/(N * log2(N))
which holds because of log2(a / b) == log2(a) - log2(b)
Now given an old vector count of observations so far and another vector of new 200 observations called batch, you can do in C++11
void update_H(double& H, std::vector<int>& count, int& N, std::vector<int> const& batch)
{
N += batch.size();
auto F = -1/(N * log2(N));
for (auto b: batch)
++count[b];
H = F * std::accumulate(count.begin(), count.end(), 0.0, [](int elem) {
return elem * log2(elem);
});
}
Here I assume that you have encoded your observations as int. If you have some kind of symbol, you would need a symbol table std::map<Symbol, int>, and do a lookup for each symbol in batch before you update the count.
This seems the quickest way of writing some code for a general update. If you know that in every batch only few counts actually change, you can do as #migdal does and keep track of the changing counts, subtract their old contribution to the entropy and add the new contribution.