Mincemeat: supply additional parameters to map and reduce functions with closures - python-2.7

I would like to try out the Mincemeat map/reduce Python application for matrix multiplication. I am using Python 2.7. I found several web pages that describe how to do matrix multiplication using Hadoop in Java, and I have been referring to this one http://importantfish.com/one-step-matrix-multiplication-with-hadoop/ both because it is simple and because the pseudocode that it displays is very close to Python code already.
I noticed in the Java code that is also included that the matrix dimensions are supplied to the map and reduce functions via an additional argument of type Context. Mincemeat doesn't provide such a thing, but I got a suggestion that I could provide these values to my map and reduce functions using closures. The map and reduce functions I wrote look like this:
def make_map_fn(num_rows_result, num_cols_result):
m = num_rows_result
p = num_cols_result
def map_fn(key, value):
# value is ('A', i, j, a_ij) or ('B', j, k, b_jk)
if value[0] == 'A':
i = value[1]
j = value[2]
a_ij = value[3]
for k in xrange(1, p):
yield ((i, k), ('A', j, a_ij))
else:
j = value[1]
k = value[2]
b_jk = value[3]
for i in xrange(1, m):
yield ((i, k), ('B', j, b_jk))
return map_fn
def make_reduce_fn(inner_dim):
n = inner_dim
def reduce_fn(key, values):
# key is (i, k)
# values is a list of ('A', j, a_ij) and ('B', j, b_jk)
hash_A = {j: a_ij for (x, j, a_ij) in values if x == 'A'}
hash_B = {j: b_jk for (x, j, b_jk) in values if x == 'B'}
result = 0
for j in xrange(1, n):
result += hash_A[j] * hash_B[j]
return (key, result)
return reduce_fn
Then I assign them to Mincemeat like this:
s = mincemeat.Server()
s.mapfn = make_map_fn(num_rows_A, num_cols_B)
s.reducefn = make_reduce_fn(num_cols_A)
When I run this in Mincemeat, I get this error message:
error: uncaptured python exception, closing channel <__main__.Client connected at 0x2ada4d0>
(<type 'exceptions.TypeError'>:arg 5 (closure) must be tuple
[/usr/lib/python2.7/asyncore.py|read|83]
[/usr/lib/python2.7/asyncore.py|handle_read_event|444]
[/usr/lib/python2.7/asynchat.py|handle_read|140]
[/usr/local/lib/python2.7/dist-packages/mincemeat.py|found_terminator|96]
[/usr/local/lib/python2.7/dist-packages/mincemeat.py|process_command|194]
[/usr/local/lib/python2.7/dist-packages/mincemeat.py|set_mapfn|159])
I searched around on the net with search terms like |python closure must be tuple| and the things that I found seemed to be dealing with cases where someone is trying to construct a function using lambda or function() and need to make sure they didn't omit certain things when defining them as closures. In my case, the map_fn and reduce_fn values returned by make_map_fn and make_reduce_fn look like valid function objects, their func_closure values are tuples of cells containing the array dimensions that I want to supply, but something is still missing. What form do I need to pass these functions in to be usable by Mincemeat?

I hate to be the bearer of bad news, but this is just the result of a few off-by-one errors in your code, plus two errors in the input file provided by the site you linked. It is unrelated to your usage of a closure, misleading error messages notwithstanding.
Off-by-one errors
Notice that the innermost loops in the pseudocode look like this:
for k = 1 to p:
for i = 1 to m:
for j = 1 to n:
In pseudocode, this typically indicates that the endpoint is included, i.e. for k = 1 to p means k = 1, 2, ..., p-1, p. On the other hand, the corresponding loops in your code look like this:
for k in xrange(1, p):
for i in xrange(1, m):
for j in xrange(1, n):
And of course, xrange(1, p) yields 1, 2, ..., p-2, p-1. Assuming you indexed the matrices from 0 (as they did on the site you linked), all your xranges should start at 0 (e.g. xrange(0, p)), as their equivalents in the Java code do (for (int k = 0; k < p; k++)). This fixes one of your problems.
Input file errors
In case you didn't catch this, the input file for A and B that the site provides is incorrect - they forgot the (0,0) entries of both matrices. In particular, you should add a line to the beginning of the form A,0,0,0.0, and a line between 9 and 10 of the form B,0,0,0.0. (I guess where exactly you put it doesn't matter, but for consistency, you may as well put them where they naturally fit.)
Once I correct these two errors, mincemeat gives me the result we expect (formatted):
{(0, 1): ((0, 1), 100.0),
(1, 2): ((1, 2), 310.0),
(0, 0): ((0, 0), 90.0),
(0, 2): ((0, 2), 110.0),
(1, 0): ((1, 0), 240.0),
(1, 1): ((1, 1), 275.0)}
I haven't figured out exactly what's going on with the error message, but I think it boils down to the fact that the incorrect loop indices in the map function are resulting in garbage data being passed to the reduce nodes, which is why the error mentions the reduce function.
Basically, what happens is that hash_A and hash_B in the reduce function sometimes don't have the same keys, so when you try to multiply hash_A[j] * hash_B[j], you'll get a KeyError because j is not a key of one or the other, and this gets caught somewhere upstream and rethrown as a TypeError instead.

Related

Solving 1 equation based on input variable

For example, Volume of a rectangular box can be calculated as V = L * W * H
Suppose we know V, L, W, then we can solve for H.
Suppose we know V, H, W, then we can solve for V.
Suppose we know L, W, H, then we can solve for V.
And e.t.c
Is there a way to solve in python (I am trying Sympy currently) to solve it based on the input given?
Sure I can use cases of ifs, but I will need to write 4 equations to solve and that is for a short equation.
Any suggestion is appreciated.
Kind regards,
Iwan
This answer to a similar question may help you. Basically you can define the general equation, get values for all but one variable, substitution them into the general equation then pass that expression to solve (or else just pass all equations to solve, which I show below):
from sympy import Eq, solve, Tuple, S
from sympy.abc import v,l,w,h
eq = Eq(v, l*w*h)
variables = eq.free_symbols
got = []
vs = ', '.join(([str(i) for i in variables]))
print('enter 3 known values of {%s} as equality, e.g. h=2' % vs)
for i in range(3):
if 0: # change to 1 to get real input
e = input()
else:
e = ['v=20','w=5','h=1'][i]
got.append(Eq(*[S(j) for j in e.split('=')]))
x = ({v,l,w,h} - Tuple(*got).free_symbols).pop()
ans = solve([eq]+got)
print('consistent values: %s' % ans)
print('%s = %s' % (x.name, ans[0][x])
gives
enter 3 known values of {v, h, w, l} as equality, e.g. h=2
consistent values: [{v: 20, h: 1, w: 5, l: 4}]
l = 4

Solving a matrix equation containing MatrixSymbols of symbolic size in Sympy?

As an introduction i want to point out that if one has a matrix A consisting of 4 submatrices in a 2x2 pattern, where the diagonal matrices are square, then if we denote its inverse as X, the submatrix X22 = (A22-A21(A11^-1)A12)^-1, which is quite easy to show by hand.
I was trying to do the same for a matrix of 4x4 submatrices, but its quite tedious by hand. So I thought Sympy would be of some help. But I cannot figure out how (I have started by just trying to reproduce the 2x2 result).
I've tried:
import sympy as s
def blockmatrix(name, sizes, names=None):
if names is None:
names = sizes
ll = []
for i, (s1, n1) in enumerate(zip(sizes, names)):
l = []
for j, (s2, n2) in enumerate(zip(sizes, names)):
l.append(s.MatrixSymbol(name+str(n1)+str(n2), s1, s2))
ll.append(l)
return ll
def eyes(*sizes):
ll = []
for i, s1 in enumerate(sizes):
l = []
for j, s2 in enumerate(sizes):
if i==j:
l.append(s.Identity(s1))
continue
l.append(s.ZeroMatrix(s1, s2))
ll.append(l)
return ll
n1, n2 = s.symbols("n1, n2", integer=True, positive=True, nonzero=True)
M = s.Matrix(blockmatrix("m", (n1, n2)))
X = s.Matrix(blockmatrix("x", (n1, n2)))
I = s.Matrix(eyes(n1, n2))
s.solve(M*X[:, 1:]-I[:, 1:], X[:, 1:])
but it just returns an empty list instead of the result.
I have also tried:
Using M*X==I but that just returns False (boolean, not an Expression)
Entering a list of equations
Using 'ordinary' symbols with commutative=False instead of MatrixSymbols -- this gives an exception with GeneratorsError: non-commutative generators: (x12, x22)
but all without luck.
Can you show how to derive a result with Sympy similar to the one I gave as an example for X22?
The most similar other questions on solving with MatrixSymbols seem to have been solved by working around doing exactly that, by using an array of the inner symbols or some such instead. But since I am dealing with symbolically sized MatrixSymbols, that is not an option for me.
Is this what you mean by a matrix of 2x2 matrices?
>>> a = [MatrixSymbol(i,2,2) for i in symbols('a1:5')]
>>> A = Matrix(2,2,a)
>>> X = A.inv()
>>> print(X[1,1]) # [1,1] instead of [2,2] because indexing starts at 0
a1*(a1*a3 - a3*a1)**(-1)
[You indicated not and pointed out that the above is not correct -- that appears to be an issue that should be resolved.]
I am not sure why this isn't implemented, but we can do the solving manually as follows:
>>> n = 2
>>> v = symbols('b:%s'%n**2,commutative=False)
>>> A = Matrix(n,n,symbols('a:%s'%n**2,commutative=False))
>>> B = Matrix(n,n,v)
>>> eqs = list(A*B - eye(n))
>>> for i in range(n**2):
... s = solve(eqs[i],v[i])[0]
... eqs[i+1:] = [e.subs(v[i],s) for e in eqs[i+1:]]
...
>>> s # solution for v[3] which is B22
(-a2*a0**(-1)*a1 + a3)**(-1)
You can change n to 3 and see a modestly complicated expression. Change it to 4 and check the result by hand to give a new definition to the word "tedious" ;-)
The special structure of the equations to be solved can allow for a faster solution, too: the variable of interest is the last factor in each term containing it:
>>> for i in range(n**2):
... c,d = eqs[i].expand().as_independent(v[i])
... assert all(j.args[-1]==v[i] for j in Add.make_args(d))
... s = 1/d.subs(v[i], 1)*-c
... eqs[i+1:] = [e.subs(v[i], s) for e in eqs[i+1:]]

Compress dict sum statement with Python

In my python application I have a big list (now with almost 9000 indexes). I need to find the two most similar items in this list. So, what I have now is something like:
aux1 = 0
aux2 = 1
min_distance = 0xffff
weights = get_weights()
for i in range(0, len(_list)):
for j in range(i + 1, len(_list)):
obj1 = _list[i]
obj2 = _list[j]
dist = 0
for key in self.__fields:
dist += weights[key] * (obj1[key] - obj2[key]) ** 2
if dist < min_distance:
min_distance = dist
aux1 = i
aux2 = j
return aux1, aux2, min_distance
In the code, weights is a dict, obj1 and obj2 are both objects in which the __getitem__ is implemented and the return value also comes from a dict. And self.__fields is a list with the selected fields (it has now 9 items).
My problem is, this loop is taking too much time to complete. Even after 5 hours, the i variable still in the first 100th list items.
With this next silly code, I come to the conclusion that the problem is not the size of the list (the silly code finishes with 5 minutes of difference).
count = 0
total = 9000
for i in range(0, total):
for j in range(i + 1, total):
for k in range(0, 10):
count += 1
print("Count is " + str(count))
Therefore, the problem seems to be in the most internal loop of my code:
for key in self.__fields:
dist += weights[key] * (obj1[key] - obj2[key]) ** 2
I know Python, but I'm not a Python specialist. I conclude that the access to the values of three objects through their key is a slow operation. Some time ago, I saw in some blog that list comprehensions and/or lambda operations can be faster.
So, my question is: how do I make this most internal loop faster using list comprehensions and/or lambda? Feel free to give any other advice if you want.
Not sure whether it's any faster, but you could rewrite that code using itertools.combinations and get the min using a key function calculating the "distance".
from itertools import combinations
weights = get_weights()
aux1, aux2 = min(combinations(_list, 2),
key=lambda pair: sum(weights[key] * (pair[0][key] - pair[1][key]) ** 2
for key in self.__fields))
If this does not help, you might consider temporarily turning the dictionaries in _list into lists, holding just the values of the relevant fields. Instead of using dictionary lookup, you can then just zip those lists together with the weights. Afterwards, turm them back into dicts.
weights_list = [weights[f] for f in self.__fields]
as_lists = [[d[f] for f in self.__fields] for d in _list]
aux1, aux2 = min(combinations(as_lists, 2),
key=lambda pair: sum(w * (x - y) ** 2
for w, x, y in zip(weights_list, *pair)))
aux1, aux2 = (dict(zip(self.__fields, x)) for x in (aux1, aux2))
This should be a bit faster, but it will only work if the dicts do not have any other fields than those in self.__fields, otherwise the dicts can not be reconstructed from the lists (at least not as easily). Alternatively, you might use tuples instead of lists and use another dictionary to map those tuples to the original dictionaries...
Or try this, using the indices of the elements instead of the elements themselves (not tested):
idx1, idx2 = min(combinations(range(len(_list)), 2),
key=lambda pair: sum(w * (x - y) ** 2
for w, x, y in zip(weights_list, as_list[pair[0]], as_list[pair[1]])))
aux1, aux2 = _lists[idx1], _lists[idx2]

Python: Solving equation system (coefficients are arrays)

I can solve a system equation (using NumPY) like this:
>>> a = np.array([[3,1], [1,2]])
>>> b = np.array([9,8])
>>> y = np.linalg.solve(a, b)
>>> y
array([ 2., 3.])
But, if I got something like this:
>>> x = np.linspace(1,10)
>>> a = np.array([[3*x,1-x], [1/x,2]])
>>> b = np.array([x**2,8*x])
>>> y = np.linalg.solve(a, b)
It doesnt work, where the matrix's coefficients are arrays and I want calculate the array solution "y" for each element of the array "x". Also, I cant calculate
>>> det(a)
The question is: How can do that?
Check out the docs page. If you want to solve multiple systems of linear equations you can send in multiple arrays but they have to have shape (N,M,M). That will be considered a stack of N MxM arrays. A quote from the docs page below,
Several of the linear algebra routines listed above are able to compute results for several matrices at once, if they are stacked into the same array. This is indicated in the documentation via input parameter specifications such as a : (..., M, M) array_like. This means that if for instance given an input array a.shape == (N, M, M), it is interpreted as a “stack” of N matrices, each of size M-by-M. Similar specification applies to return values, for instance the determinant has det : (...) and will in this case return an array of shape det(a).shape == (N,). This generalizes to linear algebra operations on higher-dimensional arrays: the last 1 or 2 dimensions of a multidimensional array are interpreted as vectors or matrices, as appropriate for each operation.
When I run your code I get,
>>> a.shape
(2, 2)
>>> b.shape
(2, 50)
Not sure exactly what problem you're trying to solve, but you need to rethink your inputs. You want a to have shape (N,M,M) and b to have shape (N,M). You will then get back an array of shape (N,M) (i.e. N solution vectors).

How do I dereference in python? (Image Processing with openCV)

I've been looking all over the internet for a simple thinning algorithm and I stumbled across this: Thinning algorithm The problem is, I do not have too much experience with the dereference operator. Also, my project is in python which has a different way of handling this situation. So I have a few questions
1: What is this bit of code doing?
void myThinningInit (CvMat ** kpw, CvMat ** kpb)
{
/ / Kernel for cvFilter2D
/ / The algorithm kpw kernel binary image and it has become a matching white, black,
/ / Convolution is divided into two sets of binary image was inverted kpb kernel, then take the AND
for (int i = 0; i <8; i + +) {
* (Kpw + i) = cvCreateMat (3, 3, CV_8UC1);
* (Kpb + i) = cvCreateMat (3, 3, CV_8UC1);
cvSet (* (kpw + i), cvRealScalar (0), NULL);
cvSet (* (kpb + i), cvRealScalar (0), NULL);
}.....
And 2: How can I translate this kernels creation into python?
He ends up making 8 kernels but I have no idea what their matrix form looks like.
I don't understand what "* (kpw + i)" or "* (kpb + i)" does in the grand scheme of the program.
3) Can I just make the kernels and store them in a list? If so, how could I do that?
UPDATE:
k = [1, 2, 3, 5, 6, 7, 8]
kpw = []
kpb = []
for i in k:
kpw.append [i] = cv.CreateMat (3, 3, cv.CV_8UC1)
kpb.append [i] = cv.CreateMat (3, 3, cv.CV_8UC1)
cv.cvSet (kpw [i], cv.RealScalar (0), cv.NULL)
cv.cvSet (kpb [i], cv.RealScalar (0), cv.NULL)
At first I didn't just had kpw [i] and it was throwing me an error. After a quick google search I found that you needed to index the array first and the way they did that was through append. I tried this bit of code in order to get 8 base kernels of 3x3 in size but I received this error:
Traceback (most recent call last):
File "/home/krtzer/Documents/python_scripts/thinning.py", line 14, in
kpw.append [i] = cv.CreateMat (3, 3, cv.CV_8UC1)
TypeError: 'builtin_function_or_method' object does not support item assignment
Does this mean I cannot have matrices in lists?
That dereference is just creating a Matrix, without initialising its data. The data is manually set to zero by those lines like cvSet (* (kpw + i), cvRealScalar (0), NULL).
In python, you can just do the same thing in one hit with numpy.zeros and then use cv.fromarray. Alternatively, use x = cv.CreateMat(3, 3, cv.CV_8UC1) and then cv.set(x, 0.).
Edit - made a (pretty big) mistake in this answer, will explain
Looks like an array of CvMats in both kpw and kpb.
Suppose I made a list of arrays kpw = [] in Python.
The *(kpw + i) = ... is just like saying kpw[i] = ....
Looks like the other code initialising the list of kernels to 3x3 matrices of 0, so you could do:
# make a list of 8 3x3 matrices of 0.
kpw = []
for i in xrange(8):
kpw.append(np.zeros((3,3)))
kpb.append(np.zeros((3,3)))
Note: I previously had:
kpw = [np.zeros((3,3))] * 8
kpb = [np.zeros((3,3))] * 8
which is wrong ! It produces 8 references to the same matrix within kpw, and so modifying kpw[0] will also modify all the other kpw[i]!
Then the cvSet2D(*(kpb+0), 0, 0, cvRealScalar(0)); can be translated to :
kpb[0][0,0] = 0
Because *(kpb+0) grabs the matrix in kpg[0], the 0,0 means element 0,0 of the matrix, and 0 is the value.
So: every time you see *(kpb+i) just substitute kpb[i] and you should be find translating that code.
I made a new one in python. Thinning(Python)