reduceByKey returns different value every time - mapreduce

I have a key value data, let's call this x. It consists of a key, and a pair of volume and weight. It looks like this
[('t1', (2, 0.8)),
('t1', (3, 0.1)),
('t1', (4, 0.3)),
('t2', (3, 0.8)),
('t2', (10, 0.3))]
I want to calculate the weighted volume value, for each t1 and t2. That is I calculate
2 * 0.8 + 3 * 0.1 + 4 * 0.3 for t1
3 * 0.8 + 10 * 0.3 for t2
I can do
x.map(lambda (x, (y, z)): (x, y*z)).reduceByKey(lambda x,y: x+y).collect()
I would get the correct number
[('t2', 5.4), ('t1', 3.1)]
My question is, if I use the original input x, and run a reduceByKey operation such as
x.reduceByKey(lambda (f1, w1), (f2, w2): (f1 * w1 + f2 * w2, w1 + w2)).collect()
I was hoping to get
[('t2', 5.4, 1.1), ('t1', 3.1, 1.2)]
However, I'm getting different results every time I run the reduceByKey operation:
[('t2', (5.4, 1.1)), ('t1', (3.38, 1.2000000000000002))]
[('t2', (5.4, 1.1)), ('t1', (2.2, 1.2000000000000002))]
[('t2', (5.4, 1.1)), ('t1', (2.91, 1.2))]
What am I misunderstanding from the reduceByKey?

Lets break down -
t1 has three values ('t1', (2, 0.8)), ('t1', (3, 0.1)) and ('t1', (4, 0.3))
during the first pass of reduceByKey, output will be -
(2,0.8),(3,0.1) => (2*0.8 + 3*0.1,0.8+0.1) == (1.9,0.9)
in next pass, it will be
(1.9,0.9),(4,0.3) => (1.9*0.9+ 4*0.3,0.9+0.3) == (2.91,1.2)
so, effective accumulation done here is (2 * 0.8 + 3*0.1 )*(0.8+0.1) + (4 * 0.3) instead of your intend which was (2*0.8 + 3*0.1+4*0.3)

Related

How to filter a list of tuples with (Int, Int) values by a predicate applied to their respective first and second values?

Suppose i have a list of tuples representing positions on a grid by their x and y values.
The tuples are defined as type Pos, a pair of Integer values.
The Board is further defined as a list of Pos.
type Pos = (Int, Int)
type Board = [Pos]
exampleBoard :: Board
exampleBoard = [(1, 1), (1, 2), (2, 2), (2, 3), (1, 3),
(5, 1), (5, 2), (4, 2), (4, 1), (5, 3),
(9, 10), (9, 11), (9, 12), (9, 13),
(10, 10), (10, 11), (10, 12), (10, 13),
(10, 20), (11, 20), (12, 20), (13, 20)]
The Board is a x*x grid, where you can always consider the known variables height and width having the same value. height = width = size
If the x or y value is not in a certain range (0<x<size || 0<y<size) I want the tuple removed.
Is there a simple way to filter the list as I describe? From searching for similar questions I have tried using library functions such as "break" and "span" to no success as of yet.
To test if a tuple (Int,Int) is in a given range, you can use the inRange function:
import Data.Ix (inRange)
inRange ((1,1),(10,10)) (5,5) -- True
inRange ((1,1),(10,10)) (11,6) -- False
filter (inRange ((1,1),(10,10))) [(5,5),(11,6)] -- [(5,5)]

How to calculate distances from coordinates stored in lists

So far I managed to calculate the distances between an Point P(x,y) and a multitude of points stored in a list l = [(x1,y1), (x2,y2), (x3,y3), ...) Here is the code :
import math
import pprint
l = [(1,2), (2,3), (4,5)]
p = (3,3)
dists = [math.sqrt((p[0]-l0)**2 + (p[1]-l1)**2) for l0, l1 in l]
pprint.pprint(dists)
Output :
[2.23606797749979, 1.0, 2.23606797749979]
Now I want to calculate the distances from multitude points in a new list to the points in the list l.
I haven't found a solution yet, so does anyone have an idea how this could be done?
Here is a possible solution:
from math import sqrt
def distance(p1, p2):
return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
lst1 = [(1,2), (2,3), (4,5)]
lst2 = [(6,7), (8,9), (10,11)]
for p1 in lst1:
for p2 in lst2:
d = distance(p1, p2)
print(f'Distance between {p1} and {p2}: {d}')
Output:
Distance between (1, 2) and (6, 7): 7.0710678118654755
Distance between (1, 2) and (8, 9): 9.899494936611665
Distance between (1, 2) and (10, 11): 12.727922061357855
Distance between (2, 3) and (6, 7): 5.656854249492381
Distance between (2, 3) and (8, 9): 8.48528137423857
Distance between (2, 3) and (10, 11): 11.313708498984761
Distance between (4, 5) and (6, 7): 2.8284271247461903
Distance between (4, 5) and (8, 9): 5.656854249492381
Distance between (4, 5) and (10, 11): 8.48528137423857

Python - Compare Tuples in a List

So in a program I am creating I have a list that contains tuples, and each tuple contains 3 numbers. For example...
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1),...]
Now I want to delete any tuple whose last two numbers are less than any other tuple's last two numbers are.
The first number has to be the same to delete the tuple. *
So with the list of tuples above this would happen...
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1),...]
# some code...
result = [(1, 2, 4), (2, 4, 1), (1, 5, 2)]
The first tuple is not deleted because (2 and 4) are not less than (4 and 1 -> 2 < 4 but 4 > 1), (1 and 5 -> 2 > 1), or (4 and 1 -> 2 < 4 but 4 > 1)
The second tuple is not deleted because its first number (2) is different than every other tuples first number.
The third tuple is not deleted for the same reason the first tuple is not deleted.
The fourth tuple is deleted because (4 and 1) is less than (5 and 2 -> 4 < 5 and 1 < 2)
I really need help because I am stuck in my program and I have no idea what to do. I'm not asking for a solution, but just some guidance as to how to even begin solving this. Thank you so much!
I think this might actually work. I just figured it out. Is this the best solution?
results = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1)]
for position in results:
for check in results:
if position[0] == check[0] and position[1] < check[1] and position[2] < check[2]:
results.remove(position)
Simple list comprehension to do this:
[i for i in l if not any([i[0]==j[0] and i[1]<j[1] and i[2]<j[2] for j in my_list])]
Your loop would work too, but be sure not to modify the list as you are iterating over it.
my_list = [(1, 2, 4), (2, 4, 1), (1, 5, 2), (1, 4, 1)]
results = []
for position in my_list:
for check in my_list:
if not (position[0] == check[0] and position[1] < check[1] and position[2] < check[2]):
results.append(position)
results
>[(1, 2, 4), (2, 4, 1), (1, 5, 2)]

Find Indexes of Non-NaN Values in Pandas DataFrame

I have a very large dataset (roughly 200000x400), however I have it filtered and only a few hundred values remain, the rest are NaN. I would like to create a list of indexes of those remaining values. I can't seem to find a simple enough solution.
0 1 2
0 NaN NaN 1.2
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
For instance, I would like a list of [(0,2), (2,1), (4,0), (4,2)].
Convert the dataframe to it's equivalent NumPy array representation and check for NaNs present. Later, take the negation of it's corresponding indices (indicating non nulls) using numpy.argwhere. Since the output required must be a list of tuples, you could then make use of generator map function applying tuple as function to every iterable of the resulting array.
>>> list(map(tuple, np.argwhere(~np.isnan(df.values))))
[(0, 2), (2, 1), (4, 0), (4, 2)]
assuming that your column names are of int dtype:
In [73]: df
Out[73]:
0 1 2
0 NaN NaN 1.20
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
In [74]: df.columns.dtype
Out[74]: dtype('int64')
In [75]: df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
Out[75]: [(0, 2), (2, 1), (4, 0), (4, 2)]
if your column names are of object dtype:
In [81]: df.columns.dtype
Out[81]: dtype('O')
In [83]: df.stack().reset_index().astype(int).drop(0,1).apply(tuple, axis=1).tolist()
Out[83]: [(0, 2), (2, 1), (4, 0), (4, 2)]
Timing for 50K rows DF:
In [89]: df = pd.concat([df] * 10**4, ignore_index=True)
In [90]: df.shape
Out[90]: (50000, 3)
In [91]: %timeit list(map(tuple, np.argwhere(~np.isnan(df.values))))
10 loops, best of 3: 144 ms per loop
In [92]: %timeit df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
1 loop, best of 3: 1.67 s per loop
Conclusion: the Nickil Maveli's solution is 12 times faster for this test DF

How do vector applications skew polygons?

I know how to move, rotate, and scale, but how does skewing work? what would I have to do to a set of verticies to skew them?
Thanks
Offset X values by an amount that varies linearly with the Y value (or vice versa).
Edit: Doing this with a rectangle:
Let's say you start with a rectangle (0, 0), (4, 0), (4, 4), (0, 4). Let's assume you want to skew it with a slope of 2, so as it goes two units up, it'll move one to the right, something like this (hand drawn, so the angle's undoubtedly a bit wrong, but I hope it gives the general idea):
To get this, each X value is adjusted like:
X = X + Y * S
where S is the inverse of the slope of the skew. In this case, the slope is 2, so S = 1/2. Working that for our four corners, we get:
(0, 0) => 0 + 0 / 2 = 0 => (0, 0)
(4, 0) => 4 + 0 / 2 = 4 => (4, 0)
(4, 4) => 4 + 4 / 2 = 6 => (6, 4)
(0, 4) => 0 + 4 / 2 = 2 => (2, 4)
Skewing / shearing is described in detail at http://en.wikipedia.org/wiki/Shear_mapping and http://mathworld.wolfram.com/ShearMatrix.html