Check if element is in documents of rdd - mapreduce

I have such rdd1 in pySpark: (please excuse any minor syntax errors):
[(id1,(1,2,3)), (id2,(3,4,5))]
I have another rdd2 holding such: (2,3,4).
Now I want to see for each element of rdd2 in how many rdd1 sublists it occurs, e.g. of expected output rdd (or collected list I dont care)
(2, [id1]),(3,[id1,id2]),(4,[id2])
This is what I have so far (note that rdd2 must be the first item in the line/algorithm)
rdd2.map(lambda x: (x, x in rdd.map(lambda y:y[1])))
Even though thus would me give only true/false as second item of the pair tuple I could live with it, but even thus does not work. Failing when trying to perform a map on rdd2 inside the anonymous function of the rdd1 map.
Any idea how to get this going in the right direction?

If rrd2 is relatively small (fits in memory):
pairs1 = rdd1.flatMap(lambda (k, vals): ((v, k) for v in vals))
vals_set = sc.broadcast(set(rdd2.collect()))
(pairs1
.filter(lambda (k, v): k in vals_set.value)
.groupByKey())
If not, you can take pairs1 from a previous part and use join:
pairs2 = rdd2.map(lambda x: (x, None))
(pairs2
.leftOuterJoin(pairs1)
.map(lambda (k, (_, v)): (k, v))
.groupByKey())
As always, if this only an intermediate structure you should consider reduceByKey, aggregateByKey or combineByKey instead of groupByKey. If it is a final structure you can add .mapValues(list).
Finally you can try to use Spark Data Frames:
df1 = sqlContext.createDataFrame(
rdd1.flatMap(lambda (v, keys): ({'k': k, 'v': v} for k in keys)))
df2 = sqlContext.createDataFrame(rdd2.map(lambda k: {'k': k}))
(df1
.join(df2, df1.k == df2.k, 'leftsemi')
.map(lambda r: (r.k, r.v)).groupByKey())

Related

python 3, comparing elements of two lists of lists

I'm trying to compare elements of 2 lists of lists in python. I want to create a new list (ph) which has a 1 if elements of lists from the 1st list of lists are in the elements of the 2nd list of lists.
However, this seems to compare the whole list and not individual elements. The code is below. Many thanks for the help! :)
import numpy as np
import pandas as pd
abc = [[1,800000,3],[4,5,6],[100000,7,8]]
l = [[
[i for i in range(0, 100000)],
[i for i in range(200000,300000)],
[i for i in range(400000,500000)],
[i for i in range(600000,700000)],
[i for i in range(800000,900000)],
[i for i in range(1000000,1100000)]
]]
ph = []
for i in abc:
for j in l:
if l[0] == abc[0]:
ph.append(1)
else:
ph.append(0)
print(ph)
The goal of your problem is somewhat unclear to me. Correct me if I'm wrong but what you want is: for each sublist of abc, get a boolean describing if all its elements are anywhere in l. Is that it ?
If it is indeed the case, here's my answer.
First of all, your second list is not a list of lists but a list of lists of lists. Hence, I removed a nested list in my code.
abc = [[1,800000,3],[4,5,6],[100000,7,8]]
L = [
[i for i in range(0, 100000)],
[i for i in range(200000,300000)],
[i for i in range(400000,500000)],
[i for i in range(600000,700000)],
[i for i in range(800000,900000)],
[i for i in range(1000000,1100000)]
]
flattened_L = sum(L, [])
print(
list(map(lambda sublist: all(x in flattened_L for x in sublist), abc))
)
# returns [True, True, False]
My code first flattens L so that is becomes easy to check whether any element is in it or not. Then, for each sublist in abc, it checks if all elements are in this flattened list.
Note: my code returns a list of boolean. If you absolutely need integers value (0 and 1), which you shouldn't, you can wrap int around all.

SML adding indices to a list

Given a generic list, return a list containing the same objects in a tuple with their index in the list.
For example:
f ["a", "b"];
- val it = [(0,"a") , (1,"b")] : (int * string) list
The function should be a one-liner, meaning no pattern matching, recursion, if/else, helper functions and let/local. So far i could only make a list of indices given the input list:
fun f lst = List.take((foldl (fn (x,list) => [(hd(list)-1)]#list) [length(lst)] (lst)),length(lst));
f [#"a",#"b"];
- val it = [0, 1]: int List.list;
I should add the list items to these indices in a tuple but i'm not sure how to do that.
Here is a hint for one way to solve it:
1) Using List.sub, create an anonymous function which sends an index i to the pair consisting of i and the lst element at index i.
2) Map this over the result obtained by calling List.tabulate on length lst and the function which sends x to x.
I was able to get this to work (on one line), but the result is ugly compared to a straightforward pattern-matching approach. Other than as a puzzle, I don't see the motivation for disallowing that which makes SML an elegant language.
It appears that i forgot the #i operator to access the i'th element of a tuple. The answer is the following:
fun f xs = List.take((foldr (fn (x,list) => [(#1(hd(list))-1,x)]#list) [(length(xs),hd(xs))] (xs)),length(xs));
f (explode "Hello");
- val it = [(0, #"H"), (1, #"e"), (2, #"l"), (3, #"l"), (4, #"o")]: (int * char) List.list;

How to use map() to convert (key,values) pair to values only in Pyspark

I have this code in PySpark to .
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
wordCounts = wordPairs.reduceByKey(lambda x,y:x+y)
print wordCounts.collect()
#PRINTS--> [('rat', 2), ('elephant', 1), ('cat', 2)]
from operator import add
totalCount = (wordCounts
.map(<< FILL IN >>)
.reduce(<< FILL IN >>))
#SHOULD PRINT 5
#(wordCounts.values().sum()) // does the trick but I want to this with map() and reduce()
I need to use a reduce() action to sum the counts in wordCounts and then divide by the number of unique words.
* But first I need to map() the pair RDD wordCounts, which consists of (key, value) pairs, to an RDD of values.
This is where I am stuck. I tried something like this below, but none of them work:
.map(lambda x:x.values())
.reduce(lambda x:sum(x)))
AND,
.map(lambda d:d[k] for k in d)
.reduce(lambda x:sum(x)))
Any help in this would be highly appreciated!
Finally I got the answer, its like this -->
wordCounts
.map(lambda x:x[1])
.reduce(lambda x,y:x + y)
Yes, your lambda function in .map takes in a tuple x as an argument and returns the 2nd element via x[1](the 2nd index in the tuple). You could also take in the tuple as an argument and return the 2nd element as follows:
.map(lambda (x,y) : y)
Mr. Tompsett, I got this to work also:
from operator import add
x = (w
.map(lambda x: x[1])
.reduce(add))
Alternatively to map-reduce you can also use aggregate which should be even faster:
In [7]: x = sc.parallelize([('rat', 2), ('elephant', 1), ('cat', 2)])
In [8]: x.aggregate(0, lambda acc, value: acc + value[1], lambda acc1, acc2: acc1 + acc2)
Out[8]: 5

Python 3.3 functions on pairs in a list

I am trying to create a program that will find the difference between all pairs in a list. For example
[2,4,6]
Would then make a list containing the difference
[2,2]
Is there a way to do this
Itertools Recipes: pairwise
from itertools import tee
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
def diffs(iterable):
return [b - a for a, b in pairwise(iterable)]
print(diffs([2,4,6]))
[L[i+1] - L[i] for i in range(len(L)-1)] will do it.
Some other ways also using a list comprehension:
[L[i+1] - L[i] for i in range(len(L[:-1]))]
[L[i] - L[i-1] for i in range(1, len(L[1:]))]
Using map:
list(map(lambda i: L[i+1]-L[i], range(len(L[:-1]))))
list(map(lambda i: L[i]-L[i-1], range(1, len(L[1:]))))
Using map and the operator module:
list(map(operator.sub, L[1:], L[:-1]))
Using zip (this one is probably the nicest way, imo):
[x - y for x, y in zip(L[1:], L[:-1])]
A more verbose approach if you aren't familiar with list comprehensions or with map (GET FAMILIAR!):
def differences(L1,L2):
L = []
for V1,V2 in zip(L1,L2):
L.append(V2-V1)
return L
diffs = differences(L[:-1],L[1:])
And a similar, but much better way to do it using a generator:
def differences(L1,L2):
for V1,V2 in zip(L1,L2):
yield V2-V1
diffs = list(differences(L[:-1],L[1:]))
And here is the generator comprehension equivalent of the above generator(notice it's almost exactly the same as the last list comprehension above, except it uses the list function instead of brackets):
list(V2-V1 for V1,V2 in zip(L[:-1],L[1:]))
Study all of these ways of doing it very closely and you will learn a lot of Python.

Scala insert into list at specific locations

This is the problem that I did solve, however being a total imperative Scala noob, I feel I found something totally not elegant. Any ideas of improvement appreciated.
val l1 = 4 :: 1 :: 2 :: 3 :: 4 :: Nil // original list
val insert = List(88,99) // list I want to insert on certain places
// method that finds all indexes of a particular element in a particular list
def indexesOf(element:Any, inList:List[Any]) = {
var indexes = List[Int]()
for(i <- 0 until inList.length) {
if(inList(i) == element) indexes = indexes :+ i
}
indexes
}
var indexes = indexesOf(4, l1) // get indexes where 4 appears in the original list
println(indexes)
var result = List[Any]()
// iterate through indexes and insert in front
for(i <- 0 until indexes.length) {
var prev = if(i == 0) 0 else indexes(i-1)
result = result ::: l1.slice(prev, indexes(i)) ::: insert
}
result = result ::: l1.drop(indexes.last) // append the last bit from original list
println(result)
I was thinking more elegant solution would be achievable with something like this, but that's just pure speculation.
var final:List[Any] = (0 /: indexes) {(final, i) => final ::: ins ::: l1.slice(i, indexes(i))
def insert[A](xs: List[A], extra: List[A])(p: A => Boolean) = {
xs.map(x => if (p(x)) extra ::: List(x) else List(x)).flatten
}
scala> insert(List(4,1,2,3,4),List(88,99)){_ == 4}
res3: List[Int] = List(88, 99, 4, 1, 2, 3, 88, 99, 4)
Edit: explanation added.
Our goal here is to insert a list (called extra) in front of selected elements in another list (here called xs--commonly used for lists, as if one thing is x then lots of them must be the plural xs). We want this to work on any type of list we might have, so we annotate it with the generic type [A].
Which elements are candidates for insertion? When writing the function, we don't know, so we provide a function that says true or false for each element (p: A => Boolean).
Now, for each element in the list x, we check--should we make the insertion (i.e. is p(x) true)? If yes, we just build it: extra ::: List(x) is just the elements of extra followed by the single item x. (It might be better to write this as extra :+ x--add the single item at the end.) If no, we have only the single item, but we make it List(x) instead of just x because we want everything to have the same type. So now, if we have something like
4 1 2 3 4
and our condition is that we insert 5 6 before 4, we generate
List(5 6 4) List(1) List(2) List(3) List(5 6 4)
This is exactly what we want, except we have a list of lists. To get rid of the inner lists and flatten everything into a single list, we just call flatten.
The flatten trick is cute, I wouldn't have thought of using map here myself. From my perspective this problem is a typical application for a fold, as you want go through the list and "collect" something (the result list). As we don't want our result list backwards, foldRight (a.k.a. :\) is here the right version:
def insert[A](xs: List[A], extra: List[A])(p: A => Boolean) =
xs.foldRight(List[A]())((x,xs) => if (p(x)) extra ::: (x :: xs) else x :: xs)
Here's another possibility, using Seq#patch to handle the actual inserts. You need to foldRight so that later indices are handled first (inserts modify the indices of all elements after the insert, so it would be tricky otherwise).
def insert[A](xs: Seq[A], ys: Seq[A])(pred: A => Boolean) = {
val positions = xs.zipWithIndex filter(x => pred(x._1)) map(_._2)
positions.foldRight(xs) { (pos, xs) => xs patch (pos, ys, 0) }
}