Suppose we wish to implement Local Sensitive Hashing(LSH) by MapReduce. Specifically, assume chunks of the signature matrix consist of columns, and elements are key-value pairs where the key is the column number and the value is the signature itself (i.e., a vector of values).
(a) Show how to produce the buckets for all the bands as output of a single
MapReduce process. Hint: Remember that a Map function can produce
several key-value pairs from a single element.
(b) Show how another MapReduce process can convert the output of (a) to
a list of pairs that need to be compared. Specifically, for each column i,
there should be a list of those columns j > i with which i needs to be
compared.
(a)
Map: the elements and its signature as input, produce the key-value pairs (bucket_id, element)
Reduce: produce the buckets for all the bands as output, i.e.
(bucket_id, list(elements))
map(key, value: element):
split item to bands
for band in bands:
for sig in band:
key = hash(sig) // key = bucket id
collect(key, value)
reduce(key, values):
collect(key, values)
(b)
Map: output of (a) as input, produce the list of combination in same
bucket, i.e. (bucket_id, list(elements)) -> (bucket_id,
combination(list(elements))), which combination() is any two elements
chosen from same bucket.
Reduce: output the item pairs need to be
compared, Specifically, for each column i, there should be a list of
those columns j > i with which i needs to be compared.
map(key, value):
for itemA, itemB in combinations(value)
key = (itemA.id, itemB.id)
collect(key, [itemA, itemB])
reduce(key, values):
collect(key, values)
Related
I have two lists, one is a list of lists, and they have the same number of indexes(the half number of values), like this:
list1=[['47', '43'], ['299', '295'], ['47', '43'], etc.]
list2=[[9.649, 9.612, 9.42, etc.]
I want to detect the repeated pair of values in the same list(and delete it), and sum the values with the same indexes in the second list, creating an output like this:
list1=[['47', '43'], ['299', '295'], etc.]
list2=[[19.069, 9.612, etc.]
The main problem is that the order of the values is important and I'm really stuck.
You could create a collections.defaultdict to sum values together, with keys as the sublists (converted as tuple to be hashable)
list1=[['47', '43'], ['299', '295'], ['47', '43']]
list2=[9.649, 9.612, 9.42]
import collections
c = collections.defaultdict(float)
for l,v in zip(list1,list2):
c[tuple(l)] += v
print(c)
Alternative using collections.Counter and which does the same:
c = collections.Counter((tuple(k),v) for k,v in zip(list1,list2))
At this point, we have the related data:
defaultdict(<class 'float'>, {('299', '295'): 9.612, ('47', '43'): 19.069})
now if needed (not sure, since the dictionary holds the data very well) we can rebuild the lists, keeping the (relative) order between them (but not their original order, that shouldn't be a problem since they're still linked):
list1=[]
list2=[]
for k,v in c.items():
list1.append(list(k))
list2.append(v)
print(list1,list2)
result:
[['299', '295'], ['47', '43']]
[9.612, 19.069]
I want to perform transitive closure of 2 large key, value lists. For doing so I have two "std::map". Both std::map maps an integer to a vector of integers.
std::map<unsigned,vector<unsigned> > mapIntVecOfInts1;
std::map<unsigned,vector<unsigned> > mapIntVecOfInts2;
"mapIntVecOfInts1" maps keys to another set of keys(VALUES). Some of the example values in it are of the following form:
0 -> (101, 102, 201)
1 -> (101, 102, 103, 203, 817, 1673)
2 -> (201, 829, 858, 1673)
"mapIntVecOfInts2" maps the VALUES present in "mapIntVecOfInts1" to another set of values. e.g.
101 -> (4002, 8293, 9000)
102 -> (4002, 8293, 10928)
103 -> (8293, 10928, 19283, 39201)
201 -> (8293)
203 -> (9393, 9830)
817 -> (19393, 19830)
1673-> (5372, 6830)
Now I want to map the keys present in "mapIntVecOfInts1" to the values present in "mapIntVecOfInts2" using the transitive mapping from "mapIntVecOfInts1" to "mapIntVecOfInts2". E.g. I want to do the following for key "0" of mapIntVecOfInts1:
0 -> 4002, 9000, 10928, 8293, 19283, 39201
1 -> 4002, 8293, 9000, 10928, 19283, 39201, 9393, 9830, 19393, 19830, 5372, 6830
"mapIntVecOfInts1" and "mapIntVecOfInts2" contain a billion elements (keys). vector within the two maps themselves contain million unsigned integers. I tried perform this transitive closure between the two maps by storing "mapIntVecOfInts1" and "mapIntVecOfInts2" in-memory. Using the following code:
std::vector<unsigned,vector<unsigned> > result;
for(std::map<unsigned,vector<unsigned> >::iterator i1= mapIntVecOfInts1.begin(), l1=mapIntVecOfInts1.end(); i1!=l1;++i1)
{
vector<unsigned> vec1;
for(vector<unsigned>::iterator i2=(*i1).second.begin(), l2=(*i1).second.end(); i2!=l2; ++i2)
vec1.insert(vec1.begin(), mapIntVecOfInts2[*i2].begin(), mapIntVecOfInts2[*i2].end());
result.push_back(make_pair((*i1).first, vec1));
}
However, performing transitive closure this way is taking a lot of time. Is there some way by which I can speed this up.
One can say that your suggested code does 2 things:
maps the second relation to the entry of the first
builds up the new relation from the results of said mapping
The resulting map will have the exact same key set as the first relation, so you can (kind of) avoid the whole red-black tree building process by just copying the whole mapIntVecOfInts1 first and then modifying the values of the copy instead of adding vectors one by one.
Of course that will not fix the major bottleneck which is the access speed of your second relation (mapIntVecOfInts2). You can try to reduce it to amortized O(1) with a hash table (std::unordered_map) or even a vector if your "billion of keys" is not too sparse.
Also as #SpectralSequence said, your code does not preserve the unique-ness in the value vectors, perhaps you want to do something about that.
At the very least, you should insert at the end of the vector in the inner loop, since inserting at the beginning requires copying the elements already in the vector.
vec1.insert(vec1.end(), mapIntVecOfInts2[*i2].begin(), mapIntVecOfInts2[*i2].end());
Also, if you don't want duplicate values then consider using a set.
I want to write a program which will read in a list of tuples, and in the tuple it will contain two elements. The first element can be an Object, and the second element will be the quantity of that Object. Just like: Mylist([{Object1,Numbers},{Object2, Numbers}]).
Then I want to read in the Numbers and print the related Object Numbers times and then store them in a list.
So if Mylist([{lol, 3},{lmao, 2}]), then I should get [lol, lol, lol, lmao, lmao] as the final result.
My thought is to first unzip those tuples (imagine if there are more than 2) into two tuples which the first one contains the Objects while the second one contains the quantity numbers.
After that read the numbers in second tuples and then print the related Object in first tuple with the exact times. But I don't know how to do this. THanks for any help!
A list comprehension can do that:
lists:flatten([lists:duplicate(N,A) || {A, N} <- L]).
If you really want printing too, use recursion:
p([]) -> [];
p([{A,N}|T]) ->
FmtString = string:join(lists:duplicate(N,"~p"), " ")++"\n",
D = lists:duplicate(N,A),
io:format(FmtString, D),
D++p(T).
This code creates a format string for io:format/2 using lists:duplicate/2 to replicate the "~p" format specifier N times, joins them with a space with string:join/2, and adds a newline. It then uses lists:duplicate/2 again to get a list of N copies of A, prints those N items using the format string, and then combines the list with the result of a recursive call to create the function result.
Am currently working with Apache Spark.
But i can not understand how reduce work after map ..
my example is pretty simple
val map = readme.map(line => line.split(" ").size)
i know this will return array of number of words per line but where is the key/value here to pass a reduce function ..
map.reduce((a,b) => {if(a>b) a else b})
reduce phase how it works .. (a,b) is the tuple_2 ? or its key/value from map function ??
Once you have
val map = readme.map(line => line.split(" ").size)
Each element of the RDD consists of a single number, the number of words in a line of the file.
You could count all the words in your dataset with map.sum() or map.reduce( (a,b) => a+b ), which are equivalent.
The code you have posted:
map.reduce((a,b) => {if(a>b) a else b})
would find the maximum number of words per line for your entire dataset.
The RDD.reduce method works by converting every two elements it encounters, which at first are taken from pairs of RDD rows, to another element, in this case a number. The aggregation function should be written so it can be nested and called on the rows in any order. For example, subtraction will not yield useful results as a reduce function because you can not predict ahead of time in what order results would be subtracted from one another. Addition, however, or maximization, still works correctly no matter the order.
I have the following 2 dictionaries,
d1={"aa":[1,2,3],"bb":[4,5,6],"cc":[7,8,9]}
d2={"aa":[1,2,3],"bb":[1,1,1,1,1,1],"cc":[7,8]}
How could I compare these two dictionaries and get the
positions(indexes) of UNMATCHED key value pairs? since I am dealing
with files of size around 2 GB, the dictionaries contain very large
data. How can this be implemented in optimized way?
def getUniqueEntry(dictionary1, dictionary2, listOfKeys):
assert sorted(dictionary1.keys()) == sorted(dictionary2.keys()), "Keys don't match" #check that they have the same keys
for key in dictionary1:
if dictionary1[key] != dictionary2[key]:
listOfKeys.append(key)
When calling the function, the third param listOfKeys is an empty list where you want the keys to be stored. Note that reading 2 gb worth of data into a dict requires alot of ram and will most likely fail.
and this is a more pythonic way: The list expansion will consider just the values that are not equal in both dictionaries:
diffrent_keys = [key for key in d1 if d1[key] != d2[key] ]