How shuffling is done in MapReduce? - mapreduce

It seems pretty straightforward, just one thing I don't quite understand is how shuffling is done? How can you create a basket for each word here?

map's output Key, Value pairs (let's call them K,V) are partitioned based on a hash of the key.
All of the K,V pairs with the same hash(K) are sent to the same reducer. The K,V pairs in each reducer are sorted by key and grouped by key.
reduce then processes each key and all of its associated values in turn.
N.B. In Hadoop (and possibly other M/R implementations), the partition, sorting and grouping functions can be user-defined.

Related

Find common value in two maps without iterating

I have these two maps, each storing 10000+ of entries:
std::map<std::string,ObjectA> mapA;
std::map<std::string,ObjectB> mapB;
I want to retrieve only those values from the maps whose keys are present in both maps.
For example, if key "10001" is found in both mapA and mapB, then I want the corresponding objects from both the maps. Something like doing a join on SQL tables. Easiest way would be to iterate over the smaller map, and then do std::find(iter->first) in each iteration to find the keys that qualify. That would also be very expensive.
Instead, I am considering maintaining a set like this:
std::set<std::string> common;
1) Every time I insert into one of the map, I will check whether it exists in the other map. If it does, I add the key to the above common set.
2) Every time I remove an entry from one of the map, I will remove the key from common set, if it exists.
The common set always maintains the keys that are in both maps. When I want to do the join, I already have the qualifying keys. Is there a faster/better way?
The algorithm is pretty simple. First, you treat the two maps as sequences (using iterators).
If either remaining sequence is empty, you're done.
If the keys at the front of the sequence are the same, you have found a match.
If the keys differ, discard the lower (according to the map's sorting order) of the two.
You'll be iterating over both maps, which means a complexity of O(n+m), which is significantly better than the naive approach with its O(n log m) or O(m log n) complexity.

AppSync $util.autoId() and DynamoDB Partition and Sort Keys Design Questions

The limits for partition and sort keys of dynamoDB are such that if I want to create a table with lots of users (e.g. the entire world population), then I can't just use a unique partition key to represent the personId, I need to use both partition key and sort key to represent a personId.
$util.autoId() in AppSync returns a 128-bit String. If I want to use this as the primary key in the dynamoDB table, then I need to split it into two Strings, one being the partition key and the other being the sort key.
What is the best way to perform this split? Or if this is not the best way to approach the design, how should I design it instead?
Also, do the limits on partition and sort keys apply to secondary indexes as well?
Regarding $util.autoId(), since it's generated randomly, if I call it many times, is there a chance that it will generate two id's that are exactly the same?
I think I'm misunderstanding something from your question's premise because to my brain, using AppSync's $util.autoId() gives you back a 128 bit UUID. The point of UUIDs is that they're unique, so you can absolutely have one UUID per person in the world. And the UUID string will definitely fit within the maximum character length limits of Dynamo's partition key requirements.
You also asked:
if I call it many times, is there a chance that it will generate two
id's that are exactly the same?
It's extremely unlikely.

Do all MapReduce implementations take a keys paramater input into the reduce function?

Previously I asked what the use case for passing a list of keys to CouchDB's reduce function was; the answer (https://stackoverflow.com/a/46713674/3114742) mentions two potential use-cases:
From a design perspective you may want to work with keys emitted by a map function
You may be calculating something based on key input (such as how often a particular key appears)
Do all implementations of MapReduce take an array of keys as input to the reduce functions? CouchDB specifically keeps track of the original document that produces a key. i.e. the input to a CouchDB reduce function:
function(keys, values, rereduce) {...}
The keys arg looks like this: [[key1,id1], [key2,id2], [key3,id3]].
i.e. Couch keeps track of the entity that emitted the key as a result of the Map function, even in the reduce function. Do other MapReduce implementations keep track of this information? Or is this specific to CouchDB...
Not all mapreduce implementation has the same structure as in couchdb.
For example in mongodb mapreduce, there is just a single key and list of values unlike the couch db. So, all the keys that is emitted by map function is grouped and passed as just one key and list of values to reduce function.
Example:
emit(1,10)
emit(1,20)
will be grouped to
reduce(1,[10,20])

finding items to de-duplicate

I have a pool of data (X1..XN), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.
The result I need is, for example:
X1 equals X3 and X6
X2 is unique
X4 equals X5
(Order of the lines, or order within a line, doesn't matter).
How can I implement that with pair-wise comparisons?
Here's what I have so far:
Compare all pairs (Xi, Xk) with i < k, and exploit transitivity: if I already found X1==X3 and X1==X6, I don't need to compare X3 and X6.
so I could use the following data structure:
map: index --> group
multimap: group --> indices
where group is arbitrarily assigned (e.g. "line number" in the output).
For a pair (Xi, Xk) with i < k :
if both i and k already have a group assigned, skip
if they compare equal:
if i already has a group assigned, put k in that group
otherwise, create a new group for i and put k in it
if they are not equal:
if i has no group assigned yet, assign a new group for i
same for k
That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.
Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.
An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics.
Code base is C++, something that works with STL / boost containers or algorithms would be nice.
[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.
This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.
Here's another, maybe simpler, data structure for exploiting transitivity. Make a queue of comparisons that you need to do. For example, in case of 4 items, it will be of [ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) ]. Also have an array for comparisons you've already done. Before each comparison, check to see if that comparison has been done before, and every time you find a match, go through the queue and replace the matching item index with its lower index equivalent.
For example, suppose we pop (1,2), compare, they're not equal, push (1,2) to the array of already_visited and continue. Next, pop (1,3) and find that they are equal. At this point, go through the queue and replace all 3's with 1's. The queue will be [(1,4), (2,1), (2,4), (1,4)], and so on. When we reach (2,1), it has already been visited, so we skip it, and the same with (1,4).
But I do agree with the previous answers. Since comparisons are computationally expensive, you probably want to compute a fast, reliable, hash table first, and only then apply this method to the collisions.
So... you already have a hash? How about this:
sort and group on hash
print all groups with size 1 as unique
compare collisions
Tip for comparing colisions: Why not just rehash them with a different algorithm? Rinse, repeat.
(I am assuming you are storing files/blobs/images here and have hashes of them and that you can slurp the hashes into memory, also, that the hashes are like sha1/md5 etc., so collisions are very unlikely)
(also, I'm assuming that two different hashing algorithms will not collide on different data, but this is probably safe to assume...)
Make hash of each item. Make a list of pair<hash,item_index>. You can find groups by sorting this list by hash or putting it into std::multimap.
When you output group list, you need compare items for hash collisions.
So for each item you will do one hash calculation and about one comparison. And sorting of hash-list.
I agree with the idea to use a second (hopefully improved) hash function so you can resolve some of your weak hash's collisions without needing to do costly pairwise comparisons. Since you say you are having memory limitation issues, hopefully you can fit the entire hash table (with secondary keys) in memory, where for each entry in the table you store a list of record indices for the records on disk that correspond to that key pair. Then the question is whether for each key pair, whether you can load all the records into memory that have that key pair. If so, then you can just iterate over key pairs; for each key pair, free any records in memory for the previous key pair and load the records in memory for the current key pair, and then do comparisons among these records like you already outlined. If you have a key pair where you can't fit all records into memory, then you'll have to load partial subsets, but you should definitely be able to maintain in memory all the groups (with a unique record representative for each group) you have found for the key pair, since the number of unique records will be small if you have a good secondary hash.

How to get a clojure array-map to maintain insertion order after assoc?

I have an array-map which I am associng some values into it. After a certain size the returned value is a PersistentHashMap rather than the original PersistentArrayMap. I've read about this behavior on a few web sites. Is there any way to force the insertion order even after assoc?
I do have a separate function which will take a ash-map and a vector of keys, and return a "fresh" array-map with keys in this order, but it means that for each assoc, I have to extract the keys first, cons/conj the new key to the vector, then create a new array-map. Seems kludgey, even if written in a separate function.
Is there a more direct language-supported way of keeping insertion order even on large-ish (>10, but < 50) keys array-map?
In case it's relevant, I'm using a list of array-maps as data into an incanter dataset and then outputting to excel. The save-xls function keeps the order of the keys/columns.
Thanks
You can use an ordered map: https://github.com/flatland/ordered