MapReduce Input/OutPut emits for each key value pair - mapreduce

MapReduce basic information for passing and emiting key value pairs.
I need little bit clarity what we pass and what emits.
Here my concerns:
MapReduce Input and OutPut:
1.Map() method-Does it takes single or list of key-value pair and emits what?
2.For each input key-value pair,what mappers emit ? Same type or different type ?
3.For each intermediate key ,what the reducer will emit ? Is there any restriction of type ?
4.Reducer receives all values assocaited with same key.How the values will be ordered like sorted or orbitarly ordered ? Does that order vary from run to run ?
5.During shuffle and sort phase,In which order keys and values are presented ?

For each input k1, v1 map emits zero or more k2, v2.
For each k2 reducer receives k2, list(v1,v3,v4..).
For each input k2, list(v) reducer can emit zero or more k3, v3.
Values are arbitrarily ordered in step 2.
Key, value - output of mapper and reducer should be of same type i.e. all key must be same type and all value must be same type.

Map method: receive as input (K1,V1) and return (K2,V2). That is, the the output key and value can be different from the input key and value.
Reducer method: after the output of the mappers has been shuffled correctly (same key goes to the same reducer), the reducer input is (K2, LIST(V2)) and its output is (K3,V3).
As a result of the shuffling process, the keys arrives the reducer sorted by the key K2.
If you want to order the keys in your particular manner, you can implement the compareTo method of the key K3.
Referring your questions:
1. Answered above.
2. You can emit whatever you want as long it consists of a key and a value.
For example, in the WordCount you send as key the word and as value 1.
3. In the WordCount example, the reducer will receive a word and list of number.
Then, it will sum up the numbers and emit the word and its sum.
4. Answered above.
5. Answered above.

Related

How to implement the division of two relations in mapreduce?

I want to implement the division of two relations in MapReduce. I have two relations: T(A,B,C) and U(B,C). I know that
for the relations R(A,B,D) and S(A,B). This is pretty much my scenario. I am not sure how I would go about implementing this in mapReduce. With my limited knowledge Im guessing there would be 3 map reduce jobs. I would assume the first round might be the (projection of B -C(T) x U) - T
Mapper 1 our input is either a tuple from T or U
If tuple t belongs to (a,b,c) from T then we have key: NULL and value ("T" a)
if tuple t belongs to (b,c) from U then we have key NULL and value (b,c "U")
With these values we can perform the cartesian product between ("T" a ) with the values (b,c "U") and emit the new key null and value (a,b,c)
Reducer 2
we remove from the new cartesian tuples any that are in the original table T and emit the tuples that are not contained in the original table.
I am confused about what I would do next. Whether it would be another mapper or could I use again a reducer for the next B -C projection? I'm not sure if I did the first round correctly. If anyone can tell me the steps how this would go preferably in pseudo-code that would me understand. Online I do not find any answers for this.

ExclusiveStartKey changes latestEvaluatedKey

I am trying to scan a large table, and I was hoping to do it in chunks by only getting so many, and then saving the lastEvaluatedKey so I could use it as my exslusiveStartKey when I start up the query again.
I have noticed that when I test on smaller tables, I may scan the entire table and get:
Key: A
Key: B
Key: C
Key: D
Key: E
Now, when I select key C as my exslusiveStartKey, I would expect to get back D and E as I run through the rest of the table. However, I will sometimes get different keys. Is this expectation correct?
Something that might be causing problems is that my keys are not alphabetically the same. So some start with a U and some start with an N. If I am using an exclusiveStartKey that starts with a U, am I ignoring any that starts with an N? I know exclusiveStartKey aims for things greater than its value.
DynamoDB keys have two part - the hash key and the sort key. As the names suggest, while the sort-key part is sorted (for strings, that's an alphabetical order), the hash-key part is not sorted alphabetical. Instead, is sorted by the value hash function, which means their order appears random although consistent: If you scan the same table twice and it didn't change, you should get back the keys in the same seemingly-random order. ExclusiveStartKey can be used to start in the middle of this order, but it shouldn't change the order.
In your example, if a Scan returned A, B, C, D, E in this order (note that as I said, it usually will not be in alphabetical order if you have hash keys!), then if you set ExclusiveStartKey to C you will definitely expect to get D and E for the scan. I don't know how you saw something else - I suspect you did something wrong.
You mentioned the possibility of the table changing in parallel, and whether this has any effect on the result. Well, if according to the hash function a key X comes between C and D, and someone wrote to key X, it is indeed possible that your scan with ExclusiveStartKey=C would find X. However, since in your example we assume that A comes before C, a scan with ExclusiveStartKey=C can never return A - the scan looks for keys whose hash function values are greater than C's - not for newly written data, so A doesn't match.

What is the difference between Partitioner phase and Shuffle&Sort phase in MapReduce?

As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.
While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.
Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.
Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?
Each reducer takes data from several different mappers. Look at this picture (found it here):
Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.
Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:
Merging - combines all key-value pairs which have
same keys and returns >.
Sorting - takes output from Merging step and sort all
key-value pairs by using Keys. This step also returns
(Key, List[Value]) output but with sorted key-value pairs.
Output of shuffle-sort phase is sent directly to reducers.

How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner

Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth
I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.

getting the partition number in a reducer to which a key-value pair belongs

when i am processing a given key-{set of values} pair in reducer function, how can I get the partition number to which this key-{set of values} belong to? How is it possible to get this partition number without adding extra information about the partition number with each key-value pair during partitioning?
Cheers
This has worked for me:
jobconf.getInt( "mapred.task.partition", 0);
in the reducer.