MapReduce Programming
Suppose that you have a large student file which cannot be stored in a single machine.
Each record of this file contains information: (Student ID, Student Name, Sex, Age,
Module, Grade, Department).
Please design a MapReduce Algorithm (Pseudo-codes or
Java Codes) to output the average grade for each module. The algorithm is
expected to be as efficient as possible.
Describe the algorithm designed. You should explain
how the input is mapped into (key, value) pairs by the map stage, i.e., specify
what is the key and what is the associated value in each pair, and, if needed,
how the key(s) and value(s) are computed. Then you should explain how the
output (key, value) pairs of the map stage are processed by the reduce stage to
Related
As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.
While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.
Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.
Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?
Each reducer takes data from several different mappers. Look at this picture (found it here):
Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.
Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:
Merging - combines all key-value pairs which have
same keys and returns >.
Sorting - takes output from Merging step and sort all
key-value pairs by using Keys. This step also returns
(Key, List[Value]) output but with sorted key-value pairs.
Output of shuffle-sort phase is sent directly to reducers.
I am considering taking advantage of sparse indexes as described in the AWS guidelines. In the example described --
... in the GameScores table, certain players might have earned a particular achievement for a game - such as "Champ" - but most players have not. Rather than scanning the entire GameScores table for Champs, you could create a global secondary index with a partition key of Champ and a sort key of UserId.
My question is: what happens when the number of champs becomes very large? I suppose that the "Champ" partition will become very large and you would start to experience uneven load distribution. In order to get uniform load distribution, would I need to randomize the "Champ" value by (effectively) sharding over n shards, e.g. Champ.0, Champ.1 ... Champ.99?
Alternatively, is there a different access pattern that can be used when fetching entities with a specific attribute that may grow large over time?
this is exactly the solution you need (Champ.0, Champ.1 ... Champ.N)
N should be [expected partitions for this index + some growth gap] (if you expect for high load, or many 'champs' then you can choose N=200) (for a good hash distribution over partitions). i recommend that N will be modulo on userId. (this can help you to do some manipulations by userId.)
we also use this solution if your hash key is Boolean (in dynamodb you can represent boolean as string), so in this case the hash will be "true.0", "true.1" .... "true.N" and the same for "false".
This is the format of a line my input file:
Population|City|State|ListOfHighways ==>
6|Oklahoma City|Oklahoma|I-35;I-44;I-40
6|Boston|Massachusetts|I-90;I-93
8|Columbus|Ohio|I-70;I-71
I need to create an output file with this following format:
Population
(newline)
City, State
Interstates: Comma-separated list of interstates, sorted by interstate number ascending
(newline)
==> Example:
6
Boston, Massachusetts
Interstates: I-90, I-93
Oklahoma City, Oklahoma
Interstates: I-35, I-40, I-44
8
Columbus, Ohio
Interstates: I-70, I-71
Here, the states having the same population should be grouped together and they have to be sorted alphabetically by state first and then by cities. I was able to get the format right, but I am not able to figure out which data structure to use to sort the states and then the cities.
I have map<int, vector<string> > now. key is the population and the rest is the vector. Any suggestions are welcome.
I wouldn't use a map for this at all. You should probably figure out what information you actually need for each element of the data, and create whatever data types you need to support that. E.g.
struct State
{
unsigned int Population;
std::vector<std::string> Cities;
std::vector<unsigned int> Highways;
};
You can then parse in your data, and create a std::vector<State>. Sort the vector and data as appropriate using std::sort (you can use lambdas, or create comparison functions or functors if necessary).
Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth
I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.
I have a collection of files, each file contains the author's name and the words he used. Now I am trying to write a map-reduce code to count each author's top N words. The tricky part is the file may contains multiple authors.
so I how should my map-reduce framework be designed ?
pseudo code plus a little explanation is enough. Thanks
In one MR job count the words used by each author by creating a complex key of author+word and value count.
A second MR job would read those pairs (author+word,count) and map them to (author+count,word+count). Write a comparator to order those keys first by author and then by count (largest to smallest) and a grouper to treat two keys with the same author as being in the same reduce group, regardless of their count. You'll probably need a partitioner to make sure that all pairs for an author go to the same partition. The reducer will then be called once for each author and the values (word+count) will be provided by the iterable with largest count first. In the reducer just write the author, word and count from the first N records from the Iterable.