getting the partition number in a reducer to which a key-value pair belongs - mapreduce

when i am processing a given key-{set of values} pair in reducer function, how can I get the partition number to which this key-{set of values} belong to? How is it possible to get this partition number without adding extra information about the partition number with each key-value pair during partitioning?
Cheers

This has worked for me:
jobconf.getInt( "mapred.task.partition", 0);
in the reducer.

Related

AWS Hadoop MapReduce - Word Count Average

Hi I have a csv data file as below.
bus,train,bus,TRAIN,car,bus,Train,CAr,car,Train,Cart,Bus,Bicycle,Bicycle,Car,Bus,Cart,Cart,Bicycle,Threewheel
I need to count the average word count in the above CSV using MapReduce.
Eg: Bus = 5/20 =0.25
I can get the word count easily but I need the Total Number of records (20 in this case) to take word count average. But Passing that to reduce function using global Variables did not work out. I tried to pass this as a key-value pair in the map. Key = "Total" Value = total Count to the reducer input. It was not successfull as well.
Any Suggestions to pass this Total Count from Map function to Reducer Function?
I used One master and 3 slaves in EMR Cluster if that is a piece of needed information.
Thank You in Advance !!!
Once you have the pairs (K, V) where K is the word and V the times it appears, you can map all to a single key, lets say (W, (K, V)). Now you can reduce to obtain a total word count. Then you can make another map/reduce step to join the old keys with the new count.
Hope it helps.

What is the difference between Partitioner phase and Shuffle&Sort phase in MapReduce?

As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.
While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.
Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.
Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?
Each reducer takes data from several different mappers. Look at this picture (found it here):
Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.
Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:
Merging - combines all key-value pairs which have
same keys and returns >.
Sorting - takes output from Merging step and sort all
key-value pairs by using Keys. This step also returns
(Key, List[Value]) output but with sorted key-value pairs.
Output of shuffle-sort phase is sent directly to reducers.

Maximum Partition key length of my data in Dynamo DB

I have an use case to place constraints on the key size in my application. I tried to find the max length of partition key so far in my DynamoDB table. This will help me to know my data before placing any internal constraints on the data that I am using as a partition key in Dynamo DB.
Example: Let's say here is my table with a partition key (idempotent_id). I want to know the max length of partition keys so far (in this case 7).
idempotent_id
1234
12
1234567
12345
I tried using Dynamo DB console from my aws account. I looked at query and scan api of DynamoDB. But nothing seems good fit for me. May be this is something we can't find using DynamoDB? or may be I am searching wrongly?
Any help would be appreciated.
Partition Keys and Sort Keys
Partition Key Length
The minimum length of a partition key value is 1 byte. The maximum length is 2048 bytes.
Partition Key Values
There is no practical limit on the number of distinct partition key values, for tables or for secondary indexes.
Sort Key Length
The minimum length of a sort key value is 1 byte. The maximum length is 1024 bytes.
Sort Key Values
In general, there is no practical limit on the number of distinct sort key values per partition key value.
The exception is for tables with secondary indexes. With a local secondary index, there is a limit on item collection sizes: For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB. This might constrain the number of sort keys per partition key value. For more information, see Item Collection Size Limit.
From your comment:
I am trying to find the maximum size/length of idempotent_id so far in my single table.
In order to do this without any auxiliary data, you will need to perform a full table scan and get the result attributes you care for from each item. You can use a ProjectionExpressions to reduce the amount of data retrieved.
You could store the value in another attribute and create a GSI on that which will give you the ability to query that index in an ordering.
Another option would be to use something like DynamoDB Streams to listen to events and keep track of the max size in a different storage medium.
Take a look at data types section and their limits: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
String
The length of a String is constrained by the maximum item size of 400
KB.
Strings are Unicode with UTF-8 binary encoding. Because UTF-8 is a
variable width encoding, DynamoDB determines the length of a String
using its UTF-8 bytes.
Number
A Number can have up to 38 digits of precision, and can be positive,
negative, or zero.
Positive range: 1E-130 to 9.9999999999999999999999999999999999999E+125
Negative range: -9.9999999999999999999999999999999999999E+125 to
-1E-130 DynamoDB uses JSON strings to represent Number data in requests and replies. For more information, see DynamoDB Low-Level
API.
If number precision is important, you should pass numbers to DynamoDB
using strings that you convert from a number type.
Binary
The length of a Binary is constrained by the maximum item size of 400
KB.
Applications that work with Binary attributes must encode the data in
Base64 format before sending it to DynamoDB. Upon receipt of the data,
DynamoDB decodes it into an unsigned byte array and uses that as the
length of the attribute.
Range Key Example
string = '1USER:01G23WSPRVXA8BWK5ERD0TDYT501G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sl'
string.length => 1024
You can also check string bytes HERE
What is interesting, is if I add one more character to this string and then create a new item with this string as the range key I get the following error.
UnhandledPromiseRejectionWarning: ValidationException: Hash primary key values must be under 2048 bytes, and range primary key values must be under 1024 bytes
The measured reality of it is that the range key length can be <= 1024 where the error message says only that of < 1024. The same difference for the Primary Hash Key of 2048; The warning says < 2048 but the database excepts <= 2048.
Note: Both keys must be at least one character. The following is an error message when the Range key is a blank string.
UnhandledPromiseRejectionWarning: ValidationException: One or more parameter values are not valid. The AttributeValue for a key attribute cannot contain an empty string value. Key: SK
Primary (Hash) Key Example
string = '01G23XW4MCMTGMRWD4R78NMA29:USERS31:1USER:01G23WSPRVXA8BWK5ERD0TDYT501G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahskghlahkdhgkjashdjkghjakshghasjkdhg;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahskghlahkdhgkjashdjkghjakshghasjkdhg;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahskghlahkdhgkjashdjkghjakshghasjkdhg;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahsdsagsdgdggg'
string.length => 2048

How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner

Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth
I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.

MapReduce Input/OutPut emits for each key value pair

MapReduce basic information for passing and emiting key value pairs.
I need little bit clarity what we pass and what emits.
Here my concerns:
MapReduce Input and OutPut:
1.Map() method-Does it takes single or list of key-value pair and emits what?
2.For each input key-value pair,what mappers emit ? Same type or different type ?
3.For each intermediate key ,what the reducer will emit ? Is there any restriction of type ?
4.Reducer receives all values assocaited with same key.How the values will be ordered like sorted or orbitarly ordered ? Does that order vary from run to run ?
5.During shuffle and sort phase,In which order keys and values are presented ?
For each input k1, v1 map emits zero or more k2, v2.
For each k2 reducer receives k2, list(v1,v3,v4..).
For each input k2, list(v) reducer can emit zero or more k3, v3.
Values are arbitrarily ordered in step 2.
Key, value - output of mapper and reducer should be of same type i.e. all key must be same type and all value must be same type.
Map method: receive as input (K1,V1) and return (K2,V2). That is, the the output key and value can be different from the input key and value.
Reducer method: after the output of the mappers has been shuffled correctly (same key goes to the same reducer), the reducer input is (K2, LIST(V2)) and its output is (K3,V3).
As a result of the shuffling process, the keys arrives the reducer sorted by the key K2.
If you want to order the keys in your particular manner, you can implement the compareTo method of the key K3.
Referring your questions:
1. Answered above.
2. You can emit whatever you want as long it consists of a key and a value.
For example, in the WordCount you send as key the word and as value 1.
3. In the WordCount example, the reducer will receive a word and list of number.
Then, it will sum up the numbers and emit the word and its sum.
4. Answered above.
5. Answered above.