AWS Hadoop MapReduce - Word Count Average - amazon-web-services

Hi I have a csv data file as below.
bus,train,bus,TRAIN,car,bus,Train,CAr,car,Train,Cart,Bus,Bicycle,Bicycle,Car,Bus,Cart,Cart,Bicycle,Threewheel
I need to count the average word count in the above CSV using MapReduce.
Eg: Bus = 5/20 =0.25
I can get the word count easily but I need the Total Number of records (20 in this case) to take word count average. But Passing that to reduce function using global Variables did not work out. I tried to pass this as a key-value pair in the map. Key = "Total" Value = total Count to the reducer input. It was not successfull as well.
Any Suggestions to pass this Total Count from Map function to Reducer Function?
I used One master and 3 slaves in EMR Cluster if that is a piece of needed information.
Thank You in Advance !!!

Once you have the pairs (K, V) where K is the word and V the times it appears, you can map all to a single key, lets say (W, (K, V)). Now you can reduce to obtain a total word count. Then you can make another map/reduce step to join the old keys with the new count.
Hope it helps.

Related

Redshift -- Query Performance Issues

SELECT
a.id,
b.url as codingurl
FROM fact_A a
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
Records Count in Fact_A: 2 Million
Records Count in Dim_B : 1500
Time Taken to Execute : 10 Mins
No of Nodes: 2
Could someone help me with an understanding why the above query takes more time to execute?
We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.
Dim_B table is created with DISTRIBUTION ALL.
Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.
Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.
You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).

Knime Data Manipulation

I have a table with properties as ReadingTime, Frequency and I would like to insert 3 values in between those records where the time difference is greater than 12 hours. I could determine the time difference using the "Time Difference" node available but could not insert rows as per the requirement. Is there any way to attain this in knime ?
In case you are using Time Generator in a chunk loop (with the lagged column and Use second column option on the Time Difference node), you can generate as many nodes as you want (I assume you already use some switches/if nodes).

Mapper or Reducer, where to do more processing?

I have a 6 million line text file with lines up to 32,000 characters long, and I want to
measure the word-length frequencies.
The simplest method is for the Mapper to create a (word-length, 1) key-value pair for every word and let an 'aggregate' Reducer do the rest of the work.
Would it be more efficient to do some of the aggregation in the mapper? Where the key-value pair output would be (word-length, frequency_per_line).
The outputs from the mapper would be decreased by an factor of the average amount of words per line.
I know there are many configuration factors involved. But is there a hard rule saying whether most or the work should be done by the Mapper or the Reducer?
The platform is AWS with a student account, limited to the following configuration.

How are map and reduce workers in cluster selected?

For mapreduce job we need to specify partitioning of input data (count of map processes - M) and count of reduce processes (R). In MapReduce papers is example of their often settings: cluster with 2 000 workers and M = 200 000, R = 5 000. Workers are tagged as map-worker or reduce-worker. I wonder how are these workers in cluster selected.
Is this done so that is chosen fixed count of map-workers and fixed count of reduce-workers? (and then data stored in reduce-workers nodes has to be send to map-reduce workers)
Or map phase is running on each node in cluster and any count of nodes are then selected as reduce-workers?
Or is it done in another way?
Thanks for your answers.
The number of Map-Worker(Mapper) depends on the number of Input-splits of the input file.
so Ex: 200 input-splits( they are logical ) =200 Mapper .
How Mapper Node is selected ?
The Mapper is the Local Data Node , if its not possible then data is transferred to free Node and Mapper is invoked on that node
.
The number of Reducer can be set by the user( Job.setNumberOfReducer(Number) ) or else it will also be as per the number of splits of Intermediate-output of Mapper .
Other Question's Answers
Q1>so in one node can run for example 10 mappers in parallel at one time, or these mappers are processed sequentially?
Ans : sequentially (Max Number of (active/running)mapper =Number of DataNodes)
Q2>how are chosen the nodes where are reducers invoked?
Ans :
Intermediate Key-Values are stored in Local File system Not in HDFS , and then it is being copied(HDFS) to Reducer Node .
A single Mapper will feed Data to multiple reducer . so locality of data is out of the question coz a data for a particular reducer come from many Nodes if not from all .
So Reducer is (or atleast should be) selected on Bandwidth of a Node , keeping in minds all above points
Q3>if we need reducers count bigger then overall nodes count (for example 90 reducers in 50 nodes cluster), are the reducers on one node processed in parallel or sequentially?
Ans : sequentially (Max Number of (active/running)Reducer =Number of DataNodes)

getting the partition number in a reducer to which a key-value pair belongs

when i am processing a given key-{set of values} pair in reducer function, how can I get the partition number to which this key-{set of values} belong to? How is it possible to get this partition number without adding extra information about the partition number with each key-value pair during partitioning?
Cheers
This has worked for me:
jobconf.getInt( "mapred.task.partition", 0);
in the reducer.