Pig parallel avg - mapreduce

Is it possible to specify pig to output 10 r files, the way MR does when it uses 10 reducers? My Pig script outputs just one r file which I guess means it is using just one reducer. I have put
SET default_parallel 10;
in my script and in stderr I can see that at the beginning
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting Parallelism to 10
but in the middle of MapReduceLauncher it goes back to
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting Parallelism to 1
I do a join, sum two columns and then compute average of one column and I am suspecting it happens because of avg or group all. Is that correct?

Yes. Qouting from http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#group_by
[...] keep in mind that when using group all, you are necessarily serializing your pipeline. That is, this step and any step after it until you split out the single bag now containing all of your records will not be done in parallel.

Related

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))
Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.
Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Pig UDF seems to always run in a single reducer - PARALLEL not working

I have a Pig script with a Python UDF that is supposed to generate user level features. My data is preprocessed by Pig and then sent to an UDF as a list of tuples. The UDF will process the tuples of data and return a chararray with my features computer per user. The code where this happens looks like this below:
-- ... loading data above
data = FOREACH data_raw GENERATE user_id, ...; -- some other metrics as well
-- Group by ids
grouped_ids = GROUP data BY user_id PARALLEL 20;
-- Limit the ids to process
userids = LIMIT grouped_ids (long)'$limit';
-- Generate features
user_features = FOREACH userids {
GENERATE group as user_id:chararray,
udfs.extract_features(data) as features:chararray;
}
The UDF code clearly runs in the reducer, and for some reason it always goes to one reducer and it takes quite some time. I am searching for a way to parallelize the execution of it, as now my job takes 22 minutes in total of which 18 mins are in this single reducer.
Pig tries to allocate 1GB of data to a reducer typically, and my data is indeed less than 1GB, around 300-700MB, but pretty time consuming on the UDF end, so this is clearly not optimal, while the rest of my cluster is empty.
Things I have tried:
Setting default parallel impacts the whole script script, but still does not manage to get the reducer with the UDF to parallelize
Manually setting parallel on GROUP data BY user_id parallelizes the output of the group and invokes multiple reducers, but at the point where the UDF kicks in, it's again a single reducer
Setting pig.exec.reducers.bytes.per.reducer that allows you to set for instance a maximum of 10MB of data per reducer, and it clearly works for other parts of my script (and ruins the parallelism as this also affects data preparation in the beginning of my pipeline - as expected) but again DOES NOT allow more than one reducer to run with this UDF.
As far as I understand what is going on, I don't see why - if the shuffle phase can hash the user_id to one or more reducers - why this script would not be able to spawn multiple reducers, instantiate the UDF there and hash the corresponding data based on user_id to the correct reducer. There is no significant skew in my data or anything.
I am clearly missing something here but fail to see what. Does anyone have any explanation and/or suggestion?
EDIT: I updated the code as something important was missing: I was running a LIMIT between the GROUP BY and the FOREACH. And i also cleaned up irrelevant info. I also expanded the inline code to separate lines for readability.
Your problem is that you are passing the whole data relation as input parameter to your UDF, so your UDF only gets called once with the whole data, hence it runs in only one reducer. I guess you want to call it once for each group of user_id, so try with a nested foreach instead:
data_grouped = GROUP data BY user_id;
user_features = FOREACH data_grouped {
GENERATE group AS user_id: chararray,
udfs.extract_features(data) AS features: chararray;
}
This way you force the UDF to run in as many reducers as the ones used in group by.
Having the LIMIT operator in the code between the group by and foreach eliminates the possibility to run my code in multiple reducers, even if I explicitly set the parallelism.
-- ... loading data above
data = FOREACH data_raw GENERATE user_id, ...; -- some other metrics as well
-- Group by ids
grouped_ids = GROUP data BY user_id PARALLEL 20;
-- Limit the ids to process
>>> userids = LIMIT grouped_ids (long)'$limit'; <<<
-- Generate features
user_features = FOREACH userids {
GENERATE group as user_id:chararray,
udfs.extract_features(data) as features:chararray;
}
Once the LIMIT is placed further in the code, I manage to get the predefined number of reducers to run my UDF:
-- ... loading data above
data = FOREACH data_raw GENERATE user_id, ...; -- some other metrics as well
-- Group by ids
grouped_ids = GROUP data BY user_id PARALLEL 20;
-- Generate features
user_features = FOREACH grouped_ids {
GENERATE group as user_id:chararray,
udfs.extract_features(data) as features:chararray;
}
-- Limit the features
user_features_limited = LIMIT user_features (long)'$limit';
-- ... process further and persist
So my effort of trying to optimize/reduce the inflow of user_ids was counter-productive for increasing paralellism.

Reducers for Hive data

I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all?
For example, 5GB of data requires how many reducers? will the same number of reducers set to smaller data set?
Thanks in advance!! Cheers!
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
default hive.exec.reducers.bytes.per.reducer is 1G.
Number of reducers depends also on size of the input file
You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
hive.exec.reducers.bytes.per.reducer 1000000
or using set
hive -e "set hive.exec.reducers.bytes.per.reducer=100000
In a MapReduce program, reducer is gets assigned based on key in the reducer input.Hence the reduce method is called for each pair in the grouped inputs.It is not dependent of data size.
Suppose if you are going a simple word count program and file size is 1 MB but mapper output contains 5 key which is going to reducer for reducing then there is a chance to get 5 reducer to perform that task.
But suppose if you have 5GB data and mapper output contains only one key then only one reducer will get assigned to process the data into reducer phase.
Number of reducer in hive is also controlled by following configuration:
mapred.reduce.tasks
Default Value: -1
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer
Default Value: 1000000000
The default is 1G, i.e if the input size is 10G, it will use 10 reducers.
hive.exec.reducers.max
Default Value: 999
Max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
Source: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Please check below link to get more clarification about reducer.
Hadoop MapReduce: Clarification on number of reducers
hive.exec.reducers.bytes.per.reducer
Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later
Source: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Pig killing data nodes while loading a lot of files

I have a script that tries to get the times that users start/end their days based on log files. The job always fails before it completes and seems to knock 2 data nodes down every time.
The load portion of the script:
log = LOAD '$data' USING SieveLoader('#source_host', 'node', 'uid', 'long_timestamp', 'type');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL AND $0#'type'=='USER_AUTH';
There are about 6500 files that we are reading from, so it seems to spawn about that many map tasks. The SieveLoader is a custom UDF that loads a line, passes it to an existing method that parses fields from the line and returns them in a map. The parameters passed in are to limit the size of the map to only those fields with which we are concerned.
Our cluster has 5 data nodes. We have quad cores and each node allows 3 map/reduce slots for a total of 15. Any advice would be greatly appreciated!

Multiple MapReduce jobs with multiple files as input and multiple reducers

I need to chain multiple MapReduce streaming jobs in order to perform some computation over a large dataset.
I intend to use multiple reducers for each job in order to quicken the overall job. As a workflow scheduler I use Oozie.
Here is an illustration to clarify my problem:
Let say I have two files
File 1: File 2:
A B 1 A B 3
A C 4 C D 6
B D 2 B D 1
I'd like to have two mappers and two reducers and get the following output for the MapReduce job:
Output:
A B 4
A C 4
B D 3
C D 6
But this is not at all what I get, instead I have partial sums.
Here is what I think happens.
Since I have multiple reducers for each MapReduce job, the input of the next job is split into several files. These files are given to the mappers which then send their output to the reducers. It seems that the mappers send their output to the reducers without waiting the whole input to be processed and sorted with name1, for example, as the key.
I've read several threads about using multiple files as an input and I don't think it is a matter of performing a map side join. Maybe it has to do with partitioning but I haven't exactly understood what partitioning consists in.
Is there any way to sort the output of several mappers before sending it to reducers ? Or can I tell Oozie to merge the output of several reducers in order to have only one file as the input of the next MapReduce Job ?
I'm slightly new to MapReduce, but it looks like your job isn't processing the keys correctly, if you are not getting the desired output based on your example.
By default, Hadoop streaming uses Tab as the default field separator and takes everything from the start of a line to the first Tab character as the Key. In your case, if your input format is actually "A[space]B[space]1", you'll need to add
-D stream.map.output.field.separator= \
-D stream.num.map.output.key.fields=2 \
to your Hadoop streaming command in order to set space as the column delimiter and the first 2 columns as the key. This will map all the lines that start with "A B" to the same reducer. More info can be found here