I need to chain multiple MapReduce streaming jobs in order to perform some computation over a large dataset.
I intend to use multiple reducers for each job in order to quicken the overall job. As a workflow scheduler I use Oozie.
Here is an illustration to clarify my problem:
Let say I have two files
File 1: File 2:
A B 1 A B 3
A C 4 C D 6
B D 2 B D 1
I'd like to have two mappers and two reducers and get the following output for the MapReduce job:
Output:
A B 4
A C 4
B D 3
C D 6
But this is not at all what I get, instead I have partial sums.
Here is what I think happens.
Since I have multiple reducers for each MapReduce job, the input of the next job is split into several files. These files are given to the mappers which then send their output to the reducers. It seems that the mappers send their output to the reducers without waiting the whole input to be processed and sorted with name1, for example, as the key.
I've read several threads about using multiple files as an input and I don't think it is a matter of performing a map side join. Maybe it has to do with partitioning but I haven't exactly understood what partitioning consists in.
Is there any way to sort the output of several mappers before sending it to reducers ? Or can I tell Oozie to merge the output of several reducers in order to have only one file as the input of the next MapReduce Job ?
I'm slightly new to MapReduce, but it looks like your job isn't processing the keys correctly, if you are not getting the desired output based on your example.
By default, Hadoop streaming uses Tab as the default field separator and takes everything from the start of a line to the first Tab character as the Key. In your case, if your input format is actually "A[space]B[space]1", you'll need to add
-D stream.map.output.field.separator= \
-D stream.num.map.output.key.fields=2 \
to your Hadoop streaming command in order to set space as the column delimiter and the first 2 columns as the key. This will map all the lines that start with "A B" to the same reducer. More info can be found here
Related
Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.
Is it possible to specify pig to output 10 r files, the way MR does when it uses 10 reducers? My Pig script outputs just one r file which I guess means it is using just one reducer. I have put
SET default_parallel 10;
in my script and in stderr I can see that at the beginning
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting Parallelism to 10
but in the middle of MapReduceLauncher it goes back to
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting Parallelism to 1
I do a join, sum two columns and then compute average of one column and I am suspecting it happens because of avg or group all. Is that correct?
Yes. Qouting from http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#group_by
[...] keep in mind that when using group all, you are necessarily serializing your pipeline. That is, this step and any step after it until you split out the single bag now containing all of your records will not be done in parallel.
Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?
The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.
Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.
I have a mapreduce program that first scans an HBase table.
I want some reducer output to go to hdfs and some reducer output to be written to an hbase table. Can a reducer be configured to output to two different locations/formats like this?
A reducer can be configured to use multiple files to output using the MulitpleOutputsclass. The documentation at the top of that class provides a clear example for writing to multiple files. However, since there is no built in Outputformat for writing to HBase you might consider writing the 2nd stream to specific place on HDFS and then using another job to insert it into HBase.
If you don't want to write too much code, just open a Table in your mapper's or reducer's setup method and do a put statement into your hbase table. On the other hand, write your job such that the output file is an hdfs file. This way you get to both write to hbase and hdfs.
To be more elaborate, when you do a context.write(), you would write to the hdfs file, and on the other hand, the table.put can happen when you do a put.
Also, don't forget to close the table and anything else in your cleanup() method. The only backdrop is, if there are let's say 1000 mappers your table connection would be opened a 1000 times, but at any given point, only the max number of your mappers really run, so that would probably be 50, depending on your setup. Works for me at least!
i think multiple output can do the job..
chk tis out
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html