I am executing a mapreduce program on AWS and the code is working correctly.
My problem is with the number of map functions that work in parallel.
every time I execute the program, there is only one map function and only one node working in parallel.
my input file contains 100 line with a size of 4 kB. I need to make a map function for each 20 lines that run in parallel.
I tried to change "fs.s3n.block.size" parameter in the config yet nothing has changed.
Thank you.
Related
We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).
I have a step function setup which calls preprocessing lambda and inference lambda for a data item. Now, I need to do this process on the entire dataset(over 10000 items). One way is to invoke step function parallelly for each input. Is there a better alternative to this approach?
Another way to do it would be to use Map state to run over an array of items. You could start with a list of item ID's and run a set of tasks for it.
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/
This approach has some drawbacks though:
There is a 256kb limit for input/output data. The initial array of items could possibly be bigger. If you passes an array of ID's only as an input to map state though, 10k items would likely not cross that limit.
Map state doesn't guarantee that all the items will run concurrently. It could possibly be less than 40 at a time (workaround would be to have nested map states or maps of map states). From documentation:
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html
Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.
Pig Code Sample
data = load 's3://data-bucket/' USING PigStorage(',') AS (line:chararray)
Hive Code Sample
CREATE EXTERNAL TABLE data (value STRING) LOCATION 's3://data-bucket/';
Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?
I tried the following without any noticeable effects:
Increase #Task Nodes
set hive.optimize.s3.query=true
manually set #mappers
Increase instance type from medium up to xlarge
I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.
You can either :
use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have a pig script that will do it for you, once.
If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :
// to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;
Please provide metrics for thoses cases
Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?
The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.
I have a script that tries to get the times that users start/end their days based on log files. The job always fails before it completes and seems to knock 2 data nodes down every time.
The load portion of the script:
log = LOAD '$data' USING SieveLoader('#source_host', 'node', 'uid', 'long_timestamp', 'type');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL AND $0#'type'=='USER_AUTH';
There are about 6500 files that we are reading from, so it seems to spawn about that many map tasks. The SieveLoader is a custom UDF that loads a line, passes it to an existing method that parses fields from the line and returns them in a map. The parameters passed in are to limit the size of the map to only those fields with which we are concerned.
Our cluster has 5 data nodes. We have quad cores and each node allows 3 map/reduce slots for a total of 15. Any advice would be greatly appreciated!