MapReduce on Yarn:Control the mapper or reducer tasks running simultaneously? - concurrency

My mapreduce-based hive sql is running on Yarn and the hadoop version is 2.7.2 . What I want ,it to restrict the mapper tasks or reducer tasks running simultaneously when some hive sql is really big. I have tried following parameters ,but in fact they are not what I want:
mapreduce.tasktracker.reduce.tasks.maximum: The maximum number of reduce tasks that will be run simultaneously by a task tracker.
mapreduce.tasktracker.map.tasks.maximum: The maximum number of map tasks that will be run simultaneously by a task tracker.
the above two parameters seems unavailable for my yarn cluster, because yarn has no concept of JobTracker,which is the concept of hadoop 1.x? And I have checked my applicatiion whose running mappers is above 20, but the mapreduce.tasktracker.reduce.tasks.maximum value is just the default value 2.
and then , I tried the following two parameters , also, they are not what I need:
mapreduce.job.maps: The default number of map tasks per job. Ignored when mapreduce.jobtracker.address is "local".
mapreduce.job.reduces: The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is "local".
mapreduce.job.maps is just a hint for how many splits will be created for mapping tasks , and mapreduce.job.maps define how many reducer will be generated.
But what I want to limit ,is how many mapper or reducer tasks was allowed to run simultaneously for each application?
In my below screenshot, a yarn application has at least 20+ mapper tasks running ,which cost too much cluster resource.I want to limit it to 10 at most.
So, what can I do?

There may be several questions here. First of all to control the reducers for a particular job running at the same time of the mappers or before all of the mappers have completed you need to tweak: mapreduce.job.reduce.slowstart.completedmaps.
This parameter defaults to .8 which is 80%. This means when 80% of the mappers complete the reducers to start. If you want the reducers to wait until all of the mappers are complete then you need to set this to 1.
As for controlling the number of the mappers running at one time then you need to look at setting up either the fair scheduler or capacity scheduler.
Using one of the schedulers you can set minimums and maximums of resources for a queue where a job runs which will control how many containers (Mappers and Reducers are containers in Yarn) run at one time.
There is good information out there about both schedulers.
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

Related

Working around AWS Step Function Map concurrency limit

I have a Map task in an AWS Step Function which executes 100-200 lambdas in parallel, each running for a few minutes, then collects the results. However, I'm running into throttling where not all lambdas are started for some time. The relevant AWS documentation says you may experience throttling with more than ~40 items, which is what I believe I'm running into.
Does anyone have any experience with working around this concurrency limitation? Can i have nested Maps, or could I bucket my tasks into several Maps I run in parallel?
Use nested state machine inside your map state, so you can have ~40 child state machines executing in parallel. Then inside each child state machine use a map state to process ~40 items in parallel.
This way you can reach processing ~1600 items in parallel.
But before reaching that you will reach AWS Step Functions Quotas:
https://docs.aws.amazon.com/step-functions/latest/dg/limits.html
I ended up working around this 40 item limit by creating 10 copies of the Map task in a Parallel, and bucketing up the task information to split tasks between these 10 copies. This means I can now run ~400 tasks before running into throttling issues. My state machine looks something like this:
AWS now offers a direct solution to this, called "distributed map state", announced at re:invent 2022 https://docs.aws.amazon.com/step-functions/latest/dg/concepts-asl-use-map-state-distributed.html
It allows you to perform 10,000 concurrent map state tasks. Under the hood, it runs them as child step function workflows, which can be specified to run as either standard or express workflows

Is there a better way for me to architect this batch processing pipeline?

So I have a large dataset (1.5 Billion) I need to perform a I/O bound transform task on (same task for each point) and place the result into a store that allows fuzzy searching on the transforms fields.
What I currently have is a Step-Function Batch Job Pipeline feeding into RDS. It works like so
Lamba splits input data into X number of even partitions
An Array Batch job is created with X array elements matching the X paritions
Batch jobs (1 vCPU, 2048 Gb ram) run on number of EC2 spot instances, transform the data and place it into RDS.
This current solution (with X=1600 workers) runs in about 20-40 minutes, mainly based on the time it takes to spin up spot instance jobs. The actual jobs themselves average about 15 minutes in run time. As for total cost, with spot savings the workers cost ~40 bucks but the real kicker is the RDS postgres DB. To be able to handle 1600 concurrent writes you need at least a r5.xlarge which is 500 a month!
Therein lies my problem. It seems I could run the actual workers quicker and for cheaper ( due to second based pricing) by having say 10,000 workers but then I would need a RDS system that could handle 10,000 concurrent DB connections somehow.
I've looked high and low and can't find a good solution to this scaling wall I am hitting. Below I'll detail some things I've tried and why they haven't worked for me or don't seem like a good fit.
RDS proxies - I tried creating 2 proxies set to 50% connection pool and giving "Even" numbered jobs one proxy and odd numbered jobs the other but that didn't help
DynamoDb - This seems off the bat to solve my problem hugely concurrent, can definitely handle the write load but it doesn't allow fuzzy searching like select * where field LIKE Y which is a key part of my workflow with the batch job results
(Theory) - have the jobs write their results to S3 then trigger a lambda on new bucket entries to insert those into the DB. (This might be a terrible idea I'm not sure)
Anyways, what I'm after is improving the cost of running this batch pipeline (mainly the DB), improving the time to run (to save on Spot costs) or both! I am open to any feedback or suggestion!
Let me know if there's some key piece of info you need I missed.

Are there any problems with running same cron job that takes 2 hours to complete every 10 minutes?

I have a script that takes two hours to run and I want to run it every 15 minutes as a cronjob on a cloud vm.
I noticed that my cpu is often at 100% usage. Should I resize memory and/or number_of_cores ?
Each time you execute your cron job, a new process will be created.
So if your job takes 120 min (2h) to complete, and you will be starting new jobs every 15 minutes, then you will be having 8 jobs running at the same time (120/15).
Thus, if the jobs are resource intensive, you will observe issues, such as 100% cpu usage.
So the question whether to up-scale or not is really dependent on the nature of these jobs. What do they do, how much cpu and memory do they take? Based on your description you are already running at 100% CPU often, thus an upgrade would be warranted in my view.
It would depend on your cron, but outside of resourcing for your server/application the following issues should be considered:
Is there overlap in data? i.e. do you retrieve a pool of data that will be processed multiple times.
Will duplicate critical actions happen? i.e. will a customer receive an email multiple times, will a payment be processed multiple times.
Is there a chance of a race condition that cause the script to exit early.
Will there be any collisions in the processing i.e. duplicate bookings made etc.
You will need to increase the CPU and Memory specification of your VM instance (in GCP) due to the high CPU load of your instance. The document [1] on upgrading the machine type of your VM instance, to do this need to shutdown your VM instance and change it´s machine type.
To know about different machine types in GCP, please have the link [2].
On the other hand, you can autoscale based on the average CPU utilization if you use managed instance group (MIG) [3]. Using this policy tells the autoscaler to collect the CPU utilization of the instances in the group and determine whether it needs to scale. You set the target CPU utilization the autoscaler should maintain and the autoscaler works to maintain that level.
[1] https://cloud.google.com/compute/docs/instances/changing-machine-type-of-stopped-instance
[2] https://cloud.google.com/compute/docs/machine-types
[3] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing#scaling_based_on_cpu_utilization

How can aws boto3 submit a final batch job that depends on the completion of all previous jobs?

The boto3 documentation describes how to submit a dependsOn parameter, but a single job can only depend on the completion of a maximum of 20 jobs. How can I submit a job that depends on the completion of an arbitrarily large number of jobs? Can this be done by specifying the final job type as SEQUENTIAL? Or does this need to be done by creating a lower priority queue?
While AWS Batch does limit you to 20 arbitrary jobs (you can contract them to see about bumping it), they did introduce array jobs in November 2017.
https://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html
This when you want the same basic job step run on a number of machines (i.e. not totally arbitrary jobs). So it takes that one job and can break it into up to 10,000 jobs. Each is given an index parameter so you could pass a large document and have each final job work on a given page number.
Then your next job step could be dependent on that job that had 2-10,000 jobs.
Check the documents for details, especially since you can configure the dependencies in different ways.

Hadoop : single node vs cluster performance

I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves).
Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the above by quite a margin.
What could be the possible reasons? Is it the network latency? It's running on 100Mbps Ethernet network. Will it help if I increase the number of nodes?
I am using Hadoop Streaming and my code is in python2.7.
MapReduce isn't really meant to handle that small of an input dataset. The MapReduce framework has to determine which nodes will run tasks and then spin up a JVM to run each individual Map and Reduce task(s) (the number of tasks is dependent on the size of your data set). That usually has a latency on the order of tens of seconds. Shipping non local data between nodes is also expensive as it involves sending data over the wire. For such a small dataset, the overhead of setting up a MapReduce job in a distributed cluster is likely higher than the runtime of the job itself. On a single node you only see the overhead of starting up tasks on a local machine and don't have to do any data copying over the network, that's why the job finishes faster on a single machine. If you had multi gigabyte files, you would see better performance on several machines.