Understanding Batch Job Behavior - amazon-web-services

Understanding Batch Job Behavior - amazon-web-services

If we are creating 100 batch jobs at a time and each batch job will take around 2 mins for say. Then i need to understand whether batch jobs will run Sequentially or parallelly. If Sequentially , can i say we need to wait for the previous job to complete for the next batch job to start?

Related

Expected Run time of AWS Glue job

I run a job in AWS glue on 1mb of data. It takes 2.5 seconds to complete.
Pyspark framework was used for the job.
So going by this, on 1gb of data, the job should take around 2.5 * 1000 = 2500 seconds to complete.
But when I run the job on 1gb of data it took only 20 seconds.
How is this possible?

By default Glue job is configured to run with 10 DPUs where each DPU has 16 GB RAM and 4 vCores. So in your case even if you are running the job with 2 DPUs you are still under utilising the cluster.
And the execution time does't really work as you calculated and there are lot of additional factors to it.If you want to read more about panning resources for Glue then refer to this link.

How to reduce the time taken by the glue etl job(spark) to actually start executing?

I want to start a glue etl job, though the execution is fair (time concerns), however, the time taken by glue to actually start executing the job is too much.
I looked into various documentation and answers but none of them could give me the solution. There was some explanation of this behavior: cold start but no solution.
I expect to have the job up asap, it takes sometimes around 10 mins to start a job which gets executed in 2 mins.

Unfortunately it's not possible now. Glue uses EMR under the hood and it requires some time to spin up a new cluster with desired number of executors. As far as I know they have a pool of spare EMR clusters with some most common DPU configurations so if you are lucky your job can get one and start immediately, otherwise it will wait.

Scheduling strategy behind AWS Batch

I am wondering what the scheduling strategy behind AWS Batch looks like. The official documentation on this topic doesn't provide much details:
The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.
(https://docs.aws.amazon.com/batch/latest/userguide/job_scheduling.html)
"Approximately" fifo is quite vaque. Especially as the execution order I observed when testing AWS Batch did't look like fifo.
Did I miss something? Is there a possibility to change the scheduling strategy, or configure Batch to execute the jobs in the exact order in which they were submitted?

I've been using Batch for a while now, and it has always seemed to behave in roughly a FIFO manner. Jobs that are submitted first will generally be started first, but because of limitations with distributed systems, this general rule won't work out perfectly. Jobs with dependencies are kept in the PENDING state until their dependencies have completed, and then they go into the RUNNABLE state. In my experience, whenever Batch is ready to run more jobs from the RUNNABLE state, it picks the job with the earliest time submitted.
However, there are some caveats. First, if Job A was submitted first but requires 8 cores while Job B was submitted later but only requires 4 cores, Job B might be selected first if Batch has only 4 cores available. Second, after a job leaves the RUNNABLE state, it goes into STARTING while Batch downloads the Docker image and gets the container ready to run. Depending on a number of factors, jobs that were submitted at the same time may take longer or shorter in the STARTING state. Finally, if a job fails and is retried, it goes back into the PENDING state with its original time submitted. When Batch decides to select more jobs to run, it will generally select the job with the earliest submit date, which will be the job that failed. If other jobs have started before the first job failed, the first job will start its second run after the other jobs.
There's no way to configure Batch to be perfectly FIFO because it's a distributed system, but generally if you submit jobs with the same compute requirements spaced a few seconds apart, they'll execute in the same order you submitted them.

How to reduce the initialisation and termination time in google dataflow job?

I'm currently working on a POC and primarily focusing on Dataflow for ETL processing. I have created the pipeline using Dataflow 2.1 Java Beam API, and it takes about 3-4 minutes just to initialise, and also it takes about 1-2 minutes for termination on each run. However, the actual transformation (ParDo) takes less than a minute. Moreover, I tried running the jobs by following different approaches,
Running the job on local machine
Running the job remotely on GCP
Running the job via Dataflow template
But it looks like, all the above methods consume more or less same time for initialization and termination. So this is being a bottleneck for the POC as we intend to run hundreds of jobs every day.
I'm looking for a way to share the initialisation/termination time across all jobs so that it can be a one-time activity or any other approaches to reduce the time.
Thanks in advance!

From what I know, there are no ways to reduce startup or teardown time. You shouldn't consider that to be a bottleneck, as each run of a job is independent of the last one, so you can run them in parallel, etc. You could also consider converting this to a streaming pipeline if that's an option to eliminate those times entirely.

Scaling Oozie Map Reduce Job: Does splitting into smaller jobs reduce overall runtime and memory usage?

I have a Oozie workflow that runs a Map-reduce job within a particular queue on the cluster.
I have to add more input sources/clients to this job, so this job will be processing n times more data than what it does today.
My question is If instead of have one big job processing all the data, if I break it down into multiple jobs, one per source, will I reduce the total amount of time the jobs will take to complete?
I know Mapreduce anyhow breaks down a job into smaller jobs and spreads them across the grid, so one big job should be the same as multiple small jobs.
Also the capacity allocation within the queue is done on a 'per user' basis[1], so no matter how many jobs are submitted under one user the capacity allocated to the user will be the same. Or is there something I am missing?
So will my jobs really run any faster if broken down into smaller jobs?
Thanks.
[1] https://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html#Resource+allocation

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js