Expected Run time of AWS Glue job

Expected Run time of AWS Glue job - amazon-web-services

I run a job in AWS glue on 1mb of data. It takes 2.5 seconds to complete.
Pyspark framework was used for the job.
So going by this, on 1gb of data, the job should take around 2.5 * 1000 = 2500 seconds to complete.
But when I run the job on 1gb of data it took only 20 seconds.
How is this possible?

By default Glue job is configured to run with 10 DPUs where each DPU has 16 GB RAM and 4 vCores. So in your case even if you are running the job with 2 DPUs you are still under utilising the cluster.
And the execution time does't really work as you calculated and there are lot of additional factors to it.If you want to read more about panning resources for Glue then refer to this link.

Related

Training job runtime exceeded MaxRuntimeInSeconds provided

I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further

Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.

According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.

How to calculate number of G.1 Workers in AWS Glue for processing 1TB data?

I have 1TB of data from the parquet S3 to be loaded in AWS Glue Spark Jobs. I am trying to figure out the number of workers needed for this type of requirement.
As per me below are the details of the G.1x configuration:
1 DPU added for MasterNode
1 DPU reserved for Driver/ApplicationMaster
Each worker is configured with 1 executor
Each executor is configured with 10 GB of memory
Each executor is configured with 8 cores
EBS block of 64GB per DPU
So if I take 50 workers. 1 would be parked aside for the driver and 1 for the master node. So, I am left with 48 now. So, 48 * 10 = 480 GB Memory (since 1 executor takes 10 GB memory). Also, 64 * 48 = 3072 GB ~ 3 TB disk. In case there is any data spill required then a disk would be used.
So, is this configuration correct? If not, do I need to increase or decrease the workers? Any help is much appreciated. Also, if in the future I have lots of collect operations involved then how can I increase the driver memory which is 16GB for now?

To start with there is no direct statistical or mathematical formula to come up with the number of DPUs needed because it depends on the nature of the problem that one is trying to solve as in:
Do the job need to be finished as fast as possible? Maximum parallelism and finish faster?
Will it be a long-lived job or a short-running job?
Will the job parses a lot of small (in KBs) files or large chunks (in 100s MB )?
Cost considerations? given the cost is per DPU hour and Job run duration matters
Frequency of the job (every hour or once a day)? this will help you to determine for example if the job takes 35mins with a low number of DPUs (thereby finishing in less than an hour then it might be acceptable because it helps in saving cost)
Now to your question, assuming you are using Glue2.0, in order to estimate the number of DPUs (or workers) needed you should actually enable the job metrics in AWS Glue that can give you the required insight to understand the job execution time, active executors, completed stages, and maximum needed executors to scale in/out your AWS Glue job. Using these metrics you can visualize and determine the optimal number of DPUs needed for your situation.
You can try running a dummy job or the actual job for 1 time, use the metrics and determine an optimal number of DPUs (from cost and job finish time) perspective.
For example, try running with 50 workers, analyze your under-provisioning factor, then use the factor to scale your current capacity.
You can read more on this AWS link and external link.
For your other question about increasing the driver memory, I would suggest reaching out to your AWS support or trying the using G2.X which has 20gb driver memory.

Does AWS SageMaker charges for launching an instance or just for usage?

I'm using AWS SageMaker studio and I need to launch a ml.p2.xlarge instance for a Training Job to run the fit() function of a model. I need to run it multiple times, and I want to know if AWS charges me for every time I launch an instance or just for the minutes I use them.
For example, if I need to run it three times, would it be cheaper to launch a ml.p2.xlarge instance once and run the training job three times in the span of an hour, or launch the instance three times in that span for 6 minutes each?

The answer is generally to run a training jobs 3 time. This way you only pay for what you use, and there's no ideal time wasted. One thing to note is that, per job, you also pay for the overhead of loading the training container, loading data to the training container, and the duration to stop the instance. As long as this overhead is relatively small, it's worth it.
Example: (6min net training + 4min overhead) = 10min x 3 = 30min vs 60min.
Another benefit to have a job per training is separate metadata and results per job (metrics, logs, hyperparameters), comparing jobs, ability to quickly clone a job, job status. etc.
Empirically: you can run one training job, multiple results by 3.
In SageMaker Training you pay by the second ("billable seconds"). You can see this figure in the training job details in the web console (or via describe-training-job API call).

Is there a better way for me to architect this batch processing pipeline?

So I have a large dataset (1.5 Billion) I need to perform a I/O bound transform task on (same task for each point) and place the result into a store that allows fuzzy searching on the transforms fields.
What I currently have is a Step-Function Batch Job Pipeline feeding into RDS. It works like so
Lamba splits input data into X number of even partitions
An Array Batch job is created with X array elements matching the X paritions
Batch jobs (1 vCPU, 2048 Gb ram) run on number of EC2 spot instances, transform the data and place it into RDS.
This current solution (with X=1600 workers) runs in about 20-40 minutes, mainly based on the time it takes to spin up spot instance jobs. The actual jobs themselves average about 15 minutes in run time. As for total cost, with spot savings the workers cost ~40 bucks but the real kicker is the RDS postgres DB. To be able to handle 1600 concurrent writes you need at least a r5.xlarge which is 500 a month!
Therein lies my problem. It seems I could run the actual workers quicker and for cheaper ( due to second based pricing) by having say 10,000 workers but then I would need a RDS system that could handle 10,000 concurrent DB connections somehow.
I've looked high and low and can't find a good solution to this scaling wall I am hitting. Below I'll detail some things I've tried and why they haven't worked for me or don't seem like a good fit.
RDS proxies - I tried creating 2 proxies set to 50% connection pool and giving "Even" numbered jobs one proxy and odd numbered jobs the other but that didn't help
DynamoDb - This seems off the bat to solve my problem hugely concurrent, can definitely handle the write load but it doesn't allow fuzzy searching like select * where field LIKE Y which is a key part of my workflow with the batch job results
(Theory) - have the jobs write their results to S3 then trigger a lambda on new bucket entries to insert those into the DB. (This might be a terrible idea I'm not sure)
Anyways, what I'm after is improving the cost of running this batch pipeline (mainly the DB), improving the time to run (to save on Spot costs) or both! I am open to any feedback or suggestion!
Let me know if there's some key piece of info you need I missed.

How to reduce the time taken by the glue etl job(spark) to actually start executing?

I want to start a glue etl job, though the execution is fair (time concerns), however, the time taken by glue to actually start executing the job is too much.
I looked into various documentation and answers but none of them could give me the solution. There was some explanation of this behavior: cold start but no solution.
I expect to have the job up asap, it takes sometimes around 10 mins to start a job which gets executed in 2 mins.

Unfortunately it's not possible now. Glue uses EMR under the hood and it requires some time to spin up a new cluster with desired number of executors. As far as I know they have a pool of spare EMR clusters with some most common DPU configurations so if you are lucky your job can get one and start immediately, otherwise it will wait.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js