A standard DPU in AWS Glue comes with 4 vCPU and 2 executors.
I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single DPU with 4vcpu and 2 executors?
I had a similar discussion with the AWS Glue support team about this, I'll share with you what they told me about Glue Configuration. Take in example the Standard and the G1.X configuration.
Standard DPU Configuration:
1 DPU reserved for MasterNode
1 executor reserved for Driver/ApplicationMaster
Each DPU is configured with 2 executors
Each executor is configured with 5.5 GB memory
Each executor is configured with 4 cores
G.1X WorkerType Configuration:
1 DPU added for MasterNode
1 DPU reserved for Driver/ApplicationMaster
Each worker is configured with 1 executor
Each executor is configured with 10 GB memory
Each executor is configured with 8 cores
If we have for example a Job with Standard Configuration with 21 DPU means that we have:
1 DPU reserved for Master
20 DPU x 2 = 40 executors
40 executors - 1 Driver/AM = 39 executors
Which we then end up with a total amount of 156 cores. Meaning, your job has 156 slots for execution. If for example you read files from S3 that means that you will be able to accept 156 input files in parallel.
Hope it helps.
Related
I am running 10 concurrent run of same glue job.The job is taking Lot of DPUs.Does concurrent run of same glue job take more DPUs than running multiple different glue jobs in parallel.
Generally it should not matter if you run your jobs in parallel or sequentially. Every job will consume some DPUs and is directly based on the time it takes. So 1 job for 10 mins and 10 jobs for 1 min should result in same cost.
You can refer to pricing examples in the documentation page.
https://aws.amazon.com/glue/pricing/
Or can share more data (screenshot maybe) of how you're calculating the pricing.
Have searched the AWS Glue documents, but could not find the pricing details for AWS Glue worker types G.1X and G.2X. Can someone please explain if there is no cost difference between Standard, G.1X & G.2X?
All I can see the Glue pricing section is "You are billed $0.44 per DPU-hour in increments of 1 second, rounded up to the nearest second. Glue Spark jobs using Glue version 2.0 have a 1-minute minimum billing duration. ". Is this irrespective of the worker type?
Standard type - 16 GB memory, 4 vCPUs of compute capacity, and 50 GB
of attached EBS storage (2 Executors)
G.1X - 16 GB memory, 4 vCPUs,
and 64 GB of attached EBS storage (1 Executor)
G.2X - Twice that of G.1X (https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/) which means,
G.2X - 32 GB memory, 8vCPUs, 128 GB of EBS !!
Appreciate any inputs on this.
Yuva
As you can read up here:
When you are using G1.X / G.2X you are allocating an amount of worker. Those map to DPU.
For the G.1X worker type, each worker maps to 1 DPU
and
For the G.2X worker type, each worker maps to 2 DPU
That means that G.2X is twice as costly as G.1X. If you are using Standard, you are allocating a specific amount of DPU directly. If you are using Glue 2.0 I would advise you to use either G.1X or G.2X, depending on your use case.
AWS batch documentation states that array jobs share common parameters like job definition, vCPUs and memory. Does it mean if I have an array job of 1000 and vCPU as 4, each child job will get 4 vCPU or total vCPU across all child jobs will be 4?
The former. The job definition defines the amount of resources for a given job. The array job then points to the job definition to state "this is what is required for each of the individual jobs in the array"
In your example, the each job gets 4 vCPU. Your compute environment will probably have some max number of vCPU. If that max is 8, then only 2 individual jobs run at the same time and the other 998 wait in the queue until resources are free.
Just wanted to understand about the Dev End point in AWS GLUE pricing . What will charge more if i use two small Dev Endpoint like 32 Gb / 10 DPU and a large Dev end point with specs like 20 DPU with 64 Gb Ram.
I wants to understand what will be cheaper option in case we need to submit multiple jobs at same time .
Thanks in advance
You get charged something like $0.44 per DPU per hour running. So if you start a dev endpoint provisioned with 10 DPUs and play around with it for an hour your bill would be $4.40
Cost can vary by region - https://aws.amazon.com/glue/pricing/
I am planning to use AWS Glue for my ETL process, and have custom python code written and run as a AWS Glue Job.
I found in AWS Glue documentation, that by default, AWS Glue allocates 10 DPU per job.Is there a maximum limit of DPUs for a job, (I do not see anything in the LIMITs section, i.e., Max of DPUs per Job limits).
Or is there any optimal data size in MB / GB, that is recommended to avoid any Out of memory error issue. Please clarify.
Thanks.
According to the Glue API docs, the max you can allocate per Job execution is 100 DPUs.
AllocatedCapacity – Number (integer).
The number of AWS Glue data processing units (DPUs) allocated to runs of this job. From 2 to 100 DPUs can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.
The limits aren't the same for Python Glue jobs (which the OP plans to implement) where you can have maximum 1 DPU. Below is the official documentation (as of Aug 2019)
The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.
You can set the value to 0.0625 or 1. The default is 0.0625.