AWS Glue what is optimal data size for ETL

AWS Glue what is optimal data size for ETL - amazon-web-services

I am planning to use AWS Glue for my ETL process, and have custom python code written and run as a AWS Glue Job.
I found in AWS Glue documentation, that by default, AWS Glue allocates 10 DPU per job.Is there a maximum limit of DPUs for a job, (I do not see anything in the LIMITs section, i.e., Max of DPUs per Job limits).
Or is there any optimal data size in MB / GB, that is recommended to avoid any Out of memory error issue. Please clarify.
Thanks.

According to the Glue API docs, the max you can allocate per Job execution is 100 DPUs.
AllocatedCapacity – Number (integer).
The number of AWS Glue data processing units (DPUs) allocated to runs of this job. From 2 to 100 DPUs can be allocated; the default is 10. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.

The limits aren't the same for Python Glue jobs (which the OP plans to implement) where you can have maximum 1 DPU. Below is the official documentation (as of Aug 2019)
The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see the AWS Glue pricing page.
You can set the value to 0.0625 or 1. The default is 0.0625.

Related

AWS Glue worker pricing details for G.1X and G.2X

Have searched the AWS Glue documents, but could not find the pricing details for AWS Glue worker types G.1X and G.2X. Can someone please explain if there is no cost difference between Standard, G.1X & G.2X?
All I can see the Glue pricing section is "You are billed $0.44 per DPU-hour in increments of 1 second, rounded up to the nearest second. Glue Spark jobs using Glue version 2.0 have a 1-minute minimum billing duration. ". Is this irrespective of the worker type?
Standard type - 16 GB memory, 4 vCPUs of compute capacity, and 50 GB
of attached EBS storage (2 Executors)
G.1X - 16 GB memory, 4 vCPUs,
and 64 GB of attached EBS storage (1 Executor)
G.2X - Twice that of G.1X (https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/) which means,
G.2X - 32 GB memory, 8vCPUs, 128 GB of EBS !!
Appreciate any inputs on this.
Yuva

As you can read up here:
When you are using G1.X / G.2X you are allocating an amount of worker. Those map to DPU.
For the G.1X worker type, each worker maps to 1 DPU
and
For the G.2X worker type, each worker maps to 2 DPU
That means that G.2X is twice as costly as G.1X. If you are using Standard, you are allocating a specific amount of DPU directly. If you are using Glue 2.0 I would advise you to use either G.1X or G.2X, depending on your use case.

AWS - Glue , Dev End point two Small node v/s one large node

Just wanted to understand about the Dev End point in AWS GLUE pricing . What will charge more if i use two small Dev Endpoint like 32 Gb / 10 DPU and a large Dev end point with specs like 20 DPU with 64 Gb Ram.
I wants to understand what will be cheaper option in case we need to submit multiple jobs at same time .
Thanks in advance

You get charged something like $0.44 per DPU per hour running. So if you start a dev endpoint provisioned with 10 DPUs and play around with it for an hour your bill would be $4.40
Cost can vary by region - https://aws.amazon.com/glue/pricing/

GCP Dataflow vCPU usage and pricing question

I submited a GCP dataflow pipeline to receive my data from GCP Pub/Sub, parse and store to GCP Datastore. It seems work perfect.
Through 21 days, I found the cost is $144.54 and worked time is 2,094.72 hour. It means after I submitted it, it will be charged every sec, even there is not receive (process) any data from Pub/Sub.
Is this behavior normal? Or I set a wrong parameters?
I thought CPU use time only be counted when data is received.
Is there have any way to reduce the cost in same working model (receive from Pub/Sub and store to Datastore)?

The Cloud Dataflow service usage is billed in per second increments, on a per job basis. I guess your job used 4 n1-standard-1 workers, which used 4 vCPUs giving an estimated of 2,000 vCPU hr resource usage. Therefore, this behavior is normal. To reduce the cost, you can use either autoscaling, to specify the maximum number of workers, or the pipeline options, to override the resource settings that are allocated to each worker. Depending on your necessities, you could consider using Cloud Functions which cost less, but considering its limits.
Hope it helps.

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue.
I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests.
On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days.
On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around $10.08. I have not taken into account other additional expenses such as S3, RDS, Redshift, etc. & DEV Endpoint which is optional, since my objective is to compare ETL job price benefits
Looks like EMR is cheaper when compared to AWS Glue. Is the EMR pricing correct, can someone please suggest if anything missing? I have tried the AWS price calculator for EMR, but confused, and not clear if normalized hours are billed into it.
Regards
Yuva

Yes, EMR does work out to be cheaper than Glue, and this is because Glue is meant to be serverless and fully managed by AWS, so the user doesn't have to worry about the infrastructure running behind the scenes, but EMR requires a whole lot of configuration to set up. So it's a trade off between user friendliness and cost, and for more technical users EMR can be the better option.

#user2889316 - Did you check my question wherein I had provided a comparison numbers?
Also please note Glue is roughly about 0.44 per hour / DPU for a job. I don't think you will have any AWS Glue JOB that is expected to running throughout the day? Are you talking about the Glue Dev end point or the Job?
A AWS Glue job requires a minimum of 2 DPUs to run, which means 0.88 per hour, which I think roughly about $21 per day? This is only for the GLUE job and there are additional charges such as S3, and any database / connection charges / crawler charges, etc.
Corresponding instance for EMR is m3.xlarge & its charges are (pricing at $0.266 & $0.070 respectively). This would be approximately less than $16 for 2 instance per day? plus other S3, database charges, etc. Am considering 2 EMR instances against the default DPUs for AWS Glue job.
Hope this would give you an idea.
Thanks

If your infrastructure doesn't need drastic scaling (and is mostly with fixed configuration), use EMR. But if it is needed, Glue is better choice as it is serverless. By just changing DPUs, your infrastructure is scaled. However in EMR, you have to decide on cluster type, number of nodes, auto-scaling rules. For each change, you will need to change cluster creation script, test it, deploy it - basically add overhead of standard release cycle for change. With change in infra config, you may want to change spark config to optimize jobs accordingly. So time to make new version release is higher with change in infra configuration. If you add high configuration to start, it will cost more. If you add low configuration to start, you need frequent changes in script.
Having said that, AWS Glue has fixed infra configuration for each DPU - e.g. 16GB memory per core. If your ETL demands more memory per core, you may have to shift to EMR. However, if your ETL is designed such a way that it will not exceed 11GB driver memory with 1 executor or 5.5GB with 2 executors (e.g. Take additional data volume in parallel on new core or divide volume in 5gb/11gb batch and run in for loop on same core), Glue is right choice.
If your ETL is complex and all jobs are going to keep cluster busy throughout day, I would recommend to go with EMR with dedicated devops team to manage EMR infra.

If you use Spot instance of EMR instead of On-Demand it will cost 1/3rd of on-Demand price and will turn out to be much cheaper. AWS Glue doesn't have that pricing benefits.

AWS Glue: How to reduce the number of DPUs for an ETL job

AWS Glue documentation regarding pricing reads:
A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue
allocates 10 DPUs to each ETL job. You are billed $0.44 per DPU-Hour
in increments of 1 minute, rounded up to the nearest minute, with a
10-minute minimum duration for each ETL job.
I want to reduce the number of DPUs allocated to my ETL job. I searched for this option in Glue console. But I didn't find it. Can you please let me know how do I do that?
Thanks

To reduce the number of DPU, please go to AWS glue jobs console. Select the job and under action Edit the job. Under "Script libraries and job parameters", you should see "Concurrent DPUs per job run". You can provide an integer value to increase or reduce the number of DPUs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue what is optimal data size for ETL - amazon-web-services

Related

AWS Glue worker pricing details for G.1X and G.2X

AWS - Glue , Dev End point two Small node v/s one large node

GCP Dataflow vCPU usage and pricing question

AWS Glue pricing against AWS EMR

AWS Glue: How to reduce the number of DPUs for an ETL job

Categories

Resources