How to monitor and control DPU usage in AWS Glue Crawlers - amazon-web-services

In the docs it's said that AWS allocates by default 10 DPUs per ETL job and 5 DPUs per development endpoint by default, even though both can have a minimum of 2 DPUs configured.
It's also mentioned that Crawling is also priced on second increments and with a 10 minute minimum run, but nowhere is specified how many DPUs are allocated. Jobs and Development Endpoints can be configured in the Glue console to consume less DPUs, but I haven't seen any such configuration for the crawlers.
Is there a fixed amount of DPUs per crawler? Can we control that amount?

This is my conversation with AWS Support about this subject:
Hello, I'd like to know how many DPUs a crawler uses in order to
calculate my costs with crawlers.
Their answer:
Dear AWS Customer,
Thank you for reaching out today. My name is Safari, I will assist
with your case.
I understand that while compiling the cost of your Glue crawlers,
you'd like to know the amount of DPUs a particular crawler uses.
Unfortunately, there is no direct way to find out the DPU consumption
by a given crawler. I apologize for the inconvenience. However, you
may see the total DPU consumption across all crawlers in your detailed
bill under the section AWS Service Charges > Glue > {region} > AWS
Glue CrawlerRun. Additionally, you can add tags to your crawlers and
then enable "Cost Allocation Tags" from your AWS Billing and Cost
Management console. This would allow AWS to generate a cost allocation
report grouped by the predefined tags. For more on this, please see
the documentation link below [1].
I hope this helps. Please let me know if I can provide you with any
other assistance.
References [1]:
https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html

Discussed with AWS support team as well, and currently its not possible to modify or view the DPU configuration details for Glue - crawlers. But, does crawlers use a DPU?

Related

S3 write concurrency using AWS Glue

I have a suspicion we are hitting an S3 write concurrency issue with an AWS Glue job. I am testing 10 DPUs writing 10k objects, 1 MB each (~10 GB total) and it is taking 2+ hours for just the write stage of the job. It seems like across 10 DPUs I should be able to distribute good enough to get much better throughput. I am hitting several different bucket prefixes and do not think I'm getting throttled by S3 or anything.
I see that my job is using EMRFS (the default S3 FileSystem API implementation for Glue), so that is good for write throughput from my understanding. I found some suggestions that say to adjust fs.s3.maxConnections, hive.mv.files.threads and set hive.blobstore.use.blobstore.as.scratchdir = false.
Where can I see what the current settings for these are in my Glue jobs and how can I configure them? While I see many settings and configurations in the Spark UI logs I can generate from my jobs, I'm not finding these settings.
How can I see what actual S3 write concurrency I'm getting in each worker in the job? Is this something I can see in the Spark UI logs or is there another metric somewhere that would show this?

EBS storage for Amazon Elasticsearch

Im learning about AWS for a subject in the university.
About 20 days ago I started to learn about Elasticsearch because I need querys that DynamoDB can't do.
I'm trying to use only the Free Tier and I created some domains, put data through Lambda (like 100 KiB) and then deleted it.
Then I checked the Billing and I realized that 4.9GB has been used for EBS storage. The Free Tier provide 10GB per month but the problem is that I don't know how I used all that storage and if there is a way to limit it because I dont want to exceed the usage limits.
I will be grateful for any kind of explanation or advice to not exceed the limit.
I'm unaware with preventive step which can restrict your billing.
However, using Cloudwatch billing alarm, you'd be notified immediately as soon as it breaches billing threshold.
Please have look at here for detailed AWS documentation on it.

AWS Glue request limit

Have some lambdas that request schemas form AWS Glue. Would like to know if there is a limit of requests to AWS Glue after which Glue cannot handle it? Load testing in other words.
Have not found anything about it in official documentation.
Thanks
The various default, per-region limits for the AWS Glue service are listed at the below link. You can request increases to these limits via the support console.
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-service-limits.html
These limits are not a guaranteed capacity unless there is an SLA defined for the service, which I don't think Glue has. One would assume that EC2 is the backing service though so capacity should theoretically not be an issue. You will only know by running your workflow over a long period of time to see the true availability of the service if there is no SLA.
have a look here:
https://docs.aws.amazon.com/general/latest/gr/glue.html
as of today (2020/01/27):
Number of jobs per trigger 50

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue.
I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests.
On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days.
On calculating for a month, I see that AWS Glue works out to be around $14.64, whereas for EMR it works out to be around $10.08. I have not taken into account other additional expenses such as S3, RDS, Redshift, etc. & DEV Endpoint which is optional, since my objective is to compare ETL job price benefits
Looks like EMR is cheaper when compared to AWS Glue. Is the EMR pricing correct, can someone please suggest if anything missing? I have tried the AWS price calculator for EMR, but confused, and not clear if normalized hours are billed into it.
Regards
Yuva
Yes, EMR does work out to be cheaper than Glue, and this is because Glue is meant to be serverless and fully managed by AWS, so the user doesn't have to worry about the infrastructure running behind the scenes, but EMR requires a whole lot of configuration to set up. So it's a trade off between user friendliness and cost, and for more technical users EMR can be the better option.
#user2889316 - Did you check my question wherein I had provided a comparison numbers?
Also please note Glue is roughly about 0.44 per hour / DPU for a job. I don't think you will have any AWS Glue JOB that is expected to running throughout the day? Are you talking about the Glue Dev end point or the Job?
A AWS Glue job requires a minimum of 2 DPUs to run, which means 0.88 per hour, which I think roughly about $21 per day? This is only for the GLUE job and there are additional charges such as S3, and any database / connection charges / crawler charges, etc.
Corresponding instance for EMR is m3.xlarge & its charges are (pricing at $0.266 & $0.070 respectively). This would be approximately less than $16 for 2 instance per day? plus other S3, database charges, etc. Am considering 2 EMR instances against the default DPUs for AWS Glue job.
Hope this would give you an idea.
Thanks
If your infrastructure doesn't need drastic scaling (and is mostly with fixed configuration), use EMR. But if it is needed, Glue is better choice as it is serverless. By just changing DPUs, your infrastructure is scaled. However in EMR, you have to decide on cluster type, number of nodes, auto-scaling rules. For each change, you will need to change cluster creation script, test it, deploy it - basically add overhead of standard release cycle for change. With change in infra config, you may want to change spark config to optimize jobs accordingly. So time to make new version release is higher with change in infra configuration. If you add high configuration to start, it will cost more. If you add low configuration to start, you need frequent changes in script.
Having said that, AWS Glue has fixed infra configuration for each DPU - e.g. 16GB memory per core. If your ETL demands more memory per core, you may have to shift to EMR. However, if your ETL is designed such a way that it will not exceed 11GB driver memory with 1 executor or 5.5GB with 2 executors (e.g. Take additional data volume in parallel on new core or divide volume in 5gb/11gb batch and run in for loop on same core), Glue is right choice.
If your ETL is complex and all jobs are going to keep cluster busy throughout day, I would recommend to go with EMR with dedicated devops team to manage EMR infra.
If you use Spot instance of EMR instead of On-Demand it will cost 1/3rd of on-Demand price and will turn out to be much cheaper. AWS Glue doesn't have that pricing benefits.

Monitoring usage amount for each user in Amazon AWS

We have a central Amazon AWS account and 87 users. As I figured out, there is no way to allocate credit to each single user (like $40 of the whole credit on the account).
Please let me know if there is a way to figure out current usage of each user from the total credit. For example have list showing user no. 1 has user $3 so far and no. 2 has used $40 ...
Users have access to EC2, ERM and S3.
Thanks
I'm not sure if you're already using consolidated billing, but it might serve as a solution
What you are probably looking for is cost allocation.
You can use cost allocation to organize and track your AWS costs. When you apply tags to your AWS resources (such as Amazon EC2 instances or Amazon S3 buckets), AWS generates a Cost Allocation Report as a comma-separated value (CSV) file with your usage and costs aggregated by your tags. You can apply tags that represent your business dimensions (such as cost centers, application names, or owners) to organize your costs across multiple services.
When you follow the steps to activate this feature, you specify an S3 bucket where the reports will be dropped. There's a final report at the end of every billing cyle but there are also "estimated" rolling reports that get dropped into the bucket several times a day.
Thanks friends I found the solution with your help.
Using consolidated billing we are able to see to total usage amount and separately for each user.
Cost allocation reports from AWS definitely help. But it can be tough to turn all that CSV data into reports, especially if you've got a lot of individual users.
A lot of tools have been built to help with this issue. It's one that a lot of AWS users run into.
This blog post gives some hints on how to create more useful allocation reports: http://blog.cloudability.com/insight-with-aws-cost-allocation-reports/
As a disclaimer, I work at Cloudability. We've got a lot of users with this exact same issue.