AWS Glue request limit - amazon-web-services

Have some lambdas that request schemas form AWS Glue. Would like to know if there is a limit of requests to AWS Glue after which Glue cannot handle it? Load testing in other words.
Have not found anything about it in official documentation.
Thanks

The various default, per-region limits for the AWS Glue service are listed at the below link. You can request increases to these limits via the support console.
https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-service-limits.html
These limits are not a guaranteed capacity unless there is an SLA defined for the service, which I don't think Glue has. One would assume that EC2 is the backing service though so capacity should theoretically not be an issue. You will only know by running your workflow over a long period of time to see the true availability of the service if there is no SLA.

have a look here:
https://docs.aws.amazon.com/general/latest/gr/glue.html
as of today (2020/01/27):
Number of jobs per trigger 50

Related

How can I add ip-based rate limits with longer intervals on API Gateway?

I have an API Gateway endpoint that I would like to limit access to. For anonymous users, I would like to set both daily and monthly limits (based on IP address).
AWS WAF has the ability to set rate limits, but the interval for them is a fixed 5 minutes, which is not useful in this situation.
API Gateway has the ability to add usage plans with longer term rate quotas that would suit my needs, but unfortunately they seem to be based on API keys, and I don't see a way to do it by IP.
Is there a way to accomplish what I'm trying to do using AWS Services?
Is it maybe possible to use a usage plan and automatically generate an api key for each user who wants to access the api? Or is there some other solution?
Without more context on your specific use-case, or the architecture of your system, it is difficult to give a “best practice” answer.
Like most things tech, there are a few ways you could accomplish this. One way would be to use a combination of CloudWatch API logging, Lambda, DynamoDB (with Streams) and WAF.
At a high level (and regardless of this specific need) I’d protect my API using WAF and the AWS security automations quickstart, found here, and associate it with my API Gateway as guided in the docs here. Once my WAF is setup and associated with my API Gateway, I’d enable CloudWatch API logging for API Gateway, as discussed here. Now that I have things setup, I’d create two Lambdas.
The first will parse the CloudWatch API logs and write the data I’m interested in (IP address and request time) to a DynamoDB table. To avoid unnecessary storage costs, I’d set the TTL on the record I’m writing to my DynamoDB table to be twice whatever my analysis’s temporal metric is... ie If I’m looking to limit it to 1000 requests per 1 month, I’d set the TTL on my DynamoDB record to be 2 months. From there, my CloudWatch API log group will have a subscription filter that sends log data to this Lambda, as described here.
My second Lambda is going to be doing the actual analysis and handling what happens when my metric is exceeded. This Lambda is going to be triggered by the write event to my DynamoDB table, as described here. I can have this Lambda run whatever analysis I want, but I’m going to assume that I want to limit access to 1000 requests per month for a given IP. When the new DynamoDB item triggers my Lambda, the Lambda is going to query the DynamoDB table for all records that were created in the preceding month from that moment, and that contain the IP address. If the number of records returned is less than or equal to 1000, it is going to do nothing. If it exceeds 1000 then the Lambda is going to update the WAF WebACL, and specifically UpdateIPSet to reject traffic for that IP, and that’s it. Pretty simple.
With the above process I have near real-time monitoring of request to my API gateway, in a very efficient, cost-effective, scaleable manner in a way that can be deployed entirely Serverless.
This is just one way to handle this, there are definitely other ways you could accomplish this with say Kinesis and Elastic Search, or instead of logs you could analyze CloudTail events, or by using a third party solution that integrates with AWS, or something else.

How to monitor and control DPU usage in AWS Glue Crawlers

In the docs it's said that AWS allocates by default 10 DPUs per ETL job and 5 DPUs per development endpoint by default, even though both can have a minimum of 2 DPUs configured.
It's also mentioned that Crawling is also priced on second increments and with a 10 minute minimum run, but nowhere is specified how many DPUs are allocated. Jobs and Development Endpoints can be configured in the Glue console to consume less DPUs, but I haven't seen any such configuration for the crawlers.
Is there a fixed amount of DPUs per crawler? Can we control that amount?
This is my conversation with AWS Support about this subject:
Hello, I'd like to know how many DPUs a crawler uses in order to
calculate my costs with crawlers.
Their answer:
Dear AWS Customer,
Thank you for reaching out today. My name is Safari, I will assist
with your case.
I understand that while compiling the cost of your Glue crawlers,
you'd like to know the amount of DPUs a particular crawler uses.
Unfortunately, there is no direct way to find out the DPU consumption
by a given crawler. I apologize for the inconvenience. However, you
may see the total DPU consumption across all crawlers in your detailed
bill under the section AWS Service Charges > Glue > {region} > AWS
Glue CrawlerRun. Additionally, you can add tags to your crawlers and
then enable "Cost Allocation Tags" from your AWS Billing and Cost
Management console. This would allow AWS to generate a cost allocation
report grouped by the predefined tags. For more on this, please see
the documentation link below [1].
I hope this helps. Please let me know if I can provide you with any
other assistance.
References [1]:
https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html
Discussed with AWS support team as well, and currently its not possible to modify or view the DPU configuration details for Glue - crawlers. But, does crawlers use a DPU?

AWS WAF - Auto Save Web Application Firewall logs in S3

How do you route AWS Web Application Firewall (WAF) logs to an S3 bucket? Is this something I can quickly do through the AWS Console? Or, would I have to use a lambda function (invoked by a CloudWatch timer event) to query the WAF logs every n minutes?
UPDATE:
I'm interested in the ACL logs (Source IP, URI, Matches rule, Request Headers, Action, Time, etc).
UPDATE (05/15/2017)
AWS doesn't provide an easy way to view/parse these logs. You can get a "random sample" via the get-sampled-requests command. Which isn't acceptable...
Gets detailed information about a specified number of requests--a
sample--that AWS WAF randomly selects from among the first 5,000
requests that your AWS resource received during a time range that you
choose. You can specify a sample size of up to 500 requests, and you
can specify any time range in the previous three hours.
http://docs.aws.amazon.com/cli/latest/reference/waf/get-sampled-requests.html
Also, I'm not the only one experiencing this issue either:
https://forums.aws.amazon.com/thread.jspa?threadID=220202
I was looking for this functionality today and stumbled across the referenced thread. It was, coincidentally, updated today:
Hello,
Thanks for your input. I have submitted a feature request on your
behalf to export WAF events to S3 for long term analysis.
Best Regards, albertpataws
The lack of this feature strikes me as being almost as odd as the fact that I can't change timezones for graphs.

CloudWatch Events rule Limits

what is the maximum number of rules for cloud watch I can create on my AWS account. I might have a lot of different rules that will invoke lambda function on schedule. Is it unlimited?
The basic limits are documented at http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/cloudwatch_limits.html - currently 50 rules per account.
If you need more, reach out through your AWS contact and these can be expanded.
This is no longer 50 and has been increased to 100 per region per account.
As per this link:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/cloudwatch_limits_cwe.html
And as mentioned by johnny: this can be increased further on request (if amazon approves the request).
After talking to AWS cloud watch team I found out that the rule limit can be increased as per your need.
If you're willing to use a non-AWS service, then you might check out Microsoft Azure. Azure offers a great job scheduler that doesn't pose any limits. You could use this service to invoke your lambda functions.

AWS Cloudwatch monitoring for S3

Amazon Cloudwatch provides some very useful metrics for monitoring my EC2s, load balancers, elasticache and RDS databases, etc and allows me to set alarms for a whole range of criteria; but is there any way to configure it to monitor my S3s as well? Or are there any other monitoring tools (besides simply enabling logging) that will help me monitor the numbers of POST/GET requests and data volumes for my S3 resources? And to provide alarms for thresholds of activity or increased datastorage?
AWS S3 is a managed storage service. The only metrics available in AWS CloudWatch for S3 are NumberOfObjects and BucketSizeBytes. In order to understand your S3 usage better you need to do some extra work.
I have recently written an AWS Lambda function to do exactly what you ask for and it's available here:
https://github.com/maginetv/s3logs-cloudwatch
It works by parsing S3 Server side log files and aggregates/exports metrics to AWS Cloudwatch (CloudWatch allows you to publish custom metrics).
Example graphs that you will get in AWS CloudWatch after deploying this function on your AWS account are:
RestGetObject_RequestCount
RestPutObject_RequestCount
RestHeadObject_RequestCount
BatchDeleteObject_RequestCount
RestPostMultiObjectDelete_RequestCount
RestGetObject_HTTP_2XX_RequestCount
RestGetObject_HTTP_4XX_RequestCount
RestGetObject_HTTP_5XX_RequestCount
+ many others
Since metrics are exported to CloudWatch, you can easily set up alarms for them as well.
CloudFormation template is included in GitHub repo and you can deploy this function very quickly to gain visibility into your S3 bucket usage.
EDIT 2016-12-10:
In November 2016 AWS has added extra S3 request metrics in CloudWatch that can be enabled when needed. This includes metrics like AllRequests, GetRequests, PutRequests, DeleteRequests, HeadRequests etc. See Monitoring Metrics with Amazon CloudWatch documentation for more details about this feature.
I was also unable to find any way to do this with CloudWatch. This question from April 2012 was answered by Derek#AWS as not having S3 support in CloudWatch. https://forums.aws.amazon.com/message.jspa?messageID=338089
The only thing I could think of would be to import the S3 access logs to a log service (like Splunk). Then create a custom cloud watch metric where you post the data that you parse from the logs. But then you have to filter out the polling of the access logs and…
And while you were at it, you could just create the alarms in Splunk instead of in S3.
If your use case is to simply alert when you are using it too much, you could set up an account billing alert for your S3 usage.
I think this might depend on where you are looking to track the access from. I.e. if you are trying to measure/watch usage of S3 objects from outside http/https requests then Anthony's suggestion if enabling S3 logging and then importing into splunk (or redshift) for analysis might work. You can also watch billing status on requests every day.
If trying to guage usage from within your own applications, there are some AWS SDK cloudwatch metrics:
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/metrics/package-summary.html
and
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/metrics/S3ServiceMetric.html
S3 is a managed service, meaning that you don't need to take action based on system events in order to keep it up and running (as long as you can afford to pay for the service's usage). The spirit of CloudWatch is to help with monitoring services that require you to take action in order to keep them running.
For example, EC2 instances (which you manage yourself) typically need monitoring to alert when they're overloaded or when they're underused or else when they crash; at some point action needs to be taken in order to spin up new instances to scale out, spin down unused instances to scale back in, or reboot instances that have crashed. CloudWatch is meant to help you do the job of managing these resources more effectively.
To enable Request and Data transfer metrics in your bucket you can run the below command. Be aware that these are paid metrics.
aws s3api put-bucket-metrics-configuration \
--bucket YOUR-BUCKET-NAME \
--metrics-configuration Id=EntireBucket
--id EntireBucket
This tutorial describes how to do it in AWS Console with point and click interface.