I am running simple query against my s3 bucket with CloudTrail logs. The bucket is big and after around 1 min and 45 seconds I get error
HIVE_CURSOR_ERROR: Please reduce your request rate.
Is there a way to limit request rate against my s3 bucket within Athena?
SELECT *
FROM default.cloudtrail_logs_cloudtraillog
WHERE eventname = 'DeleteUser' AND awsregion = 'us-east-1'
So I will summarize solutions suggested by AWS. None of them are great and I wonder why AWS would not throttle on their end and instead throw the error.
By default, S3 will scale automatically to support very high request rates. When your request rate is scaling, S3 automatically partitions your S3 bucket as needed to support higher request rates.However, sometimes it still errors out. So they suggest to wait (do not suggest time frame) to give S3 enough time to auto-partition your bucket based on the request rate it is receiving.
They also suggest:
1) Using S3distcp utility to combine small files into larger objects. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
2) Partitioning https://docs.aws.amazon.com/athena/latest/ug/partitions.html
I got the same answer from AWS support. Since I was doing a one-off analysis, I ended up writing a script to copy a small date range worth of logs to a separate bucket and using Athena to analyze the smaller dataset.
Related
I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.
We have a Data pipeline that does a nightly copy of our DynamoDB to S3 buckets so we can run reports on the data with Athena. Occasionally the pipeline will fail with a 503 SlowDown error. The retries will usually "succeed" but create tons of duplicate records in S3. The DynamoDB has On-Demand read capacity and the pipeline has 0.5 myDDBReadThroughputRatio. A couple of questions here:
I assume reducing the myDDBReadThroughputRatio would probably lessen the problem, if true does anyone have a good ratio that will still be performant but not cause these errors?
Is there a way to prevent the duplicate records in S3? I can't figure out why these are being generated? (possibly the records from the failed run are not removed?)
Of course any other thoughts/solutions for the problem would be greatly appreciated.
Thanks!
Using AWS Data Pipeline for continuous backups is not recommended.
AWS recently launched a new functionality that allows you to export DynamoDB table data to S3 and can be further analysed by Athena. Check it out here
You can also use Amazon glue to do the same (link).
If you still want to continue to use data pipeline, then the issue seems to be happening due to S3 limits being reached. You might need to see if there are other requests also writing to S3 at same time OR if you can limit the request rate from pipeline using some configuration.
All my work to access AWS S3 is in the region: us-east2 and the AZ is us-east-2a.
But I saw some throttling complaints from S3, so I am wondering if I move some of my work to another AZ like us-east-2b, could it mitigate the problem? ( Or it will not help since us-east-2a and us-east-2b are actually pointing to same endpoint?)
Thank you.
The throttling is not per AZ, its for a bucket. The below quote is from the AWS documentation.
You can send 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per partitioned prefix in an S3 bucket. When you have an increased request rate to your bucket, Amazon S3 might return 503 Slow Down errors while it scales to support the request rate. This scaling process is called partitioning.
To avoid or minimize 503 Slow Down responses, verify that the number of unique prefixes in your bucket supports your required transactions per second (TPS). This helps your bucket leverage the scaling and partitioning capabilities of Amazon S3. Additionally, be sure that the objects and the requests for those objects are distributed evenly across the unique prefixes. For more information, see Best Practices Design Patterns: Optimizing Amazon S3 Performance.
If possible enable exponential backoff to retry in the S3 bucket. If the application uploading to S3 is performance sensitive the suggestion would be to handoff to a background application that can upload to S3 at a later time.
I'm trying to download, transform and upload an entire S3 bucket to Azure Blob Storage. The task itself, though trivial, became really annoying due to throttling issues.
The bucket itself is 4.5TB and contains roughly 700,000,000 keys - my first approach was to create a Lambda to handle a batch of 2000 keys at a time and just attack S3. After launching all the lambdas I came across S3 throttling for the first time -
{
"errorMessage": "Please reduce your request rate.",
"errorType": "SlowDown"}
At first this was amusing, but eventually it became a blocker on the whole migration process. Transferring the entire bucket with this throttling policy will take me around 2 weeks.
Of course I implemented an exponential retry but in this scale of 100+ lambdas concurrently it has little effect.
Am I missing something? Is there a service I could use to that ? Can I overcome the throttling someway ?
Any help would be appreciated.
When unloading data from Redshift to S3 with PARALLEL ON, results of the select statement are split across a set of files as described in the Redshift Documentation. How does this translate into number of requests sent to S3 endpoint ?
Ex - If the UNLOAD query ( with PARALLEL ON ) generated 6 files in S3 bucket, does that correspond to 6 PUT requests on S3 ? Or is the number of requests received by S3 via this UNLOAD query execution a higher number?
I would appreciate it if you could briefly clarify.
If you're worried about cost then you're looking in the wrong place.
PUT requests in Amazon S3 (in a US region) are a half-cent per 1000 requests. Compare that to a Redshift cluster that is a minimum of 25c per hour (which is still rather good value).
Take a look at your monthly bill and S3 request pricing is unlikely to feature in your top costs.