Export S3 Bucket To Blob Storage - amazon-web-services

I'm trying to download, transform and upload an entire S3 bucket to Azure Blob Storage. The task itself, though trivial, became really annoying due to throttling issues.
The bucket itself is 4.5TB and contains roughly 700,000,000 keys - my first approach was to create a Lambda to handle a batch of 2000 keys at a time and just attack S3. After launching all the lambdas I came across S3 throttling for the first time -
{
"errorMessage": "Please reduce your request rate.",
"errorType": "SlowDown"}
At first this was amusing, but eventually it became a blocker on the whole migration process. Transferring the entire bucket with this throttling policy will take me around 2 weeks.
Of course I implemented an exponential retry but in this scale of 100+ lambdas concurrently it has little effect.
Am I missing something? Is there a service I could use to that ? Can I overcome the throttling someway ?
Any help would be appreciated.

Related

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

AWS Data Pipeline DynamoDB to S3 503 SlowDown Error

We have a Data pipeline that does a nightly copy of our DynamoDB to S3 buckets so we can run reports on the data with Athena. Occasionally the pipeline will fail with a 503 SlowDown error. The retries will usually "succeed" but create tons of duplicate records in S3. The DynamoDB has On-Demand read capacity and the pipeline has 0.5 myDDBReadThroughputRatio. A couple of questions here:
I assume reducing the myDDBReadThroughputRatio would probably lessen the problem, if true does anyone have a good ratio that will still be performant but not cause these errors?
Is there a way to prevent the duplicate records in S3? I can't figure out why these are being generated? (possibly the records from the failed run are not removed?)
Of course any other thoughts/solutions for the problem would be greatly appreciated.
Thanks!
Using AWS Data Pipeline for continuous backups is not recommended.
AWS recently launched a new functionality that allows you to export DynamoDB table data to S3 and can be further analysed by Athena. Check it out here
You can also use Amazon glue to do the same (link).
If you still want to continue to use data pipeline, then the issue seems to be happening due to S3 limits being reached. You might need to see if there are other requests also writing to S3 at same time OR if you can limit the request rate from pipeline using some configuration.

HIVE_CURSOR_ERROR: Please reduce your request rate

I am running simple query against my s3 bucket with CloudTrail logs. The bucket is big and after around 1 min and 45 seconds I get error
HIVE_CURSOR_ERROR: Please reduce your request rate.
Is there a way to limit request rate against my s3 bucket within Athena?
SELECT *
FROM default.cloudtrail_logs_cloudtraillog
WHERE eventname = 'DeleteUser' AND awsregion = 'us-east-1'
So I will summarize solutions suggested by AWS. None of them are great and I wonder why AWS would not throttle on their end and instead throw the error.
By default, S3 will scale automatically to support very high request rates. When your request rate is scaling, S3 automatically partitions your S3 bucket as needed to support higher request rates.However, sometimes it still errors out. So they suggest to wait (do not suggest time frame) to give S3 enough time to auto-partition your bucket based on the request rate it is receiving.
They also suggest:
1) Using S3distcp utility to combine small files into larger objects. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
2) Partitioning https://docs.aws.amazon.com/athena/latest/ug/partitions.html
I got the same answer from AWS support. Since I was doing a one-off analysis, I ended up writing a script to copy a small date range worth of logs to a separate bucket and using Athena to analyze the smaller dataset.

Does AWS guarantee my lambda function will be triggered 100%?

I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?
You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.

How to migrate millions of files in aws s3 bucket from one account to another really fast

I have an s3 bucket in account A with millions of files that take up many GBs
I want to migrate all this data into a new bucket in account B
So far, I've given account B permissions to run s3 commands on the bucket in account A.
I am able to get some results with the
aws s3 sync command with the setting aws configure set default.s3.max_concurrent_requests 100
its fast but it only does a speed of some 20,000 parts per minute.
Is there an approach to sync/move data across aws buckets in different accounts REALLY fast?
I tried to do aws transfer acceleration but it seems that that is good for uploading and downloading from the buckets and I think it works within an aws account.
20,000 parts per minute.
That's > 300/sec, so, um... that's pretty fast. It's also 1.2 million per hour, which is also pretty respectable.
S3 Request Rate and Performance Considerations implies that 300 PUT req/sec is something of a default performance threshold.
At some point, make too many requests too quickly and you'll overwhelm your index partition and you'll start encountering 503 Slow Down errors -- though hopefully aws-cli will handle that gracefully.
The idea, though, seems to be that that S3 will scale up to accommodate the offered workload, so if you leave this process running, you may find that it actually does get faster with time.
Or...
If you expect a rapid increase in the request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, we recommend that you open a support case to prepare for the workload and avoid any temporary limits on your request rate.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Note, also, that it says "temporary limits." This is where I come to the conclusion that, all on its own, S3 will -- at some point -- provision more index capacity (presumably this means a partition split) to accommodate the increased workload.
You might also find that you get away a much higher aggregate trx/sec if you run multiple separate jobs, each handling a different object prefix (e.g. asset/1, asset/2, asset/3, etc. depending on how the keys are designed in your bucket, because you're not creating such a hot spot in the object index.
The copy operation going on here is an internal S3-to-S3 copy. It isn't download + upload. Transfer acceleration is only used for actual downloads.