Cheapest way to delete 2 billion objects from S3 IA

Cheapest way to delete 2 billion objects from S3 IA - amazon-web-services

I have a bucket in S3 (Infrequent access) containing 2 billion objects. It is too big to delete in the console or over the api without taking years.
I can create a lifecycle rule to expire and delete the objects but the calculator predicts this will cost me >$20,000. Is that correct? Is there a better way to delete a bucket?
I have a file effectively containing a list of all the objects in that bucket if that helps.
Update 2021:
An answer below from #MAP points out that there is now an "Empty" button. I haven't tested yet, but looks like the way to go (I'll accept that answer once tested):

If you have a list of all the objects available then you can certainly use Multi Delete Object action. Apparently this API is free. I would create AWS Step Functions state machine to loop through the file and delete 1000 objects at a time. 1000 appears to be the limit.
It will take around 2M step function transactions to delete all the objects in the bucket. As per the pricing for step function it will cost you around $50 + cost of Lambda invocations around $1 so total cost roughly $51.
Update
Using Lambda or Step Functions is probably not the most cost effective option because both ways you will need to read the file (that contains object keys) from some source such as S3. So I think running the script from local machine or any EC2 linux screen appears to be the best option.

In 2021, anyone who comes across this question may benefit to know that AWS console now provides an empty button.
Select the bucket and click on "empty" button and all objects versioned or not versioned would be emptied/deleted. Depending on the number of objects it can take minutes to days.

Expiration lifecycle rules are free. From the original feature announcement:
As with standard delete requests, Amazon S3 doesn’t charge you for using Object Expiration.

Delete operations are for free. You can create a lifecycle
Policy to automate a bulk delete.
I would start with a small number of objects first and check billing report to 100% confirm that the delete will not be charged, then go for the rest.

Related

Is there any notification event I can trace for completion of an execution of AWS S3 lifecycle rule?

I wanted to delete large number of S3 files (may be few 100K or 1000K, which I do not have control) in a bulk async process. I tried to look into multiple blogs and collated below strategies:
Leverage AWS S3 REST API from the async thread of custom application
Here the drawbacks are:
I will have to make huge number of S3 API calls as 1 request is limited for 1000 S3 objects and I may not know the exact S3 object.
Even if I identify the S3 objects to delete, I will have to first GET and then DELETE which will make the solution costly.
Here I will have to keep track of deleted chunks and in case of any failure in middle of operation, I will have to build a mechanism to re-trigger the chunks which failed to be deleted.
Leveraging S3 lifecycle policy
Here the drawbacks are:
We are storing multiple customer data into same bucket segregated by customer-id in prefix. With growing number of customers, we foresee that the 1000 rules per bucket hard limit may hit us.
To surpass above drawback, we can delete the rule and free-up the quota for next requests. But we were looking for any event based notification which can tell us back that the bulk delete operation is complete.
Again with growing number of customers, here we may loose predictability of the bulk delete operation. This is because of accumulated jobs due to reached quota limit and a submitted bulk delete job may have to wait for days to be completed.
Create only 1 rule with a special bulk delete tag and use it to set 1 S3 lifecycle policy
With this approach, we believe we will not hit the limit issue as we are expecting in above approach. And as we understood that these S3 lifecycle rules gets executed once a day (though we don't know exactly when), so we are assured that in max next 24h, the rule will get triggered and then it will take some time to actually complete the bulk delete operation (may be few mins or hours, we don't know). Here also we have the open question as: Is there a notification event after completion of 1 execution of S3 lifecycle rule which we can listen and update the status of all submitted bulk delete jobs as DONE? In lack of such notification event, it becomes difficult to let transparently communicate it back to the end-user who triggered the bulk delete async operation.
Any comments/advice on below strategies will be helpful. Also if you can help me with the answer for the last strategy which I guess is the most preferable choice I have as of now.
I tried all the above stated strategies and got stuck at the mentioned problem for each. Any inputs/advice on above will be of great help.

After all evaluations, we have finalized to go with codeful delete relevant data for specific time-range as an async java process leveraging S3 bulk delete SDK (DeleteObjectsRequest).

Do "nearline" Google Cloud Storage buckets slow down with file churn?

I've recently deleted around 90 million objects (around 100TB of data) from a "Nearline" GCS bucket, and now that I have an almost-empty bucket it takes >5 seconds to list the single remaining file. Standard buckets of ours that have only a dozen files take ~1s to list.
This occurs consistently from both gsutil as well as Go-based tooling that we've written. This has been tested from multiple VMs ranging in sizes within GCP from the same region as the buckets. All buckets are single-region, the only difference is that the slower one is Nearline, and the others are Standard. Is it really possible that simply listing the files in a bucket takes more than 5 seconds on Nearline?
Since this smells like a garbage collection/vacuum-related slowdown and we've been using it for almost 5 years now I'm inclined to simply delete the bucket and recreate it, but it'd be good to know if anyone has done an accurate characterization of GCP bucket performance with high churn over time.

In the 24h since I've posted this, the performance of this bucket has returned to what I'd consider normal. gsutil ls takes ~1.2s, my custom Go code for listing the bucket takes ~.15s.
By simply waiting and trying again I've answered my own question: yes, the database of bucket keys does seem to have variable (reduced) performance depending on content churn, but it's something that resolves itself relatively quickly.

AWS S3 delete all the objects or within in a given date range

I am really having a hard time of deleting my bucket jananath-logs-bucket-new. It has over 70 TB of data and I need to delete the entire bucket. This has files from 2019
I tried deleting the bucket and since it has many small files (over 50 millions), it take so much time and the UI (browser hangs). So I thought, let the AWS do it for me.
So I tried the lifecycle rules. So I created the two rules
delete-all-from-start
delete-all-from-start-2
And below are the screenshots of each rule:
delete-all-from-start
delete-all-from-start-2
And both the rules look like this now:
But my objects are not deleted.
I have given the number of days for each field as 1 thinking it would delete everything from 2019 (where the first object is created).
Can someone help me on this?
How can I delete the entire objects from the bucket from the 2019
Is it possible to delete the objects between a date range - say from 2020-2021 ?
Thank you,
Have a great day!

According to the documentation a lifecycle policy is a valid way to empty a bucket. Please note that there may be a delay for expiring objects:
When an object reaches the end of its lifetime based on its lifecycle
policy, Amazon S3 queues it for removal and removes it asynchronously.
There might be a delay between the expiration date and the date at
which Amazon S3 removes an object.

Faster way to delete TB of data from GCP cloud storage

I want to delete 2TB of files from the GCP bucket.
I have read the GCP documentation for deletion and it says to use the gsutil -m rm command but when I am running it says 400+ hours estimate time.
Is there any faster way to do the deletion process?

For buckets with a very large number of objects, one trick to deleting the contents is to use the Lifecycle Management feature. https://cloud.google.com/storage/docs/lifecycle
Set a lifecycle rule that triggers when the object is 0 days old and an action of "Delete", and that should cause GCS to begin deleting your objects for you. Note that this may still take a while, as lifecycle rules can take up to 24 hours to go into effect, but that's still a lot better than a couple of weeks.
You can configure the lifecycle policy on a bucket from the console:
Head to https://console.cloud.google.com/storage/browser
Find the bucket you want to enable, and click None in the Lifecycle column.
Click Add rule.
Select the condition (object is 0 days old or )
Select an action (Delete the object)
Click continue.
Click save.
See https://cloud.google.com/storage/docs/managing-lifecycles for more instructions.
N.B.: Lifecycle changes can take up to 24 hours to go into effect, so once all of your objects go away and you remove the lifecycle config setting, you should wait an additional 24 hours before putting any new files in the bucket, or else they might also get deleted.

How long does it take for AWS S3 to save and load an item?

S3 FAQ mentions that "Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES." However, I don't know how long it takes to get eventual consistency. I tried to search for this but couldn't find an answer in S3 documentation.
Situation:
We have a website consists of 7 steps. When user clicks on save in each step, we want to save a json document (contains information of all 7 steps) to Amazon S3. Currently we plan to:
Create a single S3 bucket to store all json documents.
When user saves step 1 we create a new item in S3.
When user saves step 2-7 we override the existing item.
After user saves a step and refresh the page, he should be able to see the information he just saved. i.e. We want to make sure that we always read after write.
The full json document (all 7 steps completed) is around 20 KB.
After users clicked on save button we can freeze the page for some time and they cannot make other changes until save is finished.
Question:
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
Is there a function to calculate save/load time based on item size?
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?

I wanted to add to #error2007s answers.
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
It's not only that you will not find the exact time anywhere - there's actually no such thing exact time. That's just what "eventual consistency" is all about: consistency will be achieved eventually. You can't know when.
If somebody gave you an upper bound for how long a system would take to achieve consistency, then you wouldn't call it "eventually consistent" anymore. It would be "consistent within X amount of time".
The problem now becomes, "How do I deal with eventual consistency?" (instead of trying to "beat it")
To really find the answer to that question, you need to first understand what kind of consistency you truly need, and how exactly the eventual consistency of S3 could affect your workflow.
Based on your description, I understand that you would write a total of 7 times to S3, once for each step you have. For the first write, as you correctly cited the FAQs, you get strong consistency for any reads after that. For all the subsequent writes (which are really "replacing" the original object), you might observe eventual consistency - that is, if you try to read the overwritten object, you might get the most recent version, or you might get an older version. This is what is referred to as "eventual consistency" on S3 in this scenario.
A few alternatives for you to consider:
don't write to S3 on every single step; instead, keep the data for each step on the client side, and then only write 1 single object to S3 after the 7th step. This way, there's only 1 write, no "overwrites", so no "eventual consistency". This might or might not be possible for your specific scenario, you need to evaluate that.
alternatively, write to S3 objects with different names for each step. E.g., something like: after step 1, save that to bruno-preferences-step-1.json; then, after step 2, save the results to bruno-preferences-step-2.json; and so on, then save the final preferences file to bruno-preferences.json, or maybe even bruno-preferences-step-7.json, giving yourself the flexibility to add more steps in the future. Note that the idea here to avoid overwrites, which could cause eventual consistency issues. Using this approach, you only write new objects, you never overwrite them.
finally, you might want to consider Amazon DynamoDB. It's a NoSQL database, you can securely connect to it directly from the browser or from your server. It provides you with replication, automatic scaling, load distribution (just like S3). And you also have the option to tell DynamoDB that you want to perform strongly consistent reads (the default is eventually consistent reads; you have to change a parameter to get strongly consistent reads). DynamoDB is typically used for "small" records, 20kB is definitely within the range -- the maximum size of a record would be 400kB as of today. You might want to check this out: DynamoDB FAQs: What is the consistency model of Amazon DynamoDB?

How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
You will not find the exact time anywhere. If you ask AWS they will give you approx timings. Your file is 20 KB so as per my experience from S3 usage the time will be more or less 60-90 Sec.
Is there a function to calculate save/load time based on item size?
No there is no any function using which you can calculate this.
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
For Seattle US West Oregon Will work with no problem.
You can also take a look at this experiment for comparison https://github.com/andrewgaul/are-we-consistent-yet

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js