Two days ago I deleted a bucket that contained a backup of all log files for a site. It contained about 30,000 tiny files and about 275 MB of space.
I noticed in the Monitoring panel of the site that the file count is exactly the same. Decided to wait a couple of days and it still has not changed.
The bucket uses standard storage class, multi-region location, and has no lifecycle rules with uniform permissions.
I can verify that the bucket is gone in the UI as well as using the ls command in cloud shell.
Cloud Storage Object Count
Cloud Storage Object Count
The count of objects in the Monitoring panel reconciled about two days later.
Looks like the change ended up being retroactive, meaning the charts in the past were re-written to reflect the objects being deleted.
Related
I've recently deleted around 90 million objects (around 100TB of data) from a "Nearline" GCS bucket, and now that I have an almost-empty bucket it takes >5 seconds to list the single remaining file. Standard buckets of ours that have only a dozen files take ~1s to list.
This occurs consistently from both gsutil as well as Go-based tooling that we've written. This has been tested from multiple VMs ranging in sizes within GCP from the same region as the buckets. All buckets are single-region, the only difference is that the slower one is Nearline, and the others are Standard. Is it really possible that simply listing the files in a bucket takes more than 5 seconds on Nearline?
Since this smells like a garbage collection/vacuum-related slowdown and we've been using it for almost 5 years now I'm inclined to simply delete the bucket and recreate it, but it'd be good to know if anyone has done an accurate characterization of GCP bucket performance with high churn over time.
In the 24h since I've posted this, the performance of this bucket has returned to what I'd consider normal. gsutil ls takes ~1.2s, my custom Go code for listing the bucket takes ~.15s.
By simply waiting and trying again I've answered my own question: yes, the database of bucket keys does seem to have variable (reduced) performance depending on content churn, but it's something that resolves itself relatively quickly.
I am really having a hard time of deleting my bucket jananath-logs-bucket-new. It has over 70 TB of data and I need to delete the entire bucket. This has files from 2019
I tried deleting the bucket and since it has many small files (over 50 millions), it take so much time and the UI (browser hangs). So I thought, let the AWS do it for me.
So I tried the lifecycle rules. So I created the two rules
delete-all-from-start
delete-all-from-start-2
And below are the screenshots of each rule:
delete-all-from-start
delete-all-from-start-2
And both the rules look like this now:
But my objects are not deleted.
I have given the number of days for each field as 1 thinking it would delete everything from 2019 (where the first object is created).
Can someone help me on this?
How can I delete the entire objects from the bucket from the 2019
Is it possible to delete the objects between a date range - say from 2020-2021 ?
Thank you,
Have a great day!
According to the documentation a lifecycle policy is a valid way to empty a bucket. Please note that there may be a delay for expiring objects:
When an object reaches the end of its lifetime based on its lifecycle
policy, Amazon S3 queues it for removal and removes it asynchronously.
There might be a delay between the expiration date and the date at
which Amazon S3 removes an object.
I have a bucket in S3 (Infrequent access) containing 2 billion objects. It is too big to delete in the console or over the api without taking years.
I can create a lifecycle rule to expire and delete the objects but the calculator predicts this will cost me >$20,000. Is that correct? Is there a better way to delete a bucket?
I have a file effectively containing a list of all the objects in that bucket if that helps.
Update 2021:
An answer below from #MAP points out that there is now an "Empty" button. I haven't tested yet, but looks like the way to go (I'll accept that answer once tested):
If you have a list of all the objects available then you can certainly use Multi Delete Object action. Apparently this API is free. I would create AWS Step Functions state machine to loop through the file and delete 1000 objects at a time. 1000 appears to be the limit.
It will take around 2M step function transactions to delete all the objects in the bucket. As per the pricing for step function it will cost you around $50 + cost of Lambda invocations around $1 so total cost roughly $51.
Update
Using Lambda or Step Functions is probably not the most cost effective option because both ways you will need to read the file (that contains object keys) from some source such as S3. So I think running the script from local machine or any EC2 linux screen appears to be the best option.
In 2021, anyone who comes across this question may benefit to know that AWS console now provides an empty button.
Select the bucket and click on "empty" button and all objects versioned or not versioned would be emptied/deleted. Depending on the number of objects it can take minutes to days.
Expiration lifecycle rules are free. From the original feature announcement:
As with standard delete requests, Amazon S3 doesn’t charge you for using Object Expiration.
Delete operations are for free. You can create a lifecycle
Policy to automate a bulk delete.
I would start with a small number of objects first and check billing report to 100% confirm that the delete will not be charged, then go for the rest.
I want to delete 2TB of files from the GCP bucket.
I have read the GCP documentation for deletion and it says to use the gsutil -m rm command but when I am running it says 400+ hours estimate time.
Is there any faster way to do the deletion process?
For buckets with a very large number of objects, one trick to deleting the contents is to use the Lifecycle Management feature. https://cloud.google.com/storage/docs/lifecycle
Set a lifecycle rule that triggers when the object is 0 days old and an action of "Delete", and that should cause GCS to begin deleting your objects for you. Note that this may still take a while, as lifecycle rules can take up to 24 hours to go into effect, but that's still a lot better than a couple of weeks.
You can configure the lifecycle policy on a bucket from the console:
Head to https://console.cloud.google.com/storage/browser
Find the bucket you want to enable, and click None in the Lifecycle column.
Click Add rule.
Select the condition (object is 0 days old or )
Select an action (Delete the object)
Click continue.
Click save.
See https://cloud.google.com/storage/docs/managing-lifecycles for more instructions.
N.B.: Lifecycle changes can take up to 24 hours to go into effect, so once all of your objects go away and you remove the lifecycle config setting, you should wait an additional 24 hours before putting any new files in the bucket, or else they might also get deleted.
The gcloud-sdk command "bq load" can take a local file as input.
From the output of the command, it looks like that file is first being uploaded into google cloud storage somewhere before the bigquery load job is scheduled. Given that the REST api for bigquery schedule-load-job endpoint also takes only "gs://" urls, and that the load-job needs the data to be reachable, I am pretty sure that such an upload to cloud-storage is taking place (though I can't find any documentation that explicitly describes "bq load" with local files.
My question then is: can someone tell me where the local file is temporarily uploaded to? Is it one of the gcloud project cloud-storage buckets, or somewhere else? Is it guaranteed to be deleted after the load-job completes?
I have a requirement for data to be kept only in a specific geographical region, thus the location of the (presumed) temporary storage is significant.
I could upload the data explicitly to storage, then use "bq load" with a reference to the cloud storage, but then need to arrange deletion of the data afterwards which is a minor inconvenience. A dedicated storage with a "lifecycle rule" could at least delete after 1 day, but the "bq load .. localfile" approach is cleaner..
If you run bq --help you can see how one of the global bq_flags is --location. It is defined as follows:
--location: “Default geographic location to use when creating datasets or determining where jobs should run (Ignored when not
applicable.)”
If you run:
bq load --location=eu {your-table} {your-source}
For a dataset located in EU, then the job should succeed and all jobs related should be run in EU.