I have a large bucket (PiB) and I'm interested in running some regex queries to understand how many bytes certain paths take.
gsutil du -s -a gs://.... works well at a small scale, but I have two questions:
Is there a better way to analyze size for redundant paths in GCS that isn't gsutil du
Is there an associated cost for running this command on my bucket?
I think gsutil du, is the tool you might use for this analysis. There is no faster way to do it.
But if you need to do it regularly, you may need to enable bucket logging:
You can read more about it, here:
https://cloud.google.com/storage/docs/access-logs#delivery
Although about the cost, It counts as a class B operation
https://cloud.google.com/storage/pricing
With Cloud Storage, you can't search for object based on regex, only based on a prefix. If you want a regex, you have to mirror the file name elsewhere and search for the pattern that you want.
How to mirror? you have to do it by yourselves :(
About gsutil du command, it's pretty simple: the gsutil binary query Cloud Storage API to get list the file. In that API response, the File metadata are present (especially the file size) and gsutil aggregate the results, i.e. 1 Class a operation call per 1000 files (max page size)
To answer your question 2. Is there an associated cost for running this command on my bucket?, the answer is yes.
I was charged $20 today in the category of Class A Operations, and the only thing I did was uploading the files to my bucket and check the bucket size using gsutil du -s.
They explicitly mentioned this in their document:
Caution: The gsutil du command calculates the current space usage by making a series of object listing requests, which can take a long time for large buckets. If the number of objects in your bucket is hundreds of thousands or more, or if you want to monitor your bucket size over time, use Monitoring instead, as described in the Console tab.
Related
I'm uploading a file that is 8.6T in size.
$ nohup gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp big_file.jsonl gs://bucket/big_file.jsonl > nohup.mv-big-file.out 2>&1 &
At some point, it just hangs, with no error messages, nothing.
Any suggestions on how I can move this large file from the box to the GS bucket?
In accordance to what #John Hanley mentioned, the maximum size limit for individual objects stored in Cloud Storage is 5 TB, as stated in Buckets and Objects Limits
Here are some workaround that you can try :
You can try uploading it across multiple folders on a single bucket since there is no limit on the actual bucket size.
Second option you can try is chunking of your files up to 32 chunks Parallel composite uploads.
Another option that you may also consider is Transfer Appliance for a faster and higher capacity of upload to Cloud Storage.
You might want to take a look as well to GCS's best practices documentation.
I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.
Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)
The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.
As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu
S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?
The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.
I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.
So I know this is a common question but there just doesn't seem to be any good answers for it.
I have a bucket with gobs (I have no clue how many) number of files in them. They are all within 2k a piece.
1) How do I figure out how many of these files I have WITHOUT listing them?
I've used the s3cmd.rb, aws/s3, and jets3t stuff and the best I can find is a command to count the first 1000 records (really performing GETS on them).
I've been using jets3t's applet as well cause it's really nice to work with but even that I can't list all my objects cause I run out of heap space. (presumably cause it is peforming GETS on all of them and keeping them in memory)
2) How can I just delete a bucket?
The best thing I've seen is a paralleized delete loop and that has problems cause sometimes it tries to delete the same file. This is what all the 'deleteall' commands that I've ran across do.
What do you guys do who have boasted about hosting millions of images/txts?? What happens when you want to remove it?
3) Lastly, are there alternate answers to this? All of these files are txt/xml files so I'm not even sure S3 is such a concern -- maybe I should move this to a document database of sorts??
What it boils down to is that the amazon S3 API is just straight out missing 2 very important operations -- COUNT and DEL_BUCKET. (actually there is a delete bucket command but it only works when the bucket is empty) If someone comes up with a method that does not suck to do these two operations I'd gladly give up lots of bounty.
UPDATE
Just to answer a few questions. The reason I ask this was I have been for the past year or so been storing hundreds of thousands, more like millions of 2k txt and xml documents. The last time, a couple of months ago, I wished to delete the bucket it literally took DAYS to do so because the bucket has to be empty before you can delete it. This was such a pain in the ass I am fearing ever having to do this again without API support for it.
UPDATE
this rocks the house!
http://github.com/SFEley/s3nuke/
I rm'd a good couple gigs worth of 1-2k files within minutes.
I am most certainly not one of those 'guys do who have boasted about hosting millions of images/txts', as I only have a few thousand, and this may not be the answer you are looking for, but I looked at this a while back.
From what I remember, there is an API command called HEAD which gets information about an object rather than retrieving the complete object which is what GET does, which may help in counting the objects.
As far as deleting Buckets, at the time I was looking, the API definitely stated that the bucket had to be empty, so you need to delete all the objects first.
But, I never used either of these commands, because I was using S3 as a backup and in the end I wrote a few routines that uploaded the files I wanted to S3 (so that part was automated), but never bothered with the restore/delete/file management side of the equation. For that use Bucket Explorer which did all I need. In my case, it wasn't worth spending time when for $50 I can get a program that does all I need. There are probably others that do the same (eg CloudBerry)
In your case, with Bucket Explorer, you can right click on a bucket and select delete or right click and select properties and it will count the number of objects and the size they take up. It certainly does not download the whole object. (Eg the last bucket I looked it was 12Gb and around 500 files and it would take hours to download 12GB whereas the size and count is returned in a second or two). And if there is a limit, then it certainly isn't 1000.
Hope this helps.
"List" won't retrieve the data. I use s3cmd (a python script) and I would have done something like this:
s3cmd ls s3://foo | awk '{print $4}' | split -a 5 -l 10000 bucketfiles_
for i in bucketfiles_*; do xargs -n 1 s3cmd rm < $i & done
But first check how many bucketfiles_ files you get. There will be one s3cmd running per file.
It will take a while, but not days.
1) Regarding your first question, you can list the items on a bucket without actually retrieving them. You can do that both with the SOAP and the REST API. As you can see, you can define the maximum number of items to list and the position to start the listing from (the marker). Read more about it here.
I do not know of any implementation of the paging, but especially for the REST interface it would be very easy to implement it in any language.
2) I believe the only way to delete a bucket is to first empty it from all items. See alse this question.
3) I would say that S3 is very well suited for storing a large number of files. It depends however on what you want to do. Do you plan to also store binary files? Do you need to perform any queries or just listing the files is enough?
I've had the same problem with deleting hundreds of thousands of files from a bucket. It may be worthwhile to fire up an EC2 instance to run the parallel delete because the latency to S3 is low. I think there's some money to be made hosting a bunch of EC2 servers and charging people to delete buckets quickly. (At least until Amazon gets around to changing the API)
Old thread, but still relevant as I was looking for the answer until I just figured this out. I wanted a file count using a GUI-based tool (i.e. no code). I happen to already use a tool called 3Hub for drag & drop transfers to and from S3. I wanted to know how many files I had in a particular bucket (I don't think billing breaks it down by buckets).
So, using 3Hub,
- list the contents of the bucket (looks basically like a finder or explorer window)
- go to the bottom of the list, click 'show all'
- select all (ctrl+a)
- choose copy URLs from right-click menu
- paste the list into a text file (I use TextWrangler for Mac)
- look at the line count
I had 20521 files in the bucket and did the file count in less than a minute.
I'd like to know if anyone's found a better way since this would take some time on hundreds of thousands of files.
To count objects in an S3 bucket:
Go to AWS Billing, then reports, then AWS Usage reports.
Select Amazon Simple Storage Service, then Operation StandardStorage.
Download a CSV file that includes a UsageType of StorageObjectCount that lists the item count for each bucket.
Count
aws s3 ls s3://mybucket/ --recursive | wc -l
From this post
Delete
aws s3 rm --recursive s3://mybucket/ && aws s3 rb s3://mybucket/
This deletes every item then the bucket.