How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List? - amazon-web-services

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?

I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".

Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

Related

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.

Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.

What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).

Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

Copying objects from one bucket directory folder to another bucket folder using transfer

I'm wanting to use google transfer to copy all folders/files in a specific directory in Bucket-1 to the root directory of Bucket-2.
Have tried to use transfer with the filter option but doesn't copy anything across.
Any pointers on getting this to work within transfer or step by step for functions would be really appreciated.

I reproduced your issue and worked for me using gsutil.
For example:
gsutil cp -r gs://SourceBucketName/example.txt gs://DestinationBucketName
Furthermore, I tried to copy using Transfer option and it also worked. The steps I have done with Transfer option are these:
1 - Create new Transfer Job
Panel: “Select Source”:
2 - Select your source for example Google Cloud Storage bucket
3 - Select your bucket with the data which you want to copy.
4 - On the field “Transfer files with these prefixes” add your data (I used “example.txt”)
Panel “Select destination”:
5 - Select your destination Bucket
Panel “Configure transfer”:
6 - Run now if you want to complete the transfer now.
7 - Press “Create”.
For more information about copy from a bucket to another you can check the official documentation.

So, a few things to consider here:
You have to keep in mind that Google Cloud Storage buckets don’t treat subdirectories the way you would expect. To the bucket it is basically all part of the file name. You can find more information about that in the How Subdirectories Work documentation.
The previous is also the reason why you cannot transfer a file that is inside a “directory” and expect to see only the file’s name appear in the root of your targeted bucket. To give you an example:
If you have a file at gs://my-bucket/my-bucket-subdirectory/myfile.txt, once you transfer it to your second bucket it will still have the subdirectory in its name, so the result will be: gs://my-second-bucket/my-bucket-subdirectory/myfile.txt
This is why, If you are interested in automating this process, you should definitely give the Google Cloud Storage Client Libraries a try.
Additionally, you could also use the GCS Client with Google Cloud Functions. However, I would just suggest this if you really need the Event Triggers offered by GCF. If you just want the transfer to run regularly, for example on a cron job, you could still use the GCS Client somewhere other than a Cloud Function.
The Cloud Storage Tutorial might give you a good example of how to handle Storage events.
Also, on your future posts, try to provide as much relevant information as possible. For this post, as an example, it would’ve been nice to know what file structure you have on your buckets and what you have been getting as an output. And If you can provide straight away what’s your use case, it will also prevent other users from suggesting solutions that don’t apply to your needs.

try this in Cloud Shell in the project
gsutil cp -r gs://bucket1/foldername gs://bucket2

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)

Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.

Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

How to diff very large buckets in Amazon S3?

I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?

Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.

For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare

How do I delete/count objects in a s3 bucket?

So I know this is a common question but there just doesn't seem to be any good answers for it.
I have a bucket with gobs (I have no clue how many) number of files in them. They are all within 2k a piece.
1) How do I figure out how many of these files I have WITHOUT listing them?
I've used the s3cmd.rb, aws/s3, and jets3t stuff and the best I can find is a command to count the first 1000 records (really performing GETS on them).
I've been using jets3t's applet as well cause it's really nice to work with but even that I can't list all my objects cause I run out of heap space. (presumably cause it is peforming GETS on all of them and keeping them in memory)
2) How can I just delete a bucket?
The best thing I've seen is a paralleized delete loop and that has problems cause sometimes it tries to delete the same file. This is what all the 'deleteall' commands that I've ran across do.
What do you guys do who have boasted about hosting millions of images/txts?? What happens when you want to remove it?
3) Lastly, are there alternate answers to this? All of these files are txt/xml files so I'm not even sure S3 is such a concern -- maybe I should move this to a document database of sorts??
What it boils down to is that the amazon S3 API is just straight out missing 2 very important operations -- COUNT and DEL_BUCKET. (actually there is a delete bucket command but it only works when the bucket is empty) If someone comes up with a method that does not suck to do these two operations I'd gladly give up lots of bounty.
UPDATE
Just to answer a few questions. The reason I ask this was I have been for the past year or so been storing hundreds of thousands, more like millions of 2k txt and xml documents. The last time, a couple of months ago, I wished to delete the bucket it literally took DAYS to do so because the bucket has to be empty before you can delete it. This was such a pain in the ass I am fearing ever having to do this again without API support for it.
UPDATE
this rocks the house!
http://github.com/SFEley/s3nuke/
I rm'd a good couple gigs worth of 1-2k files within minutes.

I am most certainly not one of those 'guys do who have boasted about hosting millions of images/txts', as I only have a few thousand, and this may not be the answer you are looking for, but I looked at this a while back.
From what I remember, there is an API command called HEAD which gets information about an object rather than retrieving the complete object which is what GET does, which may help in counting the objects.
As far as deleting Buckets, at the time I was looking, the API definitely stated that the bucket had to be empty, so you need to delete all the objects first.
But, I never used either of these commands, because I was using S3 as a backup and in the end I wrote a few routines that uploaded the files I wanted to S3 (so that part was automated), but never bothered with the restore/delete/file management side of the equation. For that use Bucket Explorer which did all I need. In my case, it wasn't worth spending time when for $50 I can get a program that does all I need. There are probably others that do the same (eg CloudBerry)
In your case, with Bucket Explorer, you can right click on a bucket and select delete or right click and select properties and it will count the number of objects and the size they take up. It certainly does not download the whole object. (Eg the last bucket I looked it was 12Gb and around 500 files and it would take hours to download 12GB whereas the size and count is returned in a second or two). And if there is a limit, then it certainly isn't 1000.
Hope this helps.

"List" won't retrieve the data. I use s3cmd (a python script) and I would have done something like this:
s3cmd ls s3://foo | awk '{print $4}' | split -a 5 -l 10000 bucketfiles_
for i in bucketfiles_*; do xargs -n 1 s3cmd rm < $i & done
But first check how many bucketfiles_ files you get. There will be one s3cmd running per file.
It will take a while, but not days.

1) Regarding your first question, you can list the items on a bucket without actually retrieving them. You can do that both with the SOAP and the REST API. As you can see, you can define the maximum number of items to list and the position to start the listing from (the marker). Read more about it here.
I do not know of any implementation of the paging, but especially for the REST interface it would be very easy to implement it in any language.
2) I believe the only way to delete a bucket is to first empty it from all items. See alse this question.
3) I would say that S3 is very well suited for storing a large number of files. It depends however on what you want to do. Do you plan to also store binary files? Do you need to perform any queries or just listing the files is enough?

I've had the same problem with deleting hundreds of thousands of files from a bucket. It may be worthwhile to fire up an EC2 instance to run the parallel delete because the latency to S3 is low. I think there's some money to be made hosting a bunch of EC2 servers and charging people to delete buckets quickly. (At least until Amazon gets around to changing the API)

Old thread, but still relevant as I was looking for the answer until I just figured this out. I wanted a file count using a GUI-based tool (i.e. no code). I happen to already use a tool called 3Hub for drag & drop transfers to and from S3. I wanted to know how many files I had in a particular bucket (I don't think billing breaks it down by buckets).
So, using 3Hub,
- list the contents of the bucket (looks basically like a finder or explorer window)
- go to the bottom of the list, click 'show all'
- select all (ctrl+a)
- choose copy URLs from right-click menu
- paste the list into a text file (I use TextWrangler for Mac)
- look at the line count
I had 20521 files in the bucket and did the file count in less than a minute.
I'd like to know if anyone's found a better way since this would take some time on hundreds of thousands of files.

To count objects in an S3 bucket:
Go to AWS Billing, then reports, then AWS Usage reports.
Select Amazon Simple Storage Service, then Operation StandardStorage.
Download a CSV file that includes a UsageType of StorageObjectCount that lists the item count for each bucket.

Count
aws s3 ls s3://mybucket/ --recursive | wc -l
From this post
Delete
aws s3 rm --recursive s3://mybucket/ && aws s3 rb s3://mybucket/
This deletes every item then the bucket.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js