Need to export the path/url of each file in Amazon S3 server - amazon-web-services

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.

Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.

What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).

Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!
I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function
There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

How can I detect orphaned objects in S3 that aren't mapped to our database?

I am trying to find possible orphans in an S3 bucket. What I mean is that we might delete something out of the DB, and for whatever reason, it doesn't get cleared from S3. This can be a bug in our system or something of that nature. I want to double check against our API that the object in S3 maps to something that exists - the naming convention let's us map things together like that.
Scraping an entire bucket every X days seems unscalable. I was thinking that for each object in the bucket, it can add itself to an SQS queue for the relevant checking to happen, every 30 days or so.
I've only found events around uploads and specific modifications over at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html. Is there anything more generalized I can't find? Any creative solutions to this problem?
You should activate Amazon S3 Inventory, which can provide a regular CSV file (as often as daily) that contains a list of every object in the Amazon S3 bucket.
You could then trigger some code that compares the contents of the CSV file against the database to find 'orphan' objects.

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

python boto for aws s3, how to get sorted and limited files list in bucket?

If There are too many files on a bucket, and I want to get only 100 newest files,
How can I get only these list?
s3.bucket.list seems not to have that function. Is there anybody who know this?
please let me know. thanks.
There is no way to do this type of filtering on the service side. The S3 API does not support it. You might be able to accomplish something like this by using prefixes in your object names. For example, if you named all of your objects using a pattern like this:
YYYYMMDD/<objectname>
20140618/foobar (as an example)
you could use the prefix parameter of the ListBucket request in S3 to return only the object that were stored today. In boto, this would look like:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
for key in bucket.list(prefix='20140618'):
# do something with the key object
You would still have to retrieve all of the objects with that prefix and then sort them locally based on their last_modified_date but that would be much easier than listing all of the objects in the bucket and then sorting.
The other option would be to store metadata object the S3 objects in a database like DynamoDB and then query that database to find the objects to retrieve from S3.
You can find out more about hierarchical listing in S3 here
Can you try this code. This worked for me.
import boto,operator,time
con = boto.connect_s3()
key_repo = []
bucket = con.get_bucket('<your bucket name>')
bucket_keys = bucket.get_all_keys()
for object in bucket_keys:
t = (object.key,time.strptime(object.last_modified[:19], "%Y-%m-%dT%H:%M:%S"))
key_repo.append(t)
key_repo.sort(key=lambda item:item[1], reverse=1)
for key in key_repo[:10]: #top 10 items in the list
print key[0], ' ',key[1]
PS : I am beginner to Python so the code might not be optimized. Fell free to edit the answer to provide best code.