AWS CLI S3API find newest folder in path - amazon-web-services

I've got a very large bucket (hundreds of thousands of objects). I've got a path (lets say s3://myBucket/path1/path2). /path2 gets uploads that are also folders. So a sample might look like:
s3://myBucket/path1/path2/v6.1.0
s3://myBucket/path1/path2/v6.1.1
s3://myBucket/path1/path2/v6.1.102
s3://myBucket/path1/path2/v6.1.2
s3://myBucket/path1/path2/v6.1.25
s3://myBucket/path1/path2/v6.1.99
S3 doesn't take into account version number sorting (which makes sense) but alphabetically the last in the list is not the last uploaded. In that example .../v6.1.102 is the newest.
Here's what I've got so far:
aws s3api list-objects
--bucket myBucket
--query "sort_by(Contents[?contains(Key, \`path1/path2\`)],&LastModified)"´
--max-items 20000
So one problem here is max-items seems to start alphabetically from the all files recursively in the bucket. 20000 does get to my files but it's a pretty slow process to go through that many files.
So my questions are twofold:
1 - This is still searching the whole bucket but I just want to narrow it down to path2/ . Can I do this?
2 - This lists just objects, is it possible to pull up just a path list instead?
Basically the end goal is I just want a command to return the newest folder name like 'v6.1.102' from the example above.

To answer #1, you could add the --prefix path1/path2 to limit what you're querying in the bucket.
In terms of sorting by last modified, I can only think of using an SDK to combine the list_objects_v2 and head_object (boto3) to get last modified on the objects and programmatically sort
Update
Alternatively, you could reverse sort by LastModified in jmespath and return the first item to give you the most recent object and gather the directory from there.
aws s3api list-objects-v2 \
--bucket myBucket \
--prefix path1/path2 \
--query 'reverse(sort_by(Contents,&LastModified))[0]'

If you want general purpose querying e.g. "lowest version", "highest version", "all v6.x versions" then consider maintaining a separate database with the version numbers.
If you only need to know the highest version number and you need that to be retrieved quickly (quicker than a list object call) then you could maintain that version number independently. For example, you could use a Lambda function that responds to objects being uploaded to path1/path2 where the Lambda function is responsible for storing the highest version number that it has seen into a file at s3://mybucket/version.max.

Prefix works with list_object using boto3 client. But using boto3 resource might give some issues. Paginator in pagination is a great concept and works nice!. to find the latest changes(additions of objects) : sort_by(contents)[-1]

Related

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

S3 Bucket objects starting at a certain date [duplicate]

I want to query items from S3 within a specific subdirectory in a bucket by the date/time that they were added to S3. I haven't been able to find any explicit documentation around this, so I'm wondering how it can be accomplished?
The types of queries I want to perform look like this...
Return the URL of the most recently created file in S3 bucket images under the directory images/user1/
Return the URLs of all items created between datetime X and datetime Y in the S3 bucket images under the directory images/user1
Update 3/19/2019
Apparently s3api allows you to do this quite easily
One solution would probably to use the s3api. It works easily if you have less than 1000 objects, otherwise you need to work with pagination.
s3api can list all objects and has a property for the lastmodified attribute of keys imported in s3. It can then be sorted, find files after or before a date, matching a date ...
Examples of running such option
all files for a given date
DATE=$(date +%Y-%m-%d)
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?
contains(LastModified, `$DATE`)]'
all files after a certain date
export YESTERDAY=`date -v-1w +%F`
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?
LastModified > `$YESTERDAY`)]'
s3api will return a few metadata so you can filter for specific elements
DATE=$(date +%Y-%m-%d)
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?contains(LastModified, `$DATE`)].Key'
OLD ANSWER
AWS-SDK/CLI really should implement some sort of retrieve-by-date flag, it would make life easier and cheaper.
If you have not prefixed/labelled your files with the dates, you may also want to try using the flag
--start-after (string)
If you know the latest file you want to start listing from, you can use the list-objects-v2 command with the --start-after flag.
"StartAfter is where you want Amazon S3 to start listing from. Amazon S3 starts listing after this specified key. StartAfter can be any key in the bucket"
So --start-after will continually get your objects, so if you would like to limit the number of items try specifying a --max-items flag.
https://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects-v2.html
S3 can list all objects in a bucket, or all objects with a prefix (such as a "directory"). However, this isn't a cheap operation, it's certainly not designed to be done on every request.
Generally speaking, you are best served by a database layer for this. It can be something light and fast (like redis), but you should know what objects you have and which one you need for a given request.
You can somewhat cheat by copying objects twice- for instance, images/latest.jpg or images/user1/latest.jpg. But in the "date query" example, you should certainly do this external to S3.
You could store the files prefixed by date in the final directory eg:
images/user1/2016-01-12_{actual file name}
Then in the script that is doing the querying you can generate the list of dates in the time period and construct the prefixes accordingly and query S3 for all the dates separately and meagre the results. It should be way faster than getting full list and filtering the LastModified field (well that depends how many files you have in given dir, I think than anything less than a 1000 is not worth the effort.)
There is actually better method with use of 'Marker' parameter in listObjects call, so you set the marker to a key and listObjects will return only keys witch are after that one in the directory. We do have dates and time in the key names.

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

What is the boto3 equivalent of the following AWS CLI command?

I am trying to write the following AWS CLI command in boto3:
aws s3api list-objects-v2 --bucket bucket --query "Contents[?LastModified>='2020-02-20T17:00:00'] | [?LastModified<='2020-02-20T17:10:00']"
My goal is to be able to list only those s3 files in a specified bucket that were created within a specific interval of time. In the example above, when I run this command it returns only files that were created between 17:00 and 17:10.
When looking at the boto3 documentation, I am not seeing anything resembling the '--query' portion of the above command, so I am not sure how to go about returning only the files that fall within a specified time interval.
(Please note: Listing all objects in the bucket and filtering from the larger list is not an option for me. For my specific use case, there are simply too many files to go this route.)
If a boto3 equivalent does not exist, is there a way to launch this exact command via a Python script?
Thanks in advance!
There's no equivalent one from boto3. But s3pathlib solves your problem. Here's the code snippet solves your problem:
from datetime import datetime
from s3pathlib import S3path
# define a bucket
p_bucket = S3Path("bucket")
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable attribute can be used for filtering
S3Path.last_modified_at.between(
datetime(2000,1,1),
datetime(2020,1,1),
)
):
# do what ever you like
print(p.console_url) # click link to open it in console, inspect results
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document

Copy files from S3 bucket to local machine using file index

I need to copy a files from many subdirectories in an S3 bucket to my local machine. The file name is auto generated and would be difficult to obtain without first using ls, but I do know that the target file is always the 2nd file in the subfolder by date creation order.
Is there a way to reference a file the in the s3 bucket subfolder file by index?
I am envisioning doing this with aws cli, though I'm open to other suggestions.
I'm not aware of any way within S3 to list the second oldest object without listing all objects at a given prefix and then explicitly sorting that list by date. If you need to do this then here are a few ideas:
if objects are only ever added (never deleted), then you could perhaps use a key naming convention when objects are uploaded that allows you to easily locate the 2nd oldest object e.g 0001-xxx, 0002-xxx. Then you can find the 2nd oldest object by listing objects with prefix 0002.
maintain an independent index of the objects in an RDBMS or KV database that allows you to easily locate the S3 key of the 2nd oldest object in any part of your S3 hierarchy. Possibly the DB is maintained via a Lambda function called when objects are put or deleted.
use a Lambda function triggered on object PUT that enumerates all of the objects in the relevant 'folder' and writes the key of the 2nd oldest object back to a kind of index object in that same folder (or as metadata on a known index object). Then you can find the 2nd oldest by getting the contents of the index object (or its metadata).
Option #2 might be the best as it's simple, fast, and flexible (what if, as your app changes over time, you find that you also need to know the 4th oldest object, or the 2nd newest object).
You could use this method to obtain the name of the second file in a given bucket/path:
aws s3api list-objects-v2 --bucket BUCKET-NAME --query 'Contents[1].Key' --output text
This would also work with BUCKET-NAME/PATH.
However, you mention that you have many subdirectories, so you would have to know the names of all those subdirectories if you are wanting to avoid doing a full bucket listing.