I am trying to write the following AWS CLI command in boto3:
aws s3api list-objects-v2 --bucket bucket --query "Contents[?LastModified>='2020-02-20T17:00:00'] | [?LastModified<='2020-02-20T17:10:00']"
My goal is to be able to list only those s3 files in a specified bucket that were created within a specific interval of time. In the example above, when I run this command it returns only files that were created between 17:00 and 17:10.
When looking at the boto3 documentation, I am not seeing anything resembling the '--query' portion of the above command, so I am not sure how to go about returning only the files that fall within a specified time interval.
(Please note: Listing all objects in the bucket and filtering from the larger list is not an option for me. For my specific use case, there are simply too many files to go this route.)
If a boto3 equivalent does not exist, is there a way to launch this exact command via a Python script?
Thanks in advance!
There's no equivalent one from boto3. But s3pathlib solves your problem. Here's the code snippet solves your problem:
from datetime import datetime
from s3pathlib import S3path
# define a bucket
p_bucket = S3Path("bucket")
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable attribute can be used for filtering
S3Path.last_modified_at.between(
datetime(2000,1,1),
datetime(2020,1,1),
)
):
# do what ever you like
print(p.console_url) # click link to open it in console, inspect results
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document
Related
I wish to use an aws cli commad that will return a list of dynamodb tables in my aws account where the CreationDateTime is less than a certain value. CreationDateTime is a property that is shown with the describe-table command, but the list-tables command only returns the names of the table. Is there any way I could possibly use a query to filter the names returned by list-tables in accordance with their corresponding CreationDateTime?
As noted, the answer is simply no, the AWS CLI can not do what you want in one query.
You need to resort to shell scripting to chain the necessary aws commands, or,
if you are willing to give up the hard requirement that it must be the AWS CLI, an alternative solution (and probably faster than forking off aws over and over) is to use a 5 line python script to get the result you want. You can still run this from the command-line, but it won't be the AWS CLI specifically.
Just perform the time arithmetic that suits your specific selection criteria.
#!/usr/bin/env python3
import boto3
from datetime import datetime, timedelta, timezone
ddb = boto3.resource('dynamodb')
for t in sorted(filter(lambda table: datetime.now(timezone.utc) - table.creation_date_time < timedelta(days=90), ddb.tables.all()), key=lambda table: table.creation_date_time):
print(f"{t.name}, {str(t.creation_date_time)}") # tables created in the last 90 days
No - As you noted, ListTables only lists the names of tables, and there is no request that provides additional information for each table, let alone filter on such information. You'll need to use ListTables and then DescribeTable on each of those tables. You can run all of those DescribeTable requests in parallel to make the whole thing much faster (although there is a practical snag - to do it in parallel, you'll need to have opened a bunch of connections to the server).
I want to query items from S3 within a specific subdirectory in a bucket by the date/time that they were added to S3. I haven't been able to find any explicit documentation around this, so I'm wondering how it can be accomplished?
The types of queries I want to perform look like this...
Return the URL of the most recently created file in S3 bucket images under the directory images/user1/
Return the URLs of all items created between datetime X and datetime Y in the S3 bucket images under the directory images/user1
Update 3/19/2019
Apparently s3api allows you to do this quite easily
One solution would probably to use the s3api. It works easily if you have less than 1000 objects, otherwise you need to work with pagination.
s3api can list all objects and has a property for the lastmodified attribute of keys imported in s3. It can then be sorted, find files after or before a date, matching a date ...
Examples of running such option
all files for a given date
DATE=$(date +%Y-%m-%d)
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?
contains(LastModified, `$DATE`)]'
all files after a certain date
export YESTERDAY=`date -v-1w +%F`
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?
LastModified > `$YESTERDAY`)]'
s3api will return a few metadata so you can filter for specific elements
DATE=$(date +%Y-%m-%d)
aws s3api list-objects-v2 --bucket test-bucket-fh --query 'Contents[?contains(LastModified, `$DATE`)].Key'
OLD ANSWER
AWS-SDK/CLI really should implement some sort of retrieve-by-date flag, it would make life easier and cheaper.
If you have not prefixed/labelled your files with the dates, you may also want to try using the flag
--start-after (string)
If you know the latest file you want to start listing from, you can use the list-objects-v2 command with the --start-after flag.
"StartAfter is where you want Amazon S3 to start listing from. Amazon S3 starts listing after this specified key. StartAfter can be any key in the bucket"
So --start-after will continually get your objects, so if you would like to limit the number of items try specifying a --max-items flag.
https://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects-v2.html
S3 can list all objects in a bucket, or all objects with a prefix (such as a "directory"). However, this isn't a cheap operation, it's certainly not designed to be done on every request.
Generally speaking, you are best served by a database layer for this. It can be something light and fast (like redis), but you should know what objects you have and which one you need for a given request.
You can somewhat cheat by copying objects twice- for instance, images/latest.jpg or images/user1/latest.jpg. But in the "date query" example, you should certainly do this external to S3.
You could store the files prefixed by date in the final directory eg:
images/user1/2016-01-12_{actual file name}
Then in the script that is doing the querying you can generate the list of dates in the time period and construct the prefixes accordingly and query S3 for all the dates separately and meagre the results. It should be way faster than getting full list and filtering the LastModified field (well that depends how many files you have in given dir, I think than anything less than a 1000 is not worth the effort.)
There is actually better method with use of 'Marker' parameter in listObjects call, so you set the marker to a key and listObjects will return only keys witch are after that one in the directory. We do have dates and time in the key names.
I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html
I am trying to read the file names from GCS bucket recursively from all folders, subfolders under the bucket using a composer DAG. Is it possible? For example i have bucket with corresponding folders and sub folders as mentioned below. static is the bucket name.
static/folder1/subfolder1/file1.json
static/folder1/subfolder2/file2.json
static/folder1/subfolder3/file3.json
static/folder1/subfolder3/file4.json
I want to read the files recursively and put the data in two variables like below.
bucketname = static
filepath = static/folder1/subfolder3/file4.json
You can use Airflow's BashOperator to use the GCS CLI tool (docs here).
An example could be the following:
read_files = BashOperator(
task_id='read_files',
bash_command='gsutil ls -r gs://bucket',
dag=dag,
)
Edit: Since you want to capture the output and the BashOperator only pushes the last line of stdout to XCom, I would suggest using a PythonOperator which calls a custom Python callable which uses either the GCS API or even the CLI tool via subprocess to collect all file names and push it to XCom for subsequent use by downstream tasks. Unless you do not need other tasks to use this data at all, in which case you can process it however you like (not clear from the question).
I've got a very large bucket (hundreds of thousands of objects). I've got a path (lets say s3://myBucket/path1/path2). /path2 gets uploads that are also folders. So a sample might look like:
s3://myBucket/path1/path2/v6.1.0
s3://myBucket/path1/path2/v6.1.1
s3://myBucket/path1/path2/v6.1.102
s3://myBucket/path1/path2/v6.1.2
s3://myBucket/path1/path2/v6.1.25
s3://myBucket/path1/path2/v6.1.99
S3 doesn't take into account version number sorting (which makes sense) but alphabetically the last in the list is not the last uploaded. In that example .../v6.1.102 is the newest.
Here's what I've got so far:
aws s3api list-objects
--bucket myBucket
--query "sort_by(Contents[?contains(Key, \`path1/path2\`)],&LastModified)"´
--max-items 20000
So one problem here is max-items seems to start alphabetically from the all files recursively in the bucket. 20000 does get to my files but it's a pretty slow process to go through that many files.
So my questions are twofold:
1 - This is still searching the whole bucket but I just want to narrow it down to path2/ . Can I do this?
2 - This lists just objects, is it possible to pull up just a path list instead?
Basically the end goal is I just want a command to return the newest folder name like 'v6.1.102' from the example above.
To answer #1, you could add the --prefix path1/path2 to limit what you're querying in the bucket.
In terms of sorting by last modified, I can only think of using an SDK to combine the list_objects_v2 and head_object (boto3) to get last modified on the objects and programmatically sort
Update
Alternatively, you could reverse sort by LastModified in jmespath and return the first item to give you the most recent object and gather the directory from there.
aws s3api list-objects-v2 \
--bucket myBucket \
--prefix path1/path2 \
--query 'reverse(sort_by(Contents,&LastModified))[0]'
If you want general purpose querying e.g. "lowest version", "highest version", "all v6.x versions" then consider maintaining a separate database with the version numbers.
If you only need to know the highest version number and you need that to be retrieved quickly (quicker than a list object call) then you could maintain that version number independently. For example, you could use a Lambda function that responds to objects being uploaded to path1/path2 where the Lambda function is responsible for storing the highest version number that it has seen into a file at s3://mybucket/version.max.
Prefix works with list_object using boto3 client. But using boto3 resource might give some issues. Paginator in pagination is a great concept and works nice!. to find the latest changes(additions of objects) : sort_by(contents)[-1]