Boto3 / Python - Listing recently modified objects? - amazon-web-services

I'm looking to list objects but I am only really interested in the last 1,000 that have been modified that same day.
I have seen that Boto3 supports pagination and getting specific objects by key name + a modified date. However, I can't see any mechanism that allows listing objects by their modified date?
Boto3 S3 Object.get() - This supports returning a key if its modified on a certain day.
Boto3 Paginators - This allows listing listing a certain number of objects, but doesn't allow you to determine the listing method.
I can achieve this by first listing all objects, then iterating over that list of objects, but this incurs the full costs, which is what I'm trying to avoid. I'm trying to do this to prevent having to list an entire bucket (which has more overhead costs).

No, there is no base functionality that offers this capability.
An alternative is to activate Amazon S3 Inventory:
Amazon S3 inventory is one of the tools Amazon S3 provides to help manage your storage. You can simplify and speed up business workflows and big data jobs using the Amazon S3 inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation. Amazon S3 inventory provides a comma-separated values (CSV) flat-file output of your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

Related

AWS service to store single key value that is updated daily

What AWS service is appropriate for storing a single key-value pair data that is updated daily? The stored data will be retrieved by other several services throughout the day (~ 100 times total per day).
My current solution is to create and upload a JSON to an S3 bucket. All other services download the JSON and get the data. When it's time to update the data, I create a new JSON and upload it to replace the previously uploaded JSON. This works pretty well but I'm wondering if there is a more appropriate way.
There's many:
AWS Systems Manager Parameter Store
AWS Secrets Manager
Dynamo
S3
^ those are some of the most common. Without knowing more I'd suggest you consider Dynamo or Param Store. Both are simple and inexpensive--although S3 is fine, too.
The only reason to not use S3 is governance of the key expires etc., automatically from AWS side - like using a secret manager - therefore, giving it to third parties will be much harder.
Your solution seems very good, especially since S3 IS the object store database - json is an object.
The system you described is such a low usage that you shouldn't spend time thinking if there is any better way :)
Just make sure you are aware that amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write
and to refer to your comment:
The S3 way seemed a little hacky, so I am trying to see if there is a better approach
S3 way is not hacky at all - intended use of S3 is to store some objects in the key-value database :)

Efficient way to find number of objects under a S3 prefix

I can think of these solutions:
Cloudwatch Bucket metric is good but that's only available at the bucket level.
S3 List (with prefix) is time consuming for millions of objects.
Is there any other efficient and cheaper way?
Some options:
Call ListObjects() (with pagination) to obtain the current list, or
Use Amazon S3 Inventory, which can provide a daily or weekly CSV/ORC/parquet file listing all objects, or
Create your own database (eg in DynamoDB) that keeps track of objects and create AWS Lambda functions that are triggered whenever objects are created/deleted to update the database

What is the cost of listing all files in AWS S3 bucket?

I am writing a script in python where I need to get the latest modified file in a bucket (using a prefix), but as far as I have read, I cannot do that query directly from python (using boto3 at least), So I have to retrieve the information of every object in my bucket.
I would have to do some query of several thousands of files, and I do not want to get any surprise in my billing.
If I do a query where I retrieve the metadata of all the objects in my bucket to sort them later locally, will I be charged as a single request or will it count as a request per object?
Thank you all in advance
Popular
A common method people use is via s3api to consolidate multiple calls into a single LIST request for every 1000 objects and then use --query to define your filtering operation, such as:
aws s3api list-objects-v2 --bucket your-bucket-name --query 'Contents[?contains(LastModified, `$DATE`)]'
Although please keep in mind that this isn't a good solution for two reasons:
This does not scale really well especially with large buckets and it does not help much in minimizing the data outbound.
It does not reduce the number of S3 API calls because the --query parameter isn't performed in the server-side. It just so happened to be a feature of this aws-cli command. To illustrate, this is how it would look like in boto3 and as you can see we'd still need to query it on client-side:
import boto3
client = boto3.client('s3',region_name='us-east-1')
response = client.list_objects_v2(Bucket='your-bucket-name')
results = sorted(response['Contents'], key=lambda item: item['LastModified'])[-1])
Probably
One thing you could *probably* do depending on your specific use case is to utilize S3 Event Notifications to automatically publish an event to SQS which gives you the opportunity to poll for all the S3 object events along with their metadata information which is more lightweight. This is still going to cost some money and it's not going to work if you already have an existing big bucket to begin with. Plus the fact that you'll have to actively poll for the messages since they won't persist too long.
Perfect (sorta)
This sounds to me like a good use case for S3 Inventory. It will deliver a daily file for you which is comprised of the list of objects and their metadata information based on your specifications. See https://docs.aws.amazon.com/AmazonS3/latest/user-guide/configure-inventory.html

Creating index for Amazon S3 bucket keys

What is the general strategy to index S3 keys as such it would be possible to query keys? Knowing multiple S3 operations cannot be in a transaction, then it's not possible to create an index in such a manner:
putObject(...);
indexObject(...)
Since the method putIndex here is not guaranteed to be in the same transaction of the previous method, so wire or other connection error would mean the first operation would not have an index.
There is no capability for "searching" Amazon S3 keys. The closest capability is to specify a Prefix, which can be a directory path, or partial name of an object.
A ListObjects call only returns 1000 objects at a time. This means that large buckets with 100,000+ objects can be slow to retrieve.
If you need a fast, searchable index you can store a list of keys in DynamoDB. Then, use Amazon S3 Events to trigger AWS Lambda functions when objects are added and deleted, to update DynamoDB.
Alternatively, if you have a large number of objects but they do not change frequently, you can use Amazon S3 Inventory to obtain a daily or weekly CSV file with a list of all objects.

Storing multiple results in Amazon S3

I am running multiple plugins as cron jobs everyday to gather some data from different data sources. I am planning to store these data in 2 places: in Amazon DynamoDB and in Amazon S3. The metadata of the results will be stored in DynamoDB. This will also hold the S3 bucket name where the data will store. My question is how do I group these daily results in S3? I am thinking of couple of ways:
(1)Lets say for plugin1 everyday I run I will store it in different buckets where the bucket name will be -. The merit of this approach is easy to retrieve the data for each day but the demerit is we have now 365 buckets for just one plugin. So if I have n plugins I will have 365 times n buckets over a year. We could delete buckets after some time interval to reduce the number of buckets (say 3 months)
(2) I could also use one bucket per plugin, and use a guid as a prefix for my keys. Like guid/result_n where result_n is the nth result that I get for that plugin. I would also add a key let's call it plugin_runs that would hold a list of dictionaries, where each dictionary would have this format {date: execution_id}. Then I could for a given date, find the prefix for the execution_id for that date and retrieve the contents of those keys.
Which approach would be better? Any other suggestions?
Given that AWS will only allow you to create 100 buckets per account, then I would say #2 is a much better approach.
But you really only need a single bucket, with a key prefix on each object to organize them. Here, for example, is how AWS Kinesis Firehose creates objects for you, and the naming convention they use. If it works for them, it should work for you:
Amazon S3 Object Name Format
Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH before
putting objects to Amazon S3. The prefix translates into the Amazon S3
folder structure, where each label separated by a forward slash (/)
becomes a sub-folder. You can modify this folder structure by adding
your own top-level folder with a forward slash (for example,
myApp/YYYY/MM/DD/HH) or prepending text to the YYYY top-level folder
name (for example, myApp YYYY/MM/DD/HH). This is accomplished by
specifying an S3 Prefix when creating the delivery stream, either by
using the Firehose console or the Firehose API.
http://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html