I am running multiple plugins as cron jobs everyday to gather some data from different data sources. I am planning to store these data in 2 places: in Amazon DynamoDB and in Amazon S3. The metadata of the results will be stored in DynamoDB. This will also hold the S3 bucket name where the data will store. My question is how do I group these daily results in S3? I am thinking of couple of ways:
(1)Lets say for plugin1 everyday I run I will store it in different buckets where the bucket name will be -. The merit of this approach is easy to retrieve the data for each day but the demerit is we have now 365 buckets for just one plugin. So if I have n plugins I will have 365 times n buckets over a year. We could delete buckets after some time interval to reduce the number of buckets (say 3 months)
(2) I could also use one bucket per plugin, and use a guid as a prefix for my keys. Like guid/result_n where result_n is the nth result that I get for that plugin. I would also add a key let's call it plugin_runs that would hold a list of dictionaries, where each dictionary would have this format {date: execution_id}. Then I could for a given date, find the prefix for the execution_id for that date and retrieve the contents of those keys.
Which approach would be better? Any other suggestions?
Given that AWS will only allow you to create 100 buckets per account, then I would say #2 is a much better approach.
But you really only need a single bucket, with a key prefix on each object to organize them. Here, for example, is how AWS Kinesis Firehose creates objects for you, and the naming convention they use. If it works for them, it should work for you:
Amazon S3 Object Name Format
Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH before
putting objects to Amazon S3. The prefix translates into the Amazon S3
folder structure, where each label separated by a forward slash (/)
becomes a sub-folder. You can modify this folder structure by adding
your own top-level folder with a forward slash (for example,
myApp/YYYY/MM/DD/HH) or prepending text to the YYYY top-level folder
name (for example, myApp YYYY/MM/DD/HH). This is accomplished by
specifying an S3 Prefix when creating the delivery stream, either by
using the Firehose console or the Firehose API.
http://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html
Related
I can think of these solutions:
Cloudwatch Bucket metric is good but that's only available at the bucket level.
S3 List (with prefix) is time consuming for millions of objects.
Is there any other efficient and cheaper way?
Some options:
Call ListObjects() (with pagination) to obtain the current list, or
Use Amazon S3 Inventory, which can provide a daily or weekly CSV/ORC/parquet file listing all objects, or
Create your own database (eg in DynamoDB) that keeps track of objects and create AWS Lambda functions that are triggered whenever objects are created/deleted to update the database
What is the general strategy to index S3 keys as such it would be possible to query keys? Knowing multiple S3 operations cannot be in a transaction, then it's not possible to create an index in such a manner:
putObject(...);
indexObject(...)
Since the method putIndex here is not guaranteed to be in the same transaction of the previous method, so wire or other connection error would mean the first operation would not have an index.
There is no capability for "searching" Amazon S3 keys. The closest capability is to specify a Prefix, which can be a directory path, or partial name of an object.
A ListObjects call only returns 1000 objects at a time. This means that large buckets with 100,000+ objects can be slow to retrieve.
If you need a fast, searchable index you can store a list of keys in DynamoDB. Then, use Amazon S3 Events to trigger AWS Lambda functions when objects are added and deleted, to update DynamoDB.
Alternatively, if you have a large number of objects but they do not change frequently, you can use Amazon S3 Inventory to obtain a daily or weekly CSV file with a list of all objects.
I want to use system manager to patch EC2s. and I use maintainance window to schedule the patching. In task, I want to write output to S3 bucket.
How do I define S3 key prefix with timestamp (or just date since I only plan to schedule patching once every day or week), so that the output is organized in s3 with timestamp folder name,
i.e. today’s output is stored in
mys3bucket/<today’s timestamp>/……
tomorrow’s output is stored in
mys3bucket/<tomorrow’s timestamp>/……
how do I set the S3 key prefix in aws console for this purpose? If I cannot, how to set it in AWS CLI or SDK, etc.?
I'm looking to list objects but I am only really interested in the last 1,000 that have been modified that same day.
I have seen that Boto3 supports pagination and getting specific objects by key name + a modified date. However, I can't see any mechanism that allows listing objects by their modified date?
Boto3 S3 Object.get() - This supports returning a key if its modified on a certain day.
Boto3 Paginators - This allows listing listing a certain number of objects, but doesn't allow you to determine the listing method.
I can achieve this by first listing all objects, then iterating over that list of objects, but this incurs the full costs, which is what I'm trying to avoid. I'm trying to do this to prevent having to list an entire bucket (which has more overhead costs).
No, there is no base functionality that offers this capability.
An alternative is to activate Amazon S3 Inventory:
Amazon S3 inventory is one of the tools Amazon S3 provides to help manage your storage. You can simplify and speed up business workflows and big data jobs using the Amazon S3 inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation. Amazon S3 inventory provides a comma-separated values (CSV) flat-file output of your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).
I have S3 bucket in which many files are being uploaded daily by many users. I am writing a consumer application to list the objects based on given date range.
Note : I cannot get all the objects and sort them because, there will be atleast 5k files uploaded daily.If I request all objects, my application doesn't scale as the number of files increases. I have to some how request the bucket for the files which are uploaded/modified in a certain range. How can I accomplish that ?
The AWS S3 API's list-objects can take a --query argument where you can filter based on the contents of the objects' LastModified metadata. The documentation for list-objects (http://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects.html) has an example that queries on key and value but it should be simple to modify to change to query on LastModified instead.
However, have you considered modifying your S3 directory structure to use a date prefix for the modified files? This would remove the need to filter as you could list the modified files by their prefix.