Efficient way to find number of objects under a S3 prefix - amazon-web-services

I can think of these solutions:
Cloudwatch Bucket metric is good but that's only available at the bucket level.
S3 List (with prefix) is time consuming for millions of objects.
Is there any other efficient and cheaper way?

Some options:
Call ListObjects() (with pagination) to obtain the current list, or
Use Amazon S3 Inventory, which can provide a daily or weekly CSV/ORC/parquet file listing all objects, or
Create your own database (eg in DynamoDB) that keeps track of objects and create AWS Lambda functions that are triggered whenever objects are created/deleted to update the database

Related

Creating index for Amazon S3 bucket keys

What is the general strategy to index S3 keys as such it would be possible to query keys? Knowing multiple S3 operations cannot be in a transaction, then it's not possible to create an index in such a manner:
putObject(...);
indexObject(...)
Since the method putIndex here is not guaranteed to be in the same transaction of the previous method, so wire or other connection error would mean the first operation would not have an index.
There is no capability for "searching" Amazon S3 keys. The closest capability is to specify a Prefix, which can be a directory path, or partial name of an object.
A ListObjects call only returns 1000 objects at a time. This means that large buckets with 100,000+ objects can be slow to retrieve.
If you need a fast, searchable index you can store a list of keys in DynamoDB. Then, use Amazon S3 Events to trigger AWS Lambda functions when objects are added and deleted, to update DynamoDB.
Alternatively, if you have a large number of objects but they do not change frequently, you can use Amazon S3 Inventory to obtain a daily or weekly CSV file with a list of all objects.

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

Boto3 / Python - Listing recently modified objects?

I'm looking to list objects but I am only really interested in the last 1,000 that have been modified that same day.
I have seen that Boto3 supports pagination and getting specific objects by key name + a modified date. However, I can't see any mechanism that allows listing objects by their modified date?
Boto3 S3 Object.get() - This supports returning a key if its modified on a certain day.
Boto3 Paginators - This allows listing listing a certain number of objects, but doesn't allow you to determine the listing method.
I can achieve this by first listing all objects, then iterating over that list of objects, but this incurs the full costs, which is what I'm trying to avoid. I'm trying to do this to prevent having to list an entire bucket (which has more overhead costs).
No, there is no base functionality that offers this capability.
An alternative is to activate Amazon S3 Inventory:
Amazon S3 inventory is one of the tools Amazon S3 provides to help manage your storage. You can simplify and speed up business workflows and big data jobs using the Amazon S3 inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation. Amazon S3 inventory provides a comma-separated values (CSV) flat-file output of your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

Storing multiple results in Amazon S3

I am running multiple plugins as cron jobs everyday to gather some data from different data sources. I am planning to store these data in 2 places: in Amazon DynamoDB and in Amazon S3. The metadata of the results will be stored in DynamoDB. This will also hold the S3 bucket name where the data will store. My question is how do I group these daily results in S3? I am thinking of couple of ways:
(1)Lets say for plugin1 everyday I run I will store it in different buckets where the bucket name will be -. The merit of this approach is easy to retrieve the data for each day but the demerit is we have now 365 buckets for just one plugin. So if I have n plugins I will have 365 times n buckets over a year. We could delete buckets after some time interval to reduce the number of buckets (say 3 months)
(2) I could also use one bucket per plugin, and use a guid as a prefix for my keys. Like guid/result_n where result_n is the nth result that I get for that plugin. I would also add a key let's call it plugin_runs that would hold a list of dictionaries, where each dictionary would have this format {date: execution_id}. Then I could for a given date, find the prefix for the execution_id for that date and retrieve the contents of those keys.
Which approach would be better? Any other suggestions?
Given that AWS will only allow you to create 100 buckets per account, then I would say #2 is a much better approach.
But you really only need a single bucket, with a key prefix on each object to organize them. Here, for example, is how AWS Kinesis Firehose creates objects for you, and the naming convention they use. If it works for them, it should work for you:
Amazon S3 Object Name Format
Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH before
putting objects to Amazon S3. The prefix translates into the Amazon S3
folder structure, where each label separated by a forward slash (/)
becomes a sub-folder. You can modify this folder structure by adding
your own top-level folder with a forward slash (for example,
myApp/YYYY/MM/DD/HH) or prepending text to the YYYY top-level folder
name (for example, myApp YYYY/MM/DD/HH). This is accomplished by
specifying an S3 Prefix when creating the delivery stream, either by
using the Firehose console or the Firehose API.
http://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html