I'm looking for a way to be notified when an object in s3 changes storage class. I thought there would be a bucket event notification for this but I don't see it as an option. How can I know when an object moves from STANDARD to GLACIER? We have systems that depend on objects not being in GLACIER. If they change to GLACIER, we need to be made aware and handle them accordingly.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html#supported-notification-event-types
You can use S3 access logs to capture changes in life cycle, but i think thats about it:
Amazon S3 server access logs can be enabled in an S3 bucket to capture
S3 Lifecycle-related actions such as object transition to another
storage class
Taken from AWS docs - life-cycle and other bucket config
You could certainly roll your own notifications for storage class transitions - might be a bit more involved than you are hoping for though.... You need a separate bucket to write your access logs. Setup an S3 notification for object creation in your new logs bucket to trigger a lambda function to process each new log file. In your lambda function use Athena to query the logs and fire off an SNS alert or perform some corrective action in code.
There are some limitations to be aware of though - see best effort logging means you might not get logs for a few hours
Updated 28/5/21
If the logs are on you should see the various lifecycle operations logged as they happen +/- a few hours. If not are you definitely meeting the minimum criteria for transitioning objects to glacier? (eg it takes 30 days to transition from standard to glacier).
As for:
The log record for a particular request might be delivered long after
the request was actually processed, or it might not be delivered at
all.
Consider S3's eventual consistency model and the SLA on data durability - there is possibility of data loss for any object in S3. I think the risk is relatively low of loosing log records, but it could happen.
You could also go for a more active approach - use s3 api from a lambda function triggered by cloudwatch events (cron like scheduling) to scan all the objects in the bucket and do something accordingly (send an email, take corrective action etc). Bare in mind this might get expensive depending on how often you run the lambda and how many object are in your bucket but low volumes might even be in the free tier depending on your usage.
As of Nov 2021 you can now do this via AWS EventBridge.
Simply create a new Rule on the s3 bucket that handles the Object Storage Class Changed event.
See https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
Related
There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.
I need to start a Lambda Function when an object has been created on an S3 Bucket. I found 2 solutions to do this.
Using AWS::S3::Bucket NotificationConfiguration.
Using a CloudWatch AWS::Events::Rule.
They both seem to do exactly the same thing, which is to track specific changes and launch a Lambda Function when it happens. I could not find any information on which one should be used. I'm using Cloud Formation Template to provision the Lambda, the S3 Bucket and the trigger.
Which one should I use to call a Lambda on Object level changes and why?
Use the 1st one because of
A push model is much better than a pull model. Push means you send data when you get it instead of polling onto something for some set of interval. This is an era for push notifications all over us. You don't go to facebook to check every 5 minutes if someone has liked your picture or not OR someone has replied to your comment, etc.
In terms of cost and efforts also, S3 event notification wins the race.
Cloudwatch was the best option if you didn't have S3 notification but since you have it, that's the best. Plus if you have a feature in the service itself then why will you go for an alternative solution like Cloudwatch rules.
I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs
I have AWS data pipelines setup that feed to my S3 bucket. Each time a new feed file is generated by the pipeline and stored in the bucket. We keep at most 30 days of data in the bucket. Is it possible to configure an alarm so that I am notified via email, etc when the generated object size crosses the threshold (say 1G)? How would I go about it?
If you want granular data some dev work is required below are some options/further reading.
s3 notifications - ie events sent by s3 in response to create/delete etc which can be used to fire a lambda to perform whatever logic. You can base logic on key, filesize, created date etc. You could then store that value as a cloudwatch metric, and then setup an alarm on your custom metric.
See https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
Or
s3 inventory (which is basically a csv formatted directory listing uploaded to different bucket on a schedule).
If you go for inventory option you set a schedule and then you can then create a notification on destination bucket of inventory file to fire a lambda as each csv is availavle. Also take a look at aws Athena, can be used to query the inventory files direct via api - no need to download/parse csv!
See https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
If your interested in quick n easy / none programming route there's a total bucket size cloudwatch metric called BucketSizeBytes which you could easily add an alarm which triggers sns email if total size got above 30gb. Depending on your goals this might be useful and should take minutes to setup - but is pretty useless for timely monitoring purposes.
See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/s3-metricscollected.html
I need help figuring out how to pass an array of s3 file names to my second aws lambda function I am working on. The first function would query the dynamodb index table for list of s3 files in glacier it would issue retrieval request to. the second lambda would get the same filename parameter in an array one at a time and invoke the lambda function but 4 hours later for each of the files that are retrieved from Glacier, is there any delicate way to do this in lambda or other AWS services using javascript? any help is appreciated, thanks!
Glacier retrieval jobs are not guaranteed to be complete within 4 hours (archives typically become accessible within 3–5 hours, but that's not a guarantee). Also, scheduling Lambda function invocations for some time in the future is not the best way to solve this problem.
You should make use of Glacier notifications. When a Glacier retrieval job completes, it can post a message to an SNS topic. SNS and Lambda are integrated so you can invoke Lambda functions from SNS notifications.
The Glacier SDK supports archive retrieval (and inventory retrieval) with SNS notifications at completion time via initiate_job().
EDIT: this does not work if the S3 objects were archived to Glacier via lifecycle management because retrieval notifications require you to supply a Glacier vault name but lifecycle management does not expose this vault name to you (it's internal to the AWS service). [Thanks #Mark B]