Is there any notification event I can trace for completion of an execution of AWS S3 lifecycle rule? - amazon-web-services

I wanted to delete large number of S3 files (may be few 100K or 1000K, which I do not have control) in a bulk async process. I tried to look into multiple blogs and collated below strategies:
Leverage AWS S3 REST API from the async thread of custom application
Here the drawbacks are:
I will have to make huge number of S3 API calls as 1 request is limited for 1000 S3 objects and I may not know the exact S3 object.
Even if I identify the S3 objects to delete, I will have to first GET and then DELETE which will make the solution costly.
Here I will have to keep track of deleted chunks and in case of any failure in middle of operation, I will have to build a mechanism to re-trigger the chunks which failed to be deleted.
Leveraging S3 lifecycle policy
Here the drawbacks are:
We are storing multiple customer data into same bucket segregated by customer-id in prefix. With growing number of customers, we foresee that the 1000 rules per bucket hard limit may hit us.
To surpass above drawback, we can delete the rule and free-up the quota for next requests. But we were looking for any event based notification which can tell us back that the bulk delete operation is complete.
Again with growing number of customers, here we may loose predictability of the bulk delete operation. This is because of accumulated jobs due to reached quota limit and a submitted bulk delete job may have to wait for days to be completed.
Create only 1 rule with a special bulk delete tag and use it to set 1 S3 lifecycle policy
With this approach, we believe we will not hit the limit issue as we are expecting in above approach. And as we understood that these S3 lifecycle rules gets executed once a day (though we don't know exactly when), so we are assured that in max next 24h, the rule will get triggered and then it will take some time to actually complete the bulk delete operation (may be few mins or hours, we don't know). Here also we have the open question as: Is there a notification event after completion of 1 execution of S3 lifecycle rule which we can listen and update the status of all submitted bulk delete jobs as DONE? In lack of such notification event, it becomes difficult to let transparently communicate it back to the end-user who triggered the bulk delete async operation.
Any comments/advice on below strategies will be helpful. Also if you can help me with the answer for the last strategy which I guess is the most preferable choice I have as of now.
I tried all the above stated strategies and got stuck at the mentioned problem for each. Any inputs/advice on above will be of great help.

After all evaluations, we have finalized to go with codeful delete relevant data for specific time-range as an async java process leveraging S3 bulk delete SDK (DeleteObjectsRequest).

Related

Best strategy to archive specific records from RDS to a cheaper storage in AWS

I have the following requirements:
For every deleted record in RDS we need to archive it into somewhere cheaper on AWS.
Reduce storage cost
Not using Glacier
Context oriented (e.g. a file per table)
re-import is not a requirement
I'm not an experienced user with AWS, so I'm still a bit lost among the amount of options it has to offer and I'd like to know if you have more ideas to help me clear it out.
Initial thoughts:
The microservice that deletes the record, might send it to a broker (RabbitMQ for e.g.) and another microservice (let's call it archiver) will listen to it, write into a file, zip and send to S3. This approach has some technical challenges though: in order to make sense create big files, I need to wait the queue to growth a bit, wrap it into a stream and zip inside S3. The transaction control is very weak as well, since file writing and ack on messages are signal based i.e. I'll remove the messages from the broker just after the file is created.
Add a new column to the "archiveble" tables as "deleted (bool)" and run a separate job fetching only those records and saving them into S3. Discarded they don't want the new microservice with access to other's databases.
Following the same approach as in the first item, but instead of save into S3, save into a cheaper database. SimpleDB?
option 1, but instead of rabbitmq, write it to a kinesis firehose and direct that to an s3 location - it doesn't get much cheaper or easier than that.

Realtime-ness of S3 event notification

I am interested in traffic lifecycle (i.e. when the objects were created and deleted) of objects.
One approach is to perform periodic scan of the bucket and track explicitly the lastModifiedTime and perform a diff with previous scan result to identify objects deleted.
Another alternate I was considering was to enable S3 event notifications. However, the data in notification does not contain lastModifiedTime for the object. Can the eventTime be used as proxy instead? Is there a guarantee how quickly the event is sent ? In my case, it is acceptable if delivery of the event is delayed; as long as eventTime is not significantly later that modificationTime of object
Also, any other alternatives to capture lifecycle of s3 objects?
Yeah, the eventTime is a pretty good approximation of the lastModifiedTime of an object. One caveat here is the definition of lastModifiedTime is
Object creation date or the last modified date, whichever is the latest.
So in order to use eventTime as an approximation, you probably need a trigger that covers all the events where an object is either created or modified. Regarding to your question of how quickly the event is sent, here is a quote from the S3 documentation:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
If you want the accurate lastModifiedTime, you need to do a headObject operation for each object.
Your first periodic pull approach could work, but be careful don't do it naively if you have millions of objects. I mean don't use listObjects and do it in a while loop. This doesn't scale at all and listObjects API is pretty expensive. If you only need to do this traffic analysis once a day or once a week, I recommend using S3 inventory. The lastModifiedTime is included in the inventory report. [ref]
There is no guarantee for how long it takes to deliver the events. From the docs:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
Also events occurring at the same time, may be represented by single event at the end:
If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. If you want to ensure that an event notification is sent for every successful write, you can enable versioning on your bucket. With versioning, every successful write will create a new version of your object and will also send an event notification.

AWS SNS trigger when 3 objects are loaded into S3

I have a client that performs a weekly data upload of 3 CSV files to an S3 bucket, always within (at most) 5 minutes of each other. I have python code that ingests the 3 files, aggregates them, and writes and output that I would like to use to create a lambda function to fully automate this job. The issue is that I can't configure an S3 trigger that is every 3 object creates. I could have the lambda trigger every upload and exit until the 3 files are there but I don't want to do that as it's not really cost effective.
So I came across this question here that suggested having an SNS Topic that gets notified after a batch of uploads is completed, however I'm having trouble figuring out how to configure that. Basically what I'd like to do is create something similar to a CloudWatch Alarm that triggers when 3 object PUTS have occurred within 5 minutes of each other. Is this possible? Or, how can I configure my SNS event in a way as is suggested in the linked question?
I think you are misreading the answer in the post you linked. The SNS option is literally someone just sending an SNS message after they have pushed their batch of files to S3. So the sender would need to update their process to include a new SNS message step.
Option 3 in that question/answer is about using S3 triggers to handle this in an automatic fashion, without the sender needing to do any extra steps besides upload files, but it is also much more complicated since it involves using a DynamoDB table with Atomic Counters to ensure the files aren't processed multiple times.
I would like to point out that your concern over the cost effectiveness of triggering a Lambda function for each S3 object is going to be on the order of a few pennies a month, maybe less.

Cheapest way to delete 2 billion objects from S3 IA

I have a bucket in S3 (Infrequent access) containing 2 billion objects. It is too big to delete in the console or over the api without taking years.
I can create a lifecycle rule to expire and delete the objects but the calculator predicts this will cost me >$20,000. Is that correct? Is there a better way to delete a bucket?
I have a file effectively containing a list of all the objects in that bucket if that helps.
Update 2021:
An answer below from #MAP points out that there is now an "Empty" button. I haven't tested yet, but looks like the way to go (I'll accept that answer once tested):
If you have a list of all the objects available then you can certainly use Multi Delete Object action. Apparently this API is free. I would create AWS Step Functions state machine to loop through the file and delete 1000 objects at a time. 1000 appears to be the limit.
It will take around 2M step function transactions to delete all the objects in the bucket. As per the pricing for step function it will cost you around $50 + cost of Lambda invocations around $1 so total cost roughly $51.
Update
Using Lambda or Step Functions is probably not the most cost effective option because both ways you will need to read the file (that contains object keys) from some source such as S3. So I think running the script from local machine or any EC2 linux screen appears to be the best option.
In 2021, anyone who comes across this question may benefit to know that AWS console now provides an empty button.
Select the bucket and click on "empty" button and all objects versioned or not versioned would be emptied/deleted. Depending on the number of objects it can take minutes to days.
Expiration lifecycle rules are free. From the original feature announcement:
As with standard delete requests, Amazon S3 doesn’t charge you for using Object Expiration.
Delete operations are for free. You can create a lifecycle
Policy to automate a bulk delete.
I would start with a small number of objects first and check billing report to 100% confirm that the delete will not be charged, then go for the rest.

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.