Is it possible to automatically delete objects older than 10 minutes in AWS S3? - amazon-web-services

We want to delete objects from S3, 10 minutes after they are created. Is it possible currently?

I have a working solution that was built serverless with the help of AWS's Simple Queue Service and AWS Lambda. This works for all objects created in an s3 bucket.
Overview
When any object is created in your s3 bucket, the bucket will send an event with object details to an SQS queue configured with a 10 minute delivery delay. The SQS queue is also configured to trigger a Lambda function. The Lambda function reads the object details from the event sent and deletes the object from the s3 bucket. All three components involved (s3, SQS and Lambda) are low cost, loosely coupled, serverless and scale automatically to very large workloads.
Steps Involved
Setup your Lambda Function First. In my solution, I used Python 3.7. The code for the function is:
import json
import boto3
def lambda_handler(event, context):
for record in event['Records']:
v = json.loads(record['body'])
for rec in v["Records"]:
bucketName = rec["s3"]["bucket"]["name"]
objectKey = rec["s3"]["object"]["key"]
#print("bucket is " + bucketName + " and object is " + objectKey )
sss = boto3.resource("s3")
obj = sss.Object(bucketName, objectKey)
obj.delete()
return {
'statusCode': 200,
'body': json.dumps('Delete Completed.')
}
This code and a sample message file were uploaded to a github repo.
Create a vanilla SQS queue. Then configure the SQS queue to have a 10 minute delivery Delay. This setting can be found under Queue Actions -> Configure Queue -> 4 setting down
Configure the SQS Queue to trigger the Lambda Function you created in Step 1. To do this use Queue Actions -> Configure Trigger for Lambda Function. The setup screen is self explanatory. If you don't see your Lambda function from step 1, redo it correctly and make sure you are using the same Region.
Setup your S3 Bucket so that it fires an event to the SQS Queue you created in step 2. This is found on the main bucket screen, click Properties tab and select Events. Click the plus sign to add an event and fill out the following form:
Important points to select are to select All Object create events and to select the queue you created in Step 2 for the last pull down on this screen.
Last step - Add an execute policy to your Lambda Function that allows it to only delete from the specific S3 bucket. You can do this via the Lambda function console. Scroll down the Lambda function screen of your console and configure it under Execution Role.
This works for files I've copied into a single s3 bucket. This solution could support many S3 buckets to 1 queue and 1 lambda.

In addition to the detailed solution proposed by #taterhead involving a SQS queue, one might also consider the following serverless solution using AWS Step Functions:
Create a State Machine in AWS Step Functions with a Wait state of 10 minutes followed by a Task state executing a Lambda function that will delete the object.
Configure CloudTrail and CloudWatch Events to start an execution of your state machine when an object is uploaded to S3.
It has the advantage of (1) not having the 15 minutes limit and (2) avoiding the continuous queue polling cost generated by the Lambda function.
Inspiration: Schedule emails without polling a database using Step Functions

If anyone is still interest in this, S3 now offers Life Cycle rules which I've just been looking into, and they seem simple enough to configure in the AWS S3 Console.
The "Management" tab of an S3 bucket will reveal a button labeled "Add lifecycle rule" where users can select specific prefixes for objects and also set expiration times for the life times of the objects in the bucket that's being modified.
For a more detailed explanation, AWS have published an article on the matter, which explains this in more detail here.

Related

Lambda invocation on two SNS events at the sametime

I have a usecase where I need to read the two files which are in a different account, I will be receiving an SNS event with the filename and I need to create an EMR cluster from the Lambda only if two files are available in the other s3 bucket.
Currently I am writing dummy files to s3 bucket every time I receive a SNS event and then creating the EMR cluster only after ensuring that on the second SNS event that I have received, the first file is available in my accounts s3 bucket- This approach is working fine.
But I am unable to solve the issue of what really happens if we receive two files at the same time in the other s3 bucket and if we get two sns events around the same time, as each event thinks the other file hasn’t been arrived yet.
How would I solve this problem .

SQS Trigger Lambda, with FileName in S3 for text extraction

I have a use case, I have a list of pdf files stored in S3 Bucket, I have listed them and push them to SQS for Text Extraction, Created one Lambda for processing those files by providing bucket information and TextExtraxtion Information of AWS.
The issue is, Lambda is getting Timeout, as SQS trigger multiple lambda instance of all files and all of them waiting for Text Extract Service.
Lambda to trigger one by one, for all SQS message(FileName) so that Timeout does not occur, As we do have a limit for accessing AWS TextExtract
Processing 100+ files is a time consuming task, I would suggest taking no more than 10 files per Lambda execution.
Use SQS with Lambda as an event source.
https://dzone.com/articles/amazon-sqs-as-an-event-source-to-aws-lambda-a-deep

Running AWS Glue jobs in parallel

I have 30 Glue jobs that I want to run in parallel. If one job fails, others must continue. I started with step function, creating state machine that executes runner lambda function which on other hand triggers glue job depending on parameter(name of glue job). For one job there is decent amount of step function logic implemented(retry, error handling etc.)
Is there any way to execute state machine from other state machine? In that way I can have 30 parallel tasks that executes other state machines. If you have any suggestions please feel free to share.
AWS recommends using SNS for a fan out architecture to run parallel jobs from a single S3 event, as you get an overlap error if two lambdas try to use the same S3 event.
You basically send the S3 event to SNS and subscribe your 30 lambdas so they all trigger from the SNS notification (containing details of the S3 event) when it's published.
Create the Topic
Update the Topic Policy to allow Event Notifications from an S3 Bucket
Configure the S3 Bucket to send Event Notifications to the SNS Topic
Create the parallel Lambda functions, one for each job
Modify the Lambda functions to process SNS messages of S3 event notifications instead of the S3 event itself
https://aws.amazon.com/blogs/compute/fanout-s3-event-notifications-to-multiple-endpoints/
There is also another nice example with CloudFormation template https://aws.amazon.com/blogs/compute/messaging-fanout-pattern-for-serverless-architectures-using-amazon-sns/

process files put into s3 bucket in AWS lambda in the order in which they were put

My current workflow is as follows:
User drops file into s3 bucket -> s3 bucket triggers event to lambda -> lambda processes the file in s3 bucket. It also invokes other lambdas.
I want to handle the scenario where multiple users will drop files in the s3 bucket simultaneously. I want to process the files such that the file put first gets processed first. To handle this, I want the lambda to process each file in a gap of 15 minutes (for example).
So, I want to use SQS to queue the input file drop events. S3 can trigger an event to SQS. A cloudwatch event can trigger a lambda in every 15 minutes, and the lambda can poll the SQS queue for the first s3 file drop event, and process it.
The problem with SQS is that Standard SQS queues do not adhere to order, and FIFO SQS queues are not compatible with S3 (Ref: Error setting up notifications from S3 bucket to FIFO SQS queue due to required ".fifo" suffix)
What approach should I use to solve this problem?
Thanks,
Swagatika
You could have Amazon S3 trigger an AWS Lambda function, which then pushes the file info into a FIFO Amazon SQS queue.
There is a new capability where SQS can trigger Lambda, but you'd have to experiment to see how/whether that works with FIFO queues. If it works well, that could eliminate the '15 minutes' thing.

how should i architect aws lambda to support parallel process in batch model?

i have an aws lambda function to do some statistics on over 1k of stock tickers after market close. i have an option like below.
setup a cron job in ec2 instance and trigger a cron job to submit 1k http request asyn (e.g. http://xxxxx.lambdafunction.xxxx?ticker= to trigger the aws lambda function (or submit 1k request to SNS and let lambda to pickup.
i think it should run fine, but much appreciate if there is any serverless/PaaS approach to trigger task
On top of my head, Here are a couple of ways to achieve what you need:
Option 1: [Cost-Effective]
Post all the ticks to AWS FIFO SQS queue.
Define triggers on this queue to invoke lambda function.
Result: Since you are posting all the events in FIFO queue that maintains the order, all the events will be polled sequentially. More-over SQS to lambda trigger will help you scale automatically based on the number of message in the queue.
Option 2: [Costly and can easily scale for real-time processing]
Same as above, but instead of posting to FIFO queue, post to Kinesis Stream.
Enable Kinesis stream to trigger lambda function.
Result: Kinesis will ensure the order of event arriving in the stream and lambda function invocation will be invoked based on the number of shards in the stream. This implementation scales significantly. If you have any future use-case for real-time processing of tickers, this could be a great solution.
Option 3: [Cost Effective, alternate to Option:1]
Collect all ticker events(1k or whatever) and put it into a file.
Upload this file to AWS S3 bucket.
Enable S3 event notification to trigger proxy lambda function.
This proxy lambda function reads the s3 file and based on the total number of events in the file, it will spawn n parallel actor lambda function.
Actor lambda function will process each event.
Result: Easy to implement, cost-effective and provides easy scaling based on your custom algorithm to distribute the load in the proxy lambda function.
Option 4: [All-serverless]
Write a lambda function that gets the list of tickers from some web-server.
Define an AWS cloud watch rule for generating events based on cron/frequency.
Add a trigger to this cloudwatch rule to invoke proxy lambda function.
Proxy lambda function will use any combination of above options[1, 2 or 3] to trigger the actor lambda function for processing the records.
Result: Everything can be configured via AWS console and easy to use. Alternatively, you can also write your AWS cloud formation template to generate all the required resources in a single go.
Having said that, now I will leave this up to you to choose the right solution based on your business/cost requirements.
You can use lambda fanout option.
You can follow these steps to process 1k or more using serverless aproach.
1.Store all the stock tickers in a S3 file.
2.Create a master lambda which will read the s3 file and split the stocks in groups of 10.
3. Create a child lambda which will make the async call to external http service and fetch the details.
4. In the master lambda Loop through these groups and invoke 100 child lambdas passing in each group and return the results to the
Master lambda
5. Collect all the information returned from the child lambdas and continue with your processing here.
Now you can trigger this master lambda at the end of markets everyday using CloudWatch time based rule scheduler.
This is a complete serverless approach.