For example I have lambda functions that consume messages from a KinesisStream. How do stop and resume the function so that I don't incur charges and I don't loose data in the stream.
I know that if the events keep failing, Kinesis will keep retrying and the cost can be very high.
I cannot delete the function because there is lots of automation around it through CloudFormation. Is there a way to stop and restart the function?
SOLUTION: http://alestic.com/2015/11/aws-lambda-kinesis-pause-resume
NOTE: Event sources for rules, log streaming, cannot be disable using the event source. You will not event get it in the list when calling the API using the SDK. For those you have to disable the Event Rule, or the Log Subscription.
The updated Lambda console on AWS supports this in the UI now. Click on the Kinesis stream feeding your lambda function, toggle the "Enabled/Disabled" toggle at the bottom, and Save. This will essentially pause/resume your function.Screenshot - Toggling Kinesis input into Lambda
Let's talk about Kinesis for a moment. When you pull records off the stream, Kinesis will not 'delete' those records until you 'checkpoint' the stream. You can read the same records over and over until you confirm with Kinesis that you don't need them anymore.
AWS Lambda does not checkpoint the stream until the function completes its execution without an error. (context.success())
If you deploy a Lambda function and it is broken in some way (exits with an exception/error), the Lambda function will not checkpoint the stream, and your records will stay in the stream for as long until retention period expires (24 hours, by default). The 'un-checkpointed' records can then be read in a subsequent Lambda execution.
During deployment, the same thing applies. Any currently executing Lambdas that are interrupted will not checkpoint the stream, and any currently executing Lambdas that complete successfully will checkpoint as you expect.
Related
I have a scheduled error handling lambda, I would like to use Serverless technology here as opposed to a spring boot service or something.
The lambda will read from an s3 bucket and process accordingly. The problem is at times the s3 bucket may have high volume of data to be processed. long running operations aren't suited to lambdas.
One solution I can think of is have the lambda read and process one item from the bucket and on success trigger another instance of the same lambda unless the bucket is empty/fully-processed. The thing i don't like is that this is synchronous and quite slow. I also need to be conscious of running too many lambdas at the same time as we are hitting a REST endpoint as part of the error flow and don't want to overload it with too many requests.
I am thinking it would be nice to have maybe 3 instances of the lambdas running at the same time until the bucket is empty but not really sure, I am wondering if anyone has any nice patterns that could be used here or suggestions on best practices?
Thanks
Create a S3 bucket for processing your files.
Enable a trigger S3 -> Lambda, on every new file in the bucket lambda will be invoked to process the file, every file is processed separately. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Once the file is processed you could either delete or move file to other place.
About concurrency please have a look at provisioned concurrency https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Update:
As you still plan to use a scheduler lambda and S3
Lambda reads/lists only the filenames and puts messages into SQS to process the file.
A new Lambda to consume SQS messages and process the file.
Note: I would recommend using SQS initially if the files/messages are not so big, it has built it recovery mechanics, DLQ , delays, visibility etc which you could benefit more than the simple S3 storage, second way is just create message with file reference and still use SQS.
I'd separate the lambda that is called by the scheduler from the lambda that is doing the actual processing. When the scheduler calls the first lambda, it can look at the contents of the bucket, then spawn the worker lambdas to process the objects. This way you have control over how many objects you want per worker.
Given your requirements, I would recommend:
Configure an Amazon S3 Event so that a message is pushed to an Amazon SQS queue when the objects are created in the S3 bucket
Schedule an AWS Lambda function at regular intervals that will:
Check that the external service is working
Invoke a Lambda function to process one message from the queue, and keep looping
The hard part would be throttling the second Lambda function so that it doesn't try to send all request at once (which might impact that external service).
You could probably do this by using a Step Function to trigger Lambda and then, if it was successful, trigger another Lambda function. This could even be done in parallel, such as allowing up to three parallel Lambda executions. The benefit of using Step Functions is that there is no cost for "waiting" for each Lambda to finish executing.
So, the Step Function flow would be something like:
Invoke a "check external service" Lambda function
If it fails, then quit the flow
Invoke the "processing" Lambda function
Get one message
Process the message
If successful, remove the message from the queue
Return success/fail
If it was successful, keep looping until the queue is empty
i have an aws lambda function to do some statistics on over 1k of stock tickers after market close. i have an option like below.
setup a cron job in ec2 instance and trigger a cron job to submit 1k http request asyn (e.g. http://xxxxx.lambdafunction.xxxx?ticker= to trigger the aws lambda function (or submit 1k request to SNS and let lambda to pickup.
i think it should run fine, but much appreciate if there is any serverless/PaaS approach to trigger task
On top of my head, Here are a couple of ways to achieve what you need:
Option 1: [Cost-Effective]
Post all the ticks to AWS FIFO SQS queue.
Define triggers on this queue to invoke lambda function.
Result: Since you are posting all the events in FIFO queue that maintains the order, all the events will be polled sequentially. More-over SQS to lambda trigger will help you scale automatically based on the number of message in the queue.
Option 2: [Costly and can easily scale for real-time processing]
Same as above, but instead of posting to FIFO queue, post to Kinesis Stream.
Enable Kinesis stream to trigger lambda function.
Result: Kinesis will ensure the order of event arriving in the stream and lambda function invocation will be invoked based on the number of shards in the stream. This implementation scales significantly. If you have any future use-case for real-time processing of tickers, this could be a great solution.
Option 3: [Cost Effective, alternate to Option:1]
Collect all ticker events(1k or whatever) and put it into a file.
Upload this file to AWS S3 bucket.
Enable S3 event notification to trigger proxy lambda function.
This proxy lambda function reads the s3 file and based on the total number of events in the file, it will spawn n parallel actor lambda function.
Actor lambda function will process each event.
Result: Easy to implement, cost-effective and provides easy scaling based on your custom algorithm to distribute the load in the proxy lambda function.
Option 4: [All-serverless]
Write a lambda function that gets the list of tickers from some web-server.
Define an AWS cloud watch rule for generating events based on cron/frequency.
Add a trigger to this cloudwatch rule to invoke proxy lambda function.
Proxy lambda function will use any combination of above options[1, 2 or 3] to trigger the actor lambda function for processing the records.
Result: Everything can be configured via AWS console and easy to use. Alternatively, you can also write your AWS cloud formation template to generate all the required resources in a single go.
Having said that, now I will leave this up to you to choose the right solution based on your business/cost requirements.
You can use lambda fanout option.
You can follow these steps to process 1k or more using serverless aproach.
1.Store all the stock tickers in a S3 file.
2.Create a master lambda which will read the s3 file and split the stocks in groups of 10.
3. Create a child lambda which will make the async call to external http service and fetch the details.
4. In the master lambda Loop through these groups and invoke 100 child lambdas passing in each group and return the results to the
Master lambda
5. Collect all the information returned from the child lambdas and continue with your processing here.
Now you can trigger this master lambda at the end of markets everyday using CloudWatch time based rule scheduler.
This is a complete serverless approach.
I have a Lambda function that’s triggered by a PUT to an S3 bucket.
I want to limit this Lambda function so that it’s only running one instance at a time – I don’t want two instances running concurrently.
I’ve had a look through the Lambda configuration and docs, but I can’t see anything obvious. I can about writing my own locking system, but it would be nice if this was already a solved problem.
How can I limit the number of concurrent invocations of a Lambda?
AWS Lambda now supports concurrency limits on individual functions:
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
I would suggest you to use Kinesis Streams (or alternatively DynamoDB + DynamoDB Streams, which essentially have the same behavior).
You can see Kinesis Streams as as queue. The good part is that you can use a Kinesis Stream as a Trigger to you Lambda function. So anything that gets inserted into this queue will automatically be passed over to your function, in order. So you will be able to process those S3 events one by one, one Lambda execution after the other (one instance at a time).
In order to do that, you'll need to create a Lambda function with the simple purpose of getting S3 Events and putting them into a Kinesis Stream. Then you'll configure that Kinesis Stream as your Lambda Trigger.
When you configure the Kinesis Stream as your Lambda Trigger I suggest you to use the following configuration:
Batch size: 1
This means that your Lambda will be called with only one event from Kinesis. You can select a higher number and you'll get a list of events of that size (for example, if you want to process the last 10 events in one Lambda execution instead of 10 consecutive Lambda executions).
Starting position: Trim horizon
This means it'll behave as a queue (FIFO)
A bit more info on AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AWS Lambda.
I hope this helps anyone with a similar problem.
P.S. Bear in mind that Kinesis Streams have their own pricing. Using DynamoDB + DynamoDB Streams might be cheaper (or even free due to the non-expiring Free Tier of DynamoDB).
No, this is one of the things I'd really like to see Lambda support, but currently it does not. One of the problems is that if there were a lot of S3 PUT operations happening AWS would have to queue up all the Lambda invocations somehow, and there is currently no support for that.
If you built a locking mechanism into your Lambda function, what would you do with the requests you don't process due to a lock? Would you just throw those S3 notifications away?
The solution most people recommend is to have S3 send the notifications to an SQS queue, and then have your Lambda function scheduled to run periodically, like once a minute, and check if there is an item in the queue that needs to be processed.
Alternatively, have S3 send the notifications to SQS and just have a t2.nano EC2 instance with a single-threaded service polling the queue.
I know this is an old thread, but I ran across it trying to figure out how to make sure my time sequenced SQS messages were processed in order coming out of a FIFO queue and not getting processed simultaneously/out-of-order via multiple Lambda threads running.
Per the documentation:
For FIFO queues, Lambda sends messages to your function in the order
that it receives them. When you send a message to a FIFO queue, you
specify a message group ID. Amazon SQS ensures that messages in the
same group are delivered to Lambda in order. Lambda sorts the messages
into groups and sends only one batch at a time for a group. If your
function returns an error, the function attempts all retries on the
affected messages before Lambda receives additional messages from the
same group.
Your function can scale in concurrency to the number of active message
groups.
Link: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
So essentially, as long as you use a FIFO queue and submit your messages that need to stay in sequence with the same MessageGroupID, SQS/Lambda automatically handles the sequencing without any additional settings necessary.
Have the S3 "Put events" cause a message to be placed on the queue (instead of involving a lambda function). The message should contain a reference to the S3 object. Then SCHEDULE a lambda to "SHORT POLL the entire queue".
PS: S3 events can not trigger a Kinesis Stream... only SQS, SMS, Lambda (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-destinations). Kinesis Stream are expensive and used for real-time event handling.
It is unclear to me why the lambda-based trigger I just recreated atop my DynamoDB stream has stopped firing. Per the docs, I know that the stream atop my single sharded DynamoDB instance sends the payloads synchronously and will not invoke subsequent batches until the previous one finishes.
Because I wanted to recreate the trigger with more batches processed per payload (from 100 to 5000) I took these steps:
Deleted the trigger.
Disabled the previous dynamodb stream;
Re-enabled the stream (creating a new ARN with the updated
timestamp);
Recreated the trigger tied to the same lambda (with
a batch size of 5000).
Either the lambda that polls the stream and sends those batches to my lambda is not polling OR by doing one of the above steps, I've voided the stream and it has no new results. But I've since updated dynamodb directly as well as inserted new rows. The trigger still hasn't fired.
I'm not sure what I'm missing?
Lambda functions may not execute for a variety of reasons.
Lack of permissions
Trigger not being enabled
DynamoDB Stream being disabled
Hitting Lambda region and account limits
We have a .NET client application that uploads files to S3. There is an event notification registered on the bucket which triggers a Lambda to process the file. If we need to do maintenance, then we suspend our processing by removing the event notification and adding it back later when we're ready to resume processing.
To process the backlog of files that have queued up in S3 during the period the event notification was disabled, we write a record to a kinesis stream with the S3 key to each file, and we have an event mapping that lets Lambda consume each kinesis record. This works great for us because it allows us to control our concurrency when we are processing a large backlog by controlling the number of shards in the stream. We were originally using SNS but when we had thousands of files that needed to be reprocessed SNS would keep starting Lambdas until we hit our concurrent executions threshold, which is why we switched to Kinesis.
The problem we're facing right now is that the cost of kinesis is killing us, even though we barely use it. We get 150 - 200 files uploaded per minute, and our lambda takes about 15 seconds to process each one. If we suspend processing for a few hours we end up with thousands of files to process. We could easily reprocess them with a 128 shard stream, however that would cost us $1,400 / month. The current cost for running our Lambda each month is less than $300. It seems terrible that we have to increase our COGS by 400% just to be able to control our concurrency level during a recovery scenario.
I could attempt to keep the stream size small by default and then resize it on the fly before we re-process a large backlog, however resizing a stream from 1 shard up to 128 takes an incredibly long time. If we're trying to recover from an unplanned outage then we can't afford to sit around waiting for the stream to resize before we can use it. So my questions are:
Can anyone recommend an alternative pattern to using kinesis shards for being able to control the upper bound on the number of concurrent lambdas draining a queue?
Is there something I am missing which would allow us to use Kinesis more cost efficiently?
You can use SQS with Lambda or Worker EC2s.
Here is how it can be achieved (2 approaches):
1. Serverless Approach
S3 -> SNS -> SQS -> Lambda Sceduler -> Lambda
Use SQS instead of Kinesis for storing S3 Paths
Use a Lambda Scheduler to keep polling messages (S3 paths) from SQS
Invoke Lambda function from Lambda scheduler for processing files
2. EC2 Approach
S3 -> SNS -> SQS -> Beanstalk Worker
Use SQS instead of Kinesis for storing S3 Paths
Use Beanstalk Worker environment which polls SQS automatically
Implement the application (processing logic) in the Beanstalk worker hosted locally on a HTTP server in the same EC2