There are a few pieces of my app that cannot afford the additional 1-2 second delay caused by the "freeze-thaw" cycle that Lambda functions go through when they're new or unused for some period of time.
How can I keep these Lambda functions warm so AWS doesn't have to re-provision them all the time? This goes for both 1) infrequently-used functions and 2) recently-deployed functions.
Ideally, there is a setting that I missed called "keep warm" that increases the cost of the Lambda functions, but always keeps a function warm and ready to respond, but I'm pretty sure this does not exist.
I suppose an option is to use a CloudWatch timer to ping the functions every so often... but this feels wrong to me. Also, I don't know the interval that AWS uses to put Lambda functions on ice.
UPDATE DEC 2019
AWS now also offers 'Provisioned Concurrency'. https://aws.amazon.com/blogs/aws/new-provisioned-concurrency-for-lambda-functions/
Basically you pay around 10$/month (for a 1GB Lambda) per instance that you want to keep 'warm'.
BBC has published a nice article on iPlayer engineering where they describe a similar issue.
They have chosen to call the function every few minutes using CloudWatch Scheduled Events.
So in theory, it should just stay there, except it might not. So we have set up a scheduled event to keep the container ‘warm’. We invoke the function every couple of minutes not to do any processing, but to make sure we’ve got the model ready. This accounts for a very small percentage of invocations but helps us mitigate race conditions in the model download.(We also limit artificially how many lambdas we invoke in parallel as an additional measure).
Related
I am working on a massive distributive computing platform built within AWS Lambda. The platform is extremely spiky, so most of the time the number of ConcurrentExecutions is below 50, but we can hit maximum (1000 currently) for up to an hour or more if a large batch job hits the system (it is an event-driven system). This is a problem as we will have customer-facing APIs that will lag terribly. Finally, I am not an architect, so I have minimal control over how the system was designed, but I have been asked to devise a clever Concurrent Execution limiting solution
I'm not new to AWS, so I know about the standard ways to handle this problem. #1 is reserve concurrency on the user-facing lambdas. I'm not allowed to do that for the sake of this exercise (though I'll go tell my boss thats whats necessary if it truly is). I'm thinking of a system where we designate high-priority (for UI) and low priority functions (for batch processing), and the low-priority functions will check a stored (DynamoDB) value output from Cloudwatch on the current number of ConcurrentExecutions. If a low priority function finds that we are in danger of using all the ConcurrentExecutions, it will post to a queue with exponential backoff in place. This all should work, save the problem that ConcurrentExecutions are only monitored in one-minute increments, which is too slow, as many of our Lambdas run for around 500ms.
So my questions are as follows:
Is there a way to set up a custom ConcurrentExecutions metric that has second-by-second data points, and if so, how would you do it?
Is there a better way to implement a counter than Cloudwatch?
Am I just missing something here and someone has a clever way to manage Lambda ConcurrentExecutions
I don't think it's necessary to create a monitor or throttling solution at all. You will need to to build test and maintain something additional to your core solution. Instead, two suggestions:
Sounds like the current design has one lambda function doing too much. Decompose the Lambdas further, so you can split the Lambdas into a Ui/public lambda, and one or more dedicated to the batch processes. This way you can spread the concurrent execution limit across more Lambdas. The limit is per Lambda function.
Second, request a service quota/limit increase
To raise the limit above 1,000 concurrent function executions, submit a request to the AWS Support Center by following the steps in our documentation. This feature is available in all regions where Lambda is available.
See AWS Lambda Raises Default Concurrent Execution Limits.
https://aws.amazon.com/about-aws/whats-new/2017/05/aws-lambda-raises-default-concurrent-execution-limit/
The limit management team is very flexible when asking for a limit to be raped they were generally raise it to any reasonable number that our solution requires.
To request a limit increase, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html
I use NodeJs AWS Lambdas. If I don't do calls to my S3, or DynamoDB, or KMS for some time (approx. 8h or more) the first call I make is usually painfully slow - up-to 5sec. There's nothing complex in the queries themselves - i.e. get a 0.2Kb S3 object, query a DynamoDB table by index.
So, it looks like AWS "hibernates" these resources when they aren't in active use and when I call them for the 1st time after a while they spend some time to return from "hibernated" state. This is my assumption, but I couldn't find any information about it in docs. So, the questions are the following two:
Is my assumption about "hibernation" correct?
If 1st point is correct, then is there any way to mitigate these "cold" calls to AWS services except keeping those services "warm" by calling them every X minutes?
Edit
Just to avoid confusions - this is not about Lambda's cold starts. I'm aware of them, they exist, they have their own share in functions' latency. Times I measure are the exact times of calls to S3/DynamoDB etc. - after the lambda is started.
It all likelihood it is the lambda function that is hibernating, not the other services:
A cold start occurs when an AWS Lambda function is invoked after not
being used for an extended period of time resulting in increased
invocation latency.
https://medium.com/#lakshmanLD/resolving-cold-start%EF%B8%8F-in-aws-lambda-804512ca9b61
and yes, you could setup a cloudwatch event to keep your lambda function warm.
We have experienced the same issue for calls to SSM and DynamoDB. It's probably not these services that go into hibernation, but the parameters for calling them are cached on the lambda container, which means they need to be recreated when a new container is spawned.
Unfortunately, we have not found a solution other than pinging the lambda from time to time. In this case, you should execute a call to your services in the ping in order to see an improvement in the loading times. See also below benchmark.
AWS (zoewangg) acknowledged the slow startup issue in 1.11.x Java SDK1.
One of the main reasons is that 1.11.x SDK uses ApacheHttpClient under
the hood and initializing it can be expensive.
Check out https://aws.amazon.com/blogs/developer/tuning-the-aws-java-sdk-2-x-to-reduce-startup-time/
I have an application on an AWS EC2 instance that runs once daily. The application fetches some files from a web service, parses the files line by line, updates a database, updates S3 files based on changes in the database, sends notification emails to customers as well as a few other tasks.
This is a series of logical tasks that must take place in sequence, although some of the tasks can be thought of as sub-tasks that can be executed in parallel. All tasks are a combination of Perl scripts and Java programs, with a single Perl script acting as the manager that executes each in turn. Some tasks can take as long as 45 minutes to complete, and the whole process can take up to 3 hours in total.
I'd like to make this whole process serverless. My initial idea was to use AWS Lambda, whereby each task would execute as a Lambda function, until I discovered Lambda functions impose a 5 minute execution timeout. It seems like the AWS Step Functions service is actually a better fit for my use case, but my understanding is that this service is backed by Lambda, so the tasks will still have the 5 min execution limitation.
(I'm also aware that I would have to re-write my Perl scripts to a language supported by Lambda).
I assume that I can work around the execution time limit by refactoring my code into smaller functions that will guarantee to complete in under 5 minutes. In my particular situation though, this seems inefficient.
Currently the database update task processes lines from a file one at a time. For this to work with Lambda, a Lambda function would need to handle only a single line from the file (or a very small number of lines) in order to guarantee not spilling over 5 minutes execution time. This would involve opening and closing a connection with the database on every invocation of the Lambda function. Also, each line processed should result in an entry written to a file, to be stored in S3. Right now, I just keep a file handle in memory and write the file to S3 when all lines are processed, but with Lambda I would need to keep reading the file, updating it and writing it back to S3.
What I'm asking is:
Is my use case a bad fit for AWS Lambda and/or AWS Step Functions?
Have I misunderstood how these services work?
Is there another AWS service that would be a better fit for my use case?
After further research, I think AWS Batch might be a good idea.
What you want are called Activity Workers. Tl;dr: You register "activities" and each gets an ARN. Then you can put that ARN in the resource field of Task states and then you run some code (the "worker") somewhere (in a Lambda, on EC2, in your basement, wherever) that polls for tasks identified by that ARN, then calls back to report success or failure. Activity Workers can run for up to a year.
Step-by-step details at the AWS docs
In response to RTF's comment, here's a deeper dive: Suppose you have code to color turtles in color_turtles.pl. So what you do is call the CreateActivity API - see http://docs.aws.amazon.com/step-functions/latest/apireference/API_CreateActivity.html - giving the name "ColorTurtles" and it'll give you back an ARN, a string beginning arn:aws... Then in your state machine you make a Task state with that ARN as the value of the resource field. Then you add code to color_turtles.pl to poll the service with http://docs.aws.amazon.com/step-functions/latest/apireference/API_GetActivityTask.html - whenever a machine you're running gets to that task, it'll go look for activity workers polling. It'll give your polling worker the input for the task, then you process the input and generate some output, and call SendTaskSuccess or SendTaskFailure. All these are just REST HTTP calls, so you can run them anywhere and I mean anywhere; in a Lambda, on an EC2 instance, or on some computer anywhere on the Internet.
So to answer your questions:
1) Yeah, if you've got something that'll run for around 45 minutes, whilst you could engineer it with Lambda/Step functions you're probably better off getting a EC2 micro instance.
2)Nope you've pretty much got it.
3) As above you want to go with EC2 for this, there's a good article on using Data Pipelines to start / stop an EC2 instance here that way by starting instance only when you need it the cost(if any) is negligible.
I have jobs that run in this fashion normally you can get away with with a t2.micro instance which is free tier eligible.
You can also run your perl scripts on an EC2 instance so no need to rewrite them!
I will start with that it seems you are looking for workflow solutions on AWS. SWF and Step functions are the two most popular ones. Steps function is more recent offering and encouraged by AWS more than SWF.
SWF has native capability to handle long-running tasks, the downside is that you have to provide your own execution environment for deciders (can't use lambda).
With step functions, you can do this in two different ways. One of the approaches is suggested by Tim in his answer. There is an alternative way to achieve the same which is using job poller in step functions. Job pollers have the ability to call (poll) your resource and find out if the task is done and if not you can send execution in wait mode for the specified time. As mentioned above maximum execution time allowed currently for any workflow is 1 year. In case you have tasks which may take longer than 1 year, you can't use step functions in its current form.
We are experimenting with a new serverless solution where external provider writes to DynamoDB, DynamoDB Stream reacts to a new write event, and triggers AWS Lambda function which propagates changes down the road?
So far it works well, however, sometimes we notice that data is being delayed e.g. no updates would come from Lambda for a few minutes.
After going through a lot of DynamoDB Stream documentation the only term they use is "near real-time stream record" but what is generally "near real-time"? What are the possible delays we are looking at here?
In my experience, most of the time it is near real-time. However, on a rare occasion you might have to wait a while (in my case, up to half an hour). I assume this was because of hardware or network issues in AWS infrastructure.
In most cases, Lambda functions are triggered within half a second after you make an update to a small item in a Streams-enabled DynamoDB table. But event source changes, updates to the Lambda function, changing the Lambda execution role, etc. may introduce additional latency when the Lambda function is run for the first-time.
We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: