I know this may miss the point of using Cloud Functions in the first place, but in my specific case, I'm using Cloud Functions because it's the only way I can bridge Next.js with Firebase Hosting. I don't need to make it cost efficient, etc.
With that said, the cold boot times for Cloud Functions are simply unbearable and not production-ready, averaging around 10 to 15 seconds for my boilerplate.
I've watched this video by Google (https://www.youtube.com/watch?v=IOXrwFqR6kY) that talks about how to reduce cold boot time. In a nutshell: 1) Trim dependencies, 2) Trial & error for dependencies' versions for cache on Google's network, 3) Lazy loading.
But 1) there are only so many dependencies I can trim. 2) How would I know which version is more cached? 3) There are only so many dependencies I can lazy load.
Another way is to avoid the cold boot all together. What's a good way or hack that I can essentially keep my (one and only) cloud function warm?
With all "serverless" compute providers, there is always going to be some form of cold start cost that you can't eliminate. Even if you are able to keep a single instance alive by pinging it, the system may spin up any number of other instances to handle current load. Those new instances will have a cold start cost. Then, when load decreases, the unnecessary instances will be shut down.
There are ways to minimize your cold start costs, as you have discovered, but the costs can't be eliminated.
As of Sept 2021, you can now specify a minimum number of instances to keep active. This can help reduce (but not eliminate) cold starts. Read the Google Cloud blog and the documentation. For Firebase, read its documentation. Note that setting min instances incurs extra billing - keeping computing resources active is not a free service.
If you absolutely demand hot servers to handle requests 24/7, then you need to manage your own servers that run 24/7 (and pay the cost of those servers running 24/7). As you can see, the benefit of serverless is that you don't manage or scale your own servers, and you only pay for what you use, but you have unpredictable cold start costs associated with your project. That's the tradeoff.
You're not the first to ask ;-)
The answer is to configure a remote service to periodically call your function so that the single|only instance remains alive.
It's unclear from your question but I assume your Function provides an HTTP endpoint. In that case, find a healthcheck or cron service that can be configured to make an HTTP call every x seconds|minutes and point it at your Function.
You may have to juggle the timings to find the Goldilocks period -- not too often that that you're wasting effort, not too infrequently that it dies -- but this is what others have done.
You can now specify MIN_INSTANCE_LIMIT to keep instances running at all times.
Cloud Functions Doc: https://cloud.google.com/functions/docs/configuring/min-instances
Cloud Functions example from the docs:
gcloud beta functions deploy myFunction --min-instances 5
It's also available in Firebase Functions by specifying minInstances:
Firebase Functions Docs: https://firebase.google.com/docs/functions/manage-functions#min-max-instances
Frank announcing it on Twitter: https://twitter.com/puf/status/1433431768963633152
Firebase Function example from the docs:
exports.getAutocompleteResponse = functions
.runWith({
// Keep 5 instances warm for this latency-critical function
minInstances: 5,
})
.https.onCall((data, context) => {
// Autocomplete a user's search term
});
You can trigger it via cron job as explained here: https://cloud.google.com/scheduler/docs/creating
Using Google Scheduler is a wise solution but the actual implementation is not so straightforward. Please check my article for details. Examples of functions:
myHttpFunction: functions.https.onRequest((request, response) => {
// Check if available warmup parameter.
// Use request.query.warmup parameter if warmup request is GET.
// Use request.body.warmup parameter if warmup request is POST.
if (request.query.warmup || request.body.warmup) {
return response.status(200).type('application/json').send({status: "success", message: "OK"});
}
});
myOnCallFunction: functions.https.onCall((data, context) => {
// Check if available warmup parameter.
if (data.warmup) {
return {"success": true};
}
});
Examples of gcloud cli comands:
gcloud --project="my-awesome-project" scheduler jobs create http warmupMyOnCallFuntion --time-zone "America/Los_Angeles" --schedule="*/5 5-23 * * *" --uri="https://us-central1-my-awesome-project.cloudfunctions.net/myOnCallFuntion" --description="my warmup job" --headers="Content-Type=application/json" --http-method="POST" --message-body="{\"data\":{\"warmup\":\"true\"}}"
gcloud --project="my-awesome-project" scheduler jobs create http warmupMyHttpFuntion --time-zone "America/Los_Angeles" --schedule="*/5 5-23 * * *" --uri="https://us-central1-my-awesome-project.cloudfunctions.net/myHttpFuntion?warmup=true" --description="my warmup job" --headers="Content-Type=application/json" --http-method="GET"
Cloud functions are generally best-suited to perform just one (small) task. More often than not I come across people who want to do everything inside one cloud function. To be honest, this is also how I started developing cloud functions.
With this in mind, you should keep your cloud function code clean and small to perform just one task. Normally this would be a background task, a file or record that needs to be written somewhere, or a check that has to be performed. In this scenario, it doesn't really matter if there is a cold start penalty.
But nowadays, people, including myself, rely on cloud functions as a backend for API Gateway or Cloud Endpoints. In this scenario, the user goes to a website and the website sends a backend request to the cloud function to get some additional information. Now the cloud function acts as an API and a user is waiting for it.
Typical cold cloud function:
Typical warm cloud function:
There are several ways to cope with a cold-start problem:
Reduce dependencies and amount of code. As I said before, cloud functions are best-suited for performing single tasks. This will reduce the overall package size that has to be loaded to a server between receiving a request and executing the code, thus speeding things up significantly.
Another more hacky way is to schedule a cloud scheduler to periodically send a warmup request to your cloud function. GCP has a generous free tier, which allows for 3 schedulers and 2 million cloud function invocations (depending on resource usage). So, depending on the number of cloud functions, you could easily schedule an http-request every few seconds. For the sake of clarity, I placed a snippet below this post that deploys a cloud function and a scheduler that sends warmup requests.
If you think you have tweaked the cold-start problem, you could also take measures to speed up the actual runtime:
I switched from Python to Golang, which gave me a double-digit performance increase in terms of actual runtime. Golang speed is comparable to Java or C++.
Declare variables, especially GCP clients like storage, pub/sub etc., on a global level (source). This way, future invocations of your cloud function will reuse that objects.
If you are performing multiple independent actions within a cloud function, you could make them asynchronous.
And again, clean code and fewer dependencies also improve the runtime.
Snippet:
# Deploy function
gcloud functions deploy warm-function \
--runtime=go113 \
--entry-point=Function \
--trigger-http \
--project=${PROJECT_ID} \
--region=europe-west1 \
--timeout=5s \
--memory=128MB
# Set IAM bindings
gcloud functions add-iam-policy-binding warm-function \
--region=europe-west1 \
--member=serviceAccount:${PROJECT_ID}#appspot.gserviceaccount.com \
--role=roles/cloudfunctions.invoker
# Create scheduler
gcloud scheduler jobs create http warmup-job \
--schedule='*/5 * * * *' \
--uri='https://europe-west1-${PROJECT_ID}.cloudfunctions.net/warm-function' \
--project=${PROJECT_ID} \
--http-method=OPTIONS \
--oidc-service-account-email=${PROJECT_ID}#appspot.gserviceaccount.com \
--oidc-token-audience=https://europe-west1-${PROJECT_ID}.cloudfunctions.net/warm-function
Google has just announced the ability to set min-instances for your Cloud Function deploys. This allows you to set the lower bound for scaling your functions down and minimises cold starts (they don't promise to eliminate them).
There is a small cost for keeping warm instances around (Idle time) - though at the time of writing, that seems undocumented on the Cloud Functions pricing page. They say:
If you set a minimum number of function instances, you are also billed
for the time these instances are not active. This is called idle time
and is priced at a different rate.
In order to keep the cold start to minimum there is not a single solution, it is a mixture of multiple techniques. The question is more how to make our lambdas so fast we don't care so much about cold starts - I am talking about a startup time in the range of 100-500 ms.
How to make your lambda faster ?
Keep your package size as small a possible (remove all big libraries where a fraction of it is used) - keep package size to max 20 MB. On every cold start this package is fetched and decompressed.
Try initialise on start of your application only the pieces you want.
Nodejs - https://gist.github.com/Rich-Harris/41e8ccc755ea232a5e7b88dee118bcf5
If you use a JVM technology for your services, try to migrate them to Graalvm where the boot-up overhead is reduced to minimum.
micronaut + graalvm
quarkus + graalvm
helidon + graalvm
Use cloud infrastructure configs to reduce the cold starts.
In 2020 cold start are not such pain as few years ago. I would say more about AWS, but I am sure all above works well for any cloud provider.
In end of 2019 AWS intorduced Lambda concurrency provisioning -https://aws.amazon.com/about-aws/whats-new/2019/12/aws-lambda-announces-provisioned-concurrency/, you don't have to care so much about warming anymore.
Related
An email I recently received from GCP mentions the transition to Artifact Registry for Cloud Functions.
It claims:
Cloud Functions for Firebase and Firebase Extensions have historically
used Container Registry for packaging functions and managing their
deployment, yet with the change to Artifact Registry, you’ll have the
following benefits:
Your functions will deploy faster.
You’ll have access to more regions.
I cannot find any more information regarding faster deployments, either from official documentation or from user experiences.
Is there any reason to believe Cloud Function deployment will actually be faster, by an appreciable margin? Currently function deployment is glacial, so even a small speedup in percentage terms would shave minutes off deployment times.
I'm personally surprise of that "faster" deployment mention, because, in reality, it won't.
To explain that, you simply have to review the deployment process:
You submit your code
Your code is packaged in a container (with Cloud Build and Buildpack) and stored somewhere (in container registry or artifact registry)
The code is deployed on the target service.
If you take the duration of each step, in percentage you can have:
0.5% (depend on your network)
99% (depend on the build to perform, can take long minutes to compile/minify,...)
0.5% (Even if the container is "big", the petabyte network is wonderful).
So, yes, you have more regions, and, by the way, if you have a large container to deploy, in a non supported region, the data transfer with take more ms, even a few seconds.
All of that to say, yes, you can save few seconds, but it's not always the case.
I have a several python scripts that follow a similar format: you pass in a date, and it either: - checks my S3 bucket for the file with that date in the filename, and parses it or - Runs a python script doing some analysis on the file of that date (which take over 1 hour to run)
I am looking for a serverless solution that would let me call these functions on a range of dates, and run them all in parallel. Because of the long duration of my python script, services like AWS and Google Cloud Functions don't work because of their timeouts (15 minutes and 9 minutes respectively). I have looked at Google Cloud Dataflow, but am not sure whether this is overkill for my relatively simple use case.
Something with the lowest possible outages is important, so I am leaning towards something from AWS, Google Cloud, etc.
I also would like to be able to see a dashboard of the progress of each job with logs, so I can see which dates have completed and which dates had a bug (plus what the bug is)
As you said, with Google Cloud Functions you can configure the timeout for up to 9 minutes during the deployment.
Solutions different to Dataflow that allow higher timeouts:
App engine Flex
Other GCP product that allows higher timeouts (up to 60 minutes) is the App Engine Flex environment link.
Cloud Tasks
Cloud tasks is also similar, but asynchronous. With timeouts up to 30 min. It is a task queue, you put the task in the queue and returns quickly. Then, the worker (or workers) of the queue will evaluate the tasks one by one.
The usual output of Cloud Tasks is to send emails or to save the results into a Storage link.
With this solution, you can add a task for each file/filename to process and each of this tasks has the timeout of 30 min.
Long running duration is planned in the Cloud Run roadmap but we don't have date for now.
Today, the best recommended way is to use AppEngine in addition of Task Queue. With push queue, you can run process up to 24H long when you deploy in manual scaling mode. But Be careful, manual scaling doesn't scale to 0!
If you prefer container, I know 2 "strange" workaround on GCP:
Use Cloud Build. Cloud Build allows you to build custom builder in a container. Do whatever you want in this container, even if it's not for building something. Think to set up the correct timeout for your processing step. You have 120 minutes per day FREE with Cloud Build (shared across the entire organisation, it's not a free tier per project!). You can run up to 10 build jobs in parallel.
Use AI Platform training. Similarly to Cloud Build, AI Platform training allows you to run a custom container for performing processing, initially think for training. But, it's a container, you can run whatever you want in it. No free tier here. You are limited to 20 CPU in parallel but you can ask for increasing the limit up to 450 concurrent vCPU.
Sadly, it's not as easy as a Function or a Cloud Run to use. You don't have an HTTP endpoint and you simply call this with the date that you want and enjoy. But you can wrap this into a function which perform the API calls to the Cloud Build or the AI Platform training.
Is there any provider, that offers such an option out of the box? I need to run at least 1K concurrent sessions (docker containers) of headless web-browsers (firefox) for complex UI tests. I have a Docker image that I just want to deploy and scale to 1000 1CPU/1GB instances in second, w/o spending time on maintaining the cluster of servers (I need to shut them all down after the job is done), just focuse on the code. The most close thing I found so far is Amazon ECS/Fargate, but their limits have no sense to me ("Run containerized applications in production" -> max limit: 50 tasks -> production -> ok). Am I missing something?
I think that AWS Batch might be a better solution for your use case. You define a "compute environment" that provides a certain level of capacity, then submit tasks that are run on that compute environment.
I don't think that you'll find anything that can start up an environment and deploy a large number of tasks in "one second": in my experience it takes about a minute or two ramp-up time for Batch, although once the machines are up and running they are able to sequence jobs quickly. You should also give consideration to whether it makes sense to run all 1,000 jobs concurrently; that will depend on what you're trying to get out of your tests.
You'll also need to be aware of any places where you might be throttled (for example, retrieving configuration from the AWS Parameter Store). This talk from last year's NY Summit covers some of the issues that the speaker ran into when deploying multiple-thousands of concurrent tasks.
You could use lambda layers to run headless browsers (I know there are several implementations for chromium/selenium on github, not sure about firefox).
Alternatively you could try and contact the AWS team to see how much the limit for concurrent tasks on Fargate can be increased. As you can see at the documentation, the 50 task is a soft limit and can be raised.
Be aware if you start via Fargate, there is some API limit on the requests per second. You need to make sure you throttle your API calls or you use the ECS Create Service.
In any case, starting 1000 tasks would require 1000 seconds, which is probably not what you expect.
Those limits are not there if you use ECS, but in that case you need to manage the cluster, so it might be a good idea to explore the lambda option.
I have an application on an AWS EC2 instance that runs once daily. The application fetches some files from a web service, parses the files line by line, updates a database, updates S3 files based on changes in the database, sends notification emails to customers as well as a few other tasks.
This is a series of logical tasks that must take place in sequence, although some of the tasks can be thought of as sub-tasks that can be executed in parallel. All tasks are a combination of Perl scripts and Java programs, with a single Perl script acting as the manager that executes each in turn. Some tasks can take as long as 45 minutes to complete, and the whole process can take up to 3 hours in total.
I'd like to make this whole process serverless. My initial idea was to use AWS Lambda, whereby each task would execute as a Lambda function, until I discovered Lambda functions impose a 5 minute execution timeout. It seems like the AWS Step Functions service is actually a better fit for my use case, but my understanding is that this service is backed by Lambda, so the tasks will still have the 5 min execution limitation.
(I'm also aware that I would have to re-write my Perl scripts to a language supported by Lambda).
I assume that I can work around the execution time limit by refactoring my code into smaller functions that will guarantee to complete in under 5 minutes. In my particular situation though, this seems inefficient.
Currently the database update task processes lines from a file one at a time. For this to work with Lambda, a Lambda function would need to handle only a single line from the file (or a very small number of lines) in order to guarantee not spilling over 5 minutes execution time. This would involve opening and closing a connection with the database on every invocation of the Lambda function. Also, each line processed should result in an entry written to a file, to be stored in S3. Right now, I just keep a file handle in memory and write the file to S3 when all lines are processed, but with Lambda I would need to keep reading the file, updating it and writing it back to S3.
What I'm asking is:
Is my use case a bad fit for AWS Lambda and/or AWS Step Functions?
Have I misunderstood how these services work?
Is there another AWS service that would be a better fit for my use case?
After further research, I think AWS Batch might be a good idea.
What you want are called Activity Workers. Tl;dr: You register "activities" and each gets an ARN. Then you can put that ARN in the resource field of Task states and then you run some code (the "worker") somewhere (in a Lambda, on EC2, in your basement, wherever) that polls for tasks identified by that ARN, then calls back to report success or failure. Activity Workers can run for up to a year.
Step-by-step details at the AWS docs
In response to RTF's comment, here's a deeper dive: Suppose you have code to color turtles in color_turtles.pl. So what you do is call the CreateActivity API - see http://docs.aws.amazon.com/step-functions/latest/apireference/API_CreateActivity.html - giving the name "ColorTurtles" and it'll give you back an ARN, a string beginning arn:aws... Then in your state machine you make a Task state with that ARN as the value of the resource field. Then you add code to color_turtles.pl to poll the service with http://docs.aws.amazon.com/step-functions/latest/apireference/API_GetActivityTask.html - whenever a machine you're running gets to that task, it'll go look for activity workers polling. It'll give your polling worker the input for the task, then you process the input and generate some output, and call SendTaskSuccess or SendTaskFailure. All these are just REST HTTP calls, so you can run them anywhere and I mean anywhere; in a Lambda, on an EC2 instance, or on some computer anywhere on the Internet.
So to answer your questions:
1) Yeah, if you've got something that'll run for around 45 minutes, whilst you could engineer it with Lambda/Step functions you're probably better off getting a EC2 micro instance.
2)Nope you've pretty much got it.
3) As above you want to go with EC2 for this, there's a good article on using Data Pipelines to start / stop an EC2 instance here that way by starting instance only when you need it the cost(if any) is negligible.
I have jobs that run in this fashion normally you can get away with with a t2.micro instance which is free tier eligible.
You can also run your perl scripts on an EC2 instance so no need to rewrite them!
I will start with that it seems you are looking for workflow solutions on AWS. SWF and Step functions are the two most popular ones. Steps function is more recent offering and encouraged by AWS more than SWF.
SWF has native capability to handle long-running tasks, the downside is that you have to provide your own execution environment for deciders (can't use lambda).
With step functions, you can do this in two different ways. One of the approaches is suggested by Tim in his answer. There is an alternative way to achieve the same which is using job poller in step functions. Job pollers have the ability to call (poll) your resource and find out if the task is done and if not you can send execution in wait mode for the specified time. As mentioned above maximum execution time allowed currently for any workflow is 1 year. In case you have tasks which may take longer than 1 year, you can't use step functions in its current form.
We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: