Serverless python requests with long timeouts? - google-cloud-platform

I have a several python scripts that follow a similar format: you pass in a date, and it either: - checks my S3 bucket for the file with that date in the filename, and parses it or - Runs a python script doing some analysis on the file of that date (which take over 1 hour to run)
I am looking for a serverless solution that would let me call these functions on a range of dates, and run them all in parallel. Because of the long duration of my python script, services like AWS and Google Cloud Functions don't work because of their timeouts (15 minutes and 9 minutes respectively). I have looked at Google Cloud Dataflow, but am not sure whether this is overkill for my relatively simple use case.
Something with the lowest possible outages is important, so I am leaning towards something from AWS, Google Cloud, etc.
I also would like to be able to see a dashboard of the progress of each job with logs, so I can see which dates have completed and which dates had a bug (plus what the bug is)

As you said, with Google Cloud Functions you can configure the timeout for up to 9 minutes during the deployment.
Solutions different to Dataflow that allow higher timeouts:
App engine Flex
Other GCP product that allows higher timeouts (up to 60 minutes) is the App Engine Flex environment link.
Cloud Tasks
Cloud tasks is also similar, but asynchronous. With timeouts up to 30 min. It is a task queue, you put the task in the queue and returns quickly. Then, the worker (or workers) of the queue will evaluate the tasks one by one.
The usual output of Cloud Tasks is to send emails or to save the results into a Storage link.
With this solution, you can add a task for each file/filename to process and each of this tasks has the timeout of 30 min.

Long running duration is planned in the Cloud Run roadmap but we don't have date for now.
Today, the best recommended way is to use AppEngine in addition of Task Queue. With push queue, you can run process up to 24H long when you deploy in manual scaling mode. But Be careful, manual scaling doesn't scale to 0!
If you prefer container, I know 2 "strange" workaround on GCP:
Use Cloud Build. Cloud Build allows you to build custom builder in a container. Do whatever you want in this container, even if it's not for building something. Think to set up the correct timeout for your processing step. You have 120 minutes per day FREE with Cloud Build (shared across the entire organisation, it's not a free tier per project!). You can run up to 10 build jobs in parallel.
Use AI Platform training. Similarly to Cloud Build, AI Platform training allows you to run a custom container for performing processing, initially think for training. But, it's a container, you can run whatever you want in it. No free tier here. You are limited to 20 CPU in parallel but you can ask for increasing the limit up to 450 concurrent vCPU.
Sadly, it's not as easy as a Function or a Cloud Run to use. You don't have an HTTP endpoint and you simply call this with the date that you want and enjoy. But you can wrap this into a function which perform the API calls to the Cloud Build or the AI Platform training.

Related

Cloud computing service to run thousands of containers in parallel

Is there any provider, that offers such an option out of the box? I need to run at least 1K concurrent sessions (docker containers) of headless web-browsers (firefox) for complex UI tests. I have a Docker image that I just want to deploy and scale to 1000 1CPU/1GB instances in second, w/o spending time on maintaining the cluster of servers (I need to shut them all down after the job is done), just focuse on the code. The most close thing I found so far is Amazon ECS/Fargate, but their limits have no sense to me ("Run containerized applications in production" -> max limit: 50 tasks -> production -> ok). Am I missing something?
I think that AWS Batch might be a better solution for your use case. You define a "compute environment" that provides a certain level of capacity, then submit tasks that are run on that compute environment.
I don't think that you'll find anything that can start up an environment and deploy a large number of tasks in "one second": in my experience it takes about a minute or two ramp-up time for Batch, although once the machines are up and running they are able to sequence jobs quickly. You should also give consideration to whether it makes sense to run all 1,000 jobs concurrently; that will depend on what you're trying to get out of your tests.
You'll also need to be aware of any places where you might be throttled (for example, retrieving configuration from the AWS Parameter Store). This talk from last year's NY Summit covers some of the issues that the speaker ran into when deploying multiple-thousands of concurrent tasks.
You could use lambda layers to run headless browsers (I know there are several implementations for chromium/selenium on github, not sure about firefox).
Alternatively you could try and contact the AWS team to see how much the limit for concurrent tasks on Fargate can be increased. As you can see at the documentation, the 50 task is a soft limit and can be raised.
Be aware if you start via Fargate, there is some API limit on the requests per second. You need to make sure you throttle your API calls or you use the ECS Create Service.
In any case, starting 1000 tasks would require 1000 seconds, which is probably not what you expect.
Those limits are not there if you use ECS, but in that case you need to manage the cluster, so it might be a good idea to explore the lambda option.

Optimizing apache beam / cloud dataflow startup

I have done a few tests with apache-beam using both auto-scale workers and 1 worker, and each time I see a startup time of around 2 minutes. Is it possible to reduce that time, and if so, what are the suggested best practices for reducing the startup time?
IMHO: Two minutes is very fast for a product like Cloud Dataflow. Remember, Google is launching a powerful Big Data service for you that autoscales.
Compare that time to the other cloud vendors. I have seen some clusters (Hadoop) take 15 minutes to come live. In any event, you do not control the initialization process for Dataflow so there is nothing for you to improve.

How can I keep Google Cloud Functions warm?

I know this may miss the point of using Cloud Functions in the first place, but in my specific case, I'm using Cloud Functions because it's the only way I can bridge Next.js with Firebase Hosting. I don't need to make it cost efficient, etc.
With that said, the cold boot times for Cloud Functions are simply unbearable and not production-ready, averaging around 10 to 15 seconds for my boilerplate.
I've watched this video by Google (https://www.youtube.com/watch?v=IOXrwFqR6kY) that talks about how to reduce cold boot time. In a nutshell: 1) Trim dependencies, 2) Trial & error for dependencies' versions for cache on Google's network, 3) Lazy loading.
But 1) there are only so many dependencies I can trim. 2) How would I know which version is more cached? 3) There are only so many dependencies I can lazy load.
Another way is to avoid the cold boot all together. What's a good way or hack that I can essentially keep my (one and only) cloud function warm?
With all "serverless" compute providers, there is always going to be some form of cold start cost that you can't eliminate. Even if you are able to keep a single instance alive by pinging it, the system may spin up any number of other instances to handle current load. Those new instances will have a cold start cost. Then, when load decreases, the unnecessary instances will be shut down.
There are ways to minimize your cold start costs, as you have discovered, but the costs can't be eliminated.
As of Sept 2021, you can now specify a minimum number of instances to keep active. This can help reduce (but not eliminate) cold starts. Read the Google Cloud blog and the documentation. For Firebase, read its documentation. Note that setting min instances incurs extra billing - keeping computing resources active is not a free service.
If you absolutely demand hot servers to handle requests 24/7, then you need to manage your own servers that run 24/7 (and pay the cost of those servers running 24/7). As you can see, the benefit of serverless is that you don't manage or scale your own servers, and you only pay for what you use, but you have unpredictable cold start costs associated with your project. That's the tradeoff.
You're not the first to ask ;-)
The answer is to configure a remote service to periodically call your function so that the single|only instance remains alive.
It's unclear from your question but I assume your Function provides an HTTP endpoint. In that case, find a healthcheck or cron service that can be configured to make an HTTP call every x seconds|minutes and point it at your Function.
You may have to juggle the timings to find the Goldilocks period -- not too often that that you're wasting effort, not too infrequently that it dies -- but this is what others have done.
You can now specify MIN_INSTANCE_LIMIT to keep instances running at all times.
Cloud Functions Doc: https://cloud.google.com/functions/docs/configuring/min-instances
Cloud Functions example from the docs:
gcloud beta functions deploy myFunction --min-instances 5
It's also available in Firebase Functions by specifying minInstances:
Firebase Functions Docs: https://firebase.google.com/docs/functions/manage-functions#min-max-instances
Frank announcing it on Twitter: https://twitter.com/puf/status/1433431768963633152
Firebase Function example from the docs:
exports.getAutocompleteResponse = functions
.runWith({
// Keep 5 instances warm for this latency-critical function
minInstances: 5,
})
.https.onCall((data, context) => {
// Autocomplete a user's search term
});
You can trigger it via cron job as explained here: https://cloud.google.com/scheduler/docs/creating
Using Google Scheduler is a wise solution but the actual implementation is not so straightforward. Please check my article for details. Examples of functions:
myHttpFunction: functions.https.onRequest((request, response) => {
// Check if available warmup parameter.
// Use request.query.warmup parameter if warmup request is GET.
// Use request.body.warmup parameter if warmup request is POST.
if (request.query.warmup || request.body.warmup) {
return response.status(200).type('application/json').send({status: "success", message: "OK"});
}
});
myOnCallFunction: functions.https.onCall((data, context) => {
// Check if available warmup parameter.
if (data.warmup) {
return {"success": true};
}
});
Examples of gcloud cli comands:
gcloud --project="my-awesome-project" scheduler jobs create http warmupMyOnCallFuntion --time-zone "America/Los_Angeles" --schedule="*/5 5-23 * * *" --uri="https://us-central1-my-awesome-project.cloudfunctions.net/myOnCallFuntion" --description="my warmup job" --headers="Content-Type=application/json" --http-method="POST" --message-body="{\"data\":{\"warmup\":\"true\"}}"
gcloud --project="my-awesome-project" scheduler jobs create http warmupMyHttpFuntion --time-zone "America/Los_Angeles" --schedule="*/5 5-23 * * *" --uri="https://us-central1-my-awesome-project.cloudfunctions.net/myHttpFuntion?warmup=true" --description="my warmup job" --headers="Content-Type=application/json" --http-method="GET"
Cloud functions are generally best-suited to perform just one (small) task. More often than not I come across people who want to do everything inside one cloud function. To be honest, this is also how I started developing cloud functions.
With this in mind, you should keep your cloud function code clean and small to perform just one task. Normally this would be a background task, a file or record that needs to be written somewhere, or a check that has to be performed. In this scenario, it doesn't really matter if there is a cold start penalty.
But nowadays, people, including myself, rely on cloud functions as a backend for API Gateway or Cloud Endpoints. In this scenario, the user goes to a website and the website sends a backend request to the cloud function to get some additional information. Now the cloud function acts as an API and a user is waiting for it.
Typical cold cloud function:
Typical warm cloud function:
There are several ways to cope with a cold-start problem:
Reduce dependencies and amount of code. As I said before, cloud functions are best-suited for performing single tasks. This will reduce the overall package size that has to be loaded to a server between receiving a request and executing the code, thus speeding things up significantly.
Another more hacky way is to schedule a cloud scheduler to periodically send a warmup request to your cloud function. GCP has a generous free tier, which allows for 3 schedulers and 2 million cloud function invocations (depending on resource usage). So, depending on the number of cloud functions, you could easily schedule an http-request every few seconds. For the sake of clarity, I placed a snippet below this post that deploys a cloud function and a scheduler that sends warmup requests.
If you think you have tweaked the cold-start problem, you could also take measures to speed up the actual runtime:
I switched from Python to Golang, which gave me a double-digit performance increase in terms of actual runtime. Golang speed is comparable to Java or C++.
Declare variables, especially GCP clients like storage, pub/sub etc., on a global level (source). This way, future invocations of your cloud function will reuse that objects.
If you are performing multiple independent actions within a cloud function, you could make them asynchronous.
And again, clean code and fewer dependencies also improve the runtime.
Snippet:
# Deploy function
gcloud functions deploy warm-function \
--runtime=go113 \
--entry-point=Function \
--trigger-http \
--project=${PROJECT_ID} \
--region=europe-west1 \
--timeout=5s \
--memory=128MB
# Set IAM bindings
gcloud functions add-iam-policy-binding warm-function \
--region=europe-west1 \
--member=serviceAccount:${PROJECT_ID}#appspot.gserviceaccount.com \
--role=roles/cloudfunctions.invoker
# Create scheduler
gcloud scheduler jobs create http warmup-job \
--schedule='*/5 * * * *' \
--uri='https://europe-west1-${PROJECT_ID}.cloudfunctions.net/warm-function' \
--project=${PROJECT_ID} \
--http-method=OPTIONS \
--oidc-service-account-email=${PROJECT_ID}#appspot.gserviceaccount.com \
--oidc-token-audience=https://europe-west1-${PROJECT_ID}.cloudfunctions.net/warm-function
Google has just announced the ability to set min-instances for your Cloud Function deploys. This allows you to set the lower bound for scaling your functions down and minimises cold starts (they don't promise to eliminate them).
There is a small cost for keeping warm instances around (Idle time) - though at the time of writing, that seems undocumented on the Cloud Functions pricing page. They say:
If you set a minimum number of function instances, you are also billed
for the time these instances are not active. This is called idle time
and is priced at a different rate.
In order to keep the cold start to minimum there is not a single solution, it is a mixture of multiple techniques. The question is more how to make our lambdas so fast we don't care so much about cold starts - I am talking about a startup time in the range of 100-500 ms.
How to make your lambda faster ?
Keep your package size as small a possible (remove all big libraries where a fraction of it is used) - keep package size to max 20 MB. On every cold start this package is fetched and decompressed.
Try initialise on start of your application only the pieces you want.
Nodejs - https://gist.github.com/Rich-Harris/41e8ccc755ea232a5e7b88dee118bcf5
If you use a JVM technology for your services, try to migrate them to Graalvm where the boot-up overhead is reduced to minimum.
micronaut + graalvm
quarkus + graalvm
helidon + graalvm
Use cloud infrastructure configs to reduce the cold starts.
In 2020 cold start are not such pain as few years ago. I would say more about AWS, but I am sure all above works well for any cloud provider.
In end of 2019 AWS intorduced Lambda concurrency provisioning -https://aws.amazon.com/about-aws/whats-new/2019/12/aws-lambda-announces-provisioned-concurrency/, you don't have to care so much about warming anymore.

Simple task queue using Google Cloud Platform : issue with Google PubSub

My task : I cannot speak openly about what the specifics of my task are, but here is an analogy : every two hours, I get a variable number of spoken audio files. Sometimes only 10, sometimes 800 or more. Let's say I have a costly python task to perform on these files, for example Automatic Speech Recognition. I have a Google Intance managed group that can deploy any number of VMs for executing this task.
The issue : right now, I'm using Google PubSub. Every two hours, a topic is filled with audio ids. Instances of the managed group can be deployed depending on the size of queue. The problem is, only one worker get all the messages from the PubSub subscription, while the others are not receiving any, perhaps because the queue is not that long (maximum ~1000 messages). This issue is reported in a few cases in the python Google Cloud github, and it is not clear if it is the intended purpose of PubSub, or just a bug.
How could I implement the equivalent of a simple serverless task queue in Python and Google Cloud, and can spawn instances based on a given metric, for example the size of the queue ? Is this the intended purpose of PubSub ?
Thanks in advance.
In App Engine you can create push queues and set rate/concurrency limits and Google will handle the rest for you. App Engine will scale as needed (e.g. increase Python instances).
If you're outside of App Engine (e.g. GKE), the pubsub Python client library may be pulling many messages at once. We had a hard time controlling this (for google-cloud-pubsub==0.34.0) as well so we ended up writing a small adjustment on top of google-cloud-pubsub calling SubscriberClient.pull with max_messages set). The server side pubsub API does adhere to max_messages.

Are there any Schedulers for AWS/DynamoDB?

We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: