Suppose we have a web-service written in python, that does some time-consuming file processing. It definitely should not be run inside the HTTP handler as it takes up to 10 mins to complete. Instead, the processing should be done asynchronously by some sort of workers, and it would be also nice to report the progress of the task execution to display to the user.
Would it be a good idea to setup Google Cloud Tasks with some Cloud Run or Cloud Functions service as a HTTP target to do this work?
Is Google Cloud Tasks suitable for handling this type of async tasks, where the user is sitting and waiting for result?
If not, is there any other options to achieve this with Google Cloud? (or should I use custom task services for this purpose, for instance, celery and redis) It also seems that Cloud Run Jobs features somewhat similar functionality, but there are not any queue systems to manage workers.
GC tasks is simply a tool for queuing tasks. It is a useful tool for ensuring that tasks do run, as it has built in retry mechanisms. How you use that in the context of an application depends on a lot of other detail of the application itself, but it is definitely possible to use it for background/asynchronous processing of tasks.
We use Google Cloud tasks to implement long running processes that report their progress via data store records. Some of these processes run longer than the standard 10 minute timeout, and trigger a new cloud task to complete the processing. We then have a simple lightweight handler that retrieves the status record from data store and reports that to the user. We poll that handler from the client, but you could also implement something like websockets.
GCP can handle Asynchronous tasks, Asynchronous execution is a well-established way to reduce request latency and make your application more responsive.
We can use cloud run or cloud functions for this type of tasks , Because we can increase the time limit upto 30 min in http task handlers in GCP cloud tasks.
For more information refer to this document.
We use Google Cloud tasks to implement long running processes that report their progress via data store records. Some of these processes run longer than the standard 10 minute timeout, and trigger a new cloud task to complete the processing.
Related
In my personal case, Pub/Sub's pushes to a Python service on Cloud Functions are being unfeasible due to it's short timeout. So the idea of having a container-based managed instance group of Compute Engine instances sounds good, these instances can scale up/down based on Pub/Sub pending task count metrics. These machines' containers would run Python code on startup, the given code would PULL Pub/Sub and process the pulled job accordingly.
Contextualization aside, the question is: Is it a good idea? Are there any gotchas? As there would be several machines at scale, how could I guarantee that a same given 'queued task' would not be picked and have it's processing started on more than one of these machines? I know about ACKs, but ACKs should just be emitted when the task ends successfully, isn't it? What strategy to use to prevent the initially mentioned and other problems?
I am using Camunda workflows to automate various processes. I have come across a scenario where the process is not moving from a service task. Usually, we call the task/{taskid}/complete to complete the task, but since the process is stuck on a service task, I am not able to complete that task. Can anybody help me find a way to complete the service task?
You are using a service task. That basically means "a machine should do something". The "normal" implementation is to provide code (a java Delegate or a connector endpoint) that is called by the process engine to execute this task.
The alternativ is to use the "external task" pattern. Think of external tasks as "user tasks for computers". So the process waits, tells subscribed clients that a job is to be done and waits for their completion.
I suppose your process uses the second option? (you can check in the modeler under "Implementation"). So completion can be done through the external task API, see docs.
/external-task/{id}/complete
If it is a connector then you likely will see when checking the log that retries have occurred and that the transaction rolled back. After addressing the underlying issue the service task (email) should be sent without explicitly triggering the service task and the following user task (Approval) should be created.
I'm new to using Cloud Run and the idea of scaling down to zero is very appealing to me, but I have question about a few scenarios about its usage:
If I have a Cloud Run instance querying an external API endpoint, would the instance winds down while waiting for the response if no additional requests come in (i.e. I set the query time out to 60min, and no requests are received in that 60 min)?
If the Cloud Run instance is running computation that lasts for longer than 24 hour, or perhaps even days, without receiving requests, could it be trusted to carry out the computation until it's done without being randomly shutdown or restarted for servicing or other purposes (I ask this because Cloud Run is primarily intended as for stateless applications, but I have infrequent computation jobs that may take a long time that may be considered "stateful" in short-term context).
Does CPU utilization impact auto-scaling (e.g. if I have a computationally intensive job not configured for distributed computing running on one instance, would this trigger Cloud Run to spawn additional instances?)
If you deep dive in the documentation, I'm quite sure that you can find your answers. So, here a summary
(Interesting read).The Cloud Run instances are shut down only when they aren't in used (usually 15 minutes (can change at any time, no commitment, only observations) without request handling). In your case, if you are in a request handling context, no worries, your instance won't be killed, it is in use! Note: don't send an HTTP response before the end of the processing. Background process/jobs aren't considered in a request context. The context is considered from the receipt of the request to the response (OK or KO) back. Partial response/streaming is accepted.
Cloud run instance can, potentially, live more than 24h, but nothing is guaranteed. And, because the request handling is limited to 1h, you can't run process longer that that. I recommend you to have a look to GKE autopilot or to run a container on a Compute Engine and stop the VM at the end of the processing to save resources and money (or a hack to run your container on AI PLatform custom training; even if you train nothing, you run a custom container on a serverless platform!). If you can, I recommend you to design your workload to be split in several small and parallelizable jobs
Yes, it's described here. But keep in mind that only 1 request is processed on one instance. If you send a request that trigger an intensive compute job, the request will be only processed on the same instance (that can have several CPUs if your workload is compliant with that). And if another request comes in during the intensive processing, another Cloud Run instance will be spawn to handle it; only the new request.
I have couple of Python scripts which I would like to schedule to run once a month on Google cloud. The scripts basically trigger DLP jobs, extract data catalog information to a file in GCS. These batch workloads would hardly run for 30 mins. And so I don't need to use services like GKE, composer etc which are very resource intensive.
For these batch workloads I would like to know the best options available in GCP. Looking at some of the blog posts I found below article to use Cloud Scheduler-> Pub/Sub-> Cloud Functions -> Create VM (using a startup script).
https://medium.com/google-cloud/running-a-serverless-batch-workload-on-gcp-with-cloud-scheduler-cloud-functions-and-compute-86c2bd573f25
I have below questions with above design..
1) How long does the Cloud Function run as it starts the VM? I know cloud function has a timeout of 9mins ..what happens if the VM takes longer than 9mins to process the startup script?
Any other design ideas are much appreciated.
Thanks
I'm the author of that medium post.
1) How long does the Cloud Function run as it starts the VM?
You can change the Cloud Function code to not wait for the response, It's using NodeJS so you just don't have to wait for the Promise.
Also in that solution the Cloud Function job is only to trigger the VM creation.
.createVM(vmName, vmConfig)
.then(data => {
// Operation pending.
const vm = data[0];
const operation = data[1];
console.log(`VM being created: ${vm.id}`);
console.log(`Operation info: ${operation.id}`);
return operation.promise();
// This will return right away with the VM pending state, you can finish
// your logic here, and not wait for VM creation to finish.
// You can even ignore this step if you don't need the VM ID logged for
// debugging purposes
})
.then(() => {
const message = 'VM created with success, Cloud Function finished execution.';
console.log(message);
}
Using that same code, in the worst case (if it takes more than 9 minutes), the Cloud Function will timeout but the VM creation will continue.
The desing that I suggest is using: Cloud Scheduler + Pub/Sub + Compute Engine
This design in few words:
- you compute engine will have a utility that listens to a Cloud Pub/Sub topic
- this utility will execute upon receiving a new event from the Topic and run a cron job on the instance
- Cloud scheduler is used here to push messages to the Pub/Sub Topic in a time that you can specify in your job.
By using Pub/Sub to decouple the task-scheduling logic from the logic
running the commands on Compute Engine, you can update your cron
scripts as needed, without updating the Cloud Scheduler configuration.
You can also change your task schedule without updating the utility
service on your Compute Engine instances
you can find full explanation of this design and a sample code by following this and this.
let me know if there is anything not obvious.
I've got a Google Cloud PubSub topic which at times has thousands of messages and at times zero messages coming in. These messages represent tasks which can take upwards of an hour each. Preferably I'm able to use Cloud Run for this, as it scales really well to the demand, if a thousand messages gets published, I want 100s of Cloud Run instances to spin up. These Run instances get started by a push subscription. The problem is that PubSub has a 600 second timeout for the acknowledgement. This means in order to have Cloud Run process these messages they have to finish within 600 seconds. If they do not, PubSub times it out, and sends it again, causing the task to be restarted until the first task finally does acknowledge it (this causes the same task to be ran many times). Cloud Run acknowledges the messages by returning a 2** HTTP status code. The documentation states
When an application running on Cloud Run finishes handling a request, the container instance's access to CPU will be disabled or severely limited. Therefore, you should not start background threads or routines that run outside the scope of the request handlers.
So is it maybe possible to acknowledge a PubSub request through code and continue the processing, without having Google Cloud Run hand over the resources? Or is there a better solution I'm unaware of?
Because these processes are so code/resource-intensive, I feel Cloud Functions will not suffice. I've looked at https://cloud.google.com/solutions/using-cloud-pub-sub-long-running-tasks and https://cloud.google.com/blog/products/gcp/how-google-cloud-pubsub-supports-long-running-workloads. But these didn't answer my question.
I've looked at Google Cloud Tasks, which might be something? But the rest of the project has been built around PubSub/Run/Functions, so preferably I stick with that.
This project is written in Python.
So preferably I would like to write my Google Cloud Run tasks like this:
#app.route('/', methods=['POST'])
def index():
"""Endpoint for Google Cloud PubSub messages"""
pubsub_message = request.get_json()
logger.info(f'Received PubSub pubsub_message {pubsub_message}')
if message_incorrect(pubsub_message):
return "Invalid request", 400 #use normal NACK handling
# acknowledge message here without returning
# ...
# Do actual processing of the task here
# ...
So how can or should I solve this, so that the the resource-intensive tasks get properly scaled on demand ( so a push PubSub subscription ). And the tasks only get executed once.
Answers:
In short what has been answered. Cloud Run and Functions are just not suited for this problem. There is no way to have them do tasks that take longer than 9 or 15 minutes respectively. The only solution is to switch over to another Google Service and use a pull style subscription and lose out on auto-scaling of GC Run/Functions
Cloud Run on GKE can handle long process, more CPU and memory than available on managed platform. However, you have a GKE cluster always running and you loose the "pay-as-you-use" benefit.
If you want to use this solution, don't link directly PubSub push subscription to your Cloud Run on GKE. Use Cloud Task with HTTP job for this. The timeout is longer than PubSub (up to 24h instead of 10 min) and the retry policies are customizables.
Neither Cloud Functions nor Cloud Run is sufficient for arbitrarily long running operations. Cloud Functions has a hard cap of 9 minutes per invocation, and Cloud Run caps at 60. If you need more time, you're going to have to delegate the work to another product, such as Google Compute Engine. It should be possible to kick off some Compute Engine work from one of the serverless products.
Give the limits of pubsub acks, you'll probably have to find a way for a client to be able to poll or listen to some resource to find out when the work is actually done. You could use a database for that, and Cloud Firestore lets you listen to documents to find out when they change. So you could use that to track the status of your long-running work.