I have a batch job that takes a couple of hours to run. How can I run this in a serverless way on Google Cloud?
AppEngine, Cloud Functions, and Cloud Run are limited to 10-15 minutes. I don't want to rewrite my code in Apache Beam.
Is there an equivalent to AWS Batch on Google Cloud?
Note: Cloud Run and Cloud Functions can now last up to 60 minutes. The answer below remains a viable approach if you have a multi-hour job.
Vertex AI Training is serverless and long-lived. Wrap your batch processing code in a Docker container, push to gcr.io and then do:
gcloud ai custom-jobs create \
--region=LOCATION \
--display-name=JOB_NAME \
--worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=EXECUTOR_IMAGE_URI,local-package-path=WORKING_DIRECTORY,script=SCRIPT_PATH
You can run any arbitrary Docker container — it doesn’t have to be a machine learning job. For details, see:
https://cloud.google.com/vertex-ai/docs/training/create-custom-job#create_custom_job-gcloud
Today you can also use Cloud Batch: https://cloud.google.com/batch/docs/get-started#create-basic-job
Google Cloud does not offer a comparable product to AWS Batch (see https://cloud.google.com/docs/compare/aws/#service_comparisons).
Instead you'll need to use Cloud Tasks or Pub/Sub to delegate the work to another product, such as Compute Engine, but this lacks the ability to do this in a "serverless" way.
Finally Google released (in Beta for the moment) Cloud Batch which does exactly what you want.
You push jobs (containers or scripts) and it runs. Simple as that.
https://cloud.google.com/batch/docs/get-started
This answer to a How to make GCE instance stop when its deployed container finishes? will work for you as well:
In short:
First dockerize your batch process.
Then, create an instance:
Using a container-optmized image
And using a Startup script that pulls your docker image, runs it, and shutdown the machine at the end.
I have faced the same problem. in my case I went for:
Cloud Scheduler to start the job by pushing to Pub/Sub.
Pub/Sub triggers Cloud Functions.
Cloud Functions mounting a Compute Engine instance.
Compute Engine runs the batch workload and auto kills the instance
once it’s done. You can read by post on medium:
https://link.medium.com/1K3NsElGYZ
It might help you get started. There's also a follow up post showing how to use a Docker container inside the Compute Engine instance: https://medium.com/google-cloud/serverless-batch-workload-on-gcp-adding-docker-and-container-registry-to-the-mix-558f925e1de1
You can use Cloud Run. At the time of writing this, the timeout of Cloud Run (fully managed) is increased to 60 minutes, but in beta.
https://cloud.google.com/run/docs/configuring/request-timeout
Important: Although Cloud Run (fully managed) has a maximum timeout of 60 minutes, only timeouts of 15 minutes or less are generally available: setting timeouts greater than 15 minutes is a Beta feature.
Another alternative for batch computing is using Google Cloud Lifesciences.
An example application using Cloud Life Sciences is dsub.
Or see the Cloud Life Sciences Quickstart documentation.
I found myself looking for a solution to this problem and built something similar to what mesmacosta has described in a different answer, in the form of a reusable tool called gcp-runbatch.
If you can package your workload into a Docker image then you can run it using gcp-runbatch. When triggered, it will do the following:
Create a new VM
On VM startup, docker run the specified image
When the docker run exits, the VM will be deleted
Some features that are supported:
Invoke batch workload from the command line, or deploy as a Cloud Function and invoke that way (e.g. to trigger batch workloads via Cloud Scheduler)
stdout and stderr will be piped to Cloud Logging
Environment variables can be specified by the invoker, or pulled from Secret Manager
Here's an example command line invocation:
$ gcp-runbatch \
--project-id=long-octane-350517 \
--zone=us-central1-a \
--service-account=1234567890-compute#developer.gserviceaccount.com \
hello-world
Successfully started instance runbatch-38408320. To tail batch logs run:
CLOUDSDK_PYTHON_SITEPACKAGES=1 gcloud beta --project=long-octane-350517
logging tail 'logName="projects/long-octane-350517/logs/runbatch" AND
resource.labels.instance_id="runbatch-38408320"' --format='get(text_payload)'
GCP launched their new "Batch" service in July '22. It basically Compute Engine packaged with some utilities to easily productionize a batch job -- including defining required resources, executables (script or container-based), and define a run schedule.
Haven't used it yet, but seems like a great fit for batch jobs that take over 1 hr.
Related
I am trying to build an app where the user is able to upload a file to cloud storage. This would then trigger a model training process (and predicting later on). Initially I though I could do this with cloud functions/pubsub and cloudml, but it seems that cloud functions are not able to trigger gsutil commands which is needed for cloudml.
Is my only option to enable cloud-composer and attach GPUs to a kubernetes node and create a cloud function that triggers a dag to boot up a pod on the node with GPUs and mounting the bucket with the data? Seems a bit excessive but I can't think of another way currently.
You're correct. As for now, there's no possibility to execute gsutil command from a Google Cloud Function:
Cloud Functions can be written in Node.js, Python, Go, and Java, and are executed in language-specific runtimes.
I really like your second approach with triggering the DAG.
Another idea that comes to my mind is to interact with GCP Virtual Machines within Cloud Composer through the Python operator by using the Compute Engine Pyhton API. You can find more information in automating infrastructure and taking a deep technical dive into the core features of Cloud Composer here.
Another solution that you can think of is Kubeflow, which aims to make running ML workloads on Kubernetes. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. Please, have a look on Codelabs tutorial.
I hope you find the above pieces of information useful.
I am currently studying for the GCP Data Engineer exam and have struggled to understand when to use Cloud Scheduler and whe to use Cloud Composer.
From reading the docs, I have the impression that Cloud Composer should be used when there is interdependencies between the job, e.g. we need the output of a job to start another whenever the first finished, and use dependencies coming from first job. You can then chain flexibly as many of these "workflows" as you want, as well as giving the opporutnity to restart jobs when failed, run batch jobs, shell scripts, chain queries and so on.
For the Cloud Scheduler, it has very similar capabilities in regards to what tasks it can execute, however, it is used more for regular jobs, that you can execute at regular intervals, and not necessarily used when you have interdependencies in between jobs or when you need to wait for other jobs before starting another one. Therefore, seems to be more tailored to use in "simpler" tasks.
These thoughts came after attempting to answer some exam questions I found. However, I was surprised with the "correct answers" I found, and was hoping someone could clarify if these answers are correct and if I understood when to use one over another.
Here are the example questions that confused me in regards to this topic:
Question 1
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times.
Which service should you use to manage the execution of these jobs?
A. Cloud Scheduler
B. Cloud Dataflow
C. Cloud Functions
D. Cloud Composer
Correct Answer: A
Question 2
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day.
Which tool should you use?
A. cron
B. Cloud Composer
C. Cloud Scheduler
D. Workflow Templates on Cloud Dataproc
Correct Answer: D
Question 3
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers.
Which cloud-native service should you use to orchestrate the entire pipeline?
A. Cloud Dataflow
B. Cloud Composer
C. Cloud Dataprep
D. Cloud Dataproc
Correct Answer: D
Any insight on this would be greatly appreciated. Thank you !
Your assumptions are correct, Cloud Composer is an Apache Airflow managed service, it serves well when orchestrating interdependent pipelines, and Cloud Scheduler is just a managed Cron service.
I don't know where you have got these questions and answers, but I assure you(and I just got the GCP Data Engineer certification last month), the correct answer would be Cloud Composer for each one of them, just ignore this supposed correct answers and move on.
Cloud Scheduler is essentially Cron-as-a-service. All you need is to enter a schedule and an endpoint (Pub/Sub topic, HTTP, App Engine route). Cloud Scheduler has built in retry handling so you can set a fixed number of times and doesn't have time limits for requests. The functionality is much simpler than Cloud Composer.
Cloud Composer is managed Apache Airflow that "helps you create, schedule, monitor and manage workflows. Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure."(https://cloud.google.com/composer/docs/)
Airflow is aimed at data pipelines with all the needed tooling.
I have a script that does some stuff with gcloud utils like gsutil and bq, eg:
#!/usr/bin/env bash
bq 'SELECT * FROM myproject.whatever WHERE date > $x' > res.csv
gsutil cp res.csv gs://my_storage/foo.csv
This works on my machine or VM, but I can't guarantee it will always be on, so I'd like to add this as a GCP cronjob/Lambda type of thing. From the docs here, it looks like the Cloud Scheduler can only do HTTP requests, Pub/Sub, or App Engine HTTP, none of which are exactly what I want.
So: is there any way in GCP to automate some gsutil / bq commands, like a cronjob, but without my supplying an always-on machine?
There are likely going to be multiple answers and this is but one.
For me, I would examine the concept of Google Cloud Run. The idea here is that you get to create a Docker image that is then instantiated, run and cleaned up when called by a REST request. What you put in your docker image is 100% up to you. It could be a simple image with tools like gcloud and gsutil installed with a script to run them with any desired parameters. Your contract with Cloud Run is only that you consume the incoming HTTP request.
When there are no requests to Cloud Run, there is no charge as nothing is running. You are only billed for the duration that your logic actually executes for.
I recommend Cloud Run over Cloud Functions as Cloud Run allows you to define the environment in which the commands run ... for example ... availability of gsutil.
I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!
For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.
Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function
I would like to know how to create a script that Automatically stop and start a Google Compute Engine instance. and how can I configure him to run every day and choose to run it only 5 days a week?
because we are not using the server it nights so i can save 9 hours a day.
can it be done?
thank you.
You can use gcloud command line tool for that (of course from another machine), it provides all controls, including starting and stopping instances. Setup cron on your local machine for:
gcloud compute instances stop INSTANCE_NAMES
gcloud compute instances start INSTANCE_NAMES
See more:
https://cloud.google.com/sdk/gcloud/reference/compute/instances/stop
https://cloud.google.com/sdk/gcloud/reference/compute/instances/start
As far as I know, GCE doesn't provide scheduled VM stop/start as a managed feature, it has to be triggered outside of the VM. For example, you can a GAE scheduled task which uses gcloud or GCE Python SDK to start and stop your VM.
You can use Google Cloud Scheduler in conjunction with Cloud Functions to run lightweight cronjobs which start/stop GCE VM instances based on a schedule that you control.
You can find a step-by-step tutorial in the official docs, but the general flow is:
Use Cloud Scheduler to publish start/stop messages to a Cloud Pub/Sub topic at the desired times (ex: every weekday at 9am, write a start VM event, every weekday at 5pm, write a stop VM event)
Create a Cloud Function which subscribes to the Pub/Sub topic, and makes the appropriate calls to the GCE APIs to trigger start/stop VM.