Cloud Run and Cloud Scheduler - Getting Failed Result on Full Dataset - google-cloud-platform

I am running a python script in Cloud Run on a daily basis with Cloud Scheduler to pull data from BigQuery and upload it to Google Cloud Storage as a CSV file. The Cloud Scheduler setup utilizes an HTTP "Target" with a GET "HTTP method". Also, Cloud Scheduler authenticates the https endpoint using a service account with the "Add OIDC token" option.
When running Cloud Scheduler and Cloud Run with a very small subset of the BigQuery data for a job that takes a few seconds, the "Result" in Cloud Scheduler always shows "Success" and the job completes as intended. However, when running Cloud Scheduler and Cloud Run with the full BigQuery dataset for a job that takes a few minutes, the "Result" in Cloud Scheduler always shows "Failed", even though the CSV file is typically (although not always) uploaded into Google Cloud Storage as intended.
(1) When running Cloud Scheduler and Cloud Run on the full BigQuery dataset, why does the "Result" in Cloud Scheduler always show "Failed", even though the job is typically finishing as intended?
(2) How can I fix Cloud Scheduler and Cloud Run to ensure the job always completes as intended and the "Result" in Cloud Scheduler always shows "Success"?

It's a common mistake with Cloud Scheduler. I rose it many times to Google but it nothing as changed until now...
The GUI (the web console) doesn't allow you to configure anything, especially the timeout. Your Cloud Scheduler fails because it considers that it doesn't receive the answer in time when you scan your full BQ dataset (that can take few minutes)
For solving this, use the command line (gcloud), especially the attempt-deadline parameter. You can have a look to other params: retry, backoff,... The allowed customization is interesting, but not present in the GUI!

Related

How to run a GCP Cloud run that calls another GCP Cloud Run for data

I have two cloud runs within the same VPC/Project, which one cloud run (A) builds the request based upon the output of the request of another cloud run (B).
I am calling cloud run (B) from cloud run (A) by providing cloud run (B)'s trigger URL in cloud run (A)'s .env.
The error from the curl I receive, is "System Unavailable".
The error logs from the cloud run is:
POST 503
"The request failed because either the HTTP response was malformed or connection to the instance had an error. Additional troubleshooting documentation can be found at: https://cloud.google.com/run/docs/troubleshooting#malformed-response-or-connection-error"
The logs show that the cloud run (A) is successfully calling the other cloud run (B). But the request takes up to 60-120 seconds until a response is generated by cloud run (A). I set the request timeout to 10min on cloud run (A) to be safe but still face errors on cloud run (A).
The only network specific non-default setting used when setting up both cloud runs is "Ingress=Internal+load balancing".
This setup works with the original reference cloud run (B) is sent a request from a GCE VM server running the same image+container setup.
What cloud run setting(s) do I need to get one cloud run to be able to request data from another properly?
I am referencing the cloud run from both another cloud run and vm server via its trigger url:
cat .env
URL=https://<name>.a.run.app
As #John Hanley explained in the comments, my answer also focused around the same points.
For your service(B) to be called by service(A) these are the conditions you must be met:
Make sure both services are deployed under the same VPC network and both services should be in the same project. For more information a similar thread has explained these constraints.
For Cloud Run inter-service communication follow this thread which answers some of your questions

Can you call a Cloud Run app from inside a Cloud Function?

I'd like to call a Cloud Run app from inside a Cloud Function multiple times, given some logic. I've googled this quite a lot and don't find good solutions. Is this supported?
I've seen the Workflows Tutorials, but AFAIK they are meant to pass messages in series between different GPC services. My Cloud Function runs on a schedule every minute and it would only need to call the Cloud Run app a few times per day given some event. I've thought about having the entire app run in Cloud Run instead of the Cloud function. However, I think having it all in Cloud Run would be more expensive than running the Cloud function.
I went through your question, I have an alternative in my mind if you agree to the solution. You can use Cloud Scheduler to securely trigger a Cloud Run service asynchronously on a schedule.
You need to create a service account to associate with Cloud
Scheduler, and give that service account the permission to invoke
your Cloud Run service, i.e. Cloud Run invoker (You can use an
existing service account to represent Cloud Scheduler, or you can
create a new one for that matter)
Next, you have to create a Cloud Scheduler job that invokes your
service at specified times. Specify the frequency, or job interval,
at which the job is to run, using a configuration string. Specify the
fully qualified URL of your Cloud Run service, for example
https://myservice-abcdef-uc.a.run.app The job will send requests to
this URL.
Next, specify the HTTP method: the method must match what your
previously deployed Cloud Run service is expecting. When you deploy
the service using Cloud Scheduler, make sure you do not allow
unauthenticated invocations. Please go through this
documentation for details and try to implement the steps.
Back to your question, yes it's possible to call your Cloud Run service from inside Cloud Functions. Here, your Cloud Run service calls from another backend service i.e. Cloud Functions directly( synchronously) over HTTP, using its endpoint URL. For this use case, you should make sure that each service is only able to make requests to specific services.
Go through this documentation suggested by #John Hanley as it provides you with the steps you need to follow.

How to use Cloud Scheduler to store data into Cloud Storage?

I've created a job in Google Cloud Scheduler to download data from my demo app using HTTP GET it seemed to run successfully. My question is where did it store that data? And how can I store it into Google Cloud Storage? Below is a screenshot of my job:
I'm new to Google Cloud and working with a free trial account. Please advise.
Cloud Scheduler does not process data. The data returned by your example request is discarded.
Write a Cloud Function scheduled by Cloud Scheduler. There are other services such as Firebase and Cloud Run that work well also.
What you're trying to do is that you're trying to create a scheduler job that calls a GET request, and Cloud Scheduler will do exactly that, except the part where data is stored. Since Cloud Scheduler is a managed-cron service, it doesn't matter if the URL returns data. Cloud Scheduler will call the endpoint on a timely manner and that's it. Data returned by the request will be discarded, as mentioned by John Hanley.
What you can do is to integrate scheduling with Cloud Functions, but to be clear, your app needs to do the following first:
Download from external link and save the object to /tmp within the function.
The rest of the file system is read-only, and /tmp is the only writeable part. Any files saved in /tmp are stored within the function's memory. You can clear /tmp for every successful upload to GCS.
Upload the file to Cloud Storage (using Cloud Storage client library).
Now that your app is capable of storing and uploading data, then you can make a decision where to deploy it.
You can deploy your code using Firebase CLI and use scheduled functions. The advantage is that Pub/Sub and Cloud Scheduler is taken care of automatically and configuration is done on your code. Downside is that you are limited to Node Runtime compared to GCP Cloud Functions, where there are many different programming languages available. Learn more about scheduled functions.
Second, you can deploy through gcloud CLI but for this, you need to setup Pub/Sub notifications and Cloud Scheduler. You can check more about this by navigating to this link.

Scheduling cron jobs on Google Cloud DataProc

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!
For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.
Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

How do I run a serverless batch job in Google Cloud

I have a batch job that takes a couple of hours to run. How can I run this in a serverless way on Google Cloud?
AppEngine, Cloud Functions, and Cloud Run are limited to 10-15 minutes. I don't want to rewrite my code in Apache Beam.
Is there an equivalent to AWS Batch on Google Cloud?
Note: Cloud Run and Cloud Functions can now last up to 60 minutes. The answer below remains a viable approach if you have a multi-hour job.
Vertex AI Training is serverless and long-lived. Wrap your batch processing code in a Docker container, push to gcr.io and then do:
gcloud ai custom-jobs create \
--region=LOCATION \
--display-name=JOB_NAME \
--worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=EXECUTOR_IMAGE_URI,local-package-path=WORKING_DIRECTORY,script=SCRIPT_PATH
You can run any arbitrary Docker container — it doesn’t have to be a machine learning job. For details, see:
https://cloud.google.com/vertex-ai/docs/training/create-custom-job#create_custom_job-gcloud
Today you can also use Cloud Batch: https://cloud.google.com/batch/docs/get-started#create-basic-job
Google Cloud does not offer a comparable product to AWS Batch (see https://cloud.google.com/docs/compare/aws/#service_comparisons).
Instead you'll need to use Cloud Tasks or Pub/Sub to delegate the work to another product, such as Compute Engine, but this lacks the ability to do this in a "serverless" way.
Finally Google released (in Beta for the moment) Cloud Batch which does exactly what you want.
You push jobs (containers or scripts) and it runs. Simple as that.
https://cloud.google.com/batch/docs/get-started
This answer to a How to make GCE instance stop when its deployed container finishes? will work for you as well:
In short:
First dockerize your batch process.
Then, create an instance:
Using a container-optmized image
And using a Startup script that pulls your docker image, runs it, and shutdown the machine at the end.
I have faced the same problem. in my case I went for:
Cloud Scheduler to start the job by pushing to Pub/Sub.
Pub/Sub triggers Cloud Functions.
Cloud Functions mounting a Compute Engine instance.
Compute Engine runs the batch workload and auto kills the instance
once it’s done. You can read by post on medium:
https://link.medium.com/1K3NsElGYZ
It might help you get started. There's also a follow up post showing how to use a Docker container inside the Compute Engine instance: https://medium.com/google-cloud/serverless-batch-workload-on-gcp-adding-docker-and-container-registry-to-the-mix-558f925e1de1
You can use Cloud Run. At the time of writing this, the timeout of Cloud Run (fully managed) is increased to 60 minutes, but in beta.
https://cloud.google.com/run/docs/configuring/request-timeout
Important: Although Cloud Run (fully managed) has a maximum timeout of 60 minutes, only timeouts of 15 minutes or less are generally available: setting timeouts greater than 15 minutes is a Beta feature.
Another alternative for batch computing is using Google Cloud Lifesciences.
An example application using Cloud Life Sciences is dsub.
Or see the Cloud Life Sciences Quickstart documentation.
I found myself looking for a solution to this problem and built something similar to what mesmacosta has described in a different answer, in the form of a reusable tool called gcp-runbatch.
If you can package your workload into a Docker image then you can run it using gcp-runbatch. When triggered, it will do the following:
Create a new VM
On VM startup, docker run the specified image
When the docker run exits, the VM will be deleted
Some features that are supported:
Invoke batch workload from the command line, or deploy as a Cloud Function and invoke that way (e.g. to trigger batch workloads via Cloud Scheduler)
stdout and stderr will be piped to Cloud Logging
Environment variables can be specified by the invoker, or pulled from Secret Manager
Here's an example command line invocation:
$ gcp-runbatch \
--project-id=long-octane-350517 \
--zone=us-central1-a \
--service-account=1234567890-compute#developer.gserviceaccount.com \
hello-world
Successfully started instance runbatch-38408320. To tail batch logs run:
CLOUDSDK_PYTHON_SITEPACKAGES=1 gcloud beta --project=long-octane-350517
logging tail 'logName="projects/long-octane-350517/logs/runbatch" AND
resource.labels.instance_id="runbatch-38408320"' --format='get(text_payload)'
GCP launched their new "Batch" service in July '22. It basically Compute Engine packaged with some utilities to easily productionize a batch job -- including defining required resources, executables (script or container-based), and define a run schedule.
Haven't used it yet, but seems like a great fit for batch jobs that take over 1 hr.