Unable to start Google Cloud Profiler due to error: Unable to find the job id or job name from env var - google-cloud-platform

We followed the Cloud Profiler documentation to enable the Cloud Profiler for our Dataflow jobs and the Profiler is failing to start.
The issue is, Cloud Profiler needs JOB_NAME and JOB_ID environment vars to start but the worker VM has only the JOB_ID env var but the JOB_NAME is missing.
The question is why the JOB_NAME env var is missing?
Logs:
jsonPayload: {
job: "2022-09-16 13 41 20-1177626142222241340"
logger: "/us/local/lib/pvthon3.9/site-packages/apache_beam/runners/worker/sdk_worker_main.pv:177"
message: "Unable to start google cloud profiler due to error: Unable to find the job id or job name from envvar"
portability_worker_1d: "sdk-0-13"
thread: "MainThread"
worker: "description-embeddings-20-09161341-k27g-harness-qxq2"
}
Following done so far:
Cloud Profiler API enabled for the project
Projects have enough quota.
the Service Account for the Dataflow job has appropriate permissions for Profiler.
Following options added to the pipeline
--dataflow_service_options=enable_google_cloud_profiler
enable_google_cloud_profiler and enable_google_cloud_heap_sampling flags specified as additional experiments to deploy our pipeline from Dataflow templates.
Edit: Found the cause.
The provisioning API is returning an empty JOB_NAME, causing boot.go to set the JOB_NAME env var to "", which causes the Python SDK code to fail when trying to activate googlecloudprofiler.
There is an open issue on IssueTracker regarding this.

Related

GCP Serverless pyspark : Illegal character in path at index

I'm trying to run a simple hello world python code on Serverless pyspark on GCP using gcloud (from local windows machine).
if __name__ == '__main__':
print("Hello")
This always results in the error
=========== Cloud Dataproc Agent Error ===========
java.lang.IllegalArgumentException: Illegal character in path at index 38: gs://my-bucket/dependencies\hello.py
at java.base/java.net.URI.create(URI.java:883)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:592)
The gcloud command:
gcloud dataproc batches submit pyspark hello.py --batch=hello-batch-5 --deps-bucket=my-bucket --region=us-central1
On further analysis, I found that gcloud puts hello.py file in dependencies\hello.py under folder {deps-bucket} and Java considers backward slash '\' as illegal.
Has anyone encountered a similar situation?
As #Ronak mentioned, Can you double check the bucket name ? I have replicated your task, and simply copied your code to my Google Cloud shell. and it ran just fine. for your next run can you delete the dependencies folder and run the batch job again ?
See my replication here:
Dependencies path created after running the job:

Why GCloud Builds submit failing after creating image?

I am learning deploying a pubsub service to run under Cloud Run, by following the guidelines given here
Steps I followed are:
Created a new project folder "myProject" in my local machine
Added below files:
app.jsindex.jsDockerfile
Executed below command to ship the code
gcloud builds submit --tag gcr.io/Project-ID/pubsub
It's mentioned in the tutorial document that
Upon success, you should see a SUCCESS message containing the ID, creation time, and image name. The image is stored in Container Registry and can be re-used if desired.
But in my case it's returning with error: (Ref: screenshot)
I have verified the build logs, "It's success"
So I thought to ignore this error and proceed with the next step to deploy the app by running the command:
gcloud run deploy sks-pubsub-cloudrun --image gcr.io/Project-ID/pubsub --no-allow-unauthenticated
When I run this command it immediately asking to specify the region (26 is my choice) from the list.
Next it fails with error:
Deploying container to Cloud Run service [sks-pubsub-cloudrun] in project [Project-ID] region [us-central1]
Deploying new service... Cloud Run error: The user-provided container failed to start and listen on the port defined provided by the PORT=8080 environment variable.
Logs for this revision might contain more information.
As I am new to this GCP & Dockerizing services, not understanding this issue and unable to fix it. I researched many blogs and articles yet no proper solution for this error.
Any help will be appreciated.
Tried to run the container locally and it's failing with error.
I'm using VS Code IDE, and "Cloud Code: Debug on Cloud Run Emulator" to debug the code.
Starting to debug the app using configuration `Cloud Run: Run/Debug Locally` from .vscode/launch.json
To view more detailed logs, go to Output channel : "Cloud Run: Run/Debug Locally - Detailed"
Dependency check started
Dependency check succeeded
Unpausing minikube
The minikube profile 'cloud-run-dev-internal' has been scheduled to stop automatically after exiting Cloud Code. To disable this on future deployments, set autoStop to false in your launch configuration d:\POC\promo_run_pubsub\.vscode\launch.json
Configuring minikube gcp-auth addon
Using GCP project 'Project-Id' with minikube gcp-auth
Failed to configure minikube gcp-auth addon. Your app might not be able to authenticate Google or GCP APIs it calls. The addon has been disabled. More details can be found in the detailed logs.
Update initiated
Deploy started
Deploy completed
Status check started
Resource pod/promo-run-pubsub-5d4cd64bf9-8pf4q status updated to In Progress
Resource deployment/promo-run-pubsub status updated to In Progress
Resource pod/promo-run-pubsub-5d4cd64bf9-8pf4q status updated to In Progress
Resource deployment/promo-run-pubsub status failed with waiting for rollout to finish: 0 of 1 updated replicas are available...
Status check failed
Update failed with error code STATUSCHECK_CONTAINER_TERMINATED
1/1 deployment(s) failed
Skaffold exited with code 1.
Cleaning up...
Finished clean up.

Dataflow Job failing

I have a pipeline which requires a Dataflow Job to run. I was using the gcloud CLI command to start a dataflow job which was working fine for over a month. But since last three days the dataflow job is failing within 10-20 sec with the following error log.
Failed to start the VM, launcher-2022012621245117717885921401920990, used for launching because of status code: UNAVAILABLE, reason: One or more operations had an error: 'operation-1643261093401-5d68989bed339-a33de830-9f90d92a': [UNAVAILABLE] 'HTTP_503'..
The command I'm using is:
gcloud dataflow sql query "SELECT tr.* FROM pubsub.topic.`my_project`.pubsub_topic as tr"
--job-name test_job
--region asia-south1
--bigquery-write-disposition write-empty
--bigquery-project my_project
--bigquery-dataset test_dataset --bigquery-table table_name
--max-workers 1 --worker-machine-type n1-standard-1
I tried starting the job from cloud console with same parameters as well which failed with the same error log. I have tested the job run from console before and it worked fine. The issue started a couple days ago.
What could be going wrong?
Thanks.
The Google Cloud error model indicates that a 503 means the service is unavailable [1].
You may try to change the region, for example, from europe-north1 to europe-west4, that should work. Additionally, you shouldn't include your job ID on Stack Overflow.
[1] https://cloud.google.com/apis/design/errors#handling_errors

Template launch failed while running a streaming dataflow job using flex-template

I'm trying to automate provisioning of streaming job using cloud build, for the POC I tried https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/dataflow/flex-templates/streaming_beam
It worked as expected when I manually ran the commands.
When I add the commands in cloudbuild.yaml file the build gets created successfully but the dataflow job fails each time with the below error:
Error occurred in the launcher container: Template launch failed. See console logs
This is the only error log that I get, I tried to add extra permissions to Cloud Build service account but that didn't help either.
Since there's no other info mentioned in the log file I find it hard to debug it as well.

How to increase the cloud build timeout when using ```gcloud run deploy```?

When attempting to deploy to Cloud Run using the gcloud run deploy I am hitting the 10m Cloud Build timeout limit. gcloud run deploy is working well as long as the build step does not exceed 10m. When the build step exceeds 10m, the build fails with the "Timed out" status as shown in below screenshot. AFAIK there are no arguments to gcloud run deploy that can set the Cloud Build timeout limit. gcloud run deploy docs are here: https://cloud.google.com/sdk/gcloud/reference/run/deploy
I've attempted to increase the Cloud Build timeout limit using gcloud config set builds/timeout 20m and gcloud config set container/build_timeout 20m, but these settings are not reflected in the execution details of the cloud build process when using gcloud run deploy.
In the GUI, this is the setting I want to change:
Is it possible to increase the Cloud Build timeout limit using gcloud run deploy?
How about splitting the command into (more easily configured) constituents?
[I've not tried this]
Build the container image specifying the timeout
:
gcloud builds submit --source=.... --timeout=...
Then reference the image that results when you gcloud run deploy:
gcloud run deploy ... --image=...
I know this is answered and confirmed, but #DazWikin's solution was the harder way to solve this problem than #SimonKarman's solution.
For those who do not have the cloudbuild.yml file like myself, this solution still is a valid one, you just need to edit the one created by google itself. You can find it under builds > triggers > Desired Trigger (Edit)
Then when you open the editor you can apply the timeout. If you want other changes to the yaml file you can also checkout the schema here:
https://cloud.google.com/build/docs/build-config-file-schema#yaml
Note: I am using cloudrun and this worked for me and therefore I am not 100% if it works with all builds generated by google
Hope it will be helpful for someone else in future :)
If you're using a --source such as the cloudbuild.yaml you can add the following property to alter the timeout in seconds:
...
timeout: "1800s"
...
You can find this in the documentation