Retrieve Google Dataflow user defined labels in Google Monitoring & Logging - google-cloud-platform

I'm running a Dataflow job using the gcloud command with the --parameters labels='{"my_label":"my_value"}' parameter in order to set custom labels (which I can see in the Job Info panel in the UI).
But unfortunately, I'm not able to retrieve this custom label in the Logging stack: I can see the classic Dataflow labels (dataflow.googleapis.com/job_name, dataflow.googleapis.com/job_id, etc.) and the Resource ones (step_id, project_id, region, etc.) but not my my_label custom label.
Is there something I'm missing or is that label propagation I'm expecting does not exist?
Thanks!
EDIT Here is the full command I'm using to run my Dataflow job:
gcloud dataflow flex-template run "my-dataflow-job" \
--template-file-gcs-location "gs://$(GCP_BUCKET)/my-dataflow-job/templates/my-dataflow-job.json" \
--region "$(GCP_REGION)" \
--parameters tempLocation="gs://$(GCP_BUCKET)/my-dataflow-job/temp" \
--parameters stagingLocation="gs://$(GCP_BUCKET)/my-dataflow-job/staging" \
--parameters usePublicIps=false \
--project "$(GCP_PROJECT)" \
--worker-machine-type="$(DATAFLOW_WORKER_MACHINE_TYPE)" \
--num-workers=1 \
--max-workers=1 \
--parameters labels='{"my_label":"my_value"}'

Related

How to specify pubsub topic when deploying event arc triggered 2nd gen cloud function using gcloud command

I want to deploy cloud function that is triggerred by pubsub eventarc trigger using gcloud command line, but I haven't found a way to specify the pubsub topic with the gcloud command.
I have tried executing gcloud command like this :
gcloud functions deploy <function_name> \
--gen2 \
--source=. \
--trigger-event-filters=type=google.cloud.pubsub.topic.v1.messagePublished \
--trigger-location=asia-southeast2 \
--trigger-service-account=<service_account> \
--runtime=python310 \
--entry-point=hello_pubsub \
--project=<project_id> \
--region=asia-southeast2
But I got this error :
gcloud.functions.deploy) INVALID_ARGUMENT: Pubsub topic must be set
for events with type google.cloud.pubsub.topic.v1.messagePublished.
I have checked GCP documentation eventarc cloud function documentation, but they do not mention on how to specify the pubsub topic.
My objective is to call this gcloud command from the cloud build pipeline to automatically deploy the cloud function.
Thank You
You can use --trigger-topic to specify the topic.
gcloud functions deploy <function_name> \
--gen2 \
--source=. \
--trigger-topic=topic_name
....
The --trigger-event-filters can be used to filter out events based on any other attributes. Checkout the linked documentation for more information.

GCP Cloud Logging Cost increasing with Dataproc img version 2.0.39-ubuntu18

I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
[
I saw dataproc:dataproc.logging.stackdriver.job.driver.enable is set to true. By default, the value is false, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).
If you want to disable Cloud Logging completely for a cluster, you can either add dataproc:dataproc.logging.stackdriver.enable=false when creating the cluster or write an init action with systemctl stop google-fluentd.service. Both will stop Cloud Logging on the cluster's side, but using property is recommended.
See Dataproc cluster properties for the property.
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.

How to run a default provided Dataflow template in gcloud?

How do I run a gcloud command that will create a Dataflow job from a default template? e.g. Pub/Sub Topic to BigQuery. I can do this via the console but looking to get this done via command line if possible?
gcloud dataflow jobs run mydataflowjob \
--gcs-location ... \
--parameters ... \
To answer my own question, I downloaded the template from GCP's Github and moved to a storage bucket:
wget https://raw.githubusercontent.com/GoogleCloudPlatform/DataflowTemplates/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java
export BUCKET_URI=gs://mybucketname && \
export TEMPLATE_NAME=PubSubToBigQuery.java && \
gsutil cp $TEMPLATE_NAME $BUCKET_URI && \
Then passed bucket file path to --gcs-location
gcloud dataflow jobs run $DATAFLOW_NAME \
--gcs-location $BUCKET_URI/$TEMPLATE_NAME \
--parameters \
topic=projects/$PROJECT_ID/topics/$BQ_DATASET_NAME-$BQ_TABLE_NAME,\
table=$PROJECT_ID:$BQ_DATASET_NAME.$BQ_TABLE_NAME
Need to figure out how to pass a temp location (perhaps something to do with service account permissions? For another thread though...)
Edit
The default templates are located here in fact: gs://dataflow-templates-us-central1/latest/PubSub_to_BigQuery
So code to run job would be:
gcloud dataflow jobs run $DATAFLOW_NAME \
--gcs-location gs://dataflow-templates-us-central1/latest/PubSub_to_BigQuery \
--region us-central1 \
--staging-location $BUCKET_URI/temp \
--parameters \
inputTopic=projects/pubsub-public-data/topics/taxirides-realtime,\
outputTableSpec=$PROJECT_ID:$BQ_DATASET_NAME.$BQ_TABLE_NAME

Google Cloud PubSub disable retrying or set a minimum

I am using Cloud Scheduler with PubSub and Cloud Run.
Some times my service is triggered more than one time, even with a successful response (HTTP 204 No Content) of my service running on Cloud Run.
My service tooks around 200 seconds to respond the POST made by PubSub.
My question is: How can I limit the number of tries of PubSub? Or I made some mistake like mutiple subscriptions (I have only one subscription, just checked on console)?
What is strange is when I trigger the Cloud Scheduler, the PubSub call my service several times (see the screenshot bellow)
I am deploying my PubSub and Cloud Run as following:
export PROJECT_ID=...
export PROJECT_NUMBER=$(gcloud projects describe --format 'value(projectNumber)' ${PROJECT_ID})
Setup Cloud Scheduler (need to be done only once per project)
gcloud pubsub topics create supervisor-cron --project ${PROJECT_ID}
Create a Pub/Sub subscription
gcloud pubsub subscriptions create supervisor-subscription \
--topic supervisor-cron \
--project ${PROJECT_ID}
Create a Cloud Scheduler at https://console.cloud.google.com/cloudscheduler
Enable Pub/Sub to create authentication tokens in your project
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:service-${PROJECT_NUMBER}#gcp-sa-pubsub.iam.gserviceaccount.com \
--role=roles/iam.serviceAccountTokenCreator
Create or select a service account to represent the Pub/Sub subscription identity
gcloud iam service-accounts create cloud-run-pubsub-invoker \
--display-name "Cloud Run Pub/Sub Invoker" \
--project ${PROJECT_ID}
Deploy Cloud Run
gcloud builds submit --tag gcr.io/${PROJECT_ID}/supervisor --project ${PROJECT_ID}
gcloud run deploy supervisor \
--set-env-vars APP_BASEURL=$(gcloud run services describe anotherservice --format 'value(status.url)' --platform managed --project ${PROJECT_ID}) \
--set-env-vars APP_HEALTHCHECKS=https://hc-ping.com \
--platform managed \
--no-allow-unauthenticated \
--timeout=900 \
--image gcr.io/${PROJECT_ID}/supervisor \
--project ${PROJECT_ID}
Create a Pub/Sub subscription with the service account
gcloud run services add-iam-policy-binding supervisor \
--member=serviceAccount:cloud-run-pubsub-invoker#${PROJECT_ID}.iam.gserviceaccount.com \
--role=roles/run.invoker \
--platform managed \
--project ${PROJECT_ID}
gcloud pubsub subscriptions create supervisor-subscription \
--topic supervisor-cron \
--push-endpoint=$(gcloud run services describe supervisor --format 'value(status.url)' --platform managed --project ${PROJECT_ID}) \
--push-auth-service-account=cloud-run-pubsub-invoker#${PROJECT_ID}.iam.gserviceaccount.com
You need to extend the ACK deadline in your PubSub subscription.
Add this parameter with you create it. 600s (10 minutes) is the max value.
--ack-deadline=600
You also have other parameters to set a delay between each retry, the max number of retry and so on. Have a look here

How to keep Google Dataproc master running?

I created a cluster on Dataproc and it works great. However, after the cluster is idle for a while (~90 min), the master node will automatically stops. This happens to every cluster I created. I see there is a similar question here: Keep running Dataproc Master node
It looks like it's the initialization action problem. However the post does not give me enough info to fix the issue. Below are the commands I used to create the cluster:
gcloud dataproc clusters create $CLUSTER_NAME \
--project $PROJECT \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--master-machine-type $MASTER_MACHINE_TYPE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--num-workers=$NUM_WORKERS \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform \
--metadata JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--optional-components=ANACONDA,JUPYTER \
--image-version=1.3
I need the BigQuery connector, GCS connector, Jupyter and DataLab for my cluster.
How can I keep my master node running? Thank you.
As summarized in the comment thread, this is indeed caused by Datalab's auto-shutdown feature. There are a couple ways to change this behavior:
Upon first creating the Datalab-enabled Dataproc cluster, log in to Datalab and click on the "Idle timeout in about ..." text to disable it: https://cloud.google.com/datalab/docs/concepts/auto-shutdown#disabling_the_auto_shutdown_timer - The text will change to "Idle timeout is disabled"
Edit the initialization action to set the environment variable as suggested by yelsayed:
function run_datalab(){
if docker run -d --restart always --net=host -e "DATALAB_DISABLE_IDLE_TIMEOUT_PROCESS=true" \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
And use your custom initialization action instead of the stock gs://dataproc-initialization-actions one. It could be worth filing a tracking issue in the github repo for dataproc initialization actions too, suggesting to disable the timeout by default or provide an easy metadata-based option. It's probably true that the auto-shutdown behavior isn't as expected in default usage on a Dataproc cluster since the master is also performing roles other than running the Datalab service.