GCP Dataproc create cluster with gpus error - google-cloud-platform

GCP Dataproc service now supports creating a cluster with GPUs as a beta feature. The problem I met was that when I tried to specify the GPU type, gcloud command line cannot recognize the type I specified.
The gcloud command I use is shown below.
gcloud beta dataproc clusters create gpu-cluster \
--zone us-east1-b \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 50 \
--initialization-actions gs://15418-initial-script/initialize_cluster.sh \
--worker-accelerator type=nvidia-tesla-p100,count=1
I returned with:
ERROR: (gcloud.beta.dataproc.clusters.create) INVALID_ARGUMENT: Insufficient 'NVIDIA_K80_GPUS' quota. Requested 2.0, available 0.0.
Anyone knows what happened? Am I using wrong command or is there something wrong with gcloud command line?

You may need to request quota for the GPUs
Check the quotas page for your project to ensure that you have sufficient GPU quota (NVIDIA_K80_GPUS or NVIDIA_P100_GPUS) available in your project. If GPUs are not listed on the quotas page or you require additional GPU quota, request a quota increase.

Related

GCP Cloud Logging Cost increasing with Dataproc img version 2.0.39-ubuntu18

I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
[
I saw dataproc:dataproc.logging.stackdriver.job.driver.enable is set to true. By default, the value is false, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).
If you want to disable Cloud Logging completely for a cluster, you can either add dataproc:dataproc.logging.stackdriver.enable=false when creating the cluster or write an init action with systemctl stop google-fluentd.service. Both will stop Cloud Logging on the cluster's side, but using property is recommended.
See Dataproc cluster properties for the property.
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.

creating dataproc cluster with multiple jars

I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag
gcloud dataproc clusters create cluster-2c76 --region us-central1 --zone us-central1-f --master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.4-debian10 \
--properties spark:spark.jars=gs://bucket/jars/spark-streaming-pubsub_2.11-2.4.0.jar,gs://bucket/jars/google-oauth-client-1.31.0.jar,gs://bucket/jars/google-cloud-datastore-2.2.0.jar,gs://bucket/jars/pubsublite-spark-sql-streaming-0.2.0.jar spark:spark.driver.memory=3000m \
--initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh \
--metadata spark-bigquery-connector-version=0.21.0 \
--scopes=pubsub,datastore
I get thrown this error
ERROR: (gcloud.dataproc.clusters.create) argument --properties: Bad syntax for dict arg: [gs://gregalr/jars/spark-streaming-pubsub_2.11-2.3.4.jar]. Please see `gcloud topic flags-file` or `gcloud topic escaping` for information on providing list or dictionary flag values with special characters.
This looked promising, but fails
If there is a better way to connect dataproc to pubsub, please share
The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m, and tried to just space-separate that from your jars flag, which isn't allowed.
Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:
--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3#spark:spark.driver.memory=3000m

GCP composer creation failed with bad request

Trying to create GCP composer environment instance with gcloud CLI
gcloud composer environments create "jakub" \
> --project "projectX" \
> --location "us-central1" \
> --zone "us-central1-a" \
> --disk-size 50GB \
> --node-count 3 \
> --image-version composer-1.7.1-airflow-1.10.2 \
> --machine-type n1-standard-2 \
> --python-version 3 \
> --labels env="test"
After an hour getting error:
f7b3f4-6b95-4fb0-85e3-f39a2b11cec9] failed: Http error status code: 400
Http error message: BAD REQUEST
Additional errors:
{"ResourceType":"appengine.v1.version","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Legacy health checks are no longer supported for the App Engine Flexible environment. Please remove the 'health_check' section from your app.yaml and configure updated health checks. For instructions on migrating to split health checks see https://cloud.google.com/appengine/docs/flexible/java/migrating-to-split-health-checks","status":"INVALID_ARGUMENT","details":[],"statusMessage":"Bad Request","requestPath":"https://appengine.googleapis.com/v1/apps/vd41e6cad4ccb2a7b-tp/services/default/versions","httpMethod":"POST"}}
Based on https://cloud.google.com/sdk/gcloud/reference/composer/environments/create
This is caused because you are trying to use an old image version. Retry Composer Environment creation with a newer supported version as per Supported Cloud Composer versions.

How to keep Google Dataproc master running?

I created a cluster on Dataproc and it works great. However, after the cluster is idle for a while (~90 min), the master node will automatically stops. This happens to every cluster I created. I see there is a similar question here: Keep running Dataproc Master node
It looks like it's the initialization action problem. However the post does not give me enough info to fix the issue. Below are the commands I used to create the cluster:
gcloud dataproc clusters create $CLUSTER_NAME \
--project $PROJECT \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--master-machine-type $MASTER_MACHINE_TYPE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--num-workers=$NUM_WORKERS \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform \
--metadata JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--optional-components=ANACONDA,JUPYTER \
--image-version=1.3
I need the BigQuery connector, GCS connector, Jupyter and DataLab for my cluster.
How can I keep my master node running? Thank you.
As summarized in the comment thread, this is indeed caused by Datalab's auto-shutdown feature. There are a couple ways to change this behavior:
Upon first creating the Datalab-enabled Dataproc cluster, log in to Datalab and click on the "Idle timeout in about ..." text to disable it: https://cloud.google.com/datalab/docs/concepts/auto-shutdown#disabling_the_auto_shutdown_timer - The text will change to "Idle timeout is disabled"
Edit the initialization action to set the environment variable as suggested by yelsayed:
function run_datalab(){
if docker run -d --restart always --net=host -e "DATALAB_DISABLE_IDLE_TIMEOUT_PROCESS=true" \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
And use your custom initialization action instead of the stock gs://dataproc-initialization-actions one. It could be worth filing a tracking issue in the github repo for dataproc initialization actions too, suggesting to disable the timeout by default or provide an easy metadata-based option. It's probably true that the auto-shutdown behavior isn't as expected in default usage on a Dataproc cluster since the master is also performing roles other than running the Datalab service.

Creating Bigtable replica cluster gives metric error

Playing with the newest Bigtable feature: cross-region replication.
I've created an instance and a replica cluster in a different region with this snippet:
gcloud bigtable instances create ${instance_id} \
--cluster=${cluster_id} \
--cluster-zone=${ZONE} \
--display-name=${cluster_id} \
--cluster-num-nodes=${BT_CLUSTER_NODES} \
--cluster-storage-type=${BT_CLUSTER_STORAGE} \
--instance-type=${BT_TYPE} \
--project=${PROJECT_ID}
gcloud beta bigtable clusters create ${cluster_id} \
--instance=${instance_id} \
--zone=${ZONE} \
--num-nodes=${BT_CLUSTER_NODES} \
--project=${PROJECT_ID}
The instance created successfully, but creating the replica cluster gave me an error: ERROR: (gcloud.beta.bigtable.clusters.create) Metric 'bigtable.googleapis.com/ReplicationFromEUToNA' not defined in the service configuration.
However the cluster created and replication worked.
I know this is currently beta, but do I need to change my setup script, or this is something on GCP side?
I can confirm that this is an issue on the GCP side. As you noted this is happening after replication is set up, so there should be no practical impact to you.
We have a ticket open to fix the underlying issue, which is purely around reporting the successful copy to our own internal monitoring. Thanks for the report!