I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag
gcloud dataproc clusters create cluster-2c76 --region us-central1 --zone us-central1-f --master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.4-debian10 \
--properties spark:spark.jars=gs://bucket/jars/spark-streaming-pubsub_2.11-2.4.0.jar,gs://bucket/jars/google-oauth-client-1.31.0.jar,gs://bucket/jars/google-cloud-datastore-2.2.0.jar,gs://bucket/jars/pubsublite-spark-sql-streaming-0.2.0.jar spark:spark.driver.memory=3000m \
--initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh \
--metadata spark-bigquery-connector-version=0.21.0 \
--scopes=pubsub,datastore
I get thrown this error
ERROR: (gcloud.dataproc.clusters.create) argument --properties: Bad syntax for dict arg: [gs://gregalr/jars/spark-streaming-pubsub_2.11-2.3.4.jar]. Please see `gcloud topic flags-file` or `gcloud topic escaping` for information on providing list or dictionary flag values with special characters.
This looked promising, but fails
If there is a better way to connect dataproc to pubsub, please share
The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m, and tried to just space-separate that from your jars flag, which isn't allowed.
Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:
--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3#spark:spark.driver.memory=3000m
Related
I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
[
I saw dataproc:dataproc.logging.stackdriver.job.driver.enable is set to true. By default, the value is false, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).
If you want to disable Cloud Logging completely for a cluster, you can either add dataproc:dataproc.logging.stackdriver.enable=false when creating the cluster or write an init action with systemctl stop google-fluentd.service. Both will stop Cloud Logging on the cluster's side, but using property is recommended.
See Dataproc cluster properties for the property.
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.
I am trying to submit a dataproc job on a cluster running Presto with the postgresql connector.
The cluster is initialized as followed:
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
--project=${PROJECT} \
--region=${REGION} \
--zone=${ZONE} \
--bucket=${BUCKET_NAME} \
--num-workers=${WORKERS} \
--scopes=cloud-platform \
--initialization-actions=${INIT_ACTION}
${INIT_ACTION} point to a bash file with the initialization actions for starting a presto cluster with postgresql.
I do not use --optional-components=PRESTO since I need --initialization-actions to perform non-default operations. And having both --optional-component and --initialization-actions does not work.
When I try to run a simple job:
gcloud beta dataproc jobs submit presto \
--cluster ${CLUSTER_NAME} \
--region ${REGION} \
-e "SHOW TABLES"
I get the following error:
ERROR: (gcloud.beta.dataproc.jobs.submit.presto) FAILED_PRECONDITION: Cluster
'<cluster-name>' requires optional component PRESTO to run PRESTO jobs
Is there some other way to define the optional component on the cluster?
UPDATE:
Using both --optional-component and --initialization-actions, as:
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
...
--scopes=cloud-platform \
--optional-components=PRESTO \
--image-version=1.3 \
--initialization-actions=${INIT_ACTION} \
--metadata ...
The ${INIT_ACTION} is copied from this repo. With a slight modification to the function configure_connectors to create a postgresql connector.
When running the create cluster the following error is given:
ERROR: (gcloud.beta.dataproc.clusters.create) Operation [projects/...] failed: Initialization action failed. Failed action 'gs://.../presto_config.sh', see output in: gs://.../dataproc-initialization-script-0_output.
The error output is logged as:
+ presto '--execute=select * from system.runtime.nodes;'
Error running command: java.net.ConnectException: Failed to connect to localhost/0:0:0:0:0:0:0:1:8080
Which leads me to believe I have to re-write the initialization script.
It would be nice to know which initialization script is running when I specify --optional-components=PRESTO.
If all you want to do is setup the optional component to work with a Postgres endpoint writing an optional component to do it is pretty easy. You just have to add the catalog file and restart presto.
https://gist.github.com/KoopaKing/8e653e0c8d095323904946045c5fa4c2
Is an example init action. I have tested it successfully with the presto optional component, but it is pretty simple. Feel free to fork the example and stage it in your GCS bucket.
I created a cluster on Dataproc and it works great. However, after the cluster is idle for a while (~90 min), the master node will automatically stops. This happens to every cluster I created. I see there is a similar question here: Keep running Dataproc Master node
It looks like it's the initialization action problem. However the post does not give me enough info to fix the issue. Below are the commands I used to create the cluster:
gcloud dataproc clusters create $CLUSTER_NAME \
--project $PROJECT \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--master-machine-type $MASTER_MACHINE_TYPE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--num-workers=$NUM_WORKERS \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform \
--metadata JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--optional-components=ANACONDA,JUPYTER \
--image-version=1.3
I need the BigQuery connector, GCS connector, Jupyter and DataLab for my cluster.
How can I keep my master node running? Thank you.
As summarized in the comment thread, this is indeed caused by Datalab's auto-shutdown feature. There are a couple ways to change this behavior:
Upon first creating the Datalab-enabled Dataproc cluster, log in to Datalab and click on the "Idle timeout in about ..." text to disable it: https://cloud.google.com/datalab/docs/concepts/auto-shutdown#disabling_the_auto_shutdown_timer - The text will change to "Idle timeout is disabled"
Edit the initialization action to set the environment variable as suggested by yelsayed:
function run_datalab(){
if docker run -d --restart always --net=host -e "DATALAB_DISABLE_IDLE_TIMEOUT_PROCESS=true" \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
And use your custom initialization action instead of the stock gs://dataproc-initialization-actions one. It could be worth filing a tracking issue in the github repo for dataproc initialization actions too, suggesting to disable the timeout by default or provide an easy metadata-based option. It's probably true that the auto-shutdown behavior isn't as expected in default usage on a Dataproc cluster since the master is also performing roles other than running the Datalab service.
Playing with the newest Bigtable feature: cross-region replication.
I've created an instance and a replica cluster in a different region with this snippet:
gcloud bigtable instances create ${instance_id} \
--cluster=${cluster_id} \
--cluster-zone=${ZONE} \
--display-name=${cluster_id} \
--cluster-num-nodes=${BT_CLUSTER_NODES} \
--cluster-storage-type=${BT_CLUSTER_STORAGE} \
--instance-type=${BT_TYPE} \
--project=${PROJECT_ID}
gcloud beta bigtable clusters create ${cluster_id} \
--instance=${instance_id} \
--zone=${ZONE} \
--num-nodes=${BT_CLUSTER_NODES} \
--project=${PROJECT_ID}
The instance created successfully, but creating the replica cluster gave me an error: ERROR: (gcloud.beta.bigtable.clusters.create) Metric 'bigtable.googleapis.com/ReplicationFromEUToNA' not defined in the service configuration.
However the cluster created and replication worked.
I know this is currently beta, but do I need to change my setup script, or this is something on GCP side?
I can confirm that this is an issue on the GCP side. As you noted this is happening after replication is set up, so there should be no practical impact to you.
We have a ticket open to fix the underlying issue, which is purely around reporting the successful copy to our own internal monitoring. Thanks for the report!
GCP Dataproc service now supports creating a cluster with GPUs as a beta feature. The problem I met was that when I tried to specify the GPU type, gcloud command line cannot recognize the type I specified.
The gcloud command I use is shown below.
gcloud beta dataproc clusters create gpu-cluster \
--zone us-east1-b \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 50 \
--initialization-actions gs://15418-initial-script/initialize_cluster.sh \
--worker-accelerator type=nvidia-tesla-p100,count=1
I returned with:
ERROR: (gcloud.beta.dataproc.clusters.create) INVALID_ARGUMENT: Insufficient 'NVIDIA_K80_GPUS' quota. Requested 2.0, available 0.0.
Anyone knows what happened? Am I using wrong command or is there something wrong with gcloud command line?
You may need to request quota for the GPUs
Check the quotas page for your project to ensure that you have sufficient GPU quota (NVIDIA_K80_GPUS or NVIDIA_P100_GPUS) available in your project. If GPUs are not listed on the quotas page or you require additional GPU quota, request a quota increase.