Getting DataProc output in GCP Logging - google-cloud-platform

I have a DataProc job that outputs some logs during the execution. I can see those logs in the Job output.
My cluster is created according to the documentation with such parameters:
dataproc:jobs.file-backed-output.enable=true
dataproc:dataproc.logging.stackdriver.enable=true
dataproc:dataproc.logging.stackdriver.job.driver.enable=true
dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true
I can see all system logs in Logging, but not the output from my job. The maximum I found is the URL to the rolling output file (even not a concrete file).
Is there any chance I can forward job output to Logging?
As per documentation cluster can be created with spark:spark.submit.deployMode=cluster so the output will be logged into yarn user logs group. But whenever I do that my job is failing with:
21/03/15 16:20:16 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!

I was able to create a cluster and submit jobs as follows.
I went to StackDriver and refreshed my page.
After refreshing, I could see Cloud Dataproc Job logging filter.
Also I noticed that both the jobs I ran, the job output was logged as 'Any Log Level'. Not sure if you are using any log level filtering.
Are you able to see Cloud Dataproc Job in logging filter after passing dataproc:dataproc.logging.stackdriver.job.driver.enable=true and refreshing the page?
Are you using one of the supported image versions?
Repro steps:
Cluster Creation:
gcloud dataproc clusters create log-exp --region=us-central1 \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true'
Job Submisson: PySpark
gcloud dataproc jobs submit pyspark \
gs://dataproc-examples/pyspark/hello-world/hello-world.py \
--cluster=log-exp \
--region=us-central1
Job Submission: Spark
gcloud dataproc jobs submit spark \
--cluster=log-exp \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
-- 100
Cloud Dataproc Job filter
Loggin level

Related

Cloud Profiler is not working for all Dataflow jobs

We're creating Dataflow job templates and launching new jobs using
google-api-python-client library. Cloud Profiler is enabled for all jobs by default during Dataflow job template creation.
python3 -m app.image_embeddings \
--job_name "image-embeddings" \
--region "us-central1" \
--runner "DataflowRunner" \
...
--experiment "use_runner_v2" \
--experiment "enable_google_cloud_profiler" \
--experiment "enable_google_cloud_heap_sampling" \
--dataflow_service_options=enable_google_cloud_profiler
Even though no changes done on our end, some jobs are profiled by the Profiler and some didn't.
According to logs, Profiler is enabled and there are no errors but job profile is still not available for some jobs. There is message when viewing the profiler link for the job.
There were profiles collected for the specified time range, but none match the current filters.
Is this issue on GCP end or related to our implementation?
Do all the jobs use the same user or service account? It might be a permissions issue, maybe some service accounts are missing the role roles/cloudprofiler.agent?

GCP Cloud Logging Cost increasing with Dataproc img version 2.0.39-ubuntu18

I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
[
I saw dataproc:dataproc.logging.stackdriver.job.driver.enable is set to true. By default, the value is false, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).
If you want to disable Cloud Logging completely for a cluster, you can either add dataproc:dataproc.logging.stackdriver.enable=false when creating the cluster or write an init action with systemctl stop google-fluentd.service. Both will stop Cloud Logging on the cluster's side, but using property is recommended.
See Dataproc cluster properties for the property.
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.

Error when creating GCP Dataproc cluster: permission denied for 'compute.projects.get'

am trying to create Dataproc cluster with a service account via cloud sdk. It's throwing an error that compute.projects.get is denied. The service account has compute viewer access, compute instance admin, dataproc editor access. Unable to understand why this error. In the IAM policy troubleshooter, I checked dataproc.cluster.create is assigned to the service account
The command is:
gcloud dataproc clusters create cluster-dqm01 \
--region europe-west-2 \
--zone europe-west2-b \
--subnet dataproc-standalone-paasonly-europe-west2 \
--master-machine-typne n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.3-deb9 \
--project xxxxxx \
--service-account xxxx.iam.gserviceaccount.com
ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Required 'compute.projects.get' permission for 'projects/xxxxxx'
The project is correct as I have tried to create from the console getting the same error, generated the gcloud command from the console to run with a service account. This is the first time dataproc cluster is being created for the project
If you had assigned the various permissions to the same service account you're specifying with --service-account, the issue is that you probably meant to specify --impersonate-service-account instead.
There are three identities that are relevant here:
The identity issuing the CreateCluster command - this is often a human identity, but if you're automating things, using --impersonate-service-account, or running the command from inside another GCE VM, it may be a service account itself.
The "Control plane" identity - this is what the Dataproc backend service uses to actually create VMs
The "Data plane" identity - this is what the Dataproc workers behave as when processing data.
Typically, #1 and #2 need the various "compute" permissions and some minimal GCS permissions. #3 typically just needs GCS and optionally BigQuery, CloudSQL, Bigtable, etc. permissions depending on what you're actually processing.
See https://cloud.google.com/dataproc/docs/concepts/iam/dataproc-principals for more in-depth explanation of these identities.
It also lists the pre-existing curated roles to make this all easy (and typically, "default" project settings will automatically have the correct roles already so that you don't have to worry about it). Basically, the "human identity" or the service account you use with --impersonate-service-account needs Dataproc Editor or Project Editor roles, the "control plane identity" needs Dataproc Service Agent, and the "data plane identity" needs Dataproc Worker.

Submit Presto job on dataproc

I am trying to submit a dataproc job on a cluster running Presto with the postgresql connector.
The cluster is initialized as followed:
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
--project=${PROJECT} \
--region=${REGION} \
--zone=${ZONE} \
--bucket=${BUCKET_NAME} \
--num-workers=${WORKERS} \
--scopes=cloud-platform \
--initialization-actions=${INIT_ACTION}
${INIT_ACTION} point to a bash file with the initialization actions for starting a presto cluster with postgresql.
I do not use --optional-components=PRESTO since I need --initialization-actions to perform non-default operations. And having both --optional-component and --initialization-actions does not work.
When I try to run a simple job:
gcloud beta dataproc jobs submit presto \
--cluster ${CLUSTER_NAME} \
--region ${REGION} \
-e "SHOW TABLES"
I get the following error:
ERROR: (gcloud.beta.dataproc.jobs.submit.presto) FAILED_PRECONDITION: Cluster
'<cluster-name>' requires optional component PRESTO to run PRESTO jobs
Is there some other way to define the optional component on the cluster?
UPDATE:
Using both --optional-component and --initialization-actions, as:
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
...
--scopes=cloud-platform \
--optional-components=PRESTO \
--image-version=1.3 \
--initialization-actions=${INIT_ACTION} \
--metadata ...
The ${INIT_ACTION} is copied from this repo. With a slight modification to the function configure_connectors to create a postgresql connector.
When running the create cluster the following error is given:
ERROR: (gcloud.beta.dataproc.clusters.create) Operation [projects/...] failed: Initialization action failed. Failed action 'gs://.../presto_config.sh', see output in: gs://.../dataproc-initialization-script-0_output.
The error output is logged as:
+ presto '--execute=select * from system.runtime.nodes;'
Error running command: java.net.ConnectException: Failed to connect to localhost/0:0:0:0:0:0:0:1:8080
Which leads me to believe I have to re-write the initialization script.
It would be nice to know which initialization script is running when I specify --optional-components=PRESTO.
If all you want to do is setup the optional component to work with a Postgres endpoint writing an optional component to do it is pretty easy. You just have to add the catalog file and restart presto.
https://gist.github.com/KoopaKing/8e653e0c8d095323904946045c5fa4c2
Is an example init action. I have tested it successfully with the presto optional component, but it is pretty simple. Feel free to fork the example and stage it in your GCS bucket.

Howto use gcloud to create cloud launcher products clusters

I'm new to google cloud and i try to experiment it.
I can see that preparing scripts is some kind of vital if i want to create and delete clusters every days.
For dataproc clusters, it's easy :
gcloud dataproc clusters create spark-6-m \
--async \
--project=my-project-id \
--region=us-east1 \
--zone=us-east1-b \
--bucket=my-project-bucket \
--image-version=1.2 \
--num-masters=1 \
--master-boot-disk-size=10GB \
--master-machine-type=n1-standard-1 \
--worker-boot-disk-size=10GB \
--worker-machine-type=n1-standard-1 \
--num-workers=6 \
--initialization-actions=gs://dataproc-initialization-actions/jupyter2/jupyter2.sh
Now, i'd like to create a cassandra cluster. I see that the code launcher allows to do that easily too but I can't find a gcloud command to automate it.
Is there a way to create cloud launcher products clusters via gcloud ?
Thanks
Cloud Launcher deployments can be replicated from the Cloud Shell using Custom Deployments [1].
Once the Cloud Launcher deployment (in this case a Cassandra cluster) is finished the details of the deployment can be seen in the Deployment Manager [2].
The deployment details have an Overview section with the configuration and the imported files used for the deployment process. Download the “Expanded Config” file, this will be the .yaml file for the custom deployment [3]. Download the imports files to the same directory as the .yaml file to be able to deploy correctly [4].
This files and configuration will create an equivalent deployment as the Cloud Launcher.