We're creating Dataflow job templates and launching new jobs using
google-api-python-client library. Cloud Profiler is enabled for all jobs by default during Dataflow job template creation.
python3 -m app.image_embeddings \
--job_name "image-embeddings" \
--region "us-central1" \
--runner "DataflowRunner" \
...
--experiment "use_runner_v2" \
--experiment "enable_google_cloud_profiler" \
--experiment "enable_google_cloud_heap_sampling" \
--dataflow_service_options=enable_google_cloud_profiler
Even though no changes done on our end, some jobs are profiled by the Profiler and some didn't.
According to logs, Profiler is enabled and there are no errors but job profile is still not available for some jobs. There is message when viewing the profiler link for the job.
There were profiles collected for the specified time range, but none match the current filters.
Is this issue on GCP end or related to our implementation?
Do all the jobs use the same user or service account? It might be a permissions issue, maybe some service accounts are missing the role roles/cloudprofiler.agent?
Related
I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
[
I saw dataproc:dataproc.logging.stackdriver.job.driver.enable is set to true. By default, the value is false, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).
If you want to disable Cloud Logging completely for a cluster, you can either add dataproc:dataproc.logging.stackdriver.enable=false when creating the cluster or write an init action with systemctl stop google-fluentd.service. Both will stop Cloud Logging on the cluster's side, but using property is recommended.
See Dataproc cluster properties for the property.
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.
I have a DataProc job that outputs some logs during the execution. I can see those logs in the Job output.
My cluster is created according to the documentation with such parameters:
dataproc:jobs.file-backed-output.enable=true
dataproc:dataproc.logging.stackdriver.enable=true
dataproc:dataproc.logging.stackdriver.job.driver.enable=true
dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true
I can see all system logs in Logging, but not the output from my job. The maximum I found is the URL to the rolling output file (even not a concrete file).
Is there any chance I can forward job output to Logging?
As per documentation cluster can be created with spark:spark.submit.deployMode=cluster so the output will be logged into yarn user logs group. But whenever I do that my job is failing with:
21/03/15 16:20:16 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
I was able to create a cluster and submit jobs as follows.
I went to StackDriver and refreshed my page.
After refreshing, I could see Cloud Dataproc Job logging filter.
Also I noticed that both the jobs I ran, the job output was logged as 'Any Log Level'. Not sure if you are using any log level filtering.
Are you able to see Cloud Dataproc Job in logging filter after passing dataproc:dataproc.logging.stackdriver.job.driver.enable=true and refreshing the page?
Are you using one of the supported image versions?
Repro steps:
Cluster Creation:
gcloud dataproc clusters create log-exp --region=us-central1 \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true'
Job Submisson: PySpark
gcloud dataproc jobs submit pyspark \
gs://dataproc-examples/pyspark/hello-world/hello-world.py \
--cluster=log-exp \
--region=us-central1
Job Submission: Spark
gcloud dataproc jobs submit spark \
--cluster=log-exp \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
-- 100
Cloud Dataproc Job filter
Loggin level
I am facing one issue and need help for the same. It's related to GCP Dataflow (Apache Beam).
I have a Dataflow template created in project B and everything works if I run it either by using a person's email or service account.
I am looking for a solution to run the Dataflow template of project B from some other project A using a service account defined in the project A (sa-A#PROJECT-A). sa-A#PROJECT-A has already the necessary permissions in project B.
I have already tried using the gcloud command below
gcloud dataflow jobs run BigQueryToBigQuery \
--gcs-location gs://{GCS bucket}/templates/BigQueryToBigQuery \
--parameters query=bigQueryTableName={projectID}:{dataset}.{table} \
--region=us-east1
where the gcs-location used is the location of the template present in the project B.
When I use the service account of project A, it triggers job in project A but not in project B. When I run it using Project B's service account the error Current user cannot act as service account... is raised.
Any help will be appreciated.
I found the solution, just thought to post here so that it will be helpful for other people too.
If we need to trigger a Dataflow job from the project A and to run it on project B then one has to use the above command with the --project flag. The full command would be:
gcloud dataflow jobs run BigQueryToBigQuery \
--gcs-location gs://{GCS_bucket}/templates/BigQueryToBigQuery \
--parameters query=bigQueryTableName={projectA_ID}:{dataset}.{table} \
--region=us-east1 \
--project=projectB_ID
I'm new to google cloud and i try to experiment it.
I can see that preparing scripts is some kind of vital if i want to create and delete clusters every days.
For dataproc clusters, it's easy :
gcloud dataproc clusters create spark-6-m \
--async \
--project=my-project-id \
--region=us-east1 \
--zone=us-east1-b \
--bucket=my-project-bucket \
--image-version=1.2 \
--num-masters=1 \
--master-boot-disk-size=10GB \
--master-machine-type=n1-standard-1 \
--worker-boot-disk-size=10GB \
--worker-machine-type=n1-standard-1 \
--num-workers=6 \
--initialization-actions=gs://dataproc-initialization-actions/jupyter2/jupyter2.sh
Now, i'd like to create a cassandra cluster. I see that the code launcher allows to do that easily too but I can't find a gcloud command to automate it.
Is there a way to create cloud launcher products clusters via gcloud ?
Thanks
Cloud Launcher deployments can be replicated from the Cloud Shell using Custom Deployments [1].
Once the Cloud Launcher deployment (in this case a Cassandra cluster) is finished the details of the deployment can be seen in the Deployment Manager [2].
The deployment details have an Overview section with the configuration and the imported files used for the deployment process. Download the “Expanded Config” file, this will be the .yaml file for the custom deployment [3]. Download the imports files to the same directory as the .yaml file to be able to deploy correctly [4].
This files and configuration will create an equivalent deployment as the Cloud Launcher.
I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like
hadoop-streaming -D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files my_mapper.py,my_reducer.py \
-mapper my_mapper.py \
-reducer my_reducer.py \
-input /aws/input/input_warc.txt \
-output /aws/output
Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?
Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.
Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.
One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.