I have a pipeline which requires a Dataflow Job to run. I was using the gcloud CLI command to start a dataflow job which was working fine for over a month. But since last three days the dataflow job is failing within 10-20 sec with the following error log.
Failed to start the VM, launcher-2022012621245117717885921401920990, used for launching because of status code: UNAVAILABLE, reason: One or more operations had an error: 'operation-1643261093401-5d68989bed339-a33de830-9f90d92a': [UNAVAILABLE] 'HTTP_503'..
The command I'm using is:
gcloud dataflow sql query "SELECT tr.* FROM pubsub.topic.`my_project`.pubsub_topic as tr"
--job-name test_job
--region asia-south1
--bigquery-write-disposition write-empty
--bigquery-project my_project
--bigquery-dataset test_dataset --bigquery-table table_name
--max-workers 1 --worker-machine-type n1-standard-1
I tried starting the job from cloud console with same parameters as well which failed with the same error log. I have tested the job run from console before and it worked fine. The issue started a couple days ago.
What could be going wrong?
The Google Cloud error model indicates that a 503 means the service is unavailable [1].
You may try to change the region, for example, from europe-north1 to europe-west4, that should work. Additionally, you shouldn't include your job ID on Stack Overflow.
[1] https://cloud.google.com/apis/design/errors#handling_errors
I'm trying to run a simple hello world python code on Serverless pyspark on GCP using gcloud (from local windows machine).
if __name__ == '__main__':
This always results in the error
=========== Cloud Dataproc Agent Error ===========
java.lang.IllegalArgumentException: Illegal character in path at index 38: gs://my-bucket/dependencies\hello.py
at java.base/java.net.URI.create(URI.java:883)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:592)
The gcloud command:
gcloud dataproc batches submit pyspark hello.py --batch=hello-batch-5 --deps-bucket=my-bucket --region=us-central1
On further analysis, I found that gcloud puts hello.py file in dependencies\hello.py under folder {deps-bucket} and Java considers backward slash '\' as illegal.
Has anyone encountered a similar situation?
As #Ronak mentioned, Can you double check the bucket name ? I have replicated your task, and simply copied your code to my Google Cloud shell. and it ran just fine. for your next run can you delete the dependencies folder and run the batch job again ?
See my replication here:
Dependencies path created after running the job:
We followed the Cloud Profiler documentation to enable the Cloud Profiler for our Dataflow jobs and the Profiler is failing to start.
The issue is, Cloud Profiler needs JOB_NAME and JOB_ID environment vars to start but the worker VM has only the JOB_ID env var but the JOB_NAME is missing.
The question is why the JOB_NAME env var is missing?
jsonPayload: {
job: "2022-09-16 13 41 20-1177626142222241340"
logger: "/us/local/lib/pvthon3.9/site-packages/apache_beam/runners/worker/sdk_worker_main.pv:177"
message: "Unable to start google cloud profiler due to error: Unable to find the job id or job name from envvar"
portability_worker_1d: "sdk-0-13"
thread: "MainThread"
worker: "description-embeddings-20-09161341-k27g-harness-qxq2"
Following done so far:
Cloud Profiler API enabled for the project
Projects have enough quota.
the Service Account for the Dataflow job has appropriate permissions for Profiler.
Following options added to the pipeline
enable_google_cloud_profiler and enable_google_cloud_heap_sampling flags specified as additional experiments to deploy our pipeline from Dataflow templates.
Edit: Found the cause.
The provisioning API is returning an empty JOB_NAME, causing boot.go to set the JOB_NAME env var to "", which causes the Python SDK code to fail when trying to activate googlecloudprofiler.
There is an open issue on IssueTracker regarding this.
I'm trying to automate provisioning of streaming job using cloud build, for the POC I tried https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/dataflow/flex-templates/streaming_beam
It worked as expected when I manually ran the commands.
When I add the commands in cloudbuild.yaml file the build gets created successfully but the dataflow job fails each time with the below error:
Error occurred in the launcher container: Template launch failed. See console logs
This is the only error log that I get, I tried to add extra permissions to Cloud Build service account but that didn't help either.
Since there's no other info mentioned in the log file I find it hard to debug it as well.
Successfully started aws EMR cluster, but any submission fails with:
19/07/30 08:37:42 ERROR UserData: Error encountered while try to get user data
java.io.IOException: File '/var/aws/emr/userData.json' cannot be read
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:296)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1711)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1748)
at com.amazon.ws.emr.hadoop.fs.util.UserData.getUserData(UserData.java:62)
at com.amazon.ws.emr.hadoop.fs.util.UserData.<init>(UserData.java:39)
at com.amazon.ws.emr.hadoop.fs.util.UserData.ofDefaultResourceLocations(UserData.java:52)
at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.buildSTSClient(AWSSessionCredentialsProviderFactory.java:52)
at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.<clinit>(AWSSessionCredentialsProviderFactory.java:17)
at com.amazon.ws.emr.hadoop.fs.rolemapping.DefaultS3CredentialsResolver.resolve(DefaultS3CredentialsResolver.java:22)
at com.amazon.ws.emr.hadoop.fs.guice.CredentialsProviderOverrider.override(CredentialsProviderOverrider.java:25)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.executeOverriders(GlobalS3Executor.java:130)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:86)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.doesBucketExist(AmazonS3LiteClient.java:90)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:139)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:116)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:508)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:111)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:190)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:146)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:144)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:354)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:354)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:354)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
userData.json isn't part of my application, looks like it is emr internals.
Any ideas what is wrong? I submit jobs via livy requests.
Cluster setup:
2 core nodes m4.large
7 task nodes m5.4xlarge
1 master node m5.xlarge
The correct way to fix this is by running the following command as part of your bootstrap script when launching EMR (or, if running on a Glue Endpoint, run the following at any point on your endpoint):
chmod 444 /var/aws/emr/userData.json
I've face the similar issue in AWS EMR emr-5.24.1(spark 2.4.1), but jobs are never failed.
I am running a batch job on dataflow, querying from BigQuery. When I use the DirectRunner, everything works, and the results are written to a new BigQuery table. Things seem to break when I change to DataflowRunner.
The logs show that 30 worker instances are spun up successfully. The graph diagram in the web UI shows the job has started. The first 3 steps show "Running", the rest show "not started". None of the steps show any records transformed (i.e. outputcollections all show '-'). The logs show many messages that look like this, which may be the issue:
skipping: failed to "StartContainer" for "python" with CrashLoopBackOff: "Back-off 10s restarting failed container=python pod=......
I took a step back and just ran the minimal wordcount example, and that completed successfully. So all the necessary APIs seem to be enabled for Dataflow runner. I'm just trying to get a sense of what is causing my Dataflow job to hang.
I am executing the job like this:
python2.7 script.py --runner DataflowRunner --project projectname --requirements_file requirements.txt --staging_location gs://my-store/staging --temp_location gs://my-store/temp
I'm not sure if my solution was the cause of the error pasted above, but fixing dependencies problems (which were not showing up as errors in the log at all!) did solve the hanging dataflow processes.
So if you have a hanging process, make sure your workers have all their necessary dependencies. You can provide them through the --requirements_file argument, or through a custom setup.py script.
Thanks to the help I received in this post, the pipeline appears to be operating, albeit VERY SLOWLY.