I have simple Dataprep jobs that are transferring GCS data to BQ. Until today, scheduled jobs were running fine but today two jobs failed and two jobs succeeded after taking more than half hour to one hour.
Error message I am getting is below:
java.lang.RuntimeException: Failed to create job with prefix beam_load_clouddataprepcmreportalllobmedia4505510bydataprepadmi_aef678fce2f441eaa9732418fc1a6485_2b57eddf335d0c0b09e3000a805a73d6_00001_00000, reached max retries: 3, last failed job:
I ran same job again, it again took very long time and failed but this time with different message:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Any pointers or direction for possible cause! Also, link or troubleshooting tips for dataprep or dataflow job is appreciated.
Thank you
There could be a lot of potential causes for the jobs to get stuck: Transient issues, some quota/limit being reached, change in data format/size or another issue with the resources being used. I suggest to start the troubleshooting from Dataflow side.
Here are some useful resources that can guide you through the most common job errors, and how to troubleshoot them:
Troubleshooting your pipeline
Dataflow common errors
In addition, you could check in the Google Issue tracker of Dataprep and Dataflow to see if the issue has been reported before
Issue tracker for Dataprep
Issue tracker for Dataflow
And you can also look at the GCP status dashboard to discard a widespread issue with some service
Google Cloud Status Dashboard
Finally, if you have GCP support you can reach out directly to support. If you don't have support, you can use the Issue tracker to create a new issue for Dataprep and report the behavior you're seeing.
Related
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.
I am receiving following error while executing Data Pipeline in GCP Cloud Data Fusion.
Spark program 'phase-1' failed with error: canCommit() is called for
transaction
More information:
So the Pipeline is responsible for lift'n'shift operation, loading on-prem oracle data into Google Bigquery via cloud data fusion.
The pipeline gives this error on Intermittent basis, meaning, it works sometime (Manual run), but mostly failed on schedule run (sometimes works on schedule run as well).
As part of mitigation, I have set the following configuration item, but no luck.
*data.tx.timeout*
Thanks a lot in advance.
Regards,
Vir
In one of our GCP project multiple bigquery scheduled queries are running but recently we started facing failing job issue with the following error -
Error Message: Already Exists: Job <PROJECT ID>:US.scheduled_query_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx; JobID:<PROJECT NUMBER>:scheduled_query_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
The write preference of these queries is "WRITE_APPEND". These jobs were created using GCP console.
Once retriggered these jobs run successfully with a new job ID.
Please help in understanding the reason behind used job ID allocation to these scheduled queries and please suggest if any fix available.
In dataprep jobs I have a transform Failed with the only information being:
Job Failed : java.lang.NullPointerException: jobId.
It does not even go to dataflow jobs, I have no logs or anything to go with.
Any ideas why, or how to have more info to correct this ?
After contacting Google-Trifacta support, there is a workaround:
disabling the Profile Results option in the run page
In Google Cloud ML (Machine Learning), I submitted a job, but it failed due to a Python error in the code.
After fixing the error, how can I re-run the job? Should I submit a new job?
When I'm done, how to delete the job?
The online documentation is not complete.
Thanks
When you're ready to re-try the job, just submit a new job with a new job name.
There is no way to delete jobs since we want to provide you with a record of previous jobs. Jobs will reach a terminal state (FAILED, SUCCEEDED, or CANCELLED) in which they are no longer consuming any resources. However, the jobs will continue to show up in the UI or in the API if you list jobs.