What to do with failed jobs? - google-cloud-platform

What to do with failed jobs? - google-cloud-platform

In Google Cloud ML (Machine Learning), I submitted a job, but it failed due to a Python error in the code.
After fixing the error, how can I re-run the job? Should I submit a new job?
When I'm done, how to delete the job?
The online documentation is not complete.
Thanks

When you're ready to re-try the job, just submit a new job with a new job name.
There is no way to delete jobs since we want to provide you with a record of previous jobs. Jobs will reach a terminal state (FAILED, SUCCEEDED, or CANCELLED) in which they are no longer consuming any resources. However, the jobs will continue to show up in the UI or in the API if you list jobs.

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles

Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.

Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

BigQuery job already exists

In one of our GCP project multiple bigquery scheduled queries are running but recently we started facing failing job issue with the following error -
Error Message: Already Exists: Job <PROJECT ID>:US.scheduled_query_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx; JobID:<PROJECT NUMBER>:scheduled_query_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
The write preference of these queries is "WRITE_APPEND". These jobs were created using GCP console.
Once retriggered these jobs run successfully with a new job ID.
Please help in understanding the reason behind used job ID allocation to these scheduled queries and please suggest if any fix available.

Google Cloud Tasks: Run a task before it's scheduled ETA

I have a use case where I schedule a task 24h into the future after an event occurs. This task represents some sort of "deadline" for other things to happen.
The scheduled task triggers a creation of a report. If not all of the above mentioned "other things" have completed by this time, then the triggered report creation process creates it anyways with the information it has at the time.
If, on the other hand, all other things do complete before these 24h, then ideally I'd like to re-use the same Google Cloud Task to trigger the same process (as it's identical as the previous case but will contain all of the information possible).
I would imagine the easiest way to achieve the above is to:
schedule a task 24h into the future
if all information arrives: run the task early before it's scheduled time
However, reading through the Google Cloud Tasks documentation I don't see the option to run the task early. However, that feature does exist on the Cloud Tasks console, so I was wondering if it is available in the documentation and client libraries.
Thanks!

This is probably what you're looking for
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/run
NOTE: It does say however that "This command is meant to be used for manual debugging"

Google Cloud Dataprep job failing with error message

I have simple Dataprep jobs that are transferring GCS data to BQ. Until today, scheduled jobs were running fine but today two jobs failed and two jobs succeeded after taking more than half hour to one hour.
Error message I am getting is below:
java.lang.RuntimeException: Failed to create job with prefix beam_load_clouddataprepcmreportalllobmedia4505510bydataprepadmi_aef678fce2f441eaa9732418fc1a6485_2b57eddf335d0c0b09e3000a805a73d6_00001_00000, reached max retries: 3, last failed job:
I ran same job again, it again took very long time and failed but this time with different message:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Any pointers or direction for possible cause! Also, link or troubleshooting tips for dataprep or dataflow job is appreciated.
Thank you

There could be a lot of potential causes for the jobs to get stuck: Transient issues, some quota/limit being reached, change in data format/size or another issue with the resources being used. I suggest to start the troubleshooting from Dataflow side.
Here are some useful resources that can guide you through the most common job errors, and how to troubleshoot them:
Troubleshooting your pipeline
Dataflow common errors
In addition, you could check in the Google Issue tracker of Dataprep and Dataflow to see if the issue has been reported before
Issue tracker for Dataprep
Issue tracker for Dataflow
And you can also look at the GCP status dashboard to discard a widespread issue with some service
Google Cloud Status Dashboard
Finally, if you have GCP support you can reach out directly to support. If you don't have support, you can use the Issue tracker to create a new issue for Dataprep and report the behavior you're seeing.

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.

Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?

You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What to do with failed jobs? - google-cloud-platform

In Google Cloud ML (Machine Learning), I submitted a job, but it failed due to a Python error in the code. After fixing the error, how can I re-run the job? Should I submit a new job? When I'm done, how to delete the job? The online documentation is not complete. Thanks

Related

Data flow pipeline got stuck

BigQuery job already exists

Google Cloud Tasks: Run a task before it's scheduled ETA

Google Cloud Dataprep job failing with error message

Way to trigger dataflow only after Big Query Job finished

Categories

Resources