Dataprep Job Failed - google-cloud-platform

In dataprep jobs I have a transform Failed with the only information being:
Job Failed : java.lang.NullPointerException: jobId.
It does not even go to dataflow jobs, I have no logs or anything to go with.
Any ideas why, or how to have more info to correct this ?

After contacting Google-Trifacta support, there is a workaround:
disabling the Profile Results option in the run page

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

HyperParameterTuning job failed

why i get this error while auto tuning my xgboost algorithm,in amazon sagemaker"Error for HyperParameterTuning job sagemaker-xgboost-220217-0532: Failed. Reason: All training jobs failed. Please take a look at the training jobs failures to get more details."
Please open a AWS support case with more details such as Job ARN, etc., and the support team will take a look.

Google Cloud Dataprep job failing with error message

I have simple Dataprep jobs that are transferring GCS data to BQ. Until today, scheduled jobs were running fine but today two jobs failed and two jobs succeeded after taking more than half hour to one hour.
Error message I am getting is below:
java.lang.RuntimeException: Failed to create job with prefix beam_load_clouddataprepcmreportalllobmedia4505510bydataprepadmi_aef678fce2f441eaa9732418fc1a6485_2b57eddf335d0c0b09e3000a805a73d6_00001_00000, reached max retries: 3, last failed job:
I ran same job again, it again took very long time and failed but this time with different message:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Any pointers or direction for possible cause! Also, link or troubleshooting tips for dataprep or dataflow job is appreciated.
Thank you
There could be a lot of potential causes for the jobs to get stuck: Transient issues, some quota/limit being reached, change in data format/size or another issue with the resources being used. I suggest to start the troubleshooting from Dataflow side.
Here are some useful resources that can guide you through the most common job errors, and how to troubleshoot them:
Troubleshooting your pipeline
Dataflow common errors
In addition, you could check in the Google Issue tracker of Dataprep and Dataflow to see if the issue has been reported before
Issue tracker for Dataprep
Issue tracker for Dataflow
And you can also look at the GCP status dashboard to discard a widespread issue with some service
Google Cloud Status Dashboard
Finally, if you have GCP support you can reach out directly to support. If you don't have support, you can use the Issue tracker to create a new issue for Dataprep and report the behavior you're seeing.

Debugging broken dags in GCP Composer

I have read the question for vanilla Airflow.
How can broken DAGs be debugged effectively in Google Cloud Composer?
How can i see the full logs of a broken DAG?
Right now I can only see one line of trace in Airflow UI main page.
EDIT:
Answers seem to be not understanding my question.
I am looking for fixing broken DAGs i.e. the DAG does not even appear in the DAGs list and of course there are no tasks running and no task logs to view.
As hexacynide pointed out, you can look at the task logs - there's details in the Composer docs about doing that specifically found here. You can also use Stackdriver logging, which is enabled by default in Composer projects. In Stackdriver logs, you can filter your logs on many variables, including by time, by pod (airflow-worker, airflow-webserver, airflow-scheduler, etc.) and by whatever keywords you suspect might appear in the logs.
EDIT: Adding screenshots and more clarity in response to question update
In Airflow, when there's a broken DAG, there is usually some form of error message at the top. (Yes, I know this error message is helpful and I don't need to debug further, but I'm going to just to show how to)
In the message, I can see that my DAG bq_copy_across_locations is broken.
To debug, I go to Stackdriver, and search for the name of my DAG. I limit the results to the logs from this Composer environment. You can also limit the time frame if needed.
I looked through the error logs and found a the Traceback error for the broken DAG.
Alternatively, if you know you only want to search for the stack traceback, you can run an advanced filter looking for your DAG name and the word "traceback". To do so, click the arrow at the right side of the Stackdriver logging bar and hit "convert to advance filter"
Then enter your advanced filter
resource.type="cloud_composer_environment"
resource.labels.location="YOUR-COMPOSER-REGION"
resource.labels.environment_name="YOUR-ENV-NAME"
("BROKEN-DAG-NAME" AND
"Traceback")
This is what my advanced search looked like
The only logs that will be returned will be the stack Traceback logs for that DAG.
To determine run-time issues that occur when a DAG is triggered, you can always look at task logs as you would for any typical Airflow installation. These can be found using the web UI, or by looking in the associated logs folder in your Cloud Composer environment's associated Cloud Storage bucket.
To identify issues at parse time, you can execute Airflow commands using gcloud composer. For example, to run airflow list_dags, the gcloud CLI equivalent would be:
$ gcloud composer environments --location=$REGION run $ENV_NAME -- list_dags --report
Note that the second -- is intentional. This is so that the command argument parser can differentiate between arguments to gcloud and arguments to be passed to the Airflow subcommand (in this case list_dags).

What to do with failed jobs?

In Google Cloud ML (Machine Learning), I submitted a job, but it failed due to a Python error in the code.
After fixing the error, how can I re-run the job? Should I submit a new job?
When I'm done, how to delete the job?
The online documentation is not complete.
Thanks
When you're ready to re-try the job, just submit a new job with a new job name.
There is no way to delete jobs since we want to provide you with a record of previous jobs. Jobs will reach a terminal state (FAILED, SUCCEEDED, or CANCELLED) in which they are no longer consuming any resources. However, the jobs will continue to show up in the UI or in the API if you list jobs.