AutoML training pipeline job failed. Where can I find the logs? - google-cloud-ml

I am using Vertex AI's AutoML to train a model an it fails with the error message show below. Where can I find the logs for this job?
Training pipeline failed with error message: Job failed. See logs for details.

I had the same issue just now, raised a case with Google who told me how to find the error logs.
In GCP Log Explorer, you need a filter of resource.type = "ml_job" (make sure your time range is set correctly, too!)

Related

Getting an error in AWS Glue -> Jobs(New) Failed to update job [gluestudio-service.us-east-2.amazonaws.com] updateDag: InternalFailure: null

I've been using AWS Glue studio for Job creation. Till now I was using Job Legacy but recently Amazon has migrated to the new version Glue Job v_3.0 Where I am trying to create a job using Spark script editor.
Steps to be followed
Open Region-Code/console.aws.amazon.com/glue/home?region=Region-Code#/v2/home
Click Create Job link
Select Spark script editor
Make sure you selected the Create a new script with boilerplate code
Then click the Create button in the top right corner.
When I try to save the Job after fill all the required information, I'm getting an error like below
Failed to update job
[gluestudio-service.us-east-1.amazonaws.com] createJob: InternalServiceException: Failed to meet resource limits for operation
Screenshot
Note
I've tried the Legacy Job creation as well where I was getting an error like below
{"service":"AWSGlue","statusCode":400,"errorCode":"ResourceNumberLimitExceededException","requestId":"179c2de8-6920-4adf-8791-ece7cbbfbc63","errorMessage":"Failed to meet resource limits for operation","type":"AwsServiceError"}
Is this something related to Internal configuration issue?
As I was using client's provided account I don't have permission to see the Limitations and all

Unable to view Vertex AI pipeline node logs

I created a Vertex AI pipeline to perform a simple ML flow of creating a dataset, training a model on it and then predicting on the test set. There is a python function based component (train-logistic-model) where I train the model. However, in the component I specify an invalid package and hence the step in the pipeline fails. I know this because when I corrected the package name the step worked fine. However, for the failed pipeline I am unable to see any logs. When I click on the "VIEW JOB" under "Execution Info" on the pipeline Runtime Graph (pic attached) it takes me to the "CUSTOM JOB" page which the pipeline ran. There is a message:
Custom job failed with error message: The replica workerpool0-0 exited
with a non-zero status of 1 ...
When I click the VIEW LOGS button, it takes me to the Logs Explorer where there are NO logs. Why are there no logs? Do I need to enable logging somewhere in the pipeline for this? Or could it be a permission issue (it does not mention anything about it though, just this message on the Logs Explorer and 0 logs below it.
Showing logs for time specified in query. To view more results update
your query
Find the pipeline job id in the component logs and paste it in the below code
from google.cloud import aiplatform
from collections import namedtuple
import json
import time
def get_status_helper(client):
response = client.get_hyperparameter_tuning_job(
name=training_job.metadata["resource_name"])
job_status = str(response.state)
return job_status
api_endpoint = f"{location}-aiplatform.googleapis.com"
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
client.get_custom_job(name="projects/{project-id}/locations/{your-location}/customJobs/{pipeline-id}")
Sample name or pipeline job id for reference:
========================================
projects/123456789101/locations/us-central1/customJobs/23456789101234567892
Above name can be found in the component logs

When can one find logs for Vertex AI Batch Prediction jobs?

I couldn't find relevant information in the Documentation. I have tried all options and links in the batch transform pages.
They can be found, but unfortunately not via any links in the Vertex AI console.
Soon after the batch prediction job fails, go to Logging -> Logs Explorer and create a query like this, replacing YOUR_PROJECT with the name of your gcp project:
logName:"projects/YOUR_PROJECT/logs/ml.googleapis.com"
First look for the same error reported by the Batch Prediction page in the Vertex AI console: "Job failed. See logs for full details."
The log line above the "Job Failed" error will likely report the real reason your batch prediction job failed.
I have found that just going to Cloud logger after batch prediction job fails and clicking run query shows the error details

Check Dataflow errors

I am trying to implement a data pipeline where I am trying insert a json in PubSub from there via DataFlow into BQ. I am using the template to transfer data from PubSub to BQ. My DataFlow is failing. It is going in the error flow. But I don't see where to get more details on the error. Eg, is it failing due to bad encoding of the data in pubsub, failing because of schema mismatch etc. etc.? Where can I find these details? I am checking Stackdriver error and logs but not able to locate where to find further details.
To add to that, I can see this error:
resource.type="dataflow_step"
resource.labels.job_id="2018-07-17_20_36_16-6729875790634111180"
logName="projects/camel-154800/logs/dataflow.googleapis.com%2Fworker"
timestamp >= "2018-07-18T03:36:17Z" severity>="INFO"
resource.labels.step_id=("WriteFailedRecords/FailedRecordToTableRow"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/PrepareWrite/ParDo(Anonymous)"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/CreateTables/ParDo(CreateTables)"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/ShardTableWrites"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/TagWithUniqueIds"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/Window.Into()/Window.Assign"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/ExpandIterable"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/GlobalWindow/Window.Assign"
OR
"WriteFailedRecords/WriteFailedRecordsToBigQuery/StreamingInserts/StreamingWriteTables/StreamingWrite")
It tells me it failed, but I have no clue why it failed? Was there schema mismatch or data type problems or wrong encoding or what? How to debug?

PubSub resource setup failing for Dataflow job when assigning timestampLabel

After modifying my job to start using timestampLabel when reading from PubSub, resource setup seems to break every time I try to start the job with the following error:
(c8bce90672926e26): Workflow failed. Causes: (5743e5d17dd7bfb7): Step setup_resource_/subscriptions/project-name/subscription-name__streaming_dataflow_internal25: Set up of resource /subscriptions/project-name/subscription-name__streaming_dataflow_internal failed
where project-name and subscription-name represent the actual values of my project and PubSub subscription I'm trying to read from. Before trying to attach timestampLabel on message entry, the job was working correctly, consuming messages from the specified PubSub subscription, which should mean that my API/network settings are OK.
I'm also noticing two warnings with the payload
Internal Issue (119d3b54af281acf): 65177287:8503
but no more information can be found in the worker logs. For the few seconds that my job is setting up I can see the timestampLabel being set in the first step of the pipeline. Unfortunately I can't find any other cases or documentation about this error.
When using the timestampLabel feature, a second subscription is created for tracking purposes. Double check the permission settings on your topic to make sure it matches the permissions required.