I created a Vertex AI pipeline to perform a simple ML flow of creating a dataset, training a model on it and then predicting on the test set. There is a python function based component (train-logistic-model) where I train the model. However, in the component I specify an invalid package and hence the step in the pipeline fails. I know this because when I corrected the package name the step worked fine. However, for the failed pipeline I am unable to see any logs. When I click on the "VIEW JOB" under "Execution Info" on the pipeline Runtime Graph (pic attached) it takes me to the "CUSTOM JOB" page which the pipeline ran. There is a message:
Custom job failed with error message: The replica workerpool0-0 exited
with a non-zero status of 1 ...
When I click the VIEW LOGS button, it takes me to the Logs Explorer where there are NO logs. Why are there no logs? Do I need to enable logging somewhere in the pipeline for this? Or could it be a permission issue (it does not mention anything about it though, just this message on the Logs Explorer and 0 logs below it.
Showing logs for time specified in query. To view more results update
your query
Find the pipeline job id in the component logs and paste it in the below code
from google.cloud import aiplatform
from collections import namedtuple
import json
import time
def get_status_helper(client):
response = client.get_hyperparameter_tuning_job(
name=training_job.metadata["resource_name"])
job_status = str(response.state)
return job_status
api_endpoint = f"{location}-aiplatform.googleapis.com"
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
client.get_custom_job(name="projects/{project-id}/locations/{your-location}/customJobs/{pipeline-id}")
Sample name or pipeline job id for reference:
========================================
projects/123456789101/locations/us-central1/customJobs/23456789101234567892
Above name can be found in the component logs
Related
I am trying to run a Custom Training Job to deploy my model in Vertex AI directly from a Jupyterlab. This Jupyterlab is instantiated from a Vertex AI Managed Notebook where I already specified the service account.
My aim is to deploy the training script that I specify to the method CustomTrainingJob directly from the cells of my notebook. This would be equivalent to pushing an image that contains my script to container registry and deploying the Training Job manually from the UI of Vertex AI (in this way, by specifying the service account, I was able to corectly deploy the training job). However, I need everything to be executed from the same notebook.
In order to specify the credentials to the CustomTrainingJob of aiplatform, I execute the following cell, where all variables are correctly set:
import google.auth
from google.cloud import aiplatform
from google.auth import impersonated_credentials
source_credentials = google.auth.default()
target_credentials = impersonated_credentials.Credentials(
source_credentials=source_credentials,
target_principal='SERVICE_ACCOUNT.iam.gserviceaccount.com',
target_scopes = ['https://www.googleapis.com/auth/cloud-platform'])
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)
job = aiplatform.CustomTrainingJob(
display_name=JOB_NAME,
script_path=SCRIPT_PATH,
container_uri=MODEL_TRAINING_IMAGE,
credentials=target_credentials
)
When after the job.run() command is executed it seems that the credentials are not correctly set. In particular, the following error is returned:
/opt/conda/lib/python3.7/site-packages/google/auth/impersonated_credentials.py in _update_token(self, request)
254
255 # Refresh our source credentials if it is not valid.
--> 256 if not self._source_credentials.valid:
257 self._source_credentials.refresh(request)
258
AttributeError: 'tuple' object has no attribute 'valid'
I also tried different ways to configure the credentials of my service account but none of them seem to work. In this case it looks like the tuple that contains the source credentials is missing the 'valid' attribute, even if the method google.auth.default() only returns two values.
To run the custom training job using a service account, you could try using the service_account argument for job.run(), instead of trying to set credentials. As long as the notebook executes as a user that has act-as permissions for the chosen service account, this should let you run the custom training job as that service account.
I'm having trouble executing VertexAI's batch inference, despite endpoint deployment and inference working perfectly. My TensorFlow model has been trained in a custom Docker container with the following arguments:
aiplatform.CustomContainerTrainingJob(
display_name=display_name,
command=["python3", "train.py"],
container_uri=container_uri,
model_serving_container_image_uri=container_uri,
model_serving_container_environment_variables=env_vars,
model_serving_container_predict_route='/predict',
model_serving_container_health_route='/health',
model_serving_container_command=[
"gunicorn",
"src.inference:app",
"--bind",
"0.0.0.0:5000",
"-k",
"uvicorn.workers.UvicornWorker",
"-t",
"6000",
],
model_serving_container_ports=[5000],
)
I have a Flask endpoint defined for predict and health essentially defined below:
#app.get(f"/health")
def health_check_batch():
return 200
#app.post(f"/predict")
def predict_batch(request_body: dict):
pred_df = pd.DataFrame(request_body['instances'],
columns = request_body['parameters']['columns'])
# do some model inference things
return {"predictions": predictions.tolist()}
As described, when training a model and deploying to an endpoint, I can successfully hit the API with JSON schema like:
{"instances":[[1,2], [1,3]], "parameters":{"columns":["first", "second"]}}
This also works when using the endpoint Python SDK and feeding in instances/parameters as functional arguments.
However, I've tried performing batch inference with a CSV file and a JSONL file, and every time it fails with an Error Code 3. I can't find logs on why it failed in Logs Explorer either. I've read through all the documentation I could find and have seen other's successfully invoke batch inference, but haven't been able to find a guide. Does anyone have recommendations on batch file structure or the structure of my APIs? Thank you!
I couldn't find relevant information in the Documentation. I have tried all options and links in the batch transform pages.
They can be found, but unfortunately not via any links in the Vertex AI console.
Soon after the batch prediction job fails, go to Logging -> Logs Explorer and create a query like this, replacing YOUR_PROJECT with the name of your gcp project:
logName:"projects/YOUR_PROJECT/logs/ml.googleapis.com"
First look for the same error reported by the Batch Prediction page in the Vertex AI console: "Job failed. See logs for full details."
The log line above the "Job Failed" error will likely report the real reason your batch prediction job failed.
I have found that just going to Cloud logger after batch prediction job fails and clicking run query shows the error details
I am using Vertex AI's AutoML to train a model an it fails with the error message show below. Where can I find the logs for this job?
Training pipeline failed with error message: Job failed. See logs for details.
I had the same issue just now, raised a case with Google who told me how to find the error logs.
In GCP Log Explorer, you need a filter of resource.type = "ml_job" (make sure your time range is set correctly, too!)
I am trying to run Kubeflow Pipelines with the new Vertex AI on GCP.
Previously, in Kubeflow Pipelines, I was able to use the Run ID in my Pipeline by utilizing dsl.RUN_ID_PLACEHOLDER or {{workflow.uid}}. My understanding was that dsl.RUN_ID_PLACEHOLDER would resolve to {{workflow,uid}} at compile time, and then at run time, the {{workflow.uid}} tag would be resolved to the Run's ID. This is at least how it has worked in my experience using Kubeflow Pipelines and the Kubeflow Pipelines UI.
However, when I try to access the Run ID in a similar way in a pipeline that I run in Vertex AI Pipelines, it seems that dsl.RUN_ID_PLACEHOLDER resolves to {{workflow.uid}} but that this never subsequently resolves to the ID of the run.
I created the following Test Pipeline, which tries to get the Run ID using the DSL Placeholder, then uses a lightweight component to print out the value of the run_id parameter of the pipeline. The result of running the pipeline in the UI is that the print_run_id component prints {{workflow.uid}}, where as on Kubeflow Pipelines previously, it would have resolved to the Run ID.
from kfp import dsl
from kfp import components as comp
import logging
from kfp.v2.dsl import (
component,
Input,
Output,
Dataset,
Metrics,
)
#component
def print_run_id(run_id:str):
print(run_id)
RUN_ID = dsl.RUN_ID_PLACEHOLDER
#dsl.pipeline(
name='end-to-end-pipeline',
description='End to end XGBoost cover type training pipeline'
)
def end_to_end_pipeline(
run_id: str = RUN_ID
):
print_task = print_run_id(run_id=run_id)
Is there a way to access the Run ID using the KFP SDK with Vertex AI Pipelines?
What works on vertex.ai are different magic strings.
Specifically:
from kfp.v2 import dsl
id = dsl.PIPELINE_JOB_ID_PLACEHOLDER
name = dsl.PIPELINE_JOB_NAME_PLACEHOLDER
see also, https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/dsl/__init__.py
Just got this answer from our support at google.
This isn't documented well. Trying to access them during the pipeline build or from within individuals components directly will return the unmodified placeholder string values. But it does work during the actual pipeline run (at least for kfp v2)
Example for kfp v2 based on this link that works as expected.
import kfp.v2.dsl as dsl
#component
def print_op(msg: str, value: str):
print(msg, value) # <-- Prints the correct value
#component
def incorrect_print_op():
print(dsl.PIPELINE_JOB_NAME_PLACEHOLDER) # <-- Prints incorrect placeholder value
#dsl.pipeline(name='pipeline-with-placeholders')
def my_pipeline():
# Correct
print_op(msg='job name:', value=dsl.PIPELINE_JOB_NAME_PLACEHOLDER)
print_op(msg='job resource name:', value=dsl.PIPELINE_JOB_RESOURCE_NAME_PLACEHOLDER)
print_op(msg='job id:', value=dsl.PIPELINE_JOB_ID_PLACEHOLDER)
print_op(msg='task name:', value=dsl.PIPELINE_TASK_NAME_PLACEHOLDER)
print_op(msg='task id:', value=dsl.PIPELINE_TASK_ID_PLACEHOLDER)
# Incorrect
print(dsl.PIPELINE_TASK_ID_PLACEHOLDER) # Prints only placeholder value