Run ID in Kubeflow Pipelines on Vertex AI - google-cloud-ml

I am trying to run Kubeflow Pipelines with the new Vertex AI on GCP.
Previously, in Kubeflow Pipelines, I was able to use the Run ID in my Pipeline by utilizing dsl.RUN_ID_PLACEHOLDER or {{workflow.uid}}. My understanding was that dsl.RUN_ID_PLACEHOLDER would resolve to {{workflow,uid}} at compile time, and then at run time, the {{workflow.uid}} tag would be resolved to the Run's ID. This is at least how it has worked in my experience using Kubeflow Pipelines and the Kubeflow Pipelines UI.
However, when I try to access the Run ID in a similar way in a pipeline that I run in Vertex AI Pipelines, it seems that dsl.RUN_ID_PLACEHOLDER resolves to {{workflow.uid}} but that this never subsequently resolves to the ID of the run.
I created the following Test Pipeline, which tries to get the Run ID using the DSL Placeholder, then uses a lightweight component to print out the value of the run_id parameter of the pipeline. The result of running the pipeline in the UI is that the print_run_id component prints {{workflow.uid}}, where as on Kubeflow Pipelines previously, it would have resolved to the Run ID.
from kfp import dsl
from kfp import components as comp
import logging
from kfp.v2.dsl import (
component,
Input,
Output,
Dataset,
Metrics,
)
#component
def print_run_id(run_id:str):
print(run_id)
RUN_ID = dsl.RUN_ID_PLACEHOLDER
#dsl.pipeline(
name='end-to-end-pipeline',
description='End to end XGBoost cover type training pipeline'
)
def end_to_end_pipeline(
run_id: str = RUN_ID
):
print_task = print_run_id(run_id=run_id)
Is there a way to access the Run ID using the KFP SDK with Vertex AI Pipelines?

What works on vertex.ai are different magic strings.
Specifically:
from kfp.v2 import dsl
id = dsl.PIPELINE_JOB_ID_PLACEHOLDER
name = dsl.PIPELINE_JOB_NAME_PLACEHOLDER
see also, https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/dsl/__init__.py
Just got this answer from our support at google.

This isn't documented well. Trying to access them during the pipeline build or from within individuals components directly will return the unmodified placeholder string values. But it does work during the actual pipeline run (at least for kfp v2)
Example for kfp v2 based on this link that works as expected.
import kfp.v2.dsl as dsl
#component
def print_op(msg: str, value: str):
print(msg, value) # <-- Prints the correct value
#component
def incorrect_print_op():
print(dsl.PIPELINE_JOB_NAME_PLACEHOLDER) # <-- Prints incorrect placeholder value
#dsl.pipeline(name='pipeline-with-placeholders')
def my_pipeline():
# Correct
print_op(msg='job name:', value=dsl.PIPELINE_JOB_NAME_PLACEHOLDER)
print_op(msg='job resource name:', value=dsl.PIPELINE_JOB_RESOURCE_NAME_PLACEHOLDER)
print_op(msg='job id:', value=dsl.PIPELINE_JOB_ID_PLACEHOLDER)
print_op(msg='task name:', value=dsl.PIPELINE_TASK_NAME_PLACEHOLDER)
print_op(msg='task id:', value=dsl.PIPELINE_TASK_ID_PLACEHOLDER)
# Incorrect
print(dsl.PIPELINE_TASK_ID_PLACEHOLDER) # Prints only placeholder value

Related

Try deploying a custom ML model with an endpoint by using a custom image on Google Cloud's Vertex AI

I have been banging my head around this for a while and Google Cloud does not have a lot of documentation about this issue. What I am trying to do is deploy a custom ML model on Google Cloud Vertex by:
Uploading the model onto the Model Registry in Vertex AI
Create an endpoint
Deploying the uploaded model on the created endpoint.
Steps 1 and 2 are easy to implement, and I am not facing and issues. However step 3 is always failing for some reason. Even the logs don't give me a lot of information.
For Step 1:
This is Dockerfile I am using to create a custom image to serve my ML model:
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8-slim
COPY requirements-base.txt requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
COPY serve.py serve.py
COPY model.pkl model.pkl
And this is what my serve.py file looks like:
from fastapi import Request, FastAPI, Response
import json
import catboost
import pickle
import os
app = FastAPI(title="Sentiment Analysis")
AIP_HEALTH_ROUTE = os.environ.get('AIP_HEALTH_ROUTE', '/health')
AIP_PREDICT_ROUTE = os.environ.get('AIP_PREDICT_ROUTE', '/predict')
#app.get(AIP_HEALTH_ROUTE, status_code=200)
async def health():
return {'health': 'ok'}
#app.post(AIP_PREDICT_ROUTE)
async def predict(request: Request):
with open('model.pkl', 'rb') as file:
model = pickle.load(file)
data = request.get_json()
input_data = data['input']
predictions = model.predict(input_data)
return json.dumps({'predictions': predictions.tolist()})
if __name__ == '__main__':
app.run(debug = True, host="0.0.0.0", port=8080)
After building the image, I push it to artifact registry on Google Cloud.
Is there an issue with how I have written the serve.py file or Dockerfile?
Or is there an easier way to deploy custom ML models on Google Cloud for MLOps and prediction purposes.
Well I tried a couple of manual approaches from the Google Cloud Vertex AI and also using gcloud commands.
In the manual process, after importing the model with the custom image I clicked on deploy to an end point. But this seems to always fail and takes forever.
Similarly using gcloud, I first create endpoint, then upload my model on to the registry, and the upload the model on the endpoint created. But this approach also fails.
At the end of the day I want my model to be successfully deployed on the endpoint and should give the right answers for predictions. Or, somehow host my custom ML model on Google Cloud and make predictions with it in a reasonable and manageable way!

Authenticate Custom Training Job in Vertex AI with Service Account

I am trying to run a Custom Training Job to deploy my model in Vertex AI directly from a Jupyterlab. This Jupyterlab is instantiated from a Vertex AI Managed Notebook where I already specified the service account.
My aim is to deploy the training script that I specify to the method CustomTrainingJob directly from the cells of my notebook. This would be equivalent to pushing an image that contains my script to container registry and deploying the Training Job manually from the UI of Vertex AI (in this way, by specifying the service account, I was able to corectly deploy the training job). However, I need everything to be executed from the same notebook.
In order to specify the credentials to the CustomTrainingJob of aiplatform, I execute the following cell, where all variables are correctly set:
import google.auth
from google.cloud import aiplatform
from google.auth import impersonated_credentials
source_credentials = google.auth.default()
target_credentials = impersonated_credentials.Credentials(
source_credentials=source_credentials,
target_principal='SERVICE_ACCOUNT.iam.gserviceaccount.com',
target_scopes = ['https://www.googleapis.com/auth/cloud-platform'])
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)
job = aiplatform.CustomTrainingJob(
display_name=JOB_NAME,
script_path=SCRIPT_PATH,
container_uri=MODEL_TRAINING_IMAGE,
credentials=target_credentials
)
When after the job.run() command is executed it seems that the credentials are not correctly set. In particular, the following error is returned:
/opt/conda/lib/python3.7/site-packages/google/auth/impersonated_credentials.py in _update_token(self, request)
254
255 # Refresh our source credentials if it is not valid.
--> 256 if not self._source_credentials.valid:
257 self._source_credentials.refresh(request)
258
AttributeError: 'tuple' object has no attribute 'valid'
I also tried different ways to configure the credentials of my service account but none of them seem to work. In this case it looks like the tuple that contains the source credentials is missing the 'valid' attribute, even if the method google.auth.default() only returns two values.
To run the custom training job using a service account, you could try using the service_account argument for job.run(), instead of trying to set credentials. As long as the notebook executes as a user that has act-as permissions for the chosen service account, this should let you run the custom training job as that service account.

Unable to view Vertex AI pipeline node logs

I created a Vertex AI pipeline to perform a simple ML flow of creating a dataset, training a model on it and then predicting on the test set. There is a python function based component (train-logistic-model) where I train the model. However, in the component I specify an invalid package and hence the step in the pipeline fails. I know this because when I corrected the package name the step worked fine. However, for the failed pipeline I am unable to see any logs. When I click on the "VIEW JOB" under "Execution Info" on the pipeline Runtime Graph (pic attached) it takes me to the "CUSTOM JOB" page which the pipeline ran. There is a message:
Custom job failed with error message: The replica workerpool0-0 exited
with a non-zero status of 1 ...
When I click the VIEW LOGS button, it takes me to the Logs Explorer where there are NO logs. Why are there no logs? Do I need to enable logging somewhere in the pipeline for this? Or could it be a permission issue (it does not mention anything about it though, just this message on the Logs Explorer and 0 logs below it.
Showing logs for time specified in query. To view more results update
your query
Find the pipeline job id in the component logs and paste it in the below code
from google.cloud import aiplatform
from collections import namedtuple
import json
import time
def get_status_helper(client):
response = client.get_hyperparameter_tuning_job(
name=training_job.metadata["resource_name"])
job_status = str(response.state)
return job_status
api_endpoint = f"{location}-aiplatform.googleapis.com"
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
client.get_custom_job(name="projects/{project-id}/locations/{your-location}/customJobs/{pipeline-id}")
Sample name or pipeline job id for reference:
========================================
projects/123456789101/locations/us-central1/customJobs/23456789101234567892
Above name can be found in the component logs

Schedule batch predictions Vertex AI

I have created a forecasting model using AutoML on Vertex AI. I want to use this model to make batch predictions every week. Is there a way to schedule this?
The data to make those predictions is stored in a bigquery table, which is updated every week.
There is no automatic scheduling directly in Vertex AutoML yet but many different ways to set this up in GCP.
Two options to try first using the client libraries available for BigQuery and Vertex:
Cloud Scheduler to use cron https://cloud.google.com/scheduler/docs/quickstart
use either Cloud Functions or Cloud Run to setup a BigQuery event trigger, and then trigger the AutoML batch prediction. Example to repurpose https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events
Not sure if you're using Vertex pipelines to run the prediction job but if you are there's a method to schedule your pipeline execution listed here.
from kfp.v2.google.client import AIPlatformClient # noqa: F811
api_client = AIPlatformClient(project_id=PROJECT_ID, region=REGION)
# adjust time zone and cron schedule as necessary
response = api_client.create_schedule_from_job_spec(
job_spec_path="intro_pipeline.json",
schedule="2 * * * *",
time_zone="America/Los_Angeles", # change this as necessary
parameter_values={"text": "Hello world!"},
# pipeline_root=PIPELINE_ROOT # this argument is necessary if you did not specify PIPELINE_ROOT as part of the pipeline definition.
)

Why pod on GKE cluster is OOMkilled when trying to run a very simple Kubeflow pipeline using TFX?

I'm following the TFX on Cloud AI Platform Pipelines tutorial to implement a Kubeflow orchestrated pipeline on Google Cloud. The main difference is that I'm trying to implement an Object Detection solution instead of the Taxi application proposed by the tutorial.
For this reason I (locally) created a dataset of images labelled via labelImg and converted it to a .tfrecord using this script that I've uploaded on a GS bucket. Then I followed the TFX tutorial creating the GKE cluster (the default one, with this configuration) and the Jupyter Notebook needed to run the code, importing the same template.
The main difference is in the first component of the pipeline, where I changed the CSVExampleGen component to an ImportExampleGen one:
def create_pipeline(
pipeline_name: Text,
pipeline_root: Text,
data_path: Text,
# TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
# query: Text,
preprocessing_fn: Text,
run_fn: Text,
train_args: tfx.proto.TrainArgs,
eval_args: tfx.proto.EvalArgs,
eval_accuracy_threshold: float,
serving_model_dir: Text,
metadata_connection_config: Optional[
metadata_store_pb2.ConnectionConfig] = None,
beam_pipeline_args: Optional[List[Text]] = None,
ai_platform_training_args: Optional[Dict[Text, Text]] = None,
ai_platform_serving_args: Optional[Dict[Text, Any]] = None,
) -> tfx.dsl.Pipeline:
"""Implements the chicago taxi pipeline with TFX."""
components = []
# Brings data into the pipeline or otherwise joins/converts training data.
example_gen = tfx.components.ImportExampleGen(input_base=data_path)
# TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
# example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
# query=query)
components.append(example_gen)
No other components are inserted in the pipeline and the data path points to the location of the folder on the bucket containing the .tfrecord:
DATA_PATH = 'gs://(project bucket)/(dataset folder)'
This is the runner code (basically identical to the one of the TFX tutorial):
def run():
"""Define a kubeflow pipeline."""
# Metadata config. The defaults works work with the installation of
# KF Pipelines using Kubeflow. If installing KF Pipelines using the
# lightweight deployment option, you may need to override the defaults.
# If you use Kubeflow, metadata will be written to MySQL database inside
# Kubeflow cluster.
metadata_config = tfx.orchestration.experimental.get_default_kubeflow_metadata_config(
)
runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=configs.PIPELINE_IMAGE)
pod_labels = {
'add-pod-env': 'true',
tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: 'tfx-template'
}
tfx.orchestration.experimental.KubeflowDagRunner(
config=runner_config, pod_labels_to_attach=pod_labels
).run(
pipeline.create_pipeline(
pipeline_name=configs.PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_path=DATA_PATH,
# TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
# query=configs.BIG_QUERY_QUERY,
preprocessing_fn=configs.PREPROCESSING_FN,
run_fn=configs.RUN_FN,
train_args=tfx.proto.TrainArgs(num_steps=configs.TRAIN_NUM_STEPS),
eval_args=tfx.proto.EvalArgs(num_steps=configs.EVAL_NUM_STEPS),
eval_accuracy_threshold=configs.EVAL_ACCURACY_THRESHOLD,
serving_model_dir=SERVING_MODEL_DIR,
# TODO(step 7): (Optional) Uncomment below to use provide GCP related
# config for BigQuery with Beam DirectRunner.
# beam_pipeline_args=configs
# .BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS,
# TODO(step 8): (Optional) Uncomment below to use Dataflow.
# beam_pipeline_args=configs.DATAFLOW_BEAM_PIPELINE_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
))
if __name__ == '__main__':
logging.set_verbosity(logging.INFO)
run()
The pipeline is then created and a run is invoked with the following code from the Notebook:
!tfx pipeline create --pipeline-path=kubeflow_runner.py --endpoint={ENDPOINT} --build-image
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}
The problem is that, while the pipeline from the example runs without problem, this pipeline always fails with the pod on the GKE cluster exiting with code 137 (OOMKilled).
This is a snapshot of the cluster workload status and this is a full log dump of the run that crashes.
I've already tried reducing the dataset size (it is now about 6MB for the whole .tfrecord) and splitting it locally in two sets (validation and training) since the crash seems to happen when the component should split the dataset, but neither of these changed the situation.
Do you have any idea on why it goes out of memory and what steps could I take to solve this?
Thank you very much.
If an application has a memory leak or tries to use more memory than a set limit amount, Kubernetes will terminate it with an “OOMKilled—Container limit reached” event and Exit Code 137.
When you see a message like this, you have two choices: increase the limit for the pod or start debugging. If, for example, your website was experiencing an increase in load, then adjusting the limit would make sense. On the other hand, if the memory use was sudden or unexpected, it may indicate a memory leak and you should start debugging immediately.
Remember, Kubernetes killing a pod like that is a good thing—it prevents all the other pods from running on the same node.
also refer similar issues link1 and link2,hope it helps.Thanks