How can I pass parameters to a Vertex AI Platform Pipeline? - google-cloud-platform

I have created a Vertex AI pipeline similar to this.
Now the pipeline has reference to a csv file. So if this csv file changes the pipeline needs to be recreated.
Is there any way to pass a new csv as a parameter to the pipeline when it is re-run? That is without recreating the pipeline using the notebook?
If not, is there a best practice way of auto updating the dataset, model and deployment?

Have a look to that documentation.
You can define your pipeline like that
...
# Define the workflow of the pipeline.
#kfp.dsl.pipeline(
name="automl-image-training-v2",
pipeline_root=pipeline_root_path)
def pipeline(project_id: str):
...
(you have something very similar in your notebook sample)
Then, when you invoke your pipeline, you can pass some parameter
import google.cloud.aiplatform as aip
job = aip.PipelineJob(
display_name="automl-image-training-v2",
template_path="image_classif_pipeline.json",
pipeline_root=pipeline_root_path,
parameter_values={
'project_id': project_id
}
)
job.submit()
You can see the project_id a dict parameter in the parameter values, and in parameter of your pipeline function.
Do the same for your CSV file name!

Related

Using Lambda for data processing - Sagemaker

I have created a docker image which has Entrypoint as processing.py. This script is taking data from /opt/ml/processing/input and after processing putting it /opt/ml/processing/output folder.
For processing the data I should put the file in /opt/ml/processing/input from s3 and then pick processed file from /opt/ml/processing/output into S3.
Following script in sagemaker is doing it properly:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput
import sagemaker
input_data = 's3://sagemaker-ap-south-1-057036842446/sagemaker/Data/Training/Churn_Modelling.csv'
output_dir = 's3://sagemaker-ap-south-1-057036842446/sagemaker/Outputs/'
image_uri = '057036842446.dkr.ecr.ap-south-1.amazonaws.com/aws-docker-repo:latest'
aws_role = sagemaker.get_execution_role()
processor = Processor(image_uri= image_uri, role=aws_role, instance_count=1, instance_type="ml.m5.xlarge")
processor.run(
inputs=[
ProcessingInput(
source=input_data,
destination='/opt/ml/processing/input'
)
],
outputs=[
ProcessingOutput(
source='/opt/ml/processing/output',
destination=output_dir
)
]
)
Could someone please guide how this can be executed with lambda function? It is not recognizing sagemaker package, second there is a challenge in placing file before the script execution and pick processed files.
I am trying codepipeline to automate this operation. However got no success on that.
Not sure how to put image from S3 into folders internally used by script
I need to know how S3 processing step which pick data from /opt/ml/processing/input
If you want to kick off a Processing Job from Lambda you can use boto3 to make the CreateProcessingJob API call:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_processing_job
I would suggest creating the Job as you have been doing using the SageMaker SDK. Once created, you can describe the Job using the DescribeProcessingJob API call:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_processing_job
You can then use the information from the DescribeProcessingJob API call output to fill out the CreateProcessingJob in Lambda.

Schedule batch predictions Vertex AI

I have created a forecasting model using AutoML on Vertex AI. I want to use this model to make batch predictions every week. Is there a way to schedule this?
The data to make those predictions is stored in a bigquery table, which is updated every week.
There is no automatic scheduling directly in Vertex AutoML yet but many different ways to set this up in GCP.
Two options to try first using the client libraries available for BigQuery and Vertex:
Cloud Scheduler to use cron https://cloud.google.com/scheduler/docs/quickstart
use either Cloud Functions or Cloud Run to setup a BigQuery event trigger, and then trigger the AutoML batch prediction. Example to repurpose https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events
Not sure if you're using Vertex pipelines to run the prediction job but if you are there's a method to schedule your pipeline execution listed here.
from kfp.v2.google.client import AIPlatformClient # noqa: F811
api_client = AIPlatformClient(project_id=PROJECT_ID, region=REGION)
# adjust time zone and cron schedule as necessary
response = api_client.create_schedule_from_job_spec(
job_spec_path="intro_pipeline.json",
schedule="2 * * * *",
time_zone="America/Los_Angeles", # change this as necessary
parameter_values={"text": "Hello world!"},
# pipeline_root=PIPELINE_ROOT # this argument is necessary if you did not specify PIPELINE_ROOT as part of the pipeline definition.
)

Run ID in Kubeflow Pipelines on Vertex AI

I am trying to run Kubeflow Pipelines with the new Vertex AI on GCP.
Previously, in Kubeflow Pipelines, I was able to use the Run ID in my Pipeline by utilizing dsl.RUN_ID_PLACEHOLDER or {{workflow.uid}}. My understanding was that dsl.RUN_ID_PLACEHOLDER would resolve to {{workflow,uid}} at compile time, and then at run time, the {{workflow.uid}} tag would be resolved to the Run's ID. This is at least how it has worked in my experience using Kubeflow Pipelines and the Kubeflow Pipelines UI.
However, when I try to access the Run ID in a similar way in a pipeline that I run in Vertex AI Pipelines, it seems that dsl.RUN_ID_PLACEHOLDER resolves to {{workflow.uid}} but that this never subsequently resolves to the ID of the run.
I created the following Test Pipeline, which tries to get the Run ID using the DSL Placeholder, then uses a lightweight component to print out the value of the run_id parameter of the pipeline. The result of running the pipeline in the UI is that the print_run_id component prints {{workflow.uid}}, where as on Kubeflow Pipelines previously, it would have resolved to the Run ID.
from kfp import dsl
from kfp import components as comp
import logging
from kfp.v2.dsl import (
component,
Input,
Output,
Dataset,
Metrics,
)
#component
def print_run_id(run_id:str):
print(run_id)
RUN_ID = dsl.RUN_ID_PLACEHOLDER
#dsl.pipeline(
name='end-to-end-pipeline',
description='End to end XGBoost cover type training pipeline'
)
def end_to_end_pipeline(
run_id: str = RUN_ID
):
print_task = print_run_id(run_id=run_id)
Is there a way to access the Run ID using the KFP SDK with Vertex AI Pipelines?
What works on vertex.ai are different magic strings.
Specifically:
from kfp.v2 import dsl
id = dsl.PIPELINE_JOB_ID_PLACEHOLDER
name = dsl.PIPELINE_JOB_NAME_PLACEHOLDER
see also, https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/dsl/__init__.py
Just got this answer from our support at google.
This isn't documented well. Trying to access them during the pipeline build or from within individuals components directly will return the unmodified placeholder string values. But it does work during the actual pipeline run (at least for kfp v2)
Example for kfp v2 based on this link that works as expected.
import kfp.v2.dsl as dsl
#component
def print_op(msg: str, value: str):
print(msg, value) # <-- Prints the correct value
#component
def incorrect_print_op():
print(dsl.PIPELINE_JOB_NAME_PLACEHOLDER) # <-- Prints incorrect placeholder value
#dsl.pipeline(name='pipeline-with-placeholders')
def my_pipeline():
# Correct
print_op(msg='job name:', value=dsl.PIPELINE_JOB_NAME_PLACEHOLDER)
print_op(msg='job resource name:', value=dsl.PIPELINE_JOB_RESOURCE_NAME_PLACEHOLDER)
print_op(msg='job id:', value=dsl.PIPELINE_JOB_ID_PLACEHOLDER)
print_op(msg='task name:', value=dsl.PIPELINE_TASK_NAME_PLACEHOLDER)
print_op(msg='task id:', value=dsl.PIPELINE_TASK_ID_PLACEHOLDER)
# Incorrect
print(dsl.PIPELINE_TASK_ID_PLACEHOLDER) # Prints only placeholder value

How to pass dynamic arguments Airflow operator?

I am using Airflow to run Spark jobs on Google Cloud Composer. I need to
Create cluster (YAML parameters supplied by user)
list of spark jobs (job params also supplied by per job YAML)
With the Airflow API - I can read YAML files, and push variables across tasks using xcom.
But, consider the DataprocClusterCreateOperator()
cluster_name
project_id
zone
and a few other arguments are marked as templated.
What if I want to pass in other arguments as templated (which are currently not so)? - like image_version,
num_workers, worker_machine_type etc?
Is there any workaround for this?
Not sure what you mean for 'dynamic', but when yaml file updated, if the reading file process is in dag file body, the dag will be refreshed to apply for the new args from yaml file. So actually, you don't need XCOM to get the arguments.
just simply create a params dictionary then pass to default_args:
CONFIGFILE = os.path.join(
os.path.dirname(os.path.realpath(\__file__)), 'your_yaml_file')
with open(CONFIGFILE, 'r') as ymlfile:
CFG = yaml.load(ymlfile)
default_args = {
'cluster_name': CFG['section_A']['cluster_name'], # edit here according to the structure of your yaml file.
'project_id': CFG['section_A']['project_id'],
'zone': CFG['section_A']['zone'],
'mage_version': CFG['section_A']['image_version'],
'num_workers': CFG['section_A']['num_workers'],
'worker_machine_type': CFG['section_A']['worker_machine_type'],
# you can add all needs params here.
}
DAG = DAG(
dag_id=DAG_NAME,
schedule_interval=SCHEDULE_INTEVAL,
default_args=default_args, # pass the params to DAG environment
)
Task1 = DataprocClusterCreateOperator(
task_id='your_task_id',
dag=DAG
)
But if you want dynamic dags rather than arguments, you may need other strategy like this.
So you probably need to figure out the basic idea:
In which level the dynamics is? Task level? DAG level?
Or you can create your own Operator to do the job and take the parameters.

AWS Data Pipeline - Java SDK - How to put a pipeline definition from JSON file

I have an AWS data pipeline definition in JSON format
Using Java SDK I have created an empty pipeline and now I would like to use my JSON to put the pipeline definition.
Basically I would like to create a PutPipelineDefinitionRequest (http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/datapipeline/model/PutPipelineDefinitionRequest.html) without creating the PipelineObjects one by one.
How can I do that? Is it possible?
Thanks!
Currently it is not possible to upload the JSON as the pipeline definition. You can however iterate over the JSON and create the array of pipeline objects.