Vertex Pipeline: CustomPythonPackageTrainingJobRunOp not supplying WorkerPoolSpecs - google-cloud-ml

I am trying to run a custom package training pipeline using Kubeflow pipelines on Vertex AI. I have the training code packaged in Google Cloud Storage and my pipeline is:
import kfp
from kfp.v2 import compiler
from kfp.v2.dsl import component
from kfp.v2.google import experimental
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
#kfp.dsl.pipeline(name=pipeline_name, pipeline_root=pipeline_root_path)
def pipeline():
training_job_run_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
project=project_id,
display_name=training_job_name,
model_display_name=model_display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module=python_module,
container_uri=container_uri,
staging_bucket=staging_bucket,
model_serving_container_image_uri=model_serving_container_image_uri)
# Upload model
model_upload_op = gcc_aip.ModelUploadOp(
project=project_id,
display_name=model_display_name,
artifact_uri=output_dir,
serving_container_image_uri=model_serving_container_image_uri,
)
model_upload_op.after(training_job_run_op)
# Deploy model
model_deploy_op = gcc_aip.ModelDeployOp(
project=project_id,
model=model_upload_op.outputs["model"],
endpoint=aiplatform.Endpoint(
endpoint_name='0000000000').resource_name,
deployed_model_display_name=model_display_name,
machine_type="n1-standard-2",
traffic_percentage=100)
compiler.Compiler().compile(pipeline_func=pipeline,
package_path=pipeline_spec_path)
When I try to run this pipeline on Vertex AI I get the following error:
{
"insertId": "qd9wxrfnoviyr",
"jsonPayload": {
"levelname": "ERROR",
"message": "google.api_core.exceptions.InvalidArgument: 400 List of found errors:\t1.Field: job_spec.worker_pool_specs; Message: At least one worker pool should be specified.\t\n"
}
}

My original CustomPythonPackageTrainingJobRunOp wasn't defining worker_pool_spec which was the reason for the error. After I specified replica_count and machine_type the error resolved. Final training op is:
training_job_run_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
project=project_id,
display_name=training_job_name,
model_display_name=model_display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module=python_module,
container_uri=container_uri,
staging_bucket=staging_bucket,
base_output_dir=output_dir,
model_serving_container_image_uri=model_serving_container_image_uri,
replica_count=1,
machine_type="n1-standard-4")

Related

Not able to create an sagemaker endpoint with datacaptureconfig enabled using boto3 API

SageMaker version: 2.129.0
boto3 version: 1.26.57
Error details: ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: One or more endpoint features are not supported using this configuration.
Steps to replicate the above issue:
# Download model
!aws s3 cp s3://sagemaker-sample-files/models/xgb-churn/xgb-churn-prediction-model.tar.gz model/
# Step 1 – Create model
import boto3
from sagemaker.s3 import S3Uploader
import datetime
from sagemaker import get_execution_role
bucket = "sagemaker-us-east-x-xxxxxxxxxx"
prefix = "sagemaker/xgb"
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_session.region_name
sm_boto3 = boto3.client("sagemaker")
model_url = S3Uploader.upload(
local_path="model/xgb-churn-prediction-model.tar.gz",
desired_s3_uri=f"s3://{bucket}/{prefix}",
)
from sagemaker import image_uris
image_uri = image_uris.retrieve("xgboost", region, "0.90-1")
model_name = f"DEMO-xgb-churn-pred-model-{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
resp = sm_boto3.create_model(
ModelName=model_name,
ExecutionRoleArn=get_execution_role(),
Containers=[{"Image": image_uri, "ModelDataUrl": model_url}],
)
# Step 2 – Create endpoint config
epc_name = f"DEMO-xgb-churn-pred-epc-{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
endpoint_config_response = sm_boto3.create_endpoint_config(
EndpointConfigName = epc_name,
ProductionVariants=[
{
'InstanceType':'ml.m5.xlarge',
'InitialInstanceCount':1,
'ModelName':model_name,
'VariantName':'production',
'InitialVariantWeight':1
}
],
DataCaptureConfig={
'EnableCapture': True,
'InitialSamplingPercentage': 50,
'DestinationS3Uri': 's3://sagemaker-us-east-x-xxxxxxxxxx/sagemaker/xgb/',
'CaptureOptions': [
{
'CaptureMode': 'InputAndOutput'
},
],
'CaptureContentTypeHeader': {
'JsonContentTypes': [
'application/json',
]
}
}
)
print('Endpoint configuration name: {}'.format(epc_name))
print('Endpoint configuration arn: {}'.format(endpoint_config_response['EndpointConfigArn']))
# Step 3 - Create endpoint
endpoint_name = f"DEMO-xgb-churn-pred-ep-{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
endpoint_params = {
'EndpointName': endpoint_name,
'EndpointConfigName': epc_name,
}
endpoint_response = sm_boto3.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=epc_name)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))
Expected behaviour: Should able to create an endpoint with data capture enabled using boto3 API.
I was able to replicate your set up and it works for me as expected. I am using a higher version of boto3 and SageMaker, consider upgrading both and give it a try !
boto3 version - 1.26.62
sagemaker version - 2.131.0

Not being able to use 'from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator' in Airflow 2.0

I am learning Cloud Composer and Airflow in Google Cloud Platform. I am trying to do some transformations and load into another table. from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator gives me an error and i have looked through airflow documentation and cant see if it has been changed or not. This is my code
from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
bq_to_bq = BigQueryOperator(
task_id = "bq_to_bq",
sql = "SELECT count(*) as count FROM `raw_bikesharing.stations`",
destination_dataset_table = 'dwh_bikesharing.temporary_stations_count',
write_disposition = 'WRITE_TRUNCATE',
create_disposition = 'CREATE_IF_NEEDED',
use_legacy_sql = False,
priority = 'BATCH'
)
No name 'BigQueryOperator' in module 'airflow.providers.google.cloud.operators.bigquery'
There is another up to date operator with Airflow to execute query and create a job : BigQueryInsertJobOperator
I think you didn’t used an existing one :
import airflow
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
execute_query = BigQueryInsertJobOperator(
task_id='execute_query_task_id',
configuration={
"query": {
"query": "select …",
"useLegacySql": False,
}
},
location='EU'
)
You can check this example from my Github repository.

how to use sagemaker inside pyspark

I have a simple requirement, I need to run sagemaker prediction inside a spark job
am trying to run the below
ENDPOINT_NAME = "MY-ENDPOINT_NAME"
from sagemaker_pyspark import SageMakerModel
from sagemaker_pyspark import EndpointCreationPolicy
from sagemaker_pyspark.transformation.serializers import ProtobufRequestRowSerializer
from sagemaker_pyspark.transformation.deserializers import ProtobufResponseRowDeserializer
attachedModel = SageMakerModel(
existingEndpointName=ENDPOINT_NAME,
endpointCreationPolicy=EndpointCreationPolicy.DO_NOT_CREATE,
endpointInstanceType=None, # Required
endpointInitialInstanceCount=None, # Required
requestRowSerializer=ProtobufRequestRowSerializer(
featuresColumnName="featureCol"
), # Optional: already default value
responseRowDeserializer= ProtobufResponseRowDeserializer(schema=ouput_schema),
)
transformedData2 = attachedModel.transform(df)
transformedData2.show()
I get the following error TypeError: 'JavaPackage' object is not callable
this was solved by ...
classpath = ":".join(sagemaker_pyspark.classpath_jars())
conf = SparkConf() \
.set("spark.driver.extraClassPath", classpath)
sc = SparkContext(conf=conf)

Advice/Guidance - composer/beam/dataflow on gcp

I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))

Error when creating Google Dataflow template file

I'm trying to schedule a Dataflow that ends after a set amount of time using a template. I'm able to successfully do this when using the command line, but when I try and do it with Google Cloud Scheduler I run into an error when I create my template.
The error is
File "pipelin_stream.py", line 37, in <module>
main()
File "pipelin_stream.py", line 34, in main
result.cancel()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1638, in cancel
raise IOError('Failed to get the Dataflow job id.')
IOError: Failed to get the Dataflow job id.
The command I'm using to make the template is
python pipelin_stream.py \
--runner Dataflowrunner \
--project $PROJECT \
--temp_location $BUCKET/tmp \
--staging_location $BUCKET/staging \
--template_location $BUCKET/templates/time_template_test \
--streaming
And the pipeline file I have is this
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import sys
PROJECT = 'projectID'
schema = 'ex1:DATE, ex2:STRING'
TOPIC = "projects/topic-name/topics/scraping-test"
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions(region='us-central1', service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x:x.decode('utf-8'))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('tablename'.format(PROJECT), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish(duration=3000)
result.cancel() # If the pipeline has not finished, you can cancel it
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Does anyone have an idea why I might be getting this error?
The error is raised by the cancel function after the waiting time and it appears to be harmless.
To prove it, I managed to reproduce your exact issue from my virtual machine with python 3.5. The template is created in the given path by --template_location and can be used to run jobs. Note that I needed to apply some changes to your code to get it to actually work in Dataflow.
In case it is of any use to you, I ended up using this pipeline code
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import datetime
# Fill this values in order to have them by default
# Note that the table in BQ needs to have the column names message_body and publish_time
Table = 'projectid:datasetid.tableid'
schema = 'ex1:STRING, ex2:TIMESTAMP'
TOPIC = "projects/<projectid>/topics/<topicname>"
class AddTimestamps(beam.DoFn):
def process(self, element, publish_time=beam.DoFn.TimestampParam):
"""Processes each incoming element by extracting the Pub/Sub
message and its publish timestamp into a dictionary. `publish_time`
defaults to the publish timestamp returned by the Pub/Sub server. It
is bound to each element by Beam at runtime.
"""
yield {
"message_body": element.decode("utf-8"),
"publish_time": datetime.datetime.utcfromtimestamp(
float(publish_time)
).strftime("%Y-%m-%d %H:%M:%S.%f"),
}
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic", default=TOPIC)
parser.add_argument("--output_table", default=Table)
args, beam_args = parser.parse_known_args(argv)
# save_main_session needs to be set to true due to modules being used among the code (mostly datetime)
# Uncomment the service account email to specify a custom service account
p = beam.Pipeline(argv=beam_args,options=PipelineOptions(save_main_session=True,
region='us-central1'))#, service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=args.input_topic).with_output_types(bytes)
| "Add timestamps to messages" >> beam.ParDo(AddTimestamps())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(args.output_table, schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
#Warning: Cancel does not work properly in a template
result.wait_until_finish(duration=3000)
result.cancel() # Cancel the streaming pipeline after a while to avoid consuming more resources
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Afterwards I ran commands:
# Fill accordingly
PROJECT="MYPROJECT-ID"
BUCKET="MYBUCKET"
TEMPLATE_NAME="TRIAL"
# create the template
python3 -m templates.template-pubsub-bigquery \
--runner DataflowRunner \
--project $PROJECT \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--template_location gs://$BUCKET/templates/$TEMPLATE_NAME \
--streaming
to create the pipeline (which yields the error you mentioned but still creates the template).
And
# Fill job-name and gcs location accordingly
# Uncomment and fill the parameters should you want to use your own
gcloud dataflow jobs run <job-name> \
--gcs-location "gs://<MYBUCKET>/dataflow/templates/mytemplate"
# --parameters input_topic="", output_table=""
To run the pipeline.
As I said, the template was properly created and the pipeline worked properly.
Edit
Indeed the cancel function does not work properly in the template. It seems to be an issue with it needing the job id on template creation which of course it does not exist and as a result it omits the function.
I found this other post that handles extracting the Job id on the pipeline. I tried some tweaks to make it work within the template code itself but I think is not necessary. Given you want to schedule their execution I would go for the easier option and execute the streaming pipeline template at a certain time (e.g. 9:01 GMT) and cancel the pipeline with script
import logging, re,os
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def retrieve_job_id():
#Fill as needed
project = '<project-id>'
job_prefix = "<job-name>"
location = '<location>'
logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))
try:
credentials = GoogleCredentials.get_application_default()
dataflow = build('dataflow', 'v1b3', credentials=credentials)
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
job_id = "none"
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
logging.info("Job ID: {}".format(job_id))
return job_id
except Exception as e:
logging.info("Error retrieving Job ID")
raise KeyError(e)
os.system('gcloud dataflow jobs cancel {}'.format(retrieve_job_id()))
at another time (e.g. 9:05 GMT). This script assumes you are running the script with the same job name each time and takes the latest appearance of the name and cancels it. I tried it several times and it works fine.