I am trying to run a custom training job on VertexAI. The goal is to train a model, save the model to cloud storage and then upload it to VertexAI as a VertexAI Model object. When I run the job from local workstation, it runs, but when I run the job from Cloud Scheduler it fails. Details below.
Python Code for the job:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import os
import pickle
from google.cloud import storage
from google.cloud import aiplatform
print("FITTING THE MODEL")
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# fit the model
model.fit(X, y)
print("SAVING THE MODEL TO CLOUD STORAGE")
if 'AIP_MODEL_DIR' not in os.environ:
raise KeyError(
'The `AIP_MODEL_DIR` environment variable has not been' +
'set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training'
)
artifact_filename = 'model' + '.pkl'
# Save model artifact to local filesystem (doesn't persist)
local_path = artifact_filename
with open(local_path, 'wb') as model_file:
pickle.dump(model, model_file)
# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
storage_path = os.path.join(model_directory, artifact_filename)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
print ("UPLOADING MODEL TO VertexAI")
# Upload the model to vertex ai
project="..."
location="..."
display_name="custom_mdoel"
artifact_uri=model_directory
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-4:latest"
description="test model"
sync=True
aiplatform.init(project=project, location=location)
model = aiplatform.Model.upload(
display_name=display_name,
artifact_uri=artifact_uri,
serving_container_image_uri=serving_container_image_uri,
description=description,
sync=sync,
)
model.wait()
print("DONE")
Running from Local Workstation:
I set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the location of the Compute Engine default service account keys I have downloaded on my local workstation. I also set the AIP_MODEL_DIR environment variable to point to a cloud storage bucket. After I run the script, I can see the model.pkl file being created in the cloud storage bucket and the Model object being created in VertexAI.
Triggering the training job from Cloud Scheduler:
This is what I ultimately want to achieve - to run the custom training job periodically from Cloud Scheduler. I have converted the python script above into a docker image and uploaded to google artifact registry. The job specification for the Cloud Scheduler is below, I can provide additional details if required. The service account email in the oauth_token is the same whose keys I use to set the GOOGLE_APPLICATION_CREDENTIALS environment variable. When I run this, (either from local workstation or directly in a VertexAI notebook), I can see that the Cloud Scheduler job gets created which keeps triggering the custom job. The custom job is able to train the model and save it to the cloud storage. However, it is not able to upload it to VertexAI and I get the error meessages, status = StatusCode.PERMISSION_DENIED and {..."grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}. Cannot figure out what the authentication issue is because in both cases I am using the same service account.
job = {
"name": f'projects/{project_id}/locations/{location}/jobs/test_job',
"description": "Test scheduler job",
"http_target": {
"uri": f'https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/customJobs',
"http_method": "POST",
"headers": {
"User-Agent": "Google-Cloud-Scheduler",
"Content-Type": "application/json; charset=utf-8"
},
"body": "..." // the custom training job body,
"oauth_token": {
"service_account_email": "...",
"scope": "https://www.googleapis.com/auth/cloud-platform"
}
},
"schedule": "* * * * *",
"time_zone": "Africa/Abidjan"
}
Related
I am looking at this, which makes all sense. Let us focus on this bit of code:
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
code="preprocessing.py",
inputs=[
ProcessingInput(source="s3://your-bucket/path/to/your/data", destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)
preprocessing_job_description = sklearn_processor.jobs[-1].describe()
Here preprocessing.py has to be obviously in the cloud. I am curious, could one also put scripts into an S3 bucket and trigger the job remotely. I can easily to this with hyper parameter optimisation, which does not require dedicated scripts though as I use an OOTB training image.
In this case I can fire off the job like so:
tuning_job_name = "amazing-hpo-job-" + strftime("%d-%H-%M-%S", gmtime())
smclient = boto3.Session().client("sagemaker")
smclient.create_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name,
HyperParameterTuningJobConfig=tuning_job_config,
TrainingJobDefinition=training_job_definition
)
and then monitor the job's progress:
smclient = boto3.Session().client("sagemaker")
tuning_job_result = smclient.describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name
)
status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
print("Reminder: the tuning job has not been completed.")
job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)
objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]
is_minimize = objective["Type"] != "Maximize"
objective_name = objective["MetricName"]
if tuning_job_result.get("BestTrainingJob", None):
print("Best model found so far:")
pprint(tuning_job_result["BestTrainingJob"])
else:
print("No training jobs have reported results yet.")
I would think starting and monitoring a SageMaker processing job from a local machine should be possible as with an HPO job but what about the script(s)? Ideally I would like to develop and test them locally and the run remotely. Hope this makes sense?
Im not sure I understand the comparison to a Tuning Job.
Based on what you have described, in this case the preprocessing.py is actually stored locally. The SageMaker SDK will upload it to S3 for the remote Processing Job to access it. I suggest launching the Job and then taking a look at the inputs in the SageMaker Console.
If you wanted to test the Processing Job locally you can do so using Local Mode. This will basically imitate the Job locally which aids in debugging the script before kicking off a remote Processing Job. Kindly note docker is required to make use of Local Mode.
Example code for local mode:
from sagemaker.local import LocalSession
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
# For local training a dummy role will be sufficient
role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'
processor = ScriptProcessor(command=['python3'],
image_uri='sagemaker-scikit-learn-processing-local',
role=role,
instance_count=1,
instance_type='local')
processor.run(code='processing_script.py',
inputs=[ProcessingInput(
source='./input_data/',
destination='/opt/ml/processing/input_data/')],
outputs=[ProcessingOutput(
output_name='word_count_data',
source='/opt/ml/processing/processed_data/')],
arguments=['job-type', 'word-count']
)
preprocessing_job_description = processor.jobs[-1].describe()
output_config = preprocessing_job_description['ProcessingOutputConfig']
print(output_config)
for output in output_config['Outputs']:
if output['OutputName'] == 'word_count_data':
word_count_data_file = output['S3Output']['S3Uri']
print('Output file is located on: {}'.format(word_count_data_file))
I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))
When I initiate a batch prediction job on Vertex AI of google cloud, I have to specify a cloud storage bucket location. Suppose I provided the bucket location, 'my_bucket/prediction/', then the prediction files are stored in something like: gs://my_bucket/prediction/prediction-test_model-2022_01_17T01_46_39_898Z, which is a subdirectory within the bucket location I provided. The prediction files are stored within that subdirectory and are named:
prediction.results-00000-of-00002
prediction.results-00001-of-00002
Is there any way to programmatically get the final export location from the batch prediction name, id or any other parameter as shown below in the details of the batch prediction job?
Not only with those parameters because and you can run the same job multiple times, new folders based on the execution date will be create, but you can get it from the API using your job id (don't forget to set the credentials by GOOGLE_APPLICATION_CREDENTIALS if you are not running on cloud sdk):
Get the output directory by the Vertex AI - Batch prediction API by the job ID:
curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) "https://us-central1-aiplatform.googleapis.com/v1/projects/[PROJECT_NAME]/locations/us-central1/batchPredictionJobs/[JOB_ID]"
Output: (Get the value from gcsOutputDirectory )
{
...
"gcsOutputDirectory": "gs://my_bucket/prediction/prediction-test_model-2022_01_17T01_46_39_898Z"
...
}
EDIT: Getting batchPredictionJobs via Python API:
from google.cloud import aiplatform
#-------
def get_batch_prediction_job_sample(
project: str,
batch_prediction_job_id: str,
location: str = "us-central1",
api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
name = client.batch_prediction_job_path(
project=project, location=location, batch_prediction_job=batch_prediction_job_id
)
response = client.get_batch_prediction_job(name=name)
print("response:", response)
#-------
get_batch_prediction_job_sample("[PROJECT_NAME]","[JOB_ID]","us-central1","us-central1-aiplatform.googleapis.com")
Check details about it here
Check the API repository here
Just adding a cherry on top of #ewertonvsilva's answer...
If you are following Google's example on programmatically getting the batch prediction,
The object response from response = client.get_batch_prediction_job(name=name) has the output_config attribute that you need. All you need to do is to call response.output_info.gcs_output_directory once the prediction job is complete.
I have to create a program that get informations on a daily basis about installations of a group of apps on the AppStore and the PlayStore.
For the PlayStore, using Google Cloud Storage I followed the instructions on this page using the client library and a Service Account method and the Python code example :
https://support.google.com/googleplay/android-developer/answer/6135870?hl=en&ref_topic=7071935
I slightly changed the given code to make it work since documentation looks not up-to-date. I made it possible to connect to the API and it seems to connect correctly.
My problem is that I don't understand what object I get and how to use it. It's not a report it just looks like files properties in a dict.
This is my code (private data "hidden") :
import json
from httplib2 import Http
from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient.discovery import build
client_email = '************.iam.gserviceaccount.com'
json_file = 'PATH/TO/MY/JSON/FILE'
cloud_storage_bucket = 'pubsite_prod_rev_**********'
report_to_download = 'stats/installs/installs_****************_202005_app_version.csv'
private_key = json.loads(open(json_file).read())['private_key']
credentials = ServiceAccountCredentials.from_json_keyfile_name(json_file, scopes='https://www.googleapis.com/auth/devstorage.read_only')
storage = build('storage', 'v1', http=credentials.authorize(Http()))
supposed_to_be_report = storage.objects().get(bucket=cloud_storage_bucket, object=report_to_download).execute()
When I print the supposed_to_be_report - which is a dictionary- I only get what I understand as Metadata about he report like this:
{'kind': 'storage#object', 'id': 'pubsite_prod_rev_***********/stats/installs/installs_****************_202005_app_version.csv/1591077412052716',
'selfLink': 'https://www.googleapis.com/storage/v1/b/pubsite_prod_rev_***********/o/stats%2Finstalls%2Finstalls_*************_202005_app_version.csv',
'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/pubsite_prod_rev_***********/o/stats%2Finstalls%2Finstalls_****************_202005_app_version.csv?generation=1591077412052716&alt=media',
'name': 'stats/installs/installs_***********_202005_app_version.csv',
'bucket': 'pubsite_prod_rev_***********',
'generation': '1591077412052716',
'metageneration': '1',
'contentType': 'text/csv;
charset=utf-16le', 'storageClass': 'STANDARD', 'size': '378', 'md5Hash': '*****==', 'contentEncoding': 'gzip'......
I am not sure I'm using it correctly. Could you please explain me where am I wrong and/or how to get installs reports correctly ?
Thanks.
I can see that you are using googleapiclient.discovery client, this is not an issue, but the recommended way to access Google Cloud APIs programmatically is by using the client libraries.
Second, you are just retrieving the object's metadata. You can download the object to have access to the file contents, this is a sample using the client library.
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Blob {} downloaded to {}.".format(
source_blob_name, destination_file_name
)
)
Sample taken from official docs.
I've created a script where I define my Tensorflow Estimator, then I pass it to AWS sagemaker sdk and run fit(), the training passes (though doesnt show anything related to training in the console) and in S3 the only output is /source/sourcedir.tar.gz and I believe there also should be at least /model/model.tar.gz which for some reason is not generated and I'm not getting any errors.
sagemaker_session = sagemaker.Session()
role = get_execution_role()
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/NamingConventions')
NamingConventions_estimator = TensorFlow(entry_point='NamingConventions.py',
role=role,
framework_version='1.12.0',
train_instance_count=1,
train_instance_type='ml.m5.xlarge',
py_version='py3',
model_dir="s3://sagemaker-eu-west-2-218566301064/model")
NamingConventions_estimator.fit(inputs, run_tensorboard_locally=True)
and my model_fn from 'NamingConventions.py'
def model_fn(features, labels, mode, params):
net = keras.layers.Embedding(alphabetLen + 1, 8, input_length=maxFeatureLen)(features[INPUT_TENSOR_NAME])
net = keras.layers.LSTM(12)(net)
logits = keras.layers.Dense(len(conventions), activation=tf.nn.softmax)(net) #output
predictions = tf.reshape(logits, [-1])
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=mode,
predictions={"ages": predictions},
export_outputs={SIGNATURE_NAME: PredictOutput({"ages": predictions})})
loss = keras.losses.sparse_categorical_crossentropy(labels, predictions)
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.contrib.framework.get_global_step(),
learning_rate=params["learning_rate"],
optimizer="AdamOptimizer")
predictions_dict = {"ages": predictions}
eval_metric_ops = {
"rmse": tf.metrics.root_mean_squared_error(
tf.cast(labels, tf.float32), predictions)
}
return tf.estimator.EstimatorSpec(
mode=mode,
loss=loss,
train_op=train_op,
eval_metric_ops=eval_metric_ops)
I still can't get it running, I'm trying to use script-mode, it seems like I can't import my model from the same directory.
Currently my script:
import argparse
import os
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.1)
# input data and model directories
parser.add_argument('--model_dir', type=str)
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
args, _ = parser.parse_known_args()
import tensorflow as tf
from NC_model import model_fn, train_input_fn, eval_input_fn
def train(args):
print(args)
estimator = tf.estimator.Estimator(model_fn=model_fn, model_dir=args.model_dir)
train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
if __name__ == '__main__':
train(args)
Is the training job listed as successful in the AWS console? Did you check the training log in Amazon CloudWatch?
I think you need to set your estimator model_dir to the path in the environment variable SM_MODEL_DIR.
This is a bit contrary to the docs which are not clear on this point. I suspect the --model_dir arg is used for distributed training and not saving of the final artifact.
Note that you'll get all your checkpoints and summaries there to so it probably best to use --model_dir in your estimator and copy your model export to SM_MODEL_DIR when training has finished.
Script mode gives you the freedom to write TensorFlow scripts the way you want, but the cost is, you have to do almost everything by yourself. For example, here in your case, if you want the model.tar.gz in S3, you have to export the model locally first. Then SageMaker will upload your local model to S3 automatically.
So what you need to add in your script is:
You need to add an exporter and pass it to eval_spec.
You need to call export_savedmodel to save the model to the local model dir that SageMaker can get. The local model dir is in env variable SM_MODEL_DIR, and should be '/opt/ml/model'.
To finish above, I guess you need to have your serving_input_fn implemented too.
Then SageMaker will upload your model from the local model dir automatically to the S3 model dir you specify. And you can see that in S3 after job succeeds.