How can I run a Cloud ML preprocessing on the cloud? - google-cloud-ml

Pre-processing can be done locally and on the cloud. I know how to run it locally.
How do I run it on the cloud?

If you are using Datalab, simply type %ml preprocess --cloud into a cell, and the generated template will have cloud hooks.
If you want to change existing code, then replace DirectPipelineRunner by DataflowPipelineRunner. You will also need to specify a few “command-line” arguments.
Here’s an example:
RUNNER = 'DataflowPipelineRunner'
OUTPUT_DIR = 'gs://{0}/preprocessed_output/'.format(BUCKET)
options = {
'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging')
'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
'job_name': 'preprocess' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
'project': PROJECT,
'extra_packages': [ml.sdk_location],
'teardown_policy': 'TEARDOWN_ALWAYS',
'no_save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline(RUNNER, options=opts)

Related

SageMaker PyTorch estimator.fit freezes when running in local mode from EC2

I am trying to train a PyTorch model through SageMaker. I am running a script main.py (which I have posted a minimum working example of below) which calls a PyTorch Estimator. I have the code for training my model saved as a separate script, train.py, which is called by the entry_point parameter of the Estimator. These scripts are hosted on a EC2 instance in the same AWS region as my SageMaker domain.
When I try running this with instance_type = "ml.m5.4xlarge", it works ok. However, I am unable to debug any problems in train.py. Any bugs in that file simply give me the error: 'AlgorithmError: ExecuteUserScriptError', and will not allow me to set breakpoint() lines in train.py (encountering a breakpoint throws the above error).
Instead I am trying to run in local mode, which I believe does allow for breakpoints. However, when I reach estimator.fit(inputs), it hangs on that line indefinitely, giving no output. Any print statements that I put at the start of the main function in train.py are not reached. This is true no matter what code I put in train.py. It also did not throw an error when I had an illegal underscore in the base_job_name parameter of the estimator, which suggests that it does not even create the estimator instance.
Below is a minimum example which replicates the issue on my instance. Any help would be appreciated.
### File structure
main.py
customcode/
|
|_ train.py
### main.py
import sagemaker
from sagemaker.pytorch import PyTorch
import boto3
try:
# When running on Studio.
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
except ValueError:
# When running from EC2 or local machine.
print('Performing manual setup.')
bucket = 'MY-BUCKET-NAME'
region = 'us-east-2'
role = 'arn:aws:iam::MY-ACCOUNT-NUMBER:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXX'
iam = boto3.client("iam")
sagemaker_client = boto3.client("sagemaker")
boto3.setup_default_session(region_name=region, profile_name="default")
sess = sagemaker.Session(sagemaker_client=sagemaker_client, default_bucket=bucket)
hyperparameters = {'epochs': 10}
inputs = {'data': f's3://{bucket}/features'}
train_instance_type = 'local'
hosted_estimator = PyTorch(
source_dir='customcode',
entry_point='train.py',
instance_type=train_instance_type,
instance_count=1,
hyperparameters=hyperparameters,
role=role,
base_job_name='mwe-train',
framework_version='1.12',
py_version='py38',
input_mode='FastFile',
)
hosted_estimator.fit(inputs) # This is the line that freezes
### train.py
def main():
breakpoint() # Throws an error in non-local mode.
return
if __name__ == '__main__':
print('Reached') # Never reached in local mode.
main()

Pass output from one workflow step to another in GCP

I am trying to orchestrate a GCP workflow to first run a query in Big Query to get some metadata (name & id) that would then be passed to another step in the workflow that starts a dataflow job given those parameters as input.
So step by step I want something like:
Result = Query("SELECT ID & name from biq query table")
Start dataflow job: Input(result)
Is this possible or is there a better solution?
I propose you 2 solutions and I hope it can help.
- Solution 1 :
If you have an orchestrator like Airflow in Cloud Composer :
Use task with a BigQueryInsertJobOperator in Airflow, this operator allows to execute a query to Bigquery
Pass the result to a second Operator via xcom
2 second operator is an operator that extends BeamRunPythonPipelineOperator
When you extend BeamRunPythonPipelineOperator, you override the execute method. In this method, you can recover the data from previous operator via xcom pull as Dict
Pass this Dict as pipeline options to your operator that extends BeamRunPythonPipelineOperator
The BeamRunPythonPipelineOperator will launch your Dataflow job
An example of operator with execute method :
class CustomBeamOperator(BeamRunPythonPipelineOperator):
def __init__(
self,
your_field
...
**kwargs) -> None:
super().__init__(**kwargs)
self.your_field = your_field
...
def execute(self, context):
task_instance = context['task_instance']
your_conf_from_bq = task_instance.xcom_pull('task_id_previous_operator')
operator = BeamRunPythonPipelineOperator(
runner='DataflowRunner',
py_file='your_dataflow_main_file.py',
task_id='launch_dataflow_job',
pipeline_options=your_conf_from_bq,
py_system_site_packages=False,
py_interpreter='python3',
dataflow_config=DataflowConfiguration(
location='your_region'
)
)
operator.execute(context)
- Solution 2 :
If you don't have an orchestrator like Airflow
You can use the same virtual env that launch your actual Dataflow job but add Python Bigquery client as package : https://cloud.google.com/bigquery/docs/reference/libraries
Create a main Python file that retrieves your conf from Bigquery table as Dict via Bigquery client
Generate with Python the command line to launch your Dataflow job with the previous conf retrieved from database, example with Python :
python -m folder.your_main_file \
--runner=DataflowRunner \
--conf1=conf1/ \
--conf2=conf2
....
--setup_file=./your_setup.py \
Launch the previous Python command with Python suprocess
You can also maybe try this api to launch Dataflow job : https://pypi.org/project/google-cloud-dataflow-client/
I didn't tried it.
I think the solution with Airflow is easier.

start, monitor and define script of SageMaker processing job from local machine

I am looking at this, which makes all sense. Let us focus on this bit of code:
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
code="preprocessing.py",
inputs=[
ProcessingInput(source="s3://your-bucket/path/to/your/data", destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)
preprocessing_job_description = sklearn_processor.jobs[-1].describe()
Here preprocessing.py has to be obviously in the cloud. I am curious, could one also put scripts into an S3 bucket and trigger the job remotely. I can easily to this with hyper parameter optimisation, which does not require dedicated scripts though as I use an OOTB training image.
In this case I can fire off the job like so:
tuning_job_name = "amazing-hpo-job-" + strftime("%d-%H-%M-%S", gmtime())
smclient = boto3.Session().client("sagemaker")
smclient.create_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name,
HyperParameterTuningJobConfig=tuning_job_config,
TrainingJobDefinition=training_job_definition
)
and then monitor the job's progress:
smclient = boto3.Session().client("sagemaker")
tuning_job_result = smclient.describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name
)
status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
print("Reminder: the tuning job has not been completed.")
job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)
objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]
is_minimize = objective["Type"] != "Maximize"
objective_name = objective["MetricName"]
if tuning_job_result.get("BestTrainingJob", None):
print("Best model found so far:")
pprint(tuning_job_result["BestTrainingJob"])
else:
print("No training jobs have reported results yet.")
I would think starting and monitoring a SageMaker processing job from a local machine should be possible as with an HPO job but what about the script(s)? Ideally I would like to develop and test them locally and the run remotely. Hope this makes sense?
Im not sure I understand the comparison to a Tuning Job.
Based on what you have described, in this case the preprocessing.py is actually stored locally. The SageMaker SDK will upload it to S3 for the remote Processing Job to access it. I suggest launching the Job and then taking a look at the inputs in the SageMaker Console.
If you wanted to test the Processing Job locally you can do so using Local Mode. This will basically imitate the Job locally which aids in debugging the script before kicking off a remote Processing Job. Kindly note docker is required to make use of Local Mode.
Example code for local mode:
from sagemaker.local import LocalSession
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
# For local training a dummy role will be sufficient
role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'
processor = ScriptProcessor(command=['python3'],
image_uri='sagemaker-scikit-learn-processing-local',
role=role,
instance_count=1,
instance_type='local')
processor.run(code='processing_script.py',
inputs=[ProcessingInput(
source='./input_data/',
destination='/opt/ml/processing/input_data/')],
outputs=[ProcessingOutput(
output_name='word_count_data',
source='/opt/ml/processing/processed_data/')],
arguments=['job-type', 'word-count']
)
preprocessing_job_description = processor.jobs[-1].describe()
output_config = preprocessing_job_description['ProcessingOutputConfig']
print(output_config)
for output in output_config['Outputs']:
if output['OutputName'] == 'word_count_data':
word_count_data_file = output['S3Output']['S3Uri']
print('Output file is located on: {}'.format(word_count_data_file))

Authenticating model upload to VertexAI job from Cloud Scheduler

I am trying to run a custom training job on VertexAI. The goal is to train a model, save the model to cloud storage and then upload it to VertexAI as a VertexAI Model object. When I run the job from local workstation, it runs, but when I run the job from Cloud Scheduler it fails. Details below.
Python Code for the job:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import os
import pickle
from google.cloud import storage
from google.cloud import aiplatform
print("FITTING THE MODEL")
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# fit the model
model.fit(X, y)
print("SAVING THE MODEL TO CLOUD STORAGE")
if 'AIP_MODEL_DIR' not in os.environ:
raise KeyError(
'The `AIP_MODEL_DIR` environment variable has not been' +
'set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training'
)
artifact_filename = 'model' + '.pkl'
# Save model artifact to local filesystem (doesn't persist)
local_path = artifact_filename
with open(local_path, 'wb') as model_file:
pickle.dump(model, model_file)
# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
storage_path = os.path.join(model_directory, artifact_filename)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
print ("UPLOADING MODEL TO VertexAI")
# Upload the model to vertex ai
project="..."
location="..."
display_name="custom_mdoel"
artifact_uri=model_directory
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-4:latest"
description="test model"
sync=True
aiplatform.init(project=project, location=location)
model = aiplatform.Model.upload(
display_name=display_name,
artifact_uri=artifact_uri,
serving_container_image_uri=serving_container_image_uri,
description=description,
sync=sync,
)
model.wait()
print("DONE")
Running from Local Workstation:
I set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the location of the Compute Engine default service account keys I have downloaded on my local workstation. I also set the AIP_MODEL_DIR environment variable to point to a cloud storage bucket. After I run the script, I can see the model.pkl file being created in the cloud storage bucket and the Model object being created in VertexAI.
Triggering the training job from Cloud Scheduler:
This is what I ultimately want to achieve - to run the custom training job periodically from Cloud Scheduler. I have converted the python script above into a docker image and uploaded to google artifact registry. The job specification for the Cloud Scheduler is below, I can provide additional details if required. The service account email in the oauth_token is the same whose keys I use to set the GOOGLE_APPLICATION_CREDENTIALS environment variable. When I run this, (either from local workstation or directly in a VertexAI notebook), I can see that the Cloud Scheduler job gets created which keeps triggering the custom job. The custom job is able to train the model and save it to the cloud storage. However, it is not able to upload it to VertexAI and I get the error meessages, status = StatusCode.PERMISSION_DENIED and {..."grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}. Cannot figure out what the authentication issue is because in both cases I am using the same service account.
job = {
"name": f'projects/{project_id}/locations/{location}/jobs/test_job',
"description": "Test scheduler job",
"http_target": {
"uri": f'https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/customJobs',
"http_method": "POST",
"headers": {
"User-Agent": "Google-Cloud-Scheduler",
"Content-Type": "application/json; charset=utf-8"
},
"body": "..." // the custom training job body,
"oauth_token": {
"service_account_email": "...",
"scope": "https://www.googleapis.com/auth/cloud-platform"
}
},
"schedule": "* * * * *",
"time_zone": "Africa/Abidjan"
}

Script mode py3 and lack of output in s3 after successful training

I've created a script where I define my Tensorflow Estimator, then I pass it to AWS sagemaker sdk and run fit(), the training passes (though doesnt show anything related to training in the console) and in S3 the only output is /source/sourcedir.tar.gz and I believe there also should be at least /model/model.tar.gz which for some reason is not generated and I'm not getting any errors.
sagemaker_session = sagemaker.Session()
role = get_execution_role()
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/NamingConventions')
NamingConventions_estimator = TensorFlow(entry_point='NamingConventions.py',
role=role,
framework_version='1.12.0',
train_instance_count=1,
train_instance_type='ml.m5.xlarge',
py_version='py3',
model_dir="s3://sagemaker-eu-west-2-218566301064/model")
NamingConventions_estimator.fit(inputs, run_tensorboard_locally=True)
and my model_fn from 'NamingConventions.py'
def model_fn(features, labels, mode, params):
net = keras.layers.Embedding(alphabetLen + 1, 8, input_length=maxFeatureLen)(features[INPUT_TENSOR_NAME])
net = keras.layers.LSTM(12)(net)
logits = keras.layers.Dense(len(conventions), activation=tf.nn.softmax)(net) #output
predictions = tf.reshape(logits, [-1])
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=mode,
predictions={"ages": predictions},
export_outputs={SIGNATURE_NAME: PredictOutput({"ages": predictions})})
loss = keras.losses.sparse_categorical_crossentropy(labels, predictions)
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.contrib.framework.get_global_step(),
learning_rate=params["learning_rate"],
optimizer="AdamOptimizer")
predictions_dict = {"ages": predictions}
eval_metric_ops = {
"rmse": tf.metrics.root_mean_squared_error(
tf.cast(labels, tf.float32), predictions)
}
return tf.estimator.EstimatorSpec(
mode=mode,
loss=loss,
train_op=train_op,
eval_metric_ops=eval_metric_ops)
I still can't get it running, I'm trying to use script-mode, it seems like I can't import my model from the same directory.
Currently my script:
import argparse
import os
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.1)
# input data and model directories
parser.add_argument('--model_dir', type=str)
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
args, _ = parser.parse_known_args()
import tensorflow as tf
from NC_model import model_fn, train_input_fn, eval_input_fn
def train(args):
print(args)
estimator = tf.estimator.Estimator(model_fn=model_fn, model_dir=args.model_dir)
train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
if __name__ == '__main__':
train(args)
Is the training job listed as successful in the AWS console? Did you check the training log in Amazon CloudWatch?
I think you need to set your estimator model_dir to the path in the environment variable SM_MODEL_DIR.
This is a bit contrary to the docs which are not clear on this point. I suspect the --model_dir arg is used for distributed training and not saving of the final artifact.
Note that you'll get all your checkpoints and summaries there to so it probably best to use --model_dir in your estimator and copy your model export to SM_MODEL_DIR when training has finished.
Script mode gives you the freedom to write TensorFlow scripts the way you want, but the cost is, you have to do almost everything by yourself. For example, here in your case, if you want the model.tar.gz in S3, you have to export the model locally first. Then SageMaker will upload your local model to S3 automatically.
So what you need to add in your script is:
You need to add an exporter and pass it to eval_spec.
You need to call export_savedmodel to save the model to the local model dir that SageMaker can get. The local model dir is in env variable SM_MODEL_DIR, and should be '/opt/ml/model'.
To finish above, I guess you need to have your serving_input_fn implemented too.
Then SageMaker will upload your model from the local model dir automatically to the S3 model dir you specify. And you can see that in S3 after job succeeds.