What are SageMaker pipelines actually? - amazon-web-services

Sagemaker pipelines are rather unclear to me, I'm not experienced in the field of ML but I'm working on figuring out the pipeline definitions.
I have a few questions:
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
I can't seem to find any examples besides the Python SDK usage, how come?
The docs and workshops seem only to properly describe the Python SDK usage,it would be really helpful if someone could clear this up for me!

SageMaker has two things called Pipelines: Model Building Pipelines and Serial Inference Pipelines. I believe you're referring to the former
A model building pipeline defines steps in a machine learning workflow, such as pre-processing, hyperparameter tuning, batch transformations, and setting up endpoints
A serial inference pipeline is two or more SageMaker models run one after the other
A model building pipeline is defined in JSON, and is hosted/run in some sort of proprietary, serverless fashion by SageMaker
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
You can create/modify them using the API, which can also be called via the CLI, Python SDK, or CloudFormation. These all use the AWS API under the hood
You can start/stop/view them in SageMaker Studio:
Left-side Navigation bar > SageMaker resources > Drop-down menu > Pipelines
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
Unlikely. CodePipeline is more for building and deploying code, not specific to SageMaker. There is no direct integration as far as I can tell, other than that you can start a SM pipeline with CP
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
The Python SDK is a stand-alone library to interact with SageMaker in a developer-friendly fashion. It's more dynamic than CloudFormation. Let's you build pipelines using code. Whereas CloudFormation takes a static JSON string
A very simple example of Python SageMaker SDK usage:
processor = SKLearnProcessor(
framework_version="0.23-1",
instance_count=1,
instance_type="ml.m5.large",
role="role-arn",
)
processing_step = ProcessingStep(
name="processing",
processor=processor,
code="preprocessor.py"
)
pipeline = Pipeline(name="foo", steps=[processing_step])
pipeline.upsert(role_arn = ...)
pipeline.start()
pipeline.definition() produces rather verbose JSON like this:
{
"Version": "2020-12-01",
"Metadata": {},
"Parameters": [],
"PipelineExperimentConfig": {
"ExperimentName": {
"Get": "Execution.PipelineName"
},
"TrialName": {
"Get": "Execution.PipelineExecutionId"
}
},
"Steps": [
{
"Name": "processing",
"Type": "Processing",
"Arguments": {
"ProcessingResources": {
"ClusterConfig": {
"InstanceType": "ml.m5.large",
"InstanceCount": 1,
"VolumeSizeInGB": 30
}
},
"AppSpecification": {
"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
"ContainerEntrypoint": [
"python3",
"/opt/ml/processing/input/code/preprocessor.py"
]
},
"RoleArn": "arn:aws:iam::123456789012:role/foo",
"ProcessingInputs": [
{
"InputName": "code",
"AppManaged": false,
"S3Input": {
"S3Uri": "s3://bucket/preprocessor.py",
"LocalPath": "/opt/ml/processing/input/code",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3DataDistributionType": "FullyReplicated",
"S3CompressionType": "None"
}
}
]
}
}
]
}
You could use the above JSON with CloudFormation/CDK, but you build the JSON with the SageMaker SDK
You can also define model building workflows using Step Function State Machines, using the Data Science SDK, or Airflow

Related

Use CDK deploy time token values in a launch template user-data script

I recently starting porting part of my infrastructure to AWS CDK. Previously, I did some experiments with Cloudformation templates directly.
I am currently facing the problem that I want to encode some values (namely the product version) in a user-data script of an EC2 launch template and these values should only be loaded at deployment time. With Cloudformation this was quite simple, I was just building my JSON file from functions like Fn::Base64 and Fn::Join. E.g. it looked like this (simplified)
"MyLaunchTemplate": {
"Type": "AWS::EC2::LaunchTemplate",
"Properties": {
"LaunchTemplateData": {
"ImageId": "ami-xxx",
"UserData": {
"Fn::Base64": {
"Fn::Join": [
"#!/bin/bash -xe",
{"Fn::Sub": "echo \"${SomeParameter}\""},
]
}
}
}
}
}
}
This way I am able to define the parameter SomeParameter on launch of the cloudformation template.
With CDK we can access values from the AWS Parameter Store either at deploy time or at synthesis time. If we use them at deploy time, we only get a token, otherwise we get the actual value.
I have achieved so far to read a value for synthesis time and directly encode the user-data script as base64 like this:
product_version = ssm.StringParameter.value_from_lookup(
self, f'/Prod/MyProduct/Deploy/Version')
launch_template = ec2.CfnLaunchTemplate(self, 'My-LT', launch_template_data={
'imageId': my_ami,
'userData': base64.b64encode(
f'echo {product_version}'.encode('utf-8')).decode('utf-8'),
})
With this code, however, the version gets read during synthesis time and will be hardcoded into the user-data script.
In order to be able to use dynamic values that are only resolved at deploy time (value_for_string_parameter) I would somehow need to tell CDK to write a Cloudformation template similar to what I have done manually before (using Fn::Base64 only in Cloudformation, not in Python). However, I did not find a way to do this.
If I read a value that is only to be resolved at deploy time like follows, how can I use it in the UserData field of a launch template?
latest_string_token = ssm.StringParameter.value_for_string_parameter(
self, "my-plain-parameter-name", 1)
It is possible using the Cloudformation intrinsic functions which are available in the class aws_cdk.core.Fn in Python.
These can be used when creating a launch template in EC2 to combine strings and tokens, e.g. like this:
import aws_cdk.core as cdk
# loads a value to be resolved at deployment time
product_version = ssm.StringParameter.value_for_string_parameter(
self, '/Prod/MyProduct/Deploy/Version')
launch_template = ec2.CfnLaunchTemplate(self, 'My-LT', launch_template_data={
'imageId': my_ami,
'userData': cdk.Fn.base64(cdk.Fn.join('\n', [
'#!/usr/bin/env bash',
cdk.Fn.join('=', ['MY_PRODUCT_VERSION', product_version]),
'git checkout $MY_PRODUCT_VERSION',
])),
})
This example could result in the following user-data script in the launch template if the parameter store contains version 1.2.3:
#!/usr/bin/env bash
MY_PRODUCT_VERSION=1.2.3
git checkout $MY_PRODUCT_VERSION

Does AWS Sagemaker PySparkProcessor manage autoscaling?

I'm using Sagemaker to generate to do preprocessing and generate training data and I'm following the Sagemaker API documentation here, but I don't see any way currently how to specify autoscaling within the EMR cluster. What should I include within the configuration argument that I pass to my spark_processor run() object? What shouldn't I include?
I'm aware of the this resource, but it doesn't seem comprehensive.
Below is my code; it is very much a "work-in-progress", but I would like to know if someone could provide me with or point me to a resource that shows:
Whether this PySparkProcessor object will manage autoscaling automatically. Should I put AutoScaling config within the configuration in the run() object?
An example of the full config that I can pass to the configuration variable.
Here's what I have so far for the configuration.
SPARK_CONFIG = \
{ "Configurations": [
{ "Classification": "spark-env",
"Configurations": [ {"Classification": "export"} ] }
]
}
spark_processor = PySparkProcessor(
tags=TAGS,
role=IAM_ROLE,
instance_count=2,
py_version="py37",
volume_size_in_gb=30,
container_version="1",
framework_version="3.0",
network_config=sm_network,
max_runtime_in_seconds=1800,
instance_type="ml.m5.2xlarge",
base_job_name=EMR_CLUSTER_NAME,
sagemaker_session=sagemaker_session,
)
spark_processor.run(
configuration=SPARK_CONFIG,
submit_app=LOCAL_PYSPARK_SCRIPT_DIR,
spark_event_logs_s3_uri="s3://{BUCKET_NAME}/{S3_PYSPARK_LOG_PREFIX}",
)
I'm used to interacting via Python more directly with EMR for these types of tasks. Doing that allows me to specify the entire EMR cluster config at once--including applications, autoscaling, EMR default and autoscaling roles--and then adding the steps to the cluster once it's created; however, much of this config seems to be abstracted away, and I don't know what remains or needs to be specified, specifically regarding the following config variables: AutoScalingRole, Applications, VisibleToAllUsers, JobFlowRole/ServiceRole etc.
I found the answer in the Sagemaker Python SDK github.
_valid_configuration_keys = ["Classification", "Properties", "Configurations"]
_valid_configuration_classifications = [
"core-site",
"hadoop-env",
"hadoop-log4j",
"hive-env",
"hive-log4j",
"hive-exec-log4j",
"hive-site",
"spark-defaults",
"spark-env",
"spark-log4j",
"spark-hive-site",
"spark-metrics",
"yarn-env",
"yarn-site",
"export",
]
Thus, specifying autoscaling, Visibility, and some other cluster level configurations seems not to be supported. However, the applications installed upon cluster start up seem to depend on the applications in the above list.

How much of Amazon Connect is scriptable, either through Terraform, Ansible or another approach?

This question concerns AWS Connect, the cloud-based call center. For those people who have been involved in the setup and configuration of AWS Connect, is there any portion of Amazon Connect that is configurable through a continuous integration flow other than any possible Lambda touchpoints. What I am looking for is scripting various functions such as loading exported flows, etc.
Looking at the AWS CLI, I see a number of AWS Connect calls but a majority is getting access to information (https://docs.aws.amazon.com/cli/latest/reference/connect/index.html) but very few that are available to configure portions of AWS Connect.
There is basically nothing at this time. All contact flows must be imported/exported by hand. Other settings (e.g. routing profiles, prompts, etc.) must be re-created manually.
Someone has created a "beta" Connect CloudFormation template though that actually uses puppeteer behind the scenes to automate the import/export process. I imagine that Amazon will eventually support this, because devops is certainly one of the rough edges of the platform right now.
For new people checking this question. Amazon has recently published the APIs you are looking for. create-contact-flow
It uses a JSON-based language specific to Amazon Connect, below is an example:
{
"Version": "2019-10-30",
"StartAction": "12345678-1234-1234-1234-123456789012",
"Metadata": {
"EntryPointPosition": {"X": 88,"Y": 100},
"ActionMetadata": {
"12345678-1234-1234-1234-123456789012": {
"Position": {"X": 270, "Y": 98}
},
"abcdef-abcd-abcd-abcd-abcdefghijkl": {
"Position": {"X": 545, "Y": 92}
}
}
},
"Actions": [
{
"Identifier": "12345678-1234-1234-1234-123456789012",
"Type": "MessageParticipant",
"Transitions": {
"NextAction": "abcdef-abcd-abcd-abcd-abcdefghijkl",
"Errors": [],
"Conditions": []
},
"Parameters": {
"Prompt": {
"Text": "Thanks for calling the sample flow!",
"TextType": "text",
"PromptId": null
}
}
},
{
"Identifier": "abcdef-abcd-abcd-abcd-abcdefghijkl",
"Type": "DisconnectParticipant",
"Transitions": {},
"Parameters": {}
}
]
}
Exporting from the GUI does not produce a JSON in this format. Obviously, a problem with this is keeping a state. I am keeping a close eye on Terraform/CloudFormation/CDK and will update this post if there is any support (that does not use puppeteer).
I think it's doable now; with the newest APIs, you can do many things to script the entire process. There are some issues with the contact flows themself, but I think this will improve over the next few months.
In the meantime, there is some effort to add Amazon Connet to Terraform. here are the issues and the WIP PRs
Github Issue
Service PR
Resource PR

Serverless framework for AWS : Adding initial data into Dynamodb table

Currently I am using Serverless framework for deploying my applications on AWS.
https://serverless.com/
Using the serverless.yml file, we create the DynamoDB tables which are required for the application. These tables are accessed from the lambda functions.
When the application is deployed, I want few of these tables to be loaded with the initial set of data.
Is this possible?
Can you provide me some pointers for inserting this initial data?
Is this possible with AWS SAM?
Don't know if there's a specific way to do this in serverless, however, just add a call to the AWS CLI like this to your build pipeline:
aws dynamodb batch-write-item --request-items file://initialdata.json
Where initialdata.json looks something like this:
{
"Forum": [
{
"PutRequest": {
"Item": {
"Name": {"S":"Amazon DynamoDB"},
"Category": {"S":"Amazon Web Services"},
"Threads": {"N":"2"},
"Messages": {"N":"4"},
"Views": {"N":"1000"}
}
}
},
{
"PutRequest": {
"Item": {
"Name": {"S":"Amazon S3"},
"Category": {"S":"Amazon Web Services"}
}
}
}
]
}
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
A more Serverless Framework option is to use a tool like the serverless-plugin-scripts plugin that allows you to add your own CLI commands to the deploy process by default:
https://github.com/mvila/serverless-plugin-scripts

Is it possible to update existing Google cloud dataflow pipelines when using template for deployment ?

When deploying google dataflow pipeline as templates, Is it possible to update the pipeline using another version of the template?
Basically, I am looking for a combination https://cloud.google.com/dataflow/pipelines/updating-a-pipeline with https://cloud.google.com/dataflow/docs/templates/overview
The feature of updating an existing job from template API is not ready yet. (We are working on it).
At the time, you can probably make use our public repo (basically the source code for these templates) to do it. Basically, you can just build and launch the job to "update" the running job from a shell.
https://github.com/GoogleCloudPlatform/DataflowTemplates
Now it's possible to update a template streaming Dataflow job.
Using the REST API, set the update parameter to true:
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://YOUR_BUCKET_NAME/templates/TemplateName
{
"jobName": "JOB_NAME",
"parameters": {
"topic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME",
"table": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME"
},
"environment": {
"tempLocation": "gs://YOUR_BUCKET_NAME/temp",
"zone": "us-central1-f"
}
"update": true
}
The update option is not present using gcloud dataflow jobs run.
https://cloud.google.com/dataflow/docs/guides/templates/running-templates#example-3:-updating-a-custom-template-streaming-job
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters