I'm using Sagemaker to generate to do preprocessing and generate training data and I'm following the Sagemaker API documentation here, but I don't see any way currently how to specify autoscaling within the EMR cluster. What should I include within the configuration argument that I pass to my spark_processor run() object? What shouldn't I include?
I'm aware of the this resource, but it doesn't seem comprehensive.
Below is my code; it is very much a "work-in-progress", but I would like to know if someone could provide me with or point me to a resource that shows:
Whether this PySparkProcessor object will manage autoscaling automatically. Should I put AutoScaling config within the configuration in the run() object?
An example of the full config that I can pass to the configuration variable.
Here's what I have so far for the configuration.
SPARK_CONFIG = \
{ "Configurations": [
{ "Classification": "spark-env",
"Configurations": [ {"Classification": "export"} ] }
]
}
spark_processor = PySparkProcessor(
tags=TAGS,
role=IAM_ROLE,
instance_count=2,
py_version="py37",
volume_size_in_gb=30,
container_version="1",
framework_version="3.0",
network_config=sm_network,
max_runtime_in_seconds=1800,
instance_type="ml.m5.2xlarge",
base_job_name=EMR_CLUSTER_NAME,
sagemaker_session=sagemaker_session,
)
spark_processor.run(
configuration=SPARK_CONFIG,
submit_app=LOCAL_PYSPARK_SCRIPT_DIR,
spark_event_logs_s3_uri="s3://{BUCKET_NAME}/{S3_PYSPARK_LOG_PREFIX}",
)
I'm used to interacting via Python more directly with EMR for these types of tasks. Doing that allows me to specify the entire EMR cluster config at once--including applications, autoscaling, EMR default and autoscaling roles--and then adding the steps to the cluster once it's created; however, much of this config seems to be abstracted away, and I don't know what remains or needs to be specified, specifically regarding the following config variables: AutoScalingRole, Applications, VisibleToAllUsers, JobFlowRole/ServiceRole etc.
I found the answer in the Sagemaker Python SDK github.
_valid_configuration_keys = ["Classification", "Properties", "Configurations"]
_valid_configuration_classifications = [
"core-site",
"hadoop-env",
"hadoop-log4j",
"hive-env",
"hive-log4j",
"hive-exec-log4j",
"hive-site",
"spark-defaults",
"spark-env",
"spark-log4j",
"spark-hive-site",
"spark-metrics",
"yarn-env",
"yarn-site",
"export",
]
Thus, specifying autoscaling, Visibility, and some other cluster level configurations seems not to be supported. However, the applications installed upon cluster start up seem to depend on the applications in the above list.
Related
Sagemaker pipelines are rather unclear to me, I'm not experienced in the field of ML but I'm working on figuring out the pipeline definitions.
I have a few questions:
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
I can't seem to find any examples besides the Python SDK usage, how come?
The docs and workshops seem only to properly describe the Python SDK usage,it would be really helpful if someone could clear this up for me!
SageMaker has two things called Pipelines: Model Building Pipelines and Serial Inference Pipelines. I believe you're referring to the former
A model building pipeline defines steps in a machine learning workflow, such as pre-processing, hyperparameter tuning, batch transformations, and setting up endpoints
A serial inference pipeline is two or more SageMaker models run one after the other
A model building pipeline is defined in JSON, and is hosted/run in some sort of proprietary, serverless fashion by SageMaker
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
You can create/modify them using the API, which can also be called via the CLI, Python SDK, or CloudFormation. These all use the AWS API under the hood
You can start/stop/view them in SageMaker Studio:
Left-side Navigation bar > SageMaker resources > Drop-down menu > Pipelines
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
Unlikely. CodePipeline is more for building and deploying code, not specific to SageMaker. There is no direct integration as far as I can tell, other than that you can start a SM pipeline with CP
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
The Python SDK is a stand-alone library to interact with SageMaker in a developer-friendly fashion. It's more dynamic than CloudFormation. Let's you build pipelines using code. Whereas CloudFormation takes a static JSON string
A very simple example of Python SageMaker SDK usage:
processor = SKLearnProcessor(
framework_version="0.23-1",
instance_count=1,
instance_type="ml.m5.large",
role="role-arn",
)
processing_step = ProcessingStep(
name="processing",
processor=processor,
code="preprocessor.py"
)
pipeline = Pipeline(name="foo", steps=[processing_step])
pipeline.upsert(role_arn = ...)
pipeline.start()
pipeline.definition() produces rather verbose JSON like this:
{
"Version": "2020-12-01",
"Metadata": {},
"Parameters": [],
"PipelineExperimentConfig": {
"ExperimentName": {
"Get": "Execution.PipelineName"
},
"TrialName": {
"Get": "Execution.PipelineExecutionId"
}
},
"Steps": [
{
"Name": "processing",
"Type": "Processing",
"Arguments": {
"ProcessingResources": {
"ClusterConfig": {
"InstanceType": "ml.m5.large",
"InstanceCount": 1,
"VolumeSizeInGB": 30
}
},
"AppSpecification": {
"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
"ContainerEntrypoint": [
"python3",
"/opt/ml/processing/input/code/preprocessor.py"
]
},
"RoleArn": "arn:aws:iam::123456789012:role/foo",
"ProcessingInputs": [
{
"InputName": "code",
"AppManaged": false,
"S3Input": {
"S3Uri": "s3://bucket/preprocessor.py",
"LocalPath": "/opt/ml/processing/input/code",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3DataDistributionType": "FullyReplicated",
"S3CompressionType": "None"
}
}
]
}
}
]
}
You could use the above JSON with CloudFormation/CDK, but you build the JSON with the SageMaker SDK
You can also define model building workflows using Step Function State Machines, using the Data Science SDK, or Airflow
On AWS cloud watch we have one dashboard per environment.
Each dashboard has N plots.
Some plots, use the Auto Scaling Group Name (ASG) to find the data to plot.
Example of such a plot (edit, tab source):
{
"metrics": [
[ "production", "mem_used_percent", "AutoScalingGroupName", "awseb-e-rv8y2igice-stack-AWSEBAutoScalingGroup-3T5YOK67T3FD" ]
],
... other params removed for brevity ...
"title": "Used Memory (%)",
}
Every time we deploy, the ASG name changes (deploy using code-deploy with Elastic Bean Stalk (EBS) configuration files from source).
I need to manually find the new name and update the N plots one by one.
The strange thing is that this happens for production and staging environments, but not for integration.
All 3 should be copies of one another, with different settings from the EBS configuration files, so I don't know what is going on.
In any case, what (I think) I need is one of:
option 1: prevent the ASG name change upon deploy
option 2: dynamically update the plots with the new name
option 3: plot the same data without using the ASG name (but alternatives I find are EC2 instance ID that changes and ImageId and InstanceType that are common to more than one EC2, so won't work either)
My online-search-foo has turned out empty.
More Info:
I'm publishing these metrics with the cloud watch agent, by adjusting the conf file, as per the docs here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html
Have a look at CloudWatch Search Expression Syntax. It allows you to use tokens for searching, e.g.:
SEARCH(' {AWS/CWAgent, AutoScalingGroupName} MetricName="mem_used_percent" rv8y2igice', 'Average', 300)
which would replace the entry for metrics like so:
"metrics": [
[ { "expression": "SEARCH(' {AWS/CWAgent, AutoScalingGroupName} MetricName=\"mem_used_percent\" rv8y2igice', 'Average', 300)", "label": "Expression1", "id": "e1" } ]
]
Simply search the desired result in the console, results that match the search appear.
To graph, all of the metrics that match your search, choose Graph search
and find the accurate search expression that you want in the Details on the Graphed metrics tab.
SEARCH('{CWAgent,AutoScalingGroupName,ImageId,InstanceId,InstanceType} mem_used_percent', 'Average', 300)
I'm creating a security camera IoT project that uploads images to S3 and will soon offer a UI to review those images. AWS Amplify is being used to make this happen quickly.
As I get started on the Amplify side of things, I'm noticing a config file that has very specifically named attributes and values. The team-provider-info.json file in particular that isn't ignored is very specific:
{
"dev": {
"awscloudformation": {
"AuthRoleName": "amplify-twintigersecurityweb-dev-123456-authRole",
"UnauthRoleArn": "arn:aws:iam::111164163333:role/amplify-twintigersecurityweb-dev-123456-unauthRole",
"AuthRoleArn": "arn:aws:iam::111164163333:role/amplify-twintigersecurityweb-dev-123456-authRole",
"Region": "us-east-1",
"DeploymentBucketName": "amplify-twintigersecurityweb-dev-123456-deployment",
"UnauthRoleName": "amplify-twintigersecurityweb-dev-123456-unauthRole",
"StackName": "amplify-twintigersecurityweb-dev-123456",
"StackId": "arn:aws:cloudformation:us-east-1:111164163333:stack/amplify-twintigersecurityweb-dev-123456/88888888-8888-8888-8888-888838f58888",
"AmplifyAppId": "dddd7dx2zipppp"
}
}
}
May I post this to my public repository without worry? Is there a chance for conflict in naming? How would one pull this in for use in their new project?
Per AWS Amplify documentation:
If you want to share a project publicly and open source your serverless infrastructure, you should remove or put the amplify/team-provider-info.json file in gitignore file.
At a glance, everything else generated by amplify init NOT in the .gitignore file is ok to share. e.g. project-config.json and backend-config.json.
Add this to .gitignore:
# not to share if public
amplify/team-provider-info.json
I have two Google projects: dev and prod. I import data from also different storage buckets located in these projects: dev-bucket and prod-bucket.
After I have made and tested changes in the dev environment, how can I smoothly apply (deploy/copy) the changes to prod as well?
What I do now is I export the flow from devand then re-import it into prod. However, each time I need to manually do the following in the `prod flows:
Change the dataset that serve as inputs in the flow
Replace the manual and scheduled destinations for the right BigQuery dataset (dev-dataset-bigquery and prod-dataset-bigquery)
How can this be done more smoother?
If you want to copy data between Google Cloud Storage (GCS) buckets dev-bucket and prod-bucket, Google provides a Storage Transfer Service with this functionality. https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console You can either manually trigger data to be copied from one bucket to another or have it run on a schedule.
For the second part, it sounds like both dev-dataset-bigquery and prod-dataset-bigquery are loaded from files in GCS? If this is the case, the BigQuery Transfer Service may be of use. https://cloud.google.com/bigquery/docs/cloud-storage-transfer You can trigger a transfer job manually, or have it run on a schedule.
As others have said in the comments, if you need to verify data before initiating transfers from dev to prod, a CI system such as spinnaker may help. If the verification can be automated, a system such as Apache Airflow (running on Cloud Composer, if you want a hosted version) provides more flexibility than the transfer services.
Follow below procedure for movement from one environment to another using API and for updating the dataset and the output as per new environment.
1)Export a plan
GET
https://api.clouddataprep.com/v4/plans/<plan_id>/package
2)Import the plan
Post:
https://api.clouddataprep.com/v4/plans/package
3)Update the input dataset
PUT:
https://api.clouddataprep.com/v4/importedDatasets/<datset_id>
{
"name": "<new_dataset_name>",
"bucket": "<bucket_name>",
"path": "<bucket_file_name>"
}
4)Update the output
PATCH
https://api.clouddataprep.com/v4/outputObjects/<output_id>
{
"publications": [
{
"path": [
"<project_name>",
"<dataset_name>"
],
"tableName": "<table_name>",
"targetType": "bigquery",
"action": "create"
}
]
}
I add metrics.properties file to resource directory (maven project) with CSV sinc. Everything is fine when I run Spark app locally - metrics appears. But when I file same fat jar to Amazon EMR I do not see any tries to put metrics into CSV sinc. So I want to check at runtime what are loaded settings for SparkMetrics subsystem. If there are any possibility to do this?
I looked into SparkEnv.get.metricsSystem but didn't find any.
That is basically because Spark on EMR is not picking up your custom metrics.properties file from the resources dir of the fat jar.
For EMR the preferred way to configure is through EMR Configurations API in which you need to pass the classification and properties in an embedded JSON.
For spark metrics subsystem here is an example to modify a couple of metrics
[
{
"Classification": "spark-metrics",
"Properties": {
"*.sink.csv.class": "org.apache.spark.metrics.sink.CsvSink",
"*.sink.csv.period": "1"
}
}
]
You can use this JSON when creating EMR cluster using Amazon Console or through SDK