Autoscaling in Sagemaker by Terraform

Autoscaling in Sagemaker by Terraform - amazon-web-services

all our AWS infra managed by Terraform, including the Sagemaker resources. We want to implement Autoscaling in our SM resources. We can't find Terraform solution to build our infra as a code.
In generally, ASG should be located in aws_sagemaker_endpoint_configuration >> production_variants blocks
references:
AWS documentation: https://aws.amazon.com/blogs/aws/auto-scaling-is-now-available-for-amazon-sagemaker/
TF documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/sagemaker_endpoint_configuration
Thanks in advance for your response

so, from my researches it should be something as:
resource "aws_appautoscaling_target" "sagemaker_target" {
max_capacity = var.max_instance_count
min_capacity = var.min_instance_count
resource_id = "endpoint/${aws_sagemaker_endpoint.endpoint.name}/variant/${var.service_name}-${var.site}-${var.environment}"
role_arn = aws_iam_role.sm_execution.arn
scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
service_namespace = "sagemaker"
}
resource "aws_appautoscaling_policy" "sagemaker_policy" {
name = "${var.service_name}-${var.site}-${var.environment}-target-tracking"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.sagemaker_target.resource_id
scalable_dimension = aws_appautoscaling_target.sagemaker_target.scalable_dimension
service_namespace = aws_appautoscaling_target.sagemaker_target.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
}
target_value = var.target_invocations
scale_in_cooldown = var.target_scale_in_cooldown
scale_out_cooldown = var.target_scale_out_cooldown
}
}

Taken a reference from the other answer, below is what my code looks like. Pasting here for some one's reference.
Note : I had to remove role_arn key from aws_appautoscaling_target resource so that it uses default service IAM role & also had to use SageMakerEndpointInvocationScalingPolicy string for policy name in aws_appautoscaling_policy resource to get same behavior as we get when we create auto scalling policy for sagemaker endpoint from aws console.
PS : Without above adjustments, TargetValue was not getting rendered wehn viewing from aws console & I was not manually able to change target value of terraform created auto scalling policy of sagemaker endpoint (as while updating TargetValue from console , was getting Validation Exception like Only one Target Tracking Scaling policy for a given metric specification is allowed. error etc
resource "aws_appautoscaling_target" "register_myendpoint_target" {
max_capacity = 2
min_capacity = 1
resource_id = "endpoint/${aws_sagemaker_endpoint.my_model-endpoint.name}/variant/${variant_name}"
scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
service_namespace = "sagemaker"
}
resource "aws_appautoscaling_policy" "autoscale_policy_my_endpoint" {
name = "SageMakerEndpointInvocationScalingPolicy" # Had to use this name ditto so that on console it doesn't show some custom policy is configured etc etc
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.register_myendpoint_target.resource_id
scalable_dimension = aws_appautoscaling_target.register_myendpoint_target.scalable_dimension
service_namespace = aws_appautoscaling_target.register_myendpoint_target.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 100.0
predefined_metric_specification {
predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
}
scale_in_cooldown = 300
scale_out_cooldown = 300
}
}

Related

how to pass cloudwatch logs to lambda in terraform

I have created two lambda functions. now i wanted to pass all the cloudwatch logs from first lambda to second lambda. i have created new log group name and subscription filter to pass the cloudwatch logs to second lambda.
I am not sure if this configuration needs to be added any resource.
resource "aws_lambda_function" "audit-logs" {
filename = var.audit_filename
function_name = var.audit_function
source_code_hash = filebase64sha256(var.audit_filename)
role = module.lambda_role.arn
handler = "cloudwatch.lambda_handler"
runtime = "python3.9"
timeout = 200
description = "audit logs"
depends_on = [module.lambda_role, module.security_group]
}
resource "aws_cloudwatch_log_group" "splunk_cloudwatch_loggroup" {
name = "/aws/lambda/audit_logs"
}
resource "aws_lambda_permission" "allow_cloudwatch_for_splunk" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.splunk-logs.arn
principal = "logs.amazonaws.com" #"logs.region.amazonaws.com"
source_arn = "${aws_cloudwatch_log_group.splunk_cloudwatch_loggroup.arn}:*"
}
resource "aws_cloudwatch_log_subscription_filter" "splunk_cloudwatch_trigger" {
depends_on = [aws_lambda_permission.allow_cloudwatch_for_splunk]
destination_arn = aws_lambda_function.splunk-logs.arn
filter_pattern = ""
log_group_name = aws_cloudwatch_log_group.splunk_cloudwatch_loggroup.name
name = "splunk_filter"
}
# splunk logs lambda function
resource "aws_lambda_function" "splunk-logs" {
filename = var.splunk_filename
function_name = var.splunk_function
source_code_hash = filebase64sha256(var.splunk_filename)
role = module.lambda_role.arn
handler = "${var.splunk_handler}.handler"
runtime = "python3.9"
timeout = 200
description = "audit logs"
depends_on = [module.lambda_role, module.security_group]
}
how can i pass all the logs from first lambda to newly created log group ? any help ?

Terraform & GCP: Google kubernetes cluster problem: Can't see monitoring section (memory and cpu) inside workloads (deployments, statefulsets)

I spent 4 days already testing all configurations from kubernetes terraform gcp module and I can't see the metrics of my workloads, It never shows me CPU nor Memory (and even the standard default created kubernetes in the GUI has this activated.
Here's my code:
resource "google_container_cluster" "default" {
provider = google-beta
name = var.name
project = var.project_id
description = "Vectux GKE Cluster"
location = var.zonal_region
remove_default_node_pool = true
initial_node_count = var.gke_num_nodes
master_auth {
#username = ""
#password = ""
client_certificate_config {
issue_client_certificate = false
}
}
timeouts {
create = "30m"
update = "40m"
}
logging_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}
}
resource "google_container_node_pool" "default" {
name = "${var.name}-node-pool"
project = var.project_id
location = var.zonal_region
node_locations = [var.zonal_region]
cluster = google_container_cluster.default.name
node_count = var.gke_num_nodes
node_config {
preemptible = true
machine_type = var.machine_type
disk_size_gb = var.disk_size_gb
service_account = google_service_account.default3.email
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/cloud-platform",
"compute-ro",
"storage-ro",
"service-management",
"service-control",
]
metadata = {
disable-legacy-endpoints = "true"
}
}
management {
auto_repair = true
auto_upgrade = true
}
}
resource "google_service_account" "default3" {
project = var.project_id
account_id = "terraform-vectux-33"
display_name = "tfvectux2"
provider = google-beta
}
Here's some info on the cluster (when I compare against the standard one with the metrics enabled I see no differences:
And here 's the workload view without the metrics that I'd like to see:

As I mentioned in the comment to solve your issue, you must add google_service_account_iam_binding module and grant your Service Account specific role - roles/monitoring.metricWriter. In comments I mention that you can also grant role/compute.admin but after another test I've run it's not necessary.
Below is a terraform snippet I've used to create a test cluster with Service Account called sa. I've changed some fields in node config. In your case, you would need to add the whole google_project_iam_binding module.
Terraform Snippet
### Creating Service Account
resource "google_service_account" "sa" {
project = "my-project-name"
account_id = "terraform-vectux-2"
display_name = "tfvectux2"
provider = google-beta
}
### Binding Service Account with IAM
resource "google_project_iam_binding" "sa_binding_writer" {
project = "my-project-name"
role = "roles/monitoring.metricWriter"
members = [
"serviceAccount:${google_service_account.sa.email}"
### in your case it will be "serviceAccount:${google_service_account.your-serviceaccount-name.email}"
]
}
resource "google_container_cluster" "default" {
provider = google-beta
name = "cluster-test-custom-sa"
project = "my-project-name"
description = "Vectux GKE Cluster"
location = "europe-west2"
remove_default_node_pool = true
initial_node_count = "1"
master_auth {
#username = ""
#password = ""
client_certificate_config {
issue_client_certificate = false
}
}
timeouts {
create = "30m"
update = "40m"
}
logging_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
}
}
resource "google_container_node_pool" "default" {
name = "test-node-pool"
project = "my-project-name"
location = "europe-west2"
node_locations = ["europe-west2-a"]
cluster = google_container_cluster.default.name
node_count = "1"
node_config {
preemptible = "true"
machine_type = "e2-medium"
disk_size_gb = 50
service_account = google_service_account.sa.email
###service_account = google_service_account.your-serviceaccount-name.email
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/cloud-platform",
"compute-ro",
"storage-ro",
"service-management",
"service-control",
]
metadata = {
disable-legacy-endpoints = "true"
}
}
management {
auto_repair = true
auto_upgrade = true
}
}
My Screens:
Whole workload
Node Workload
Additional Information
If you would add just roles/compute.admin you might see workload for the whole application,but you wouldn't be able to see each node workload. With "roles/monitoring.metricWriter" you are able to see the whole application workload and each node workload. To achieve what you want - see workloads in the node, you just need "roles/monitoring.metricWriter".
You need to use "google_project_iam_binding" as without this in IAM roles, you won't have your newly created Service Account and it will lack permission. In short, Your new SA will be visible in IAM & Admin > Service Accounts but there will be no entry in IAM & Admin > IAM.
If you want more information about IAM and Binding in terraform, please check this Terraform Documentation
As a last thing, please remember that Oauth Scope with "https://www.googleapis.com/auth/cloud-platform" gives access to all GCP resources.

"global_replication_group_id": conflicts with parameter_group_name in terraform

Here, we use aws_elasticache_global_replication_group in terraform to add the multi-region elasticache redis cluster in AWS. Here is the code we are trying and we get "global_replication_group_id": conflicts with parameter_group_name error after applying the terraform plan.
resource "aws_elasticache_global_replication_group" "global-redis" {
global_replication_group_id_suffix = "global-redis"
primary_replication_group_id = aws_elasticache_replication_group.primary.id
}
resource "aws_elasticache_replication_group" "primary" {
replication_group_id = "redis-primary"
replication_group_description = "primary replication group"
engine = "redis"
engine_version = "5.0.6"
node_type = "cache.m5.large"
snapshot_retention_limit = var.snapshot_retention
parameter_group_name = var.parameter_group_name
availability_zones = var.availability-zones-primary
number_cache_clusters = 1
}
resource "aws_elasticache_replication_group" "secondary" {
replication_group_id = "redis-secondary"
replication_group_description = "secondary replication group"
global_replication_group_id = aws_elasticache_global_replication_group.global-redis.global_replication_group_id
snapshot_retention_limit = var.snapshot_retention
parameter_group_name = var.parameter_group_name
availability_zones = var.availability-zones-secondary
number_cache_clusters = 1
provider = aws.other_region
}
We couldn't find any documentation regarding this error and looking for answers if anyone faced the same issue.

During Terraform destroy, terraform is trying to destroy the ECS cluster before destroying the Auto-scaling group and is failing

I have used ECS with capacity provider for deployment of my application and have enabled scale-in protection for ASG used by capacity provider. During Terraform destroy I see terraform trying to destroy ECS cluster and after trying for 10 minutes it fails and outputs,
Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
What i'm I doing wrong here,
Relevant Terraform script,
FOR ECS
#ecs auto-scaling
resource "aws_appautoscaling_target" "ecs_target" {
max_capacity = var.ecs_max_size -- (8)
min_capacity = var.ecs_min_size -- (2)
resource_id = "service/${aws_ecs_cluster.kong.name}/${aws_ecs_service.kong.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "ecs_asg_cpu_policy" {
name = local.name
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_target.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_target.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70
}
}
FOR Capacity Provider
resource "aws_autoscaling_group" "kong" {
name = local.name
launch_configuration = aws_launch_configuration.kong.name
vpc_zone_identifier = data.aws_subnet_ids.private.ids
min_size = var.asg_min_size --(1)
max_size = var.asg_max_size --(4)
desired_capacity = var.asg_desired_capacity --(2)
protect_from_scale_in = true
tags = [
{
"key" = "Name"
"value" = local.name
"propagate_at_launch" = true
},
{
"key" = "AmazonECSManaged"
"value" = ""
"propagate_at_launch" = true
}
]
}
resource "aws_ecs_capacity_provider" "capacity_provider" {
name = local.name
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.kong.arn
managed_termination_protection = "ENABLED"
managed_scaling {
maximum_scaling_step_size = 4
minimum_scaling_step_size = 1
instance_warmup_period = 120
status = "ENABLED"
target_capacity = 75
}
}
}
resource "aws_ecs_cluster" "kong" {
name = local.name
capacity_providers = [
aws_ecs_capacity_provider.capacity_provider.name,
]
tags = merge(
{
"Name" = local.name,
"Environment" = var.environment,
"Description" = var.description,
"Service" = var.service,
},
var.tags
)
provisioner "local-exec" {
when = destroy
command = "aws autoscaling update-auto-scaling-group --auto-scaling-group-name ${self.name} --min-size 0 --desired-capacity 0"
}
}
Terraform version:
Terraform v0.14.7
provider registry.terraform.io/hashicorp/aws v3.46.0

This is a long lasting issue reported in GitHub:
terraform attempts to destroy AWS ECS cluster before Deleting ECS Service
For now, there does not seem to be any solution to that, except manual interventions or using local-exec provisional with AWS CLI to aid TF.

Call an SSM document from terraform

I was wondering if anyone could help with this issue? I'm trying to call an SSM document using terraform to stop an ec2 instance. But, it doesn't seems to work. I keep having the error:
Automation Step Execution fails when it is changing the state of each instance. Get Exception from StopInstances API of ec2 Service. Exception Message from StopInstances API: [You are not authorized to perform this operation.
Any suggestion here?
As you could see, there are the right roles. I pass it in parameter.
provider "aws" {
profile = "profile"
region = "eu-west-1"
}
data "aws_ssm_document" "stop_ec2_doc" {
name = "AWS-StopEC2Instance"
document_format = "JSON"
}
data "aws_iam_policy_document" "assume_role" {
version = "2012-10-17"
statement {
sid = "EC2AssumeRole"
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
identifiers = ["ec2.amazonaws.com"]
type = "Service"
}
principals {
identifiers = ["ssm.amazonaws.com"]
type = "Service"
}
}
}
data "aws_ami" "latest_amazon_2" {
most_recent = true
owners = ["amazon"]
name_regex = "^amzn2-ami-hvm-.*x86_64-gp2"
}
#
resource "aws_iam_role" "iam_assume_role" {
name = "iam_assume_role"
assume_role_policy = data.aws_iam_policy_document.assume_role.json
}
#
resource "aws_iam_role_policy_attachment" "role_1" {
role = aws_iam_role.iam_assume_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
# the instance profile
resource "aws_iam_instance_profile" "iam_instance_profile" {
name = "iam_instance_profile"
role = aws_iam_role.iam_assume_role.name
}
# amazon ec2 instances
resource "aws_instance" "ec2_instances" {
count = 2
ami = data.aws_ami.latest_amazon_2.id
instance_type = "t2.micro"
subnet_id = "subnet-12345678901"
iam_instance_profile = aws_iam_instance_profile.iam_instance_profile.name
root_block_device {
volume_size = 8
volume_type = "gp2"
delete_on_termination = true
}
}
resource "aws_ssm_association" "example" {
name = data.aws_ssm_document.stop_ec2_doc.name
parameters = {
AutomationAssumeRole = "arn:aws:iam::12345678901:role/aws-service-role/ssm.amazonaws.com/AWSServiceRoleForAmazonSSM"
InstanceId = aws_instance.ec2_instances[0].id
}
}
Any suggestion is welcome. I tried to create an easy Terraform code to illustrate what I'm trying to do. And to me it should be straight forward.
I create the role. I create the instance profile. I create the association passing the proper role and the instance id.

AWSServiceRoleForAmazonSSM role does not have permissions to stop instances. Instead you should create new role for SSM with such permissions. The simplest way is as follows:
resource "aws_iam_role" "ssm_role" {
name = "ssm_role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Sid = ""
Principal = {
Service = "ssm.amazonaws.com"
}
},
]
})
}
resource "aws_iam_role_policy_attachment" "ec2-attach" {
role = aws_iam_role.ssm_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2FullAccess"
}
resource "aws_ssm_association" "example" {
name = data.aws_ssm_document.stop_ec2_doc.name
parameters = {
AutomationAssumeRole = aws_iam_role.ssm_role.arn
InstanceId = aws_instance.ec2_instances[0].id
}
}
The AmazonEC2FullAccess is too permissive just for stopping instances, but I use it as a working example.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Autoscaling in Sagemaker by Terraform - amazon-web-services

Related

how to pass cloudwatch logs to lambda in terraform

Terraform & GCP: Google kubernetes cluster problem: Can't see monitoring section (memory and cpu) inside workloads (deployments, statefulsets)

"global_replication_group_id": conflicts with parameter_group_name in terraform

During Terraform destroy, terraform is trying to destroy the ECS cluster before destroying the Auto-scaling group and is failing

Call an SSM document from terraform

Categories

Resources