Spin an AWS EC2 Spot instance with some validity using Terraform - amazon-web-services

I am trying to spin an AWS EC2 Spot instance with some validity (For example, Spot created should be accessible for 2hours or 3hours and the Spot instance should be terminated).
I am able to spin the spot instance using the below code but unable to set the duration/validity of the created Spot instance.
I am sharing my Terraform code (both main.tf and variable.tf) by which I am trying to spin a spot instance.
I tried to set the the expiry of the Spot instance using the below 2 lines of code in my main.tf file but did't work
valid_until = "${var.spot_instance_validity}"
terminate_instances_with_expiration = true
For valid_until , I couldn't able to give the RFC3339 format or YYYY-MM-DDTHH:MM:SSZ - calculating for 2 hour from the time when I spin the Spot instance. So removed the above 2 lines of code from my main.tf file
Below is the my main.tf file used to spin the spot instance
provider "aws" {
access_key = "${var.access_key}"
secret_key = "${var.secret_key}"
region = "${var.region}"
}
resource "aws_spot_instance_request" "dev-spot" {
ami = "${var.ami_web}"
instance_type = "t3.medium"
subnet_id = "subnet-xxxxxx"
associate_public_ip_address = "true"
key_name = "${var.key_name}"
vpc_security_group_ids = ["sg-xxxxxxx"]
spot_price = "${var.linux_spot_price}"
wait_for_fulfillment = "${var.wait_for_fulfillment}"
spot_type = "${var.spot_type}"
instance_interruption_behaviour = "${var.instance_interruption_behaviour}"
block_duration_minutes = "${var.block_duration_minutes}"
tags = {
Name = "dev-spot"
}
}
Below is the variable file "variable.tf"
variable "access_key" {
default = ""
}
variable "secret_key" {
default = ""
}
variable "region" {
default = "us-west-1"
}
variable "key_name" {
default = "win-key"
}
variable "windows_spot_price" {
type = "string"
default = "0.0309"
}
variable "linux_spot_price" {
type = "string"
default = "0.0125"
}
variable "wait_for_fulfillment" {
default = false
}
variable "spot_type" {
type = "string"
default = "one-time"
}
variable "instance_interruption_behaviour" {
type = "string"
default = "terminate"
}
variable "block_duration_minutes" {
type = "string"
default = "0"
}
variable "ami_web" {
default = "ami-xxxxxxxxxxxx"
}
The created Spot instance should have an validity to set as 1 hour or 2 hour which I can call from variable.tf file so the Spot instance should be Terminated by 1 hour or 2 hours (or Spot instance request should be cancelled)
Is there a way I can Spin aws ec2 Spot instance with expiry ?

It is not possible to schedule instances for termination.
However, you can use CloudWatch Events and Lambda to create your own instance termination logic. You need to create a scheduled event in Terraform according to your variable (valid_until), which invokes a Lambda function to terminate the instance.
AWS also has a solution called Instance Scheduler. You can simply attach tags to your spot instances to create start/stop schedules.
However, you should change instance stop behaviour in this case, which is by default shutdown, to terminate. Thus, your instances will be terminated when stopped. This can be achieved by using aws_instance.instance_initiated_shutdown_behavior argument in Terraform.

Related

Google Cloud VM Autoscaling in Terraform - Updating Images

I can create a fully-functioning autoscaling group in GCP Compute via Terraform, but it's not clear to me how to update the ASG to use a new image.
Here's an example of a working ASG:
resource "google_compute_region_autoscaler" "default" {
name = "example-autoscaler"
region = "us-west1"
project = "my-project
target = google_compute_region_instance_group_manager.default.id
autoscaling_policy {
max_replicas = 10
min_replicas = 3
cooldown_period = 60
cpu_utilization {
target = 0.5
}
}
}
resource "google_compute_region_instance_group_manager" "default" {
name = "example-igm"
region = "us-west1"
version {
instance_template = google_compute_instance_template.default.id
name = "primary"
}
target_pools = [google_compute_target_pool.default.id]
base_instance_name = "example"
}
resource "google_compute_target_pool" "default" {
name = "example-pool"
}
resource "google_compute_instance_template" "default" {
name = "example-template"
machine_type = "e2-medium"
can_ip_forward = false
tags = ["my-tag"]
disk {
source_image = data.google_compute_image.default.id
}
network_interface {
subnetwork = "my-subnet"
}
}
data "google_compute_image" "default" {
name = "my-image"
}
My goal is to be able to create a new Image (out of band) and then update my infrastructure to utilize it. It doesn't appear possible to change a google_compute_instance_template while it's in use.
One option I can think of is to create two separate templates, and then adjust the google_compute_region_instance_group_manager to refer to a different google_compute_instance_template which references the new image.
Another possible option is to use the version block inside the instance group manager. You can use this similarly to above to essentially toggle between two versions. You start with one "version" at 100%, and the other at 0%. When you create a new image, you update the version that is at 0% to point to the new image and change its skew to 100% and the other to 0%. I'm not actually sure this works, because you'd still have to update the template of the version that's at 0% and it might actually be in use.
Regardless, both of those methods are incredibly bulky for a large-scale production environment where we have multiple autoscaling groups in multiple regions and update them frequently.
Ideally I'd be able to change a variable that represents the image, terraform apply and that's it. Is there any way to do what I'm describing?
You need to use the lifecycle block with the create_before_destroy but also the attribute name of the google_compute_instance_template resource must be omitted or replaced by name_prefix attribute.
With this setup Terraform generates a unique name for your Instance Template and can then update the Instance Group manager without conflict before destroying the previous Instance Template.
References:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#using-with-instance-group-manager

Terraform wants to replace Google compute engine if its start/stop scheduler is modified

First of all, I am surprised that I have found very few resources on Google that mention this issue with Terraform.
This is an essential feature for optimizing the cost of cloud instances though, so I'm probably missing out on a few things, thanks for your tips and ideas!
I want to create an instance and manage its start and stop daily, programmatically.
The resource "google_compute_resource_policy" seems to meet my use case. However, when I change the stop or start time, Terraform plans to destroy and recreate the instance... which I absolutely don't want!
The resource "google_compute_resource_policy" is attached to the instance via the argument resource_policies where it is specified: "Modifying this list will cause the instance to recreate."
I don't understand why Terraform handles this simple update so badly. It is true that it is not possible to update a scheduler, whereas it is perfectly possible to detach it manually from the instance, then to destroy it before recreating it with the new stop/start schedule and the attach to the instance again.
Is there a workaround without going through a null resource to run a gcloud script to do these steps?
I tried to add an "ignore_changes" lifecycle on the "resource_policies" argument of my instance, Terraform no longer wants to destroy my instance, but it gives me the following error:
Error when reading or editing ResourcePolicy: googleapi: Error 400: The resource_policy resource 'projects/my-project-id/regions/europe-west1/resourcePolicies/my-instance-schedule' is already being used by 'projects/my-project-id/zones/europe-west1-b/instances/my-instance', resourceInUseByAnotherResource"
Here is my Terraform code
resource "google_compute_resource_policy" "instance_schedule" {
name = "my-instance-schedule"
region = var.region
description = "Start and stop instance"
instance_schedule_policy {
vm_start_schedule {
schedule = var.vm_start_schedule
}
vm_stop_schedule {
schedule = var.vm_stop_schedule
}
time_zone = "Europe/Paris"
}
}
resource "google_compute_instance" "my-instance" {
// ******** This is my attempted workaround ********
lifecycle {
ignore_changes = [resource_policies]
}
name = "my-instance"
machine_type = var.machine_type
zone = "${var.region}-b"
allow_stopping_for_update = true
resource_policies = [
google_compute_resource_policy.instance_schedule.id
]
boot_disk {
device_name = local.ref_name
initialize_params {
image = var.boot_disk_image
type = var.disk_type
size = var.disk_size
}
}
network_interface {
network = data.google_compute_network.default.name
access_config {
nat_ip = google_compute_address.static.address
}
}
}
If it can be useful, here is what the terraform apply returns
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
- destroy
-/+ destroy and then create replacement
Terraform will perform the following actions:
# google_compute_resource_policy.instance_schedule must be replaced
-/+ resource "google_compute_resource_policy" "instance_schedule" {
~ id = "projects/my-project-id/regions/europe-west1/resourcePolicies/my-instance-schedule" -> (known after apply)
name = "my-instance-schedule"
~ project = "my-project-id" -> (known after apply)
~ region = "https://www.googleapis.com/compute/v1/projects/my-project-id/regions/europe-west1" -> "europe-west1"
~ self_link = "https://www.googleapis.com/compute/v1/projects/my-project-id/regions/europe-west1/resourcePolicies/my-instance-schedule" -> (known after apply)
# (1 unchanged attribute hidden)
~ instance_schedule_policy {
# (1 unchanged attribute hidden)
~ vm_start_schedule {
~ schedule = "0 9 * * *" -> "0 8 * * *" # forces replacement
}
# (1 unchanged block hidden)
}
}
Plan: 1 to add, 0 to change, 1 to destroy.
Do you want to perform these actions in workspace "prd"?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
google_compute_resource_policy.instance_schedule: Destroying... [id=projects/my-project-id/regions/europe-west1/resourcePolicies/my-instance-schedule]
Error: Error when reading or editing ResourcePolicy: googleapi: Error 400: The resource_policy resource 'projects/my-project-id/regions/europe-west1/resourcePolicies/my-instance-schedule' is already being used by 'projects/my-project-id/zones/europe-west1-b/instances/my-instance', resourceInUseByAnotherResource
NB: I am working with Terraform 0.14.7 and I am using google provider version 3.76.0
An instance inside GCP can be power off without destroy it with the module google_compute_instance using the argument desired_status, keep in mind that if you are creating the instance for the first time this argument needs to be on “RUNNING”. This module can be used as the following.
resource "google_compute_instance" "default" {
name = "test"
machine_type = "f1-micro"
zone = "us-west1-a"
desired_status = "RUNNING"
}
You can also modify your “main.tf” file if you need to stop the VM first and then started creating a dependency in terraform with depends_on.
As you can see in the following comment, the service account will be created but the key will be assigned until the first sentence is done.
resource "google_service_account" "service_account" {
account_id = "terraform-test"
display_name = "Service Account"
}
resource "google_service_account_key" "mykey" {
service_account_id = google_service_account.service_account.id
public_key_type = "TYPE_X509_PEM_FILE"
depends_on = [google_service_account.service_account]
}
If the first component already exists, terraform only deploys the dependent.
I faced same problem with snapshot policy.
I controlled resource policy creation using a flag input variable and using count. For the first time, I created policy resource using flag as 'true'. When I want to change schedule time, I change the flag as 'false' and apply the plan. This will detach the resource.
I then make flag as 'true' again and apply the plan with new time.
This worked for me for snapshot policy. Hope it could solve yours too.
I solved the "resourceInUseByAnotherResource" error by adding the following lifecycle to the google_compute_resource_policy resource:
lifecycle {
create_before_destroy = true
}
Also, this requires to have a unique name with each change, otherwise, the new resource can't be created, because the resource with the same name already exists. So I appended a random ID to the end of the schedule name:
resource "random_pet" "schedule" {
keepers = {
start_schedule = "${var.vm_start_schedule}"
stop_schedule = "${var.vm_stop_schedule}"
}
}
...
resource "google_compute_resource_policy" "schedule" {
name = "schedule-${random_pet.schedule.id}"
...
lifecycle {
create_before_destroy = true
}
}

Dinamically add resources in Terraform

I set up a jenkins pipeline that launches terraform to create a new EC2 instance in our VPC and register it to our private hosted zone on R53 (which is created at the same time) at every run.
I also managed to save the state into S3 so it doesn't fail with the hosted zone being re-created.
the main issue I have is that at every run terraform keeps replacing the previous instance with the new one and not adding it to the pool of instances.
How can avoid this?
here's a snippet of my code
terraform {
backend "s3" {
bucket = "<redacted>"
key = "<redacted>/terraform.tfstate"
region = "eu-west-1"
}
}
provider "aws" {
region = "${var.region}"
}
data "aws_ami" "image" {
# limit search criteria for performance
most_recent = "${var.ami_filter_most_recent}"
name_regex = "${var.ami_filter_name_regex}"
owners = ["${var.ami_filter_name_owners}"]
# filter on tag purpose
filter {
name = "tag:purpose"
values = ["${var.ami_filter_purpose}"]
}
# filter on tag os
filter {
name = "tag:os"
values = ["${var.ami_filter_os}"]
}
}
resource "aws_instance" "server" {
# use extracted ami from image data source
ami = data.aws_ami.image.id
availability_zone = data.aws_subnet.most_available.availability_zone
subnet_id = data.aws_subnet.most_available.id
instance_type = "${var.instance_type}"
vpc_security_group_ids = ["${var.security_group}"]
user_data = "${var.user_data}"
iam_instance_profile = "${var.iam_instance_profile}"
root_block_device {
volume_size = "${var.root_disk_size}"
}
ebs_block_device {
device_name = "${var.extra_disk_device_name}"
volume_size = "${var.extra_disk_size}"
}
tags = {
Name = "${local.available_name}"
}
}
resource "aws_route53_zone" "private" {
name = var.hosted_zone_name
vpc {
vpc_id = var.vpc_id
}
}
resource "aws_route53_record" "record" {
zone_id = aws_route53_zone.private.zone_id
name = "${local.available_name}.${var.hosted_zone_name}"
type = "A"
ttl = "300"
records = [aws_instance.server.private_ip]
depends_on = [
aws_route53_zone.private
]
}
the outcome is that my previously created instance is destroyed and a new one is created. what I want is to keep adding instances with this code.
thank you
Your code creates only one instance aws_instance.server, and any change to its properties will modify that one instance only as your backend is in S3, thus it acts as a global state for each pipeline. The same goes for aws_route53_record.record and anything else in your script.
If you want different pipelines to reuse the same exact script, you should either use different workspaces, or create different TF states for each pipeline. The other alternative is to redefine your TF script to take a map of instances as an input variable and use for_each to create different instances.
If those instances should be same, you should manage their count using using aws_autoscaling_group and desired capacity.

GCP Cloud SQL failed to delete instance because `deletion_protection` is set to true

I have a tf script for provisioning a Cloud SQL instance, along with a couple of dbs and an admin user. I have renamed the instance, hence a new instance was created but terraform is encountering issues when it comes to deleting the old one.
Error: Error, failed to delete instance because deletion_protection is set to true. Set it to false to proceed with instance deletion
I have tried setting the deletion_protection to false but I keep getting the same error. Is there a way to check which resources need to have the deletion_protection set to false in order to be deleted?
I have only added it to the google_sql_database_instance resource.
My tf script:
// Provision the Cloud SQL Instance
resource "google_sql_database_instance" "instance-master" {
name = "instance-db-${random_id.random_suffix_id.hex}"
region = var.region
database_version = "POSTGRES_12"
project = var.project_id
settings {
availability_type = "REGIONAL"
tier = "db-f1-micro"
activation_policy = "ALWAYS"
disk_type = "PD_SSD"
ip_configuration {
ipv4_enabled = var.is_public ? true : false
private_network = var.network_self_link
require_ssl = true
dynamic "authorized_networks" {
for_each = toset(var.is_public ? [1] : [])
content {
name = "Public Internet"
value = "0.0.0.0/0"
}
}
}
backup_configuration {
enabled = true
}
maintenance_window {
day = 2
hour = 4
update_track = "stable"
}
dynamic "database_flags" {
iterator = flag
for_each = var.database_flags
content {
name = flag.key
value = flag.value
}
}
user_labels = var.default_labels
}
deletion_protection = false
depends_on = [google_service_networking_connection.cloudsql-peering-connection, google_project_service.enable-sqladmin-api]
}
// Provision the databases
resource "google_sql_database" "db" {
name = "orders-placement"
instance = google_sql_database_instance.instance-master.name
project = var.project_id
}
// Provision a super user
resource "google_sql_user" "admin-user" {
name = "admin-user"
instance = google_sql_database_instance.instance-master.name
password = random_password.user-password.result
project = var.project_id
}
// Get latest CA certificate
locals {
furthest_expiration_time = reverse(sort([for k, v in google_sql_database_instance.instance-master.server_ca_cert : v.expiration_time]))[0]
latest_ca_cert = [for v in google_sql_database_instance.instance-master.server_ca_cert : v.cert if v.expiration_time == local.furthest_expiration_time]
}
// Get SSL certificate
resource "google_sql_ssl_cert" "client_cert" {
common_name = "instance-master-client"
instance = google_sql_database_instance.instance-master.name
}
Seems like your code going to recreate this sql-instance. But your current tfstate file contains an instance-code with true value for deletion_protection parameter. In this case, you need first of all change value of this parameter to false manually in tfstate file or by adding deletion_protection = true in the code with running terraform apply command after that (beware: your code shouldn't do a recreation of the instance). And after this manipulations, you can do anything with your SQL instance
You will have to set deletion_protection=false, apply it and then proceed to delete.
As per the documentation
On newer versions of the provider, you must explicitly set deletion_protection=false (and run terraform apply to write the field to state) in order to destroy an instance. It is recommended to not set this field (or set it to true) until you're ready to destroy the instance and its databases.
Link
Editing Terraform state files directly / manually is not recommended
If you added deletion_protection to the google_sql_database_instance after the database instance was created, you need to run terraform apply before running terraform destroy so that deletion_protection is set to false on the database instance.

Creating RDS instances from not the recent snapshot using Terraform

In Terraform project I am creating an RDS instance from a not recent snapshot (fifth before the last), my script here:
data "aws_db_snapshot" "db_snapshot" {
db_instance_identifier = "production-db-intern"
db_snapshot_arn = "arn:aws:rds:eu-central-1:123114111478:snapshot:rds:production-db-intern-2019-05-09-16-10"
}
resource "aws_db_instance" "db_intern" {
skip_final_snapshot = true
identifier = "db-intern"
auto_minor_version_upgrade = false
instance_class = "db.m4.4xlarge"
deletion_protection = false
vpc_security_group_ids = ["${var.security_group_id}"]
db_subnet_group_name = "${var.subnet_group_name}"
timeouts {
create = "3h"
delete = "2h"
}
lifecycle {
prevent_destroy = false
}
snapshot_identifier = "${data.aws_db_snapshot.db_snapshot.id}"
}
I did a "terraform plan" and
I got the next error:
Error: data.aws_db_snapshot.db_snapshot: "db_snapshot_arn": this field cannot be set
db_snapshot_arn is not a valid field of the aws_db_snapshot data resource. Did you mean db_snapshot_identifier.
Also, you can't pass the ARN to this data resource, you can pass the snapshot ID instead, e.g. snap-1234567890abcdef0.
Besides that, the data resource only expects either the db_instance_identifier to be set or the db_snapshot_identifier. See the documentation on the snapshot CLI for more details on the specifics. Terraform leverages the CLI to retrieve these resources.