Terraform RDS: Restore data after destroy due to modification? - amazon-web-services

I am trying to create an RDS Aurora MySQL cluster in AWS using Terraform. However, I notice that any time I alter the cluster in a way the requires it to be replaced, all data is lost. I have configured to take a final snapshot and would like to restore from that snapshot, or restore the original data through an alternative measure.
Example: Change Cluster -> TF Destroys the original cluster -> TF Replaces with new cluster -> Restore Data from original
I have attempted to use the same snapshot identifier for both aws_rds_cluster.snapshot_identifier and aws_rds_cluster.final_snapshot_identifier, but Terraform bombs because the final snapshot of the destroyed cluster doesn't yet exist.
I've also attempted to use the rds-finalsnapshot module, but it turns out it is primarily used for spinning environments up and down, preserving the data. i.e. Destroying an entire cluster, then recreating it as part of a separate deployment. (Module: https://registry.terraform.io/modules/connect-group/rds-finalsnapshot/aws/latest)
module "snapshot_maintenance" {
source="connect-group/rds-finalsnapshot/aws//modules/rds_snapshot_maintenance"
identifier = local.cluster_identifier
is_cluster = true
database_endpoint = element(aws_rds_cluster_instance.cluster_instance.*.endpoint, 0)
number_of_snapshots_to_retain = 3
}
resource "aws_rds_cluster" "provisioned_cluster" {
cluster_identifier = module.snapshot_maintenance.identifier
engine = "aurora-mysql"
engine_version = "5.7.mysql_aurora.2.10.0"
port = 1234
database_name = "example"
master_username = "example"
master_password = "example"
iam_database_authentication_enabled = true
storage_encrypted = true
backup_retention_period = 2
db_subnet_group_name = "example"
skip_final_snapshot = false
final_snapshot_identifier = module.snapshot_maintenance.final_snapshot_identifier
snapshot_identifier = module.snapshot_maintenance.snapshot_to_restore
vpc_security_group_ids = ["example"]
}
What I find is if a change requires destroy and recreation, I don't have a great way to restore the data as part of the same deployment.
I'll add that I don't think this is an issue with my code. It's more of a lifecycle limitation of TF. I believe I can't be the only person who wants to preserve the data in their cluster in the event TF determines the cluster must be recreated.
If I wanted to prevent loss of data due to a change to the cluster that results in a destroy, do I need to destroy the cluster outside of terraform or through the cli, sync up Terraform's state and then apply?

The solution ended up being rather simple, albeit obscure. I tried over 50 different approaches using combinations of existing resource properties, provisioners, null resources (with triggers) and external data blocks with AWS CLI commands and Powershell scripts.
The challenge here was that I needed to ensure the provisioning happened in this order to ensure no data loss:
Stop DMS replication tasks from replicating more data into the database.
Take a new snapshot of the cluster, once incoming data had been stopped.
Destroy and recreate the cluster, using the snapshot_identifier to specify the snapshot taken in the previous step.
Destroy and recreate the DMS tasks.
Of course these steps were based on how Terraform decided it needed to apply updates. It may determine it only needed to perform an in-place update; this wasn't my concern. I needed to handle scenarios where the resources were destroyed.
The final solution was to eliminate the use of external data blocks and go exclusively with local provisioners, because external data blocks would execute even when only running terraform plan. I used the local provisioners to tap into lifecycle events like "create" and "destroy" to ensure my Powershell scripts would only execute during terraform apply.
On my cluster, I set both final_snapshot_identifier and snapshot_identifier to the same value.
final_snapshot_identifier = local.snapshot_identifier
snapshot_identifier = data.external.check_for_first_run.result.isFirstRun == "true" ? null : local.snapshot_identifier
snapshot_identifier is only set after the first deployment, external data blocks allow me to check if a resource exists already in order to achieve the condition. The condition is necessary because on a first deployment, the snapshot won't exist and Terraform will fail during the "planning" step due to this.
Then I execute a Powershell script in a local provisioner on the "destroy" to stop any DMS tasks and then delete the snapshot by the name of local.snapshot_identifier.
provisioner "local-exec" {
when = destroy
# First, stop the inflow of data to the cluster by stopping the dms tasks.
# Next, we've tricked TF into thinking the snapshot we want to use is there by using the same name for old and new snapshots, but before we destroy the cluster, we need to delete the original.
# Then TF will create the final snapshot immediately following the execution of the below script and it will be used to restore the cluster since we've set it as snapshot_identifier.
command = "/powershell_scripts/stop_dms_tasks.ps1; aws rds delete-db-cluster-snapshot --db-cluster-snapshot-identifier benefitsystem-cluster"
interpreter = ["PowerShell"]
}
This clears out the last snapshot and allows Terraform to create a new final snapshot by the same name as the original, just in time to be used to restore from.
Now, I can run Terraform the first time and get a brand-new cluster. All subsequent deployments will use the final snapshot to restore from and data is preserved.

Related

How to Promote Cloud SQL replica to primary using terraform so promoted instance should be in TF control

I am creating GCP Cloud SQL instance using terraform with cross region Cloud SQL replica. I am testing the DR scenario as when DR happen I am promoting read replica to primary instance using glcoud API (as there is not settings/resource available in terraform to promote replica) as I am using gcloud command the promoted instance and state file is not in sync so later the promoted instance is not under terraform control.
Cross-region replica setups become out of sync with the primary right after the promotion is complete. Promoting a replica is done manually and intentionally. It is not the same as high availability, where a standby instance (which is not a replica) automatically becomes the primary in case of a failure or zonal outage. You can promote the read replica using gcloud and Google API manually. By doing both of these will make the instance out of sync with Terraform. So what you are looking for seems to be not available while promoting a replica in Cloud SQL.
As a workaround I would suggest you to promote the replica to primary outside of Terraform, and then try to import the resource back into state which would reset the state file.
Promoting an instance to primary is not supported by Terraform's Google Cloud Provider, but there is an issue (which you should upvote if you care) to add support for this to the provider.
Here's how to work around the lack of support in the meantime. Assume you have the following minimal setup: an instance, a database, a user, and a read replica:
resource "google_sql_database_instance" "instance1" {
name = "old-primary"
region = "us-central1"
database_version = "POSTGRES_14"
}
resource "google_sql_database" "db" {
name = "test-db"
instance = google_sql_database_instance.instance1.name
}
resource "google_sql_user" "user" {
name = "test-user"
instance = google_sql_database_instance.instance1.name
password = var.db_password
}
resource "google_sql_database_instance" "instance2" {
name = "new-primary"
master_instance_name = google_sql_database_instance.instance1.name
region = "europe-west4"
database_version = "POSTGRES_14"
replica_configuration {
failover_target = false
}
}
Steps to follow:
You promote the replica out of band, either using the Console or the gcloud CLI.
Next you manually edit the state file:
# remove the old read-replica state; it's now the new primary
terraform state rm google_sql_database_instance.instance2
# import the new-primary as "instance1"
terraform state rm google_sql_database_instance.instance1
terraform import google_sql_database_instance.instance1 your-project-id/new-primary
# import the new-primary db as "db"
terraform state rm google_sql_database.db
terraform import google_sql_database.db your-project-id/new-primary/test-db
# import the new-primary user as "db"
terraform state rm google_sql_user.user
terraform import google_sql_user.user your-project-id/new-primary/test-user
Now you edit your terraform config to update the resources to match the state:
resource "google_sql_database_instance" "instance1" {
name = "new-primary" # this is the former replica's name
region = "europe-west4" # this is the former replica's region
database_version = "POSTGRES_14"
}
resource "google_sql_database" "db" {
name = "test-db"
instance = google_sql_database_instance.instance1.name
}
resource "google_sql_user" "user" {
name = "test-user"
instance = google_sql_database_instance.instance1.name
password = var.db_password
}
# this has now been promoted and is now "instance1" so the following
# block can be deleted.
# resource "google_sql_database_instance" "instance2" {
# name = "new-primary"
# master_instance_name = google_sql_database_instance.instance1.name
# region = "europe-west4"
# database_version = "POSTGRES_14"
#
# replica_configuration {
# failover_target = false
# }
# }}
}
Then you run terraform apply and see that only the user is updated in-place with the existing password. (This is done because Terraform can't get the password from the API and it was removed as part of the promotion and so has to be re-applied for Terraform's sake.)
What you do with your old primary is up to you. It's no longer managed by terraform. So either delete it manually, or re-import it.
Caveats
Everyone's Terraform setup is different and so you'll probably have to iterate through the steps above until you reach the desired result.
Remember to use a testing environment first with lots of calls to terraform plan to see what's changing. Whenever a resource is marked for deletion, Terraform will report why.
Nonetheless, you can use the process above to work your way to a terraform setup that reflects a promoted read replica. And in the meantime, upvote the issue because if it gets enough attention, the Terraform team will prioritize it accordingly.

Terraform - Encrypting a db instance forces replacement

I have a postgres RDS instance in AWS that I created using terraform.
resource "aws_db_instance" "..." {
...
}
Now I'm trying to encrypt that instance by adding
resource "aws_db_instance" "..." {
...
storage_encrypted = true
}
But when I run terraform plan, it says that it's going to force replacement
# aws_db_instance.... must be replaced
...
~ storage_encrypted = false -> true # forces replacement
What can I do to prevent terraform from replacing my db instance?
Terraform is not at fault here. You simply cannot change the encryption setting on an RDS instance after it was originally created. You can / need to create a snapshot of the current db, copy + encrypt the snapshot and then restore from that snapshot: https://aws.amazon.com/premiumsupport/knowledge-center/update-encryption-key-rds/
This will cause a downtime of the DB. And terraform does not do that for you automatically, you need to do this manually. After the DB is restored terraform should not longer try to replace the DB since the expected config now matches the actual config.
Technically you can ignore_changes the storage_encrypted property but of course that causes terraform to simply ignore any storage encryption changes.

How to handle terraform process crash and avoid the resource leak on retry?

I have a microservice deployed in a docker container to manage and execute terraform commads to create infrastructure on AWS. The terraform template supported is as follows:
provider "aws" {
profile = "default"
region = "us-east-1"
}
resource "aws_default_vpc" "default" {
tags = {
Name = "Default VPC"
}
}
resource "aws_security_group" "se_security_group" {
name = "test-sg"
description = "secure soft edge ports"
vpc_id = aws_default_vpc.default.id
tags = {
Name = "test-sg"
}
}
resource "aws_instance" "web" {
ami = "ami-*********"
instance_type = "t3.micro"
tags = {
Name = "test"
}
depends_on = [
aws_security_group.se_security_group,
]
}
With this system in place, while the terraform process is being executed (creating an EC2 instance),if the docker container crashes, then the state file would not have the entry regarding the EC2 resource being created. On container restart, if the terraform process is restarted on the same state file, it would end up creating a whole new EC2 instance resulting in a resource leak.
How is the crash scenario in terraform commonly handled?
Is there a way to rollback the previous transaction without the state file having the EC2 entry?
Please help me with this issue. Thanks
How is the crash scenario in terraform commonly handled?
It depends when did the crash happened. Some plausible scenarios are:
Most likely, your state file will remain locked, as long as your backend supports locking. In this case nothing will be created after restart, because Terraform wont be able to acquire a lock to the state file, so it will throw an error. We will have to force unlock the state.
We managed to unlock the state file/the state file was not locket at all. In this case we can have to following scenarios:
The state file will have an entry with an identifier for the resource, even if there was a crash will the resource was provisioning. In this case Terraform will refresh the state and will display in the plan if there are any changes to be made. Nevertheless, we should read the plan and decide if we would want to apply or do some manual adjustments first.
Terraform wont be able to identify a resource which already exists, so it will try to provision it. Again, we should read the state file and decide ourselves what to do. We can either import the already existing resource or terminate it and let Terraform attempt to create it again.
Is there a way to rollback the previous transaction without the state file having the EC2 entry?
No, there is no way to rollback to the previous transaction. Terraform will attempt to provision whatever it is in the .tf files. What we could do is to checkout a previous version of our code from our source control and apply that.

RDS storage autoscaling support using terraform for live databases

AWS has recently launched support for storage autoscaling of RDS instances. We have multiple RDS instances with over provisioned storage in our production environment. We want to utilise this new feature to reduce some costs. Since we cannot reduce the storage of a live RDS instance, we will have to first create a RDS instance with less storage with autoscaling support and then migrate the existing data to new instance and then delete the old instance.
We use terraform with the terraform-aws-provider to create our infrastructure. Problem is that I am not able to achieve the above strategy using terraform.
Here is what i have tried :
Modify the existing RDS creation script to create two more
resources.
One is of type aws_db_snapshot and other is
aws_db_instance (using the snapshot).
However I get the following
error error modifying DB Instance (test-rds-snapshot):
InvalidParameterCombination: Invalid storage size for engine name
postgres and storage type gp2: 20.
# Existing RDS instance with over provisioned storage
resource "aws_db_instance" "test_rds"{
.
.
.
}
# My changes below
# The snapshot
resource "aws_db_snapshot" "test_snapshot" {
db_instance_identifier = "${aws_db_instance.test_rds.id}"
db_snapshot_identifier = "poc-snapshot"
}
# New instance with autoscale support and reduced storage
resource "aws_db_instance" "test_rds_snapshot" {
identifier = "test-rds-snapshot"
allocated_storage = 20
max_allocated_storage = 50
snapshot_identifier = "${aws_db_snapshot.test_snapshot.id}"
.
.
.
}
I want to know if I am on the right track or not and will I be able to migrate production databases using this strategy. Let me know if you need more information.

How to import changes to a EBS volume after sizing it up back to Terraform?

After running out of space I had to resize my EBS Volume, now I wanted to make the size part of my Terraform configurated and added the following block to the aws_instance resource:
ebs_block_device {
device_name = "/dev/sda1"
volume_size = 32
volume_type = "gp2"
}
Now after running terraform plan it wanted to destroy the existing volume, which is terrible. I also tried to import the existing one using terraform import but it wanted me to use a different name for the resource which is also not great.
So what is the correct procedure here?
The aws_instance resource docs mention that changes to any EBS block devices will cause the instance to be recreated.
To get around this you can use something other than Terraform to grow the EBS volumes using AWS' new elastic volumes feature. Terraform also cannot detect changes to any of the attached block devices created in the aws_instance resource:
NOTE: Currently, changes to *_block_device configuration of existing resources cannot be automatically detected by Terraform. After making updates to block device configuration, resource recreation can be manually triggered by using the taint command.
As such you shouldn't need to go back and change anything in your Terraform configuration unless you are wanting to rebuild the instance using Terraform at some point at which point the worry about losing the instance is obviously moot.
However, if for some reason you want to be able to make the change to your Terraform configuration and keep the instance from being destroyed then you would need to manipulate your state file.