Terraform issues with waiting for GCP infrastructure changes to propagate - google-cloud-platform

I am using Terraform and am attempting to deploy a project into a folder which has a GCP organization policy applied to it where service accounts cannot be created within that folder/projects in that folder. I have code which will set that org policy to false as a project is being deployed. Additionally I have some service accounts being deployed within that same main.tf which will depend on the org policy being set to false.
I have attempted to use depends_on statements for service account modules to wait for the org policy to be set to false prior to provisioning the service accounts. I have also used a time_sleep resource block to allow for the project factory and org policy to provision/make changes prior to service accounts being provisioned. I can occasionally get the whole deployment to work whereas other times I come across issues where the apply step will fail due to the organizational policy.
If I check the project in GCP it shows that the org policy has been set to false which is what should happen. If I re-run the apply step in Terraform then everything will provision that was left over. Is there a better way to approach this issue? The fact that sometimes the provisioning works in one apply vs two applies is a bit odd and makes me believe there is some sort of state caching going on but that's just more of me guessing based on what I've seen.
Code is as follows below:
source = "terraform-google-modules/project-factory/google"
version = "~> 10.1"
name = var.project_name
random_project_id = var.random_project_id
org_id = var.org_id
folder_id = var.folder_id
billing_account = var.billing_account_id
create_project_sa = false
default_service_account = var.default_service_account
disable_dependent_services = var.disable_dependent_services
disable_services_on_destroy = var.disable_services_on_destroy
labels = var.project_labels
}
module "remove_disable_sa_creation" {
source = "terraform-google-modules/org-policy/google"
version = "~> 3.0.2"
constraint = "constraints/iam.disableServiceAccountCreation"
policy_type = "boolean"
policy_for = "project"
project_id = module.project-factory.project_id
enforce = false
depends_on = [module.project-factory.project_id]
}
resource "time_sleep" "wait_60_seconds" {
depends_on = [module.remove_disable_sa_creation]
create_duration = "60s"
}
module "globus_service_account" {
source = "../../../modules/service_account"
project_id = module.project-factory.project_id
prefix = var.globus_sa_prefix
names = var.globus_sa_names
project_roles = var.globus_sa_project_roles
grant_billing_role = var.globus_grant_billing_role
billing_account_id = var.billing_account_id
grant_xpn_roles = var.globus_grant_xpn_roles
org_id = var.org_id
generate_keys = var.globus_generate_keys
display_name = var.globus_sa_display_name
description = var.globus_sa_description
depends_on = [time_sleep.wait_60_seconds]
}

Changing the sleep timer to 120 seconds was the main factor which helped solved this. What I did was create the project factory, have a depends on for the organization policy to wait on the project factory, have the timer wait on the organization policy, then have all other modules wait on the timer to finish.
Essentially the flow was project > organization policy > timer for 120s > all other modules provisioning after 120 seconds.

Related

How can I configure Terraform to update a GCP compute engine instance template without destroying and re-creating?

I have a service deployed on GCP compute engine. It consists of a compute engine instance template, instance group, instance group manager, and load balancer + associated forwarding rules etc.
We're forced into using compute engine rather than Cloud Run or some other serverless offering due to the need for docker-in-docker for the service in question.
The deployment is managed by terraform. I have a config that looks something like this:
data "google_compute_image" "debian_image" {
family = "debian-11"
project = "debian-cloud"
}
resource "google_compute_instance_template" "my_service_template" {
name = "my_service"
machine_type = "n1-standard-1"
disk {
source_image = data.google_compute_image.debian_image.self_link
auto_delete = true
boot = true
}
...
metadata_startup_script = data.local_file.startup_script.content
metadata = {
MY_ENV_VAR = var.whatever
}
}
resource "google_compute_region_instance_group_manager" "my_service_mig" {
version {
instance_template = google_compute_instance_template.my_service_template.id
name = "primary"
}
...
}
resource "google_compute_region_backend_service" "my_service_backend" {
...
backend {
group = google_compute_region_instance_group_manager.my_service_mig.instance_group
}
}
resource "google_compute_forwarding_rule" "my_service_frontend" {
depends_on = [
google_compute_region_instance_group_manager.my_service_mig,
]
name = "my_service_ilb"
backend_service = google_compute_region_backend_service.my_service_backend.id
...
}
I'm running into issues where Terraform is unable to perform any kind of update to this service without running into conflicts. It seems that instance templates are immutable in GCP, and doing anything like updating the startup script, adding an env var, or similar forces it to be deleted and re-created.
Terraform prints info like this in that situation:
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
~ update in-place
-/+ destroy and then create replacement
Terraform will perform the following actions:
# module.connectors_compute_engine.google_compute_instance_template.airbyte_translation_instance1 must be replaced
-/+ resource "google_compute_instance_template" "my_service_template" {
~ id = "projects/project/..." -> (known after apply)
~ metadata = { # forces replacement
+ "TEST" = "test"
# (1 unchanged element hidden)
}
The only solution I've found for getting out of this situation is to entirely delete the entire service and all associated entities from the load balancer down to the instance template and re-create them.
Is there some way to avoid this situation so that I'm able to change the instance template without having to manually update all the terraform config two times? At this point I'm even fine if it ends up creating some downtime for the service in question rather than a full rolling update or something since that's what's happening now anyway.
I was triggered by this issue as well.
However, according to:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#using-with-instance-group-manager
Instance Templates cannot be updated after creation with the Google
Cloud Platform API. In order to update an Instance Template, Terraform
will destroy the existing resource and create a replacement. In order
to effectively use an Instance Template resource with an Instance
Group Manager resource, it's recommended to specify
create_before_destroy in a lifecycle block. Either omit the Instance
Template name attribute, or specify a partial name with name_prefix.
I would also test and plan with this lifecycle meta argument as well:
+ lifecycle {
+ prevent_destroy = true
+ }
}
Or more realistically in your specific case, something like:
resource "google_compute_instance_template" "my_service_template" {
version {
instance_template = google_compute_instance_template.my_service_template.id
name = "primary"
}
+ lifecycle {
+ create_before_destroy = true
+ }
}
So terraform plan with either create_before_destroy or prevent_destroy = true before terraform apply on google_compute_instance_template to see results.
Ultimately, you can remove google_compute_instance_template.my_service_template.id from state file and import it back.
Some suggested workarounds in this thread:
terraform lifecycle prevent destroy

Terraform update only a cloud function from a bunch

I have a Terraform project that allows to create multiple cloud functions.
I know that if I change the name of the google_storage_bucket_object related to the function itself, terraform will see the difference of the zip name and redeploy the cloud function.
My question is, there is a way to obtain the same behaviour, but only with the cloud functions that have been changed?
resource "google_storage_bucket_object" "zip_file" {
# Append file MD5 to force bucket to be recreated
name = "${local.filename}#${data.archive_file.source.output_md5}"
bucket = var.bucket.name
source = data.archive_file.source.output_path
}
# Create Java Cloud Function
resource "google_cloudfunctions_function" "java_function" {
name = var.function_name
runtime = var.runtime
available_memory_mb = var.memory
source_archive_bucket = var.bucket.name
source_archive_object = google_storage_bucket_object.zip_file.name
timeout = 120
entry_point = var.function_entry_point
event_trigger {
event_type = var.event_trigger.event_type
resource = var.event_trigger.resource
}
environment_variables = {
PROJECT_ID = var.env_project_id
SECRET_MAIL_PASSWORD = var.env_mail_password
}
timeouts {
create = "60m"
}
}
By appending MD5 every cloud functions will result in a different zip file name, so terraform will re-deploy every of them and I found that without the MD5, Terraform will not see any changes to deploy.
If I have changed some code only inside a function, how can I tell to Terraform to re-deploy only it (so for example to change only its zip file name)?
I hope my question is clear and I want to thank you everyone who tries to help me!

GCP terraform-google-project-factory multiple projects update the service account with new bindings?

I am using the terraform-google-project-factory module to create multiple GCP projects at once. The projects create just fine and I am using the included option to disable the default GCP compute service account and stand-up a new Service Account in each project.
The module has an "sa_role" input where I assign "roles/compute.admin" to the new S.A. However, I would also like to assign some additional IAM roles to that Service Account in the same deployment. The sa_role input seems to only take one string value:
module "project-factory" {
source = "terraform-google-modules/project-factory/google"
version = "12.0.0"
for_each = toset(local.project_names)
random_project_id = true
name = each.key
org_id = local.organization_id
billing_account = local.billing_account
folder_id = google_folder.DQS.id
default_service_account = "disable"
default_network_tier = "PREMIUM"
create_project_sa = true
auto_create_network = false
project_sa_name = local.service_account
sa_role = ["roles/compute.admin"]
activate_apis = ["compute.googleapis.com","storage.googleapis.com","oslogin.googleapis.com",]
}
The output for the Service Account email looks like this:
output "service_account_email" {
value = values(module.project-factory)[*].service_account_email
description = "The email of the default service account"
}
How can I add additional IAM roles to this Service Account in the same main.tf ? This Stack article comes close to what I wish to achieve:
Want to assign multiple Google cloud IAM roles against a service account via terraform
However, I do not know how to reference my Service Account email addresses from the outputs.tf to make them available to the members = part of the data google_iam_policy. My question is, how to get this to work with the data google_iam_policy, or is there another better way to do this?

AWS Glue pipeline with Terraform

We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go.
I have been trying to work on a module (or modules) that I can reuse as I know that we will be making several more pipelines for various projects. The difficulty I am having is in creating a good level of abstraction with the module. AWS Glue has several components/resources to it, including a Glue connection, databases, crawlers, jobs, job triggers and workflows. The problem is that the number of databases, jobs, crawlers and/or triggers and their interractions (i.e. some triggers might be conditional while others might simply be scheduled) can vary depending on the project, and I am having a hard time abstracting this complexity via modules.
I am having to create a lot of for_each "loops" and dynamic blocks within resources to try to render the module as generic as possible (e.g. so that I can create N number of jobs and/or triggers from the root module and define their interractions).
I understand that modules should actually be quite opinionated and specific, and be good at one task so to speak, which means my problem might simply be conceptual. The fact that these pipelines vary significantly from project to project make them a poor use case for modules.
On a side note, I have not been able to find any robust examples of modules online for AWS Glue so this might be another indicator that it is indeed not the best use case.
Any thoughts here would be greatly appreciated.
EDIT:
As requested, here is some of my code from my root module:
module "glue_data_catalog" {
source = "../../modules/aws-glue/data-catalog"
# Connection
create_connection = true
conn_name = "SAMPLE"
conn_description = "SAMPLE."
conn_type = "JDBC"
conn_url = "jdbc:sqlserver:"
conn_sg_ids = ["sampleid"]
conn_subnet_id = "sampleid"
conn_az = "eu-west-1a"
conn_user = var.conn_user
conn_pass = var.conn_pass
# Databases
db_names = [
"raw",
"cleaned",
"consumption"
]
# Crawlers
crawler_settings = {
Crawler_raw = {
database_name = "raw"
s3_path = "bucket-path"
jdbc_paths = []
},
Crawler_cleaned = {
database_name = "cleaned"
s3_path = "bucket-path"
jdbc_paths = []
}
}
crawl_role = "SampleRole"
}
Glue data catalog module:
#############################
# Glue Connection
#############################
resource "aws_glue_connection" "this" {
count = var.create_connection ? 1 : 0
name = var.conn_name
description = var.conn_description
connection_type = var.conn_type
connection_properties = {
JDBC_CONNECTION_URL = var.conn_url
USERNAME = var.conn_user
PASSWORD = var.conn_pass
}
catalog_id = var.conn_catalog_id
match_criteria = var.conn_criteria
physical_connection_requirements {
security_group_id_list = var.conn_sg_ids
subnet_id = var.conn_subnet_id
availability_zone = var.conn_az
}
}
#############################
# Glue Database Catalog
#############################
resource "aws_glue_catalog_database" "this" {
for_each = var.db_names
name = each.key
description = var.db_description
catalog_id = var.db_catalog_id
location_uri = var.db_location_uri
parameters = var.db_params
}
#############################
# Glue Crawlers
#############################
resource "aws_glue_crawler" "this" {
for_each = var.crawler_settings
name = each.key
database_name = each.value.database_name
description = var.crawl_description
role = var.crawl_role
configuration = var.crawl_configuration
s3_target {
connection_name = var.crawl_s3_connection
path = each.value.s3_path
exclusions = var.crawl_s3_exclusions
}
dynamic "jdbc_target" {
for_each = each.value.jdbc_paths
content {
connection_name = var.crawl_jdbc_connection
path = jdbc_target.value
exclusions = var.crawl_jdbc_exclusions
}
}
recrawl_policy {
recrawl_behavior = var.crawl_recrawl_behavior
}
schedule = var.crawl_schedule
table_prefix = var.crawl_table_prefix
tags = var.crawl_tags
}
It seems to me that I'm not actually providing any abstraction in this way but simply overcomplicating things.
I think I found a good solution to the problem, though it happened "by accident". We decided to divide the pipelines into two distinct projects:
ETL on source data
BI jobs to compute various KPIs
I then noticed that I could group resources together for both projects and standardize the way we have them interact (e.g. one connection, n tables, n crawlers, n etl jobs, one trigger). I was then able to create a module for the ETL process and a module for the BI/KPIs process which provided enough abstraction to actually be useful.

How to destroy the additional tgw route table created by terraform transit-gateway module

I have created tgw using the official transit-gateway module and I am using the default route table, Iam also seeing that the module has created an additional route table which I am not able to remove via tf code.
module "transit-gateway" {
source = "terraform-aws-modules/transit-gateway/aws"
version = "1.4.0"
name = var.tgw
amazon_side_asn = 64532
enable_auto_accept_shared_attachments = true
vpc_attachments = {
vpc = {
vpc_id = module.vpc.vpc_id
subnet_ids = [module.vpc.private_subnets[0]]
dns_support = true
ipv6_support = false
transit_gateway_default_route_table_association = true
transit_gateway_default_route_table_propagation = true
}
}
ram_allow_external_principals = true
ram_principals = [123456789, 0987654321]
tags = {
Environment = "${var.env}"
Automated = "Terraform"
Owner = "${var.owner}"
Project = "${var.project}"
}
}
If you look at the module's source code here the only way to disable aws_ec2_transit_gateway_route_table route table creation is by setting create_tgw to false.
If you do this, you will disable entire TGW. So the answer to your question is that you can't remove them without removing entire TGW.
This is because, if you inspect the module source code, or similar modules (as CloudPosse's) you'll see that it's creating another transit gateway route table apart from the one created by the transit gateway itself.
Make a quick test by creating a transit gateway manually in the AWS Console.
In a nutshell: If the desired result is one transit gateway route table, then you'll have to develop a module by yourself, as these modules won't use the ID of the transit gateway route table created by the transit gateway, suspecting this is because it's not possible to manage the automatically created underlying resource.