AWS Glue pipeline with Terraform - amazon-web-services

We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go.
I have been trying to work on a module (or modules) that I can reuse as I know that we will be making several more pipelines for various projects. The difficulty I am having is in creating a good level of abstraction with the module. AWS Glue has several components/resources to it, including a Glue connection, databases, crawlers, jobs, job triggers and workflows. The problem is that the number of databases, jobs, crawlers and/or triggers and their interractions (i.e. some triggers might be conditional while others might simply be scheduled) can vary depending on the project, and I am having a hard time abstracting this complexity via modules.
I am having to create a lot of for_each "loops" and dynamic blocks within resources to try to render the module as generic as possible (e.g. so that I can create N number of jobs and/or triggers from the root module and define their interractions).
I understand that modules should actually be quite opinionated and specific, and be good at one task so to speak, which means my problem might simply be conceptual. The fact that these pipelines vary significantly from project to project make them a poor use case for modules.
On a side note, I have not been able to find any robust examples of modules online for AWS Glue so this might be another indicator that it is indeed not the best use case.
Any thoughts here would be greatly appreciated.
EDIT:
As requested, here is some of my code from my root module:
module "glue_data_catalog" {
source = "../../modules/aws-glue/data-catalog"
# Connection
create_connection = true
conn_name = "SAMPLE"
conn_description = "SAMPLE."
conn_type = "JDBC"
conn_url = "jdbc:sqlserver:"
conn_sg_ids = ["sampleid"]
conn_subnet_id = "sampleid"
conn_az = "eu-west-1a"
conn_user = var.conn_user
conn_pass = var.conn_pass
# Databases
db_names = [
"raw",
"cleaned",
"consumption"
]
# Crawlers
crawler_settings = {
Crawler_raw = {
database_name = "raw"
s3_path = "bucket-path"
jdbc_paths = []
},
Crawler_cleaned = {
database_name = "cleaned"
s3_path = "bucket-path"
jdbc_paths = []
}
}
crawl_role = "SampleRole"
}
Glue data catalog module:
#############################
# Glue Connection
#############################
resource "aws_glue_connection" "this" {
count = var.create_connection ? 1 : 0
name = var.conn_name
description = var.conn_description
connection_type = var.conn_type
connection_properties = {
JDBC_CONNECTION_URL = var.conn_url
USERNAME = var.conn_user
PASSWORD = var.conn_pass
}
catalog_id = var.conn_catalog_id
match_criteria = var.conn_criteria
physical_connection_requirements {
security_group_id_list = var.conn_sg_ids
subnet_id = var.conn_subnet_id
availability_zone = var.conn_az
}
}
#############################
# Glue Database Catalog
#############################
resource "aws_glue_catalog_database" "this" {
for_each = var.db_names
name = each.key
description = var.db_description
catalog_id = var.db_catalog_id
location_uri = var.db_location_uri
parameters = var.db_params
}
#############################
# Glue Crawlers
#############################
resource "aws_glue_crawler" "this" {
for_each = var.crawler_settings
name = each.key
database_name = each.value.database_name
description = var.crawl_description
role = var.crawl_role
configuration = var.crawl_configuration
s3_target {
connection_name = var.crawl_s3_connection
path = each.value.s3_path
exclusions = var.crawl_s3_exclusions
}
dynamic "jdbc_target" {
for_each = each.value.jdbc_paths
content {
connection_name = var.crawl_jdbc_connection
path = jdbc_target.value
exclusions = var.crawl_jdbc_exclusions
}
}
recrawl_policy {
recrawl_behavior = var.crawl_recrawl_behavior
}
schedule = var.crawl_schedule
table_prefix = var.crawl_table_prefix
tags = var.crawl_tags
}
It seems to me that I'm not actually providing any abstraction in this way but simply overcomplicating things.

I think I found a good solution to the problem, though it happened "by accident". We decided to divide the pipelines into two distinct projects:
ETL on source data
BI jobs to compute various KPIs
I then noticed that I could group resources together for both projects and standardize the way we have them interact (e.g. one connection, n tables, n crawlers, n etl jobs, one trigger). I was then able to create a module for the ETL process and a module for the BI/KPIs process which provided enough abstraction to actually be useful.

Related

Terraform update only a cloud function from a bunch

I have a Terraform project that allows to create multiple cloud functions.
I know that if I change the name of the google_storage_bucket_object related to the function itself, terraform will see the difference of the zip name and redeploy the cloud function.
My question is, there is a way to obtain the same behaviour, but only with the cloud functions that have been changed?
resource "google_storage_bucket_object" "zip_file" {
# Append file MD5 to force bucket to be recreated
name = "${local.filename}#${data.archive_file.source.output_md5}"
bucket = var.bucket.name
source = data.archive_file.source.output_path
}
# Create Java Cloud Function
resource "google_cloudfunctions_function" "java_function" {
name = var.function_name
runtime = var.runtime
available_memory_mb = var.memory
source_archive_bucket = var.bucket.name
source_archive_object = google_storage_bucket_object.zip_file.name
timeout = 120
entry_point = var.function_entry_point
event_trigger {
event_type = var.event_trigger.event_type
resource = var.event_trigger.resource
}
environment_variables = {
PROJECT_ID = var.env_project_id
SECRET_MAIL_PASSWORD = var.env_mail_password
}
timeouts {
create = "60m"
}
}
By appending MD5 every cloud functions will result in a different zip file name, so terraform will re-deploy every of them and I found that without the MD5, Terraform will not see any changes to deploy.
If I have changed some code only inside a function, how can I tell to Terraform to re-deploy only it (so for example to change only its zip file name)?
I hope my question is clear and I want to thank you everyone who tries to help me!

Is BigQuery Data Transfer Service for Cloud Storage Transfer possible to implement using Terraform?

This is more of a question about the possibility because there is no documentation on GCP regarding this.
I am using BigQuery DTS for moving my CSVs from a GCS bucket to BQ Table. I have tried it out manually and it works but I need some automation behind it and want to implement it using Terraform.
I have already checked this link but it doesnt help exactly:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_data_transfer_config
Any help would be appreciated. Thanks!
I think the issue is that the documentation does not list the params. Here is an example. Compare with the API.
https://cloud.google.com/bigquery-transfer/docs/cloud-storage-transfer#bq
resource "google_bigquery_data_transfer_config" "sample" {
display_name = "sample"
location = "asia-northeast1"
data_source_id = "google_cloud_storage"
schedule = "every day 10:00"
destination_dataset_id = "target_dataset"
params = {
data_path_template = "gs://target_bucket/*"
destination_table_name_template = "target_table"
file_format = "CSV"
write_disposition = "MIRROR"
max_bad_records = 0
ignore_unknown_values = "false"
field_delimiter = ","
skip_leading_rows = "1"
allow_quoted_newlines = "false"
allow_jagged_rows = "false"
delete_source_files = "false"
}
}

Terraform issues with waiting for GCP infrastructure changes to propagate

I am using Terraform and am attempting to deploy a project into a folder which has a GCP organization policy applied to it where service accounts cannot be created within that folder/projects in that folder. I have code which will set that org policy to false as a project is being deployed. Additionally I have some service accounts being deployed within that same main.tf which will depend on the org policy being set to false.
I have attempted to use depends_on statements for service account modules to wait for the org policy to be set to false prior to provisioning the service accounts. I have also used a time_sleep resource block to allow for the project factory and org policy to provision/make changes prior to service accounts being provisioned. I can occasionally get the whole deployment to work whereas other times I come across issues where the apply step will fail due to the organizational policy.
If I check the project in GCP it shows that the org policy has been set to false which is what should happen. If I re-run the apply step in Terraform then everything will provision that was left over. Is there a better way to approach this issue? The fact that sometimes the provisioning works in one apply vs two applies is a bit odd and makes me believe there is some sort of state caching going on but that's just more of me guessing based on what I've seen.
Code is as follows below:
source = "terraform-google-modules/project-factory/google"
version = "~> 10.1"
name = var.project_name
random_project_id = var.random_project_id
org_id = var.org_id
folder_id = var.folder_id
billing_account = var.billing_account_id
create_project_sa = false
default_service_account = var.default_service_account
disable_dependent_services = var.disable_dependent_services
disable_services_on_destroy = var.disable_services_on_destroy
labels = var.project_labels
}
module "remove_disable_sa_creation" {
source = "terraform-google-modules/org-policy/google"
version = "~> 3.0.2"
constraint = "constraints/iam.disableServiceAccountCreation"
policy_type = "boolean"
policy_for = "project"
project_id = module.project-factory.project_id
enforce = false
depends_on = [module.project-factory.project_id]
}
resource "time_sleep" "wait_60_seconds" {
depends_on = [module.remove_disable_sa_creation]
create_duration = "60s"
}
module "globus_service_account" {
source = "../../../modules/service_account"
project_id = module.project-factory.project_id
prefix = var.globus_sa_prefix
names = var.globus_sa_names
project_roles = var.globus_sa_project_roles
grant_billing_role = var.globus_grant_billing_role
billing_account_id = var.billing_account_id
grant_xpn_roles = var.globus_grant_xpn_roles
org_id = var.org_id
generate_keys = var.globus_generate_keys
display_name = var.globus_sa_display_name
description = var.globus_sa_description
depends_on = [time_sleep.wait_60_seconds]
}
Changing the sleep timer to 120 seconds was the main factor which helped solved this. What I did was create the project factory, have a depends on for the organization policy to wait on the project factory, have the timer wait on the organization policy, then have all other modules wait on the timer to finish.
Essentially the flow was project > organization policy > timer for 120s > all other modules provisioning after 120 seconds.

How to export all logs from stackdriver into big query through terraform

I'll preface this by saying I am very new to GCP, stackdriver and big query.
I'm attempting to have all logs within stackdriver automatically export to a big query dataset through terraform.
I have currently defined a logging sink referencing and a big query dataset; which the logging sink references. However the dataset appears to be empty. Is there something I am missing here?
This is what my terraform code looks like currently:
resource "google_bigquery_dataset" "stackdriver_logging" {
dataset_id = "stackdriver_logs"
friendly_name = "stackdriver_logs"
location = "US"
project = google_project.project.project_id
}
resource "google_logging_project_sink" "big_query" {
name = "${google_project.project.project_id}-_big_query-sink"
project = google_project.project.project_id
destination = "bigquery.googleapis.com/projects/${google_project.project.project_id}/datasets/${google_bigquery_dataset.stackdriver_logging.dataset_id}"
unique_writer_identity = true
}
resource "google_project_iam_member" "bq_log_writer" {
member = google_logging_project_sink.big_query.writer_identity
role = "roles/bigquery.dataEditor"
project = google_project.project.project_id
}

How to specify dead letter dependency using modules?

I have the following core module based off this official module:
module "sqs" {
source = "github.com/terraform-aws-modules/terraform-aws-sqs?ref=0d48cbdb6bf924a278d3f7fa326a2a1c864447e2"
name = "${var.site_env}-sqs-${var.service_name}"
}
I'd like to create two queues: xyz and xyz_dead. xyz sends its dead letter messages to xyz_dead.
module "xyz_queue" {
source = "../helpers/sqs"
service_name = "xyz"
redrive_policy = <<POLICY {
"deadLetterTargetArn" : "${data.TODO.TODO.arn}",
"maxReceiveCount" : 5
}
POLICY
site_env = "${var.site_env}"
}
module "xyz_dead_queue" {
source = "../helpers/sqs"
service_name = "xyz_dead"
site_env = "${var.site_env}"
}
How do I specify the deadLetterTargetArn dependency?
If I do:
data "aws_sqs_queue" "dead_queue" {
filter {
name = "tag:Name"
values = ["${var.site_env}-sqs-xyz_dead"]
}
}
and set deadLetterTargetArn to "${data.aws_sqs_queue.dead_queue.arn}", then I get this error:
Error: data.aws_sqs_queue.thumbnail_requests_queue_dead: "name":
required field is not set Error:
data.aws_sqs_queue.thumbnail_requests_queue_dead: : invalid or unknown
key: filter
The best way to do this is to use the outputted ARN from the module:
module "xyz_queue" {
source = "../helpers/sqs"
service_name = "xyz"
site_env = "${var.site_env}"
redrive_policy = <<POLICY
{
"deadLetterTargetArn" : "${module.xyz_dead_queue.this_sqs_queue_arn}",
"maxReceiveCount" : 5
}
POLICY
}
module "xyz_dead_queue" {
source = "../helpers/sqs"
service_name = "xyz_dead"
site_env = "${var.site_env}"
}
NB: I've also changed the indentation of your HEREDOC here because you normally need to remove the indentation with these.
This will pass the ARN of the SQS queue directly from the xyz_dead_queue module to the xyz_queue.
As for the errors you were getting, the aws_sqs_queue data source takes only a name argument, not a filter block like some of the other data sources do.
If you wanted to use the aws_sqs_queue data source then you'd just want to use:
data "aws_sqs_queue" "dead_queue" {
name = "${var.site_env}-sqs-${var.service_name}"
}
That said, if you are creating two things at the same time then you are going to have issues using a data source to refer to one of those things unless you create the first resource first. This is because data sources run before resources so if neither queue yet exists your data source would run and not find the dead letter queue and thus fail. If the dead letter queue did exist then it would be okay. In general though you're best off avoiding using data sources like this and only use them to refer to things being created in a separate terraform apply (or perhaps even created outside of Terraform).
You are also much better off simply passing the outputs of resources or modules to other resources/modules and allowing Terraform to correctly build a dependency tree for them as well.