AWS Macie & Terraform - Select all S3 buckets in account - amazon-web-services

I am enabling AWS Macie 2 using terraform and I am defining a default classification job as following:
resource "aws_macie2_account" "member" {}
resource "aws_macie2_classification_job" "member" {
job_type = "ONE_TIME"
name = "S3 PHI Discovery default"
s3_job_definition {
bucket_definitions {
account_id = var.account_id
buckets = ["S3 BUCKET NAME 1", "S3 BUCKET NAME 2"]
}
}
depends_on = [aws_macie2_account.member]
}
AWS Macie needs a list of S3 buckets to analyze. I am wondering if there is a way to select all buckets in an account, using a wildcard or some other method. Our production accounts contain hundreds of S3 buckets and hard-coding each value in the s3_job_definition is not feasible.
Any ideas?

The Terraform AWS provider does not support a data source for listing S3 buckets at this time, unfortunately. For things like this (data sources that Terraform doesn't support), the common approach is to use the AWS CLI through an external data source.
These are modules that I like to use for CLI/shell commands:
As a data source (re-runs each time)
As a resource (re-runs only on resource recreate or on a change to a trigger)
Using the data source version, it would look something like:
module "list_buckets" {
source = "Invicton-Labs/shell-data/external"
version = "0.1.6"
// Since the command is the same on both Unix and Windows, it's ok to just
// specify one and not use the `command_windows` input arg
command_unix = "aws s3api list-buckets --output json"
// You want Terraform to fail if it can't get the list of buckets for some reason
fail_on_error = true
// Specify your AWS credentials as environment variables
environment = {
AWS_PROFILE = "myprofilename"
// Alternatively, although not recommended:
// AWS_ACCESS_KEY_ID = "..."
// AWS_SECRET_ACCESS_KEY = "..."
}
}
output "buckets" {
// We specified JSON format for the output, so decode it to get a list
value = jsondecode(module.list_buckets.stdout).Buckets
}
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Outputs:
buckets = [
{
"CreationDate" = "2021-07-15T18:10:20+00:00"
"Name" = "bucket-foo"
},
{
"CreationDate" = "2021-07-15T18:11:10+00:00"
"Name" = "bucket-bar"
},
]

Related

Terraform Data block, all buckets

I am trying to create an inventory list for all the buckets in a aws account, i amusing the terraform data source block in terraform to fetch the s3 buckets but can't figure out how to get all the buckets in my account, or which expression to use to get all the buckets, so i can do an inventory on all of them, please find my code below.
data "aws_s3_bucket" "select_bucket" {
bucket = "????"
}
resource "aws_s3_bucket" "inventory" {
bucket = "x-bucket"
}
resource "aws_s3_bucket_inventory" "inventory_list" {
for_each = toset([data.aws_s3_bucket.select_bucket.id])
bucket = each.key
name = "lifecycle_analysis_bucket"
included_object_versions = "All"
schedule {
frequency = "Daily"
}
destination {
bucket {
format = "CSV"
bucket_arn = aws_s3_bucket.inventory.arn
}
}
}
which expression to use to get all the buckets,
There is no such expression. You have to prepare the list of all you buckets beforhand, and then you can iterate over them in your code. The other option is to develop your own custom data source which would use AWS CLI or SDK to get the list of your buckets and return to TF for further processing.

Is there a way to configure date-partitioned folders for AWS DMS endpoint target S3?

I'm using terraform in order to configure this DMS migration task that migrates (full-load+cdc) the data from a MySQL instance to a S3 bucket.
The problem is that the configuration seems not to take effect and no partition-folder is created. All the migrated files are created in the same directory within the bucket.
In the documentation they say the endpoint s3 setting DatePartitionEnabled, introduced in the version 3.4.2, is supported both for CDC and FullLoad+CDC.
My terraform configuration spec:
resource "aws_dms_endpoint" "example" {
endpoint_id = "example"
endpoint_type = "target"
engine_name = "s3"
s3_settings {
bucket_name = "example"
bucket_folder = "example-folder"
compression_type = "GZIP"
data_format = "parquet"
parquet_version = "parquet-2-0"
service_access_role_arn = var.service_access_role_arn
date_partition_enabled = true
}
tags = {
Name = "example"
}
}
But in the respective s3 bucket I get no folders, but sequential files as if this option wasn't there.
LOAD00000001.parquet
LOAD00000002.parquet
...
I'm using terraform 1.0.7, aws provider 3.66.0 and a DMS Replication Instance 3.4.6.
Does anyone know what could be this issue?

How do I get list of all S3 Buckets with given prefix using terraform?

I am writing a Terraform script to setup an event notification on multiple S3 buckets which are starting with given prefix.
For example I want to setup notification for bucket starting with finance-data. With help of aws_s3_bucket datasource, we can configure a multiple S3 buckets which are already present and later we can use them in aws_s3_bucket_notification resource. Example:
data "aws_s3_bucket" "source_bucket" {
# set of buckets on which event notification will be set
# finance-data-1 and finance-data-2 are actual bucket id
for_each = toset(["finance-data-1", "finance-data-2"])
bucket = each.value
}
resource "aws_s3_bucket_notification" "bucket_notification_to_lambda" {
for_each = data.aws_s3_bucket.source_bucket
bucket = each.value.id
lambda_function {
lambda_function_arn = aws_lambda_function.s3_event_lambda.arn
events = [
"s3:ObjectCreated:*",
"s3:ObjectRemoved:*"
]
}
}
In aws_s3_bucket datasource, I am not able to find an option to give a prefix of the bucket and instead I have to enter bucket-id for all the buckets. Is there any way to achieve this?
Is there any way to achieve this?
No there is not. You have to explicitly specify buckets that you want.

How to reduce repeated HCL code in Terraform?

I have some Terraform code like this:
resource "aws_s3_bucket_object" "file1" {
key = "someobject1"
bucket = "${aws_s3_bucket.examplebucket.id}"
source = "./src/index.php"
}
resource "aws_s3_bucket_object" "file2" {
key = "someobject2"
bucket = "${aws_s3_bucket.examplebucket.id}"
source = "./src/main.php"
}
# same code here, 10 files more
# ...
Is there a simpler way to do this?
Terraform supports loops via the count meta parameter on resources and data sources.
So, for a slightly simpler example, if you wanted to loop over a well known list of files you could do something like the following:
locals {
files = [
"index.php",
"main.php",
]
}
resource "aws_s3_bucket_object" "files" {
count = "${length(local.files)}"
key = "${local.files[count.index]}"
bucket = "${aws_s3_bucket.examplebucket.id}"
source = "./src/${local.files[count.index]}"
}
Unfortunately Terraform's AWS provider doesn't have support for the equivalent of aws s3 sync or aws s3 cp --recursive although there is an issue tracking the feature request.

Using Athena Terraform Scripts

Amazon Athena reads data from input Amazon S3 buckets using the IAM credentials of the user who submitted the query; query results are stored in a separate S3 bucket.
Here is the script in Hashicorp site https://www.terraform.io/docs/providers/aws/r/athena_database.html
resource "aws_s3_bucket" "hoge" {
bucket = "hoge"
}
resource "aws_athena_database" "hoge" {
name = "database_name"
bucket = "${aws_s3_bucket.hoge.bucket}"
}
Where it says
bucket - (Required) Name of s3 bucket to save the results of the query execution.
How can I specify the input S3 bucket in the terraform script?
You would use the storage_descriptor argument in the aws_glue_catalog_table resource:
https://www.terraform.io/docs/providers/aws/r/glue_catalog_table.html#parquet-table-for-athena
Here is an example of creating a table using CSV file(s):
resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
name = "your_table_name"
database_name = "${aws_athena_database.your_athena_database.name}"
table_type = "EXTERNAL_TABLE"
parameters = {
EXTERNAL = "TRUE"
}
storage_descriptor {
location = "s3://<your-s3-bucket>/your/file/location/"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.mapred.TextInputFormat"
ser_de_info {
name = "my-serde"
serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
parameters = {
"field.delim" = ","
"skip.header.line.count" = "1"
}
}
columns {
name = "column1"
type = "string"
}
columns {
name = "column2"
type = "string"
}
}
}
The input S3 bucket is specified in each table you create in the database, as such, there's no global definition for it.
As of today, the AWS API doesn't have much provision for Athena management, as such, neither does the aws CLI command, and nor does Terraform. There's no 'proper' way to create a table via these means.
In theory, you could create a named query to create your table, and then execute that query (for which there is API functionality, but not yet Terraform). It seems a bit messy to me, but it would probably work if/when TF gets the StartQuery functionality. The asynchronous nature of Athena makes it tricky to know when that table has actually been created though, and so I can imagine TF won't fully support table creation directly.
TF code that covers the currently available functionality is here: https://github.com/terraform-providers/terraform-provider-aws/tree/master/aws
API doco for Athena functions is here: https://docs.aws.amazon.com/athena/latest/APIReference/API_Operations.html