Can terraform duplicate the content of an s3 bucket? - amazon-web-services

I am using terraform to manage aws environments for our application. The environments have s3 buckets for various things. And when setting up a new environment I just want to copy the buckets from a base source bucket, or from an existing environment.
But I can't find anything that will provision a copy. The AWS interface lets you duplicate the setting when creating (which I don't need), but not the objects, so it may not be something terraform can do directly.
If so, how about indirectly?

There is no resource that enables the copying of objects from one S3 bucket to another. If you want to include this in your Terraform setup then you would need to use a local-exec provisioner.
It would need to execute the command below, with the support the AWS CLI to run aws s3 cp.
resource "null_resource" "s3_objects" {
provisioner "local-exec" {
command = "aws s3 cp s3://bucket1 s3://bucket2 --recursive"
}
}
For this to run the local server would need to have the AWS CLI installed with a role (or valid credentials) to enable the copy.

Generally-speaking, Terraform providers reflect operations that are natively supported by the underlying APIs, but in some cases we can use various Terraform resource types together to achieve functionality that the underlying provider lacks.
I believe there's no native S3 operation for bulk-copying objects from one bucket to another, so to solve this with Terraform requires decomposing the problem into smaller steps, which I think in this case would be:
Declare a new bucket, the target
List all of the objects in the source bucket
Declare one object in the new bucket per object in the source bucket.
The AWS provider can in principle do all three of these operations: it has managed resource types for both buckets and bucket objects, and it has a data source aws_s3_bucket_objects which can enumerate some or all of the objects in a bucket.
We can combine those pieces together in a Terraform configuration like this:
resource "aws_s3_bucket" "target" {
bucket = "copy-example-target"
}
data "aws_s3_bucket_objects" "source" {
bucket = "copy-example-source"
}
data "aws_s3_bucket_object" "source" {
for_each = toset(data.aws_s3_bucket_objects.source.keys)
bucket = data.aws_s3_bucket_objects.source.bucket
key = each.key
}
resource "aws_s3_bucket_object" "target" {
for_each = aws_s3_bucket_object.source
bucket = aws_s3_bucket.target.bucket
key = each.key
content = each.value.body
}
With that said, Terraform is likely not the best tool to for this situation for the following reasons:
The above configuration will cause Terraform to read all of the objects in the bucket into memory, which would be time consuming and use lots of RAM for larger buckets, and then ultimately store all of them in the Terraform state, which would make the state itself very large.
Because the aws_s3_bucket_object data source is intended mainly for retrieving small text-based objects, the above will work only if everything in the bucket meets the limitations described in the aws_s3_bucket_object documentation: the objects must all have text-indicating MIME types and they must all contain UTF-8 encoded text.
In this case then, I would prefer to use a specialized tool for the job which is designed to exploit all of the features of the S3 API to make the copy as efficient as possible, such as streaming the list of objects and streaming the contents of each object in chunks to avoid the need to have all of the data in memory at once. One such tool is in the AWS CLI itself, in the form of the aws s3 cp command with the --recursive option.

Related

Terraform for AWS: How to have multiple events per S3 bucket filtered on object path?

My understanding is that when configuring an S3 bucket notification with Terraform we can only configure a single notification per S3 bucket:
NOTE: S3 Buckets only support a single notification configuration. Declaring multiple
ws_s3_bucket_notification resources to the same S3 Bucket will cause a perpetual difference in
configuration. See the example "Trigger multiple Lambda functions" for an option.
The application uses a single S3 bucket as a data repository, i.e. when JSON files land there they trigger a lambda which submits a corresponding batch job to ingest from the file into a database.
This works well when we have a single developer deploying the infrastructure, but with multiple developers each time one of us runs terraform apply then it updates the only/single notification for the bucket, overwriting the resource's previous settings.
What is the best practice for utilizing S3 buckets for notifications? Are they best configured/created per Terraform workspace, and/or how are the buckets managed to allow for simultaneous developers standing up/down infrastructure resources using a common S3 bucket via terraform apply, etc.? Must you use one bucket per workspace for this use case, as suggested by the docs?
The current Terraform I have for the S3 notification (the code that allows for overwriting with the latest configuration):
data "aws_s3_bucket" "default" {
bucket = var.bucket
}
resource "aws_lambda_permission" "allow_bucket_execution" {
statement_id = "AllowExecutionFromS3Bucket"
action = "lambda:InvokeFunction"
function_name = var.lambda_function_name
principal = "s3.amazonaws.com"
source_arn = data.aws_s3_bucket.default.arn
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = data.aws_s3_bucket.default.bucket
lambda_function {
lambda_function_arn = var.lambda_function_arn
events = ["s3:ObjectCreated:*"]
filter_prefix = var.namespace
filter_suffix = ".json"
}
}
The namespace variable is passed in as "${local.env}-${terraform.workspace}", with local.env as "dev", "uat", "prod", etc.
How can we modify the Terraform code above to allow for multiple notifications per S3 bucket (essentially one per Terraform workspace), or can it just not be done? If not then how is this best handled? Should I just use a bucket per workspace using a namespace variable like above as the S3 bucket name, and have it updated accordingly to the production bucket at deployment?
There are several options you have depending on your needs:
Create one bucket per env and workspace. Then the mentioned limitation of terraforms aws_s3_bucket_notification should no longer be an issue. I could imagine that the process you use to write to your bucket will then still only write to one bucket you specify. To solve this issue you could think about forwarding any objects uploaded to one "master" bucket to all other buckets (either with a lambda, itself triggered by a aws_s3_bucket_notification or probably by bucket replication).
create one bucket per env and deploy the aws_s3_bucket_notification without workspaces. Then you do no longer have the advantages of workspaces. But this might be a reasonable compromise between number of buckets and usability
keep just this one bucket, keep envs and workspaces, but deploy the aws_s3_bucket_notification resource only once (probably together with the bucket). Then this one aws_s3_bucket_notification resource would need to include the rules for all environment and workspaces.
It really depends on your situation what fits best. If those aws_s3_bucket_notification rarely change at all, and most changes are done in the lambda function, the last option might be the best. If you regularly want to change the aws_s3_bucket_notification and events to listen on, one of the other options might be more suitable.

Amazon S3 tags for automatic replication with specific prefix?

I have two Amazon S3 buckets set up for cross-region-replication. Whenever there is an upload in the source bucket with a specific prefix, I need the respective data to be replicated to my "processing bucket" in a different region. However I need to know at least some information about the original source bucket after the replication process, because I want to set up multiple buckets including replication with the same destination bucket, while the processing is going to be done via lambda events.
I thought about getting this to work with tagging but I can't find ways to automatically tag uploaded data containing a specific prefix before (or after?) they are replicated.
The only thing closing in on this topic I could find was https://docs.aws.amazon.com/AmazonS3/latest/dev/batch-ops-put-object-tagging.html, but I can't make much of that, as I'm not sure, if this is what I'm searching for, especially regarding the automatic replication functionality.
To recap: I want to process data via lambda events and differentiate their origin by information included in the event's json data (originating from specific tags on the S3 file for example).
What is the best way to approach this?
Tagging Objects
Tagging objects depends on how they are being uploaded into S3. If you are using the CLI. After you have copied the file with aws s3 cp you can call the s3api commands to add tags.
[aws s3api put-object-tagging --bucket \[bucket name\] --key \[object key\] --tagging 'TagSet=\[{Key=mykey,Value=myvalue},{Key=yourkey,Value=yourvalue}\]'][2]
Alternatively you could add a Lambda Trigger that adds the tags to the object when uploaded. You can do this using the examples outlined here.
Bucket Replication:
Objects are replicated as is, you can set the encryption, type or storage or ownership. Currently you can't change anything else.
The AWS documentation for replication defines the destination configuration as:
{
"AccessControlTranslation" : AccessControlTranslation,
"Account" : String,
"Bucket" : String,
"EncryptionConfiguration" : EncryptionConfiguration,
"StorageClass" : String
}
Currently you can only set the destination StorageClass, Bucket, Account and Configuration.
The bucket is just the bucket name, and does not include a prefix.
If the correct permissions are set replication can replicate tags, tags can be added at anytime. i.e you can add an object, it can replicate, and then you can update the source tag, and that source tag will replicate.
Note: If you update the destination objects tags, and the source updates the source will override the destination tags. This is dependent on the IAM policy defined. i.e. if ownership has changed then you might not be-able to update the tags.
AWS S3 does not have the concept of folders, the prefixes are just part of the key name, and so the entire key name is replicated.
Possible Solutions:
In the source bucket you could set a prefix for example 'my-source', and then for replication to the target bucket filter for the prefix 'my-source'. S3 replication will replicate the object to the target bucket with the prefix 'my-source'. Thus if bucket 1 is prefixed 'my-source1/object' and bucket 2 is prefixed 'my-source2/object'. Then the target bucket will show the "folders" 'my-source1' and 'my-source2' with their respected objects. But if both source buckets have the same prefix then the files will appear in the same "folder" on the target.
Alternatively you can use Lambda to change the prefix, or add tags as defined above.

How to copy S3 objects between regions with Amazon AWS PHP SDK?

I'm trying to copy Amazon AWS S3 objects between two buckets in two different regions with Amazon AWS PHP SDK v3. This would be a one-time process, so I don't need cross-region replication. Tried to use copyObject() but there is no way to specify the region.
$s3->copyObject(array(
'Bucket' => $targetBucket,
'Key' => $targetKeyname,
'CopySource' => "{$sourceBucket}/{$sourceKeyname}",
));
Source:
http://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjectUsingPHP.html
You don't need to specify regions for that operation. It'll find out the target bucket's region and copy it.
But you may be right, because on AWS CLI there is source region and target region attributes which do not exist on PHP SDK. So you can accomplish the task like this:
Create an interim bucket in the source region.
Create the bucket in the target region.
Configure replication from the interim bucket to target one.
On interim bucket set expiration rule, so files will be deleted after a short time automatically from the interim bucket.
Copy objects from source bucket to interim bucket using PHP SDK.
All your objects will also be copied to another region.
You can remove the interim bucket one day later.
Or use just cli and use this single command:
aws s3 cp s3://my-source-bucket-in-us-west-2/ s3://my-target-bucket-in-us-east-1/ --recursive --source-region us-west-2 --region us-east-1
Different region bucket could also be different account. What others had been doing was to copy off from one bucket and save the data temporary locally, then upload to different bucket with different credentials. (if you have two regional buckets with different credentials).
Newest update from CLI tool allows you to copy from bucket to bucket if it's under the same account. Using something like what Çağatay Gürtürk mentioned.

How to share a terraform script without module dependencies

I want to share a terraform script that will be used across different projects. I know how to create and share modules, but this setup has a big annoyance: when I reference a module in a script and perform a terraform apply, if the module resource does not exist it will be created, but also if I perform a terraform destroy this resource will be destroyed.
If I have two projects dependent on the same module, and in one of them I call a terraform destroy it may lead to a inconsistent state, since the module is being used by another project. The script can either fail because it cannot destroy the resource or it will destroy the resource and affect the other project.
In my scenario, I want to share network scripts between two projects and I want the network resources to never be destroyed. I cannot create a project only for this resource because I need to reference it somehow in my projects, and the only way to do it is via its ID, which I have no idea what is going to be.
prevent_destroy is also not an option, since I do need to destroy other resources but the shared script resource. This configuration makes terraform destroy fail.
Is there any way to reference the resource, like by its name, or is there any other better approach to accomplish what I want?
If I understand you correctly, you have some resource R that is a "singleton". That is, only one instance of R can ever exist in your AWS account. For example, you can only ever have one aws_route53_zone with the name "foo.com". If you include R as a module in two different places, then either one may create it when you run terraform apply and either one may delete it when you run terraform destroy. You'd like to avoid that, but you still need some way to get an output attribute from R (e.g. the zone_id for an aws_route53_zone resource is generated by AWS, so you can't guess it).
If that's the case, then instead of using a R as a module, you should:
Create R by itself in its own set of Terraform templates. Let's say those are under /terraform/R.
Configure /terraform/R to use Remote State. For example, here is how you can configure those templates to store their remote state in an S3 bucket (you'll need to fill in the bucket name/region as indicated):
terraform remote config \
-backend=s3 \
-backend-config="bucket=(YOUR BUCKET NAME)" \
-backend-config="key=terraform.tfstate" \
-backend-config="region=(YOUR BUCKET REGION)" \
-backend-config="encrypt=true"
Define any output attributes you need from R as output variables. For example:
output "zone_id" {
value = "${aws_route_53.example.zone_id}"
}
When you run terraform apply in /terraform/R, it will store its Terraform state, including that output, in an S3 bucket.
Now, in all other Terraform templates that need that output attribute from R, you can pull it in from the S3 bucket using the terraform_remote_state data source. For example, let's say you had some template /terraform/foo that needed that zone_id parameter to create an aws_route53_record (you'll need to fill in the bucket name/region as indicated):
data "terraform_remote_state" "r" {
backend = "s3"
config {
bucket = "(YOUR BUCKET NAME)"
key = "terraform.tfstate"
region = "(YOUR BUCKET REGION)"
}
}
resource "aws_route53_record" "www" {
zone_id = "${data.terraform_remote_state.r.zone_id}"
name = "www.foo.com"
type = "A"
ttl = "300"
records = ["${aws_eip.lb.public_ip}"]
}
Note that terraform_remote_state is a read-only data source. That means when you run terraform apply or terraform destroy on any templates that use that resource, they will not have any effect in R.
For more info, check out How to manage terraform state and Terraform: Up & Running.

Terraform remote state s3 bucket creation included in the state file?

I am looking for the best practice to create and store my state file in S3 bucket.
Should I include the creation of S3 bucket along with the infrastructure or
Create a separate state file for its S3 bucket and a different for the resources.
if it is a different file I also need to store the state file of the s3 bucket created, then in this case I should be creating two s3 buckets one for infrastructure state and other for s3 bucket state file.
Secondly, if remote configuration is set and performing 'terraform destroy' is throwing me an error failed to upload state file: no such bucket found, as the bucket has been destroyed. should i first disable terraform remote config -disable and then run terraform destroy?
What's the best practice I should be following?
Personally I use a Terraform base stack to effectively bootstrap an AWS account for use with Terraform. This stack just stores its state file locally which is then committed to version control. This stack should only ever have to be run once so I see no problem with it not using a remote backend.
My Terraform base stack creates:
IAM user for Terraform to run as in future
s3 Bucket for storing state
KMS CMK for encrypting/decrypting state
Bucket policy statement to enforce encryption
Bucket policy statement to prevent the Terraform user from doing anything but s3:putObject & s3:getObject with state
KMS policy statement to prevent the Terraform user from doing anything but kms:GenerateDataKey* & kms:Decrypt
A DynamoDB table for state locking.
This can be expanded to include Roles, especially if your Terraform user will be deploying across multiple accounts.
You have a chicken and egg problem here if you want to store the state of the thing that will store the state.
Creating an S3 bucket outside of Terraform is easy so I would never bother with doing that in Terraform for the actual state bucket and then use Terraform to create absolutely everything else.
The ease of creating an S3 bucket (or one of the other S3 type storage options now covered by remote state) is one of the main benefits of using S3 to back your state files rather than, say, Consul which would require you to build a cluster of instances and configure them before you can store any state files.