Terraform - one centralized state or multiple modular states - amazon-web-services

I have created a terraform stack for all the required resources that we utilise to build out a virtual data center within AWS. VPC, subnet, security groups etc, etc.
It all works beautifully :). I am having a constant argument with network engineers that want to have a completely separate state for networking etc. As a result of this we have to manage multiple state files and it requires 10 to 15 terraform plan/apply commands to being up the data center. Not alone do we have to run the commands multiple times, we cannot reference the module output variables from when creating ec2 instances etc, so now there are "magic" variables appearing within variable files. I want to put the scripts to create the ec2 instances, els etc within the same directory as the "data center" configuration so that we manage one state file (encrypted in s3 with dynamodb lock) and that our git repo has a one to one relationship with our infrastructure. There is also the added benefit that a single terraform plan/apply will build the whole datacenter in a single command.
Question is really, is it a good idea to manage data center resources (vpc, subnets, security groups) and compute resources in a single state file? Are there any issues that I may come across? Has anybody experience in managing an AWS environment with terraform this way?
Regards,
David

To begin with the Terraform provider let's you access output variables from other state files so you don't have to use magic variables. The rest is just a matter of your style. Do you frequently bring the whole datacenter infrastracture up? If so you may consider doing it in one project. If on the other hand you only change some things you may want to make it more modular relying on output from other projects. Keeping them separate makes the planing faster and avoids a very costly terraform destroy mistake.

During the last years there have been a lot of discussion about layouts for Terraform projects.
Times have also changed with Terraform 1.0 so I think this question deserve some love.
As a result of this we have to manage multiple state files and it requires 10 to 15 terraform plan/apply commands to being up the data center.
Using modules is possible to maintain separated states without requiring executing commands for each state.
Not alone do we have to run the commands multiple times, we cannot reference the module output variables
Terraform support output values. Leveraging Terraform Cloud or Terraform remote states is possible to introduce dependencies between states.
A prerequisite to adventure into multiple Terraform states in my opinion is using state locking (OP refers to using AWS DynamoDB lock mechanism but other Storage backend support this too).
Generally having everything in a single state is not the best solution and may be considered an anti-pattern.
Having multiple state is referred to as state isolation.
Why would you want to isolate states?
Reasons are multiple and the benefits are clear:
bugs blast radius. If you introduce a bug somewhere in the code an you apply all the code for the entire datacenter in the worst possible scenario everything will be affected. On the other hand if networking was separated in the worst scenario the bug could only affect networking (which in a DC would be a very severe issue but still better than everything).
state (write) lock. If you use state lock Terraform will lock the state for any operation that may possibly write to the state. This means that with a single state multiple teams working on separate areas are not able to write to the state at the same time, so updating the networking blocks instance provisioning for example.
secrets. Secrets are written plain-text to the state. A single state means all teams secrets will end up in the same state (that you must encrypt, OP is correctly doing this). As with anything with security having all eggs in the same basket is a risk.
A side benefit of isolating state is that file layout may help with code ownership (across teams or project).
How to isolate state?
There are 3 mainly ways:
via file layout (with or without modules)
via workspaces (not to be confused with Terraform Cloud Workspaces)
mix up the above ways (Here be dragons!)
There is no wide consensus on how to do it but for further reading:
https://blog.gruntwork.io/how-to-manage-terraform-state-28f5697e68fa 2016 article, I think this is sort of the root of the discussion.
https://terragrunt.gruntwork.io/docs/features/keep-your-terraform-code-dry/
https://www.terraform-best-practices.com/code-structure
A tool worth looking at may be Terragrunt from Gruntwork.

Related

Understanding where to begin with batch processing on AWS

I have a set of calculations that needs to run in a batch, and the workload is easily parallelized across machines. The work to be done is already done within a Docker container. I'm trying to understand the easiest way for me to run this workload in a highly parallel way on AWS. However, in trying to figure out where to begin I'm having trouble finding the right entrypoint. I read about AWS Batch and AWS Fargate, but each time I try to go down one of those paths to learn about them in more detail, more AWS services start popping up (Lamdas, Step Functions, ECS, AutoScaling groups), with each article having a different combination. Furthermore, I start thinking about the problem as a Batch vs Fargate problem, and then I find another article that talks about Batch + Fargate, or X + ECS + ....
I'm having trouble finding the appropriate introduction to the choices so I can get started with setting something up and getting some experience. Any pointers on which direction I might go or some resources for me to look at?
AWS containers services team member here. Your question triggers all my button cause I have been working on a deliverable to address some of this confusion ("where do I start with xyz?"). I can try to answer your question briefly here but if you want to read more (perhaps way more than you'd need feel free to contact me offline (mreferre at amazon dot com will work).
First and foremost it's not a Vs but it's an AND. Think of all these products you mention being distributed at different layers of the stack (this is a draft visual in the deliverable):
Fargate represents capacity (where your container is running), ECS represents a core containers orchestrator and Batch is one of the provisioners on top of the container orchestrator. Lambda is something separate and that live on its own. The options for your specific use case seem to be:
Lambda
ECS/Fargate
Batch/ECS/Fargate
Step Functions/ECS/Fargate (this one is outside of analysis and you don't see it in my visual - wondering if I should add it).
As others have hinted you probably want to use Lambda if your model is event-driven (e.g. if you want to fire up a dedicated function for every event like a new file uploaded to S3).
You probably do not want to use a naked ECS/Fargate solution because it would require more work to deal with the triggering and the scheduling of your batch jobs.
You probably want to use either Batch or Step Functions to schedule jobs on ECS/Fargate. I'd argue SF is good if you have basic workflows that you need to deal with and Batch if you need to manage complex jobs at scale. Perhaps this 35 mins presentation that I did last year can provide a bit more background on these Batch Vs SF differences.
Let me know if you have any additional questions because this discussion is super useful for the positioning I am trying to build.

Usefulness of IaaS Provisoning tools like Terraform?

I have a quick point of confusion regarding the whole idea of "Infrastructure as a Code" or IaaS provisioning with tools like Terraform.
I've been working on a team recently that uses Terraform to provision all of its AWS resources, and I've been learning it here and there and admit that it's a pretty nifty tool.
Besides Infrastructure as Code being a "cool" alternative to manually provisioning resources in the AWS console, I don't understand why it's actually useful though.
Take, for example, a typical deployment of a website with a database. After my initial provisioning of this infrastructure, why would I ever need to even run the Terraform plan again? With everything I need being provisioned on my AWS account, what are the use cases in which I'll need to "reprovision" this infrastructure?
Under this assumption, the process of provisioning everything I need is front-loaded to begin with, so why do I bother learning tools when I can just click some buttons in the AWS console when I'm first deploying my website?
Honestly I thought this would be a pretty common point of confusion, but I couldn't seem to find clarity elsewhere so I thought I'd ask here. Probably a naive question, but keep in mind I'm new to this whole philosophy.
Thanks in advance!
Manually provisioning, in the long term, is slow, non-reproducible, troublesome, not self-documenting and difficult to do in teams.
With tools such as terraform or CloudFormation you can have the following benefits:
Apply all the same development principles which you have when you write a traditional code. You can use comments to document your infrastructure. You can track all changes and who made these changes using software version control system (e.g. git).
you can easily share your infrastructure architecture. Your VPC and ALB don't work? Just post your terraform code to SO or share with a colleague for a review. Its much easier then sharing screenshots of your VPC and ALB when done manually.
easy to plan for disaster recovery and global applications. You just deploy the same infrastructure in different regions automatically. Doing the same manually in many regions would be difficult.
separation of dev, prod and staging infrastructure. You just re-use the same infrastructure code across different environments. A change to dev infrastructure can be easily ported to prod.
inspect changes before actually performing them. Manual upgrades to your infrastructure can have disastrous effects due to domino effect. Changing one, can change/break many other components of your architecture. With infrastructure as a code, you can preview the changes and have good understanding what implications can be before you actually do the change.
work team. You can have many people working on the same infrastructure code, proposing changes, testing and reviewing.
I really like the #Marcin's answer.
Here some additional points from my experience:
As for software version control case you not only can see history/authors, perform code review, but also treat infrastructural changes as product features. Let's say for example you're adding CDN support to your application so you have to make some changes in your infrastructure (to provision a cloud CDN service), application (to actually support and work with CDN) and your pipelines (to deliver static to CDN, if you're using this approach). If all changes related to this new feature will be in a one single branch - all feature related changes will be transparent for everyone in the team and can be easily tracked down later.
Another thing related to version control - is have ability to easily provision and destroy infrastructures for review apps semi-automatically using triggers and capabilities of your CI/CD tools for automated and manual testing. It's even possible to run automated tests for your changes in infrastructure declaration.
If you working on multiple similar project or if your project requires multiple similar but isolated from each other environment, IaC can help save countless hours of provisioning and tracking down everything. Although it's not always silver bullet, but in almost all cases it helps with saving time and avoiding most of accidental mistakes.
Last but not least - it helps with seeing bigger picture if you working with hybrid or multicloud environments. Not as good as infrastructural diagrams, but diagrams might not be always up date unlike your code.

Is there a way to discover VMs using Terraform?

Infrastructure team members are creating, deleting and modifying resources in GCP project using console. Security team wants to scan the infra and check weather proper security measures are taken care
I am tryng to create a terraform script which will:
1. Take project ID as input and list all instances of the given project.
2. Loop all the instances and check if the security controls are in place.
3. If any security control is missing, terraform script will be modifying the resource(VM).
I have to repeat the same steps for all resoources available in project like subnet, cloud storage buckets, firewalls etc.
As per my initial investigation to do such task We will have to import the resources to terraform using "terraform import" command and after that will have to think of loops.
Now it looks like using APIs of GCP is the best fit for this task, as it looks terraform is not the good choice for this kind of tasks and I am not sure weather it is achievable using teffarform.
Can somebody provide any directions here?
Curious if by "console" you mean the gcp console (aka by hand), because if you are not already using terraform to create the resources (and do not plan to in the future), then terraform is not the correct tool for what you're describing. I'd actually argue it is increasing the complexity.
Mostly because:
The import feature is not intended for this kind of use case and we still find regular issues with it. Maybe 1 time for a few resources, but not for entire environments and not without it becoming the future source of truth. Projects such as terraforming do their best but still face wild west issues in complex environments. Not all resources even support importing
Terraform will not tell you anything about the VM's that you wouldn't know from the GCP cli already. If you need more information to make an assessment about the controls then you will need to use another tool or have some complicated provisioners. Provisioners at best would end up being a wrapper around other tooling you could probably use directly.
Honestly, I'm worried your team is trying to avoid the pain of converting older practices to IaC. It's uncomfortable and challenging, but yields better fruit in the long run then the path you're describing.
Digress, if you have infra created via terraform then I'd invest more time in some other practices that can accomplish the same results. Some other options are: 1) enforce best practices via parent modules that security has "blessed", 2) implement some CI on your terraform, 3) AWS has Config and Systems Manager, not sure if GCP has an equivalent but I would look around. Also it's worth evaluating using different technologies for different layers of abstraction. What checks your OS might be different from what checks your security groups and that's ok. Knowing is half the battle and might make for a more sane first version then automatic remediation.
With or without terraform, there is a an ecosystem of both products and opensource projects that can help with the compliance or control enforcement. Take a look at tools like inspec, sentinel, or salstack for inspiration.

AWS services appropriate for concurrent access to a resource

I'm designing a system where a cluster of EC2 instances do some computing and then update a large file continually. What would be ideal is if I could have the file in S3, and have all the instances take turns writing to it one at a time, performing calculations while they wait.
As it stands if 2 instances PUT to S3 at the same time, 1 will simply override the other.
How can I solve this concurrency issue?
AWS has a preview service called EFS (http://aws.amazon.com/documentation/efs/) that is an NFS4 that can be shared among EC2 instances. But such service alone does not solve your problem as you may still have concurrency issues. Consider having something more sophisticated such as exploiting "embarrassingly parallel processing" such as having N processes creating N file chunks and finally having a single file joining all pieces together when everything is done.
As it is Amazon states that if you receive a success code then your S3 object is committed. Amazon also adds that there wouldn't be any dirty writes or overlapping inconsistency - you would read either of a fully committed write.
If you need more control you might be able to do it application like implementing a critical section.
It certainly makes sense to enable versioning the bucket so that you get to maintain all the writes and later you can specify which version as the latest.
You can also leverage the life cycle rules delete ( keep deleting ) the last n version to save cost.

AWS CloudFormation vs. Web Console?

I'm trying to understand the real-world usefulness of AWS CloudFormation. It seems to be a way of describing AWS infrastructure as a JSON file, but even then I'm struggling to understand what benefits that serves (besides potentially "recording" your infrastructure changes in VCS).
Of what use does CloudFormation's JSON files serve? What benefits does it have over using the AWS web console and making changes manually?
CloudFormation gives you the following benefits:
You get to version control your infrastructure. You have a full record of all changes made, and you can easily go back if something goes wrong. This alone makes it worth using.
You have a full and complete documentation of your infrastructure. There is no need to remember who did what on the console when, and exactly how things fit together - it is all described right there in the stack templates.
In case of disaster you can recreate your entire infrastructure with a single command, again without having to remember just exactly how things were set up.
You can easily test changes to your infrastructure by deploying separate stacks, without touching production. Instead of having permanent test and staging environments you can create them automatically whenever you need to.
Developers can work on their own, custom stacks while implementing changes, completely isolated from changes made by others, and from production.
It really is very good, and it gives you both more control, and more freedom to experiment.
First, you seem to underestimate the power of tracking changes in your infrastructure provisioning and configuration in VCS.
Provisioning and editing your infrastructure configuration via web interface is usually very lengthy process. Having the configuration in a file versus having it in multiple web dashboards gives you the much needed perspective and overall glance at what you use and what is it's configuration. Also, when you repeatedly configure similar stacks, you can re-use the code and avoid errors or mistakes.
It's also important to note that AWS CloudFormation resources frequently lag behind development of services available in the AWS Console. CloudFormation also requires gathering some know-how and time getting used to it, but in the end the benefits prevail.