Terraform: wait till the instance is "reachable" - amazon-web-services

I have some Terraform code with an aws_instance and a null_resource:
resource "aws_instance" "example" {
ami = data.aws_ami.server.id
instance_type = "t2.medium"
key_name = aws_key_pair.deployer.key_name
tags = {
name = "example"
}
vpc_security_group_ids = [aws_security_group.main.id]
}
resource "null_resource" "example" {
provisioner "local-exec" {
command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -T 300 -i ${aws_instance.example.public_dns}, --user centos --private-key files/id_rsa playbook.yml"
}
}
It kinda works, but sometimes there is a bug (probably when the instance in a pending state). When I rerun Terraform - it works as expected.
Question: How can I run local-exec only when the instance is running and accepting an SSH connection?

The null_resource is currently only going to wait until the aws_instance resource has completed which in turn only waits until the AWS API returns that it is in the Running state. There's a long gap from there to the instance starting the OS and then being able to accept SSH connections before your local-exec provisioner can connect.
One way to handle this is to use the remote-exec provisioner on the instance first as that has the ability to wait for the instance to be ready. Changing your existing code to handle this would look like this:
resource "aws_instance" "example" {
ami = data.aws_ami.server.id
instance_type = "t2.medium"
key_name = aws_key_pair.deployer.key_name
tags = {
name = "example"
}
vpc_security_group_ids = [aws_security_group.main.id]
}
resource "null_resource" "example" {
provisioner "remote-exec" {
connection {
host = aws_instance.example.public_dns
user = "centos"
file = file("files/id_rsa")
}
inline = ["echo 'connected!'"]
}
provisioner "local-exec" {
command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -T 300 -i ${aws_instance.example.public_dns}, --user centos --private-key files/id_rsa playbook.yml"
}
}
This will first attempt to connect to the instance's public DNS address as the centos user with the files/id_rsa private key. Once it is connected it will then run echo 'connected!' as a simple command before moving on to your existing local-exec provisioner that runs Ansible against the instance.
Note that just being able to connect over SSH may not actually be enough for you to then provision the instance. If your Ansible script tries to interact with your package manager then you may find that it is locked from the instance's user data script running. If this is the case you will need to remotely execute a script that waits for cloud-init to be complete first. An example script looks like this:
#!/bin/bash
while [ ! -f /var/lib/cloud/instance/boot-finished ]; do
echo -e "\033[1;36mWaiting for cloud-init..."
sleep 1
done

There is an ansible specific solution for this problem. Add this code to you playbook(there is all so pre_task clause if you use roles)
- name: will wait till reachable
hosts: all
gather_facts: no # important
tasks:
- name: Wait for system to become reachable
wait_for_connection:
- name: Gather facts for the first time
setup:

For cases where instances are not externally exposed (About 90% of the time in most of my projects), and SSM agent is installed on the target instance (newer AWS AMIs come pre-loaded with it), you can leverage SSM to probe the instance. Here's some sample code:
instanceId=$1
echo "Waiting for instance to bootstrap ..."
tries=0
responseCode=1
while [[ $responseCode != 0 && $tries -le 10 ]]
do
echo "Try # $tries"
cmdId=$(aws ssm send-command --document-name AWS-RunShellScript --instance-ids $instanceId --parameters commands="cat /tmp/job-done.txt # or some other validation logic" --query Command.CommandId --output text)
sleep 5
responseCode=$(aws ssm get-command-invocation --command-id $cmdId --instance-id $instanceId --query ResponseCode --output text)
echo "ResponseCode: $responseCode"
if [ $responseCode != 0 ]; then
echo "Sleeping ..."
sleep 60
fi
(( tries++ ))
done
echo "Wait time over. ResponseCode: $responseCode"
Assuming you have AWS CLI installed locally, you can have this null_resource required before you act on the instance. In my case, I was building an AMI.
resource "null_resource" "wait_for_instance" {
depends_on = [
aws_instance.my_instance
]
triggers = {
always_run = "${timestamp()}"
}
provisioner "local-exec" {
command = "${path.module}/scripts/check-instance-state.sh ${aws_instance.my_instance.id}"
}
}

Related

How can I have a Terraform output become a permanent value in a userdata script?

I'm not sure what the best way to do this is - but I want to deploy EFS and an ASG + Launch Template with Terraform. I'd like my userdata script (in my launch template) to run commands to mount to EFS
For example:
sudo mount -t efs -o tls fs-0b28edbb9efe91c25:/ efs
My issue is: I need my userdata script to receive my EFS ID, however, this can't just happen on my initial deploy, I also need this to happen whenever I perform a rolling update. I want to be able to change the AMI ID in my launch template, which will perform a rolling update when I run terraform apply and need my EFS ID to be in my userdata script to run the command to mount EFS.
Is there a way to have a terraform output get permanently added to my Userdata script? What are other alternatives for making this happen? Would it involve Cloudformation or other AWS services?
main.tf
resource "aws_vpc" "mtc_vpc" {
cidr_block = "10.123.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "dev"
}
}
resource "aws_launch_template" "foobar" {
name_prefix = "LTTest"
image_id = "ami-017c001a88dd93847"
instance_type = "t2.micro"
update_default_version = true
key_name = "lttest"
user_data = base64encode(templatefile("${path.module}/userdata.sh", {efs_id = aws_efs_file_system.foo.id}))
iam_instance_profile {
name = aws_iam_instance_profile.test_profile.name
}
vpc_security_group_ids = [aws_security_group.mtc_sg.id]
}
resource "aws_autoscaling_group" "bar" {
desired_capacity = 2
max_size = 2
min_size = 2
vpc_zone_identifier = [
aws_subnet.mtc_public_subnet1.id
]
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50
}
}
launch_template {
id = "${aws_launch_template.foobar.id}"
version = aws_launch_template.foobar.latest_version
}
}
resource "aws_efs_file_system" "foo" {
creation_token = "jira-efs"
}
resource "aws_efs_mount_target" "alpha" {
file_system_id = aws_efs_file_system.foo.id
subnet_id = aws_subnet.mtc_public_subnet1.id
security_groups = [aws_security_group.mtc_sg.id]
}
Update:
User-data Script:
#!/usr/bin/env bash
sudo yum install -y amazon-efs-utils
sudo yum install -y git
cd /home/ec2-user
mkdir efs
sudo mount -t efs -o tls ${efs_id}:/ efs
There are a few ways to do this. A couple that come to mind are:
Provide the EFS ID to the user data script using the templatefile() function.
Give your EC2 instance permissions (via IAM) to use the EFS API to search for the ID.
The first option is probably the most practical.
First, define your EFS filesystem (and associated aws_efs_mount_target and aws_efs_access_point resources, but I'll omit those here):
resource "aws_efs_file_system" "efs" {}
Now you can define the user data with the templatefile() function:
resource "aws_launch_template" "foo" {
# ... all the attributes ...
user_data = base64encode(templatefile("${path.module}/user-data.sh.tpl", {
efs_id = aws_efs_file_system.efs.id # Use dns_name or id here
}))
}
The contents of user-data.sh.tpl can have all your set up steps, including the filesystem mount:
sudo mount -t efs -o tls ${efs_id}:/ efs
When Terraform renders the user data in the launch template, it will substitute the variable.

Pass ProxyCommand to Terraform Provisioner 'local-exec' Successfully

I am setting up several servers in AWS utilizing terraform to deploy them, and ansible to configure (the configuration is quite complex). I would like to accomplish all of this from Terraform but I can't seem to get the ProxyCommand to execute correctly (I believe due to the use of mixed quotes). I need to utilize the ProxyCommand as the commands must be proxied through a bastion host. First I provision the bastion:
resource "aws_instance" "bastion" {
ami = var.ubuntu2004
instance_type = "t3.small"
associate_public_ip_address = true
subnet_id = aws_subnet.some_subnet.id
vpc_security_group_ids = [aws_security_group.temp.id]
key_name = "key"
tags = {
Name = "bastion"
}
}
and then I deploy another server which I would like to configure with Ansible utilizing Terraform's provisioner 'local-exec':
resource "aws_instance" "server1" {
ami = var.ubuntu2004
instance_type = "t3.small"
subnet_id = aws_subnet.some_other_subnet.id
vpc_security_group_ids = [aws_security_group.other_temp.id]
key_name = "key"
tags = {
Name = "server1"
}
provisioner "local-exec" {
command = "sleep 120; ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -u ubuntu --private-key ~/.ssh/id_rsa --ssh-common-args='-o ProxyCommand='ssh -W %h:%p ubuntu#${aws_instance.bastion.public_ip}'' -i ${self.private_ip} main.yml"
}
}
I have confirmed I can get all of this working if I just have Terraform provision the infrastructure, and then manually run Ansible with the Proxy Command input, but it fails if I try and utilize local-exec, seemingly because I have to incorporate multiple single quotes which breaks the command. Not sure if the bastion variable is done correctly either. Probably a simple fix, but anyone know how to fix this or maybe accomplish this in an easier way? Thanks

setting hostname on multiple EC2 based on tags

I am running a terraform code to create multiple EC2 instances. Is there a way to setup the hostname of the instance based on tag and a domain name . Currently i login and run hostnamectl set-hostname ..
here is my tf script i use the create the instance.
resource "aws_instance" "RR-TEMP-V-DB" {
ami = var.linux_ami[var.region]
availability_zone = var.availability_zone
instance_type = var.temp_instance_type
key_name = var.linux_key_name
vpc_security_group_ids = [var.vpc_security_group_ids[var.region]]
subnet_id = var.db_subnet_id
count = var.temp_count
tags = {
Name = "RR-TEMP-V-DB-${format("%02d", count.index + 1)}"
Environment = var.env_tag
}
}
Thanks
We accomplish as part of user data, looks similar to:
instance_name=$(aws ec2 describe-instances --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) --query "Reservations[*].Instances[*].Tags[?Key=='Name'].Value" --region $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed -e "s/.$//") --output text)
sudo hostnamectl set-hostname --static $instance_name
you can accomplish that with running as user data as #duhaas suggested or using remote-exec provisioner of terraform. here is the provisioner documentation of terraform, as you will see recommended way is setting user data on instance provision:
https://www.terraform.io/docs/provisioners/
for more details on remote-exec:
https://www.terraform.io/docs/provisioners/remote-exec.html

How to make Terraform wait for cloudinit to finish?

In my Terraform AWS Docker Swarm module I use cloud-init to initialize the EC2 instance. However, Terraform says the resource is ready before cloud-init finishes. Is there a way of making it wait for cloud-init to finish ideally without SSHing or checking for a port to be up using a null resource.
Your managers and workers both use template_cloudinit_config. They also have ec2:CreateTags.
You can use an EC2 resource tag like trajano/terraform-docker-swarm-aws/cloudinit-complete to indicate that the cloudinit has finished.
You could add this final part to each to invoke a tagging script:
part {
filename = "tag_complete.sh"
content = local.tag_complete_script
content_type = "text/x-shellscript"
}
And declare tag_complete_script be the following:
locals {
tag_complete_script = <<-EOF
#!/bin/bash
instance_id="${TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-id}"
aws ec2 create-tags --resources "$instance_id" --tags 'Key=trajano/terraform-docker-swarm-aws/cloudinit-complete,Value=true'
EOF
}
Then with a null_resource, you wait for the tag to appear (wrote this on my phone, so use it for a general idea, but I don't expect that it will work without testing and edits):
resource "null_resource" "wait_for_cloudinit" {
provisioner "local-exec" {
command = <<-EOF
#!/bin/bash
poll_tags="aws ec2 describe-tags --filters 'Name=resource-id,Values=${join(",", aws_instance.managers[*].id)}' 'Name=key,Values=trajano/terraform-docker-swarm-aws/cloudinit-complete' --output text --query 'Tags[*].Value'"
expected='${join(",", formatlist("true", aws_instance.managers[*].id))}'
$tags="$($poll_tags)"
while [[ "$tags" != "$expected" ]] ; do
$tags="$($poll_tags)"
done
EOF
}
}
This way you can have dependencies on null_resource.wait_for_cloudinit on any resources that need to run after cloudinit has completed.
Another possible approach is using AWS Systems Manager Run Command, if available on your AMI.
You create an SSM Document with Terraform that uses the cloud-init status --wait command, then you trigger the command from a local provisioner, and wait for it to complete. In this way, you don't have to play around with tags, and you are 100% sure cloud-init has been completed.
This is an example of the document you can create with Terraform:
resource "aws_ssm_document" "cloud_init_wait" {
name = "cloud-init-wait"
document_type = "Command"
document_format = "YAML"
content = <<-DOC
schemaVersion: '2.2'
description: Wait for cloud init to finish
mainSteps:
- action: aws:runShellScript
name: StopOnLinux
precondition:
StringEquals:
- platformType
- Linux
inputs:
runCommand:
- cloud-init status --wait
DOC
}
and then you can use a local-provisioner inside the EC2 instance block, or in a null resource, up to what you have to do with it.
The provisioner would be more or less like this:
provisioner "local-exec" {
interpreter = ["/bin/bash", "-c"]
command = <<-EOF
set -Ee -o pipefail
export AWS_DEFAULT_REGION=${data.aws_region.current.name}
command_id=$(aws ssm send-command --document-name ${aws_ssm_document.cloud_init_wait.arn} --instance-ids ${self.id} --output text --query "Command.CommandId")
if ! aws ssm wait command-executed --command-id $command_id --instance-id ${self.id}; then
echo "Failed to start services on instance ${self.id}!";
echo "stdout:";
aws ssm get-command-invocation --command-id $command_id --instance-id ${self.id} --query StandardOutputContent;
echo "stderr:";
aws ssm get-command-invocation --command-id $command_id --instance-id ${self.id} --query StandardErrorContent;
exit 1;
fi;
echo "Services started successfully on the new instance with id ${self.id}!"
EOF
}

Terraform provisioner error for multiple instances

When running the below file with Terraform I get the following error:
Resource 'aws_instance.nodes-opt-us-k8s' not found for variable
'aws_instance.nodes-opt.us1-k8s.id'.
Do I need to include the provisioner twice because my 'count' variable is creating two? When I just include one for 'count' variable I get the error my Ansible playbook needs to run playbook files, which makes since because it is empty until I figure this error out.
I am in the early stages with Terraform and Linux so pardon my ignorance
#-----------------------------Kubernetes Master & Worker Node Server Creations----------------------------
#-----key pair for Workernodes-----
resource "aws_key_pair" "k8s-node_auth" {
key_name = "${var.key_name2}"
public_key = "${file(var.public_key_path2)}"
}
#-----Workernodes-----
resource "aws_instance" "nodes-opt-us1-k8s" {
instance_type = "${var.k8s-node_instance_type}"
ami = "${var.k8s-node_ami}"
count = "${var.NodeCount}"
tags {
Name = "nodes-opt-us1-k8s"
}
key_name = "${aws_key_pair.k8s-node_auth.id}"
vpc_security_group_ids = ["${aws_security_group.opt-us1-k8s_sg.id}"]
subnet_id = "${aws_subnet.opt-us1-k8s.id}"
#-----Link Terraform worker nodes to Ansible playbooks-----
provisioner "local-exec" {
command = <<EOD
cat <<EOF >> workers
[workers]
${self.public_ip}
EOF
EOD
}
provisioner "local-exec" {
command = "aws ec2 wait instance-status-ok --instance-ids ${aws_instance.nodes-opt-us1-k8s.id} --profile Terraform && ansible-playbook -i workers Kubernetes-Nodes.yml"
}
}
Terraform 0.12.26 resolved similar issue for me (when using multiple file provisioners when deploying multiple VMs to Azure)
Hope this helps you: https://github.com/hashicorp/terraform/issues/22006
When using a provisioner and referring to the resource the provisioner is attached to you need to use the self keyword as you've already spotted with what you are writing to the file.
So in your case you want to use the following provisioner block:
...
provisioner "local-exec" {
command = <<EOD
cat <<EOF >> workers
[workers]
${self.public_ip}
EOF
EOD
}
provisioner "local-exec" {
command = "aws ec2 wait instance-status-ok --instance-ids ${self.id} --profile Terraform && ansible-playbook -i workers Kubernetes-Nodes.yml"
}