Google GKE startup script not workin for GKE Node - google-cloud-platform

I am adding a startup script to my GKE nodes using terraform
provider "google" {
project = var.project
region = var.region
zone = var.zone
credentials = "google-key.json"
}
terraform {
backend "gcs" {
bucket = "tf-state-bucket-devenv"
prefix = "terraform"
credentials = "google-key.json"
}
}
resource "google_container_cluster" "primary" {
name = var.kube-clustername
location = var.zone
remove_default_node_pool = true
initial_node_count = 1
master_auth {
username = ""
password = ""
client_certificate_config {
issue_client_certificate = false
}
}
}
resource "google_container_node_pool" "primary_preemptible_nodes" {
name = var.kube-poolname
location = var.zone
cluster = google_container_cluster.primary.name
node_count = var.kube-nodecount
node_config {
preemptible = var.kube-preemptible
machine_type = "n1-standard-1"
disk_size_gb = 10
disk_type = "pd-standard"
metadata = {
disable-legacy-endpoints = "true",
startup_script = "cd /mnt/stateful_partition/home && echo hi > test.txt"
}
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
]
}
}
When i go into the GCP interface, select the node and view the metadata I can see the key/value is there
metadata_startup_script
#!/bin/bash sudo su && cd /mnt/stateful_partition/home && echo hi > test.txt
However when running the below command on the node -
sudo google_metadata_script_runner --script-type startup -
-debug
I got the below -
startup-script: INFO Starting startup scripts.
startup-script: INFO No startup scripts found in metadata.
startup-script: INFO Finished running startup scripts.
Does anyone know why this script is not working/showing up? - Is it because its a GKE node and google dont let you edit these, I cant actually find anything on their documentation where they specifically say that.

You cannot specify startup scripts to run on GKE nodes. The node has a built-in startup sequence to initialize the node and join it to the cluster and to ensure that this works properly (e.g. to ensure that when you ask for 100 nodes you get 100 functional nodes) you cannot add additional logic to the startup sequence.
As an alternative, you can create a DaemonSet that runs on all of your nodes to perform node-level initialization. One advantage of this is that you can tweak your DaemonSet and re-apply it to existing nodes (without having to recreate them) if you want to change how they are configured.

Replace this metadata key name metadata_startup_script by this one startup_script.
In addition, your startup script runs as root user. you don't need to perform a sudo su

Related

Django Terraform digitalOcean re-create environment in new droplet

I have saas based Django app, I want, when a customer asks me to use my software, then i will auto provision new droplet and auto-deploy the app there, and the info should be saved in my database, like ip, customer name, database info etc.
This is my terraform script and it is working very well coz, the database is now running on
terraform {
required_providers {
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 2.0"
}
}
}
provider "digitalocean" {
token = "dop_v1_60f33a1<MyToken>a363d033"
}
resource "digitalocean_droplet" "web" {
image = "ubuntu-18-04-x64"
name = "web-1"
region = "nyc3"
size = "s-1vcpu-1gb"
ssh_keys = ["93:<The SSH finger print>::01"]
connection {
host = self.ipv4_address
user = "root"
type = "ssh"
private_key = file("/home/py/.ssh/id_rsa") # it works
timeout = "2m"
}
provisioner "remote-exec" {
inline = [
"export PATH=$PATH:/usr/bin",
# install docker-compse
# install docker
# clone my github repo
"docker-compose up --build -d"
]
}
}
I want, when i run the commands, it should be create new droplet, new database instance and connect the database with my django .env file.
Everything should be auto created. Can anyone please help me how can I do it?
or my approach is wrong? in this situation, what would be the best solution?

Can't start a self managed node group through Terraform

I am trying to deploy a self managed node group through terraform, for days now. Deploying a non self managed one works out of the bat, however, I have the following issue with the self managed one. This is what my code looks like:
self_managed_node_groups = {
self_mg_4 = {
node_group_name = "self-managed-ondemand"
subnet_ids = module.aws_vpc.private_subnets
create_launch_template = true
launch_template_os = "amazonlinux2eks"
custom_ami_id = "xxx"
public_ip = false
pre_userdata = <<-EOT
yum install -y amazon-ssm-agent \
systemctl enable amazon-ssm-agent && systemctl start amazon-ssm-agent \
EOT
disk_size = 5
instance_type = "t2.small"
desired_size = 1
max_size = 5
min_size = 1
capacity_type = ""
k8s_labels = {
Environment = "dev-test"
Zone = ""
WorkerType = "SELF_MANAGED_ON_DEMAND"
}
additional_tags = {
ExtraTag = "t2x-on-demand"
Name = "t2x-on-demand"
subnet_type = "private"
}
create_worker_security_group = false
}
}
This is the module I use: github.com/aws-samples/aws-eks-accelerator-for-terraform
And this is what Terraform throws after 10 mins:
Error: "Cluster": Waiting up to 10m0s: Need at least 1 healthy instances in ASG, have 0.
Cause: "At 2022-02-10T16:46:14Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.", Description: "Launching a new EC2 instance. Status Reason: The requested configuration is currently not supported. Please check the documentation for supported configurations. Launching EC2 instance failed.", StatusCode: "Failed"
Full code:
https://pastebin.com/mtVGC8PP
The solution was actually changing my t2.small to t3.small. Turns out my AZS didn't support t2.

cannot ssh into instance created from sourceImage "google_compute_instance_from_machine_image"

I am creating an instance from a sourceImage, using this terraform template:
resource "tls_private_key" "sandbox_ssh" {
algorithm = "RSA"
rsa_bits = 4096
}
output "tls_private_key_sandbox" { value = "${tls_private_key.sandbox_ssh.private_key_pem}" }
locals {
custom_data1 = <<CUSTOM_DATA
#!/bin/bash
CUSTOM_DATA
}
resource "google_compute_instance_from_machine_image" "sandboxvm_test_fromimg" {
project = "<proj>"
provider = google-beta
name = "sandboxvm-test-fromimg"
zone = "us-central1-a"
tags = ["test"]
source_machine_image = "projects/<proj>/global/machineImages/sandboxvm-test-img-1"
can_ip_forward = false
labels = {
owner = "test"
purpose = "test"
ami = "sandboxvm-test-img-1"
}
metadata = {
ssh-keys = "${var.sshuser}:${tls_private_key.sandbox_ssh.public_key_openssh}"
}
network_interface {
network = "default"
access_config {
// Include this section to give the VM an external ip address
}
}
metadata_startup_script = local.custom_data1
}
output "instance_ip_sandbox" {
value = google_compute_instance_from_machine_image.sandboxvm_test_fromimg.network_interface.0.access_config.0.nat_ip
}
output "user_name" {
value = var.sshuser
}
I can't even ping / netcat, neither the private or public IP of the VM created. Even the "serial port" ssh, passed inside custom script helps.
I'm suspecting, that since it is a "google beta" capability, is it even working / reliable?
Maybe we just can't yet, create VMs i.e GCEs from "SourceImages" in GCP, Unless proven otherwise, with a simple goof-up not very evident in my TF.
I could solve it actually, and all this somewhere sounds very sick of GCE.
Problem was while creating the base image, the instance I had chosen has had the following :
#sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.6 2
#sudo update-alternatives --install /usr/bin/python3 python /usr/bin/python3.7 1
Maybe I should try with "python3" instead of "python",
but when instantiating GCEs basis this MachineImage, it looks for a rather deprecated "python2.7" and not "python3" and complained of missing / unreadable packages like netiplan etc.
Commenting the "update-alternatives" and installing python3.6 and python3.7 explicitly did the trick!

Why may be the reason fo helm_resource in terraform script being terribly slow?

I've a terraform script for my AWS EKS cluster and the following pieces there:
provider "helm" {
alias = "helm"
debug = true
kubernetes {
host = module.eks.endpoint
cluster_ca_certificate = module.eks.ca_certificate
token = data.aws_eks_cluster_auth.cluster.token
load_config_file = false
}
}
and:
resource "helm_release" "prometheus_operator" {
provider = "helm"
depends_on = [
module.eks.aws_eks_auth
]
chart = "stable/prometheus-operator"
name = "prometheus-operator"
values = [
file("staging/prometheus-operator-values.yaml")
]
wait = false
version = "8.12.12"
}
With this setup it takes ~15 minutes to install the required chart with terraform apply and sometimes it fails (with helm ls giving pending-install status). On the other hand if use the following command:
helm install prometheus-operator stable/prometheus-operator -f staging/prometheus-operator-values.yaml --version 8.12.12 --debug
the required chart gets installed in ~3 minutes and never fails. What is the reason for this behavior?
EDIT
Here is a log file from a failed installation. It's quit big - 5.6M. What bothers me a bit is located in line no 47725 and 56045
What's more, helm status prometheus-operator gives valid output (as if it was successfully installed), however there're no pods defined.
EDIT 2
I've also raised an issue.

Can Terraform set a variable from a remote_exec command?

I'm trying to build a Docker Swarm cluster in AWS using Terraform. I've successfully got a Swarm manager started, but I'm trying to work out how best to pass the join key to the workers (which will be created after the manager).
I'd like some way of running the docker swarm join-token worker -q command that can be set to a Terraform variable. That way, the workers can have a remote_exec command something like docker swarm join ${var.swarm_token} ${aws_instance.swarm-manager.private_ip}
How can I do this?
My config is below:
resource "aws_instance" "swarm-manager" {
ami = "${var.manager_ami}"
instance_type = "${var.manager_instance}"
tags = {
Name = "swarm-manager${count.index + 1}"
}
provisioner "remote-exec" {
inline = [
"sleep 30",
"docker swarm init --advertise-addr ${aws_instance.swarm-manager.private_ip}"
"docker swarm join-token worker -q" // This is the value I want to store as a variable/output/etc
]
}
}
Thanks
You can use an external data source in supplement to your remote provisioning script.
This can shell into your swarm managers and get the token after they are provisioned.
If you have N swarm managers, you'll probably have to do it all at once after the managers are created. External data sources return a map of plain strings, so using keys that enable you to select the right result for each node is required, or return the whole set as a delimited string, and use element() and split() to get the right item.
resource "aws_instance" "swarm_manager" {
ami = "${var.manager_ami}"
instance_type = "${var.manager_instance}"
tags = {
Name = "swarm-manager${count.index + 1}"
}
provisioner "remote-exec" {
inline = [
"sleep 30",
"docker swarm init --advertise-addr ${aws_instance.swarm-manager.private_ip}"
]
}
}
data "external" "swarm_token" {
program = ["bash", "${path.module}/get_swarm_tokens.sh"]
query = {
swarms = ["${aws_instance.swarm_manager.*.private_ip}"]
}
}
resource "aws_instance" "swarm_node" {
count = "${var.swarm_size}"
ami = "${var.node_ami}"
tags = {
Name = "swarm-node-${count.index}"
}
provisioner "remote-exec" {
inline = [
"# Enrol me in the right swarm, distributed over swarms available",
"./enrol.sh ${element(split("|", data.swarm_token.result.tokens), count.index)}"
]
}
}