Downsizing/Scaling-in MySQL cluster nodes - amazon-web-services

I am trying to setup mysql cluster within aws autoscaling group. I am starting out with two ec2 instances, each with their own management(ndb_mgmd), data (ndbmtd) and sql (mysqld) node. When scaling out (I have configured live scale out which works fine), it adds two more ec2 instances (because the number of replication is set to 2 for ndbd) and creates a new nodegroup.
Now since I cant control exactly which instances aws shuts down during a scale-in event, it always takes out a whole nodegroup rendering the cluster invalid and causing it to crash.
From what I can see mysql cluster is not really designed to scale-in online, but is there a way I can achieve this without bringing the whole system down for maintenance? The idea is to add new identical instances to the cluster during scale-out and take instances off during scale-in events fired by aws autoscaling group.
Let me know if I missed out on any details, cheers!
This is what the initial config looks like:
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=1 #10.0.0.149 (mysql-5.6.31 ndb-7.4.12, Nodegroup: 0, *)
id=2 #10.0.0.81 (mysql-5.6.31 ndb-7.4.12, Nodegroup: 0)
[ndb_mgmd(MGM)] 2 node(s)
id=101 #10.0.0.149 (mysql-5.6.31 ndb-7.4.12)
id=102 #10.0.0.81 (mysql-5.6.31 ndb-7.4.12)
[mysqld(API)] 2 node(s)
id=51 #10.0.0.149 (mysql-5.6.31 ndb-7.4.12)
id=52 #10.0.0.81 (mysql-5.6.31 ndb-7.4.12)
This is an example of scaled out version of the same cluster (+2 instances):
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=1 #10.0.0.149 (mysql-5.6.31 ndb-7.4.12, Nodegroup: 0, *)
id=2 #10.0.0.81 (mysql-5.6.31 ndb-7.4.12, Nodegroup: 0)
id=3 #10.0.0.151 (mysql-5.6.31 ndb-7.4.12, Nodegroup: 1)
id=4 #10.0.0.83 (mysql-5.6.31 ndb-7.4.12, Nodegroup: 1)
[ndb_mgmd(MGM)] 4 node(s)
id=101 #10.0.0.149 (mysql-5.6.31 ndb-7.4.12)
id=102 #10.0.0.81 (mysql-5.6.31 ndb-7.4.12)
id=103 #10.0.0.151 (mysql-5.6.31 ndb-7.4.12)
id=104 #10.0.0.83 (mysql-5.6.31 ndb-7.4.12)
[mysqld(API)] 4 node(s)
id=51 #10.0.0.149 (mysql-5.6.31 ndb-7.4.12)
id=52 #10.0.0.81 (mysql-5.6.31 ndb-7.4.12)
id=53 #10.0.0.151 (mysql-5.6.31 ndb-7.4.12)
id=54 #10.0.0.83 (mysql-5.6.31 ndb-7.4.12)

Related

EKS: can't see nodes and nodes are not join to the cluster

I read all aws articles. I followed each one by one. But it didn't work any of them. Let me briefly summarize my situation. I created EKS automation with terraform. 1 vpc, 3 public subnets, 3 private subnets, 3 security group, 1 nat gateway(on public), and 2 autoscaled worker node groups. I checked all infra which created with terraform. There are no problem.
My main problem is that after the installation I can't see the nodes and nodes are not join to the cluster. I applied below steps but didn't worked. What should I do? By the way don't tag my question as a duplication I checked all similar questions on stackoverflow. My steps look true but does not work.
kubectl get nodes
No resources found
Before checking node with above command.Firstly I applied below command for setting kubeconfig.
aws eks update-kubeconfig --name eks-DS7h --region us-east-1
Here my kubeconfig:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJfgzsfhadfzasdfrzsd.........
server: https://0F97E579A.gr7.us-east-1.eks.amazonaws.com
name: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
contexts:
- context:
cluster: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
user: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
name: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
current-context: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
kind: Config
preferences: {}
users:
- name: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
args:
- --region
- us-east-1
- eks
- get-token
- --cluster-name
- eks-DS7h
command: aws
After this I checked the nodes again but I still get no resource found. Than I try to edit aws-auth. Before the edit I check my user on the terminal where I triggered all terraform steps installation.
aws sts get-caller-identity
{
"UserId": "ASDFGSDFGDGSDGDFHSFDSDC",
"Account": "545153234644",
"Arn": "arn:aws:iam::545153234644:user/white"
}
I took my user info and I added blank mapuser area in aws-auth. But still getting No resources found.
kubectl get cm -n kube-system aws-auth
apiVersion: v1
data:
mapAccounts: |
[]
mapRoles: |
- "groups":
- "system:bootstrappers"
- "system:nodes"
- "system:masters"
"rolearn": "arn:aws:iam::545153234644:role/eks-DS7h22060508195731770000000e"
"username": "system:node:{{EC2PrivateDNSName}}"
mapUsers: "- \"userarn\": \"arn:aws:iam::545153234644:user/white\"\n \"username\":
\"white\"\n \"groups\":\n - \"system:masters\"\n - \"system:nodes\" \n"
kind: ConfigMap
metadata:
creationTimestamp: "2022-06-05T08:20:02Z"
labels:
app.kubernetes.io/managed-by: Terraform
terraform.io/module: terraform-aws-modules.eks.aws
name: aws-auth
namespace: kube-system
resourceVersion: "4976"
uid: b12341-33ff-4f78-af0a-758f88
Oh also when I check EKS cluster on dashboard I see below warning too. I don't know is it relevant or not. I want to share it too maybe it will help.

Can't Re-allocate Elastic IP to EC2 Scaling Group

I've tried several methods to re-allocate the elastic iP, but no luck:
amazon web services - AWS EC2 User Data script to allocate Elastic IP - Stack Overflow
Elastic IP in an Auto-Scaling Group | by lakshman sundaram | Medium
I configured the EC2 Scaling Group to work with Launch Template and Launch Configuration as follows:
min= 1, desired= 1, max= 2.
I have two subnets with same region, but have two different availability zones.
Whenever I terminate an instance, the new instance launches but it doesn't auto receive a Public IP. Even though the settings is set to auto-receive for public IP. Sometimes it has new public ip, but it's different from the one that I wanted.
I'm currently using one Elastic IP.
User data:
#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
MAXWAIT=3
ALLOC_ID=eipalloc-redacted
AWS_DEFAULT_REGION=us-east-1
# Make sure the EIP is free
echo "Checking if EIP with ALLOC_ID[$ALLOC_ID] is free...."
ISFREE=$(aws ec2 describe-addresses --allocation-ids $ALLOC_ID --query Addresses[].InstanceId --output text)
STARTWAIT=$(date +%s)
while [ ! -z "$ISFREE" ]; do
if [ "$(($(date +%s) - $STARTWAIT))" -gt $MAXWAIT ]; then
echo "WARNING: We waited 30 seconds, we're forcing it now."
ISFREE=""
else
echo "Waiting for EIP with ALLOC_ID[$ALLOC_ID] to become free...."
sleep 3
ISFREE=$(aws ec2 describe-addresses --allocation-ids $ALLOC_ID --query Addresses[].InstanceId --output text)
fi
done
# Now we can associate the address
echo Running: aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $ALLOC_ID --allow-reassociation}
aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $ALLOC_ID --allow-reassociation}
Role Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeAddresses",
"ec2:AllocateAddress",
"ec2:DescribeInstances",
"ec2:AssociateAddress"
],
"Resource": "*"
}
]
}
error:
Cloud-init v. 20.4.1-0ubuntu1~20.04.1 running 'modules:config' at Fri, 12 Feb 2021 21:35:16 +0000. Up 21.57 seconds.
Checking if EIP with ALLOC_ID[eipalloc-redacted1234] is free....
/var/lib/cloud/instance/scripts/part-001: line 9: aws: command not found
Running:
/var/lib/cloud/instance/scripts/part-001: line 24: aws: command not found
/var/lib/cloud/instance/scripts/part-001: line 25: aws: command not found
Cloud-init v. 20.4.1-0ubuntu1~20.04.1 running 'modules:final' at Fri, 12 Feb 2021 21:35:28 +0000. Up 33.63 seconds.
2021-02-12 21:35:28,823 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2021-02-12 21:35:28,824 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 20.4.1-0ubuntu1~20.04.1 finished at Fri, 12 Feb 2021 21:35:29 +0000. Datasource DataSourceEc2Local. Up 34.17 seconds
I figured out the issue.
The instance was missing the awscli tool.
Once I installed it, the script is working !
Thanks for the person whom wrote this script.

Attempting zero downtime upgrades but Targetpool is directing traffic to instances that are being torn down

I am attempting to have a zero down time upgrade of my service. So far i have been unsuccessful. The load balancer is directing traffic to the old instances as they are being taken down despite the fact they are unhealthy according to the load balancer's health check. I am using terraform and gcp. The actual service that will be upgraded needs to terminate the TLS connection so this service needs to to use a network load balancer, a target pool. The regional instance group manager is to ensure redundancy in case a zone goes down.
Toy version of terraform where the number of instances are down sized but shows the problem
variable "project" {
type = string
}
variable "region" {
type = string
default = "us-central1"
}
provider "google" {
project = var.project
region = var.region
}
resource "google_compute_region_instance_group_manager" "default" {
base_instance_name = "instance"
name = "default"
region = var.region
target_size = 3
target_pools = [
google_compute_target_pool.default.self_link,
]
update_policy {
minimal_action = "REPLACE"
type = "PROACTIVE"
max_surge_fixed = 3
max_unavailable_fixed = 0
min_ready_sec = 120
}
version {
instance_template = google_compute_instance_template.template-b.self_link
}
}
resource "google_compute_address" "default" {
name = "default"
}
resource "google_compute_target_pool" "default" {
name = "default"
region = var.region
instances = []
health_checks = [
google_compute_http_health_check.default.self_link
]
lifecycle {
ignore_changes = [
instances
]
}
}
resource "google_compute_http_health_check" "default" {
name = "default"
request_path = "/"
check_interval_sec = 1
timeout_sec = 1
healthy_threshold = 3
unhealthy_threshold = 1
}
resource "google_compute_forwarding_rule" "default" {
name = "default"
region = var.region
ip_protocol = "TCP"
port_range = "80"
target = google_compute_target_pool.default.self_link
ip_address = google_compute_address.default.address
}
data "google_compute_network" "default" {
name = "default"
}
resource "google_compute_instance_template" "template-b" {
name = "template-b1"
machine_type = "f1-micro"
disk {
boot = true
auto_delete = true
disk_size_gb = 100
disk_type = "pd-ssd"
source_image = data.google_compute_image.my_image.self_link
}
network_interface {
network = data.google_compute_network.default.self_link
}
metadata_startup_script = file("./startup-scripts/helloworld.sh")
metadata = {
instance-env = "SOFTWARE_VERSION=Version-B"
}
tags = [
"http-server"
]
lifecycle {
create_before_destroy = true
}
}
data "google_compute_image" "my_image" {
family = "ubuntu-1804-lts"
project = "ubuntu-os-cloud"
}
output "ip-address" {
value = google_compute_address.default.address
}
bootup script that brings up a server that is running on each instance. startup-scripts/helloworld.sh
#!/bin/bash -x
METADATA_BASE=http://metadata.google.internal/computeMetadata/v1
SOFTWARE_VERSION=$(curl -sfm5 -H "Metadata-Flavor: Google" ${METADATA_BASE}/instance/attributes/instance-env)
echo "Hello World! This is ${SOFTWARE_VERSION} from $(hostname -f)" > index.html
python3 -m http.server 80 &
The problem i run into is that as i scale down the number of instances from say 6 to 3 i can see that some instances are labelled unhealthy by the health check, but the target pool is still directing traffic to those instances. The documentation for this implies that these instances shouldn't see any traffic.
during the resizing of the number instances from 6 to 3
i ran two shell scripts and got these results
while [[ 1 ]]; do echo -n "$(date +%s) "; curl -m5 http://${IP_ADDRESS} && sleep 1; done
no timeouts
...
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838267 curl: (52) Empty reply from server
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs
while [[ 1 ]]; do echo -n "$(date +%s) "; gcloud compute target-pools get-health default --region us-central1 && sleep 1; done
all six instances are healthy
...
1589838263 ---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
1589838266 ---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
...
unhealthy for a bit
...
1589838312 ---
healthStatus:
- healthState: HEALTHY
instance: v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
as you can see by the timestamps from the scripts that the unhealthy instances are still getting traffic.
This same pattern of unhealthy instances getting traffic can be seen when swapping the instance template for the regional instance group manager in terraform for a different one. it can also be seen when bringing up a second regional instance group manager and adding that to the target pool waiting for traffic to go to those new instances then removing the older regional instance group manager from the target pool. I have also tried bringing up a second target pool with its own instance group manager, then changing the forwarding rule, but there i saw over a minute of down time with no traffic even going to either regional instance group.
What can i do to avoid this downtime?
From the logs it follows there was a thirty second interval during which the Load Balancer kept sending requests to the irresponsive instances.
gcloud compute target-pools get-health gives the following timestamps and health statuses:
Timestamp 1589838263
instance-rvcl UNHEALTHY
instance-42dm UNHEALTHY
instance-562f HEALTHY
Timestamp 1589838266
instance-rvcl UNHEALTHY
instance-42dm UNHEALTHY
instance-562f UNHEALTHY
curl output with the health statuses merged:
no timeouts
...
1589838263 # instance-562f still HEALTHY, the last response
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838266 # relative time +0s, instances -rvcl,42dm,562f are UNHEALTHY
1589838267 curl: (52) Empty reply from server
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds # relative time +30s; instances -rvcl,42dm,562f still UNHEALTHY
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs
This may be due to the delay that the Load Balancer needs to recognize that the instance is unhealthy. During this time the Load Balancer keeps sending new requests to the instance.
Once an instance becomes unhealthy, the Load Balancer stops sending new connections there. However, existing connections could be not terminated until the shutdown script switches off the instance. Shutdown period is 90 seconds for normal instances.
Here is an example of the Health Check timeline:
Load Balancing > Doc > Health checks overview > Example health check
See also
Compute Engine > Doc > Understanding autoscaler decisions > Preparing for instance terminations
Load Balancing > Doc > Health checks overview > How health checks work > Health state

Cannot connect to Google Cloud SQL from Kubernetes Engine

I have one Google Cloud SQL Second generation instance and one Google Kubernetes Engine cluster. The problem is that I cannot connect to the Cloud SQL with the Private IP. I have enabled Private IP in the Cloud SQL dashboard and assigned it to my VPC network. However, the container still can't connect.
Is it maybe related to peering routes? Do I need to create one?
PS. I followed this guide https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine
Result of gcloud container clusters describe:
$ gcloud container clusters describe sirodoht-32-fec8e2914780bf2c
addonsConfig:
kubernetesDashboard:
disabled: true
networkPolicyConfig:
disabled: true
clusterIpv4Cidr: 10.40.0.0/14
createTime: '2019-05-12T17:07:17+00:00'
currentMasterVersion: 1.12.7-gke.10
currentNodeCount: 6
currentNodeVersion: 1.11.8-gke.6
defaultMaxPodsConstraint:
maxPodsPerNode: '110'
endpoint: <retracted ip>
initialClusterVersion: 1.11.8-gke.6
initialNodeCount: 1
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/sirodoht-32/zones/europe-west3-a/instanceGroupManagers/gke-sirodoht-32-fep8e29-pool-94e97802-grp
ipAllocationPolicy:
clusterIpv4Cidr: 10.40.0.0/14
clusterIpv4CidrBlock: 10.40.0.0/14
clusterSecondaryRangeName: gke-sirodoht-32-fec8e2914780bf2c-pods-4439d109
servicesIpv4Cidr: 10.170.0.0/20
servicesIpv4CidrBlock: 10.170.0.0/20
servicesSecondaryRangeName: gke-sirodoht-32-fec8e2914780bf2c-services-4439d109
useIpAliases: true
labelFingerprint: a9dc16a7
legacyAbac: {}
location: europe-west3-a
locations:
- europe-west3-a
loggingService: logging.googleapis.com
maintenancePolicy:
window:
dailyMaintenanceWindow:
duration: PT4H0M0S
startTime: 00:00
masterAuth:
clientCertificate: <retracted>
clientKey: <retracted>
clusterCaCertificate: <retracted>
monitoringService: monitoring.googleapis.com
name: sirodoht-32-fec8e2914780bf2c
network: compute-network-aaa8ff1ec6b52012
networkConfig:
network: projects/sirodoht-32/global/networks/compute-network-aaa8ff1ec6b52012
subnetwork: projects/sirodoht-32/regions/europe-west3/subnetworks/subnet-bb2c9eb79b29a825
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-4
oauthScopes:
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
nodePools:
- config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-4
oauthScopes:
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
initialNodeCount: 6
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/sirodoht-32/zones/europe-west3-a/instanceGroupManagers/gke-sirodoht-32-fep8e29-pool-94e97802-grp
management: {}
maxPodsConstraint:
maxPodsPerNode: '110'
name: pool
podIpv4CidrSize: 24
selfLink: https://container.googleapis.com/v1/projects/sirodoht-32/zones/europe-west3-a/clusters/sirodoht-32-fec8e2914780bf2c/nodePools/pool
status: RUNNING
version: 1.11.8-gke.6
selfLink: https://container.googleapis.com/v1/projects/sirodoht-32/zones/europe-west3-a/clusters/sirodoht-32-fec8e2914780bf2c
servicesIpv4Cidr: 10.170.0.0/20
status: RUNNING
subnetwork: subnet-bb2c9eb79b29a825
zone: europe-west3-a
* - There is an upgrade available for your cluster(s).
To upgrade nodes to the latest available version, run
$ gcloud container clusters upgrade sirodoht-32-fec8e2914780bf2c
Private IP's are only accessible by other resources on the same Virtual Private Cloud (VPC). Follow these instructions to setup a GKE cluster on the same VPC as your Cloud SQL instance.
For more information on the environment requirements for using Private IP on Cloud SQL, please see this page.

K8S Unable to mount AWS EBS as a persistent volume for pod

Question
Please suggest the cause of the error of not being able to mount AWS EBS volume in pod.
journalctl -b -f -u kubelet
1480 kubelet.go:1625] Unable to mount volumes for pod "nginx_default(ddc938ee-edda-11e7-ae06-06bb783bb15c)": timeout expired waiting for volumes to attach/mount for pod "default"/"nginx". list of unattached/unmounted volumes=[ebs]; skipping pod
1480 pod_workers.go:186] Error syncing pod ddc938ee-edda-11e7-ae06-06bb783bb15c ("nginx_default(ddc938ee-edda-11e7-ae06-06bb783bb15c)"), skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"nginx". list of unattached/unmounted volumes=[ebs]
1480 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pv-ebs" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") pod "nginx" (UID: "ddc938ee-edda-11e7-ae06-06bb783bb15c")
1480 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/vol-0d275986ce24f4304\"" failed. No retries permitted until 2017-12-31 03:34:03.644604131 +0000 UTC m=+6842.543441523 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pv-ebs\" (UniqueName: \"kubernetes.io/aws-ebs/vol-0d275986ce24f4304\") pod \"nginx\" (UID: \"ddc938ee-edda-11e7-ae06-06bb783bb15c\") "
Steps
Deployed K8S 1.9 using kubeadm (without EBS volume mount, pods work) in AWS (us-west-1 and AZ is us-west-1b).
Configure an IAM role as per Kubernetes - Cloud Providers and kubelets failing to start when using 'aws' as cloud provider.
Assign the IAM role to EC2 instances as per Easily Replace or Attach an IAM Role to an Existing EC2 Instance by Using the EC2 Console.
Deploy PV/PVC/POD as in the manifest.
The status from the kubectl:
kubectl get
NAME READY STATUS RESTARTS AGE IP NODE
nginx 0/1 ContainerCreating 0 29m <none> ip-172-31-1-43.us-west-1.compute.internal
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv/pv-ebs 5Gi RWO Recycle Bound default/pvc-ebs 33m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pvc/pvc-ebs Bound pv-ebs 5Gi RWO 33m
kubectl describe pod nginx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned nginx to ip-172-31-1-43.us-west-1.compute.internal
Normal SuccessfulMountVolume 27m kubelet, ip-172-31-1-43.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "default-token-dt698"
Warning FailedMount 6s (x12 over 25m) kubelet, ip-172-31-1-43.us-west-1.compute.internal Unable to mount volumes for pod "nginx_default(ddc938ee-edda-11e7-ae06-06bb783bb15c)": timeout expired waiting for volumes to attach/mount for pod "default"/"nginx". Warning FailedMount 6s (x12 over 25m) kubelet, ip-172-31-1-43.us-west-1.compute.internal Unable to mount volumes for pod "nginx_default(ddc938ee-edda-11e7-ae06-06bb783bb15c)": timeout expired waiting for volumes to attach/mount for pod "default"/"nginx".
Manifest
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-ebs
labels:
type: amazonEBS
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
awsElasticBlockStore:
volumeID: vol-0d275986ce24f4304
fsType: ext4
persistentVolumeReclaimPolicy: Recycle
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pvc-ebs
labels:
type: amazonEBS
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
kind: Pod
apiVersion: v1
metadata:
name: nginx
spec:
containers:
- name: myfrontend
image: nginx
volumeMounts:
- mountPath: "/var/www/html"
name: ebs
volumes:
- name: ebs
persistentVolumeClaim:
claimName: pvc-ebs
IAM Policy
Environment
$ kubectl version -o json
{
"clientVersion": {
"major": "1",
"minor": "9",
"gitVersion": "v1.9.0",
"gitCommit": "925c127ec6b946659ad0fd596fa959be43f0cc05",
"gitTreeState": "clean",
"buildDate": "2017-12-15T21:07:38Z",
"goVersion": "go1.9.2",
"compiler": "gc",
"platform": "linux/amd64"
},
"serverVersion": {
"major": "1",
"minor": "9",
"gitVersion": "v1.9.0",
"gitCommit": "925c127ec6b946659ad0fd596fa959be43f0cc05",
"gitTreeState": "clean",
"buildDate": "2017-12-15T20:55:30Z",
"goVersion": "go1.9.2",
"compiler": "gc",
"platform": "linux/amd64"
}
}
$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
EC2
EBS
Solution
Found the documentation which shows how to configure AWS cloud provider.
K8S AWS Cloud Provider Notes
Steps
Tag EC2 instances and SG with the KubernetesCluster=${kubernetes cluster name}. If created with kubeadm, it is kubernetes as in Ability to configure user and cluster name in AdminKubeConfigFile
Run kubeadm init --config kubeadm.yaml.
kubeadm.yaml (Ansible template)
kind: MasterConfiguration
apiVersion: kubeadm.k8s.io/v1alpha1
api:
advertiseAddress: {{ K8S_ADVERTISE_ADDRESS }}
networking:
podSubnet: {{ K8S_SERVICE_ADDRESSES }}
cloudProvider: {{ K8S_CLOUD_PROVIDER }}
Result
$ journalctl -b -f CONTAINER_ID=$(docker ps | grep k8s_kube-controller-manager | awk '{ print $1 }')
Jan 02 04:48:28 ip-172-31-4-117.us-west-1.compute.internal dockerd-current[8063]: I0102 04:48:28.752141
1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "kuard-pv" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") from node "ip-172-3
Jan 02 04:48:39 ip-172-31-4-117.us-west-1.compute.internal dockerd-current[8063]: I0102 04:48:39.309178
1 operation_generator.go:308] AttachVolume.Attach succeeded for volume "kuard-pv" (UniqueName: "kubernetes.io/aws-ebs/vol-0d275986ce24f4304") from node "ip-172-
$ kubectl describe pod kuard
...
Volumes:
kuard-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: kuard-pvc
ReadOnly: false
$ kubectl describe pv kuard-pv
Name: kuard-pv
Labels: failure-domain.beta.kubernetes.io/region=us-west-1
failure-domain.beta.kubernetes.io/zone=us-west-1b
type=amazonEBS
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolume","metadata":{"annotations":{},"labels":{"type":"amazonEBS"},"name":"kuard-pv","namespace":""},"spec":{"acce...
pv.kubernetes.io/bound-by-controller=yes
StorageClass:
Status: Bound
Claim: default/kuard-pvc
Reclaim Policy: Retain
Access Modes: RWO
Capacity: 5Gi
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: vol-0d275986ce24f4304
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
$ kubectl version -o json
{
"clientVersion": {
"major": "1",
"minor": "9",
"gitVersion": "v1.9.0",
"gitCommit": "925c127ec6b946659ad0fd596fa959be43f0cc05",
"gitTreeState": "clean",
"buildDate": "2017-12-15T21:07:38Z",
"goVersion": "go1.9.2",
"compiler": "gc",
"platform": "linux/amd64"
},
"serverVersion": {
"major": "1",
"minor": "9",
"gitVersion": "v1.9.0",
"gitCommit": "925c127ec6b946659ad0fd596fa959be43f0cc05",
"gitTreeState": "clean",
"buildDate": "2017-12-15T20:55:30Z",
"goVersion": "go1.9.2",
"compiler": "gc",
"platform": "linux/amd64"
}
}