init a kubernetes cluster with kubeadm but public IP on aws - amazon-web-services

I am trying to follow this tutorial:
https://www.digitalocean.com/community/tutorials/how-to-create-a-kubernetes-cluster-using-kubeadm-on-ubuntu-18-04
Important difference:
I need to run the master on a specific node, and the worker nodes are from different regions on AWS.
So it all went well until I wanted to join the nodes (step 5). The command succeeded but kubectl get nodes still only showed the master node.
I looked at the join command and it contained the master 's private ip address:
join 10.1.1.40. I guess that can not work if the workers are in a different region (note: later we probably need to add nodes from different providers even, so if there is no important security threat, it should work via public IPs).
So while kubeadm init pod-network-cidr=10.244.0.0/16 initialized the cluster but with this internal IP, I then tried with
kubeadm init --apiserver-advertise-address <Public-IP-Addr> --apiserver-bind-port 16443 --pod-network-cidr=10.244.0.0/16
But then it always hangs, and init does not complete. The kubelet log prints lots of
E0610 19:24:24.188347 1051920 kubelet.go:2267] node "ip-x-x-x-x" not found
where "ip-x-x-x-x" seems to be the master's node hostname on AWS.

I think what made it work is that I set the master's hostname to its public DNS name, and then used that as --control-plane-endpoint argument..., without --apiserver-advertise-address (but with the --apiserver-bind-port as I need to run it on another port).
Need to have it run longer to confirm but so far looks good.

Related

How to resize K8s cluster with kops, cluster-autoscaler to dynamically increase Masters

We have configured Kubernetes cluster on EC2 machines in our AWS account using kops tool (https://github.com/kubernetes/kops) and based on AWS posts (https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-kops/) as well as other resources.
We want to setup a K8s cluster of master and slaves such that:
It will automatically resize (both masters as well as nodes/slaves) based on system load.
Runs in Multi-AZ mode i.e. at least one master and one slave in every AZ (availability zone) in the same region for e.g. us-east-1a, us-east-1b, us-east-1c and so on.
We tried to configure the cluster in the following ways to achieve the above.
Created K8s cluster on AWS EC2 machines using kops this below configuration: node count=3, master count=3, zones=us-east-1c, us-east-1b, us-east-1a. We observed that a K8s cluster was created with 3 Master & 3 Slave Nodes. Each of the master and slave server was in each of the 3 AZ’s.
Then we tried to resize the Nodes/slaves in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). We set the node_asg_min to 3 and node_asg_max to 5. When we increased the workload on the slaves such that auto scale policy was triggered, we saw that additional (after the default 3 created during setup) slave nodes were spawned, and they did join the cluster in various AZ’s. This worked as expected. There is no question here.
We also wanted to set up the cluster such that the number of masters increases based on system load. Is there some way to achieve this? We tried a couple of approaches and results are shared below:
A) We were not sure if the cluster-auto scaler helps here, but nevertheless tried to resize the Masters in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). This is useful while creating a new cluster but was not useful to resize the number of masters in an existing cluster. We did not find a parameter to specify node_asg_min, node_asg_max for Master the way it is present for slave Nodes. Is there some way to achieve this?
B) We increased the count MIN from 1 to 3 in ASG (auto-scaling group), associated with one the three IG (instance group) for each master. We found that new instances were created. However, they did not join the master cluster. Is there some way to achieve this?
Could you please point us to steps, resources on how to do this correctly so that we could configure the number of masters to automatically resize based on system load and is in Multi-AZ mode?
Kind regards,
Shashi
There is no need to scale Master nodes.
Master components provide the cluster’s control plane. Master components make global decisions about the cluster (for example, scheduling), and detecting and responding to cluster events (starting up a new pod when a replication controller’s ‘replicas’ field is unsatisfied).
Master components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all master components on the same machine, and do not run user containers on this machine. See Building High-Availability Clusters for an example multi-master-VM setup.
Master node consists of the following components:
kube-apiserver
Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.
kube-scheduler
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.
kube-controller-manager
Component on the master that runs controllers.
cloud-controller-manager
runs controllers that interact with the underlying cloud providers. The cloud-controller-manager binary is an alpha feature introduced in Kubernetes release 1.6.
For more detailed explanation please read the Kubernetes Components docs.
Also if You are thinking about HA, you can read about Creating Highly Available Clusters with kubeadm
I think your assumption is that similar to kubernetes nodes, masters devide the work between eachother. That is not the case, because the main tasks of masters is to have consensus between each other. This is done with etcd which is a distributed key value store. The problem maintaining such a store is easy for 1 machine but gets harder the more machines you add.
The advantage of adding masters is being able to survive more master failures at the cost of having to make all masters fatter (more CPU/RAM....) so that they perform well enough.

private IP address range for GCP Cloud SQL is ignored

I've been trying to set up Google Cloud SQL with a private IP connection, where
the IP range it's bound to is manually allocated, and have not succeeded. I
don't know if this is a bug in the implementation because it's still in beta, if
there's something missing from the docs, or if I'm just doing something wrong.
(A command-line session is at the bottom, for a quick summary of what I'm
seeing.)
Initially, I set it up to automatically allocate the IP range. It all worked
just fine, except that it chose 172.17.0.0/24, which is one of the networks
managed by docker on my GCE instance, so I couldn't connect from there (but
could on another machine without docker). So then I tried going down the manual
allocation route.
First, I tore down all the associated network objects that had been created on
my behalf. There were two VPC Peerings, cloudsql-postgres-googleapis-com and
servicenetworking-googleapis-com, which I deleted, and then I confirmed that
the routing entry associated with them disappeared as well.
Then, I followed the directions at https://cloud.google.com/vpc/docs/configure-private-services-access#allocating-range, creating 10.10.0.0/16, because I wanted it in my default network, which is
auto mode, so I'm limited to the low half (which is currently clear).
At that point, I went back to the Cloud SQL instance creation page, since it
should be doing the rest for me. I checked the "Private IP" box, and chose the
default network.
I wasn't taking notes at the time, so my recollection may be flawed,
particularly since my experience in later attempts was consistently different,
but what I remember seeing was that below the network choice dropdown, it said
"This instance will use the existing managed service connection". I assumed
that meant it would use the address range I'd created, and went forward with the
instance creation, but the instance landed on the 172.17.0.0/24 network again.
Back around the third time, where that message was before, it had a choice box
listing my address range. Again, my recollection was poor, so I don't know if I
either saw or clicked on the "Connect" button, but the end result was the same.
On the fourth attempt, I did notice the "Connect" button, and made sure to click
it, and wait for it to say it succeeded. Which it did, sort of: it replaced the
dropdown and buttons with the same message I'd seen before about using the
existing connection. And again, the instance was created on the wrong network.
I tried a fifth time, this time having created a new address range with a new
name -- google-managed-services-default -- which was the name that the
automatic allocation had given it back when I first started (and what the
private services access docs suggest). But even with that name, and explicitly
choosing it, I still ended up with the instance on the wrong network.
Indeed, I now see that after I click "Connect", I can go check the routes and
see that the route that was created is to 172.17.0.0/24.
The same thing seems to happen if I do everything from the command-line:
$ gcloud beta compute addresses list
NAME ADDRESS/RANGE TYPE PURPOSE NETWORK REGION SUBNET STATUS
google-managed-services-default 10.11.0.0/16 INTERNAL VPC_PEERING default RESERVED
$ gcloud beta services vpc-peerings connect \
--service=servicenetworking.googleapis.com \
--ranges=google-managed-services-default \
--network=default \
--project=...
$ gcloud beta services vpc-peerings list --network=default
---
network: projects/.../global/networks/default
peering: servicenetworking-googleapis-com
reservedPeeringRanges:
- google-managed-services-default
---
network: projects/.../global/networks/default
peering: cloudsql-postgres-googleapis-com
reservedPeeringRanges:
- google-managed-services-default
$ gcloud beta compute routes list
NAME NETWORK DEST_RANGE NEXT_HOP PRIORITY
peering-route-ad7b64a0841426ea default 172.17.0.0/24 cloudsql-postgres-googleapis-com 1000
So now I'm not sure what else to try. Is there some state I didn't think to clear? How is the route supposed to be connected to the address range? Why is it creating two peerings when I only asked for one? If I were to create a route manually to the right address range, I presume that wouldn't work, because the Postgres endpoint would still be at the wrong address.
(Yes, I could reconfigure docker, but I'd rather not.)
I found here https://cloud.google.com/sql/docs/mysql/private-ip that this seems to be the correct behaviour:
After you have established a private services access connection, and created a Cloud SQL instance with private IP configured for that connection, the corresponding (internal) subnet and range used by the Cloud SQL service cannot be modified or deleted. This is true even if you delete the peering and your IP range. After the internal configuration is established, any Cloud SQL instance created in that same region and configured for private IP uses the original internal configuration.
There turned out to be a bug somewhere in the service machinery of Google Cloud, which is now fixed. For reference, see the conversation at https://issuetracker.google.com/issues/118849070.

Automated setup for multi-server RethinkDB cluster via an ECS service

I'm attempting to set up a RethinkDB cluster with 3 servers total spread evenly across 3 private subnets, each in different AZ's in a single region.
Ideally, I'd like to deploy the DB software via ECS and provision the EC2 instances with auto scaling, but I'm having trouble trying to figure out how to instruct the RethinkDB instances to join a RethinkDB cluster.
To create/join a cluster in RethinkDB, when you start up a new instance of RethinkDB, you specify host:port combination of one of the other machines in the cluster. This is where I'm running into problems. The Auto Scaling service is creating new primary ENI's for my EC2 instances and using a random IP in my subnet's range, so I can't know the IP of the EC2 instance ahead of time. On top of that, I'm using awsvpc task networking, so ECS is creating new secondary ENI's dedicated to each docker container and attaching them to the instances when it deploys them and those are also getting new IP's, which I don't know ahead of time.
So far I've worked out one possible solution, which is to not use an autoscaling group, but instead to manually deploy 3 EC2 instances across the private subnets, which would let me assign my own, predetermined, private IP. As I understand it, this still doesn't help me if I'm using awsvpc task networking though because each container running on my instances will get its own dedicated secondary ENI and I wont know the IP of that secondary ENI ahead of time. I think I can switch my task networking to bridge mode, to get around this. That way I can use the predetermined IP of the EC2 instances (the primary ENI) in the RethinkDB join command.
So In conclusion, the only way to achieve this, that I can figure out, is to not use Auto Scaling, or awsvpc task networking, both of which would otherwise be very desirable features. Can anyone think of a better way to do this?
As mentioned in the comments, this is more of an issue around the fact you need to start a single RethinkDB instance one time to bootstrap the cluster and then handle discovery of the existing cluster members when joining new members to the cluster.
I would have thought RethinkDB would have published a good pattern in their docs for this because it's going to be pretty common when setting up clusters but I couldn't see anything useful in their docs. If someone does know of an official recommendation then you should definitely use this rather than what I'm about to propose especially as I have no experience with running RethinkDB.
This is more just spit-balling here and will be completely untested (at least for now) but the principle is going to be you need to start a single, one off instance of RethinkDB to bootstrap the cluster, then have more cluster members join and then ditch the special case bootstrap member that didn't attempt to join a cluster and leave the remaining cluster members to work.
The bootstrap instance is easy enough to consider. You just need a RethinkDB container image and an ECS task that just runs it in stand-alone mode with the ECS service only running one instance of the task. To enable the second set of cluster members to easily discover cluster members including this bootstrapping instance it's probably easiest to use a service discovery mechanism such as the one offered by ECS which uses Route53 records under the covers. The ECS service should register the service in the RethinkDB namespace.
Then you should create another ECS service that's basically the same as the first but in an entrypoint script should list the services in the RethinkDB namespace and then resolve them, discarding the container's own IP address and then uses the discovered host to join to with --join when starting RethinkDB in the container.
I'd then set the non bootstrap ECS service to just 1 task at first to allow it to discover the bootstrap version and then you should be able to keep adding tasks to the service one at a time until you're happy with the size of the non bootstrapped cluster leaving you with n + 1 instances in the cluster including the original bootstrap instance.
After that I'd remove the bootstrap ECS service entirely.
If an ECS task dies in the non bootstrap ECS service dies for whatever reason it should be able to auto rejoin without any issue as it will just find a running RethinkDB task and start that.
You could probably expand the checks for which cluster member to join to by checking that the RethinkDB port is open and running before using that as a member to join so it will handle multiple tasks being started at the same time (with my original suggestion it could potentially find another task that is looking to join the cluster and try to join to that first, with them all potentially deadlocking if they all failed to randomly pick the existing cluster members by chance).
As mentioned, this answer comes with a big caveat that I haven't got any experience running RethinkDB and I've only played with the service discovery mechanism that was recently released for ECS so might be missing something here but the general principles should hold fine.

Kubernetes 1.3.x on AWS gets only one working minion

I'm trying to start a new Kubernetes cluster on AWS with the following settings:
export KUBERNETES_PROVIDER=aws
export KUBE_AWS_INSTANCE_PREFIX="k8-update-test"
export KUBE_AWS_ZONE="eu-west-1a"
export AWS_S3_REGION="eu-west-1"
export ENABLE_NODE_AUTOSCALER=true
export NON_MASQUERADE_CIDR="10.140.0.0/20"
export SERVICE_CLUSTER_IP_RANGE="10.140.1.0/24"
export DNS_SERVER_IP="10.140.1.10"
export MASTER_IP_RANGE="10.140.2.0/24"
export CLUSTER_IP_RANGE="10.140.3.0/24"
After running $KUBE_ROOT/cluster/kube-up.sh the master appears and 4 (default) minions are started. Unfortunately only one minion gets read. The result of kubectl get nodes is:
NAME STATUS AGE
ip-172-20-0-105.eu-west-1.compute.internal NotReady 19h
ip-172-20-0-106.eu-west-1.compute.internal NotReady 19h
ip-172-20-0-107.eu-west-1.compute.internal Ready 19h
ip-172-20-0-108.eu-west-1.compute.internal NotReady 19h
Please not that one node is running while 3 are not ready. If I look at the details of a NotReady node I get the following error:
ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0
right now.
If I try to start the cluster with out the settings NON_MASQUERADE_CIDR, SERVICE_CLUSTER_IP_RANGE, DNS_SERVER_IP, MASTER_IP_RANGE and CLUSTER_IP_RANGE everything works fine. All minions get ready as soon as they are started.
Does anyone has an idea why the PodCIDR was only set on one node but not on the other nodes?
One more thing: The same settings worked fine on kubernetes 1.2.4.
Your Cluster IP range is too small. You've allocated a /24 for your entire cluster (255 addresses), and Kubernetes by default will give a /24 to each node. This means that the first node will be allocated 10.140.3.0/24 and then you won't have any further /24 ranges to allocate to the other nodes in your cluster.
The fact that this worked in 1.2.4 was a bug, because the CIDR allocator wasn't checking that it didn't allocate ranges beyond the cluster ip range (which it now does). Try using a larger range for your cluster (GCE uses a /14 by default, which allows you to scale to 1000 nodes, but you should be fine with a /20 for a small cluster).

Cleanup Chef node/client on Cloudformation stack deletion

I have a cloudformation stack that consists of a VPC, two subnets (public and private), several EC2 ubuntu instances and all of the routes, EIP addresses, etc. One of the EC2 instances is in a public subnet. It is bootstrapped as a Chef node on startup.
I'd like to figure out a way to delete the chef node when the cloudformation stack is deleted. So far I've tried dropping a cleanup script into EC2 instance /etc/rc0.d.
This script does what it should when run manually, however when I just delete the stack, it does not seem to run. Actually - it might very well run, but I'm guessing that by the time the EC2 instance shuts down all of the routing and EIP addresses might already be gone, so Chef server might not be reachable by the EC2 instance.
I've also tried locking down creation/deletion order with 'DependsOn' attributes, but that didn't work out either - I don't think it's possible to have the IP and routes depend on the instance that is using the said EIP and routes
Is there some way to setup some sort of monitoring that will make sure Chef cleanup runs before everything else?
Gist with the template and chef setup/cleanup script
Yes, most likely your IPs are disassociated/removed before the instance shuts down, making any attempt to reach the Chef server from the instance futile. You can always check your cloudformation action logs but disassociating the IP address before shutdown is what makes most sense.
I think some of the workaround are:
Build an app on top of your cloudformation creation so that every time you delete a stack it also deletes the node(s) you want from your chef server. This would a full blown application with a database to keep track of the servers/stacks running. This will require your app to call the chef server API or simply call a system knife command.
Run your clean up script from another instance running knife/chef-client. You can have some sort of cron/periodic job checking for stacks/servers that have been deleting on AWS and then run the appropriate knife command to delete the server from. This in essence very similar to 1. with just the difference that you don't necessarily have to build a full blown.application.
Hope it helps.