Kubeadm why does my node not show up though kubelet says it joined? - amazon-web-services

I am setting up a Kubernetes deployment using auto-scaling groups and Terraform. The kube master node is behind an ELB to get some reliability in case of something going wrong. The ELB has the health check set to tcp 6443, and tcp listeners for 8080, 6443, and 9898. All of the instances and the load balancer belong to a security group that allows all traffic between members of the group, plus public traffic from the NAT Gateway address. I created my AMI using the following script (from the getting started guide)...
# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
# cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
# apt-get update
# # Install docker if you don't have it already.
# apt-get install -y docker.io
# apt-get install -y kubelet kubeadm kubectl kubernetes-cni
I use the following user data scripts...
kube master
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
kubeadm init \
--external-etcd-endpoints=http://${etcd_elb}:2379 \
--token=${token} \
--use-kubernetes-version=${k8s_version} \
--api-external-dns-names=kmaster.${master_elb_dns} \
--cloud-provider=aws
until kubectl cluster-info
do
sleep 1
done
kubectl apply -f https://git.io/weave-kube
kube node
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
until kubeadm join --token=${token} kmaster.${master_elb_dns}
do
sleep 1
done
Everything seems to work properly. The master comes up and responds to kubectl commands, with pods for discovery, dns, weave, controller-manager, api-server, and scheduler. kubeadm has the following output on the node...
Running pre-flight checks
<util/tokens> validating provided token
<node/discovery> created cluster info discovery client, requesting info from "http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0"
node/discovery> failed to request cluster info, will try again: [Get http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0: EOF]
<node/discovery> cluster info object received, verifying signature using given token
<node/discovery> cluster info signature and contents are valid, will use API endpoints [https://10.253.129.106:6443]
<node/bootstrap> trying to connect to endpoint https://10.253.129.106:6443
<node/bootstrap> detected server version v1.4.4
<node/bootstrap> successfully established connection with endpoint https://10.253.129.106:6443
<node/csr> created API client to obtain unique certificate for this node, generating keys and certificate signing request
<node/csr> received signed certificate from the API server:
Issuer: CN=kubernetes | Subject: CN=system:node:ip-10-253-130-44 | CA: false
Not before: 2016-10-27 18:46:00 +0000 UTC Not After: 2017-10-27 18:46:00 +0000 UTC
<node/csr> generating kubelet configuration
<util/kubeconfig> created "/etc/kubernetes/kubelet.conf"
Node join complete:
* Certificate signing request sent to master and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on the master to see this machine join.
Unfortunately, running kubectl get nodes on the master only returns itself as a node. The only interesting thing I see in /var/log/syslog is
Oct 27 21:19:28 ip-10-252-39-25 kubelet[19972]: E1027 21:19:28.198736 19972 eviction_manager.go:162] eviction manager: unexpected err: failed GetNode: node 'ip-10-253-130-44' not found
Oct 27 21:19:31 ip-10-252-39-25 kubelet[19972]: E1027 21:19:31.778521 19972 kubelet_node_status.go:301] Error updating node status, will retry: error getting node "ip-10-253-130-44": nodes "ip-10-253-130-44" not found
I am really not sure where to look...

The Hostnames of the two machines (master and the node) should be different. You can check them by running cat /etc/hostname. If they do happen to be the same, edit that file to make them different and then do a sudo reboot to apply the changes. Otherwise kubeadm will not be able to differentiate between the two machines and it will show as a single one in kubectl get nodes.

Yes , I faced the same problem.
I resolved by:
killall kubelet
run the kubectl join command again
and start the kubelet service

Related

How to configure Cassandra in GCP to remotely connect?

I am following the below steps to install and configure Cassandra in GCP.
It works perfectly as long as working with Cassandra within GCP.
$java -version
$echo "deb http://downloads.apache.org/cassandra/debian 40x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
$curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
$sudo apt install apt-transport-https
$sudo apt-get update
$sudo apt-get install cassandra
$sudo systemctl status cassandra
//Active: active (running)
$nodetool status
//Datacenter: datacenter1
$tail -f /var/log/cassandra/system.log
$find /usr/lib/ -name cqlshlib
##/usr/lib/python3/dist-packages/cqlshlib
$export PYTHONPATH=/usr/lib/python3/dist-packages
$sudo nano ~/.bashrc
//Add
export PYTHONPATH=/usr/lib/python3/dist-packages
//save
$source ~/.bashrc
$python --version
$cqlsh
//it opens cqlsh shell
But I want to configure Cassandra to remotely connect.
I tried the following 7 different solutions.
But still I am getting the error.
1.In GCP,
VPC network -> firewall -> create
IP 0.0.0.0/0
port tcp=9000,9042,8088,9870,8123,8020, udp=9000
tag = hadoop
Add this tag in VMs
2.rm -Rf ~/.cassandra
3.sudo nano ~/.cassandra/cqlshrc
[connection]
hostname = 34.72.70.173
port = 9042
4. cqlsh 34.72.70.173 -u cassandra -p cassandra
5. firewall - open ports
https://stackoverflow.com/questions/2359159/cassandra-port-usage-how-are-the-ports-used
9000,9042,8088,9870,8123,8020,7199,7000,7001,9160
6. Get rid of this line: JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=localhost"
Try restart the service: sudo service cassandra restart
If you have a cluster, make sure that ports 7000 and 9042 are open within your security group.
7. you can set the environment variable $CQLSH_HOST=1.2.3.4. Then simply type cqlsh.
https://stackoverflow.com/questions/20575640/datastax-devcenter-fails-to-connect-to-the-remote-cassandra-database/20598599#20598599
sudo nano /etc/cassandra/cassandra.yaml
listen_address: localhost
rpc_address: 34.72.70.173
broadcast_rpc_address: 34.72.70.173
sudo service cassandra restart
sudo nano ~/.bashrc
export CQLSH_HOST=34.72.70.173
source ~/.bashrc
sudo systemctl restart cassandra
sudo service cassandra restart
sudo systemctl status cassandra
nodetool status
Please suggest how to get rid of the following error
Connection error: ('Unable to connect to any servers', {'127.0.0.1:9042': ConnectionRefusedE
rror(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
This indicates that when you ran cqlsh, you didn't specify the public IP:
Connection error: ('Unable to connect to any servers', \
{'127.0.0.1:9042': ConnectionRefusedError(111, "Tried connecting to [('127.0.0.1', 9042)]. \
Last error: Connection refused")})
When running Cassandra nodes on public clouds, you need to configure cassandra.yaml with the following:
listen_address: private_IP
rpc_addpress: public_IP
The listen address is the what Cassandra nodes use for communicating with each other privately, e.g. gossip protocol.
The RPC address is what clients/apps/drivers use to connect to nodes on the CQL port (9042) so it needs to be set to the nodes' public IP address.
To connect to a node with cqlsh (a client), you need to specify the node's public IP:
$ cqlsh <public_IP>
Cheers!

Why is Google Compute Engine not running my container?

I can do this successfully:
Bundle my app into a docker image
Build this image into a container using Google Cloud Build upon push to master
(This container is stored in the registry at, for example, gcr.io/my-project/my-container)
Deply this container to the web using Google Cloud Run
Visit the Cloud Run url and see my website
I am now trying more sophisticated builds and I think the next step is to use Google Compute Engine.
To start, I am simply trying to deploy a single instance of the same app that I deployed to Cloud Run:
Navigate to Compute Engine > VM Instances
Enter basics like instance name
Enter my container location under "Container Image": gcr.io/my-project/my-container
(As an aside, I find it suspect that the interface does not offer a selector for your existing Container Registry items here.)
Select "Allow HTTP Traffic" and "Allow HTTPS Traffic"
Click "Create"
GCE takes a minute to create it, and then it shows the green checkmark and the instance name, and "External IP: 35.238.xxx.xxx". I visit that URL in my browser and get... "35.238.xxx.xxx refused to connect."
To inspect, I go back to the GCE page and select "SSH > Open in browser window" next to my instance, which opens a type of cloud terminal to the machine.
In this terminal window, type ps and see that no processes are running. The container Dockerfile ends with CMD yarn start:prod, so I guess that's not happening here.
Further, I ls here and there and navigate around, and see that there is no /app directory from my Dockerfile's WORKDIR /app command. It seems like not only did my app not boot, but was the container not copied to the VM instance?
What am I doing wrong?
For anyone having this issue. I faced the same problem and couldn't figure it out.
Reading Serhii's answer give me the clue. I believe as of today (Jan 2021) the GCP Console UI is a bit unhelpful. It appears that if you type in a container name when creating your VM but WITHOUT specifying a tag on the end, it doesn't complain nor assume a default such as 'latest', it just fails silently. Hence the VM but with no docker container running.
At least it this now works for me, hopefully this helps others.
Check whether your VM has an external IP address.
If it doesn't, the VM might not have network access to the public repository and even to the Google Container Registry (gcr.io) and the docker container doesn't start silently.
I've decided to follow Deploying a container on a new VM instance again.
Please find my steps and commands below:
create a new VM that runs the Docker image gcr.io/cloud-marketplace/google/nginx1:latest with network tag http-server:
$ gcloud compute instances create-with-container instance-3 --tags=http-server,https-server --container-image=gcr.io/cloud-marketplace/google/nginx1:latest
Created [https://www.googleapis.com/compute/v1/projects/test-prj/zones/europe-west3-a/instances/instance-3].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
instance-3 europe-west3-a n1-standard-1 10.156.0.30 35.XXX.111.XXX RUNNING
create a new firewall rule:
$ gcloud compute firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-server
Creating firewall...⠹
Created [https://www.googleapis.com/compute/v1/projects/test-prj/global/firewalls/default-allow-http].
Creating firewall...done.
NAME NETWORK DIRECTION PRIORITY ALLOW DENY DISABLED
default-allow-http default INGRESS 1000 tcp:80 False
check current firewall rules:
$ nmap -Pn 35.XXX.111.XXX
Starting Nmap 7.70 ( https://nmap.org ) at 2020-04-02 12:04 CEST
PORT STATE SERVICE
...
80/tcp open http
check if NGINX is running in the container:
$ curl -I http://35.XXX.111.XXX
HTTP/1.1 200 OK
Server: nginx/1.16.1
...
$ curl http://35.XXX.111.XXX
...
<h1>Welcome to nginx!</h1>
...
also via web browser at http://35.XXX.111.XXX
check status of the container:
$ gcloud compute ssh instance-3
...
instance-3 ~ $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
...
a657c8871239 gcr.io/cloud-marketplace/google/nginx1:latest "/usr/local/bin/dock…" 14 minutes ago Up 14 minutes klt-instance-3-uwtu
attach to the container and run curl http://35.XXX.111.XXX in the separate terminal:
instance-3 ~ $ docker attach a657c8871239
YY.YY.43.203 - - [02/Apr/2020:10:18:06 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.64.0" "-"
YY.YY.43.203 - - [02/Apr/2020:10:18:07 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.64.0" "-"
I found no errors while following documentation.
To solve your issue:
Compare your steps and commands to mine.
Run test Docker image by following documentation on your project.
Try to replicate steps from documentation with your custom image.
If you still have issue - update your question with all your steps, commands and outputs.
I also had the problem, the instance was running, but could not pull my container.
Error: Failed to start container: Error response from daemon:
{"message":"unautho rized: You don't have the needed permissions to
perform this operation, and you may have invalid credentials. To
authenticate your request, follow the steps in:
https://cloud.google.com/container-registry/docs/advanced-authentication"
I had to add some extra scope to the yaml file : https://www.googleapis.com/auth/source.full_control
steps:
- name: gcr.io/cloud-builders/docker
args: ['build', '-t', 'gcr.io/local-xxxxxxxxxxxxxx/apptraining', '.']
- name: 'gcr.io/cloud-builders/docker'
args: ["push", "gcr.io/local-xxxxxxxxxxxxxx/apptraining"]
- name: 'gcr.io/cloud-builders/gcloud'
args: ['compute', 'instances', 'create-with-container', 'instanceapptraining', '--machine-type=n1-standard-1', '--scopes=https://www.googleapis.com/auth/devstorage.full_control,https://www.googleapis.com/auth/trace.append,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/userinfo.email,https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/datastore,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/trace.append,https://www.googleapis.com/auth/source.full_control,https://www.googleapis.com/auth/source.read_only,https://www.googleapis.com/auth/compute.readonly','--zone=us-central1-a', '--preemptible', '--container-image=gcr.io/local-xxxxxxxxxxxxxx/apptraining:latest']

reusing the salt states in the AMI image

I have several salt states(base and pillars) already written and present in Amazon s3. I want to re-use the salt states instead of writing the salt state again. I want to create an AMI image using packer and apply the salt-states that I have downloaded from s3 to the Packer Builder EC2 instance. Even if the salt-minion is installed on the CentOS -7 machine, I have installed salt-master service as well and started both salt-minion and salt-master by following commands.
cat > /etc/salt/minion.d/minion_id.conf <<'EOT' id: ${host} # id salt-minion id EOT
Generate the name of the master to connect to
cat > /etc/salt/minion.d/master_name.conf <<'EOT' master: localhost EOT
systemctl enable salt-minion
systemctl start salt-minion
systemctl enable salt-master
systemctl start salt-master
When running the below command it doesn't list any minions:
salt-key -L Accepted Keys: Denied Keys: Unaccepted Keys: Rejected Keys:
So the salt 'localhost-*' state.sls state.high_state
fails with errors:
"No minions matched the target. No command was sent, no jid was assigned.
ERROR: No return received"
This is because no minionid is created from salt-key.
Anybody has any idea why the salt-key is not being shown with salt-minion and how can i resolve this issue by running the existing salt-state successfully downloaded from s3 will work in AMI image?
Regards
Pradeep
What could be happening is that your minions can't find (resolve/DNS) the master salt.
What you could do is add the IP of your master to your minions /etc/salt/minion something like this:
master: 10.0.0.1
Replace 10.0.0.3 with the IP of your master
Later restart your minion and check the master again for requests.

CoreOS fleetctl list-machines not showing 3 machines

I am following the DigitalOcean tutorial on CoreOS (https://www.digitalocean.com/community/tutorials/how-to-create-flexible-services-for-a-coreos-cluster-with-fleet-unit-files). When I do a fleetctl list-machines command on node 1 and node 2, I am not able to see all the 3 machines listed but just one for it's own node. The following is what I see:
core#coreos-1 ~ $ fleetctl list-machines
MACHINE IP METADATA
XXXX... 10.abc.de.fgh -
I logged onto my 3rd node and noticed that when I do a fleetctl list-machines I get the following error:
core#coreos-3 ~ $ fleetctl list-machines
Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etc
What should I do to find out what is the problem and how to resolve this? I have tried rebooting and other things mentioned but nothing is helping.
What happened was that I had a etcd dependencies in my unit file where I had such as following:
# Dependency ordering
After=etcd.service
I think I needed etcd2 instead.
So I did the following as directed:
sudo systemctl stop fleet.service fleet.socket etcd
sudo systemctl start etcd2
sudo systemctl reset-failed
I had to clean up on the instance that had the file when I queried for it:
core#coreos1 ~ $ etcdctl ls /_coreos.com/fleet/job
/_coreos.com/fleet/job/apache.1.service
/_coreos.com/fleet/job/apache#.service
/_coreos.com/fleet/job/apache#80.service
/_coreos.com/fleet/job/apache#9999.service
/_coreos.com/fleet/job/apache-discovery.1.service
/_coreos.com/fleet/job/apache-discovery#.service
/_coreos.com/fleet/job/apache-discovery#80.service
/_coreos.com/fleet/job/apache-discovery#9999.service
by issuing
etcdctl ls /_coreos.com/fleet/job/apache.1.service
etcdctl rm --recursive /_coreos.com/fleet/job/apache-discovery.1.service
Then I started fleet
sudo systemctl start fleet
And when I did a fleetctl list-machines again it showed all my instances connected.

Not able to access HDFS

I installed cloudera vm and started trying some basic stuff. First I just wanted to ls the hdfs directoires. so I issued the below command.
[cloudera#quickstart ~]$ hadoop fs -ls /
ls: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "quickstart.cloudera/10.0.2.15"; destination host is: "quickstart.cloudera":8020;
though ps -fu hdfs says both namenode and data node is running. I checked the status using the service command.
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode status
Hadoop namenode is not running [FAILED]
Thinking all the problems will be resolved if I restart all the services, I executed the below command.
[cloudera#quickstart conf]$ sudo /home/cloudera/cloudera-manager --express --force
[QuickStart] Shutting down CDH services via init scripts...
[QuickStart] Disabling CDH services on boot...
[QuickStart] Starting Cloudera Manager daemons...
[QuickStart] Waiting for Cloudera Manager API...
[QuickStart] Configuring deployment...
Submitted jobs: 92
[QuickStart] Deploying client configuration...
Submitted jobs: 93
[QuickStart] Starting Cloudera Management Service...
Submitted jobs: 101
[QuickStart] Enabling Cloudera Manager daemons on boot...
Now I thought all services will be up so again checked the status of namenode service. Again it came failed.
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode status
Hadoop namenode is not running [FAILED]
Now I decided to manually stop and start the namenode service. Again not much use.
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode stop
no namenode to stop
Stopped Hadoop namenode: [ OK ]
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode status
Hadoop namenode is not running [FAILED]
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode start
starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-quickstart.cloudera.out
Failed to start Hadoop namenode. Return value: 1 [FAILED]
I checked the file /var/log/hadoop-hdfs/hadoop-hdfs-namenode-quickstart.cloudera.out . It just said below
log4j:ERROR Could not find value for key log4j.appender.RFA
log4j:ERROR Could not instantiate appender named "RFA".
I also checked /var/log/hadoop-hdfs/hadoop-cmf-hdfs-NAMENODE-quickstart.cloudera.log.out . Found below when I searched for error. Can anyone please suggest me what is the best way to get the services back on track. Unfortunately I am not able to access cloudera manager from browser. Anything that I can do from command line?
2016-02-24 21:02:48,105 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={ROLE_TYPE=[NAMENODE], CATEGORY=[LOG_MESSAGE], ROLE=[hdfs-NAMENODE], SEVERITY=[IMPORTANT], SERVICE=[hdfs], HOST_IDS=[quickstart.cloudera], SERVICE_TYPE=[HDFS], LOG_LEVEL=[WARN], HOSTS=[quickstart.cloudera], EVENTCODE=[EV_LOG_EVENT]}, content=Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!, timestamp=1456295437905} - 1 of 17 failure(s) in last 79302s
java.io.IOException: Error connecting to quickstart.cloudera/10.0.2.15:7184
at com.cloudera.cmf.event.shaded.org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:249)
at com.cloudera.cmf.event.shaded.org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:198)
at com.cloudera.cmf.event.shaded.org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:133)
at com.cloudera.cmf.event.publish.AvroEventStorePublishProxy.checkSpecificRequestor(AvroEventStorePublishProxy.java:122)
at com.cloudera.cmf.event.publish.AvroEventStorePublishProxy.publishEvent(AvroEventStorePublishProxy.java:196)
at com.cloudera.cmf.event.publish.EventStorePublisherWithRetry$PublishEventTask.run(EventStorePublisherWithRetry.java:242)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Network is unreachable
You can try this:
check witch process is using the port 7184 of namenode (i.e netstat linux command)
and kill that and then restart
Or
change you namenode port from conf and restart hadoop