CoreOS fleetctl list-machines not showing 3 machines - digital-ocean

I am following the DigitalOcean tutorial on CoreOS (https://www.digitalocean.com/community/tutorials/how-to-create-flexible-services-for-a-coreos-cluster-with-fleet-unit-files). When I do a fleetctl list-machines command on node 1 and node 2, I am not able to see all the 3 machines listed but just one for it's own node. The following is what I see:
core#coreos-1 ~ $ fleetctl list-machines
MACHINE IP METADATA
XXXX... 10.abc.de.fgh -
I logged onto my 3rd node and noticed that when I do a fleetctl list-machines I get the following error:
core#coreos-3 ~ $ fleetctl list-machines
Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etc
What should I do to find out what is the problem and how to resolve this? I have tried rebooting and other things mentioned but nothing is helping.

What happened was that I had a etcd dependencies in my unit file where I had such as following:
# Dependency ordering
After=etcd.service
I think I needed etcd2 instead.
So I did the following as directed:
sudo systemctl stop fleet.service fleet.socket etcd
sudo systemctl start etcd2
sudo systemctl reset-failed
I had to clean up on the instance that had the file when I queried for it:
core#coreos1 ~ $ etcdctl ls /_coreos.com/fleet/job
/_coreos.com/fleet/job/apache.1.service
/_coreos.com/fleet/job/apache#.service
/_coreos.com/fleet/job/apache#80.service
/_coreos.com/fleet/job/apache#9999.service
/_coreos.com/fleet/job/apache-discovery.1.service
/_coreos.com/fleet/job/apache-discovery#.service
/_coreos.com/fleet/job/apache-discovery#80.service
/_coreos.com/fleet/job/apache-discovery#9999.service
by issuing
etcdctl ls /_coreos.com/fleet/job/apache.1.service
etcdctl rm --recursive /_coreos.com/fleet/job/apache-discovery.1.service
Then I started fleet
sudo systemctl start fleet
And when I did a fleetctl list-machines again it showed all my instances connected.

Related

How to configure Cassandra in GCP to remotely connect?

I am following the below steps to install and configure Cassandra in GCP.
It works perfectly as long as working with Cassandra within GCP.
$java -version
$echo "deb http://downloads.apache.org/cassandra/debian 40x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
$curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
$sudo apt install apt-transport-https
$sudo apt-get update
$sudo apt-get install cassandra
$sudo systemctl status cassandra
//Active: active (running)
$nodetool status
//Datacenter: datacenter1
$tail -f /var/log/cassandra/system.log
$find /usr/lib/ -name cqlshlib
##/usr/lib/python3/dist-packages/cqlshlib
$export PYTHONPATH=/usr/lib/python3/dist-packages
$sudo nano ~/.bashrc
//Add
export PYTHONPATH=/usr/lib/python3/dist-packages
//save
$source ~/.bashrc
$python --version
$cqlsh
//it opens cqlsh shell
But I want to configure Cassandra to remotely connect.
I tried the following 7 different solutions.
But still I am getting the error.
1.In GCP,
VPC network -> firewall -> create
IP 0.0.0.0/0
port tcp=9000,9042,8088,9870,8123,8020, udp=9000
tag = hadoop
Add this tag in VMs
2.rm -Rf ~/.cassandra
3.sudo nano ~/.cassandra/cqlshrc
[connection]
hostname = 34.72.70.173
port = 9042
4. cqlsh 34.72.70.173 -u cassandra -p cassandra
5. firewall - open ports
https://stackoverflow.com/questions/2359159/cassandra-port-usage-how-are-the-ports-used
9000,9042,8088,9870,8123,8020,7199,7000,7001,9160
6. Get rid of this line: JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=localhost"
Try restart the service: sudo service cassandra restart
If you have a cluster, make sure that ports 7000 and 9042 are open within your security group.
7. you can set the environment variable $CQLSH_HOST=1.2.3.4. Then simply type cqlsh.
https://stackoverflow.com/questions/20575640/datastax-devcenter-fails-to-connect-to-the-remote-cassandra-database/20598599#20598599
sudo nano /etc/cassandra/cassandra.yaml
listen_address: localhost
rpc_address: 34.72.70.173
broadcast_rpc_address: 34.72.70.173
sudo service cassandra restart
sudo nano ~/.bashrc
export CQLSH_HOST=34.72.70.173
source ~/.bashrc
sudo systemctl restart cassandra
sudo service cassandra restart
sudo systemctl status cassandra
nodetool status
Please suggest how to get rid of the following error
Connection error: ('Unable to connect to any servers', {'127.0.0.1:9042': ConnectionRefusedE
rror(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
This indicates that when you ran cqlsh, you didn't specify the public IP:
Connection error: ('Unable to connect to any servers', \
{'127.0.0.1:9042': ConnectionRefusedError(111, "Tried connecting to [('127.0.0.1', 9042)]. \
Last error: Connection refused")})
When running Cassandra nodes on public clouds, you need to configure cassandra.yaml with the following:
listen_address: private_IP
rpc_addpress: public_IP
The listen address is the what Cassandra nodes use for communicating with each other privately, e.g. gossip protocol.
The RPC address is what clients/apps/drivers use to connect to nodes on the CQL port (9042) so it needs to be set to the nodes' public IP address.
To connect to a node with cqlsh (a client), you need to specify the node's public IP:
$ cqlsh <public_IP>
Cheers!

Connect to VPN with Podman

Having this Dockerfile:
FROM fedora:30
ENV LANG C.UTF-8
RUN dnf upgrade -y \
&& dnf install -y \
openssh-clients \
openvpn \
slirp4netns \
&& dnf clean all
CMD ["openvpn", "--config", "/vpn/ovpn.config", "--auth-user-pass", "/vpn/ovpn.auth"]
Building the image with:
podman build -t peque/vpn .
If I try to run it with (note $(pwd), where the VPN configuration and credentials are stored):
podman run -v $(pwd):/vpn:Z --cap-add=NET_ADMIN --device=/dev/net/tun -it peque/vpn
I get the following error:
ERROR: Cannot open TUN/TAP dev /dev/net/tun: Permission denied (errno=13)
Any ideas on how could I fix this? I would not mind changing the base image if that could help (i.e.: to Alpine or anything else as long as it allows me to use openvpn for the connection).
System information
Using Podman 1.4.4 (rootless) and Fedora 30 distribution with kernel 5.1.19.
/dev/net/tun permissions
Running the container with:
podman run -v $(pwd):/vpn:Z --cap-add=NET_ADMIN --device=/dev/net/tun -it peque/vpn
Then, from the container, I can:
# ls -l /dev/ | grep net
drwxr-xr-x. 2 root root 60 Jul 23 07:31 net
I can also list /dev/net, but will get a "permission denied error":
# ls -l /dev/net
ls: cannot access '/dev/net/tun': Permission denied
total 0
-????????? ? ? ? ? ? tun
Trying --privileged
If I try with --privileged:
podman run -v $(pwd):/vpn:Z --privileged --cap-add=NET_ADMIN --device=/dev/net/tun -it peque/vpn
Then instead of the permission-denied error (errno=13), I get a no-such-file-or-directory error (errno=2):
ERROR: Cannot open TUN/TAP dev /dev/net/tun: No such file or directory (errno=2)
I can effectively verify there is no /dev/net/ directory when using --privileged, even if I pass the --cap-add=NET_ADMIN --device=/dev/net/tun parameters.
Verbose log
This is the log I get when configuring the client with verb 3:
OpenVPN 2.4.7 x86_64-redhat-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Feb 20 2019
library versions: OpenSSL 1.1.1c FIPS 28 May 2019, LZO 2.08
Outgoing Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Incoming Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
TCP/UDP: Preserving recently used remote address: [AF_INET]xx.xx.xx.xx:1194
Socket Buffers: R=[212992->212992] S=[212992->212992]
UDP link local (bound): [AF_INET][undef]:0
UDP link remote: [AF_INET]xx.xx.xx.xx:1194
TLS: Initial packet from [AF_INET]xx.xx.xx.xx:1194, sid=3ebc16fc 8cb6d6b1
WARNING: this configuration may cache passwords in memory -- use the auth-nocache option to prevent this
VERIFY OK: depth=1, C=ES, ST=XXX, L=XXX, O=XXXXX, emailAddress=email#domain.com, CN=internal-ca
VERIFY KU OK
Validating certificate extended key usage
++ Certificate has EKU (str) TLS Web Server Authentication, expects TLS Web Server Authentication
VERIFY EKU OK
VERIFY OK: depth=0, C=ES, ST=XXX, L=XXX, O=XXXXX, emailAddress=email#domain.com, CN=ovpn.server.address
Control Channel: TLSv1.2, cipher TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384, 2048 bit RSA
[ovpn.server.address] Peer Connection Initiated with [AF_INET]xx.xx.xx.xx:1194
SENT CONTROL [ovpn.server.address]: 'PUSH_REQUEST' (status=1)
PUSH: Received control message: 'PUSH_REPLY,route xx.xx.xx.xx 255.255.255.0,route xx.xx.xx.0 255.255.255.0,dhcp-option DOMAIN server.net,dhcp-option DNS xx.xx.xx.254,dhcp-option DNS xx.xx.xx.1,dhcp-option DNS xx.xx.xx.1,route-gateway xx.xx.xx.1,topology subnet,ping 10,ping-restart 60,ifconfig xx.xx.xx.24 255.255.255.0,peer-id 1'
OPTIONS IMPORT: timers and/or timeouts modified
OPTIONS IMPORT: --ifconfig/up options modified
OPTIONS IMPORT: route options modified
OPTIONS IMPORT: route-related options modified
OPTIONS IMPORT: --ip-win32 and/or --dhcp-option options modified
OPTIONS IMPORT: peer-id set
OPTIONS IMPORT: adjusting link_mtu to 1624
Outgoing Data Channel: Cipher 'AES-128-CBC' initialized with 128 bit key
Outgoing Data Channel: Using 160 bit message hash 'SHA1' for HMAC authentication
Incoming Data Channel: Cipher 'AES-128-CBC' initialized with 128 bit key
Incoming Data Channel: Using 160 bit message hash 'SHA1' for HMAC authentication
ROUTE_GATEWAY xx.xx.xx.xx/255.255.255.0 IFACE=tap0 HWADDR=0a:38:ba:e6:4b:5f
ERROR: Cannot open TUN/TAP dev /dev/net/tun: No such file or directory (errno=2)
Exiting due to fatal error
Error number may change depending on whether I run the command with --privileged or not.
It turns out that you are blocked by SELinux: after running the client container and trying to access /dev/net/tun inside it, you will get the following AVC denial in the audit log:
type=AVC msg=audit(1563869264.270:833): avc: denied { getattr } for pid=11429 comm="ls" path="/dev/net/tun" dev="devtmpfs" ino=15236 scontext=system_u:system_r:container_t:s0:c502,c803 tcontext=system_u:object_r:tun_tap_device_t:s0 tclass=chr_file permissive=0
To allow your container configuring the tunnel while staying not fully privileged and with SELinux enforced, you need to customize SELinux policies a bit. However, I did not find an easy way to do this properly.
Luckily, there is a tool called udica, which can generate SELinux policies from container configurations. It does not provide the desired policy on its own and requires some manual intervention, so I will describe how I got the openvpn container working step-by-step.
First, install the required tools:
$ sudo dnf install policycoreutils-python-utils policycoreutils udica
Create the container with required privileges, then generate the policy for this container:
$ podman run -it --cap-add NET_ADMIN --device /dev/net/tun -v $PWD:/vpn:Z --name ovpn peque/vpn
$ podman inspect ovpn | sudo udica -j - ovpn_container
Policy ovpn_container created!
Please load these modules using:
# semodule -i ovpn_container.cil /usr/share/udica/templates/base_container.cil
Restart the container with: "--security-opt label=type:ovpn_container.process" parameter
Here is the policy which was generated by udica:
$ cat ovpn_container.cil
(block ovpn_container
(blockinherit container)
(allow process process ( capability ( chown dac_override fsetid fowner mknod net_raw setgid setuid setfcap setpcap net_bind_service sys_chroot kill audit_write net_admin )))
(allow process default_t ( dir ( open read getattr lock search ioctl add_name remove_name write )))
(allow process default_t ( file ( getattr read write append ioctl lock map open create )))
(allow process default_t ( sock_file ( getattr read write append open )))
)
Let's try this policy (note the --security-opt option, which tells podman to run the container in newly created domain):
$ sudo semodule -i ovpn_container.cil /usr/share/udica/templates/base_container.cil
$ podman run -it --cap-add NET_ADMIN --device /dev/net/tun -v $PWD:/vpn:Z --security-opt label=type:ovpn_container.process peque/vpn
<...>
ERROR: Cannot open TUN/TAP dev /dev/net/tun: Permission denied (errno=13)
Ugh. Here is the problem: the policy generated by udica still does not know about specific requirements of our container, as they are not reflected in its configuration (well, probably, it is possible to infer that you want to allow operations on tun_tap_device_t based on the fact that you requested --device /dev/net/tun, but...). So, we need to customize the policy by extending it with few more statements.
Let's disable SELinux temporarily and run the container to collect the expected denials:
$ sudo setenforce 0
$ podman run -it --cap-add NET_ADMIN --device /dev/net/tun -v $PWD:/vpn:Z --security-opt label=type:ovpn_container.process peque/vpn
These are:
$ sudo grep denied /var/log/audit/audit.log
type=AVC msg=audit(1563889218.937:839): avc: denied { read write } for pid=3272 comm="openvpn" name="tun" dev="devtmpfs" ino=15178 scontext=system_u:system_r:ovpn_container.process:s0:c138,c149 tcontext=system_u:object_r:tun_tap_device_t:s0 tclass=chr_file permissive=1
type=AVC msg=audit(1563889218.937:840): avc: denied { open } for pid=3272 comm="openvpn" path="/dev/net/tun" dev="devtmpfs" ino=15178 scontext=system_u:system_r:ovpn_container.process:s0:c138,c149 tcontext=system_u:object_r:tun_tap_device_t:s0 tclass=chr_file permissive=1
type=AVC msg=audit(1563889218.937:841): avc: denied { ioctl } for pid=3272 comm="openvpn" path="/dev/net/tun" dev="devtmpfs" ino=15178 ioctlcmd=0x54ca scontext=system_u:system_r:ovpn_container.process:s0:c138,c149 tcontext=system_u:object_r:tun_tap_device_t:s0 tclass=chr_file permissive=1
type=AVC msg=audit(1563889218.947:842): avc: denied { nlmsg_write } for pid=3273 comm="ip" scontext=system_u:system_r:ovpn_container.process:s0:c138,c149 tcontext=system_u:system_r:ovpn_container.process:s0:c138,c149 tclass=netlink_route_socket permissive=1
Or more human-readable:
$ sudo grep denied /var/log/audit/audit.log | audit2allow
#============= ovpn_container.process ==============
allow ovpn_container.process self:netlink_route_socket nlmsg_write;
allow ovpn_container.process tun_tap_device_t:chr_file { ioctl open read write };
OK, let's modify the udica-generated policy by adding the advised allows to it (note, that here I manually translated the syntax to CIL):
(block ovpn_container
(blockinherit container)
(allow process process ( capability ( chown dac_override fsetid fowner mknod net_raw setgid setuid setfcap setpcap net_bind_service sys_chroot kill audit_write net_admin )))
(allow process default_t ( dir ( open read getattr lock search ioctl add_name remove_name write )))
(allow process default_t ( file ( getattr read write append ioctl lock map open create )))
(allow process default_t ( sock_file ( getattr read write append open )))
; This is our new stuff.
(allow process tun_tap_device_t ( chr_file ( ioctl open read write )))
(allow process self ( netlink_route_socket ( nlmsg_write )))
)
Now we enable SELinux back, reload the module and check that the container works correctly when we specify our custom domain:
$ sudo setenforce 1
$ sudo semodule -r ovpn_container
$ sudo semodule -i ovpn_container.cil /usr/share/udica/templates/base_container.cil
$ podman run -it --cap-add NET_ADMIN --device /dev/net/tun -v $PWD:/vpn:Z --security-opt label=type:ovpn_container.process peque/vpn
<...>
Initialization Sequence Completed
Finally, check that other containers still have no these privileges:
$ podman run -it --cap-add NET_ADMIN --device /dev/net/tun -v $PWD:/vpn:Z peque/vpn
<...>
ERROR: Cannot open TUN/TAP dev /dev/net/tun: Permission denied (errno=13)
Yay! We stay with SELinux on, and allow the tunnel configuration only to our specific container.

reusing the salt states in the AMI image

I have several salt states(base and pillars) already written and present in Amazon s3. I want to re-use the salt states instead of writing the salt state again. I want to create an AMI image using packer and apply the salt-states that I have downloaded from s3 to the Packer Builder EC2 instance. Even if the salt-minion is installed on the CentOS -7 machine, I have installed salt-master service as well and started both salt-minion and salt-master by following commands.
cat > /etc/salt/minion.d/minion_id.conf <<'EOT' id: ${host} # id salt-minion id EOT
Generate the name of the master to connect to
cat > /etc/salt/minion.d/master_name.conf <<'EOT' master: localhost EOT
systemctl enable salt-minion
systemctl start salt-minion
systemctl enable salt-master
systemctl start salt-master
When running the below command it doesn't list any minions:
salt-key -L Accepted Keys: Denied Keys: Unaccepted Keys: Rejected Keys:
So the salt 'localhost-*' state.sls state.high_state
fails with errors:
"No minions matched the target. No command was sent, no jid was assigned.
ERROR: No return received"
This is because no minionid is created from salt-key.
Anybody has any idea why the salt-key is not being shown with salt-minion and how can i resolve this issue by running the existing salt-state successfully downloaded from s3 will work in AMI image?
Regards
Pradeep
What could be happening is that your minions can't find (resolve/DNS) the master salt.
What you could do is add the IP of your master to your minions /etc/salt/minion something like this:
master: 10.0.0.1
Replace 10.0.0.3 with the IP of your master
Later restart your minion and check the master again for requests.

Kubeadm why does my node not show up though kubelet says it joined?

I am setting up a Kubernetes deployment using auto-scaling groups and Terraform. The kube master node is behind an ELB to get some reliability in case of something going wrong. The ELB has the health check set to tcp 6443, and tcp listeners for 8080, 6443, and 9898. All of the instances and the load balancer belong to a security group that allows all traffic between members of the group, plus public traffic from the NAT Gateway address. I created my AMI using the following script (from the getting started guide)...
# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
# cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
# apt-get update
# # Install docker if you don't have it already.
# apt-get install -y docker.io
# apt-get install -y kubelet kubeadm kubectl kubernetes-cni
I use the following user data scripts...
kube master
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
kubeadm init \
--external-etcd-endpoints=http://${etcd_elb}:2379 \
--token=${token} \
--use-kubernetes-version=${k8s_version} \
--api-external-dns-names=kmaster.${master_elb_dns} \
--cloud-provider=aws
until kubectl cluster-info
do
sleep 1
done
kubectl apply -f https://git.io/weave-kube
kube node
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
until kubeadm join --token=${token} kmaster.${master_elb_dns}
do
sleep 1
done
Everything seems to work properly. The master comes up and responds to kubectl commands, with pods for discovery, dns, weave, controller-manager, api-server, and scheduler. kubeadm has the following output on the node...
Running pre-flight checks
<util/tokens> validating provided token
<node/discovery> created cluster info discovery client, requesting info from "http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0"
node/discovery> failed to request cluster info, will try again: [Get http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0: EOF]
<node/discovery> cluster info object received, verifying signature using given token
<node/discovery> cluster info signature and contents are valid, will use API endpoints [https://10.253.129.106:6443]
<node/bootstrap> trying to connect to endpoint https://10.253.129.106:6443
<node/bootstrap> detected server version v1.4.4
<node/bootstrap> successfully established connection with endpoint https://10.253.129.106:6443
<node/csr> created API client to obtain unique certificate for this node, generating keys and certificate signing request
<node/csr> received signed certificate from the API server:
Issuer: CN=kubernetes | Subject: CN=system:node:ip-10-253-130-44 | CA: false
Not before: 2016-10-27 18:46:00 +0000 UTC Not After: 2017-10-27 18:46:00 +0000 UTC
<node/csr> generating kubelet configuration
<util/kubeconfig> created "/etc/kubernetes/kubelet.conf"
Node join complete:
* Certificate signing request sent to master and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on the master to see this machine join.
Unfortunately, running kubectl get nodes on the master only returns itself as a node. The only interesting thing I see in /var/log/syslog is
Oct 27 21:19:28 ip-10-252-39-25 kubelet[19972]: E1027 21:19:28.198736 19972 eviction_manager.go:162] eviction manager: unexpected err: failed GetNode: node 'ip-10-253-130-44' not found
Oct 27 21:19:31 ip-10-252-39-25 kubelet[19972]: E1027 21:19:31.778521 19972 kubelet_node_status.go:301] Error updating node status, will retry: error getting node "ip-10-253-130-44": nodes "ip-10-253-130-44" not found
I am really not sure where to look...
The Hostnames of the two machines (master and the node) should be different. You can check them by running cat /etc/hostname. If they do happen to be the same, edit that file to make them different and then do a sudo reboot to apply the changes. Otherwise kubeadm will not be able to differentiate between the two machines and it will show as a single one in kubectl get nodes.
Yes , I faced the same problem.
I resolved by:
killall kubelet
run the kubectl join command again
and start the kubelet service

ubuntu rabbitmq - Error: unable to connect to node 'rabbit#somename: nodedown

I am using celery for django which needs rabbitmq. Some 4 or 5 months back, it used to work well. I again tried using it for a new project and got below error for rabbitmq while listing queues.
Listing queues ...
Error: unable to connect to node 'rabbit#somename': nodedown
diagnostics:
- nodes and their ports on 'somename': [{rabbitmqctl23014,44910}]
- current node: 'rabbitmqctl23014#somename'
- current node home dir: /var/lib/rabbitmq
- current node cookie hash: XfMxei3DuB8GOZUm1vdUsg==
Whats the solution? If there is no good solution, can I uninstall and reinstall rabbitmq ?
I had installed rabbit as a service apparently and the
sudo rabbitmqctl force_reset
command was not working.
sudo service rabbitmq-server restart
Did exactly what I need.
P.S. I made sure I was the root user to do the previous command
sudo su
if you need change hostname:
sudo aptitude remove rabbitmq-server
sudo rm -fr /var/lib/rabbitmq/
set new hostname:
hostname newhost
in file /etc/hostname set new value hostname
add to file /etc/hosts
127.0.0.1 newhost
install rabbitmq:
sudo aptitude install rabbitmq-server
done
Check if the server is running by using this command:
sudo service rabbitmq-server status
If it says
Status of all running nodes...
Node 'rabbit#ubuntu' with Pid 26995:
running done.
It's running.
In my case, I accidentally ran the rabbitmqctl command with a different user and got the error you mentioned.
You might have installed it with root, try running
sudo rabbitmqctl stop_app
and see what the response is.
(If everything's fine, run
sudo rabbitmqctl start_app
afterwards).
Double check that your cookie hash file is the same
Double check that your machine name (uname) is the same as the one stated in your configuration — this one can be tricky
And double check that you start rabbitmq with the same user as the one you installed it. Just using 'sudo' won't do the trick.