UPDATE: I'm deploying on AWS cloud with the help of kops.
I'm in the process applying HPA for one of my kubernete deployment.
While testing the sample app, I deployed with default namespace, I can see the metrics being exposed as below ( showing current utilisation is 0%)
$ kubectl run busybox --image=busybox --port 8080 -- sh -c "while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; \
env | grep HOSTNAME | sed 's/.*=//g'; } | nc -l -p 8080; done"
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
busybox Deployment/busybox 0%/20% 1 4 1 14m
But when I deploy with the custom namespace ( example: test), the current utilisation is showing unknown
$ kubectl get hpa --namespace test
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
busybox Deployment/busybox <unknown>/20% 1 4 1 25m
Can someone please suggest whats wrong here?
For future you need to meet few conditions for HPA to work. You need to have metrics server or heapster running on your cluster. What is important is to set resources on namespace basis.
You did not provide in what environment is your cluster running, but in GKE by default you have a cpu resource set (100m), but you need to specify it on new namespaces:
Please note that if some of the pod’s containers do not have the
relevant resource request set, CPU utilization for the pod will not be
defined and the autoscaler will not take any action for that metric.
In your case I am not sure why it does work after redeploy, as there is not enough information. But for future remember to:
1) object that you want to scale and HPA should be in the same namespace
2) set resources on CPU per namespace or simply add --requests=cpu=value so the HPA will be able to scale based on that.
UPDATE:
for your particular case:
1) kubectl run busybox --image=busybox --port 8080 -n test --requests=cpu=200m -- sh -c "while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; \
env | grep HOSTNAME | sed 's/.*=//g'; } | nc -l -p 8080; done"
2) kubectl autoscale deployment busybox --cpu-percent=50 --min=1 --max=10 -n test
Try running the commands below in the namespace where you experience this issue and see if you get any pointers.
kubectl get --raw /apis/metrics.k8s.io/ - This should display a valid JSON
Also, do a kubectl describe hpa name_of_hpa_deployment - This may indicate if there are any issues with your hpa deployment in that namespace.
Related
I was streaming Kafka on AWS EC2 CentOS 7. My Session Manager Idle Timeout is set to 60min. And yet, after running for much less than that, the terminal got frozen, saying My session has been terminated. Of course, the Kafka streaming for disrupted as well.
When I tried to restart a new session with a new terminal, I got this error popup
Your session has been terminated for the following reasons: Plugin with name Standard_Stream not found. Step name: Standard_Stream
and I am still unable to restart a terminal.
What does this error mean and how to resolve it? Thanks.
So far you need to access the EC2 using SSH with key-pem to debug
(ask your admin)
Running tail -f got issue
tail: inotify resources exhausted
tail: inotify cannot be used, reverting to polling
Restart ssm-agent service also got issue No space left on device
but it's not about disk space
[root#env-test ec2-user]# systemctl restart amazon-ssm-agent.service
Error: No space left on device
[root#env-test ec2-user]# df -h |grep dev
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
/dev/nvme0n1p1 100G 82G 18G 83% /
So the error itself means that system is getting low on inotify
watches, that enable programs to monitor file/dirs changes. To see
the currently set limit (including output on my machine)
$ cat /proc/sys/fs/inotify/max_user_watches
8192
Check which processes using inotify to improve your apps or increase max_user_watches
for foo in /proc/*/fd/*; do readlink -f $foo; done | grep inotify | sort | uniq -c | sort -nr
5 /proc/1/fd/anon_inode:inotify
2 /proc/7126/fd/anon_inode:inotify
2 /proc/5130/fd/anon_inode:inotify
1 /proc/4497/fd/anon_inode:inotify
1 /proc/4437/fd/anon_inode:inotify
1 /proc/4151/fd/anon_inode:inotify
1 /proc/4147/fd/anon_inode:inotify
1 /proc/4028/fd/anon_inode:inotify
1 /proc/3913/fd/anon_inode:inotify
1 /proc/3841/fd/anon_inode:inotify
1 /proc/31146/fd/anon_inode:inotify
1 /proc/2829/fd/anon_inode:inotify
1 /proc/21259/fd/anon_inode:inotify
1 /proc/1934/fd/anon_inode:notify
Notice that the above inotify list include PID of ssm-agent
processes, it explains why we got issue with SSM when
max_user_watches reached limit
ps -ef | grep ssm-ag
root 3841 1 0 00:02 ? 00:00:05 /usr/bin/amazon-ssm-agent
root 4497 3841 0 00:02 ? 00:00:33 /usr/bin/ssm-agent-worker
Final Solution: Permanent solution (preserved across restarts)
echo "fs.inotify.max_user_watches=1048576" >> /etc/sysctl.conf sysctl -p
Verify:
$ aws ssm start-session --target i-123abc456efd789xx --region ap-northeast-2
Starting session with SessionId: userdev-03ccb1a04a6345bf5
sh-4.2$
This issue comes from EC2 instance not about SSM agent Go to link to
undestanding SSM agent.
optional link
In my case, extend the disk space works!
(syslog full of my case)
In my case too extending the disk space worked as my /var/logs was huge.
Several days ago I asked this question Confuse about fail2ban behavior with firewallD in Centos 7
It is a large text with several comments.
It seems something starts flushing iptables after some hours of fail2ban restart I don't get what it is.
A couple of months ago I moved a few Virtual Hosts from a dedicated server I used for more than 10 years to a Contabo VPS.
All goes fine but fail2ban jail. Prisoners escape. :)
My move was from Centos 6 to Centos 7 Webmin/Virtualmin LAMP fail2ban; leaving /etc/sysconfig/iptables, now using firewalld.
As said, after some hours of fail2ban restart, and after some successfully banned IPs, as #sebres suggested, something is flushing iptables because of the symptom "after effects" like
2019-12-05 16:55:20,856 fail2ban.action [1514]: ERROR iptables -w -n -L INPUT | grep -q 'f2b-proftpd[ \t]' -- stdout: ''
and "already banned" notices.
None of the changes I tried in default configurations changed that.
At the end I deleted the Webmin module to manage fail2ban and reinstalled the service.
Renamed /etc/fail2ban to keep backup configurations.
rpm -qa | grep -i fail2ban
then
yum remove fail2ban-server
yum remove fail2ban-firewalld
yum install fail2ban-firewalld (also installs -server)
yum install fail2ban-systemd
then copied old jail.local to new /etc/fail2ban directory
[DEFAULT]
banaction = iptables-multiport
banaction_allports = iptables-allports
[sshd]
enabled = true
port = ssh
maxretry = 4
bantime = 7200
[ssh-ddos]
enabled = true
port = ssh,sftp
filter = sshd-ddos
[webmin-auth]
enabled = true
port = 10000
[proftpd]
enabled = true
bantime = -1
[postfix]
enabled = true
bantime = -1
[dovecot]
enabled = true
bantime = -1
[postfix-sasl]
enabled = true
bantime = -1
I also checked cron jobs to see if something can be flushing iptables in any way.
At this time I have running, periodically, a script to manually reject those "already banned" IPs once.
firewall-cmd --permanent --add-rich-rule="rule family='ipv4' source address='xxx.xxx.xxx.xxx' reject"
So my question is how to know what is flusing iptables.
UPADATE 1
After update to stable V 0.10 fail2ban release it seemed problems gone but after 5 days they started again.
Previously, after v0.9 restart, problems started after few hours.
UPADATE 2
Running fail2ban-client -d I got "Found no accessible config files for 'filter.d/sshd-ddos'". That's because I kept the old ssh-ddos config in jail.conf.
So, a subquestion is if I'm right simply making this change (at least no errors in fail2ban-client -d
#filter = sshd-ddos
filter = sshd
mode = agressive (as suggested by #sebres)
Here's the output of fail2ban-client -d
"No, the after effect is there because something is flushing rules, not vice versa"
I understand that, I'm not that fluent in English speaking, I meant that that was a symptom that something happens so the effect.
"So which banning action do you use really?"
Sorry my poor knowledge on this matter. Is that what is included in [Default] part of jail.local?
"(for example can you exclude some service implemented by Contabo installed or integrated in your VPS, that doing that?"
I asked them some time ago but their answer was "...we are providing our customers with the basic installations..." nothing that technical. They have several VPS services and I don't see other people complaining about that.
UPDATE 3
The first jail.local (from fresh Webmin/Virtualmin install) actions were
action = firewallcmd-ipset[]
action_ = %(banaction)s[name=%(__name__)s, bantime="%(bantime)s", port="%(port)s", protocol="%(protocol)s", chain="%(chain)s"]
I changed by
banaction = iptables-multiport
banaction_allports = iptables-allports
some time ago.
Now I went back with firewallcmd-ipset as [DEFAULT] and this is the fail2ban-client -d output.
I'll check fail2ban.log. .... After few hours, problems again.
About firewallD Webmin has a section with defined zones/rules and tools to manage them instead of having to write commands in shell. Nothing more.
I cannot imagine fail2ban blame here, but to exclude it (or some action of fail2ban is broken on your side), we should take a look into your configuration...
So provide your whole (unmodified) configuration dump:
fail2ban-client -d
or at least dump of all actions of all jails:
fail2ban-client -d | grep 'action'
now using firewalld
I see pretty sure iptables in your config and the log excerpt (error message).
So which banning action do you use really?
something is flushing iptables because of "after effects" like ...
No, the after effect is there because something is flushing rules, not vice versa.
This can be for example some "firewall" script which you apply to configure iptables (ports, default reject rules, etc), restart of some service due to dependency (restarting or reloading iptables), some script mistakenly implementing knocking, and many others.
I don't think it'd be simple to find it, if you don't have it under your control (for example can you exclude some service implemented by Contabo installed or integrated in your VPS, that doing that?).
UPDATE 1:
I don't see any error in your excerpt... to remove f2b-proftpd (or flush input chain) it should contain:
either <iptables> -D INPUT ... -j f2b-proftpd
or even <iptables> -F INPUT somewhere in actions parameters, excepting actionstop (which is intended stop by shutdown or restart case only).
But it is only available in actionstop as expected:
all entries containing removal from INPUT chain (actionstop only):
<iptables> -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
<iptables> -D INPUT -p tcp -m multiport --dports 10000 -j f2b-webmin-auth
<iptables> -D INPUT -p tcp -m multiport --dports ftp,ftp-data,ftps,ftps-data -j f2b-proftpd
<iptables> -D INPUT -p tcp -m multiport --dports smtp,465,submission -j f2b-postfix
<iptables> -D INPUT -p tcp -m multiport --dports pop3,pop3s,imap,imaps,submission,465,sieve -j f2b-dovecot
<iptables> -D INPUT -p tcp -m multiport --dports smtp,465,submission,imap,imaps,pop3,pop3s -j f2b-postfix-sasl
<iptables> -D INPUT -p tcp -m multiport --dports ssh,sftp -j f2b-ssh-ddos
Thus we could exclude the blame of fail2ban.
Just you said "now using firewalld" - where? In your config dump I see iptables-multiport only:
['set', 'sshd', 'addaction', 'iptables-multiport']
['set', 'webmin-auth', 'addaction', 'iptables-multiport']
['set', 'proftpd', 'addaction', 'iptables-multiport']
['set', 'postfix', 'addaction', 'iptables-multiport']
['set', 'dovecot', 'addaction', 'iptables-multiport']
['set', 'postfix-sasl', 'addaction', 'iptables-multiport']
['set', 'ssh-ddos', 'addaction', 'iptables-multiport'],
So for example if firewald would rewrite INPUT chain of iptables (removes fail2ban entries), you would catch exactly this issue.
Thus better find out which major netfilter your system use and use it as banaction (or avoid rewriting of fail2ban rules e. g. by adding it to fail2ban dependencies, etc).
I'm attempting to create a ssh tunnel, when deploying an application to aws beanstalk. I want to put the tunnel as a background process, that is always connected on application deploy. The script is hanging forever on the deployment and I can't see why.
"/home/ec2-user/eclair-ssh-tunnel.sh":
mode: "000500" # u+rx
owner: root
group: root
content: |
cd /root
eval $(ssh-agent -s)
DISPLAY=":0.0" SSH_ASKPASS="./askpass_script" ssh-add eclair-test-key </dev/null
# we want this command to keep running in the backgriund
# so we add & at then end
nohup ssh -L 48682:localhost:8080 ubuntu#[host...] -N &
and here is the output I'm getting from /var/log/eb-activity.log:
[2019-06-14T14:53:23.268Z] INFO [15615] - [Application update suredbits-api-root-0.37.0-testnet-ssh-tunnel-fix-port-9#30/AppDeployStage1/AppDeployPostHook/01_eclair-ssh-tunnel.sh] : Starting activity...
The ssh tunnel is spawned, and I can find it by doing:
[ec2-user#ip-172-31-25-154 ~]$ ps aux | grep 48682
root 16047 0.0 0.0 175560 6704 ? S 14:53 0:00 ssh -L 48682:localhost:8080 ubuntu#ec2-34-221-186-19.us-west-2.compute.amazonaws.com -N
If I kill that process, the deployment continues as expected, which indicates that the bug is in the tunnel script. I can't seem to find out where though.
You need to add -n option to ssh when run it in background to avoid reading from stdin.
I am setting up a Kubernetes deployment using auto-scaling groups and Terraform. The kube master node is behind an ELB to get some reliability in case of something going wrong. The ELB has the health check set to tcp 6443, and tcp listeners for 8080, 6443, and 9898. All of the instances and the load balancer belong to a security group that allows all traffic between members of the group, plus public traffic from the NAT Gateway address. I created my AMI using the following script (from the getting started guide)...
# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
# cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
# apt-get update
# # Install docker if you don't have it already.
# apt-get install -y docker.io
# apt-get install -y kubelet kubeadm kubectl kubernetes-cni
I use the following user data scripts...
kube master
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
kubeadm init \
--external-etcd-endpoints=http://${etcd_elb}:2379 \
--token=${token} \
--use-kubernetes-version=${k8s_version} \
--api-external-dns-names=kmaster.${master_elb_dns} \
--cloud-provider=aws
until kubectl cluster-info
do
sleep 1
done
kubectl apply -f https://git.io/weave-kube
kube node
#!/bin/bash
rm -rf /etc/kubernetes/*
rm -rf /var/lib/kubelet/*
until kubeadm join --token=${token} kmaster.${master_elb_dns}
do
sleep 1
done
Everything seems to work properly. The master comes up and responds to kubectl commands, with pods for discovery, dns, weave, controller-manager, api-server, and scheduler. kubeadm has the following output on the node...
Running pre-flight checks
<util/tokens> validating provided token
<node/discovery> created cluster info discovery client, requesting info from "http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0"
node/discovery> failed to request cluster info, will try again: [Get http://kmaster.jenkins.learnvest.net:9898/cluster-info/v1/?token-id=eb31c0: EOF]
<node/discovery> cluster info object received, verifying signature using given token
<node/discovery> cluster info signature and contents are valid, will use API endpoints [https://10.253.129.106:6443]
<node/bootstrap> trying to connect to endpoint https://10.253.129.106:6443
<node/bootstrap> detected server version v1.4.4
<node/bootstrap> successfully established connection with endpoint https://10.253.129.106:6443
<node/csr> created API client to obtain unique certificate for this node, generating keys and certificate signing request
<node/csr> received signed certificate from the API server:
Issuer: CN=kubernetes | Subject: CN=system:node:ip-10-253-130-44 | CA: false
Not before: 2016-10-27 18:46:00 +0000 UTC Not After: 2017-10-27 18:46:00 +0000 UTC
<node/csr> generating kubelet configuration
<util/kubeconfig> created "/etc/kubernetes/kubelet.conf"
Node join complete:
* Certificate signing request sent to master and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on the master to see this machine join.
Unfortunately, running kubectl get nodes on the master only returns itself as a node. The only interesting thing I see in /var/log/syslog is
Oct 27 21:19:28 ip-10-252-39-25 kubelet[19972]: E1027 21:19:28.198736 19972 eviction_manager.go:162] eviction manager: unexpected err: failed GetNode: node 'ip-10-253-130-44' not found
Oct 27 21:19:31 ip-10-252-39-25 kubelet[19972]: E1027 21:19:31.778521 19972 kubelet_node_status.go:301] Error updating node status, will retry: error getting node "ip-10-253-130-44": nodes "ip-10-253-130-44" not found
I am really not sure where to look...
The Hostnames of the two machines (master and the node) should be different. You can check them by running cat /etc/hostname. If they do happen to be the same, edit that file to make them different and then do a sudo reboot to apply the changes. Otherwise kubeadm will not be able to differentiate between the two machines and it will show as a single one in kubectl get nodes.
Yes , I faced the same problem.
I resolved by:
killall kubelet
run the kubectl join command again
and start the kubelet service
I would like to increase ulimit in Docker in Elastic Beanstalk to run some apps.
I know that I need to increase ulimit of Docker host and restart docker service but cannot find a way to do it.
I wrote following .ebextensions/01limits.config but still cannot increase ulimit.
commands:
01limits:
command: echo -e "#commands\nroot soft nofile 65536\nroot hard nofile 65536\n* soft nofile 65536\n* hard nofile 65536" >> /etc/security/limits.conf
02restartdocker:
command: service docker restart
ADDED 2014-11-20 09:37 GMT
Also tried with following config file.
commands:
01limits:
command: echo -e "#commands\nroot soft nofile 65536\nroot hard nofile 65536\n* soft nofile 65536\n* hard nofile 65536" >> /etc/security/limits.conf
02restartdocker:
command: service docker stop && ulimit -a 65536 && service docker start
It successfully increased ulimit but showed following error in the management console:
[Instance: i-xxxxxxxx Module: AWSEBAutoScalingGroup ConfigSet: null] Command failed on instance. Return code: 1 Output: [CMD-AppDeploy/AppDeployStage1/AppDeployEnactHook/00flip.sh] command failed with error code 1: /opt/elasticbeanstalk/hooks/appdeploy/enact/00flip.sh Stopping nginx: [ OK ]
Starting nginx: [ OK ]
Stopping current app container: 1c**********... Error response from daemon: Cannot destroy container 1c**********: Driver devicemapper failed to remove root filesystem 1c**************************************************************: Device is Busy 2014/11/20 09:06:36 Error: failed to remove one or more containers.
I am not sure this config is suitable.
This is way late, but here's a (hopefully) working solution to your issue
initctl stop eb-docker && /sbin/service docker stop && ulimit -n 65536 && ulimit -c unlimited && export DMAP=$(df | grep /var/lib | awk '{print $1}') && if [[ $DMAP ]]; then umount $DMAP; fi && /sbin/service docker start && initctl start eb-docker
Explanation is as follows :
Stop docker container service
Stop docker service
Apply your ulimit settings (mine are different from yours)
Find offending devicemapper mount and unmount it (handles if none are mounted)
Start docker
Start docker container service
It's not an elegant solution, but such is the life of hacking around EB.
You might want to break it out into smaller components but the general idea is there.
I was running a version of apachectl that was trying to change the file limit. And it was failing and causing an error with the container.
But, eventually, I ran the ulimit command from inside the container:
root#22806b77a474:/home# ulimit
unlimited
It seems apachectl was trying to raise a limit that isn't actually there in the docker container. I ULIMIT_MAX_FILES to something that didn't cause a problem.
RUN sed -i 's/ULIMIT_MAX_FILES="${APACHE_ULIMIT_MAX_FILES:-ulimit -n 8192}"/ULIMIT_MAX_FILES="ulimit -H -n"/' /usr/sbin/apachectl