What is happens with the server load average too high - overloading

i have a problem with my server.checked connecting netstat -nat |wc -l show 1150. Now load average: 502.61, 500.38, 507.66
I have to reboot it frequently, if not reboot it can be up to load average: 700
Tasks: 465 total, 150 running, 363 sleeping, 5 stopped, 0 zombie
Cpu(s): 24.1%us, 75.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
alway running up and not down. I have stopped all service but it still high load.
nginx, apache, mysql, named, .. stop
It is why someone can help me? Thank you very much

Related

Redisson does not recover after redis master fail over

We are using Redisson 3.17.0 and redis version 6.0.8. We have redis cluster mode setup with 3 masters and each master has about 4-5 replicas. When redis master fail over happens, redisson starts throwing exceptions that it is unable to write command into connection. Even after fail over completes (which is ~30s or so), the exceptions don't stop. Only a bounce of the instance that runs redisson resolves this error. This is affecting high availability of our service. We have pingConnectionInterval set to 5000 ms. Our read mode is only from Masters.
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Try to increase nettyThreads setting. Payload size in bytes: 81. Node source: NodeSource [slot=10354, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection#1578264320 [redisClient=[addr=rediss://-:6379], channel=[id: 0xb0f98c8c, L:/-:55678 - R:-/-:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); after 2 retry attempts
Following is our redisson client config:
redisClientConfig: {
endPoint: "rediss://$HOST_IP:6379"
scanInterval: 1000
masterConnectionPoolSize: 64
masterConnectionMinimumIdleSize: 24
sslEnableEndpointIdentification: false
idleConnectionTimeout: 30000
connectTimeout: 10000
timeout: 3000
retryAttempts: 2
retryInterval: 300
pingConnectionInterval: 5000
keepAlive: true
tcpNoDelay: true
dnsMonitoringInterval: 5000
threads: 16
nettyThreads: 32
}
How can redisson recover from these exceptions without a restart of the application? We tried increasing netty threads etc, but redisson does not recover from the fail over

Minikube with Virtualbox or KVM using lots of CPU on Centos 7

I've installed minikube as per the kubernetes instructions.
After starting it, and waiting a while, I noticed that it is using a lot of CPU, even though I have nothing particular running in it.
top shows this:
%Cpu(s): 0.3 us, 7.1 sy, 0.5 ni, 92.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32521856 total, 2259992 free, 9882020 used, 20379844 buff/cache
KiB Swap: 2097144 total, 616108 free, 1481036 used. 20583844 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4847 root 20 0 3741112 91216 37492 S 52.5 0.3 9:57.15 VBoxHeadless
lscpu shows this:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 21
Model: 2
Model name: AMD Opteron(tm) Processor 3365
I see the same effect if I use KVM instead of VirtualBox
kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 20m
I installed metrics-server and it outputs this:
kubectl top node minikube
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
minikube 334m 16% 1378Mi 76%
kubectl top pods --all-namespaces
NAMESPACE NAME CPU(cores) MEMORY(bytes)
default hello-minikube-56cdb79778-rkdc2 0m 3Mi
kafka-data-consistency zookeeper-84fb4cd6f6-sg7rf 1m 36Mi
kube-system coredns-fb8b8dccf-2nrl4 4m 15Mi
kube-system coredns-fb8b8dccf-g6llp 4m 8Mi
kube-system etcd-minikube 38m 41Mi
kube-system kube-addon-manager-minikube 31m 6Mi
kube-system kube-apiserver-minikube 59m 186Mi
kube-system kube-controller-manager-minikube 22m 41Mi
kube-system kube-proxy-m2fdb 2m 17Mi
kube-system kube-scheduler-minikube 2m 11Mi
kube-system kubernetes-dashboard-79dd6bfc48-7l887 1m 25Mi
kube-system metrics-server-cfb4b47f6-q64fb 2m 13Mi
kube-system storage-provisioner 0m 23Mi
Questions:
1) is it possible to find out why it is using so much CPU? (note that I am generating no load, and none of my containers are processing any data)
2) is that normal?
Are you sure nothing is running? What happens if you type kubectl get pods --all-namespaces? By default Kubernetes only displays the pods that are inside the default namespace (thus excluding the pods inside the system namespace).
Also, while I am no CPU expert, this seems like a reasonable consumption for the hardware you have.
In response to question 1):
You can ssh into minikube and from there you can run top to see the processes which are running:
minikube ssh
top
There is a lot of docker and kublet stuff running:
top - 21:43:10 up 8:27, 1 user, load average: 10.98, 12.00, 11.46
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
%Cpu0 : 15.7/15.7 31[|||||||||||||||||||||||||||||||| ]
%Cpu1 : 6.0/10.0 16[|||||||||||||||| ]
GiB Mem : 92.2/1.9 [ ]
GiB Swap: 0.0/0.0 [ ]
11842 docker 20 0 24.5m 3.1m 0.7 0.2 0:00.71 R `- top
1948 root 20 0 480.2m 77.0m 8.6 4.1 27:45.44 S `- /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --tlsverify --tlscacert /etc/docker/ca+
...
3176 root 20 0 10.1g 48.4m 2.0 2.6 17:45.61 S `- etcd --advertise-client-urls=https://192.168.39.197:2379 --cert-file=/var/lib/minikube/certs/etc+
The two process with 27 and 17 hours of processor time are the culprits.
In response to question 2): No idea but could be. See answer from #alassane-ndiaye

uwsgi listen queue fills on reload

I'm running a Django app on uwsgi with an average of 110 concurrent users and 5 requests per second during peak hours. I'm finding that when I deploy with uwsgi reload during these peak hours I am starting to run into an issue where workers keep getting slowly killed and restarted, and then the uwsgi logs begin to throw an error:
Gracefully killing worker 1 (pid: 25145)...
Gracefully killing worker 2 (pid: 25147)...
... a few minutes go by ...
worker 2 killed successfully (pid: 25147)
Respawned uWSGI worker 2 (new pid: 727)
... a few minutes go by ...
worker 2 killed successfully (pid: 727)
Respawned uWSGI worker 2 (new pid: 896)
... this continues gradually for 25 minutes until:
*** listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100) ***
At this point my app rapidly slows to a crawl and I can only recover with a hard uwsgi stop followed by a uwsgi start. There are some relevant details which make this situation kind of peculiar:
This only occurs when I uwsgi reload, otherwise the listen queue never fills up on its own
The error messages and slowdown only start to occur about 25 minutes after the reload
Even during the moment of crisis, memory and CPU resources on the machine seem fine
If I deploy during lighter traffic times, this issue does not seem to pop up
I realize that I can increase the listen queue size, but that seems like a band-aid more than an actual solution. And the fact that it only fills up during reload (and takes 25 minutes to do so) leads me to believe that it will fill up eventually regardless of the size. I would like to figure out the mechanism that is causing the queue to fill up and address that at the source.
Relevant uwsgi config:
[uwsgi]
socket = 127.0.0.1:8001
processes = 4
threads = 2
max-requests = 300
reload-on-rss = 800
vacuum = True
touch-reload = foo/uwsgi/reload.txt
memory-report = true
Relevant software version numbers:
uwsgi 2.0.14
Ubuntu 14.04.1
Django 1.11.13
Python 2.7.6
It appears that our touch reload is not graceful when we have slight traffic, is this to be expected or do we have a more fundamental issue?
On uwsgi there is a harakiri mode that will periodically kill long running processes to prevent unreliable code from hanging (and effectively taking down the app). I would suggest looking there for why your processes are being killed.
As to why a hard stop works and a graceful stop does not -- it seems to further indicate your application code is hanging. A graceful stop will send SIGHUP, which allows resources to be cleaned up in the application. SIGINT and SIGTERM follow the harsher guidelines of "stop what you are doing right now and exit".
Anyway, it boils down to this not being a uwsgi issue, but an issue in your application code. Find what is hanging and why. Since you are not noticing CPU spikes; some probable places to look are...
blocking connections
locks
a long sleep
Good luck!
The key thing you need to look is "listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100)"
Default listen queue size is 100. Increase the queue size by adding the option "listen" in uwsgi.ini.
"listen = 4096"

Docker on AWS filling up its thin pool while running somehow?

I've got a server on ElasticBeanstalk on AWS. Even though no images are being pulled, the thin pool continually fills for under a day until the filesystem is re-mounted as read-only and the applications die.
This happens with Docker 1.12.6 on latest Amazon AMI.
I can't really make heads or tails of it.
When an EC2 instance (hosting Beanstalk) starts, it has about 1.3GB in the thin pool. By the time my 1.2GB image is running, it has about 3.6GB (this is remembered info, it is very close to this). OK, that's fine.
Cut to 5 hours later...
(from the EC2 instance hosting it) docker info returns:
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 107.4 GB
Backing Filesystem: ext4
Data file:
Metadata file:
Data Space Used: 8.489 GB
Data Space Total: 12.73 GB
Data Space Available: 4.245 GB
lvs agrees.
In another few hours that will grow to be 12.73GB used and 0 B free.
dmesg will report:
[2077620.433382] Buffer I/O error on device dm-4, logical block 2501385
[2077620.437372] EXT4-fs warning (device dm-4): ext4_end_bio:329: I/O error -28 writing to inode 4988708 (offset 0 size 8388608 starting block 2501632)
[2077620.444394] EXT4-fs warning (device dm-4): ext4_end_bio:329: I/O error
[2077620.473581] EXT4-fs warning (device dm-4): ext4_end_bio:329: I/O error -28 writing to inode 4988708 (offset 8388608 size 5840896 starting block 2502912)
[2077623.814437] Aborting journal on device dm-4-8.
[2077649.052965] EXT4-fs error (device dm-4): ext4_journal_check_start:56: Detected aborted journal
[2077649.058116] EXT4-fs (dm-4): Remounting filesystem read-only
Yet hardly any space is used in the container itself...
(inside the Docker container:) df -h
/dev/mapper/docker-202:1-394781-1exxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 99G 1.7G 92G 2% /
tmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/xvda1 25G 1.4G 24G 6% /etc/hosts
shm 64M 0 64M 0% /dev/shm
du -sh /
1.7G /
How can this space be filling up? My programs are doing very low-volume logging, and the log files are extremely small. I have good reason not to write them to stdout/stderr.
xxx#xxxxxx:/var/log# du -sh .
6.2M .
I also did docker logs and the output is less than 7k:
>docker logs ecs-awseb-xxxxxxxxxxxxxxxxxxx > w4
>ls -alh
-rw-r--r-- 1 root root 6.4K Mar 27 19:23 w4
The same container does NOT do this to my local docker setup. And finally, running du -sh / on the EC2 instance itself reveals less than 1.4GB usage.
It can't be being filled up by logfiles, and it isn't being filled inside the
container. What can be going on? I am at my wits' end!

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar