Ejabberd Server Application CPU Overload - amazon-web-services

We have build Ejabberd in AWS EC2 instance and have enabled the clustering in the 6 Ejabberd servers in Tokyo, Frankfurt, and Singapore regions.
The OS, middleware, applications and settings for each EC2 instance are exactly the same.
But currently, the Ejabberd CPUs in the Frankfurt and Singapore regions are overloaded.
The CPU of Ejabberd in the Japan region is normal.
Could you please tell me the suspicious part?

You can take a look at the ejabberd log files of the problematic (and the good) nodes, maybe you find some clue.
You can use the undocumented "ejabberdctl etop" shell command in the problematic nodes. It's similar to "top", but runs inside the erlang virtual machine that runs ejabberd
ejabberdctl etop
========================================================================================
ejabberd#localhost 16:00:12
Load: cpu 0 Memory: total 44174 binary 1320
procs 277 processes 5667 code 20489
runq 1 atom 984 ets 5467
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
----------------------------------------------------------------------------------------
<9135.1252.0> caps_requests_cache 2393 1 2816 0 gen_server:loop/7
<9135.932.0> mnesia_recover 480 39 2816 0 gen_server:loop/7
<9135.1118.0> dets:init/2 71 2 5944 0 dets:open_file_loop2
<9135.6.0> prim_file:start/0 63 1 2608 0 prim_file:helper_loo
<9135.1164.0> dets:init/2 56 2 4072 0 dets:open_file_loop2
<9135.818.0> disk_log:init/2 49 2 5984 0 disk_log:loop/1
<9135.1038.0> ejabberd_listener:in 31 2 2840 0 prim_inet:accept0/3
<9135.1213.0> dets:init/2 31 2 5944 0 dets:open_file_loop2
<9135.1255.0> dets:init/2 30 2 5944 0 dets:open_file_loop2
<9135.0.0> init 28 1 3912 0 init:loop/1
========================================================================================

Related

New Ceoh Installation Won't Recover

I am unsure if this is the platform to ask. But hopefully it is :).
I've got a 3 node setup of ceph.
node1
mds.node1 , mgr.node1 , mon.node1 , osd.0 , osd.1 , osd.6
14.2.22
node2
mds.node2 , mon.node2 , osd.2 , osd.3 , osd.7
14.2.22
node3
mds.node3 , mon.node3 , osd.4 , osd.5 , osd.8
14.2.22
For some reason though, When I down one node, It does not start backfilling/recovery at all. It just reports 3 osd's down as below. But does nothing to repair it....
If I run a ceph -s I get the below ouput:
[root#node1 testdir]# ceph -s
cluster:
id: 8932b76b-282b-4385-bee8-5c295af88e74
health: HEALTH_WARN
3 osds down
1 host (3 osds) down
Degraded data redundancy: 30089/90267 objects degraded (33.333%), 200 pgs degraded, 512 pgs undersized
1/3 mons down, quorum node1,node2
services:
mon: 3 daemons, quorum node1,node2 (age 2m), out of quorum: node3
mgr: node1(active, since 48m)
mds: homeFS:1 {0=node1=up:active} 1 up:standby-replay
osd: 9 osds: 6 up (since 2m), 9 in (since 91m)
data:
pools: 4 pools, 512 pgs
objects: 30.09k objects, 144 MiB
usage: 14 GiB used, 346 GiB / 360 GiB avail
pgs: 30089/90267 objects degraded (33.333%)
312 active+undersized
200 active+undersized+degraded
io:
client: 852 B/s rd, 2 op/s rd, 0 op/s wr
[root#node1 testdir]#
The odd thing though, when I boot up my 3rd node again it does recover and sync. But it looks like it's backfilling just not starting at all...
Is there something that might be causing it?
Update
What I did notice, If I mark a drive as out, it does recover it... But when the server node's down, and the drive's marked as out, it then does not recover it at all...
Update 2:
I noticed while experimenting that if the OSD is up, but out, It does recover... When the OSD is marked as down it does not begin to recover at all...
The default is 10 minutes for ceph to wait until it marks OSDs as out (mon_osd_down_out_interval). This can help in case a server just needs a reboot and returns within 10 minutes then all is good. If you need a longer maintenance window but you're not sure if it will be longer than 10 minutes, but the server will eventually return, set ceph osd set noout to prevent unnecessary rebalancing.

Google Cloud Notebook VM's - High Memory Machine, Kernel Crashing on Memory Error

I'm using GCP's Cloud Notebook VM's. I have a 200+ gb RAM VM running and am attempting to download about 70gb of data from BigQuery into memory using the bigquery storage engine.
Once it gets to around 50gb the kernel crashes --
Tailing the logs, sudo tail -20 /var/log/syslog - here's what I find:
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.550367] Task in /system.slice/jupyter.service killed as a result of limit of /system.slice/jupyter.service
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.563843] memory: usage 53350876kB, limit 53350964kB, failcnt 1708893
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.570582] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.578694] kmem: usage 110900kB, limit 9007199254740988kB, failcnt 0
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.585267] Memory cgroup stats for /system.slice/jupyter.service: cache:752KB rss:53239292KB rss_huge:0KB mapped_file:60KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:53239292KB inactive_file:400KB active_file:248KB unevictable:0KB
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.612963] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.621645] [ 787] 1003 787 99396 17005 63 3 0 0 jupyter-lab
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.632295] [ 2290] 1003 2290 4996 966 14 3 0 0 bash
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.642309] [13143] 1003 13143 1272679 26639 156 6 0 0 python
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.652528] [ 5833] 1003 5833 16000467 13268794 26214 61 0 0 python
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.661384] [ 6813] 1003 6813 4996 936 14 3 0 0 bash
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.670033] Memory cgroup out of memory: Kill process 5833 (python) score 996 or sacrifice child
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.680823] Killed process 5833 (python) total-vm:64001868kB, anon-rss:53072876kB, file-rss:4632kB, shmem-rss:0kB
Dec 2 13:38:07 pytorch-20200908-152245 sync_gcs_service.sh[806]: GCS bucket is not specified in GCE metadata, skip GCS sync
Dec 2 13:39:03 pytorch-20200908-152245 bash[787]: [I 13:39:03.463 LabApp] Saving file at /outlog.txt
I followed this guidance and allocated 100gb of RAM How to increase Jupyter notebook Memory limit? but it's still crashing at around 55gb. e.g., 53350964kB is the limit in the logs.
How can I utilize the available memory of my machine? Thanks!
Tacking on what worked - changing this config setting:
/sys/fs/cgroup/memory/system.slice/jupyter.service/memory.limit_in_bytes to a higher number.
I can see here "Cgroup out of memory" means that instance has sufficient memory and the process that is being killed in a cgroup. For visualized workload this can be possible as docker containers can cause this issue.
a) To identify the cgroup
system-cgtop
b) Check the limit of the cgroup
cat /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes
c) Adjust the limit, please adjust by editing configuration file of POD, memory limit for Docker Container. Update the limit for raw cgroup
echo [NUMBER_OF_BYTES] > /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes

429 at very low qps despite adequate headroom

Xoogler in the cloud here. I have a very low qps service that serves HTML plus the follow-up resources. So it typically sits idle and then receives something in the order of 20 requests over 5s with concurrency well below 10, where concurrency limit is 80. I observe that clients regularly receive 429s from Cloud Run, typically after periods of service inactivity, even though an instance is still up (so it's not a cold-start problem). This can either be on the first request but often somewhere in the middle of the sequence (i.e. icons, css don't load).
The instance is concurrent, responsive and could easily handle the load, but Cloud Run doesn't let it. No other instances are spun up either, although we're not even at the max of 2. This suggests that Cloud Run for some reason estimates >2 instances needed?
Here's a typical request sequence, redacted from the logs:
... 20 min idle ...
I 2020-03-27T18:21:27.619317Z GET 307 288 B 5 ms
I 2020-03-27T18:21:27.706580Z GET 302 0 B 0 ms
I 2020-03-27T18:21:27.760271Z GET 200 5.83 KiB 5 ms
I 2020-03-27T18:21:27.838066Z GET 200 1.89 KiB 4 ms
I 2020-03-27T18:21:27.882751Z GET 200 1.05 KiB 4 ms
I 2020-03-27T18:21:27.886743Z GET 200 582 B 3 ms
I 2020-03-27T18:21:27.893060Z GET 200 533 B 4 ms
I 2020-03-27T18:21:27.897352Z GET 200 5.35 KiB 4 ms
I 2020-03-27T18:21:27.899086Z GET 200 11.38 KiB 6 ms
I 2020-03-27T18:21:27.905967Z GET 200 22.48 KiB 13 ms
I 2020-03-27T18:21:27.906113Z GET 200 592 B 13 ms
I 2020-03-27T18:21:27.907967Z GET 200 35.08 KiB 14 ms
...500ms...
I 2020-03-27T18:21:28.434846Z GET 200 2.76 MiB 50 ms
I 2020-03-27T18:21:28.465552Z GET 200 2.29 MiB 67 ms <= up to here all resources served from image
...2500ms...
I 2020-03-27T18:21:31.086943Z GET 200 2.95 KiB 706 ms <= IO-bound, talking to backend api
...1600ms...
W 2020-03-27T18:21:32.674973Z GET 429 14 B 0 ms <= !!!
W 2020-03-27T18:21:32.675864Z GET 429 14 B 0 ms <= !!!
W 2020-03-27T18:21:32.676292Z GET 429 14 B 0 ms <= !!!
I 2020-03-27T18:21:32.684265Z GET 200 547 B 6 ms
I 2020-03-27T18:21:32.686695Z GET 200 504 B 9 ms
I 2020-03-27T18:21:32.690580Z GET 200 486 B 12 ms
Conceivably that last group of requests are 6 parallel requests. Why would three be denied and three served? The service is way under capacity. A couple of reloads typically solve the issue.
It really appears to me as if the algorithm vastly overestimates the required resources after a period of inactivity. I'm happy to try a larger max-instances (redeployed to 10 now) but something really seems off with the estimates on the low end of the spectrum. If "2" as a concurrency setting is below what the platform supports, gcloud probably should probably enforce a higher minimum in the first place.
This is somewhat sad as it impacts people just "trying out" Cloud Run and they observe intermittent errors (partially rendered pages, ...) - which are even pinned on the client (4xx) who is certainly not at fault.
Happy to provide more data.
Configuration:
template:
metadata:
...
annotations:
...
autoscaling.knative.dev/maxScale: '2'
spec:
timeoutSeconds: 900
...
containerConcurrency: 80
containers:
...
resources:
limits:
cpu: 1000m
memory: 244Mi
This looks like a known issue with Cloud Run, I would recommend starring it to receive notifications and expedite resolution.

What's the meaning of the ethereum Parity console output lines?

Parity doesn't seem to have any documentation on what it's console output means. At least none that I've found which admittedly doesn't mean a whole lot. Can anyone give me a breakdown of the meaning of the following line?
2018-03-09 00:05:12 UTC Syncing #4896969 61ee…bdad 2 blk/s 508 tx/s 16 Mgas/s 645+ 1 Qed #4897616 17/25 peers 4 MiB chain 135 MiB db 42 MiB queue 5 MiB sync RPC: 0 conn, 0 req/s, 182 µs
Thanks.
Why document when you can just read code? (bleh)
2018-03-09 00:05:12 UTC(1) Syncing #4896969(2) 61ee…bdad(3) 2 blk/s(4) 508 tx/s(5) 16 Mgas/s(6) 645+(7) 1(8) Qed #4897616(9) 17/25 peers(10) 4 MiB chain(11) 135 MiB db(12) 42 MiB queue(13) 5 MiB sync(14) RPC: 0 conn(15), 0 req/s(16), 182 µs(17)
Timestamp
Best block number (latest verified block number)
Best block hash
Blocks downloaded per second
Transactions downloaded per second
Millions of gas processed per second
Unverified queue size
Verified queue size
Latest block number
Number of active peer nodes/number of total peer nodes
Blockchain header cache size
Blockchain state cache size
Queue cache size
Node sync metadata cache size
Number of open RPC sessions to your node
RPC requests per second
Approximate roundtrip ping
Now the answer to this question is also included in Parity's FAQ. It provides a comprehensive explanation of different command line output:
What does Parity's command line output mean?

Jython + Django not ready for production?

So recently I was playing around with Django on the Jython platform and wanted to see its performance in "production". The site I tested with was just a simple return HttpResponse("Time %.2f" % time.time()) view, so no database involved.
I tried the following two combinations (measurements done with ab -c15 -n500 -k <url>, everything in Ubuntu Server 10.10 on VirtualBox):
J2EE application server (Tomcat/Glassfish), deployed WAR file
I get results like
Requests per second: 143.50 [#/sec] (mean)
[...]
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 16
90% 31
95% 31
98% 641
99% 3219
100% 3219 (longest request)
Obviously, the server hangs for a few seconds once in a while, which is not acceptable. I assume it has something to do with reloading Jython because starting the jython shell takes about 3 seconds, too.
AJP serving using patched flup package (+ Apache as frontend)
Note: flup is the package used by manage.py runfcgi, I had to patch it because flup's threading/forking support doesn't seem to work on Jython (-> AJP was the only working method).
Almost the same results here, but sometimes the last 100 requests don't even get answered at all (but server process still alive).
I'm asking this on SO (instead of serverfault) because it's very Django/Jython-specific. Does anyone have experience with deploying Django sites on Jython? Is there maybe another (faster) way to serve the site? Or is it just too early to use Django on the Java platform?
So as nobody replied, I investigated a bit more and it seems like my problem might have to do with VirtualBox. Using different server OSes (Debian Squeeze, Ubuntu Server), I had similar problems. For example, with simple static file serving, I got this result from the Apache web server (on Debian):
> ab -c50 -n1000 http://ip.of.my.vm/some/static/file.css
Requests per second: 91.95 [#/sec] (mean) <--- quite impossible for static serving
[...]
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 22.1 0 688
Processing: 0 206 991.4 31 9188
Waiting: 0 96 401.2 16 3031
Total: 0 208 991.7 31 9203
Percentage of the requests served within a certain time (ms)
50% 31
66% 47
75% 63
80% 78
90% 156
95% 781
98% 844
99% 9141 <--- !!!!!!
100% 9203 (longest request)
This led to the conclusion that (I don't have a conclusion, but) I think the Java reloading might not be the problem here, rather the virtualization. I will try it on a real host and leave this question unanswered till then.
FOLLOWUP
Now I successfully tested a bare-bones Django site (really just the welcome page) using Jython + AJP over TCP/mod_proxy_ajp on Apache (again with patched flup package). This time on a real host (i7 920, 6 GB RAM). The result proved that my above assumption was correct and that I really should never benchmark on a virtual host again. Here's the result for the welcome page:
Document Path: /jython-test/
Document Length: 2059 bytes
Concurrency Level: 40
Time taken for tests: 24.688 seconds
Complete requests: 20000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 43640000 bytes
HTML transferred: 41180000 bytes
Requests per second: 810.11 [#/sec] (mean)
Time per request: 49.376 [ms] (mean)
Time per request: 1.234 [ms] (mean, across all concurrent requests)
Transfer rate: 1726.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.5 0 20
Processing: 2 49 16.5 44 255
Waiting: 0 48 16.5 44 255
Total: 2 49 16.5 45 256
Percentage of the requests served within a certain time (ms)
50% 45
66% 48
75% 51
80% 53
90% 69
95% 80
98% 90
99% 97
100% 256 (longest request) # <-- no multiple seconds of waiting anymore
Very promising, I would say. The only downside is that the average request time is > 40 ms whereas the development server has a mean of < 3 ms.