several open connection in rabbit with different java client version number - amazon-web-services

I have a rabbitmq cluster setup in AWS. The 2 machines are ha-all, mirrored queued and they are behind an ELB. The queue name hips.preprod.queue has a springamqp consumer. The springamqp version is 3.3.4.
The consumer connects to the ELB and not the hosts directly. When the consumer connects to the rabbit cluster a single connection is created. But after sometime I see more connections in rabbit on the same IP but on a different socket. The weird part is that the some clients under the Connection tab indicates that the version is 3.2.4 and other clients 3.3.4. I also checked the classpath of my consumer and I cannot find any 3.2.4 amqp jar. I am at a total loss as to how is it possible to have same client with different version. Has anyone experienced anything similar to this.
Below is the data from Connections tab
Network Overview
Name Protocol Client Node From client To client Timeout Channels Virtual host User name State
10.161.2.178:27238 AMQP 0-9-1 RabbitMQ / Java3.3.4 rabbit#infopush-rabbit2 17B/s(37kB total) 8B/s(24kB total) 1s 1 /INFOPUSH
running
10.161.2.178:27312 AMQP 0-9-1 RabbitMQ / Java3.2.4 rabbit#infopush-rabbit2 1B/s(2.4kB total) 1B/s(1.2kB total) 25s 1 /INFOPUSH
running
10.161.2.178:27711 AMQP 0-9-1 RabbitMQ / Java3.3.4 rabbit#infopush-rabbit2 17B/s(20kB total) 9B/s(10kB total) 1s 1 /INFOPUSH
running
10.161.2.178:28833 AMQP 0-9-1 RabbitMQ / Java3.2.4 rabbit#infopush-rabbit1 1B/s(2.8kB total) 0B/s(1.4kB total) 25s 1 /INFOPUSH
running
10.161.2.178:29093 AMQP 0-9-1 RabbitMQ / Java3.2.4 rabbit#infopush-rabbit1 0B/s(2.5kB total) 0B/s(1.2kB total) 25s 1 /INFOPUSH
running
10.161.2.178:29692 AMQP 0-9-1 RabbitMQ / Java3.3.4 rabbit#infopush-rabbit1 16B/s(15kB total) 9B/s(7.9kB total) 1s 3 /INFOPUSH
running
10.161.2.92:10032 AMQP 0-9-1 RabbitMQ / Java3.2.4 rabbit#infopush-rabbit2 1B/s(1.7kB total) 0B/s(857B total) 25s 1 /INFOPUSH
running
10.161.2.92:56573 AMQP 0-9-1 RabbitMQ / Java3.3.4 rabbit#infopush-rabbit1 17B/s(40kB total) 8B/s(21kB total) 1s 4 /INFOPUSH
running
10.161.2.92:56703 AMQP 0-9-1 RabbitMQ / Java3.2.4 rabbit#infopush-rabbit1 1B/s(1.7kB total) 0B/s(1.1kB total) 25s 1 /INFOPUSH
running
10.161.2.92:9352 AMQP 0-9-1 RabbitMQ / Java3.3.4 rabbit#infopush-rabbit2 17B/s(46kB total) 9B/s(29kB total) 1s 5 /INFOPUSH
running
10.161.2.92:9506 AMQP 0-9-1 RabbitMQ / Java3.2.4 rabbit#infopush-rabbit2 1B/s(2.5kB total) 0B/s(1.2kB total) 25s 1 /INFOPUSH
running
Thanks
-Parshu

Sorry for not updating this post. I found out that another app had a connection leak which was using rabbit 3.2.4. Because rabbit is behind an ELB it was hard to track down the faulty application.
This issue is fixed now.
Thanks
-Parshu

Related

Redisson does not recover after redis master fail over

We are using Redisson 3.17.0 and redis version 6.0.8. We have redis cluster mode setup with 3 masters and each master has about 4-5 replicas. When redis master fail over happens, redisson starts throwing exceptions that it is unable to write command into connection. Even after fail over completes (which is ~30s or so), the exceptions don't stop. Only a bounce of the instance that runs redisson resolves this error. This is affecting high availability of our service. We have pingConnectionInterval set to 5000 ms. Our read mode is only from Masters.
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Try to increase nettyThreads setting. Payload size in bytes: 81. Node source: NodeSource [slot=10354, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection#1578264320 [redisClient=[addr=rediss://-:6379], channel=[id: 0xb0f98c8c, L:/-:55678 - R:-/-:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); after 2 retry attempts
Following is our redisson client config:
redisClientConfig: {
endPoint: "rediss://$HOST_IP:6379"
scanInterval: 1000
masterConnectionPoolSize: 64
masterConnectionMinimumIdleSize: 24
sslEnableEndpointIdentification: false
idleConnectionTimeout: 30000
connectTimeout: 10000
timeout: 3000
retryAttempts: 2
retryInterval: 300
pingConnectionInterval: 5000
keepAlive: true
tcpNoDelay: true
dnsMonitoringInterval: 5000
threads: 16
nettyThreads: 32
}
How can redisson recover from these exceptions without a restart of the application? We tried increasing netty threads etc, but redisson does not recover from the fail over

Django Channels: Get stuck after period of time

I run code from https://github.com/andrewgodwin/channels-examples/tree/master/multichat for around 50 users.
It goes to get stuck without any notice. Server is not down, access log has nothing special. When I stop daphne server (with Ctrl+C), it takes about 5-10 minutes to completely go down. Sometime I have to run kill command.
It is very weird when I put daphne inside supervisord, I restart it every 30 minutes using crontab, websocket can be connected normally. It's hacky but working.
My config: HAProxy => Daphne
daphne -b 192.168.0.6 -p 8000 yyapp.asgi:application --access-log=/home/admin/daphne.log
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
Debian: 9.4 (original kernel) on OVH server.
Python: 3.6.4
Daphne: 2.2.1
Channels: 2.1.2
Django: 1.11.15
Redis: 4.0.11
I know this question may be too general, but I really have no ideas with this. I tried upgrade python, re-install all the packages but it didn't work.
Well, web servers and load balancers are, in general, very bad with persistent connections. You need to give Haproxy explicit instructions so it knows when and how to timeout unused tunnels.
There are four timeouts that Haproxy will need to keep track of:
timeout client
timeout connect
timeout server
timeout tunnel
The first three are related to the initial HTTP negotiation phase of the socket connection. As soon as the connection is established, only timeout tunnel matters. You will need to tinker with the values for your own application, but some suggested values to start with are:
timeout client: 25s
timeout connect: 5s
timeout server: 25s
timeout tunnel: 3600s
In your code, that would be:
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
timeout client 25s
timeout connect 5s
timeout server 25s
timeout tunnel 3600s
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
You might need to tinker with the other timeouts to get a good mixture. Some timeouts that may affect your setup - and some starting values - are:
timeout http-keep-alive: 1s
timeout http-request: 15s
timeout queue: 30s
timeout tarpit: 60s
Of course, read up and customize to suit your needs.
Reference:
Haproxy - Websockets Load Balancing

uWSGI listen queue of socket full

My setup includes Load Balancer (haproxy) with two nginx servers running Django. Server 2 works fine but sometimes server 1 will start crashing and log will be full of
*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***
message.
How do I go about resolving this issue?
Your listen queue is full. When you run uwsgi, pass it --listen 1024 to increase the queue to 1024.
Note that a larger queue makes you more susceptible to a DDoS attack.
You may also need to increase net.core.somaxconn
sysctl -w net.core.somaxconn=65536

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar

Celery and RabbitMQ timeouts and connection resets

I'm using RabbitMQ 3.6.0 and Celery 3.1.20 on a Windows 10 machine in a Django application. Everything is running on the same computer. I've configured Celery to Acknowledge Late (CELERY_ACKS_LATE=True) and now I'm getting connection problems.
I start the Celery worker, and after 50-60 seconds of handling tasks each worker thread fails with the following message:
Couldn't ack ###, reason:ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
(### is the number of the task)
When I look at the RabbitMQ logs I see this:
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672)
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.254.0> (127.0.0.1:55373 -> 127.0.0.1:5672)
=ERROR REPORT==== 10-Feb-2016::22:17:14 ===
closing AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
The error occurs exactly when the Celery workers are getting their connection reset.
I thought this was an AMQP Heartbeat issue, so I've added BROKER_HEARTBEAT = 15 to my Celery settings, but it did not make any difference.
I was having a similar issue with Celery on Windows with long running
tasks with concurrency=1. The following configuration finally worked for
me:
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
I also started the celery worker daemon with the -Ofair option:
celery -A test worker -l info -Ofair
In my limited understanding, CELERYD_PREFETCH_MULTIPLIER sets the number
of messages that sit in the queue of a specific Celery worker. By
default it is set to 4. If you set it to 1, each worker will only
consume one message and complete the task before it consumes another
message. I was having issues with long-running task because the
connection to RabbitMQ was consistently lost in the middle of the long task, but
then the task was re-attempted if any other message/tasks were waiting
in the celery queue.
The following option was also specific to my situation:
CELERYD_CONCURRENCY = 1
Setting concurrency to 1 made sense for me because I had long running
tasks that needed a large amount of RAM so they each needed to run solo.
#bbaker solution with CELERY_ACKS_LATE (which is task_acks_late in celery 4x) itself did not work for me. My workers are in Kubernetes pods and must be run with --pool solo and each task takes 30-60s.
I solved it by including broker_heartbeat=0
broker_pool_limit = None
task_acks_late = True
broker_heartbeat = 0
worker_prefetch_multiplier = 1