Redisson does not recover after redis master fail over - redisson

We are using Redisson 3.17.0 and redis version 6.0.8. We have redis cluster mode setup with 3 masters and each master has about 4-5 replicas. When redis master fail over happens, redisson starts throwing exceptions that it is unable to write command into connection. Even after fail over completes (which is ~30s or so), the exceptions don't stop. Only a bounce of the instance that runs redisson resolves this error. This is affecting high availability of our service. We have pingConnectionInterval set to 5000 ms. Our read mode is only from Masters.
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Try to increase nettyThreads setting. Payload size in bytes: 81. Node source: NodeSource [slot=10354, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection#1578264320 [redisClient=[addr=rediss://-:6379], channel=[id: 0xb0f98c8c, L:/-:55678 - R:-/-:6379], currentCommand=null, usage=1], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); after 2 retry attempts
Following is our redisson client config:
redisClientConfig: {
endPoint: "rediss://$HOST_IP:6379"
scanInterval: 1000
masterConnectionPoolSize: 64
masterConnectionMinimumIdleSize: 24
sslEnableEndpointIdentification: false
idleConnectionTimeout: 30000
connectTimeout: 10000
timeout: 3000
retryAttempts: 2
retryInterval: 300
pingConnectionInterval: 5000
keepAlive: true
tcpNoDelay: true
dnsMonitoringInterval: 5000
threads: 16
nettyThreads: 32
}
How can redisson recover from these exceptions without a restart of the application? We tried increasing netty threads etc, but redisson does not recover from the fail over

Related

Intermittent DNS issues while pulling docker image from ECR repository

Has anyone facing this issue with docker pull. we recently upgraded docker to 18.03.1-ce from then we are seeing the issue. Although we are not exactly sure if this is related to docker, but just want to know if anyone faced this problem.
We have done some troubleshooting using tcp dump the DNS queries being made were under the permissible limit of 1024 packet. which is a limit on EC2, We also tried working around the issue by modifying the /etc/resolv.conf file to use a higher retry \ timeout value, but that didn't seem to help.
we did a packet capture line by line and found something. we found some responses to be negative. If you use Wireshark, you can use 'udp.stream eq 12' as a filter to view one of the negative answers. we can see the resolver sending an answer "No such name". All these requests that get a negative response use the following name in the request:
354XXXXX.dkr.ecr.us-east-1.amazonaws.com.ec2.internal
Would anyone of you happen to know why ec2.internal is being adding to the end of the DNS? If run a dig against this name it fails. So it appears that a wrong name is being sent to the server which responds with 'no such host'. Is docker is sending a wrong dns name for resolution.
We see this issue happening intermittently. looking forward for help. Thanks in advance.
Expected behaviour
5.0.25_61: Pulling from rrg
Digest: sha256:50bbce4af6749e9a976f0533c3b50a0badb54855b73d8a3743473f1487fd223e
Status: Downloaded newer image forXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Actual behaviour
docker-compose up -d rrg-node-1
Creating rrg-node-1
ERROR: for rrg-node-1 Cannot create container for service rrg-node-1: Error response from daemon: Get https:/XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/v2/: dial tcp: lookup XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com on 10.5.0.2:53: no such host
Steps to reproduce the issue
docker pull XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Output of docker version:
(Docker version 18.03.1-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3)
Output of docker info:
([ec2-user#ip-10-5-3-45 ~]$ docker info
Containers: 37
Running: 36
Paused: 0
Stopped: 1
Images: 60
Server Version: swarm/1.2.5
Role: replica
Primary: 10.5.4.172:3375
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 12
Plugins:
Volume:
Network:
Log:
Swarm:
NodeID:
Is Manager: false
Node Address:
Kernel Version: 4.14.51-60.38.amzn1.x86_64
Operating System: linux
Architecture: amd64
CPUs: 22
Total Memory: 80.85GiB
Name: mgr1
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Live Restore Enabled: false
WARNING: No kernel memory limit support)

Django Channels: Get stuck after period of time

I run code from https://github.com/andrewgodwin/channels-examples/tree/master/multichat for around 50 users.
It goes to get stuck without any notice. Server is not down, access log has nothing special. When I stop daphne server (with Ctrl+C), it takes about 5-10 minutes to completely go down. Sometime I have to run kill command.
It is very weird when I put daphne inside supervisord, I restart it every 30 minutes using crontab, websocket can be connected normally. It's hacky but working.
My config: HAProxy => Daphne
daphne -b 192.168.0.6 -p 8000 yyapp.asgi:application --access-log=/home/admin/daphne.log
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
Debian: 9.4 (original kernel) on OVH server.
Python: 3.6.4
Daphne: 2.2.1
Channels: 2.1.2
Django: 1.11.15
Redis: 4.0.11
I know this question may be too general, but I really have no ideas with this. I tried upgrade python, re-install all the packages but it didn't work.
Well, web servers and load balancers are, in general, very bad with persistent connections. You need to give Haproxy explicit instructions so it knows when and how to timeout unused tunnels.
There are four timeouts that Haproxy will need to keep track of:
timeout client
timeout connect
timeout server
timeout tunnel
The first three are related to the initial HTTP negotiation phase of the socket connection. As soon as the connection is established, only timeout tunnel matters. You will need to tinker with the values for your own application, but some suggested values to start with are:
timeout client: 25s
timeout connect: 5s
timeout server: 25s
timeout tunnel: 3600s
In your code, that would be:
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
timeout client 25s
timeout connect 5s
timeout server 25s
timeout tunnel 3600s
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
You might need to tinker with the other timeouts to get a good mixture. Some timeouts that may affect your setup - and some starting values - are:
timeout http-keep-alive: 1s
timeout http-request: 15s
timeout queue: 30s
timeout tarpit: 60s
Of course, read up and customize to suit your needs.
Reference:
Haproxy - Websockets Load Balancing

Spark shuts down after 10 seconds of running

I'm trying to setup clusters in my AWS account (Amazon). I followed this tutorial to set it up. I've ran into some problems regarding ports but I finally got it to work until... it shut down after 10 seconds giving me no more than this error:
16/05/12 12:52:46 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:06 INFO client.AppClient$ClientActor: Connecting to master spark://ip-to-my-machine:7077...
16/05/12 12:53:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
16/05/12 12:53:26 ERROR scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
This was the bash I ran to make it work:
bin/spark-shell --master spark://ip-to-my-machine:7077
I opened the TCP port 7077, what seems to be the problem?

chef test-kitchen configuration for Vagrant ssh connect timeout

My google-fu is failing me. what do I need to put into my .kitchen.yml in order to get it to increase the config.vm.boot_timeout or number of attempts in my Vagrantfile. My kitchen converge almost always hits:
STDERR: Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.
After about another minute or so I can connect without issue.
I've tried many of what I thought it could be but none seem to be setting it to all of the following:
driver:
name: vagrant
vm.boot_timeout: 20
vm:
boot_timeout: 20
driver_config:
require_chef_omnibus: true
vm.boot_timeout: 20
vm:
boot_timeout: 20
What do I need to do to get this increased?
I added:
driver:
name: vagrant
boot_timeout: 1200
It appears to work, the boot_timout is already present in Vagantfile.erb, maybe because of a newer version.
This isn't supported directly, but you can copy the default Vagrantfile.erb and set
driver:
name: vagrant
vagrantfile_erb: path/to/your/Vagrantfile.erb
or possibly: (I forget which is needed)
driver:
name: vagrant
config:
vagrantfile_erb: path/to/your/Vagrantfile.erb

Increasing capacity of Micro Cloud Foundry 1.2 in offline mode

I would like to do some real world data testing on micro cloud foundry but the capacity of postgres database is limited to 256MB which is not sufficient for my testing. Is there a way to increase the db capacity temporarily in offline mode for testing?
If not, can somebody point me the latest instructions of setting up private cloud foundry server on Ubuntu Server 12.04
you can ssh in to the instance, change the configuration for mysql and restart the service.
SSH to the MCF instance :
$ ssh vcap#api.<your-mcf-instance-name-here>.cloudfoundry.me
*note - if you can't remember the password for the vcap user, you can change this via the vm console menu by selecting option 3.
Edit the mysql-node configuration file :
$ vi /var/vcap/jobs/mysql_node/config/mysql_node.yml
the file should look something or exactly like this :
---
local_db: sqlite3:/var/vcap/store/mysql_node.db
base_dir: /var/vcap/store/mysql
mbus: nats://nats:f5dc63f74be5e38f#127.0.0.1:4222
index: 0
logging:
level: debug
file: /var/vcap/sys/log/mysql_node/mysql_node.log
pid: /var/vcap/sys/run/mysql_node/mysql_node.pid
available_storage: 2048
node_id: mysql_node_1
max_db_size: 256
max_long_query: 3
mysql:
host: localhost
port: 3306
socket: /var/vcap/sys/run/mysqld/mysqld.sock
user: root
pass: dc64fad710976ea5
migration_nfs: /var/vcap/services_migration
max_long_tx: 0
max_user_conns: 20
mysqldump_bin: /var/vcap/packages/mysql/bin/mysqldump
mysql_bin: /var/vcap/packages/mysql/bin/mysql
gzip_bin: /bin/gzip
ip_route: 127.0.0.1
z_interval: 30
max_nats_payload: 1048576
The two lines you are interested in are;
available_storage: 2048
and
max_db_size: 256
The first line is the maximum available amount of disk storage made available to MySQL, the second is the maximum size per mysql db instance. Set these to your desired values, obviously available_storage has to be large than max_db_size and also a multiple of that value.
Save the file and then restart the VM (shut it down via the menu in the VM console or do it via SSH) and you should be good to go!