Does a redisson retry attempt invoke the load balancer again? - redisson

Given one master node and multiple replica nodes with Sentinels enabled, Redisson configured to read from replica nodes only, and round robin load balancer and retries enabled. In case of a Redisson retry attempt, does it invoke the load balancer again to determine a new replica node or does it talk to the same replica that it had unsuccessfully communicated with before?
Our use case is such that we see are witnessing a 50% drop in commands processed per second when a single replica node restarts. We have retries enabled in Redisson but they don't seem to be helping so we are wondering if the retries also go to the same failing node.
Redisson version: 3.13.6
Redis version: 5.0.6

does it invoke the load balancer again to determine a new replica node or does it talk to the same replica that it had unsuccessfully communicated with before?
In each attempt it uses load balancer.
We have retries enabled in Redisson but they don't seem to be helping so we are wondering if the retries also go to the same failing node.
You need to increase attempts amount to survive slave node failover. Also you can reduce failedSlaveCheckInterval parameter to detect failed slave nodes faster without waiting Redis cluster topology update.

Related

How to correctly guarantee the high availability requirement for a WSO2 MB profile cluster?

I have the following doubt relating how to correctly set up a WSO2 MB cluster respecting the requirements of high availability. I am following this official guide: https://docs.wso2.com/display/EI650/Clustering+the+Message+Broker+Profile#ClusteringtheMessageBrokerProfile-Testingthecluster
So I will have a two nodes WSO2 MB profile cluster. Now my doubt is related to the high availability concept (basically: if a single node is not working, the cluster should still works).
I have two nodes cluster, each node run on a specific server having a specific IP address, something like this
NODE 1 with the IP: XXX.XXX.XXX.1
NODE 2 with the IP: XXX.XXX.XXX.2
So lets suppose that I want publish a message into a queue defined on this cluster. I suppose that I can send the message indifferently to one of these 2 nodes (correct me if I am doing wrong assertion).
If this is the situation: how can I guarantee the high availability requirement? Can I simply put both my nodes under a load balancer? so that if one node doesn't work, the request is directed to the other
Is it a correct way to handle this situation?
Yes if you have both of the EI instances with MB profile clustering where the two servers have cluster coordination configured on JDBC or Hazelcast level, with the above-mentioned approach you will guarantee the high availability of the service.
You can make sure to follow the below provide additional precautions to make sure both servers do not go down for the same reason at ones.
Have the two servers in two separate instances rather than having
offset in the same instance.
You can configure the load balancer to
work in Active-Passive or Active-Active mode if you are deployed on a
cloud provider like AWS you can add configurations to redeploy the instances of the health check is failed for a given amount of time.

AWS Codedeploy BlockTraffic/AllowTraffic durations

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

ELB backend connection errors when deregister ec2 instances

I've written a custom release script to manage releases for an EC2 autoscaling application. The processing works like so...
Create an AMI based on an application git tag.
Create launch config.
Configure ASG to use new launch config.
Find current desired capacity for ASG.
Set desired capacity to 2x previous capacity.
Wait for new instances to become healthy by querying ELB.
Set desired capacity back to previous value.
This all works fairly well, except whenever I run this, the monitoring for the ELB is showing a lot of backend connection errors.
I don't know why this would be occurring, as it should (based on my understanding) still service current connections if the "Connection draining" option is enabled for the ELB (which it is).
I thought perhaps the ASG was terminating the instances before the connections could finish, so I changed my script to first deregister the instances from the ELB, and then wait a while before changing the desired capacity at the ASG. This however didn't make any difference. As soon as the instances were deregistered from the ELB (even though they're still running and healthy) the backend connection errors occur.
It seems as though it's ignoring the connection draining option and simply dropping connections as soon as the instance has been deregistered.
This is the command I'm using to deregister the instances...
aws elb deregister-instances-from-load-balancer --load-balancer-name $elb_name --instances $old_instances
Is there some preferred method to gracefully remove the instances from the ELB before removing them from the ASG?
Further investigation suggests that the back-end connection errors are occurring because the new instances aren't yet ready to take the full load when the old instances are removed from the ELB. They're healthy, but seem to require a bit more warming.
I'm working on tweaking the health-check settings to give the instances a bit more time before they start trying to serve requests. I may also need to change the apache2 settings to get them ready quicker.

Grace Period? - AWS EC2 Container Service and Elastic Load Balancers

When an elastic load balancer (ELB) is associated with an auto-scaling group, it is possible to specify a grace period during which new EC2 instances will not be terminated even if they are marked as unhealthy by the ELB. Is it possible to specify a similar grace period, during which new ECS tasks will not be killed and restarted by their associated ECS service, even if the ECS instance on which a task is running has been marked unhealthy by the ELB?
Update:
In our current use case, the docker container being run as an ECS task contains a JBoss instance that loads a number of caches on startup. These caches can take several minutes to load. However, the ECS service registers the container instance with the ELB, as soon as the container has started. This means that traffic can be routed to the new container before it is ready to accept it. We could increase the health check interval and the "healthy/unhealthy thresholds" on the ELB to prevent the ELB from routing traffic to the instance and the ECS service from restarting the container until the caches have been loaded. However, increasing the health check interval and thresholds is not desirable, because if an instance is marked as unhealthy after the caches have been loaded, the ECS service should restart the container as soon as possible (which necessitates a shorter health check interval and smaller thresholds).
Thus, is it possible to apply a grace period during which traffic will not be routed to a new container by the ELB and the ECS service will not restart the container (even if it fails the health checks)? Or failing that, are there any suggestions regarding a solution for our use case?
In case anyone else finds themselves here via google, in the linked support thread, it is noted that this has been added to AWS, it is called healthCheckGracePeriodSeconds https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_CreateService.html#ECS-CreateService-request-healthCheckGracePeriodSeconds
After a discussion with the support team, it turns out that ECS cannot support our current use case.
There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.
However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.
As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.
Use env var ECS_CONTAINER_STOP_TIMEOUT
See https://github.com/aws/amazon-ecs-agent/issues/126

Prevent machine on Amazon from shutting down before all users finished tasks

I'm planning a server environment on AWS with auto scaling over VPC.
My application has some process that is done in several steps on server, and the user should stick to the same server by using ELB's sticky session.
The problem is, that when the auto scaling group suppose to shut down server, some users may be in the middle of the process (the process takes multiple request - for example -
1. create an album
2. upload photos to the album each at a time
3. convert photos to movie and delete photos
4. store movie on S3)
Is it possible to configure the ELB to stop passing NEW users to the server that is about to shut down, while still passing previous users (that has the sticky session set)?, and - is it possible to tell the server to wait for, let's say, 10 min. after the shutdown rule applied before it actually shut down?
Thank you very much
This feature hasn't been available in Elastic Load Balancing at the time of your question, however, AWS has meanwhile addressed the main part of your question by adding ELB Connection Draining to avoid breaking open network connections while taking an instance out of service, updating its software, or replacing it with a fresh instance that contains updated software.
Please not that you still need to specify a sufficiently large timeout based on the maximum time you expect users to finish their activity, see Connection Draining:
When you enable connection draining for your load balancer, you can set a maximum time for the load balancer to continue serving in-flight requests to the deregistering instance before the load balancer closes the connection. The load balancer forcibly closes connections to the deregistering instance when the maximum time limit is reached.
[...]
If your instances are part of an Auto Scaling group and if connection draining is enabled for your load balancer, Auto Scaling will wait for the in-flight requests to complete or for the maximum timeout to expire, whichever comes first, before terminating instances due to a scaling event or health check replacement. [...] [emphasis mine]
The emphasized part confirms that it is not possible to specify an additional timeout that only applies after the last connection has been drained.