Customizing/Architecting AWS ELB to have Zero Downtime - amazon-web-services

So other day we faced an issue where one of the instance behind our application load balancer failed Instance Status Check and System Check. It took about 10 sec (the minimum we can get) for our ELB to detect this and mark the instance as "unhealthy", however we lost some amount of traffic in those 10 seconds as the ELB kept routing traffic to the unhealthy instance. Is there a solution where we can avoid literally any downtime or am I being too unrealistic?

I'm sure this isn't the answer you want to hear, but in order to minimize traffic loss on your systems if 10s is not tolerable, you'll need to implement your own health check/load balancing solution. My organization has systems where packet loss is unacceptable as well, and that's what we needed to do.
This solution is twofold.
You need to implement your own load-balancing infrastructure. We chose to use Route53 weighted record sets (TTL of 1s, we'll get back to this) with equal weight for each server
Launch an ECS container instance per load-balanced EC2 instance whose sole purpose is to health check. It runs both DNS and IP health checks (requests library in python) and will add/remove the Route53 weighted record real-time as it sees an issue.
In our testing, however, we discovered that while the upstream DNS servers from Route53 honor the 1 second TTL upon removal of a DNS record, they "blacklist" that record (FQDN + IP combo) from coming back up again for up to 10 minutes (we get variance of resolution times from 1m-10m). So you'll be able to failover quickly, but you must take into account it will take up to 10 minutes for the re-addition of the record to be honored.

Related

ELB cross-AZ balancing DNS resolution with Sticky sessions

I am preparing for AWS certification and came across a question about ELB with sticky session enabled for instances in 2 AZs. The problem is that requests from a software-based load tester in one of the AZs end up in the instances in that AZ only instead of being distributed across AZs. At the same time regular requests from customers are evenly distributed across AZs.
The correct answers to fix the load tester issue are:
Forced the software-based load tester to re-resolve DNS before every
request;
Use third party load-testing service to send requests from
globally distributed clients.
I'm not sure I can understand this scenario. What is the default behaviour of Route 53 when it comes to ELB IPs resolution? In any case, those DNS records have 60 seconds TTL. Isn't it redundant to re-resolve DNS on every request? Besides, DNS resolution is a responsibility of DNS service itself, not load-testing software, isn't it?
I can understand that requests from the same instance, with load testing software on it, will go to the same LBed EC2, but why does it have to be an instance in the same AZ? It can be achieved only by Geolocation- or Latency-based routing, but I can't find anything in the specs whether those are the default ones.
When an ELB is in more than one availability zone, it always has more than one public IP address -- at least one per zone.
When you request these records in a DNS lookup, you get all of these records (assuming there are not very many) or a subset of them (if there are a large number, which would be the case in an active cluster with significant traffic) but they are unordered.
If the load testing software resolves the IP address of the endpoint and holds onto exactly one of the IP addresses -- as it a likely outcome -- then all of the traffic will go to one node of the balancer, which is in one zone, and will send traffic to instances in that zone.
But what about...
Cross-Zone Load Balancing
The nodes for your load balancer distribute requests from clients to registered targets. When cross-zone load balancing is enabled, each load balancer node distributes traffic across the registered targets in all enabled Availability Zones. When cross-zone load balancing is disabled, each load balancer node distributes traffic across the registered targets in its Availability Zone only.
https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html
If stickiness is configured, those sessions will initially land in one AZ and then stick to that AZ because they stick to the initial instance where they landed. If cross-zone is enabled, the outcome is not quite as clear, but either balancer nodes may prefer instances in their own zone in that scenario (when first establishing stickiness), or this wasn't really the point of the question. Stickiness requires coordination, and cross-AZ traffic takes a non-zero amount of time due to distance (typically <10 ms) but it would make sense for a balancer to prefer to select instances its local zone for sessions with no established affinity.
In fact, configuring the load test software to re-resolve the endpoint for each request is not really the focus of the solution -- the point is to ensure that (1) the load test software uses all of them and does not latch onto exactly one and (2) that if more addresses become available due to the balancer scaling out under load, that the load test software expands its pool of targets.
In any case, those DNS records have 60 seconds TTL. Isn't it redundant to re-resolve DNS on every request?
The software may not see the TTL, may not honor the TTL and, as noted above, may stick to one answer even if multiple are available, because it only needs one in order to make the connection. Every request is not strictly necessary, but it does solve the problem.
Besides, DNS resolution is a responsibility of DNS service itself, not load-testing software, isn't it?
To "resolve DNS" in this context simply means to do a DNS lookup, whatever that means in the specific instance, whether using the OS's DNS resolver or making a direct query to a recursive DNS server. When software establishes a connection to a hostname, it "resolves" (looks up) the associated IP address.
The other solution, "use third party load-testing service to send requests from globally distributed clients," solves the problem by accident, since the distributed clients -- even if they stick to the first address they see -- are more likely to see all of the available addresses. The "global" distribution aspect is a distraction.
ELB relies on random arrival of requests across its external-facing nodes as part of the balancing strategy. Load testing software whose design overlooks this is not properly testing the ELB. Both solutions mitigate the problem in different ways.
The sticky is the issue , see here : https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html
The load balancer uses a special cookie to associate the session with
the instance that handled the initial request, but follows the
lifetime of the application cookie specified in the policy
configuration. The load balancer only inserts a new stickiness cookie
if the application response includes a new application cookie. The
load balancer stickiness cookie does not update with each request. If
the application cookie is explicitly removed or expires, the session
stops being sticky until a new application cookie is issued.
The first solution, to re-resolve DNS will create new sessions and with this will break the stickiness of the ELB . The second solution is to use multiple clients , stickiness is not an issue if the number of globally distributed clients is large.
PART 2 : could not add as comment , is to long :
Yes, my answer was to simple and incomplete.
What we know is that ELB is 2 AZ and will have 2 nodes with different IP. Not clear how many IP , depends on the number of requests and the number of servers on each AZ. Route 53 is rotating the IP for every new request , first time in NodeA-IP , NodeB-IP , second time is NodeB-IP, NodeA-IP. The load testing application will take with every new request the first IP , balancing between the 2 AZ. Because a Node can route only inside his AZ , if the sticky cookie is for NodeA and the request arrives to NodeB , NodeB will send it to one of his servers in AZ2 ignoring the cookie for a server in AZ 1.
I need to run some tests, quickly tested with Route53 with classic ELB and 2 AZ and is rotating every time the IP's. What I want to test if I have a sticky cookie for AZ 1 and I reach the Node 2 will not forward me to Node 1 ( In case of no available servers, is described in the doc this interesting flow ). Hope to have updates in short time.
Just found another piece of evidence that Route 53 returns multiple IPs and rotate them for ELB scaling scenarios:
By default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request. As the traffic profile changes, the controller service will scale the load balancers to handle more requests, scaling equally in all Availability Zones.
And then:
To ensure that clients are taking advantage of the increased capacity, Elastic Load Balancing uses a TTL setting on the DNS record of 60 seconds. It is critical that you factor this changing DNS record into your tests. If you do not ensure that DNS is re-resolved or use multiple test clients to simulate increased load, the test may continue to hit a single IP address when Elastic Load Balancing has actually allocated many more IP addresses.
What I didn't realize at first is that even if regular traffic is distributed evenly across AZs, it doesn't mean that Cross-Zone Load Balancing is enabled. As Michael pointed out regular traffic will naturally come through various locations and end up in different AZs.
And as it was not explicitly mentioned in the test, Cross-AZ balancing might not have been in place.
https://aws.amazon.com/articles/best-practices-in-evaluating-elastic-load-balancing/

AWS - Does Elastic Load Balancing actually prevent LOAD BALANCER failover?

I've taken this straight from some AWS documentation:
"As traffic to your application changes over time, Elastic Load Balancing scales your load balancer and updates the DNS entry. Note that the DNS entry also specifies the time-to-live (TTL) as 60 seconds, which ensures that the IP addresses can be remapped quickly in response to changing traffic."
Two questions:
1) I was under the impression originally that a single static IP address would be mapped to multiple instances of an AWS load balancer, thereby causing fault tolerance on the balancer level, if for instance one machine crashed for whatever reason, the static IP address registered to my domain name would simply be dynamically 'moved' to another balancer instance and continue serving requests. Is this wrong? Based on the quote above from AWS, it seems that the only magic happening here is that AWS's DNS servers hold multiple A records for your AWS registered domain name, and after 60 seconds of no connection from the client, the TTL expires and Amazon's DNS entry is updated to only start sending requests to active IP's. This still takes 60 seconds on the client side of failed connection. True or false? And why?
2) If the above is true, would it be functionally equivalent if I were using a host provider of say, GoDaddy, entered multiple "A" name records, and set the TTL to 60 seconds?
Thanks!
The ELB is assigned a DNS name which you can then assign to an A record as an alias, see here. If you have your ELB set up with multiple instances you define the health check. You can determine what path is checked, how often, and how many failures indicate an instance is down (for example check / every 10s with a 5s timeout and if it fails 2 times consider it unhealthy. When an instance becomes unhealthy all the remaining instances still serve requests just fine without delay. If the instance returns to a healthy state (for example its passes 2 checks in a row) then it returns as a healthy host in the load balancer.
What the quote is referring to is the load balancer itself. In the event it has an issue or an AZ becomes unavailable its describing what happens with the underlying ELB DNS record, not the alias record you assign to it.
Whether or not traffic is effected is partially dependent on how sessions are handled by your setup. Whether they are sticky or handled by another system like elasticache or your database.

Route 53 failover routing takes downtime of 6 to 8 minutes

I am getting "502 bad gateway error" between switching regions of Route 53 Failover.
Switching between primary to secondary takes 2-3 minutes if primary is down.
Meanwhile when on DR site IF primary comes up It will takes another 6 to 8 minutes for redirecting traffic from DR to primary. How to completely minimizes downtime from 6 to 8 minutes to 0?
You need to check how long it takes your ELB Health Check + Route53 Health Checks to determine a failover is required, the final step is the TTL of the DNS records.
For example, let's say you have a web application, hosted behind and ELB, and you are accessing it via myapp.mydomain.com.
ELB Health Check
While the primary thing you should check is the R53 health check (see below), the ELB configuration is also important.
Look at how long it should take to determine failure:
HealthCheck Interval - The amount of time between health checks
Unhealthy Threshold - How many failed health checks
Make sure this configuration is the same in ELBs in both regions.
Route53 Health Check
This is the main thing that will determine how long failover takes.
You probably have 2 CNAME records for myapp.mydomain.com, each pointing to a R53 health check, and each health check points at an ELB at it's respective region.
Check both health checks and make sure:
Request interval - How often R53 will poll your ELB for it's health.
Failure threshold - The number of consecutive health checks that an endpoint must pass or fail for the status to change.
Make sure both health check's config (Primary and Secondary) are the same.
Once the status changes, it's up to the DNS record TTL.
Route53 CNAME TTL
Check how long your CNAMES will point to a record after a failover by looking at the record TTL. For example, if TTL is 30, it will take approx. 30 seconds for Route53 to start pointing to the secondary region.
Make sure both CNAME records have the same TTL.
After following this you can determine how long it should take to failover, for example:
Your health checks are looking at port 80:/ availability, your health checks take approx. 30 seconds, and your apache dies on the primary site.
Within 30 (example) seconds ELB will determine instances out of service and stop forwarding traffic.
Within the same 30 (example) seconds the R53 health check which is monitoring the same healthcheck (port 80:/) will also determine primary ELB is unhealthy.
This is where R53 decides to start pointing DNS queries to your secondary ELB.
If your TTL is set to 30, failover should be completed in approx. 1 minute, +/- some time for propagation, etc.
Make sure not to set your health checks to be too frequent, depending on how many instances are behind your ELB, it can result in a lot of calls to your service from the ELB and Route53 for the health endpoint.

Elastic Beanstalk reports 5xx errors even though instances are in perfect health

I need to set up an api application for gathering event data to be used in a recommendation engine. This is my setup:
Elastic Beanstalk env with a load balancer and autoscaling group.
I have 2x t2.medium instances running behind a load balancer.
EBS configuration is 64bit Amazon Linux 2016.03 v2.1.1 running Tomcat 8 Java 8
Additionally I have 8x t2.micro instances that I use for high-load testing the api, sending thousands of requests/sec to be handled by the api.
Im using Locust (http://locust.io/) as my load testing tool.
Each t2.micro instance that is run by Locust can send up to about 500req/sec
Everything works fine while the reqs/sec are below 1000, maybe 1200. Once over that, my load balancer reports that some of the instances behind it are reporting 5xx errors (attached). I've also tried with 4 instances behind the load balancer, and although things start out well with up to 3000req/sec, soon after, the ebs health tool and Locust both report 503s and 504s, while all of the instances are in perfect health according to the actual numbers in the ebs Health Overview, showing only 10%-20% CPU utilization.
Is there smth I'm missing in configuring the env? It seems like no matter how many machines I have behind the load balancer, the env handles no more than 1000-2000 requests per second.
EDIT:
Now I know for sure that it's the ELB that is causing the problems, not the instances.
I ran a load test with 10 simulated users. Each user sends about 1req/sec and the load increases by 10 users/sec to 4000 users, which should equal to about 4000req/sec. Still it doesn't seem to like any request rate over 3.5k req/sec (attachment1).
As you can see from attachment2, the 4 instances behind the load balancer are in perfect health, but I still keep getting 503 errors. It's just the load balancer itself causing problems. Look how SurgeQueueLength and SpilloverCount increase rapidly at some point. (attachment3) I'm trying to figure out why.
Also I completely removed the load balancer and tested with just one instance alone. It can handle up to about 3k req/sec. (attachment4 and attachment5), so it's definitely the load balancer.
Maybe I'm missing some crucial limit that load balancers have by default, like the queue size of 1024? What is normal handle rate for 1 load balancer? Should I be adding more load balancers? Could it be related to availability zones? ELB listeners from one zone are trying to route to instances from a different zone?
attachment1:
attachment2:
attachment3:
attachment4:
attachment5:
UPDATE:
Cross zone load balancing is enabled
UPDATE:
maybe this helps more:
The message says that "9.8 % of the requests to the ELB are failing with HTTP 5xx (6 minutes ago)". This does not mean that your instances are not returning HTTP 5xx responses. The requests are failing at the ELB itself. This can happen when your backend instances are at capacity (e.g. connections are saturated and they are rejecting connections to the ELB).
Your requests are spilling over at the ELB. They never make it to the instance. If they were failing at the EC2 instances then the cause would be different and data for the environment would match the data for the instances.
Also note that the cause says that this was the state "6 minutes ago". Elastic Beanstalk multiple data sources - one is the data coming from the instance which shows the requests per second and HTTP status codes in the table shown. Another data source is cloudwatch metrics for your ELB. Since cloudwatch metrics for ELB are 1 minute, this data is slightly delayed and the cause tells you how old the information is.

Using Redis behing AWS load balancer

We're using Redis to collect events from our web application (pub/sub based) behind AWS ELB.
We're looking for a solution that will allow us to scale-up and high-availability for the different servers. We do not wish to have these two servers in a Redis cluster, our plan is to monitor them using cloudwatch and switch between them if necessary.
We tried a simple test of locating two Redis server behind the ELB, telnetting the ELB DNS and see what happens using 'redis-cli monitor', but we don't see nothing. (when trying the same without the ELB it seems fine)
any suggestions?
thanks
I came across this while looking for a similar question, but disagree with the accepted answer. Even though this is pretty old, hopefully it will help someone in the future.
It's more appropriate for your question here to use DNS failover with a Redis Replication Auto-Failover configuration. DNS failover provides groups of availability (if you need that level of scale) and the Replication group provides cache up time.
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover-configuring.html
The Active-passive failover should provide the solution you're wanting with High Availability:
Active-passive failover: Use this failover configuration when you want
a primary group of resources to be available the majority of the time
and you want a secondary group of resources to be on standby in case
all of the primary resources become unavailable. When responding to
queries, Amazon Route 53 includes only the healthy primary resources.
If all of the primary resources are unhealthy, Amazon Route 53 begins
to include only the healthy secondary resources in response to DNS
queries.
After you setup the DNS, then you would point that to the Elasticache Redis failover group's URL and add multiple groups for higher availability during a failover operation.
However, you might need to setup your application to write and read from different endpoints to maximize the architecture's scalability.
Sources:
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.html
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/AutoFailover.html
Placing a pair of independent redis nodes behind a LB will likely not be what you want. What will happen is ELB will try to balance connections to each instance, splitting half to one and half to another. This means that commands issued by one connection may not be seen by another. It also means no data is shared. So client a could publish a message, and client b being subscribed to the other server won't see the message.
For PUBSUB behind ELB you have a secondary problem. ELB will close an idle connection. So if you subscribe to a channel that isn't busy your ELB will close your connection. As I recall the max you can make this is 60s, meaning if you don't publish a message every single minute your clients will be disconnected.
As to how much of a problem that is depends on your client library, and frankly in my experience most don't handle it well in that they are unaware of the need to re-subscribe upon re-establishing the connection, meaning you would have to code that yourself.
That said a sentinel + redis solution would be quite ideal if your c,isn't has proper sentinel support. In this scenario. Your client asks the sentinels for the master to talk to, and on a connection failure it repeats this process. This would handle the setup you describe, without the problems of being behind an ELB.
Assuming you are running in VPC:
did you register the EC2 instances with the ELB?
did you add the correct security group setting to the ELB (allowing inbound port 23)?
did you add an ELB listener that maps port 23 on the ELB to port 23 on the instances?
did you set sensible ELB health checks (e.g. TCP on port 23) so that ELB thinks the EC2 instances are healthy?
If the ELB thinks the servers behind it are not healthy then ELB will not send them any traffic.