AWS and ELB Network throughput limits - amazon-web-services

My site runs on AWS and uses ELB
I regularly see 2K con-current users, and during these times, requests through my stack would become slow and take a long time to get a response (30s-50s)
None of my servers or database at this time, would show significant load.
Which leads me to believe my issue could be related to ELB.
I have added some images of a busy day on my site, which shows graphs of my main ELB. Can you perhaps spot something that would give me insight into my problem?
Thanks!
UPDATE
The ELB in the screengrabs is my main ELB forwarding to multiple varnish cache servers. In my varnish vcl I would send misses for a couple of URL's but varnish have a queing behavior and what I ended up doing was set a high ttl for these request, and return hit_for_pass for them. What this does is let varnish know in the vcl_recv that these requests should be passed to the back-end immediately. Since doing this, the problem outlined above has completely been fixed

did you ssh into one of the servers? Maybe you reach some connection limit in apache or whatever server you run. Also check the cloudwatch monitors of EBS volumes attached to your instances, maybe they cause a io bottleneck.

Related

How do I block calls to a specific endpoint in EC2?

I have an open port for a server I am hosting, and I get lots of spurious calls to "/ws/v1/cluster/apps/new-application" which seems to be for some Hadoop botnet (all it does is pollute my logs with lots of invalid URL errors). How do I block calls to this URL? I could change my port to a less common one but I would prefer not to.
The only way to "block" such requests from reaching your server would be to launch an AWS Web Application Firewall (AWS WAF) and configure appropriate rules.
AWS WAF only works in conjunction with Amazon CloudFront or an Elastic Load Balancer, so the extra effort (and expense) might not be worth the benefit of simply avoiding some lines in a log file.
One day I took a look at my home router's logs and I was utterly amazed to see the huge amount of bot attempts to gain access to random systems. You should be thankful if this is the only one getting through to your server!

How to know if an elb is handling a high load?

We are experiencing a high traffic load, and the great mayority of requests are failing. We have added more instances behind our elb but that does not solve the issue. Our db is at 50% cpu usage. Is there a way to test if the elb is the one that cannot handle the load?
ELB scales infinitely so it's almost never became bottleneck. You can start from CloudWatch metrics, but it's will be good to check VPC flow logs, not only http response codes from ELB.
Also, ALB have log feature that need to be turned on manually

What is the major benefit of Active-Active in AWS routing

I came across the so called Active-Active or Active-Passive routing. Diagrammed as below.
For the later Active-Passive:
It is easy to understand: Passive (HTTP Server 2) is the Standby service/instance for Active (HTTP Server 1) to fail over.
For the first one Active-Active:
I don't understand what is the major benefit though, it seems to me both service/instance must be up and running the same level and the routing is maybe just something like round robin, wouldn't this be kind of resource/cost wasting? Does it introduce extra computing power? what is the use case for it?
In active-passive mode one web server is sitting there costing you money but not serving any requests. If a sudden surge in traffic came in the extra web server would not be able to help absorb the extra load. The only time the second web server starts being used is when the first web server crashes and can no longer serve requests. This gives you failover in the event of a server crash, but does not help you at all in the event of a sudden surge in traffic.
In active-active mode each web server is serving some of the traffic. In order to scale out your web servers (horizontal scaling) you would have two or more servers, all in "active" mode serving some portion of the web requests. If a sudden surge in traffic comes in, that surge is spread across multiple servers which can hopefully absorb the load, and new servers can be added automatically by AWS as needed, and removed when no longer needed.

Consistency between two Varnish servers behind AWS ELB

We are using ELB to load balance requests between two different Nginx+Varnish servers in two different AZs. These Varnish servers have been configured to balance requests to another ELB distributing requests to our app servers. In this way, we should be able to keep the site working if one AZ stops working.
The issue we are facing with this approach is that we don't know how to keep the site from serving different cached objects to the same client, i.e. keeping the consistency of the cached content between the two Varnish servers.
One possible solution would be using ELB's IP hashing so that depending on the client IP one Varnish or the other would serve the request. This would mitigate the problem somewhat.
Is there any other way to sync the contents between these two Varnish servers?
There is no active state synchronization available in Varnish.
You can do this with a helper processes that tails varnishlog and call out to the other n Varnish servers, but this is brittle and will probably break on you. The common approach is just to do round-robin and have enough traffic that everything is cached where it needs to be. :)
There is some underlying knowledge of how your application behaves baked into your question, but not so many details. Why is it a problem that a different backend has made the response? If they are identical (since you want redundancy, I'd expect that they were?) this shouldn't be an issue.
If the backend replies with user-specific response data for some URLs, it should tell Varnish that with the Vary header.
Adding session stickyness (~ip hashing) in ELB will just hide your problem until one of the AZs goes down and traffic is rerouted, at which point I'd guess you are pretty busy already.
You can enable ELB stickiness to achieve what you need, there is no varnish cluster which shares state between varnish instances.

How can I make an ELB forward requests from a single client to multiple nodes?

I'm currently running a REST API app on two EC2 nodes under a single load balancer. Rather than the standard load-balancing scenario of small amounts of traffic coming from many IPs, I get huge amounts of traffic from only a few IPs. Therefore, I'd like requests from each individual IP to be spread among all available nodes.
Even with session stickiness turned off, however, this doesn't appear to be the case. Looking at my logs, almost all requests are going to one server, with my smallest client going to the secondary node. This is detrimental, as requests to my service can last up to 30 seconds and losing that primary node would mean a disproportionate amount of requests get killed.
How can I instruct my ELB to round-robin for each client's individual requests?
You cannot. ELB uses a non-configurable round-robin algorithm. What you can do to mitigate (and not solve) this problem is adding additional servers to your ELB and/or making the health check requests initiated by your ELB more frequent.
I understand where you're coming from. However, I think you should approach the problem from a different angle. Your problem it appears isn't specifically related to the fact that the load is not balanced. Lets say you do get this balancing problem solved. You're still going to loose a large amount of requests. I don't know how you're clients connect to your services so I can't go into details on how you might fix the problem, but you may want to look at improving the code to be more robust and plan for the connection to get dropped. No service that has connections of 30+ seconds should rely on the connection not getting dropped. Back in the days of TCP/UDP sockets there was a lot more work done on building for failures, somehow that's gotten lost in today's HTTP world.
What I'm trying to say, is if you write the code you're clients are using to connect, build the code to be more robust and handle failures with retries. Once you start performing retries you'll need to make sure that your API calls are atomic and use transactions where necessary.
Lastly, I'll answer your original question. Amazon's ELB's are round robin even from the same computer / ip address. If your clients are always connecting to the same server its most likely the browser or code that is caching the response. If they're not directly accessing your REST API from a browser most languages allow you to get a list of ip's for a given host name. Those ip's will be the ip's of the loadbalancers and you can just shuffle the list and then use the top entry each time. For example you could use the following PHP code to randomly send requests to a different load balancer.
public function getHostByName($domain) {
$ips = gethostbynamel($domain);
shuffle($ips);
return $ips[0];
}
I have had similar issues with Amazon ELB however for me it turned out that the HTTP client used Connection: keep-alive. In other words, the requests from the same client was served over the same connection and for that reason it did not switch between the servers.
I don't know which server you use but it is probably possible to turn off keep-alive forcing the client to make a new connection for every request. This might be a good solution for requests with a lot of data. If you have a large amount of requests with small data it might affect performance negatively.
This may happen when you have the two instances in different availability zones.
When one ELB is working with multiple instances in a single availability zone, it will round-robin the requests between the instances.
When two instances are in two different availability zones, the way ELB works is create two servers (elb servers) each with its own IP, and they balance the load with DNS.
When your client asks the DNS for the IP address of your server, it receives two (or more) responses. Then the client chooses one IP and caches that (the OS usually does). Not much you can do about this, unless you control the clients.
When your problem is that the two instances are in different availability zones, the solution might be to have at least two instances in each availability zone. Then one single ELB server will handle the round-robin across two servers and will have just one IP so when a server fails it will be transparent to the clients.
PS: Another case when ELBs create more servers with unique IPs is when you have a lot of servers in a single availability zone, and one single ELB server can't handle all the load and distribute it to connected servers. Then again, a new server is created to connect the extra instances and the load is distributed using DNS and multiple IPs.