AWS ALB catastrophic failure - amazon-web-services

First, the background:
Yesterday our AWS-based business in US West 2, consisting of two auto-scale groups (and various other components like RDS further back) behind an ALB went offline for six hours. Service was only reinstated by building an entirely new ALB (migrating over the rules and target groups).
At 4:15am our local time (GMT+10) the ALB ceased to receive inbound traffic and would not respond to web traffic. We used it for port 80 and port 443 (with SSL cert) termination. At the same time, all target group instances were also marked as "Unhealthy" (although they most certainly were operable) and no traffic was forwarded on to them. DNS resolved correctly to the ALB. It simply stopped responding. Equivalent symptoms to a network router/switch being either switched off or firewalled out of existence.
Our other EC2 servers that were not behind the ALB continued to operate.
Initial thoughts were:
a) deliberate isolation by AWS? Bill not paid, some offence taken at an abuse report? Unlikely and AWS had not notified us of any transgression or reason to take action.
b) A mistake on our part in network configuration? No change had been made in days to NACL or security groups. Further we were sound asleep when it happened, nobody was fiddling with settings. When we built the replacement ALB we used the same NACL and security groups without problem.
c) Maintenance activity gone wrong? This seems most likely. But AWS appeared not to detect the failure. And we didn't pick it up because we considered a complete, inexplicable, and undetected failure of an ALB as "unlikely". We will need to put in place some external healthchecks of our own. We have some based upon Nagios so can enable alarming. But this doesn't help if an ALB is unstable - it is not practical to keep having to build a new one if this reoccurs.
The biggest concern is that this happened suddenly and unexpectedly and that AWS did not detect this. Normally we are never worried about AWS network infrastructure as "it just works". Until now. There's no user-serviceable options for an ALB (eg restart/refresh).
And now my actual question:
Has anyone else ever seen something like this? If so, what can be done to get service back faster or prevent it in the first place? If this happened to you what did you do?

I'm going to close this off.
It happened again the following Sunday, and again this evening. Exact same symptoms. Restoration was initially achieved by creating a new ALB and migrating rules and target groups over. Curiously, the previous ALB was observed to be operational again but when we tried to reinstate it then it failed again.
Creating new ELBs is no longer a workaround and we've switched to an AWS business support to get direct help from AWS.
Our best hypotheses is this: AWS have changed something in their maintenance process and the ALB (which is really just a collection of EC2 instances with some AWS "proprietary code") is failing but it's really just wild speculation.

Related

How i can configure Google Cloud Platform with Cloudflare-Only?

I recently start using GCP but i have one thing i can't solve.
I have: 1 VM + 1 DB Instance + 1 LB. DB instance allow only conections from the VM IP. bUT THE VM IP allow traffic from all ip (if i configure the firewall to only allow CloudFlare and LB IP's the website crash and refuse conections).
Recently i was under attack, i activate the Cloudflare ddos mode, restart all and in like 6 h the attack come back with the Cloudflare activate. Wen i see mysql conections bump from 20-30 to 254 and all conections are from the IP of the VM so i think the problem are the public accesibility of the VM but i don't know how to solved it...
If i activate my firewall rules to only allow traffic from LB and Cloudflare the web refuses all conections..
Any idea what i can do?
Thanks.
Cloud Support here, unfortunately, we do not have visibility into what is installed on your instance or what software caused the issue.
Generally speaking you're responsible for investigating the source of the vulnerability and taking steps to mitigate it.
I'm writing here some hints that will help you:
Make sure you keep your firewall rules in a sensible manner, e.g. is not a good practice to have a firewall rule to allow all ingress connections on port 22 from all source IPs for obvious reasons.
Since you've already been rooted, change all your passwords: within the Cloud SQL instance, within the GCE instance, even within the GCP project.
It's also a good idea to check who has access to your service accounts, just in case people that aren't currently working for you or your company still have access to them.
If you're using certificates revoke them, generate new ones and share them in a secure way and with the minimum required number of users.
Securing GCE instances is a shared responsability, in general, OWASP hardening guides are really good.
I'm quoting some info here from another StackOverflow thread that might be useful in your case:
General security advice for Google Cloud Platform instances:
Set user permissions at project level.
Connect securely to your instance.
Ensure the project firewall is not open to everyone on the internet.
Use a strong password and store passwords securely.
Ensure that all software is up to date.
Monitor project usage closely via the monitoring API to identify abnormal project usage.
To diagnose trouble with GCE instances, serial port output from the instance can be useful.
You can check the serial port output by clicking on the instance name
and then on "Serial port 1 (console)". Note that this logs are wipped
when instances are shutdown & rebooted, and the log is not visible
when the instance is not started.
Stackdriver monitoring is also helpful to provide an audit trail to
diagnose problems.
You can use the Stackdriver Monitoring Console to set up alerting policies matching given conditions (under which a service is considered unhealthy) that can be set up to trigger email/SMS notifications.
This quickstart for Google Compute Engine instances can be completed in ~10 minutes and shows the convenience of monitoring instances.
Here are some hints you can check on keeping GCP projects secure.

aws ELB vs. ALB Review: Performance,Resiliancy

It has been almost a year since aws deployed the ALB.
There seems to be little to none technical comparison between the classic LB and the ALB.
The only thing that my research yielded is: aws forum question
While the above implies that ALB is not reliable it is only one source.
I am interested in knowing the experience of people who did the switch between ELB and ALB, mainly in regards to latency, resilience, HA.
It seems to me that Layer4 balancing is more robust than Layer7 and therefore the general performance would be better.
I would like to provide my experience here, but I am not going into very detail. Basically my company has over 200k devices on the field actively connecting our backend via websocket. We were using ELB in front our servers and it was working quite well, all the connections were quite stable. We tried to switch ELB to ALB before, but the result wasn't really good and we switched back.
The main problem we found the connections were dropped suddenly (with a pattern like every 15 hrs?). Please check the graph of number of connections I captured from the monitoring system:
As you can see there are some dips in the graph which showing the connections have be dropped. I keep the same server setup, only factor I have changed is from ELB to ALB and got this result.

Using Redis behing AWS load balancer

We're using Redis to collect events from our web application (pub/sub based) behind AWS ELB.
We're looking for a solution that will allow us to scale-up and high-availability for the different servers. We do not wish to have these two servers in a Redis cluster, our plan is to monitor them using cloudwatch and switch between them if necessary.
We tried a simple test of locating two Redis server behind the ELB, telnetting the ELB DNS and see what happens using 'redis-cli monitor', but we don't see nothing. (when trying the same without the ELB it seems fine)
any suggestions?
thanks
I came across this while looking for a similar question, but disagree with the accepted answer. Even though this is pretty old, hopefully it will help someone in the future.
It's more appropriate for your question here to use DNS failover with a Redis Replication Auto-Failover configuration. DNS failover provides groups of availability (if you need that level of scale) and the Replication group provides cache up time.
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover-configuring.html
The Active-passive failover should provide the solution you're wanting with High Availability:
Active-passive failover: Use this failover configuration when you want
a primary group of resources to be available the majority of the time
and you want a secondary group of resources to be on standby in case
all of the primary resources become unavailable. When responding to
queries, Amazon Route 53 includes only the healthy primary resources.
If all of the primary resources are unhealthy, Amazon Route 53 begins
to include only the healthy secondary resources in response to DNS
queries.
After you setup the DNS, then you would point that to the Elasticache Redis failover group's URL and add multiple groups for higher availability during a failover operation.
However, you might need to setup your application to write and read from different endpoints to maximize the architecture's scalability.
Sources:
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.html
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/AutoFailover.html
Placing a pair of independent redis nodes behind a LB will likely not be what you want. What will happen is ELB will try to balance connections to each instance, splitting half to one and half to another. This means that commands issued by one connection may not be seen by another. It also means no data is shared. So client a could publish a message, and client b being subscribed to the other server won't see the message.
For PUBSUB behind ELB you have a secondary problem. ELB will close an idle connection. So if you subscribe to a channel that isn't busy your ELB will close your connection. As I recall the max you can make this is 60s, meaning if you don't publish a message every single minute your clients will be disconnected.
As to how much of a problem that is depends on your client library, and frankly in my experience most don't handle it well in that they are unaware of the need to re-subscribe upon re-establishing the connection, meaning you would have to code that yourself.
That said a sentinel + redis solution would be quite ideal if your c,isn't has proper sentinel support. In this scenario. Your client asks the sentinels for the master to talk to, and on a connection failure it repeats this process. This would handle the setup you describe, without the problems of being behind an ELB.
Assuming you are running in VPC:
did you register the EC2 instances with the ELB?
did you add the correct security group setting to the ELB (allowing inbound port 23)?
did you add an ELB listener that maps port 23 on the ELB to port 23 on the instances?
did you set sensible ELB health checks (e.g. TCP on port 23) so that ELB thinks the EC2 instances are healthy?
If the ELB thinks the servers behind it are not healthy then ELB will not send them any traffic.

Load balancer for php application

Questions about load balancers if you have time.
So I've been using AWS for some time now. Super basic instances, using them to do some tasks whenever I needed something done.
I have a task that needs to be load balanced now. It's not a public service though. It's pretty much a giant cron job that I don't want running on the same servers as my website.
I set up an AWS load balancer, but it doesn't do what I expected it to do.
It get's stuck on one server, and doesn't load balance at all. I've read why it does this, and that's all fine and well, but I need it to be a serious round-robin load balancer.
edit:
I've set up the instances on different zones, but no matter how many instances I add to the ELB, it just uses one. If I take that instance down, it switches to a different one, so I know it's working. But I really would like it to always use a different one under every circumstance.
I know there are alternatives. Here's my question(s):
Would a custom php load balancer be an ok option for now?
IE: Have a list of servers, and have php randomly select a ec2 instance. Wouldn't be scalable at all, bu atleast I could set this up in 2 mins and it can work for now.
or
Should I take the time to learn how HAProxy works, and set that up in place of the AWS ELB?
or
Am I doing it wrong, and AWS's ELB does do round-robin. I just have something configured wrong?
edit:
Structure:
1) Web server finds a task to do.
2) If it's too large it sends it off to AWS (to load balancer).
3) Do the job on EC2
4) Report back via curl to an API
5) Rinse and repeat
Everything works great. But because the connection always comes from my server (one IP) it get's sticky'd to a single EC2 machine.
ELB works well for sites whose loads increase gradually. If you are expecting an uncommon and sudden increase on the load, you can ask AWS to pre-warm it for you.
I can tell you I used ELB in different scenarios and it always worked well for me. As you didn't provide too much information about your architecture, I would bet that ELB works for you, and the case that all connections are hitting only one server, I would ask you:
1) Did you check the ELB to see how many instances are behind it?
2) The instances that you have behind the ELB, are all alive?
3) Are you accessing your application through the ELB DNS?
Anyway, I took an excerpt from the excellent article that does a very good comparison between ELB and HAProxy. http://harish11g.blogspot.com.br/2012/11/amazon-elb-vs-haproxy-ec2-analysis.html
ELB provides Round Robin and Session Sticky algorithms based on EC2
instance health status. HAProxy provides variety of algorithms like
Round Robin, Static-RR, Least connection, source, uri, url_param etc.
Hope this helps.
This point comes as a surprise to many users using Amazon ELB. Amazon
ELB behaves little strange when incoming traffic is originated from
Single or Specific IP ranges, it does not efficiently do round robin
and sticks the request. Amazon ELB starts favoring a single EC2 or
EC2’s in Single Availability zones alone in Multi-AZ deployments
during such conditions. For example: If you have application
A(customer company) and Application B, and Application B is deployed
inside AWS infrastructure with ELB front end. All the traffic
generated from Application A(single host) is sent to Application B in
AWS, in this case ELB of Application B will not efficiently Round
Robin the traffic to Web/App EC2 instances deployed under it. This is
because the entire incoming traffic from application A will be from a
Single Firewall/ NAT or Specific IP range servers and ELB will start
unevenly sticking the requests to Single EC2 or EC2’s in Single AZ.
Note: Users encounter this usually during load test, so it is ideal to
load test AWS Infra from multiple distributed agents.
More info at the Point 9 in the following article http://harish11g.blogspot.in/2012/07/aws-elastic-load-balancing-elb-amazon.html
HAProxy is not hard to learn and is tremendously lightweight yet flexible. I actually use HAProxy behind ELB for the best of both worlds -- the hardened, managed, hands-off reliability of ELB facing the Internet and unwrapping SSL, and the flexible configuration of HAProxy to allow me to fine tune how things hit my servers. I've never lost an HAProxy instance yet, but it I do, ELB will just take that one out of rotation... as I have seen happen when the back-end servers have all become inaccessible, which (because of the way it's configured) makes ELB think the HAProxy is unhealthy, but that's by design in my setup.

Possible solution to ELB lacking A record support?

Hey guys I was wondering if this seems like a viable solution to the age old problem of Amazon Elastic Load Balancer's lacking a dedicated IP, and thus A record support.
What if I created a micro/small instance and hooked it to an elastic IP. I can then use that IP as my A record address for my website. That instance will forward 100% of its traffic to an ELB load balancer address (Haproxy?), which will then operate normally and forward that traffic to my server pool.
With this architecture I can use my A-record and an ELB.
Are there any downsides to this aside from the cost of the initial instance that forwards its traffic to the ELB?
Will this double forwarding create too much lag or is it really negligible since they're all in AWS?
Thanks for feedback.
If you are currently using Route53 for you DNS, it does have support for handling zone apex.
https://forums.aws.amazon.com/message.jspa?messageID=260459
Not sure if this answers your question since you didn't mention why you need a dedicated ip.
Are there any downsides to this aside from the cost of the initial instance that forwards its traffic to the ELB?
Er, yes. You're loosing about 99.9% of the benefits of ELB.
Will this double forwarding create too much lag or is it really negligible since they're all in AWS?
No, the lag should be small (sub-milisecond). The two main problems are:
1) Your instance will become a bottleneck when your traffic increases. You won't be able to survive a sudden rush, such as being linked from a high-traffic website like Slashdot or Oprah.
The whole point of ELB is that they can manage scaling (the frontend and the backend) for you. If you insert a single box in the flow, it kinda prevents ELB from doing anything useful.
Also, the micro instance can take very little traffic. You have to go to at least a m1.large if you won't want your network packets throttled.
2) Your instance will become a Single Point Of Failure. When your box dies, your website will be down. ELB can prevent problems on both the front and backend with redundancy.
Perhaps if you explained why you needed an A record?
(It is also possible to run your own front-end(s): Just create a box with an EIP, and put nginx and/or HAProxy on it. But as with everything, there are trade-offs.)