designing for failure when polling/retrying/queueing is not possible - amazon-web-services

I'm designing a website/web service to be hosted in the cloud (specifically AWS although that's mostly irrelevant), and I'm spending a lot of time thinking about "designing for failure". I want my system to seamlessly handle node failures, i.e. without any significant user impact or engineer intervention.
In most cases, it's easy to see how to handle sudden node failure. If my app has an API handled by 4 servers behind a load balancer, polled by AJAX or an iPhone app, the poller can simply detect the failed TCP/IP transmission and retry... assuming the load balancer behaves correctly, it will hit a healthy instance.
If the app is more processing-oriented, a queue service like SQS can be used to allow stateless nodes to pick up where the failed nodes left off.
The difficulty I see is with "points of entry", where no retry/polling is possible because the application hasn't been loaded yet, and where a failure means the app never starts. For example, the index.html on a webpage... if a node fails while transmitting that file, the user's browser will likely hang and not automatically retry (they will need to refresh).
The Load Balancer is also a single "point of entry/failure". However, in this case it appears we can solve the problem by creating multiple Load Balancers, and load balancing them using DNS Load Balancing as described here: http://blog.rightscale.com/2012/10/23/dns-load-balancing-and-using-multiple-load-balancers-in-the-cloud/
Is this a solution that would work for the simpler index.html case? Overall, how can we create redundancy where polling/retrying/queuing is not possible?
EDIT: Another idea is to have the index.html hosted statically on a CDN, S3, etc (where resource availability is more dependable), although that prevents using dynamic content. Dynamic content could be added if the page populates itself using JS, but that adds a dependency on JS as well as latency for the user.

Related

Webpage resource request stalled for nearly a minute in Chrome

A resource on my webapp takes nearly a minute to load after a long stall. This happens consistently. As shown below, only 3 requests on this page actually hit the server itself, the rest hit the memory or disk cache. This problem only seems to occur on Chrome, both Safari and Firefox do not exhibit this behavior.
I have implemented the Cache-Control: no-store suggestion in this SO question but the problem persists. request stalled for a long time occasionally in chrome
Also included below is an example of what the response looks like once it finally does come in.
My app is hosted in AWS behind a Network Load Balancer which proxies to an EC2 instance running nginx and the app itself.
Any ideas what is causing this?
I encountered the exact same problem. We are using Elastic Beanstalk with Network Load Balancer (NLB) with TLS termination at NLB.
The feedback I got from AWS support is:
This problem can occur when a client connects to a TLS listener on a Network Load Balancer and does not send data immediately after completing the TLS handshake. The root cause is an edge case in the handling of new connections. Note that this only occurs if the Target Group for the TLS listener is configured to use the TCP protocol without Proxy Protocol v2 enabled
They are working on a fix for this issue now.
Somehow this problem can only be noticed when you are using Chrome browser.
In the meantime, you have these 2 options as workaround:
enable Proxy Protocol v2 on the Target Group OR
configure the Target Group to use TLS protocol for routing traffic to the targets
I know it's a late answer but I write it for someone seeking a solution.
TL;DR: In my case, enabling cross-zone load balancing attribute of NLB solved the problem.
With investigation using WireShark I figured out there were two different IPv4 addresses Chrome communicated with.
Sending packets to one of them always succeeded and to the other always failed.
Actually the two addresses delegated two Availability Zones.
By default, cross-zone load balancing is disabled if you choose NLB (on the contrary the same attribute of ALB is enabled by default).
Let's say there are two AZs; AZ-1 / AZ-2.
When you attach both AZs to a NLB, it has a node for each AZ.
The node belongs to AZ-1 just routes traffic to instances which also belong to AZ-1. AZ-2 instances are ignored.
My modest app (hosted on Fargate) has just one app server (ECS task) in AZ-2 so that the NLB node in AZ-1 cannot route traffic to anywhere.
I'm not familiar with TCP/IP or Browser implementation but in my understanding, your browser somehow selects the actual ip address after DNS lookup.
If the AZ-2 node is selected in the above case then everything goes fine, but if the AZ-1 is selected your browser starts stalling.
Maybe Chrome has a random strategy to select ip while Safari or Firefox has a sticky one, so that the problem only appears on Chrome.
After enabling cross-zone load balancing the ECS task on AZ-2 is visible from the AZ-1 NLB node, and it works fine with Chrome browser too.
(Please feel free to correct my poor English. Thank you!)
I see two things that could be responsible for delays:
1) Usage of CDNs
If the resources that load slow are loaded from CDNs (Content Delivery Networks) you should try to download them to the server and deliver directly.
Especially if you use http2 this can be a remarkable gain in speed, but also with http1. I've no experience with AWS, so I don't know how things are served there by default.
It's not shown clearly in your screenshot if the resources are loaded from CDN but as it's about scripts I think that's a reasonable assumption.
2) Chrome’s resource scheduler
General description: https://blog.chromium.org/2013/04/chrome-27-beta-speedier-web-and-new.html
It's possible or even probable that this scheduler has changed since the article was published but it's at least shown in your screenshot.
I think if you optimize the page with help of the https://www.webpagetest.org and the chrome web tools you can solve any problems with the scheduler but also other problems concerning speed and perhaps other issues too. Here is the link: https://developers.google.com/web/tools/
EDIT
3) Proxy-Issue
In general it's possible that chrome has either problems or reasons to delay because of the proxy-server. Details can't be known before locking at the log-files, perhaps you've to adjust that log-files are even produced and that the log-level is enough to tell you about any problems (Level Warning or even Info).
After monitoring the chrome net-export logs, it seems as though I was running into this issue: https://bugs.chromium.org/p/chromium/issues/detail?id=447463.
I still don't have a solution for how to fix the problem though.

Are code level changes required for moving to load-balancing environment?

My client wants to move to a ColdFusion load-balancing environment for better availability and scalability of the site. I know how to setup clusters and instances in the ColdFusion Admin. We should also use J2EE session mgmt for sticky sessions.
But I am not sure of other code level changes required while moving from a single server to load-balancing environment.
Anyone having any experience please suggest? Or any helpful links.
Skipping the session scoped issues you're bound to enjoy I'll focus on less common code level strategies.
You will have 2+ isolated application scopes. This creates challenges in synchronicity. Examine the app code for writes to the app scope. Should some condition require updating an app scoped value, that value must be reflected in all sibling application scopes.
Know that each instance will have its own onApplicationStart() & onApplicationEnd() events. Depending on what happens in the code, it could cause mischief.
Be aware of things like FuseBox (framework) when load balancing. FuseBox generates files locally that are not replicated on other server instances.
When logging, emailing errors, etc., use an instance identifier so you'll know which server you're working with.
Should your app need the originating IP address of a request, you may need to enable X-Forwarded-For HTTP headers within your load balancer. Otherwise, you could get the IP of the load balancer on every request.
Verify Identical on EACH Instance:
Security implementations
ColdFusion & Java versions
Datasources
Mappings
Virtual directories
Shared resource locations ..
CF Admin settings: site-wide error handling, etc.
CF account privileges, Important!
Consider using the ColdFusion Server Manager to assist consistification. ;)

wso2elb services mapping not working with all configured hostnames

I'm currently deploying some wso2as cluster, and am facing a strange problem with URL mapping.
I have setup two worker nodes (named was0 and was1), a manager node (named mgt) and an ELB (named elb).
The installation seems working fine, as I'm able to call URL mapped on load balancer like the following : http://was0.domain/services/... , was0.domain being mapped on the load balancer IP on the station accessing this address (outside the cluster).
When I call services on this endpoint, I'm able to load balance as I can notice that my wsdl has enpoints based on was0 and was1. The two worker nodes are pretty detected as application nodes on the ELB.
The problem I encounter however is that when I use was0 based URL, it works fine, but when I try to use the was1 one, the load balancer returns a blank page, and I don't notice any error in logs. I have both hosts was1 and was0 defined in my cluster configuration as application members for AS.
If I try from the ELB node to access the was1 based webservice directly on the WAS, I'm able to access it without problem (so the service is working on was1 node, and this node is also detected and registered inside the cluster, but not accessible through cluster).
Finally, this results in one call working when round robin targets was0, and one call not working when targetting was1.
So I'm currently wondering if I understood well the cluster behavior: should it work for both application servers mapped URL, or is it normal to have only the first was0 responding with success? How could I force generated WSDL to return a valid endpoint URL?
What I understood by reading documentation is that I need mapping WAS URLs on the ELB, and that this one will then balance on all WAS servers, but it doesn't seem working like that.
Please tell me if you need some configuration part, diagram or example, I didn't paste it here because it's quite big :)
For information, I had the same problem when balancing through 2 wso2esb worker nodes, but was able to solve it by forcing WSDL URLs prefix by the first node URL (esb0) with the WSDLEPRPrefix in ESB configuration. As I don't have a such setting in wso2as, I don't know how accessing the URL returned in WSDL.
Thank you by advance for your help,
BOUCNIAUX Benjamin

Web service provider routing

I am looking to implement a service (web/windows, .net) that maintains a list of available services and can provide an endpoint based on the nature or type of request. The requester can then pass the actual work request to the provided endpoint. The actual work requests can contain very large chunks (from 10MB up to and possibly exceeding a GB) of data.
WCF routing services sounds like a perfect fit, but turns out not to be because the it requires the actual work request to pass through it, creating a bottleneck at the routing service (the whole point is to get a system to be able to scale out). If I had smaller messages, WCF routing would be a no brainer.
Is there anything out there that fits the bill? Preferably .NET/windows based?
Do you mean because the requests block for work?
Do could use OneWay OperationContract to create async services so as to not block the request pool.
[ServiceContract]
interface IMyContract
{
[OperationContract(IsOneWay = true)]
void DoWork()
}
Update
I think understand your question better now, you are looking to distribute load to different servers to avoid request bottle necks due to heavy traffic load (preferably distributed based on content).
I'd say that MVC Routing is indeed ideal for this. One of the features that you can leverage is the fall over functionality. You can actually define multiple backup endpoints, and in the case where one fails, it will automatically move over to the next. There's a good introduction to how this works here.
There's also a good article here that talks about load balancing with WCF using the same principles. It provides 2 solutions for a round robin filter implementation that allows you to load balance the service requests (even though at the begin he says his general answer to whether it supports load balancing is no for implementation reasons).
If you are worried about all requests routing via the one server and still becoming a bottle neck, then think of web load balancers. It's the same scenario. Sitting in the middle forwarding packets doesn't require much work, and they have no problem handling huge volumes of traffic. I don't think this is an issue IMO.

High number of persistent connections

I'm setting up a project and one of the main questions is how to implement a simple message queueing system (something along the line of a messenger chat system). I would like to avoid polling, but there will most likely be a lot of concurrent connections (tens of thousands). These will be HTTP+SSL connections, started from an application not a browser.
One solution I found would be DNS Load Balancing: distribute these persistent connections across a bunch of nginx webservers.
What do you think? Any other possible solutions?
For load balancing, keeping the application server stateless will open up the field significantly. Once you've got that, you're free to use almost any generic load balancer. From something protocol specific like HTTP load balancers to the generic TCP level load balancers.
Keep it stateless, the rest will be trivial in comparison.
If you are planning on using web services (XML message passing ), you can use gsoap, which has an included web server sample application, which uses thread pools. I've run a server using this and mysql ( for persistent state ). I agree with Ryan on reducing/eliminating the statefulness of the application.
DNS load balancing will allow you to distribute queries between multiple IP addresses, which could be multiple servers. Keep in mind that your clients could get different servers from one request to another, so your applicaiton can't use local state management. Your applicaiton will have to store its state in a centralized location such as a database.
Have you considered peer-to-peer? The state of the art in punching through firewalls is actually very effective especially since you're running your own client software in each instance, and you have servers to start the connection.
More work, but significantly less server resources.
Also, write your own server software - make sure it can handle a lot of connections and is extraordinarily lightweight and you should be able to handle thousands of connections per server before you do load balancing.
-Adam