Webpage resource request stalled for nearly a minute in Chrome - amazon-web-services

A resource on my webapp takes nearly a minute to load after a long stall. This happens consistently. As shown below, only 3 requests on this page actually hit the server itself, the rest hit the memory or disk cache. This problem only seems to occur on Chrome, both Safari and Firefox do not exhibit this behavior.
I have implemented the Cache-Control: no-store suggestion in this SO question but the problem persists. request stalled for a long time occasionally in chrome
Also included below is an example of what the response looks like once it finally does come in.
My app is hosted in AWS behind a Network Load Balancer which proxies to an EC2 instance running nginx and the app itself.
Any ideas what is causing this?

I encountered the exact same problem. We are using Elastic Beanstalk with Network Load Balancer (NLB) with TLS termination at NLB.
The feedback I got from AWS support is:
This problem can occur when a client connects to a TLS listener on a Network Load Balancer and does not send data immediately after completing the TLS handshake. The root cause is an edge case in the handling of new connections. Note that this only occurs if the Target Group for the TLS listener is configured to use the TCP protocol without Proxy Protocol v2 enabled
They are working on a fix for this issue now.
Somehow this problem can only be noticed when you are using Chrome browser.
In the meantime, you have these 2 options as workaround:
enable Proxy Protocol v2 on the Target Group OR
configure the Target Group to use TLS protocol for routing traffic to the targets

I know it's a late answer but I write it for someone seeking a solution.
TL;DR: In my case, enabling cross-zone load balancing attribute of NLB solved the problem.
With investigation using WireShark I figured out there were two different IPv4 addresses Chrome communicated with.
Sending packets to one of them always succeeded and to the other always failed.
Actually the two addresses delegated two Availability Zones.
By default, cross-zone load balancing is disabled if you choose NLB (on the contrary the same attribute of ALB is enabled by default).
Let's say there are two AZs; AZ-1 / AZ-2.
When you attach both AZs to a NLB, it has a node for each AZ.
The node belongs to AZ-1 just routes traffic to instances which also belong to AZ-1. AZ-2 instances are ignored.
My modest app (hosted on Fargate) has just one app server (ECS task) in AZ-2 so that the NLB node in AZ-1 cannot route traffic to anywhere.
I'm not familiar with TCP/IP or Browser implementation but in my understanding, your browser somehow selects the actual ip address after DNS lookup.
If the AZ-2 node is selected in the above case then everything goes fine, but if the AZ-1 is selected your browser starts stalling.
Maybe Chrome has a random strategy to select ip while Safari or Firefox has a sticky one, so that the problem only appears on Chrome.
After enabling cross-zone load balancing the ECS task on AZ-2 is visible from the AZ-1 NLB node, and it works fine with Chrome browser too.
(Please feel free to correct my poor English. Thank you!)

I see two things that could be responsible for delays:
1) Usage of CDNs
If the resources that load slow are loaded from CDNs (Content Delivery Networks) you should try to download them to the server and deliver directly.
Especially if you use http2 this can be a remarkable gain in speed, but also with http1. I've no experience with AWS, so I don't know how things are served there by default.
It's not shown clearly in your screenshot if the resources are loaded from CDN but as it's about scripts I think that's a reasonable assumption.
2) Chrome’s resource scheduler
General description: https://blog.chromium.org/2013/04/chrome-27-beta-speedier-web-and-new.html
It's possible or even probable that this scheduler has changed since the article was published but it's at least shown in your screenshot.
I think if you optimize the page with help of the https://www.webpagetest.org and the chrome web tools you can solve any problems with the scheduler but also other problems concerning speed and perhaps other issues too. Here is the link: https://developers.google.com/web/tools/
EDIT
3) Proxy-Issue
In general it's possible that chrome has either problems or reasons to delay because of the proxy-server. Details can't be known before locking at the log-files, perhaps you've to adjust that log-files are even produced and that the log-level is enough to tell you about any problems (Level Warning or even Info).

After monitoring the chrome net-export logs, it seems as though I was running into this issue: https://bugs.chromium.org/p/chromium/issues/detail?id=447463.
I still don't have a solution for how to fix the problem though.

Related

Traefik Best Practices/Capabilities For Dynamic Vanity Domain Certificates

I'm looking for guidance on the proper tools/tech to accomplish what I assume is a fairly common need.
If there exists a web service: https://www.ExampleSaasWebService.com/ and customers can add vanity domains/subdomains to white-label or resell the service and replace the domain name with their own, there needs to be a reverse proxy to terminate vanity domains TLS traffic and route it to the statically defined (HTTPS) back-end service on the non-vanity original domain (there is essentially one "back-end" server somewhere else on the internet, not the local network, that accepts all incoming traffic no matter the incoming domain). Essentially:
"Customer A" could setup an A/CNAME record to VanityProxy.ExampleSaasWebService.com (the host running Traefik) from example.customerA.com.
"Customer B" could setup an A/CNAME record to VanityProxy.ExampleSaasWebService.com (the host running Traefik) from customerB.com and www.customerB.com.
etc...
I (surprisingly) haven't found anything that does this out of the box, but looking at Traefik (2.x) I'm seeing some promising capabilities and it seems like the most capable tool to accomplish this. Primarily because of the Let's Encrypt integration and the ability to reconfigure without a restart of the service.
I initially considered AWS's native certificate management and load balancing, but I see there is a limit of ~25 certificates per load balancer which seems like a non-starter. Presumably there could be thousands of vanity domains in place at any time.
Some of my Traefik specific questions:
Am I correct in understanding that you can get away without explicitly provisioning a generated list of explicit vanity domains to produce TLS certificates for in the config files? They can be determined on-the-fly and provisioned from Let's Encrypt based on the headers of the incoming requests/SNI?
E.g. If a request comes to www.customerZ.com and there is not yet a certificate for that domain name, one can be generated on the fly?
I found this note on the OnDemand flag in the v1.6 docs, but I'm struggling to find the equivalent documentation in the (2.x) docs.
Using AWS services, how can I easily share "state" (config/dynamic certificates that have already been created) between multiple servers to share the load? My initial thought was EFS, but I see EFS shared file system may not work because of a dependency on file change watch notifications not working on NFS mounted file systems?
It seemed like it would make sense to provision an AWS NLB (with a static IP and an associated DNS record) that delivered requests to a fleet of 1 or more of these Traefik proxies with a universal configuration/state that was safely persisted and kept in sync.
Like I mentioned above, this seems like a common/generic need. Is there a configuration file sample or project that might be a good starting point that I overlooked? I'm brand new to Traefik.
When routing requests to the back-end service, the original Host name will be identifiable still somewhere in the headers? I assume it can't remain in the Host header as the back-end recieves requests to an HTTPS hostname as well.
I will continue to experiment and post any findings back here, but I'm sure someone has setup something like this already -- so just looking to not reinvent the wheel.
I managed to do this with Caddy. It's very important that you configure the ask,interval and burst to avoid possible DDoS attacks.
Here's a simple reverse proxy example:
# https://caddyserver.com/docs/caddyfile/options#on-demand-tls
{
# General Options
debug
on_demand_tls {
# will check for "?domain=" return 200 if domain is allowed to request TLS
ask "http://localhost:5000/ask/"
interval 300s
burst 1
}
}
# TODO: use env vars for domain name? https://caddyserver.com/docs/caddyfile-tutorial#environment-variables
qrepes.app {
reverse_proxy localhost:5000
}
:443 {
reverse_proxy localhost:5000
tls {
on_demand
}
}

Intermittent 502 gateway errors with AWS ALB in front of ECS services running express / nginx

Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).

AWS, Load Balancer 504 error after a few requests

I am repeating a question that I posted at https://forums.aws.amazon.com/thread.jspa?threadID=275855&tstart=0
to reach out more people.
Hi,
I am trying to deploy a REST service in AWS. The current architecture is:
Domain name (Route 53) -> Load Balancer -> Single EC2 instance (bound to an Elastic IP). And I use TLS/SSL certificate issued by a Certificate Manager.
The instance is Ubuntu 16.04 machine, and the service is implemented with (bare) Vert.X (==no proxy server).
However, 504 Error (gateway timeout) occurs after a few different requests (each of which takes <1s) in a series, and then it does not respond. The requests do not reach the server instance after a few requests. I checked that it happens in the same way when I access both the domain name and the load balancer directly. I have confirmed that the exact same scenario is working with direct URL.
I run up a dummy server returning "hello world" and it's working okay with the load balancer. The problem should be caused by something no coherent between the load balancer and the server code, but I can't get where to start.
I have checked several threads complaining the 504 errors, and followed some of the instructions, but they do not work. Especially I set keep-alive option in Vert.x and set the idle time longer than the balancer's. As the delays are not longer than the idel time with the direct communication, I believe it is not the problem anyway. I have checked the Security Groups also and confirmed the right ports are open. (The first few requests are working, so it must not be the problem also.)
Does any of you have a sense where I should start looking at? Even better, know the source of the problem?
Thanks in advance.
EDIT: I just found the issue in some of the code. I've answered myself below. Thanks for reading!
Found the issue in my code. Some of the APIs (implemented by my colleague...) was not flushing the buffer of HTTP responses in the server.
In Vert.X Java, it was resp.end().
It was somehow working with direct access probably the buffer was flushed at some point, but that flush seems not caught by the load balancer.
Hope nobody experiences this, but in case...

designing for failure when polling/retrying/queueing is not possible

I'm designing a website/web service to be hosted in the cloud (specifically AWS although that's mostly irrelevant), and I'm spending a lot of time thinking about "designing for failure". I want my system to seamlessly handle node failures, i.e. without any significant user impact or engineer intervention.
In most cases, it's easy to see how to handle sudden node failure. If my app has an API handled by 4 servers behind a load balancer, polled by AJAX or an iPhone app, the poller can simply detect the failed TCP/IP transmission and retry... assuming the load balancer behaves correctly, it will hit a healthy instance.
If the app is more processing-oriented, a queue service like SQS can be used to allow stateless nodes to pick up where the failed nodes left off.
The difficulty I see is with "points of entry", where no retry/polling is possible because the application hasn't been loaded yet, and where a failure means the app never starts. For example, the index.html on a webpage... if a node fails while transmitting that file, the user's browser will likely hang and not automatically retry (they will need to refresh).
The Load Balancer is also a single "point of entry/failure". However, in this case it appears we can solve the problem by creating multiple Load Balancers, and load balancing them using DNS Load Balancing as described here: http://blog.rightscale.com/2012/10/23/dns-load-balancing-and-using-multiple-load-balancers-in-the-cloud/
Is this a solution that would work for the simpler index.html case? Overall, how can we create redundancy where polling/retrying/queuing is not possible?
EDIT: Another idea is to have the index.html hosted statically on a CDN, S3, etc (where resource availability is more dependable), although that prevents using dynamic content. Dynamic content could be added if the page populates itself using JS, but that adds a dependency on JS as well as latency for the user.

BizTalkServerIsolatedHost disappeared from one server in multi-server group

Afternoon all,
We have a group of four BizTalk servers: two orchestration hosts and two adapter hosts. We have a number of orchestrations exposed as web services, and for the purposes of this question, it is important to note that these web services are hosted on the adapter servers, and run under the BizTalkServerIsolatedHost host instance.
This morning, we started seeing odd errors on both of the adapter servers when SOAP calls came into the web services, like this:
The Messaging Engine failed to
register the adapter for “SOAP” for
the receive location blahblahblah.
Please verify that the receive
location exists, and that the isolated
adapter runs under an account that has
access to the BizTalk databases.
We restarted IIS on both servers, which fixed the errors on ONE server, but the other server continued to fail. The errors continued after a reboot as well.
After chasing our tails for a while, we eventually discovered that the BizTalkServerIsolatedHost host instance on the still-failing server was gone. Just... gone. These applications have been in production for months. Everything had been working swimmingly through the morning, until this just happened.
I don't want to muddy the waters, because I think the problems are unrelated, but in the interest of providing enough information, this problem exactly coincided with a problem in our load-balancing network hardware. The load balancer, which provides a single URL to consumers, and round-robins between the two adapter servers, just stopped working. This problem has not been resolved, so I don't know what happened, but it certainly made troubleshooting more interesting...
So, I have two questions:
Has anyone seen this before, where a host instance disappears?
We cannot find anything in the event viewer or anywhere else that says the host instance was deleted. Is this logged somewhere?
Thanks,
Jason