I have a site running on AWS, the spec of free tier. It has been running almost more than 10 months without any trouble, but I found lately it has the 502 gateway, not be reached.
My question here is I haven't touched the settings or any, I suddenly got this. Wondered what caused it or what's happened. Is this a common issue? How can I avoid it?
Thanks in advance.
Related
Slack integration for error reporting suddenly stopped working. Anyone experienced similar issues?
We also removed the slack channel integration and redid it. Sadly no improvement. Also worth noting is, that other channels work as expected
Issue has been resolved in the mean time. Thanks to Kyle for looking into this!
Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).
We are frequently seeing (once or twice every day at least) "java.net.UnknownHostException" errors on some of our Google Kubernetes Engine pods. When it happens, these errors come for all external hosts our process on those pods are trying to reach, for instance:
java.net.UnknownHostException: datastore.googleapis.com
java.net.UnknownHostException accounts.google.com
com.google.gcloud.datastore.DatastoreException - I/O error
Anyone else has faced this issue? What could be the cause of this sudden loss of this connectivity on the pods? Off late (for last 2 weeks or so), we seem to be noticing this issue more frequently than before.
Thanks.
I am working with a customer who is having issues with their GCP Cloud SQL deployment. There questions are transcribed here:
When connecting to Cloud SQL, connections often fail intermittently. This can look like a Python error:
(psycopg2.DatabaseError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Or, in Node, it can look like a timeout error or a socket hang up:
TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
We have everything configured correctly, as far as we can tell, and have followed all the instructions in the Cloud SQL troubleshooting guide. We have an instance with 20GB of memory that should support 250 connections. The timeouts should be set to refresh the connections at the right intervals (< 10 min). So we're not sure what's going on here.
I know that isn't a ton to go on but I wanted to try and do my due diligence in seeing how we can help them. I realize we may not get a perfect answer on what is going on but some additional questions I can ask of them to help debug the issue would be a great help to start with.
I found this similar question that seems to be describing the same issue but it has no answers: PostgreSQL 'Sever closed the connection unexpectedly'
Thanks for any help!
As the error suggest, it's not clear what caused the connection to be closed. I would suggest looking into the Cloud SQL error logs (within your Google Cloud Console) to see detailed information as to why the connection was closed, as it was the case in this Github Issue (The wrong role was assigned).
We started seeing this issue about 2 hours ago. It's happening very randomly. For example, if you copy and paste this URL for the image into browser, the chance of it not showing up for me is about 20%.
https://d1jbmqjs327xbn.cloudfront.net/_pa/spaces-developer.pxand/assets/images/apps/pos/pos-login-bg.jpg
Even after the browser able to load the image, it may not load if you do a hard refresh. Then it will eventually doing hard refresh 2-3 more times.
This seems like a networking issue on AWS side.
Another thing I saw is that 3 of my domains randomly became unreachable for a few minutes (I tested with ping) and then they eventually become reachable again without making any change on my side.
Is anyone experiencing the same issue today (Sep 20, 2017)? This is causing an issue for 10+ sites/apps I manage and I'm not quite sure how to solve this issue. Amazon is also not getting back to me on this.