We are frequently seeing (once or twice every day at least) "java.net.UnknownHostException" errors on some of our Google Kubernetes Engine pods. When it happens, these errors come for all external hosts our process on those pods are trying to reach, for instance:
java.net.UnknownHostException: datastore.googleapis.com
java.net.UnknownHostException accounts.google.com
com.google.gcloud.datastore.DatastoreException - I/O error
Anyone else has faced this issue? What could be the cause of this sudden loss of this connectivity on the pods? Off late (for last 2 weeks or so), we seem to be noticing this issue more frequently than before.
Thanks.
Related
We are using Istio 1.8.1 and have started using a headless service to get direct pod to pod communication working with Istio mTLS. This is all working fine, but we have recently noticed that sometimes after killing one of our pods we get 503 no healthy upstream errors for a very long time afterwards (many minutes). If we go back to a ‘normal’ service we get a few 503 errors and then the problem is fixed very quickly (but we can't direct requests to a specific pod which we need to do).
We have traced the communications of the envoy container using kubectl sniff and can see that existing connections are maintained for a long period after the pod is killed, and even that new connections are attempted to the previously killed pod IP.
We have circuit breaker configuration on a destination rule for the service in question, and that doesn’t seem to have helped either. We have also tried setting ‘PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES’ which seemed to improve the 503 errors situation, but strangely interfered with pod to pod direct IP configuration.
Does anyone have any suggestions on why we were receiving the 503 errors or how to avoid them?
I have a site running on AWS, the spec of free tier. It has been running almost more than 10 months without any trouble, but I found lately it has the 502 gateway, not be reached.
My question here is I haven't touched the settings or any, I suddenly got this. Wondered what caused it or what's happened. Is this a common issue? How can I avoid it?
Thanks in advance.
Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).
I am working with a customer who is having issues with their GCP Cloud SQL deployment. There questions are transcribed here:
When connecting to Cloud SQL, connections often fail intermittently. This can look like a Python error:
(psycopg2.DatabaseError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Or, in Node, it can look like a timeout error or a socket hang up:
TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
We have everything configured correctly, as far as we can tell, and have followed all the instructions in the Cloud SQL troubleshooting guide. We have an instance with 20GB of memory that should support 250 connections. The timeouts should be set to refresh the connections at the right intervals (< 10 min). So we're not sure what's going on here.
I know that isn't a ton to go on but I wanted to try and do my due diligence in seeing how we can help them. I realize we may not get a perfect answer on what is going on but some additional questions I can ask of them to help debug the issue would be a great help to start with.
I found this similar question that seems to be describing the same issue but it has no answers: PostgreSQL 'Sever closed the connection unexpectedly'
Thanks for any help!
As the error suggest, it's not clear what caused the connection to be closed. I would suggest looking into the Cloud SQL error logs (within your Google Cloud Console) to see detailed information as to why the connection was closed, as it was the case in this Github Issue (The wrong role was assigned).
I am stuck in one of the problem from last few days :
I have a spring boot application running on aws ecs which is behind elb.
The application is exposing a jersey end point which is downloading a 750 MB
file from aws-s3 in chunks. We are taking input stream from s3 and
streaming it on HTTP. In mid of it's download (at around 400 MB download), we get below exception.
Caused by: org.apache.catalina.connector.ClientAbortException:
java.net.SocketTimeoutException
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:380)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:350)
at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:405)
at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:393)
at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:96 )
at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:325)
at org.glassfish.jersey.message.internal.CommittingOutputStream.write(CommittingOutputStream.java:229)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$UnCloseableOutputStream.write(WriterInterceptorExecutor.java:299)Caused by: java.net.SocketTimeoutException: null
at org.apache.tomcat.util.net.NioBlockingSelector.write(NioBlockingSelector.java:134)
at org.apache.tomcat.util.net.NioSelectorPool.write(NioSelectorPool.java:157)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1221)
at org.apache.tomcat.util.net.SocketWrapperBase.writeBlocking(SocketWrapperBase.java:378)
at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:347) at org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.doWrite(Http11OutputBuffer.java:561)
at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:112)
I research this and found some common solutions on google indicating to keep : idle time out of elb to higher value, and set keepalivetimeout of tomcat to higher value, when i set both these properties to 1800, it started to work, but I don't want to set these value to some no: without understanding the root cause.
Also, I went into tomcat classes and found the actual line of code which is throwing the exception, but still not able to figure out the cause.
Also, not able to understand why this is happening only when application is
behind aws-elb.
Do anyone have clue on this ?
I run into the similar issue with you, let me try to explain the root cause of the problem:
The data stream flow is from "Spring Boot Application"-->ELB-->Client.The "Spring Boot Application" and ELB are located on the same network, so the network speed between them is very fast. But the network speed between ELB and Client is not so fast.
Let us assume the case if the ELB memory buffer and network buffer are filled with data because the data is not transported to Client in time, what will happened: the SocketTimeoutException will be thrown because the tiny 4k bytes cannot be sent to the ELB within “timeout” seconds due to there is not enough space to accept the data from "Spring Boot Application"(which is supposed to be finished instantly).
So my guess is that the issue can always happen if the timeout is small enough or the network speed between ELB and Client is slow enough.