Istio + ExternalName + ElasticCache = timeout

Istio + ExternalName + ElasticCache = timeout - istio

I have an Istio setup within my cluster and connectivity to Redis is via an ExternalName to ElasticCache
After an hour of inactivity we get a disconnect:
aioredis Connection has been closed by server, response: None
Can't seem to find exactly what is closing the connection after this time. AWS doesn't have a timeout set and the tcp connection should remain open.
Have added a DesinationRule attempting to increase tcp and HTTP timeout settings but not sure this is being utilised properly
Any pointers of where to look appreciated

Related

Reason for sudden inability to SSH into GCP VM instance

I was no longer able to SSH into a Google Cloud Compute Engine VM instance that previously showed no problems.
The error logs show the following
#type: "type.googleapis.com/google.protobuf.Struct" value: {
conditionNotMet: { userVisibleMessage: "Supplied fingerprint does not
match current metadata fingerprint."
Trying SSH through the console showed
Code: 4003 Reason: failed to connect to backend Please ensure that:
your user account has iap.tunnelInstances.accessViaIAP permission
VM has a firewall rule that allows TCP ingress traffic from the IP range XXX.0/20, port: 22
you can make a proper https connection to the IAP for TCP hostname: https://tunnel.cloudproxy.app You may be able to connect without using
the Cloud Identity-Aware Proxy.
The VM instance logs showed the following
Error watching metadata: Get
http://metadata.google.internal/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=XXX:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers)
After stopping and restarting the instance I was able to ssh again but I would like to understand the reason for the problem in the first place.

The error message you received indicates that the metadata server's response caused the connection to the Google Compute Engine VM instance to time out. This could be because the server was taking too long to respond or there was a problem with the network. You can try to resolve this issue by either increasing the timeout value by using this doc or waiting for the instance to become healthy using the gcloud compute wait command.
The instance was unable to reach the metadata server, as suggested by the timeout error message. This could be a problem with the instance itself or with the network connection. A firewall or network configuration issue could have prevented the instance from connecting to the metadata server, or an issue with the underlying infrastructure could have rendered the instance temporarily unavailable.
To prevent this issue from happening again, you can increase the timeout value or use the gcloud compute wait command to wait for the instance to become healthy.it is recommended that you regularly update the SSH key used to connect to the instance, and check that the instance can reach the metadata server by making an HTTPS request to the IAP for TCP hostname. Additionally, it is important to ensure that your user account has the "iap.tunnelInstances.accessViaIAP" permission, and that the VM has a firewall rule that allows TCP ingress traffic from the IP range XXX.0/20, port: 22.
If you are using windows vm try troubleshooting steps mentioned in this doc.

MQTT over WebSockets periodically disconnecting

We have been running eclipse-mosquitto 1.6.10 in Google Cloud Platform for months without issue. We use both MQTT protocol and websockets. We have one Google Cloud TCP load balancer for each protocol - one running on port 443 for MQTT and one running port 9200 for websockets. They provide TLS for both connections. There is only 1 broker which both of the load balancers point to.
Last night at 19:30 BST (UTC +1), the broker started having timeout issues and we were unable to find the root cause. In the end, we spun up a new server with a fresh image and new load balancers. The timeout issues stopped, and regular MQTT connections are now working perfectly. However, websocket connections are now intermittently failing. The browser attempts to connect, and after a delay of ~20 seconds, either connects or fails. Even successful connections are then dropping shortly after - anywhere between 20 seconds and 8 minutes.
We have upped the Linux open file limit and have also tried upgrading to 1.6.14. Rebooting Mosquitto and the entire server doesn't fix the problem but does seem to slightly more likely that connections succeed for a short time.
The server that the broker is running on is not under high load.
Any insights would be appreciated. Thanks.

AWS Security Group connection tracking failing for responses with a body in ASP.NET Core app running in ECS + Fargate

In my application:
ASP.NET Core 3.1 with Kestrel
Running in AWS ECS + Fargate
Services run in a public subnet in the VPC
Tasks listen only in the port 80
Public Network Load Balancer with SSL termination
I want to set the Security Group to allow inbound connections from anywhere (0.0.0.0/0) to port 80, and disallow any outbound connection from inside the task (except, of course, to respond to the allowed requests).
As Security Groups are stateful, the connection tracking should allow the egress of the response to the requests.
In my case, this connection tracking only works for responses without body (just headers). When the response has a body (in my case, >1MB file), they fail. If I allow outbound TCP connections from port 80, they also fail. But if I allow outbound TCP connections for the full range of ports (0-65535), it works fine.
I guess this is because when ASP.NET Core + Kestrel writes the response body it initiates a new connection which is not recognized by the Security Group connection tracking.
Is there any way I can allow only responses to requests, and no other type of outbound connection initiated by the application?

So we're talking about something like that?
Client 11.11.11.11 ----> AWS NLB/ELB public 22.22.22.22 ----> AWS ECS network router or whatever (kubernetes) --------> ECS server instance running a server application 10.3.3.3:8080 (kubernetes pod)
Do you configure the security group on the AWS NLB or on the AWS ECS? (I guess both?)
Security groups should allow incoming traffic if you allow 0.0.0.0/0 port 80.
They are indeed stateful. They will allow the connection to proceed both ways after it is established (meaning the application can send a response).
However firewall state is not kept for more than 60 seconds typically (not sure what technology AWS is using), so the connection can be "lost" if the server takes more than 1 minute to reply. Does the HTTP server take a while to generate the response? If it's a websocket or TCP server instead, does it spend whole minutes at times without sending or receiving any traffic?
The way I see it. We've got two stateful firewalls. The first with the NLB. The second with ECS.
ECS is an equivalent to kubernetes, it must be doing a ton of iptables magic to distribute traffic and track connections. (For reference, regular kubernetes works heavily with iptables and iptables have a bunch of -very important- settings like connection durations and timeouts).
Good news is. If it breaks when you open inbound 0.0.0.0:80, but it works when you open inbound 0.0.0.0:80 + outbound 0.0.0.0:*. This is definitely an issue due to the firewall dropping the connection, most likely due to losing state. (or it's not stateful in the first place but I'm pretty sure security groups are stateful).
The drop could happen on either of the two firewalls. I've never had an issue with a single bare NLB/ELB, so my guess is the problem is in the ECS or the interaction of the two together.
Unfortunately we can't debug that and we have very little information about how this works internally. Your only option will be to work with the AWS support to investigate.

How do I configure keep-alive time between elb and server?

We are seeing 504 errors in our ELB logs, however there are no corresponding errors in the application logs. Have increased the idle timeout on ELB and can see that no requests are taking more time than that.
Going through aws documentation found that we need to configure keep-alive time at ec2 instances to be equal or more than idle timeout to keep the connection open between elb and backend server.
Couldn't find any way to configure keep-alive time between elb and backend server. Any suggestion to do that would be helpful
We are using tomcat-ebs for backend servers.

You need to set the keepAliveTimeout="xxxx" in your tomcat connector settings to avoid tearing down idle connections.

ELB Connection Draining Configuration

So, we are kinda kinda lost using the AWS ELB connection draining feature.
We have an Auto Scaling Group and we have an application that has independent sessions (A session on every instance). We configured the ELB listener over HTTP on port 80, forwarding to port 8080 (this is of course the port where the application is deployed) and we created a LBCookieStickinessPolicy. We also enabled the connection draining for 120 seconds.
The behavior we want:
We want to scale down an instance but since the session is sticked to each instance, we want to "maintain" that session during 120 seconds (Or the connection draining configuration).
The behavior we have:
We have tried to deregister, set to stanby, terminate, stop, set to unhealthy an instance. But no matter what we do, the instance shut downs immediately causing the session to end abruptly. Also, we changed the ELB listener configuration to work over TCP with no luck.
Thoughts?

Connection draining refers to open tcp connections with the client it has nothing to do with sessions on your instance. You may be able to do something with keep-alives if you do a TCP passthrough instead of http listener.
The best route to go is set up sessions to be shared between your instances and then disable stickyness on the load balancer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js