How can I prevent Istio sidecar from shutting down before my service has finished gracefully terminating?

How can I prevent Istio sidecar from shutting down before my service has finished gracefully terminating? - istio

We have a service that needs to run a few longer SQL queries when it shuts down. However, when the pod receives a SIGTERM from Kubernetes, the istio proxy container waits only 5s prior to shutting down. This causes our queries to fail and the service terminates ungracefully.
Things we've tried:
Setting the terminationGracePeriodSeconds to 3600. Istio still shuts down after 5s.
Keeping an HTTP connection open to try to force Istio not to shut down. Istio still shuts down, forcing our HTTP connection to close too.
Setting TERMINATION_DRAIN_DURATION_SECONDS to 3600 on the istio container. Istio keeps running until 3600s have elapsed, even if our service has finished shutting down. We tried calling curl -XPOST http://127.0.0.1:15000/quitquitquit to get Istio to shut down sooner but it remains running for the full time.
How can we get Istio to stay running long enough for our service to terminate gracefully, without having it stay running for too long?

As far as I know unfortunately Graceful shutdown for istio sidecar is not supported:
Currently there is not. This seems like an interesting feature request though, it may be worth a feature request on github.
I also think that setting up an appropriate thread on github will be a solution.
The only way I know is to use TERMINATION_DRAIN_DURATION_SECONDS option.
There are also several topic on github related to this theme:
istio-proxy stop before my containers #10112
And currently envoy does not support graceful shutdown. xref: envoyproxy/envoy#2920 Once envoy implements this, we can support it in sidecar.
Render istio-proxy environment variables from global.proxy.env or environmentOverride annotation #18333
For anyone looking through issues for why their connection to a pod is disconnecting early, hopefully this helps.
You need the pod to have a terminationGracePeriodSeconds, which I assume you were already aware about, and you also need an annotation for the pod's sidecar config:
annotations:
# https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#ProxyConfig
proxy.istio.io/config: |
terminationDrainDuration: {{ $terminationGracePeriodSeconds }}s
Envoy shutting down before the thing it's wrapping can cause failed requests #7136
Graceful shutdown with injected envoy-sidecar #536

Related

Cloud SQL Proxy connection timesout occasionaly

We use single-tenant architecture for our instances. Each instance contains 3 Django Apps i.e Django, Celery{worker, beat} and few other things that don't interact with the database. We deploy cloudsql-proxy as a sidecar for these django containers which are running as Pod in Google Kubernetes Engine. We are using CloudSQL (Postgres 9.6) by Google and it has Public IP address.
The problem is that we are getting Operational Errors on Django side i.e
OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
and when we check the Pod's logs at the same time when the OperationalError occurred we see the following log error from cloudsql-proxy container
couldn't connect to db_instance: dial tcp our_db_instance_public_ip:3307: connect: connection timed out
It is not that the connection to database doesn't work. It works most of the time but sometime it throws the above errors, which is kind of a pain because we run celery tasks every other minute and they fail due to this. Sometime, this occurs when the end user is interacting with our application and their requests fails.
Our application isn't under very high load. We set the maximum connection of our database to 1000. And the peak number of connection is around 35 (sum of all instance's connections). I checked the stats of Database and it seems pretty happy i.e CPU utilization almost never goes above 50%, Disk is 30% used, Memory usage is around 50%.
I can provide more details if needed. Would appreciate any help!

Istio keeps using deleted pod IP

We are using Istio 1.8.1 and have started using a headless service to get direct pod to pod communication working with Istio mTLS. This is all working fine, but we have recently noticed that sometimes after killing one of our pods we get 503 no healthy upstream errors for a very long time afterwards (many minutes). If we go back to a ‘normal’ service we get a few 503 errors and then the problem is fixed very quickly (but we can't direct requests to a specific pod which we need to do).
We have traced the communications of the envoy container using kubectl sniff and can see that existing connections are maintained for a long period after the pod is killed, and even that new connections are attempted to the previously killed pod IP.
We have circuit breaker configuration on a destination rule for the service in question, and that doesn’t seem to have helped either. We have also tried setting ‘PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES’ which seemed to improve the 503 errors situation, but strangely interfered with pod to pod direct IP configuration.
Does anyone have any suggestions on why we were receiving the 503 errors or how to avoid them?

Got an error reading communication packets in Google Cloud SQL

From 31th March I've got following error in Google Cloud SQL:
Got an error reading communication packets.
I have been using Google Cloud SQL for 2 years, but never faced with such problem.
I'm very worried about it.
This is detail error message:
textPayload: "2019-04-29T17:21:26.007574Z 203385 [Note] Aborted connection 203385 to db: {db_name} user: {db_username} host: 'cloudsqlproxy~{private ip}' (Got an error reading communication packets)"

While it is true that this error message often occurs after a maintenance period, it isn't necessarily a cause for concern as this is a known behavior by MySQL.
Possible explanations about why this issue is happening are :
The large increase of connection requests to the instance, with the
number of active connections increasing over a short period of time.
The freezing / unavailability of the instance can also occur due to
the burst of connections happening in a very short time interval. It
is observed that this freezing always happens with an increase of
connection requests. This increase in connections causes the
instance to be overloaded and hence unavailable to respond to
further connection requests until the number of connections
decreases or the instance stabilizes.
The server was too busy to accept new connections.
There were high rates of previous connections that were not closed
correctly.
The client terminated it abnormally.
readTimeout setting being set too low in the MySQL driver.
In an excerpt from the documentation, it is stated that:
There are many reasons why a connection attempt might not succeed.
Network communication is never guaranteed, and the database might be
temporarily unable to respond. Make sure your application handles
broken or unsuccessful connections gracefully.
Also a low Cloud SQL Proxy version can be the reason for such
incident issues. Possible upgrade to the latest version of (v1.23.0)
can be a troubleshooting solution.
IP from where you are trying to connect, may not be added to the
Authorized Networks in the Cloud SQL instance.
Some possible workaround for this issue, depending which is your case could be one of the following:
In the case that the issue is related to a high load, you could
retry the connection, using an exponential backoff to prevent
from sending too many simultaneous connection requests. The best
practice here is to exponentially back off your connection requests
and add randomized backoffsto avoid throttling, and potentially
overloading the instance. As a way to mitigate this issue in the
future, it is recommended that connection requests should be
spaced-out to prevent overloading. Although, depending on how you
are connecting to Cloud SQL, exponential backoffs may already be in
use by default with certain ORM packages.
If the issue could be related to an accumulation of long-running
inactive connections, you would be able to know if it is your case
using show full processliston your database looking for
the connections with high Time or connections where Command is
Sleep.
If this is your case you would have a few possible options:
If you are not using a connection pool you could try to update the client application logic to properly close connections immediately at the end of an operation or use a connection pool to limit your connections lifetime. In particular, it is ideal to manage the connection count by using a connection pool. This way unused connections are recycled and also the number of simultaneous connection requests can be limited through the use of the maximum pool size parameter.
If you are using a connecting pool, you could return the idle connections to the pool immediately at the end of an operation and set a shorter timeout by adjusting wait_timeout or interactive_timeoutflag values. Set CloudSQL wait_timeout flag to 600 seconds to force refreshing connections.
To check the network and port connectivity once -
Step 1. Confirm TCP connectivity on port 3306 with tcptraceroute or
netcat.
Step 2. If [Step 1] succeeded then try to check if there are any
errors in using mysql client to check timeout/error.
When the client might be terminating the connection abruptly you
could check for:
If the MySQL client or mysqld server are receiving a packet bigger
than max_allowed_packet bytes, or the client receiving a packet
too large message,if it so you could send smaller packets or
increase the max_allowed_packet flag value on both client
and server. If there are transactions that are not being properly
committed using both "begin" and "commit", there is the need to
update the client application logic to properly commit the
transaction.
There are several utilities that I think will be helpful here,
if you can install mtr and the tcpdump utilities to
monitor the packets during these connection-increasing events.
It is strongly recommended to enable the general_log in the
database flags. Another suggestion is to also enable the slow_query
database flag and output to a file. Also have a look at this
GitHub issue comment and go through the list of additional
solutions proposed for this issue here

This error message indicates a connection issue, either because your application doesn't terminate connections properly or because of a network issue.
As suggested in these troubleshooting steps for MySQL or PostgreSQL instances from the GCP docs, you can start debugging by checking that you follow best practices for managing database connections.

Running Flask port 80 on Elastic-Beanstalk Worker

Given an AWS Elastic-Beanstalk Worker box, is it possible to use Flask/port:80 to serve the messages coming in from the associated SQS queue?
I have seen conflicting information about what is going on, inside an ELB-worker. The ELB Worker Environment page says:
Elastic Beanstalk simplifies this process by managing the Amazon SQS queue and running a daemon process on each instance that reads from the queue for you. When the daemon pulls an item from the queue, it sends an HTTP POST request locally to http://localhost/ on port 80 with the contents of the queue message in the body. All that your application needs to do is perform the long-running task in response to the POST.
This SO question Differences in Web-server versus Worker says:
The most important difference in my opinion is that worker tier instances do not run web server processes (apache, nginx, etc).
Based on this, I would have expected that I could just run a Flask-server on port 80, and it would handle the SQS messages. However, the post appears incorrect. Even the ELB-worker boxes have Apache running on them, apparently for doing health-checks (when I stopped it, my server turned red). And of course it's using port 80...
I already have Flask/Gunicorn on an EC2 server that I was trying to move to ELB, and I would like to keep using that - is it possible? (Note: the queue-daemon only posts messages to port 80, that can't be changed...)
The docs aren't clear, but it sounds like they expect you to modify Apache to proxy to Flask, maybe? I hope that's not the only way.
Or, what is the "correct" way of setting up an ELB-worker to process the SQS messages? How are you supposed to "perform the long-running task"?
Note: now that I've used ELB more, and have a fairly good understanding of it - let me make clear that this it not the use-case that Amazon designed the ELB-workers for, and it has some glitches (which will be noted). The standard use-case, basically, is that you create a simple Flask app, and hook it into an ELB-EC2 server, that is configured to make it simple to run that Flask app.
My use-case was, I already had an EC2 server with a large Flask app, running under gunicorn, as well as various other things going on. I wanted to use that server (as an image) to build the ELB server, and have it respond to SQS-queue messages. It's possible there are better solutions, like just writing a queue-polling daemon, and that no-one else will ever take this option, but there it is...

The ELB worker is connected to an SQS queue, by a daemon that listens to that queue, and (internally) posts any messages to http://localhost:80. Apache is listening on port 80. This is to handle health-checks, which are done by the ELB manager (or something in the eco-system). Apache passes non-health-check-requests, using mod_wsgi, to the Flask app that was uploaded, which is at:
/opt/python/current/app/application.py
I suspect it would be possible but difficult to remove Apache and handle the health-checks some other way (flask), thus freeing up port 80. But that's enough of a change that I decided it wasn't worth it.
So the solution I found, is to change which port the local daemon posts to - by reconfiguring it via a YAML config-file, it will post to port 5001, where my Flask app was running. This mean Apache can continue to handle the health-checks on port 80, and Flask can handle the SQS messages from the daemon.
You configure the daemon, and stop/start it (as root):
/etc/aws-sqsd.d/default.yaml
/opt/elasticbeanstalk/addons/sqsd/hooks/stop-sqsd.sh
/opt/elasticbeanstalk/addons/sqsd/hooks/start-sqsd.sh
/opt/elasticbeanstalk/addons/sqsd/hooks/restart-sqsd.sh
Actual daemon:
/opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd
/opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.2.0/gems/aws-sqsd-2.3/bin/aws-sqsd
Glitches:
If you ever use the ELB GUI to configure daemon options, it will over-write the config-file, and you will have to re-edit the Port (and re-start the daemon).
Note: All of the HTTP traffic is internal, either to the ELB eco-system or the worker - so it is possible to close off all external ports (I keep 22 open), such as Port 80. Otherwise your Worker has Apache responding to http://:80 posts, meaning it's open to the world. I assume the server is configured fairly securely, but Port 80 doesn't need to be open at all, for healthchecks or anything else.

Kubernetes liveness probes stop responding

I'm using kube-aws to set up a production Kubernetes cluster on AWS. Now I'm running into an issue I am unable to recreate in my dev environment.
When a pod is running heavy computation, which in my case happens in bursts, the liveness and readiness probes stop responding. In my local environment, where I run docker compose, they work as expected.
The probes use simple HTTP 204 No Content output using native Go functionality.
Has anyone seen this before?
EDIT:
I'd be happy to provide more information, but I am uncertain what I should provide as there is a tremendous amount of configuration and code involved. Primarily, I'm looking for help to troubleshoot and where to look to try to locate the actual issue.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js