We use single-tenant architecture for our instances. Each instance contains 3 Django Apps i.e Django, Celery{worker, beat} and few other things that don't interact with the database. We deploy cloudsql-proxy as a sidecar for these django containers which are running as Pod in Google Kubernetes Engine. We are using CloudSQL (Postgres 9.6) by Google and it has Public IP address.
The problem is that we are getting Operational Errors on Django side i.e
OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
and when we check the Pod's logs at the same time when the OperationalError occurred we see the following log error from cloudsql-proxy container
couldn't connect to db_instance: dial tcp our_db_instance_public_ip:3307: connect: connection timed out
It is not that the connection to database doesn't work. It works most of the time but sometime it throws the above errors, which is kind of a pain because we run celery tasks every other minute and they fail due to this. Sometime, this occurs when the end user is interacting with our application and their requests fails.
Our application isn't under very high load. We set the maximum connection of our database to 1000. And the peak number of connection is around 35 (sum of all instance's connections). I checked the stats of Database and it seems pretty happy i.e CPU utilization almost never goes above 50%, Disk is 30% used, Memory usage is around 50%.
I can provide more details if needed. Would appreciate any help!
Related
I am having a problem with my server dealing with a large volume of concurrent users signing on and operating at the same time. Our business case requires the user base to be logging in at the exact same time (1 min window) and performing various operations in our application. Once the server goes past a 1000 concurrent users, the website starts loading very slowly and constant giving 502 errors.
We have reviewed the server metrics (CPU, RAM, Data traffic utilization) on the cloud are most resources are operating at below 10%. Scaling up the server doesn't help.
While the website is constant giving 502 errors are responding after a long time, any direct database queries and SSH connection are working fine. As such we have concluded that issue is primarily focused on the number of concurrent requests the server is handling due to any Nginx or Gunicorn configuration we may have set up incorrectly.
Please advice on any possible incorrect configuration (or any other solution) to this issue :
Server info :
Server specs - AWS c4.8xlarge ( CPU and RAM)
Web Server -Nginx
Backend Server -Gunicorn
nginx conf imageGunicorn conf file
I have a working Django application that is running locally using an sqlite3 database without problem. However, when I change the Django database settings to use my external AWS RDS database all my pages start taking upwards of 40 seconds to load. I have checked my AWS metrics and my instance is not even close to being fully utilized. When I make a request to a view with no database read/write operations I also get the same problem. My activity monitor shows my local CPU spiking with each request. It shows a process named 'WindowsServer' using most of the CPU during each request.
I am aware more latency is expected when using a remote database but I don't think this should result in 40 second page lags. What other problems that could be causing this behaviour?
AWS database monitoring
Local machine
So your computer has connection to the server in Amazon, that's the problem with latency. Production servers should be in the same place as DB servers(or should have very very good connection, so the latency is lowered as much as possible.)
--edit--
So we need more details. What is your ISP? What is your connection properties? Uplink, downlink? What are pings to servers in AWS?
I'm hosting a CouchBase single node cluster in GCP and a flask backend OpenShift cluster which supports angular frontend. The problem is, when a post is called in my flask by my angular, it is taking too much time to get connected the VM (couchbase) and hence flask has to return a "504 Gateway Time-out". But this happens only sometimes. Sometimes it just works very well with proper speed. Not able to troubleshoot. The total data size is less than 100M, and everything is 100% memory resident in Couchbase. So I guess this is not a problem with Couchbase. Just the connection latency to GCP.
My guess is that the first time your flask backend is trying to connect for the first time to your VM, it's taking more than usual as it needs to establish the connection, authenticate and possibly do other things depending on your use-case.
This is a common problem when hosting your app on App Engine or something similar and the solution there is to use "warm-up requests". This basically spins up the whole connection (and in app engine case the instance) and makes a test connection just so when the desired connection comes, everything is already set up.
So I suggest that you check how warm-up requests work and configure something similar between your flask and VM. So basically a route in flask with the only purpose of establishing a test connections with a test package. This way your next connection will be up to speed with no 504 errors.
try to clear to cache of load balancer in GCP console
I already faced same kind of issue and resolved it using above technique
From 31th March I've got following error in Google Cloud SQL:
Got an error reading communication packets.
I have been using Google Cloud SQL for 2 years, but never faced with such problem.
I'm very worried about it.
This is detail error message:
textPayload: "2019-04-29T17:21:26.007574Z 203385 [Note] Aborted connection 203385 to db: {db_name} user: {db_username} host: 'cloudsqlproxy~{private ip}' (Got an error reading communication packets)"
While it is true that this error message often occurs after a maintenance period, it isn't necessarily a cause for concern as this is a known behavior by MySQL.
Possible explanations about why this issue is happening are :
The large increase of connection requests to the instance, with the
number of active connections increasing over a short period of time.
The freezing / unavailability of the instance can also occur due to
the burst of connections happening in a very short time interval. It
is observed that this freezing always happens with an increase of
connection requests. This increase in connections causes the
instance to be overloaded and hence unavailable to respond to
further connection requests until the number of connections
decreases or the instance stabilizes.
The server was too busy to accept new connections.
There were high rates of previous connections that were not closed
correctly.
The client terminated it abnormally.
readTimeout setting being set too low in the MySQL driver.
In an excerpt from the documentation, it is stated that:
There are many reasons why a connection attempt might not succeed.
Network communication is never guaranteed, and the database might be
temporarily unable to respond. Make sure your application handles
broken or unsuccessful connections gracefully.
Also a low Cloud SQL Proxy version can be the reason for such
incident issues. Possible upgrade to the latest version of (v1.23.0)
can be a troubleshooting solution.
IP from where you are trying to connect, may not be added to the
Authorized Networks in the Cloud SQL instance.
Some possible workaround for this issue, depending which is your case could be one of the following:
In the case that the issue is related to a high load, you could
retry the connection, using an exponential backoff to prevent
from sending too many simultaneous connection requests. The best
practice here is to exponentially back off your connection requests
and add randomized backoffsto avoid throttling, and potentially
overloading the instance. As a way to mitigate this issue in the
future, it is recommended that connection requests should be
spaced-out to prevent overloading. Although, depending on how you
are connecting to Cloud SQL, exponential backoffs may already be in
use by default with certain ORM packages.
If the issue could be related to an accumulation of long-running
inactive connections, you would be able to know if it is your case
using show full processliston your database looking for
the connections with high Time or connections where Command is
Sleep.
If this is your case you would have a few possible options:
If you are not using a connection pool you could try to update the client application logic to properly close connections immediately at the end of an operation or use a connection pool to limit your connections lifetime. In particular, it is ideal to manage the connection count by using a connection pool. This way unused connections are recycled and also the number of simultaneous connection requests can be limited through the use of the maximum pool size parameter.
If you are using a connecting pool, you could return the idle connections to the pool immediately at the end of an operation and set a shorter timeout by adjusting wait_timeout or interactive_timeoutflag values. Set CloudSQL wait_timeout flag to 600 seconds to force refreshing connections.
To check the network and port connectivity once -
Step 1. Confirm TCP connectivity on port 3306 with tcptraceroute or
netcat.
Step 2. If [Step 1] succeeded then try to check if there are any
errors in using mysql client to check timeout/error.
When the client might be terminating the connection abruptly you
could check for:
If the MySQL client or mysqld server are receiving a packet bigger
than max_allowed_packet bytes, or the client receiving a packet
too large message,if it so you could send smaller packets or
increase the max_allowed_packet flag value on both client
and server. If there are transactions that are not being properly
committed using both "begin" and "commit", there is the need to
update the client application logic to properly commit the
transaction.
There are several utilities that I think will be helpful here,
if you can install mtr and the tcpdump utilities to
monitor the packets during these connection-increasing events.
It is strongly recommended to enable the general_log in the
database flags. Another suggestion is to also enable the slow_query
database flag and output to a file. Also have a look at this
GitHub issue comment and go through the list of additional
solutions proposed for this issue here
This error message indicates a connection issue, either because your application doesn't terminate connections properly or because of a network issue.
As suggested in these troubleshooting steps for MySQL or PostgreSQL instances from the GCP docs, you can start debugging by checking that you follow best practices for managing database connections.
I am using Datastax's c++ driver version 2.8.0 for Apache Cassandra inside a kubernetes application. Cassandra is deployed as a 3 node cluster via this Helm chart.
The chart leverages kubernetes' headless services to make the Cassandra endpoints available, so there is an entry in the kubernetes DNS for those endpoints.
I have a c++ app running in a kubernetes pod that interacts with Cassandra, which connects using that DNS entry to resolve the endpoints. The application has a single connection to Cassandra object, following the driver usage guidelines. Connection is initialized at the beginning of the program, and failure to initialize the connection or to execute a query later on will actually fail the program.
Everything is working fine, but cassandra nodes/pods may eventually go down for some reason. When that happens, they're respawned, but get reassigned with a different IP. It seems like the c++ driver is able to get the new endpoints from the DNS without any additional code. However in such a situation the connection is not closed on the client side, and it looks like the previous endpoints remain in the connection pool on some level. This leads to a series of log events similar to the following:
1531920921.161 [WARN] (src/pool.cpp:420:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host XXX.XXX.XXX.XX because of the following error: Connection timeout
and
1531920921.894 [WARN] (src/pool.cpp:420:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host XXX.XXX.XXX.XX because of the following error: Connect error 'host is unreachable'
Which pop up every [reconnect timeout]. The more IP reassignments, the more log messages, which as you can guess can get to a pretty large number for long lived applications.
Is there some feature of the driver's API that allows dealing with that? Or a good/recommended way to handle that client side, more generally? One option, external to the driver, could be to reset the connection within the client code, but (although I may be missing out) I fail to see a way to "catch" such events : they only show up in the logs.