Cherrypy - random requests hanging then server becomes unresponsive - python-2.7

We have a server built on the following:
CherryPy 3.2.2 for Windows
Python 2.7.1
Mako 0.3.6
SQLAlchemy 0.7.2
pyodbc 2.1.8
various other minor components
It is running on the following platform:
Windows Server 2008
Microsoft SQL Server 2008 Web
Hosted VM
xampp Apache gateway
Over the past few months we have had several instances where the cherypy server has become unresponsive. The first symptom is that ajax requests time out, reloading the page in the browser times out, then eventually Apache will return a 502 because cherrypy will no longer accept any connections. Restarting the python service resolves the problem. Sometimes it will go from timeouts to 502 within 10 minutes, other times it can just keep timing out for over half an hour before we realise and have to restart.
The server can run for weeks with no issues, but then some days the problem can occur 5 times within a few hours.
We have implemented some extra logging of all requests to identify patterns. Firstly it does not appear to be triggered by any one specific type of request, time of day, load, user, etc. However we do occasionally get SQL deadlock errors, and sometimes we get the cherrypy request timeout errors. The SQL deadlocks do not appear to be related as the request cleans up and they do not correlate with the time the server freezes. The request timeouts did occur at a similar time, but I have only seen them a couple of times, they do not happen every time.
Within each entry point of each module we've added the following logging:
util.write_to_debuglog("ajax", "START", _path)
# Call AJAX target providing the session object
_args.update(session=session)
result=target(**_args)
util.write_to_debuglog("ajax", "FINISH", _path)
return result
Within the debug log we also print out the session ID and user name so I can see exactly where the requests are coming from.
Yesterday when it went down, this is what the debug log showed (working backwards):
I restarted the server at 11:59:08
Previous log entry was 11:23:11 with START ajax get_orders User1 (no corrsponding FINISH)
Between 11:23:04 and 11:22:33 were 9 pairs of successful START and FINISH
requests by different users and different functions
At 11:22:31 was a START ajax get_orders User2 (no FINISH)
11:22:27 to 11:22:23 4 pairs of START and FINISH, different users and functions
11:22:22 a START ajax do_substitutions User3 (no FINISH)
11:22:19 to 11:15:52 299 pairs of START and FINISH, different users (including User1, User2, and User3) and different functions (including those which hung)
11:15:52 a START ajax get_filling_pack User4 (no FINISH)
11:15:51 to 11:15:45 13 pairs of START and FINISH
11:15:45 a START ajax echo User5 (no FINISH)
11:15:43 to 11:14:56 63 pairs of START and FINISH
11:14:56 a START ajax echo User6 (no FINISH)
11:14:56 to 11:13:40 104 pairs of START and FINISH
11:13:40 a START ajax update_supplies User7 (no FINISH)
11:13:38 to 11:13:17 36 pairs of START and FINISH
11:13:17 a START post set_status User8 (no FINISH)
Then normal activity back to midnight when server scheduled restart
Between 11:23:11 and 11:59:08 if you tried to access the server the browser would eventually time out, whereas other times it has eventually reached a point where you get an immediate 502 bad gateway from Apache. This tells me that cherrypy is still accepting socket connections during this period, but the log shows that the requests are not coming though. Even the cherrypy access log shows nothing.
127.0.0.1 - - [05/Jan/2017:11:23:04] "GET /***/get_joblist HTTP/1.1" 200 18 "" "***/2.0"
127.0.0.1 - - [05/Jan/2017:11:59:09] "GET /***/upload_printer HTTP/1.1" 200 38 "" "***/2.0"
On this particular day there were two request timeouts in the error log file, however this is not common.
[05/Jan/2017:11:22:04] Traceback (most recent call last):
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 169, in trap
return func(*args, **kwargs)
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 96, in __call__
return self.nextapp(environ, start_response)
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 379, in tail
return self.response_class(environ, start_response, self.cpapp)
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 222, in __init__
self.run()
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 320, in run
request.run(meth, path, qs, rproto, headers, rfile)
File "c:\python27\lib\site-packages\cherrypy\_cprequest.py", line 603, in run
raise cherrypy.TimeoutError()
TimeoutError
[05/Jan/2017:11:22:05] Traceback (most recent call last):
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 169, in trap
return func(*args, **kwargs)
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 96, in __call__
return self.nextapp(environ, start_response)
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 379, in tail
return self.response_class(environ, start_response, self.cpapp)
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 222, in __init__
self.run()
File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 320, in run
request.run(meth, path, qs, rproto, headers, rfile)
File "c:\python27\lib\site-packages\cherrypy\_cprequest.py", line 603, in run
raise cherrypy.TimeoutError()
TimeoutError
We are not changing the response timeout, so this should be default of 5 minutes. Based on this assumption the errors do not correlate to any of the requests that had hung. A timeout at 11:22:04 would imply a request at 11:18:04, but all requests at that time were successful. Whereas the 8 requests that did hang were much older than 5 minutes and never timed out.
Why would one type of request hang for one user but continue to work successfully for other users?
Why would they hang at all if they have been working for days or weeks before?
Why isn't the request timeout cleaning up all of these?
Why is the server reaching a point where it won't take any requests at all? Surely 8 concurrent requests isn't the server maximum?
Why is the server still accepting socket connections but not processing requests?
Any suggestions for how I can diagnose or resolve these issues would be greatly appreciated.
Thanks,
Patrick.

I believe we finally tracked this down.
Under a very specific set of conditions it was executing a query on a separate database cursor called from within a transaction (i.e. the second query wasn't in the transaction). So one connection was holding a table lock in SQL and the second connection was waiting for that lock, but both originated from the same Python thread. There was also no timeout set on the database connections (default is infinite), so it would hold in its own deadlock forever. The system would still work as long as nobody queried the same tables held in the transaction. Eventually when other users/threads tried to access the same locked area of the database they would also be maid to wait forever.
I don't think SQL saw this as a deadlock because it expected one connection to finish the transaction and release the lock. I don't think CherryPy could terminate the thread because the SQL connection had control.

Related

gUnicorn/Flask/GAE - two processes started for processing the same http request

I have an app on Google AppEngine (Python39 standard env) running on gUnicorn and Flask. I'm making a request to the server from client-side app for a long-running operation and seeing that the request processed twice. The second process (worker) started after a while (a hour and a half) after the first one has been working.
I'm not sure is it related to gUnicorn specifically or to GAE.
The server controller has logging at the beginning :
#app.route("/api/campaign/generate", methods=["GET"])
def campaign_generate():
logging.info('Entering campaign_generate');
# some very long processing here
The controller is called by clicking a button from the UI app. I checked the network in DevTools in the browser that only one request fired. And I can see that there's only one request in server logs at the moment of executing of workers (more on this follow).
The whole app.yaml is like this:
runtime: python39
default_expiration: 0
instance_class: B2
basic_scaling:
max_instances: 1
entrypoint: gunicorn -b :$PORT server.server:app --timeout 0 --workers 2
So I have 2 workers with infinite timeouts, basic scaling with max instances = 1.
I expect while the app is processing one request for a long-running operation, another worker is available for serving.
I don't expect the second worker will used to processing the same request, it's a nonsense (if only the user won't start another operation from another browser).
Thanks to timeout=0 I expect gUnicorn will wait indefinitely till the controller finishes. And only one thing that can hinder is GAE'e timeout. But thanks to basic-scaling it's 24 hours. So I expect the app should process requests for several hours without problem.
But what I'm seeing instead is that after the processing the request for a while another execution is started. Here's simplified logs I see in Cloud Logging:
13:00:58 GET /api/campaign/generate
13:00:59 Entering campaign_generate
..skipped
13:39:13 Starting generating zip-archive (it's something that takes a while)
14:25:49 Entering campaign_generate
So, at 14:25, 1:25 after the current request came another processing of the same request started!
And now there're two request processings running in parallel.
Needless to say that this increase memory pressure and doubles execution time.
When the first "worker" finished (14:29:28 in our example) its processing, its result isn't being returned to the client. It looks like gUnicorn or GAE simply abandoned the first request. And the client has to wait till the second worker finishes processing.
Why is it happening?
And how can I fix it?
Regarding http requests records in the log.
I did see only one request in Cloud Logging (the first one) when the processing was active, and even after the controller was called for the second time ('Entering campaign_generate' in logs appeared) there was not any new GET-request in the logs. But after that everything completed (actually the second processing returned a response) a mysterious second GET-request appeared. So technically after everything is done, from the server logs' view (Cloud Logging) it looks like there were two subsequent requests from the client. But there weren't! There was only one, and I can see it in the browser's DevTools.
Those two requests have different traceId and requestId http headers.
It's very hard to understand what's going on, I tried running the app locally (on the same data) but it works as intended.

Django production (using gunicorn) - internal server error (no request) until 10-20 requests have been made

I have a production system that has been running for 2+ years now, regularly (daily/weekly) updates. Around 2 months ago, a strange behaviour occurs every time I restart Gunicorn, for the first 10-20 requests made to the web server, I get an internal server error. The errors (when the system is switched to debug=True) all relate the the request being None.
The login (allauth) page works a treat, but once I have entered my account details (or any other) - I get internal server error on the following URL. If I reload, it loads AOK. If I browse the site, I get a mixture (semi random) of pages that either load or internal server error. After around 10-20 page load attempts - every thing starts working 100% AOK. No issues.
I can then log in as any account, every page works. The above issues on restarting the web server also occurs with any other account login.
Its as if there is something that is failing in the middleware or some sort of internal time out before the request details can be stored. But, the database server is fully up and running, no load issues at all.
Any thoughts on the issue or how I could go about fixing this. Before this I could update the production servers without any down time, not this is causing around 4-5 minutes of downtime any time I want to update code.
Some additional info - no issue when running locally runserver etc...
Thanks in advance
Thanks for your time and apologies for not including more info on the initial post.
Ended up that I was using --preload on our Gunicorn config. This has worked a treat for almost 3 years in production but seems that this was causing every 'worker' to need at least 2 requests to it, before it would start processing them. Weird eh?
I needed to wait until this weekend as we had some planned down time and I could isolate Production, try it with Debug etc.... None of this helped! but I started setting the number of workers from 1 to 13 and the number of requests it took to get a proper reply (rather than Internal Server error) was two times the number of workers.
So - then I just happened to try removing the --preload option and everything worked exactly as it has before about 3 months ago.
Memory increase is not an issue - so I will move forwards with this.
The stack trace whilst in debug I got was:
OperationalError at /company/person_overview/
SSL SYSCALL error: EOF detected
Request Method: GET
Request URL: https://www.mowida.com/company/person_overview/
Django Version: 3.2.15
Exception Type: OperationalError
Exception Value:
SSL SYSCALL error: EOF detected
Exception Location: /home/timothy/.pyenv/versions/3.9.13/envs/mwp/lib/python3.9/site-packages/django/db/backends/utils.py, line 84, in _execute
Python Executable: /home/timothy/.pyenv/versions/3.9.13/envs/mwp/bin/python3.9
Python Version: 3.9.13
Python Path:
['/home/timothy/.virtualenvs/mwp/lib/python3.9/site-packages',
'/mwp/mwp',
'/home/timothy/.pyenv/versions/3.9.13/envs/mwp/bin',
'/home/timothy/.pyenv/versions/3.9.13/lib/python39.zip',
'/home/timothy/.pyenv/versions/3.9.13/lib/python3.9',
'/home/timothy/.pyenv/versions/3.9.13/lib/python3.9/lib-dynload',
'/home/timothy/.pyenv/versions/3.9.13/envs/mwp/lib/python3.9/site-packages',
'/mwp/mwp/mwp']
Server time: Sat, 13 Aug 2022 22:52:37 +0200
Traceback Switch to copy-and-paste view
/home/timothy/.pyenv/versions/3.9.13/envs/mwp/lib/python3.9/site-packages/django/db/backends/utils.py, line 84, in _execute
return self.cursor.execute(sql, params) …
▶ Local vars
The above exception (SSL SYSCALL error: EOF detected ) was the direct cause of the following exception:
/home/timothy/.pyenv/versions/3.9.13/envs/mwp/lib/python3.9/site-packages/django/core/handlers/exception.py, line 47, in inner
response = get_response(request) …
▶ Local vars
/home/timothy/.pyenv/versions/3.9.13/envs/mwp/lib/python3.9/site-packages/django/core/handlers/base.py, line 204, in _get_response
response = response.render()
I have included this to help anyone else who might get the same issue.
One last thought is why this started happening around 3 months ago? Gunicorn has not updated, our config has not been updated/changed.
Thanks again for thoughts.

Django's infinite streaming response logs 500 in apache logs

I have a Django+Apache server, and there is a view with infinite streaming response
def my_view(request):
try:
return StreamingHttpResponse(map(
lambda x: f"{dumps(x)}\n",
data_stream(...) # yields dicts forever every couple of seconds
))
except Exception as e:
print_exc()
return HttpResponse(dumps({
"success": False,
"reason": ERROR_WITH_CLASSNAME.format(e.__class__.__name__)
}), status=500, content_type="application/json")
When client closes the connection to the server, there is no cleanup to be done. data_stream will yield one more message which won't get delivered. No harm done if that message is yielded and not received as there are no side-effects. Overhead from processing that extra message is negligible on our end.
However, after that last message fails to deliver, apache logs 500 response code (100% of requests). It's not getting caught by except block, because print_exc doesn't get called (no entries in error log), so I'm guessing this is apache failing to deliver the response from django and switching to 500 itself.
These 500 errors are triggering false positive alerts in our monitoring system and it's difficult to differentiate an error due to connection exit vs an error in the data_stream logic.
Can I override this behavior to log a different status code in the case of a client disconnect?
From what I understand about the StreamingHttpResponse function is that any exceptions raised inside it are not propagated further. This has to do with how WSGI server works. If you start handling an exception and steal the control, the server will not be able to to finish the HTTP response. So the error is handled by the server and printed in the terminal. If you attach the debugger to this and see how the exception is handled you will be able to find a line in wsgiref/handlers.py where your exception is absorbed and taken care of.
I think in this file- https://github.com/python/cpython/blob/main/Lib/wsgiref/handlers.py

Oozie job expiring on Java action when writing to HDFS

I have an Oozie coordinator that runs a workflow every hour. The workflow is composed of two sequential actions: a shell action and a Java action. When I run the coordinator, the shell action seems to execute successfully, however, when it's time for the Java action, the Job Browser in Hue always show:
There was a problem communicating with the server: Job application_<java-action-id> has expired.
When I click on the application_id, here's the snapshot:
This seems to point on views.py and api.py. When I looked into server logs:
[23/Nov/2015 02:25:22 -0800] middleware INFO Processing exception: Job application_1448245438537_0010 has expired.: Traceback (most recent call last):
File "/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.6.10-py2.6.egg/django/core/handlers/base.py", line 112, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.6.10-py2.6.egg/django/db/transaction.py", line 371, in inner
return func(*args, **kwargs)
File "/usr/lib/hue/apps/jobbrowser/src/jobbrowser/views.py", line 67, in decorate
raise PopupException(_('Job %s has expired.') % jobid, detail=_('Cannot be found on the History Server.'))
PopupException: Job application_1448245438537_0010 has expired.
The Java action consists of two parts: REST API call and writing to HDFS (via Hadoop client library) the parsed result. Eventhough the Java action job is expiring / failing on Job Browser, the write to HDFS was successful. Here's the snippet of the HDFS writing part of the Java code.
FileSystem hdfs = FileSystem.get(new URI(hdfsUriPath), conf);
OutputStream os = hdfs.create(file);
BufferedWriter br = new BufferedWriter(new OutputStreamWriter(os, "UTF-8"));
...
br.write(toWriteToHDFS);
br.flush();
br.close();
hdfs.close();
When I run the workflow as a standalone, I've got a 50-50 chance of success and expiration on the Java action part, but on coordinator, all Java action's are expiring.
The YARN logs shows this:
Job commit failed: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:794)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1645)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1587)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.touchz(CommitterEventHandler.java:265)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:271)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
So it looks like it has problem with closing the FileSystem at the ending of my Java code (should I keep the FileSystem open?).
I'm using Cloudera Quickstart CDH 5.4.0 and Oozie 4.1.0
The problem is already solved. My Java action uses an instance (say variable fs) of org.apache.hadoop.fs.FileSystem class. At the end of the Java action, I use fs.close(), which will cause the problem on the next period of Oozie job. So when I removed this line, everything went well again.

How can I prevent RuntimeError("Unable to create a new session key.")?

A client's Django application is intermittently (about twice a day) throwing RuntimeError("Unable to create a new session key."):
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python2.6/dist-packages/django/contrib/admin/views/decorators.py", line 17, in _checklogin
if request.user.is_active and request.user.is_staff:
File "/usr/local/lib/python2.6/dist-packages/django/contrib/auth/middleware.py", line 9, in __get__
request._cached_user = get_user(request)
File "/usr/local/lib/python2.6/dist-packages/django/contrib/auth/__init__.py", line 107, in get_user
user_id = request.session[SESSION_KEY]
File "/usr/local/lib/python2.6/dist-packages/django/contrib/sessions/backends/base.py", line 47, in __getitem__
return self._session[key]
File "/usr/local/lib/python2.6/dist-packages/django/contrib/sessions/backends/base.py", line 195, in _get_session
self._session_cache = self.load()
File "/usr/local/lib/python2.6/dist-packages/django/contrib/sessions/backends/cache.py", line 16, in load
self.create()
File "/usr/local/lib/python2.6/dist-packages/django/contrib/sessions/backends/cache.py", line 33, in create
raise RuntimeError("Unable to create a new session key.")
RuntimeError: Unable to create a new session key.
As you can see from the traceback, this happens deep in the bowels of django.contrib.sessions when using the cache session backend with the memcached cache backend.
A Django trac ticket (https://code.djangoproject.com/ticket/14093) suggests changing the session key hash from MD5 to UUID4, but that's no help -- the problem is the network. I've observed (with tcpdump) that this exception can occur when the TCP connection from app server to memcache server times out due to packet loss.
We have two app servers and one memcached (1.4.2) server, all running in Amazon EC2. During periods of high demand, I've observed one app server exchanging 75,000 packets/second with the memcache server. During this period of high demand, I observed one SYN packet for a new memcache connection get lost, resulting in a python-memcache connection timeout (before the kernel even had a chance to retransmit) and a RuntimeError.
I'm at a loss for how to solve this. I'd like to tune Linux's TCP retransmit timer lower than three seconds, but it's not tunable. Failing that, I'd like to have python-memcache retry a connection a couple times before giving up, but it won't. I see that pylibmc has configurable connect and retry behavior, but I haven't been able to find a combination of options that works around the packet loss.
Ideas?
UPDATE:
For people who see "Unable to create a new session key" all the time, that just means your memcache is not set up correctly. Some of the answers below discuss things to check (is the package installed? is the port correct?).
The problem we were having was intermittent — only one in several thousand requests would fail. I used tcpdump to show that this happens when one of the three packets in the TCP three-way handshake is lost (due to network congestion), resulting in python-memcache timing out and raising an exception, resulting in the "Unable to create a new session key" RuntimeError.
Packet loss in cloud provider networks may be unavoidable. Ideally, the Linux kernel would make the TCP initial retransmit timer configurable, but that did not appear to be the case at the time I was investigating this. That means the python-memcache library itself would need to have some kind of timeout-and-retry logic, which it did not have (at the time).
It looks like later versions of Django's cache backend have added a retry loop, which should avoid these kinds of intermittent failures, at the expense of requests occasionally taking a few seconds longer.
Just solved same problem with apt-get install memcached. May be it's your case too.
Ou, sorry, this is not your case. I'vr just read question with more attention. But I will left my answer - cause it's about this runtime error.
Looking at the python-memcached code on launchpad, you should be able to tweak dead_retry and retry_timeout. Another option may be to run a low memory, low connections instance of memcached on one or both of the app servers as a fallback when the primary memcached server is unreachable.
https://github.com/django/django/blob/master/django/contrib/sessions/backends/cache.py
def create(self):
# Because a cache can fail silently (e.g. memcache), we don't know if
# we are failing to create a new session because of a key collision or
# because the cache is missing. So we try for a (large) number of times
# and then raise an exception. That's the risk you shoulder if using
# cache backing.
for i in xrange(10000):
self._session_key = self._get_new_session_key()
try:
self.save(must_create=True)
except CreateError:
continue
self.modified = True
return
raise RuntimeError("Unable to create a new session key.")
you can monkey-patch django.contrib.sessions.backends.base.SessionBase._get_new_session_key to do a time.sleep(0.001).
you could also check your entropy:
here's the command:
cat /proc/sys/kernel/random/entropy_avail
I was getting this error running a local, development version of a Django project, because it was periodically having trouble connecting to a non-local cache. I realized that I could change my session backend to a file-based session to address the issue .
In the settings file for this local, development version of Django, I simply set the following value:
SESSION_ENGINE = 'django.contrib.sessions.backends.file'
This is not the solution I would use in a production environment, and not the solution I would suggest to the original poster, but it took me a few minutes to figure out what the issue was and this is one of the only results that appeared when I Googled, so I figured I'd post here possibly to help out others with a similar issue.
In the file /etc/sysconfig/memcached
change
-l 127.0.0.1 to -l 0.0.0.0
in some machines its in the file /etc/memcached.conf
Same issue I faced, Check ports configured in django and memcached.May be both are different.
you can change memcached port vim /etc/memcached.conf find 'Default connection port is' changed according to your need restart memcached services
This is exactly CPU shortage. when the request goes to be completed watching /var/log/apache/error.log could be the best place for monitoring. also take htop tp monitor your cpu while you are working with web console.
you should increase CPU cores