Storing and accessing request-response wide object - django

I need to store a created/open LDAP connection, so multiple models, views and so on can reuse a single connection rather than creating a new one each time. This connection should be open when first required during a request and closed when sending a response (done generating the page). The connection should not be shared between different requests/responses.
What is the way to do it? Where to store the connection and how to ensure it is eventually closed?
A bit more info. As an additional information source, I use LDAP connections. LDAP data contains details I cannot store in the database (redundancy/consistency reasons), e.g. MS Exchange mailing groups. I might need some LDAP data in multiple points, different objects/instances should access it during response generation.

One way to store the connection resource so that it can be shared across your components is to use thread local storage.
For example, in myldap.py:
import threading
_local = theading.local()
def get_ldap_connection():
if not hasattr(_local, 'ldap_connection') or _local.ldap_connection is None:
_local.ldap_connection = create_ldap_connection()
return _local.ldap_connection
def close_ldap_connection():
if hasattr(_local, 'ldap_connection') and _local.ldap_connection is not None:
close_ldap_connection(_local.ldap_connection)
_local.ldap_connection = None
So the first time myldap.get_ldap_connection is called from a specific thread it will open the connection. Subsequent calls from the same thread will reuse the connection.
To ensure the connection is closed when you have finished working, you could implement a Django middleware component. Amongst other things this will allow you to specify a hook that gets invoked after the view has returned it's response object.
The middleware can then invoke myldap.close_ldap_connection() like this:
import myldap
Class CloseLdapMiddleware(object):
def process_response(self, response):
myldap.close_ldap_connection()
return response
Finally you will need to add your middleware in settings.py MIDDLEWARE_CLASSES:
MIDDLEWARE_CLASSES = [
...
'path.to.CloseLdapMiddleWare',
...
]

Related

If I use Gunicorn multi-threaded mode with Flask would I have any concurrency issues [duplicate]

In my application, the state of a common object is changed by making requests, and the response depends on the state.
class SomeObj():
def __init__(self, param):
self.param = param
def query(self):
self.param += 1
return self.param
global_obj = SomeObj(0)
#app.route('/')
def home():
flash(global_obj.query())
render_template('index.html')
If I run this on my development server, I expect to get 1, 2, 3 and so on. If requests are made from 100 different clients simultaneously, can something go wrong? The expected result would be that the 100 different clients each see a unique number from 1 to 100. Or will something like this happen:
Client 1 queries. self.param is incremented by 1.
Before the return statement can be executed, the thread switches over to client 2. self.param is incremented again.
The thread switches back to client 1, and the client is returned the number 2, say.
Now the thread moves to client 2 and returns him/her the number 3.
Since there were only two clients, the expected results were 1 and 2, not 2 and 3. A number was skipped.
Will this actually happen as I scale up my application? What alternatives to a global variable should I look at?
You can't use global variables to hold this sort of data. Not only is it not thread safe, it's not process safe, and WSGI servers in production spawn multiple processes. Not only would your counts be wrong if you were using threads to handle requests, they would also vary depending on which process handled the request.
Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs. If you need to load and access Python data, consider multiprocessing.Manager. You could also use the session for simple data that is per-user.
The development server may run in single thread and process. You won't see the behavior you describe since each request will be handled synchronously. Enable threads or processes and you will see it. app.run(threaded=True) or app.run(processes=10). (In 1.0 the server is threaded by default.)
Some WSGI servers may support gevent or another async worker. Global variables are still not thread safe because there's still no protection against most race conditions. You can still have a scenario where one worker gets a value, yields, another modifies it, yields, then the first worker also modifies it.
If you need to store some global data during a request, you may use Flask's g object. Another common case is some top-level object that manages database connections. The distinction for this type of "global" is that it's unique to each request, not used between requests, and there's something managing the set up and teardown of the resource.
This is not really an answer to thread safety of globals.
But I think it is important to mention sessions here.
You are looking for a way to store client-specific data. Every connection should have access to its own pool of data, in a threadsafe way.
This is possible with server-side sessions, and they are available in a very neat flask plugin: https://pythonhosted.org/Flask-Session/
If you set up sessions, a session variable is available in all your routes and it behaves like a dictionary. The data stored in this dictionary is individual for each connecting client.
Here is a short demo:
from flask import Flask, session
from flask_session import Session
app = Flask(__name__)
# Check Configuration section for more details
SESSION_TYPE = 'filesystem'
app.config.from_object(__name__)
Session(app)
#app.route('/')
def reset():
session["counter"]=0
return "counter was reset"
#app.route('/inc')
def routeA():
if not "counter" in session:
session["counter"]=0
session["counter"]+=1
return "counter is {}".format(session["counter"])
#app.route('/dec')
def routeB():
if not "counter" in session:
session["counter"] = 0
session["counter"] -= 1
return "counter is {}".format(session["counter"])
if __name__ == '__main__':
app.run()
After pip install Flask-Session, you should be able to run this. Try accessing it from different browsers, you'll see that the counter is not shared between them.
Another example of a data source external to requests is a cache, such as what's provided by Flask-Caching or another extension.
Create a file common.py and place in it the following:
from flask_caching import Cache
# Instantiate the cache
cache = Cache()
In the file where your flask app is created, register your cache with the following code:
# Import cache
from common import cache
# ...
app = Flask(__name__)
cache.init_app(app=app, config={"CACHE_TYPE": "filesystem",'CACHE_DIR': Path('/tmp')})
Now use throughout your application by importing the cache and executing as follows:
# Import cache
from common import cache
# store a value
cache.set("my_value", 1_000_000)
# Get a value
my_value = cache.get("my_value")
While totally accepting the previous upvoted answers, and discouraging use of global variables for production and scalable Flask storage, for the purpose of prototyping or really simple servers, running under the flask 'development server'...
...
The Python built-in data types, and I personally used and tested the global dict, as per Python documentation are thread safe. Not process safe.
The insertions, lookups, and reads from such a (server global) dict will be OK from each (possibly concurrent) Flask session running under the development server.
When such a global dict is keyed with a unique Flask session key, it can be rather useful for server-side storage of session specific data otherwise not fitting into the cookie (max size 4 kB).
Of course, such a server global dict should be carefully guarded for growing too large, being in-memory. Some sort of expiring the 'old' key/value pairs can be coded during request processing.
Again, it is not recommended for production or scalable deployments, but it is possibly OK for local task-oriented servers where a separate database is too much for the given task.
...

Django: How to establish persistent connection to rabbitmq?

I am looking for a way to publish messages to a rabbitmq server from my django application. This is not for task offloading, so I don't want to use Celery. The purpose is to publish to the exchange using the django application and have a sister (non-django) application in the docker container consume from that queue.
This all seems very straightforward, however, I can't seem to publish to the exchange without establishing and closing a connection each time, even without explicitly calling for that to happen.
In an attempt to solve this, I have defined a class with a nested singleton class that maintains a connection to the rabbitmq server using Pika. The idea was that the nested singleton would be instantiated only once, declaring the connection at that time. Any time something is to be published to the queue, the singleton handles it.
import logging
import pika
import os
logger = logging.getLogger('django')
class PikaChannelSingleton:
class __Singleton:
channel = pika.adapters.blocking_connection.BlockingChannel
def __init__(self):
self.initialize_connection()
def initialize_connection(self):
logger.info('Attempting to establish RabbitMQ connection')
credentials = pika.PlainCredentials(rmq_username, rmq_password)
parameters = pika.ConnectionParameters(rmq_host, rmq_port, rmq_vhost, credentials, heartbeat=0)
connection = pika.BlockingConnection(parameters)
con_chan = connection.channel()
con_chan.exchange_declare(exchange='xchng', exchange_type='topic', durable=True)
self.channel = con_chan
def send(self, routing_key, message):
if self.channel.is_closed:
PikaChannelSingleton.instance.initialize_connection()
self.channel.basic_publish(exchange='xchng', routing_key=routing_key,
body=message)
instance = None
def __init__(self, *args, **kwargs):
if not PikaChannelSingleton.instance:
logger.info('Creating channel singleton')
PikaChannelSingleton.instance = PikaChannelSingleton.__Singleton()
#staticmethod
def send(routing_key, message):
PikaChannelSingleton.instance.send(routing_key, message)
rmq_connection = PikaChannelSingleton()
I then import rmq_connection where needed in the django application. Everything works in toy applications and in the python repl, but a new connection is being established every time the send function is being called in the django application. The connection then immediately closes with the message 'client unexpectedly closed TCP connection'. The message does get published to the exchange correctly.
So I am sure there is something going on with django and how it handles processes and such. The question still remains, how do I post numerous messages to a queue without re-establishing a connection each time?
If I understand correctly, connections cannot be kept alive like that in a single-threaded context. As your Django app continues executing, the amqp client is not sending the heartbeats on the channel and the connection will die.
You could use SelectConnection instead of BlockingConnection, probably not easy in the context of Django.
A good compromise could be to simply collect messages in your singleton but only send them all at once with a BlockingConnection at the very end of your Django request.

PyMySQL with Django, Multithreaded application

We are trying to use PyMySQL (==0.7.11) in our Django (==1.11.4) environment. But we are encountered with problems when multiple actions are performed at the same time (Multiple requests sent to the same API function).
We get this error:
pymysql.err.InternalError: Packet sequence number wrong - got 6
expected 1
We are trying to delete records from the DB (some time massive requests come from multiple users).
Code:
def delete(self, delete_query):
self.try_reconnect()
return self._execute_query(delete_query)
def try_reconnect(self):
if not self.is_connected:
self.connection.ping(reconnect=True)
#property
def is_connected(self)
try:
self.connection.ping(reconnect=False)
return True
execpt:
return False
def _execute_query(self, query):
with self.connection.cursor() as cursor:
cursor.execute(query)
self.connection.commit()
last_row_id = cursor.lastrowid
return last_row_id
I didn't think it necessary to point out that those functions are part of DBHandler class,
and self.connection initialized in def connect(self) function.
def connect(self):
self.connection = pymysql.connect(...)
This connect function run once in Django startup, we create a global instance(varaible) of DBHandler for the whole project, and multiple files importing it.
We are using the delete function as our gateway to execute delete query.
What we are doing wrong ? And how can we fix it ?
Found the problem,
PyMySQL is not thread safty to share connections as we did (we shared the class instance between multiple files as a global instance - in the class there is only one connection), it is labled as 1:
threadsafety = 1
According to PEP 249:
1 - Threads may share the module, but not connections.
One of the comments in PyMySQL github issue:
you need one pysql.connect() for each process/thread. As far as I know that's the only way to fix it. PyMySQL is not thread safe, so the same connection can't be used across multiple threads.
Any way if you were thinking of using other python package called MySQLdb for your threading application, notice to MySQLdb message:
Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
For threaded applications, try using a connection pool. This can be
done using the Pool module.
Eventually we managed to use Django ORM and we are writing only for our specific table, managed by using inspectdb.

If you are using lighttpd to drive a Django based web application, does each call create a new Python interpreter instance?

I'd like to be able to share some object instances between requests for managing asynchronous event delivery, but this seems like something that won't work with an event based server like lighttpd. Is that the case? What's the best way to work around this if that is the case?
Of note:
This is not a standard web deployment. I'm trying to make this run on an embedded platform for local network only. So some typical deployment/scaling concerns are not really at play here and resources are at a premium.
FastCGI is already long-running, so getting access to a long-lived object should just be a matter of assigning the object to a module-level variable somewhere.
# yourapp/async_thingy.py
_long_lived_object = None
def get_long_lived_object():
global _long_lived_object
if _long_lived_object is None:
_long_lived_object = create_the_long_lived_object()
return _long_lived_object
# views
from .async_thingy import get_long_lived_object
def the_view(request):
# do whatever
long_lived_obj = get_long_lived_object()
long_lived_obj.whatever()
# the rest of the view - return your response, etc.
I'd start with something like this. There are other potential issues if you're using multiple Python processes, but given your resource constraints I'm assuming that's not the case.

Django multiprocessing and database connections

Background:
I'm working a project which uses Django with a Postgres database. We're also using mod_wsgi in case that matters, since some of my web searches have made mention of it. On web form submit, the Django view kicks off a job that will take a substantial amount of time (more than the user would want to wait), so we kick off the job via a system call in the background. The job that is now running needs to be able to read and write to the database. Because this job takes so long, we use multiprocessing to run parts of it in parallel.
Problem:
The top level script has a database connection, and when it spawns off child processes, it seems that the parent's connection is available to the children. Then there's an exception about how SET TRANSACTION ISOLATION LEVEL must be called before a query. Research has indicated that this is due to trying to use the same database connection in multiple processes. One thread I found suggested calling connection.close() at the start of the child processes so that Django will automatically create a new connection when it needs one, and therefore each child process will have a unique connection - i.e. not shared. This didn't work for me, as calling connection.close() in the child process caused the parent process to complain that the connection was lost.
Other Findings:
Some stuff I read seemed to indicate you can't really do this, and that multiprocessing, mod_wsgi, and Django don't play well together. That just seems hard to believe I guess.
Some suggested using celery, which might be a long term solution, but I am unable to get celery installed at this time, pending some approval processes, so not an option right now.
Found several references on SO and elsewhere about persistent database connections, which I believe to be a different problem.
Also found references to psycopg2.pool and pgpool and something about bouncer. Admittedly, I didn't understand most of what I was reading on those, but it certainly didn't jump out at me as being what I was looking for.
Current "Work-Around":
For now, I've reverted to just running things serially, and it works, but is slower than I'd like.
Any suggestions as to how I can use multiprocessing to run in parallel? Seems like if I could have the parent and two children all have independent connections to the database, things would be ok, but I can't seem to get that behavior.
Thanks, and sorry for the length!
Multiprocessing copies connection objects between processes because it forks processes, and therefore copies all the file descriptors of the parent process. That being said, a connection to the SQL server is just a file, you can see it in linux under /proc//fd/.... any open file will be shared between forked processes. You can find more about forking here.
My solution was just simply close db connection just before launching processes, each process recreate connection itself when it will need one (tested in django 1.4):
from django import db
db.connections.close_all()
def db_worker():
some_paralell_code()
Process(target = db_worker,args = ())
Pgbouncer/pgpool is not connected with threads in a meaning of multiprocessing. It's rather solution for not closing connection on each request = speeding up connecting to postgres while under high load.
Update:
To completely remove problems with database connection simply move all logic connected with database to db_worker - I wanted to pass QueryDict as an argument... Better idea is simply pass list of ids... See QueryDict and values_list('id', flat=True), and do not forget to turn it to list! list(QueryDict) before passing to db_worker. Thanks to that we do not copy models database connection.
def db_worker(models_ids):
obj = PartModelWorkerClass(model_ids) # here You do Model.objects.filter(id__in = model_ids)
obj.run()
model_ids = Model.objects.all().values_list('id', flat=True)
model_ids = list(model_ids) # cast to list
process_count = 5
delta = (len(model_ids) / process_count) + 1
# do all the db stuff here ...
# here you can close db connection
from django import db
db.connections.close_all()
for it in range(0:process_count):
Process(target = db_worker,args = (model_ids[it*delta:(it+1)*delta]))
When using multiple databases, you should close all connections.
from django import db
for connection_name in db.connections.databases:
db.connections[connection_name].close()
EDIT
Please use the same as #lechup mentionned to close all connections(not sure since which django version this method was added):
from django import db
db.connections.close_all()
For Python 3 and Django 1.9 this is what worked for me:
import multiprocessing
import django
django.setup() # Must call setup
def db_worker():
for name, info in django.db.connections.databases.items(): # Close the DB connections
django.db.connection.close()
# Execute parallel code here
if __name__ == '__main__':
multiprocessing.Process(target=db_worker)
Note that without the django.setup() I could not get this to work. I am guessing something needs to be initialized again for multiprocessing.
I had "closed connection" issues when running Django test cases sequentially. In addition to the tests, there is also another process intentionally modifying the database during test execution. This process is started in each test case setUp().
A simple fix was to inherit my test classes from TransactionTestCase instead of TestCase. This makes sure that the database was actually written, and the other process has an up-to-date view on the data.
Another way around your issue is to initialise a new connection to the database inside the forked process using:
from django.db import connection
connection.connect()
(not a great solution, but a possible workaround)
if you can't use celery, maybe you could implement your own queueing system, basically adding tasks to some task table and having a regular cron that picks them off and processes? (via a management command)
Hey I ran into this issue and was able to resolve it by performing the following (we are implementing a limited task system)
task.py
from django.db import connection
def as_task(fn):
""" this is a decorator that handles task duties, like setting up loggers, reporting on status...etc """
connection.close() # this is where i kill the database connection VERY IMPORTANT
# This will force django to open a new unique connection, since on linux at least
# Connections do not fare well when forked
#...etc
ScheduledJob.py
from django.db import connection
def run_task(request, job_id):
""" Just a simple view that when hit with a specific job id kicks of said job """
# your logic goes here
# ...
processor = multiprocessing.Queue()
multiprocessing.Process(
target=call_command, # all of our tasks are setup as management commands in django
args=[
job_info.management_command,
],
kwargs= {
'web_processor': processor,
}.items() + vars(options).items()).start()
result = processor.get(timeout=10) # wait to get a response on a successful init
# Result is a tuple of [TRUE|FALSE,<ErrorMessage>]
if not result[0]:
raise Exception(result[1])
else:
# THE VERY VERY IMPORTANT PART HERE, notice that up to this point we haven't touched the db again, but now we absolutely have to call connection.close()
connection.close()
# we do some database accessing here to get the most recently updated job id in the database
Honestly, to prevent race conditions (with multiple simultaneous users) it would be best to call database.close() as quickly as possible after you fork the process. There may still be a chance that another user somewhere down the line totally makes a request to the db before you have a chance to flush the database though.
In all honesty it would likely be safer and smarter to have your fork not call the command directly, but instead call a script on the operating system so that the spawned task runs in its own django shell!
If all you need is I/O parallelism and not processing parallelism, you can avoid this problem by switch your processes to threads. Replace
from multiprocessing import Process
with
from threading import Thread
The Thread object has the same interface as Procsess
If you're also using connection pooling, the following worked for us, forcibly closing the connections after being forked. Before did not seem to help.
from django.db import connections
from django.db.utils import DEFAULT_DB_ALIAS
connections[DEFAULT_DB_ALIAS].dispose()
One possibility is to use multiprocessing spawn child process creation method, which will not copy django's DB connection details to the child processes. The child processes need to bootstrap from scratch, but are free to create/close their own django DB connections.
In calling code:
import multiprocessing
from myworker import work_one_item # <-- Your worker method
...
# Uses connection A
list_of_items = djago_db_call_one()
# 'spawn' starts new python processes
with multiprocessing.get_context('spawn').Pool() as pool:
# work_one_item will create own DB connection
parallel_results = pool.map(work_one_item, list_of_items)
# Continues to use connection A
another_db_call(parallel_results)
In myworker.py:
import django. # <-\
django.setup() # <-- needed if you'll make DB calls in worker
def work_one_item(item):
try:
# This will create a new DB connection
return len(MyDjangoModel.objects.all())
except Exception as ex:
return ex
Note that if you're running the calling code inside a TestCase, mocks will not be propagated to the child processes (will need to re-apply them).
You could give more resources to Postgre, in Debian/Ubuntu you can edit :
nano /etc/postgresql/9.4/main/postgresql.conf
by replacing 9.4 by your postgre version .
Here are some useful lines that should be updated with example values to do so, names speak for themselves :
max_connections=100
shared_buffers = 3000MB
temp_buffers = 800MB
effective_io_concurrency = 300
max_worker_processes = 80
Be careful not to boost too much these parameters as it might lead to errors with Postgre trying to take more ressources than available. Examples above are running fine on a Debian 8GB Ram machine equiped with 4 cores.
Overwrite the thread class and close all DB connections at the end of the thread. Bellow code works for me:
class MyThread(Thread):
def run(self):
super().run()
connections.close_all()
def myasync(function):
def decorator(*args, **kwargs):
t = MyThread(target=function, args=args, kwargs=kwargs)
t.daemon = True
t.start()
return decorator
When you need to call a function asynchronized:
#myasync
def async_function():
...