Django with fastcgi and threads - django

I have a Django app, which spawns a thread to communicate with another server, using Pyro.
Unfortunately, it seems like under fastcgi, multiple versions of this thread are fired off, and a dictionary that should be globally constant within my program, isn't. (Sometimes it has the values I expect, sometimes not)
What's the best way to ensure that there's one and only one copy of a dictionary in a django / fastcgi app?

I strongly recommend against relying on global anything in django. The problem is that, just as you seem to be encountering, the type of deployment will determine how (or whether or not) this global state is shared. To be a style nazi, that's a completely different level of abstraction from the code, which is relying on some guarantee of consistent global state.
I'm not experienced with fastcgi, but my understanding is that it, like many other frameworks, has a pre-forked and a threaded mode. In pre-forked mode, you have separate processes, not threads, running your python code. This spells nightmare for shared global state.
Barring some fragile workaround, which ought to be possible and which someone may or may not suggest, the only persistence you can really rely on is in the database, and, to a lesser extent, whatever caching mechanism you choose. You could use the low-level api to cache and retrieve keys and values.

Related

Synchronising across networked python apps

I am hoping this isn't too vague a question...
I am writing a python app that will have multiple instances of itself to allow load balancing and redundancy to take place.
Each instance will need to be able to read and write to the backend database, which raises the issue of two 'gateways' trying to update the same item.
Can anyone recommend an approach (not primarily looking for code solution) to this?
Thank you in advance ;)
You have mainly two ways to avoid loosing consistency on your data set:
either use some lock mechanisms between the python processes,
or use transactions when accessing the database.
Lock mechanisms must be based on inter-process communications. Depending on your underlying operating system, you may have access to shared memory (think about test-and-set primitive, to use shared memory for such a goal), semaphores and others tools. See Python IPC, for instance: https://docs.python.org/2/library/ipc.html
Transactions depend on your database. For instance, with a traditional SQL database, you can often configure the behaviours according to your needs: auto-commit, optimistic concurrency, etc...

Multiple db inserts with Django performance is not increased by parallel threads

I'm doing thousands and thousands of inserts to a PostgreSQL database with Python and Django (using the CLI, so no web server at all).
The objects that are inserted are already in memory, and I'm poping them one by one from a FIFO queue (using Python's native https://docs.python.org/2/library/queue.html)
What I'm doing basically is:
args1, args2 = queue.get()
m1, _ = Model1.objects.get_or_create(args1)
Model2.objects.create(m1, args2)
I was thinking a way to do this faster was too spawn a few more threads that can do this in parallel. To my surprise the performance is actually slightly decreased... I was expecting almost linear improvement in relation to the number of threads.. not sure what's going on..
Is there something database specific I'm missing, are there table locks that are blocking the threads when this is running?
Or does it have something to do with that each thread can only access a single database connection atomically during runtime?
I have standard configuration for PostgreSQL (9.3) and Django (1.7.7) installed with apt-get on Debian Jessie.
Also I tried with 4 threads, which is the same number of CPUs I have available on my box.
There are a few things going on here.
Firstly you are using very high level ORM methods (get_or_create, create). Those are generally not a good fit for bulk operations since methods like that tend to have a lot of overhead to provide a nice API and also do additional work to prevent users from shooting themselves in the foot too easily.
Secondly your careful use of a queue is very counterproductive in multiple ways:
Due to django running in autocommit mode by default each database operation is carried out in its own transaction. Since that is a relatively expensive operation this also causes unnecessary overhead.
Inserting each object by itself also causes a lot more back and forth communication between the database and django, which again produces overhead, slowing things down.
Thirdly the reason using multiple threads is even slower stems from the fact that python has a GIL (Global Interpreter Lock). This prevents multiple threads from executing Python code at the same time. There is a lot of material on the web about the whys and hows of the GIL and what can be done in which circumstances to mitigate it. There is a nice summary by Dave Beazly about the GIL that should get you started if you're interested in learning more about it.
Additionally I'd generally recommend against doing large inserts from multiple threads in any language since - depending on your database and data model - this can also cause slowdowns inside the database due to possibly required locking.
Now there are many solutions to your problem but I'd recommend to start with a simple one:
Django actually provides a handy low-level interface to create models in bulk, fittingly enough called bulk_create(). I'd suggest removing all that fancy queue and thread code and using this interface as directly as possible with the data you already have.
In case this isn't sufficient for your case a possible alternative would be to generate an INSERT INTO statement from the data and executing that directly on the database.
If all you want to achieve is simply insertion, could you instead just use the save() method instead of get_or_create(). get_or_create() queries the database first. If the table is large, the call to get_or_create() can be a bottleneck. And that's probably why having multiple parallel threads do not help.
The other possibility is with the insertion itself. Postgres by default enables auto-commit on a per insert (transaction) basis. The committing process involves complex mechanisms under the hood. Long story short, you may try disabling auto-commit and see if that would help in your particular case. A relevant article is here.

Is it possible to cause multi processes hibernate (core dump?)?

I have a software (c++) that runs few processes (each process is a major system itself).
The processes have communication with each other via xml-rpc or boost asio
I want to be able to freeze or stop all processes at a given moment and be able to raise the system (all processes) later to the same state as before hibernating.
How can I do that in c++?
Would it be feasible due to the fact that the processes communicates with each other?
The big picture is that you need to get the system to a stable consistent state, then persist that state in some re-creatable form.
You can in principle write such code, the degree of difficulty depends on your application. You will need to figure out things such as:
How the processes agree that they are in a consistent state. You may need to define some new "Get ready to hibernate" and "I'm ready" messages.
For each process you need to figure out how to persist and recover it's state. Depending upon the complexity of any live data structures that may be quite tricky. On the other hand, if your processes are stateless then this could be really easy.
You'll need to devise a scheme for managing the sets of hibernated data, how you determine a consistent set across all the processes.
I see this as significant coding effort, the degree of difficulty will depend on the complexity of your application and the quality of its implementation. In a well structured application such major "replumbing" exercises often go surprisingly simply.
Unless you're an OS - no, it won't be possible.
What you need to do instead is to make sure that each process can do it for itself (i.e.: write a functionality that allows saving and restoring the states for each of the processes), and also to accommodate for inconsistencies in the communication (for example - to ensure ACK on the messages, and resend if saved state without receiving ACK).
It's feasible if done right, but it's easier said than done, of course, and assumes you can actually change the processes.
Well,
the other answers are fine. There is another rather "exotic" way which may solve this quickly, but it may be overkill or not suitable. But who knows ? So just in case...
I suggest to run your program into a virtual machine (I mean for example a linux with vmware) and pause/wake up this virtual machine at will.
If you are using an inter-process communication method which is not disrupted by this kind of operation, it may work and save you a lot of time.
Good luck.

Apache C++ module persistent global objects

I want to keep some global objects in an Apache C++ module persistent across Apache child process invocations. How do I do this?
You must use some form of storage external to the Apache processes.
Basic choices:
A database.
Shared memory (OS dependent).
Another process and use an IPC mechanism (eg. a socket)
A file.
Which one is appropriate depends on your requirements, and you might combine them. For example, "a database" is actually implemented as another process that makes things persistent in a file and it deals with concurrency issues in a known way.
In general, a database is probably the first thing to try and only go to other alternatives if you have specific issues that can be solved by taking a different approach.

How does cherrypy handle user threads?

I'm working on a django app right and I'm using cherrypy as the server. Cherrypy creates a new thread for every page view. I'd like to be able to access all of these threads (threads responsible for talking to django) from within any of them. More specifically I'd like to be able to access the thread_data for each of these threads from within any of them. Is this possible? If so, how do I do it?
CherryPy's wsgiserver doesn't create a new thread for every request--it uses a pool. Each of those worker threads is a subclass of threading.Thread, so all of them should be accessible via threading.enumerate().
However, if you're talking specifically about cherrypy.thread_data, that's something else: a threading.local. If you're using a recent version of Python, then all that's coded in C and you (probably rightfully) don't have cross-thread access to it from Python. If you really need it and really know what you're doing, the best technique is usually to stick an additional reference to such things in a global container at the same time that they are inserted into the thread_data structure. I recommend dicts with weakrefs as keys for those global containers--there are enough Python ORM's that use them for connection pools (see my own Geniusql, for example) that you should be able to learn how to implement them fairly easily.
My first response to a question like this isn't to tell you how to do it but to stress that you really should reconsider before moving forward with this. I normally shy away from threaded web-servers, in favor of multi-process or asynchronous solutions. Adding explicit inter-thread communication to the mix only increases those fears.
When a question like this is asked, there is a deeper goal. I suspect that what you think inter-thread communication would solve can actually be solved in some other, safer way.