haystack's RealTimeSearchIndex causes django to hang on data entry

haystack's RealTimeSearchIndex causes django to hang on data entry - django

I'm using django-haystack and a xapian backend with real time indexing (haystack.indexes.RealTimeSearchIndexing) of model data and it works fine on my Ubuntu server. However, it causes django to hang upon data entry when I deployed the app on a RHEL5 server.
Everything is hunky dory if I switch to a standard SearchIndex.
Running ./manage.py rebuild_index manually works fine too.
The major differences between the two setups would be the versions of Python (2.4.3 vs 2.6.4) and the xapian (1.0.4-1 vs 1.0.15).
Any suggestions on what may be the problem?
Nothing interesting appears in the logs, and I've tried different databases (mysql, sqlite3) and deployment methods (mod_python, wsgi) with no luck yet.
I have noted the warning on the haystack docs stating that RealTimeSearchIndex is only handled gracefully with a Solr backend, however I'm running a very low traffic site with only occasional writes so I'm fine with some CPU overheads on writes.

Installing xapian-core and xapian-bindings from source solved the problem.
I initially used the RPM packages provided here.

Please note this from the author of xapian-haystack:
Because Xapian does not support simultaneous WritableDatabase connections, it is strongly recommended that users take care when using RealTimeSearchIndex to either set WSGIDaemonProcess processes=1 or use some other way of ensuring that there are not multiple attempts to write to the indexes. Alternatively, use SearchIndex and a cronjob to reindex content at set time intervals (sample cronjob can be found here http://gist.github.com/216247) or derive your own SearchIndex to implement some other form of keeping your indexes up to date.

Related

Postgres has a lot of activity going on despite me not doing anything

I am running a local development server. I am working on the project until it is ready for deployment. I checked the postgres admin page and I noticed that I have a lot of transactions running in the background.
I am not using the site/making queries, and am wondering what is causing this. (I am also the only user)
Why is this?

You'll need to find out what is going on yourself.
First, you can try SELECT * FROM pg_stat_activity (doc). It shows the last statement executed by each user.
With some luck, you'll find out what is going on.
If that is not enough, use pg_stat_statements.
It is a little bit more complicated to install (load in postgresql.conf then CREATE EXTENSION pg_stat_statements) but you will not miss any of the queries.

postgres + GeoDajango - data doesn't seem to save

I have a small python script that pushes data to my django postgres db.
It imports the relevant model from a django project and uses the .save function to save the data to the db without issue.
Yesterday the system was running fine. I started and stopped both my django project and the python script many times over the course of the day, but never rebooted or powered off my computer, until the end of the day.
Today I have discovered that the data is no longer in the db!
This seems silly, as I probably forgotten to do something obvious, but I thought that when the save function is called from a model, the data is committed to the db.

So this answer is "where to start troubleshooting problems like this" since the question is quite vague and we don't have enough info to troubleshoot effectively.
If this ever happens again, the first thing to do is to turn on statement logging for PostgreSQL and look at the statements as they come in. This should show you begin and commit statements as well as the queries. It's virtually impossible to troubleshoot this sort of problem without access to the queries. Things to look for include missing COMMITs, and missing statements.
After that, the next thing to do is to look at the circumstances under which your computer rebooted. Is it possible it did so before an expected commit? Or did it lose power and not have the transaction log flushed to disk in time?
Those two should rule out just about all possible causes on the db side in a development environment. In a production environment for old versions of PostgreSQL you do want to verify that the system has autovacuum running properly and that you aren't getting warnings about xid wraparound. In newer versions this is not a problem because PostgreSQL will refuse to accept queries when approaching xid wraparound.

Recommended way to setup Django Fast CGI configuration with multiple domains

I'm creating a Django project that will be used by multiple domains, and the functionality will be slightly different depending on the domain. I'm looking for advice on the proper way to set this up.
The
sites framework seems like it would be a good fit for doing some of the customizations once processing has reached the point where it's executing the Django code. But I'm trying to determine what the setup should be before we reach that point (relating to the nginx, flup, fastcgi, config).
Here is my current understanding:
It seems like multiple Django settings files are appropriate, each with a different SITE_ID. Then two virtual hosts would be setup in the nginx configuration that would point to two different sockets. Two 'manage.py runfastcgi' processes would then be used to listen on those two different sockets and each process would reference a different settings.py
./manage.py --settings=settings.site1.py runfcgi method=prefork socket=/home/user/mysite1.sock pidfile=django1.pid
./manage.py --settings=settings.site2.py runfcgi method=prefork socket=/home/user/mysite2.sock pidfile=django2.pid
However, it seems like this could get messy if you add more domains. It would require a new 'manage runfastcgi' process to be run for every domain that would be added. Is there a way to support multiple sites in this way without running a separate process for each?
What are your experiences with hosting multiple domains with Django?
Any advice is much appreciated. Thank you for reading.
Joe

If you are going to have a lot of domains running, one process per domain might get quite expensive. The sites framework was originally made with another use case in mind: being able to easily create "duplicate" content on several news sites. When trying to use the sites framework for other uses you run into several difficulties.
one possibility is to move the domain processing to a middleware and have django handle the multidomain part. It's not trivial though, specially if you have to tweak apps to be domain aware, and also urlconfs, templates, etc... A quick google search showed up:
http://djangosnippets.org/snippets/1119/
Might help as a starting point.

which django cache backend to use?

I been using cmemcache + memcache for a while with positive results.
However cmemcache lately not tagging along well and I also found that it's now recommended. I have now installed python-memcached and its working well. As I have decided to change would like to try some other cache back end any recommendation.
I have also came across pylibmc and python-libmemcached any other??
Have anyone tried nginx memcache module?
Thanks

Only cmemcache and python-memcached are supported by the Django memcached backend. I don't have experience with either of the two libraries you mentioned, but if you want to use them you will need to write a new cache backend.
The nginx memcached module is something completely different. It doesn't really seem like it would work well with Django's caching, but it might with enough hacking. At any rate, I wouldn't use it, since it is much better if you control what gets cached and retrieved from cache from the Python side.

Django cluster deployment

I have five nodes behind a load balancer and I'm trying to determine the optimal configuration for a Django based site.
Each node has access to Postgres, mod_wsgi, Apache, Lighttpd, memcached, pgpool2 (for database replication) and glusterfs(for media file replication) and is running Ubuntu 8.04LTS.
So far, the setup is four nodes running Apache/Lighttpd/memcached/pgpool2 all reading/writing to one master node that is running the "master" Postgresql. Each of the four web nodes is also running Postgres for replication from the master via pgpool.
So, my question is: How would you configure this setup and/or what would you change so that there is no single point of failure, if possible?

This sounds like a good setup, although its hard to know exactly what your setup looks like. In terms of memory etc. and what traffic you expect to handle.
You might want to consider using Django's multidb support and have a read only postgres instance (use DB routing to direct reads to the read only for certain apps). This can offer up some quite nice speed improvements - and at the moment you could have a potential bottleneck at the single postgres instance depending how heavy your database work is.
As #ashwoods suggested, it might be working looking into gunicorn and nginx. I guess at the moment you use Apache only to run mod_wsgi? And lighttpd for the static files? With nginx, you can use it with a number of wsgi servers and its great for static files too.

The setup looks pretty good to me. I would consider using gunicorn/uwsgi + nginx. I would also benchmark using pbbouncer, although pgpool2 offers more out of the box.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js