I am building a dashboard-like webapp in Django and my view takes forever to load due to a relatively large database (a single table with 60.000 rows...and growing), the complexity of the queries and quiet a lot of number crunching and data manipulation in python, according to django debug toolbar the page needs 12 seconds to load.
To speed up the page loading time I thought about the following solution:
Build a view that is called automatically every night, completeles all the complex queries, number crunching and data manipulation and saves the results in a small lookup table in the database
Build a second view that is returning the dashbaord but retrieves the data from the small lookup table via a very simple query and hence loads much faster
Since the queries from the first view are executed every night, the data is always up-to-date in the lookup table
My questions: Does my idea make sense, and if yes does anyone have any exerience with such an approach? How can I write a view that gets called automatically every night?
I also read about caching but with caching the first loading of the page after a database update would still take a very long time, and the data in the database gets updated on a regular basis
Yes, it is common practice.
We are pre-calculating some stuff and we are using celery to run those tasks around midnight daily. For some stuff we have special new model, but usually we add database columns to the model, that contains pre-calculated information.
This approach basically has nothing to do with views - you use them normally, just access data differently.
I have a few views in Django where I query the database about 50 times. I am wondering if it is faster to query the database or to grab the items once and count myself in python.
For example: On one page, I want to list the following statistics:
- users who live in New York
- users who are admin users
- users who are admin users and have one car
...
...
etc.
Is it faster to do many Django database queries such as the following:
User.objects.filter(state='NY').count()
User.objects.filter(type='Admin').count()
User.objects.filter(type='Admin', car=1).count()
Or would it be faster to grab all the users, loop through them in python just one time, and count each of these things as I go?
Does it matter on the database? Does it matter how many of these count() queries I execute (on some views I have upwards of 30 as my application is very data driven and is just spitting out many statistics)?
Thanks!
Use the Database as much as you can.
Python, or Java, or any other language doesn't matter. Always use the SQL (if it is a RDBS ). Every time you execute an SQL multiple things happen:
Prepare
Execute
Fetch
You want every single step to be minimal. If you do all of these on the DB, the Prepare/Execute may take longer (but usually is less than all the queries combined), but Fetch (which sometimes is the biggest) becomes 1 (not n), and usually has also less records that the others combined.
Then, do execution plan to optimize your prepare and execution steps.
Next link is for java developers, but actually it can apply to python world too.
We recently made a shift from relational (MySQL) to NoSQL (couchbase). Basically its a back-end for social mobile game. We were facing a lot of problems scaling our backend to handle increasing number of users. When using MySQL loading a user took a lot of time as there were a lot of joins between multiple tables. We saw a huge improvement after moving to couchbase specially when loading data as most of it is kept in a single document.
On the downside, couchbase also seems to have a lot of limitations as far as querying is concerned. Couchbase alternative to SQL query is views. While we managed to handle most of our queries using map-reduce, we are really having a hard time figuring out how to handle time based queries. e.g. we need to filter users based on timestamp attribute. We only need a user in view if time is less than current time:
if(user.time < new Date().getTime() / 1000)
What happens is that once a user's time is set to some future time, it gets exempted from this view which is the desired behavior but it never gets added back to view unless we update it - a document only gets re-indexed in view when its updated.
Our solution right now is to load first x user documents and then check time in our application. Sorting is done on user.time attribute so we get those users who's time is less than or near to current time. But I am not sure if this is actually going to work in live environment. Ideally we would like to avoid these type of checks at application level.
Also there are times e.g. match making when we need to check multiple time based attributes. Our current strategy doesn't work in such cases and we frequently get documents from view which do not pass these checks when done in application. I would really appreciate if someone who has already tackled similar problems could share their experiences. Thanks in advance.
Update:
We tried using range queries which works for only one key. Like I said in most cases we have multiple time based keys meaning multiple ranges which does not work.
If you use Date().getTime() inside a view function, you'll always get the time when that view was indexed, just as you said "it never gets added back to view unless we update it".
There are two ways:
Bad way (don't do this in production). Query views with stale=false param. That will cause view to update before it will return results. But view indexing is slow process, especially if you have > 1 milllion records.
Good way. Use range requests. You just need to emit your date in map function as a key or a part of complex key and use that range request. You can see one example here or here (also if you want to use DateTime in couchbase this example will be more usefull). Or just look to my example below:
I.e. you will have docs like:
doc = {
"id"=1,
"type"="doctype",
"timestamp"=123456, //document update or creation time
"data"="lalala"
}
For those docs map function will look like:
map = function(){
if (doc.type === "doctype"){
emit(doc.timestamp,null);
}
}
And now to get recently "updated" docs you need to query this view with params:
startKey="dateTimeNowFromApp"
endKey="{}"
descending=true
Note that startKey and endKey are swapped, because I used descending order. Here is also a link to documnetation about key types that couchbase supports.
Also I've found a link to a question that can also help.
On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.
My Django application retrieves an RSS feed every day. I would like to persist the time the feed was last updated somewhere in the app. I'm only retrieving one feed, it will never grow to be multiple feeds. How can I persist the last updated time?
My ideas so far
Create a model and add a datetime field to it. This seems like overkill as it adds another table to the database, in which there will only ever be one row. Other than that, it's the most obvious and straight-forward solution.
Create a settings object which just stores key/value mappings. The last updated date would just be row in this database. This is essentially a generic version of the previous solution.
Use dbsettings/django-values, which allows you to store settings in the database. The last updated date would just be a 'setting'.
Any other ideas that I'm missing?
In spite of the fact databases regularly store many rows in any given table, having a table with only one row is not especially costly, so long as you don't have (m)any indexes, which would waste space. In fact most databases create many single row tables to implement some features, like monotonic sequences used for generating primary keys. I encourage you to create a regular model for this.
RAM is volatile, thus not persistent: memcached is not what you asked for.
XML it is not the right technology to store a single value.
RDMS is not the right technology to store a single value.
Django cache framework will answer your question if CACHE_BACKEND is set to anything else than file://...
The filesystem is the right technology to "persist a single value".
In settings.py:
RSS_FETCH_DATETIME_PATH=os.path.join(
os.path.abspath(os.path.dirname(__file__)),
'rss_fetch_datetime'
)
In your rss fetch script:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'w+')
handler.write(int(time.time()))
handler.close()
Wherever you need to read it:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'r+')
timestamp = int(handler.read())
handler.close()
But cron is the right tool if you want to "run a command every day", for example at 5AM:
0 5 * * * /path/to/manage.py runscript /path/to/retreive/script
Of course, you can still write the last update timestamp in a file at the end of the retreive script, and use it somewhere else, if that makes sense to you.
Concluding by quoting Ken Thompson:
One of my most productive days was
throwing away 1000 lines of code.
One solution I've used in the past is to use Django's cache feature. You set a value to True with an expiration time of one day (in your case.) If the value is not set, you fetch the feed, otherwise you don't do anything.
You can see my solution here: Importing your Flickr photos with Django
If you need it only for caching purposes, why not store it in the memcached?
On the other hand, if you use this data for other purposes (e.g. display it on the page, or to make some calculation, etc.), then I would store it in a new model - in Django, all persistence is built on top of the database, via models, and I would not try to use other "clever" solutions.
One thing I used to do when I was deving with PHP, was to store the xml somewhere, but with a new tag inserted to hold the timestamp of the latest retrieval. It wasn't great, but it was quick and simple.
Keeping it simple would lead to the idea of just storing it in the file system ... why can't you do that? You could, for example, have a siteconfig module in one of your apps which held these sorts of data. This could load up data from a specific file, which could be text, JSON, ConfigParser, pickle or any suitable format. Just import siteconfig somewhere, and it can load the data and make it available to the other modules in your site. You could easily extend this to hold a dict-like object with a number of settings (e.g., if you ever have multiple feeds, but don't want to have a model just for 2-3 rows, you could easily hold the last-retrieved time for each feed in a dict keyed by feed URL).
Create a session key, which persists forever and update the feed timestamp every time you access it.