Django: Actions that provide intermediate pages ... with 100k rows - django

I know how to write Actions that provide intermediate pages, since the docs are great:
https://docs.djangoproject.com/en/2.0/ref/contrib/admin/actions/#actions-that-provide-intermediate-pages
But, if my selection contains 100k rows, the pattern of the docs does not work since the URL gets too long.
How to write Django Admin Actions that provide intermediate pages and can handle +100k rows?

I solved it this way:
Pickle QuerySets
Store pickled QuerySet in the cache under a random ID
forward the random ID to the next page
the next pages use the random ID to read the QuerySet from the cache.

When i need something closer to that i used some grouping variables like: all, active, accepted, denied. By doing this grouping i can do some bulk action on huge large of data without creating a python list with thousands of pks.
Another good point to pay atention is that you need to pass that to the DB, otherwise you will have a enormous bottleneck on the views/models.

Related

Feed Algorithm + Database: Either too many rows or too slow retrieval

Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances

Django pagination random: order_by('?')

I am loving Django, and liking its implemented pagination functionality. However, I encounter issues when attempting to split a randomly ordered queryset across multiple pages.
For example, I have 100 elements in a queryset, and wish to display them 25 at a time. Providing the context object as a queryset ordered randomly (with the .order_by('?') specification), a completely new queryset is loaded into the context each time a new page is requested (page 2, 3, 4).
Explicitly stated: how do I (or can I) request a single queryset, randomly ordered, and display it across digestible pages?
I ran into the same problem recently where I didn't want to have to cache all the results.
What I did to resolve this was a combination of .extra() and raw().
This is what it looks like:
raw_sql = str(queryset.extra(select={'sort_key': 'random()'})
.order_by('sort_key').query)
set_seed = "SELECT setseed(%s);" % float(random_seed)
queryset = self.model.objects.raw(set_seed + raw_sql)
I believe this will only work for postgres. Doing a similar thing in MySQL is probably simpler since you can pass the seed directly to RAND(123).
The seed can be stored in the session/a cookie/your frontend in the case of ajax calls.
Warning - There is a better way
This is actually a very slow operation. I found this blog post describes a very good method both for retrieving a single result as well as sets of results.
In this case the seed will be used in your local random number generator.
i think this really good answer will be useful to you: How to have a "random" order on a set of objects with paging in Django?
basically he suggests to cache the list of objects and refer to it with a session variable, so it can be maintained between the pages (using django pagination).
or you could manually randomize the list and pass a seed to maintain the randomification for the same user!
The best way to achive this is to use some pagination APP like:
pure-pagination
django-pagination
django-infinite-pagination
Personally i use the first one, it integrates pretty well with Haystack.
""" EXAMPLE: (django-pagination) """
#paginate 10 results.
{% autopaginate my_query 10 %}

Design pattern for caching dynamic user content (in django)

On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.

Django pagination of large image thumbnails

I have a couple hundred of image thumbnails, 15k each. I want to display 20 or so on each page.
Would django.core.paginator suffice for the pagination of these pages? I.e., will it return only those images displayed on the current page? (And if not, what would be a good way to do this?) Thank you.
Depends, because there is one big limitation from the RDBMS (which affects all databases, including MySQL, Postgres, etc.).
django.core.paginator takes a QuerySet which represent any kind of SQL query and adds a LIMIT clause to just get a couple of entries from the database. This approach works well for many kinds of applications, but might become a serious problem if you have a lot of entries. The particular problem is, that whenever you access the 800th page, the database will actually fetch 801*20 entries and then drop the first 800*20 entries again to return the last twenty.
Unfortunately, there is no easy way to solve this problem. In a lot of cases, a next/prev button might be enough so you can write your own pagination which does operate on after-keys instead of page numbers. For example, if the last entry currently displayed by the user has the key "D" you show a next button which links to /next?after=D and then use a SQL query like SELECT * FROM objects WHERE key >DORDER BY key LIMIT 20. The advantage of this approach is, that you can add an index on objects.key which speed up things significantly.
The other approach requires, that you add an additional, indexed (!) column page_num to your table. Then you can perform SQL queries like SELECT * FROM objects WHERE page_num=800 ORDER BY key. With that approach, you can still access all pages randomly, but you have to maintain the page_num column. This might be easy if data is mostly appended at the end and is more complicated if you want to delete/insert elements from the middle efficiently.
So, I would start with django.core.paginator because it's just about 1 line of code. But keep an eye on the response times of your paginated views and the slowquery log from your database. If your database server can't handle the load anymore, you will have to choose one of the techniques mentioned above. Choose solution 2 if random page access is an requirement and solution 1 otherwise (because it's much simpler).
PS: And yes, django.core.paginator will work correctly. :)

Are Django's QuerySets lazy enough to cope with large data sets?

I think I've read somewhere that Django's ORM lazily loads objects. Let's say I want to update a large set of objects (say 500,000) in a batch-update operation. Would it be possible to simply iterate over a very large QuerySet, loading, updating and saving objects as I go?
Similarly if I wanted to allow a paginated view of all of these thousands of objects, could I use the built in pagination facility or would I manually have to run a window over the data-set with a query each time because of the size of the QuerySet of all objects?
If you evaluate a 500000-result queryset, which is big, it will get cached in memory. Instead, you can use the iterator() method on your queryset, which will return results as requested, without the huge memory consumption.
Also, use update() and F() objects in order to do simple batch-updates in single query.
If the batch update is possible using a SQL query, then i think using sql-queries or django-orm will not make a major difference. But if the update actually requires loading each object, processing the data and then updating them, you can use the orm or write your own sql query and run update queries on each of the processed data, the overheads completely depends on the code logic.
The built-in pagination facility runs a limit,offset query (if you are doing it correct), so i don't think there are major overheads in the pagination either ..
As I benchmarked this for my current project with dataset of 2.5M records in one table.
I was reading information and counting records, for example, I needed to find IDs of records, which field "name" was updated more than once in certain timeframe. Django benchmark was using ORM, to retrieve all records and then to iterate through them. Data was saved in list for future processing. No any debug output, except result print in the end.
On the other end, I was using MySQLdb which was executing same queries (got from Django) and building same structure, using classes for storing data and saving instances in list for future processing. No any debug output, except result print in the end.
I found that:
without Django with Django
execution time x 10x
memory consumption y 25y
And I was only reading and counting, without performing update/insert queries.
Try to investigate this question for yourself, benchmark isn't hard to write and execute.