Controlling ordering of Django queryset result via filtering with redis list - django

On a Django website of mine, users contribute posts, which are then showed globally on the home page, sorted by most-recent first.
I'm introducing redis into this mix, via doing an lpush of all post_ids into a redis list (which is kept trimmed at 1000 entries). The code is:
def add_post(link_id):
my_server = redis.Redis(connection_pool=POOL)
my_server.lpush("posts:1000", link_id)
my_server.ltrim("posts:1000", 0, 9999)
Then, when a user requests the contents of the home page, I simply execute the following query in the get_queryset method of the relevant class-based view:
Post.objects.filter(id__in=all_posts())
Where all_posts() is simply:
def all_posts():
my_server = redis.Redis(connection_pool=POOL)
return my_server.lrange("posts:1000", 0, -1)
Next, I iterate over the context["object_list"] in a Django template (i.e. {% for post in object_list %}, and one by one populate the latest posts for my users to see.
My problem is that this arrangement does not show most-recent first. It always shows most-recent last. So I changed lpush to rpush instead, but the result didn't change at all. Why isn't changing redis' list insert method changing the ordering of the results Django's queryset is returning to me?
Perhaps I'm missing something rudimentary. Please advise me on what's going on, and how can I fix this (is {% for post in object_list reversed %} my sole option here). My reason for taking the redis route was, naturally, performance. Prior to redis, I would do: Post.objects.order_by('-id')[:1000] Thanks in advance.
Note: please ask for more information if required.

You're iterating through a queryset that doesn't have an order_by clause, which means that you can't have any expectations about the order or the results. The __in clause just controls which rows to return, not their order.
The fact that the returned results are in the id order is an implementation detail. If you want to rely on that, you can just iterate through the queryset in reverse order. A more robust solution would be to reorder (in Python) the instances based on the order of the ids returned from Redis.
All that said, though, I don't think there will be any performance advantage to using Redis here. I think that any relational database with an index on id will be able to execute Post.objects.order_by('-id')[:1000] very efficiently. (Note that slicing a queryset does a LIMIT on the database; you're not fetching all the rows into Python and then slicing a huge list.)

Related

How do I tell if a Django QuerySet has been evaluated?

I'm creating a Django queryset by hand and want to just use the Django ORM to read the resulting querset.query SQL itself without hitting my DB.
I know Django quersets are lazy and I see all the ops that trigger a queryset being evaluated:
https://docs.djangoproject.com/en/1.10/ref/models/querysets/#when-querysets-are-evaluated
But... what if I just want to verify my code is purely building the queryset guts but ISN'T evaluating and hitting my DB yet inadvertently? Are there any attributes on the queryset object I can use to verify it hasn't been evaluated without actually evaluating it?
For querysets that use a select to return lists of model instances, like a basic filter or exclude, the _result_cache attribute is None if the queryset has not been evaluated, or a list of results if it has. The usual caveats about non-public attributes apply.
Note that printing a queryset - although the docs note calling repr() as an evaluation trigger - does not in fact evaluate the original queryset. Instead, internally the original queryset chains into a new one so that it can limit the amount of data printed without changing the limits of the original queryset. It's true that it evaluates a subset of the original queryset and therefore hits the DB, so that's another weakness of this approach if you're in fact trying to use it to monitor all DB traffic.
For other querysets (count, delete, etc) I'm not sure there is a simple way. Maybe watch your database logs, or run in DEBUG mode and check connection.queries as described here:
https://docs.djangoproject.com/en/dev/faq/models/#how-can-i-see-the-raw-sql-queries-django-is-running

Django pagination random: order_by('?')

I am loving Django, and liking its implemented pagination functionality. However, I encounter issues when attempting to split a randomly ordered queryset across multiple pages.
For example, I have 100 elements in a queryset, and wish to display them 25 at a time. Providing the context object as a queryset ordered randomly (with the .order_by('?') specification), a completely new queryset is loaded into the context each time a new page is requested (page 2, 3, 4).
Explicitly stated: how do I (or can I) request a single queryset, randomly ordered, and display it across digestible pages?
I ran into the same problem recently where I didn't want to have to cache all the results.
What I did to resolve this was a combination of .extra() and raw().
This is what it looks like:
raw_sql = str(queryset.extra(select={'sort_key': 'random()'})
.order_by('sort_key').query)
set_seed = "SELECT setseed(%s);" % float(random_seed)
queryset = self.model.objects.raw(set_seed + raw_sql)
I believe this will only work for postgres. Doing a similar thing in MySQL is probably simpler since you can pass the seed directly to RAND(123).
The seed can be stored in the session/a cookie/your frontend in the case of ajax calls.
Warning - There is a better way
This is actually a very slow operation. I found this blog post describes a very good method both for retrieving a single result as well as sets of results.
In this case the seed will be used in your local random number generator.
i think this really good answer will be useful to you: How to have a "random" order on a set of objects with paging in Django?
basically he suggests to cache the list of objects and refer to it with a session variable, so it can be maintained between the pages (using django pagination).
or you could manually randomize the list and pass a seed to maintain the randomification for the same user!
The best way to achive this is to use some pagination APP like:
pure-pagination
django-pagination
django-infinite-pagination
Personally i use the first one, it integrates pretty well with Haystack.
""" EXAMPLE: (django-pagination) """
#paginate 10 results.
{% autopaginate my_query 10 %}

Django .order_by() with .distinct() using postgres

I have a Read model that is related to an Article model. What I would like to do is make a queryset where articles are unique and ordered by date_added. Since I'm using postgres, I'd prefer to use the .distinct() method and specify the article field. Like so:
articles = Read.objects.order_by('article', 'date_added').distinct('article')
However this doesn't give the desired effect and orders the queryset by the order they were created. I am aware of the note about .distinct() and .order_by() in Django's documentation, but I don't see that it applies here since the side effect it mentions is there will be duplicates and I'm not seeing that.
# To actually sort by date added I end up doing this
articles = sorted(articles, key=lambda x: x.date_added, reverse=True)
This executes the entire query before I actually need it and could potentially get very slow if there are lots of records. I've already optimized using select_related().
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
UPDATE
The output would ideally be a queryset of Read instances where their related article is unique in the queryset and only using the Django orm (i.e. sorting in python).
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
Possibily. It's hard to say without the full picture, but my assumption is that you are using Read to track which articles have and have not been read, and probably tying this to User instance to determine if a particular user has read an article or not. If that's the case, your approach is flawed. Instead, you should do something like:
class Article(models.Model):
...
read_by = models.ManyToManyField(User, related_name='read_articles')
Then, to get a particular user's read articles, you can just do:
user_instance.read_articles.order_by('date_added')
That takes the need to use distinct out of the equation, since there will not be any duplicates now.
UPDATE
To get all articles that are read by at least one user:
Article.objects.filter(read_by__isnull=False)
Or, if you want to set a threshold for popularity, you can use annotations:
from django.db.models import Count
Article.objects.annotate(read_count=Count('read_by')).filter(read_count__gte=10)
Which would give you only articles that have been read by at least 10 users.

django - show the length of a queryset in a template

In my html file, how can I output the size of the queryset that I am using (for my debugging purposes)
I've tried
{{ len(some_queryset) }}
but that didn't work. What is the format?
Give {{ some_queryset.count }} a try.
This is better than using len (which could be invoked with {{ some_queryset.__len__ }}) because it optimizes the SQL generated in the background to only retrieve the number of records instead of the records themselves.
some_queryset.count() or {{some_queryset.count}} in your template.
dont use len, it is much less efficient. The database should be doing that work. See the documentation about count().
However, taking buffer's advice into account, if you are planning to iterate over the records anyway, you might as well use len which will involve resolving the queryset and making the resulting rows resident in main memory - this wont go to waste because you will visit these rows anyway. It might actually be faster, depending on db connection latency, but you should always measure.
Just to highlight #Yuji'Tomita'Tomita comment above as a separate answer:
There is a filter called length to call len() on anything.
So you could use:
{{ some_queryset|length }}
The accepted answer is not entirely correct. Whether you should use len() (or the length-filter in a template) vs count() depends on your use case.
If the QuerySet only exists to count the amount of rows, use count().
If the QuerySet is used elsewhere, i.e. in a loop, use len() or |length. Using count() here would issue another SELECT-query to count the rows, while len() simply counts the amount of cached results in the QuerySet.
From the docs:
Note that if you want the number of items in a QuerySet and are also retrieving model instances from it (for example, by iterating over it), it’s probably more efficient to use len(queryset) which won’t cause an extra database query like count() would.
Although it seems that with related objects that you have already eager-loaded using prefetch_related(), you can safely use count() and Django will be smart enough to use the cached data instead of doing another SELECT-query.

Django - Storing results of query

I have a 'categories' model which I is used more than once on a page. Since I am obtaining all the categories at the start, I want to cut down on database queries by obtaining the same data more than once.
Since the initial query is getting ALL the categories, is there a way to store this information in the model so that when I reference the data again later, I don't have to hit the database again?
Perhaps some kind of associative array or dict which stores the categories?
Any help would be appreciated.
Django querysets are lazy and cached, so the database is not hit till the queryset is accessed. You should also take a look at how queries are evaluated.
If you could post some code, we could help you figure out an optimal way to write queries.