Django ORM: Why is exclude() so slow and how to optimize it? - django

I have the following 3 queries in my CBV:
filtered_content = Article.objects.filter_articles(search_term)
filtered_articles = filtered_content.exclude(source__website=TWITTER)
filtered_tweets = filtered_content.filter(source__website=TWITTER)
Short explanation:
I'm querying my database (PostgreSQL) for all article titles that contain the search term. After that, I separate the results into one variable that contains all articles originating from Twitter and the other variable contains all articles originating from all other websites.
I have two questions about optimizing these queries.
Question 1: Looking at the average time it takes to run these queries, it doesn't make sense to me (filtered_content = less than 0.001 seconds, filtered_articles = 0.2 seconds and filtered_tweets = 0.04 seconds).
What is the reason for the exclude() statement (filtered_articles) being so slow?
I also tried doing the query in another way, but this was even slower:
filtered_content = Article.objects.filter_articles(search_term)
filtered_tweets = filtered_content.filter(source__website=TWITTER)
filtered_content.exclude(article_id__in=[tweet.article_id for tweet in filtered_tweets])
Question 2: Is there a more elegant way to solve this problem / is there a way to do it in less than 3 separate queries? More specifically, using the Django ORM, is there a way to do a query where all excluded() objects are stored in one variable while all non-excluded objects are stored in another?

I don't know why is it slower, maby you need to inspect sql that exclude generates. On the other hand you can try with Q statment:
from django.db.models import Q
Article.objects.filter(~Q(source__website=TWITTER)).filter_articles(search_term)
can you share yours .filter_articles() method?

Related

Django filter exact m2m objects

Let's say I have a team model, and a team has members.
So
class Team(models.Model):
team_member = models.ManyToManyField('Employee')
class Employee(models.Model):
....
Lets say I have a list of employee ids like team_members = [1001, 1003, 1004] and I want to find the Team, that is made up of exactly those three members.
I don't want the team that has [1001, 1003, 1004, 1005] or the team that has [1001, 1003].
Only team [1001, 1003, 1004].
This is what I'm doing now:
teams = Team.objects.all()
for t in teams:
if set([x.id for x in t.team_member.all()]) == set(team_members):
team = t
if not team:
team = Team.objects.create()
team.team_member = team_members
But it seems a bit ham-handed. Is there a cleaner way, with fewer nested loops?
The short answer
No, I don't know of a much simpler way in terms of code appearance.
However there are some things you could do to make your code a little more graceful and potentially a lot faster. Plus it is possible to do the work in the database, albeit quite inefficiently for large team sizes.
The DB option listed below is pretty much as ham-handed as the for loop you provided, but could be more efficient depending on your data set, DB, etc.
Longer answer: ways to be less 'ham-handed'
There are a couple of places I'd clean up the style here.
Plus, in my experience with Django, loops like the one you built do tend to become pretty expensive on large data sets. If you end up loading, say, 10,000 teams into memory, having the ORM convert them to Team objects, and then iterating over them, you'll probably see some significant slowdown.
Two things to try for speed & grace:
Use Team.values_list('team_members') for your in-python filter loop, which skips the step where Django organizes all of the SQL data into Model objects. I've found this to save lots of time instantiating objects (sometimes around an order of magnitude).
straighten out your set() calls. Currently you're re-converting team_members to a set() on every iteration, plus you're turning t.team_member implicitly into TeamMember objects (as they're fetched from the DB), then into a list of ids and then into a set. For the first item, just make a team_members_set = set(team_members) up front and reuse it. For the second item, you can do set(t.team_member.values_list('id', flat=True)) which will skip the heaviest ORM step of instantiating TeamMembers (which could be as bad as O(n^2) in your example depending on the data set and Django's caching).
use Team.objects.all().iterator() to not load the Teams all into memory at once. This will help if you're running into memory issues.
But with any performance optimization, of course test your perf with real or real-ish data to be sure you're not making things worse!
Longer answer: the DB option
After trying all manner of Q() manipulation and other approaches listed in the answers here, to no avail, I found this answer by #Todor.
Basically you need to do repeated filter()s, one for each team_member. On top of that you use a Count filter to make sure that you don't end up choosing a Team with a superset of the desired members.
desired_members = [1001, 1003, 1004]
initial_queryset = Team.objects.annotate(cnt=models.Count('team_members')).filter(cnt=len(desired_members))
matching_teams = reduce( # Can of course use a for loop if you prefer that to reduce()
lambda queryset, member: queryset.filter(team_members=member),
desired_members,
initial_queryset
)
Note that the resulting query will likely have perf issues for large teams, since it will do one JOIN for every one of your desired_members. It'd be nice to avoid that but I don't know of another way to do this all in the database without changing your data structure. I'd love to learn a better way, and if you end up doing some perf testing I'd be curious to find what you learn!
Maybe you can use annotate for the count of team_member. Can you try this?
Team.objects.filter(team_member__pk__in=team_members).annotate(num_team=Count('team_member')).filter(num_team=len(team_members))
To get the Team with those exact three members you can use:
Team.objects.get(team_member__pk=team_members) # This code was untested
You could also try with a list of Employee objects:
# team_members = Employee.objects.filter(pk__in=tem_members)
team_members = [<Employee: Employee object>, <Employee: Employee object>, <Employee: Employee object>]
Team.objects.get(team_member=team_members)

Quantity of database query's vs application memory performance using Django

If I need a total of all objects in a query set as well as a slice of filed values from those objects, which option would be better considering speed and application memory use (I am using a PostgreSQL backend):
Option a:
def get_data():
queryset = MyObject.objects.all()
total_objects = queryset.count()
thumbs = queryset[:5].values_list('thumbnail', flat=True)
return {total_objects:total_objects, thumbs:thumbs}
Option b:
def get_data():
objects = list(MyObject.objects.all())
total_objects = len(objects)
thumbs = [o.thumbnail for o in objects[:5]]
return {total_objects:total_objects, thumbs:thumbs}
If I understand things correctly, and certainly correct me if I am wrong:
Option a: It will hit the database two times and will result in only total_objects = integer and thumbs = list of strings in memory.
Option b: It will hit the database one time and will result in a list of all objects and all their filed data + option a items in memory.
Considering these options and that there are potentially millions of instances of MyObject: Is the speed of one data base hit (options a) preferable to the memory consumption of a single data base hit (option b)?
My priority is for overall speed in returning the data, but I am concerned about the larger memory consumption slowing things down even more than the extra database hit.
Using SQL is the fastest method and will always beat the Python equivalent, even if it hits the database more. The difference is negligible in comparison. Remember, that's what SQL is meant to do - be fast and efficient.
Anyway, running a thousand loops using timeit, these are the results:
In [8]: %timeit get_data1() # Using ORM
1000 loops, best of 3: 628 µs per loop
In [9]: %timeit get_data2() # Using python
1000 loops, best of 3: 1.54 ms per loop
As you can see, the first method takes 628 microseconds per loop, while the second one takes 1.54 milliseconds. That's almost 2.5 times as much! A clear winner.
I used an SQLite database with only 100 objects in it (I used autofixture to spam the models). I'm guessing PostgreSQL will return different results, but I am still in favor of the first one.

memcached vs. database for random select

I have a question about DB/memcached usage. I have a series of questions in database, divided by level (about 1000 questions per level). For each user at each step I need to select one question for specified level randomly. I use standard django's ORM select + random row using this query:
question = Question.objects.all().filter(level=1).order_by('?')[0]
After analysis of log files I saw that about 50% of all DB query time is spent on selections of questions. I tried to use memcached for it. Because of randomize choise it is not obvious, I can't use it like key-value storage for question_id-question pairs. So I decided to split questions by level, store them in memcached, after it select a group of questions from memcached and choose random one using python like this:
for level in ...:
questions_by_level = [q for q in questions if q.level == level]
cache.set('questions' + str(level), questions_by_level)
and when I need a question:
questions = cache.get('questions' + str(level))
question = choice(questions)
I have memcached on the same machine and geting 1000 questions this way is about 2.5 times slower, that from database. Probably it is because 1000 of objects is selectedfrom memcached, deserialized to python and random one is selected.
Is it possible to choose another strategy for uing cache in this situation? Questions are updated rare, so it is a good lace to have a cache from my point of view.
Thanks.
UPD: one solution, that I discovered myself. For each question build a string key like this: l_n, where l is level and n is number of the question in the group of questions with level l. Now to find a random question I build a random key:
key = str(level) + '_' + str(int(random.random() * num_of_questions_by_level)
pros: getting of 1000 random questions is about 10 time faster, than from DB
cons: initial cache population is very slow
Store them in the database with sequential id numbers and then simply pick a random number between 0 and the number of keys then check memcached for the key, if it returns the question use it if not pull it from the database and put it in memcached for the next use.
This will run you into problems if you delete questions and so there are missing IDs in the sequence but this problem can be overcome, for example, if instead of using IDs you instead select the X item in the database and then when the questions do change you clear memcached so the data gets refreshed.

How to optimize use of querysets with lists

I have a model that has a couple million objects. Each object represents a call made/received by a company.
To simplify things, let's say this model, Call, has these fields:
calldate, context, channel.
My goal is to know the average # of calls made and received during each hour of the day of the month (load by hour). The catch is: I need to find this for port1 and port2 separately.
As of now, my code works fine, except that it takes around 1 whole minute to give me the result for a range of 4 months and I it seems extremely inefficient.
I've done some simple profiling and discovered that the extend is taking around 99% of the processing time:
queryset = Call.objects.filter(calldate__gte='SOME_DATE')
port1, port2 = [],[]
port1.extend(queryset.filter(context__icontains="e1-1"))
port2.extend(queryset.filter(context__icontains="e1-2"))
channels_in_port1 = ["Port/%d-2" % x for x in range(1,32)]
channels_in_port2 = ["Port/%d-2" % x for x in range(32,63)]
for i in channels_in_port1:
port1.extend(queryset.filter(channel__icontains=i))
for i in channels_in_port2:
port2.extend(queryset.filter(channel__icontains=i))
port1 and port2 have around 150k objects combined now.
As soon as I have all calls for port1 and port2, I'm good to go. The rest of the code is basically some for loops for port1 and port2 that sums up and takes the average of calls according to the hour/day/month. Trivial stuff.
I tried to avoid using any "extend" by using itertools.chain and chaining the querysets instead. However, that made the processing time shift to the part where I do the trivial for loops to calculate the load by hour.
Any alternatives? Better ways to filter the queryset?
Thanks very much!!
Have you considered using django's aggregate functions? http://docs.djangoproject.com/en/dev/topics/db/aggregation/
I presume your problem is with the second set of extends, ie those within the for loops, rather than the first. (The first is completely unnecessary, in any case: rather than defining an empty list up front and extending it, you can just do port1 = list(queryset.filter(context__icontains="e1-1")).)
Anyway, to summarize what I think you are trying to do: you want to get all Call objects for a certain date, in two blocks depending on the value for channel: one where it contains values from 0 to 31, and one with values between 32 and 62.
It seems like you could do this with just two queries, without any extending at all:
port1 = queryset.filter(channel__range=["Port/1-2", "Port/31-2"])
port2 = queryset.filter(channel__range=["Port/1-32", "Port/31-62"])
Does that not do what you want?
Edit in response to comment but that's then just two queries which you can extend, or concatenate. The problem with your code as posted is that you are doing 31 queries and extend operations for each port, which is bound to be expensive. If you just do one each, plus one extend/concat, that will be much cheaper.

Optimal timestamp-based query in Django

What is the optimal query to obtain all the records for one specific day?
In my Weather model, 'timestamp' is a standard DateTimeField.
I'm currently using
start = datetime.datetime(2009, 1, 31)
end = start + datetime.timedelta(hours=23, minutes=59, seconds=59)
Weather.objects.filter(timestamp__range=(start, end))
but wonder if there is a more efficient method.
The way it's done in django.views.generic.date_based is:
{'date_field__range': (datetime.datetime.combine(date, datetime.time.min),
datetime.datetime.combine(date, datetime.time.max))}
There should soon be a patch merged into Django that will provide a __date lookup for exactly this type of query (http://code.djangoproject.com/ticket/9596).
Do not prematurely optimize
Index columns that your queries are based on frequently
Optimize expensive columns, like add auto-updated year, month, and day values (maybe just as a string) if and only if tests show it provides a significant speedup and only after using what already works NOW and determining it isn't viable.