I have a bit of code that is causing my page to load pretty slow (49 queries in 128 ms). This is the landing page for my site -- so it needs to load snappily.
The following is my views.py that creates a feed of latest updates on the site and is causing the slowest load times from what I can see in the Debug toolbar:
def product_feed(request):
""" Return all site activity from friends, etc. """
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')
latestdesigns = Design.objects.all().order_by('-added')
latest = list(latestparts) + list(latestdesigns)
latestupdates = sorted (latest, key = lambda x: x.added, reverse = True)
latestupdates = latestupdates [0:8]
# only get the unique avatars that we need to put on the page so we're not pinging for images for each update
uniqueusers = User.objects.filter(id__in = Part.objects.values_list('adder', flat=True))
return render_to_response("homepage.html", {
"uniqueusers": uniqueusers,
"latestupdates": latestupdates
}, context_instance=RequestContext(request))
The query that causes the most time seem to be:
latest = list(latestparts) + list(latestdesigns) (25ms)
There is another one that's 17ms (sitewide annoucements) and 25ms (adding tagged items on each product feed item) respectively that I am also investigating.
Does anyone see any ways in which I can optimize the loading of my activity feed?
You never need more than 8 items, so limit your queries. And don't forget to make sure that added in both models is indexed.
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')[:8]
latestdesigns = Design.objects.all().order_by('-added')[:8]
For bonus marks, eliminate the magic number.
After making those queries a bit faster, you might want to check out memcache to store the most common query results.
Moreover, I believe adder is ForeignKey to User model.
Part.objects.distinct().values_list('adder', flat=True)
Above line is QuerySet with unique addre values. I believe you ment exactly that.
It saves you performing a subuery.
Related
TLDR
Is there a way to mark cached values so I could do something like:
cache.filter('some_tag').clear()
Details
In my project I have the following model:
class Item(models.Model):
month = models.DateField('month', null=False, blank=False, db_index=True)
kg = models.BigIntegerField('kg')
tags = models.ManyToManyField('Tag', related_name='items')
// bunch of other fields used to filter data
And I have a report_view that returns the sum of kg by month and by tag according to the filters supplied in the URL query.
Something like this:
--------------------------------
|Tag |jan |fev |mar |
--------------------------------
|Tag 1 |1000 |1500 |2000 |
--------------------------------
|Tag 2 |1235 |4652 |0 |
--------------------------------
As my Item table has already more than 4 million records and is always growing my report_view is cached.
So far I got all of this covered.
The problem is: the site user can change the tags from the Items and every time this occurs I have to invalidate the cache, but I would like to do it in a more granular way.
For example if a user changes a tag in a Item from january that should invalidate all the totals for that month (I prefer to cache by month because sometimes changing one tag has a cascading effect on others). However I don't know all the views that have been cached as there are thousands of possibilities of different filters that change the URL.
What I have done so far:
Set a signal to invalidate all my caches when a tag changes
#receiver(m2m_changed, sender=Item.tags.through)
def tags_changed(sender, **kwargs):
cache.clear()
But this cleans everything which is not optimal in my case. Is there a way of doing something like cache.filter('some_tag').clear() with Django cache framework?
https://martinfowler.com/bliki/TwoHardThings.html
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
Presuming you are using Django's Cache Middleware, you'll need to target the cache keys that are relevant. You can see how they generate the cache key from these two files in the Django Project:
-
https://github.com/django/django/blob/master/django/middleware/cache.py#L99
- https://github.com/django/django/blob/master/django/utils/cache.py#L367
- https://github.com/django/django/blob/master/django/utils/cache.py#L324
_generate_cache_key
def _generate_cache_key(request, method, headerlist, key_prefix):
"""Return a cache key from the headers given in the header list."""
ctx = hashlib.md5()
for header in headerlist:
value = request.META.get(header)
if value is not None:
ctx.update(force_bytes(value))
url = hashlib.md5(force_bytes(iri_to_uri(request.build_absolute_uri())))
cache_key = 'views.decorators.cache.cache_page.%s.%s.%s.%s' % (key_prefix, method, url.hexdigest(), ctx.hexdigest())
return _i18n_cache_key_suffix(request, cache_key)
The cache key is generated based on attributes and headers from the request and hashed values (i.e. the url is hashed and used as part of the key). The Vary header in your response specifies other headers to use as part of the cache it.
If you understand how Django is caching your views and calculating your cache keys, then you can use this to target appropriate cache entries, but this is still very difficult because the url is hashed you can't target url patterns (you could use https://stackoverflow.com/a/35629796/784648 cache.delete_patterns(...) otherwise).
Django primarily relies on timeout to invalidate the cache.
I would recommend looking into Django Cacheops, this package is designed to work with Django's ORM to cache and invalidate QuerySets. This seems a lot more practical for your needs because you want fine-grained invalidation on your Item QuerySets, you simply will not get that from Django's Cache Middleware. Take a look at the github repo, I've used it and it works well if you take the time to read the docs and understand it.
Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.
I have something that looks like this:
pages = Page.objects.prefetch_related("sections","sections__tiles").all()
for page in pages:
for section in page.sections.all():
for tile in section.tiles.all():
print tile.author
# more stuff to build each page
This hits the SQL layer once for the initial query, then once per loop (n+1).
However the optimal number of SQL queries is 1 + (count of unique authors).
I implemented a simple hash based cash of "authors", and cut load time dramatically.
cache_author = {}
def author_cache(author_id):
author = cache_author.get(author_id, None)
if author:
return author
else:
author = Author.objects.get(id=author_id)
cache_author[author_id] = author
return author
pages = Page.objects.prefetch_related("sections","sections__tiles").all()
for page in pages:
for section in page.sections.all():
for tile in section.tiles.all():
print author_cache(tile.author_id)
# more stuff to build each page
But it feels messy. What cleaner options are out there, for reducing SQL overhead
inside a single transaction?
Again, I can't be positive without seeing the models but I tested this setup in Django 1.7 and it should get everything with only a couple of queries.
pages = Page.objects.prefetch_related("sections","sections__tiles").all().select_related('sections__tiles__author')
for page in pages:
for section in page.sections.all():
for tile in section.tiles.all():
print tile.author
I have multiple databases, one Django managed and one external containing relevant information for filtering down inside Django admin using SimpleListFilter.
As I don't have foreign keys across databases due to limitations in Django, I'm doing a lookup in the external database to fetch for example a target version number. Based on that lookup list I am able to reduce my queryset.
Now the problem is that my database is too large to filter down that way, as the resulting SQL query looks like the following:
SELECT 'status'.'id', 'status'.'service_number', 'status'.'status'
FROM 'status'
WHERE ('status'.'service_number' = '01xxx' OR 'status'.'service_number' = '02xxx' OR 'status'.'service_number' = '03xxx' ......
The list of OR's is too long and the reduce cannot be done in the database anymore, the error received is:
Django Version: 1.4.4
Exception Type: DatabaseError
Exception Value: (1153, "Got a packet bigger than 'max_allowed_packet' bytes")
I increased already max_allowed_packet in MySQL, but this time I don't think it is the right way to simply increase that value again.changed
My SimpleListFilter looks like:
class TargetFilter(SimpleListFilter):
parameter_name = 'target'
def lookups(self, request, model_admin):
return (
('v1', 'V1.0'),
('v2', 'V2.0'),
)
def queryset(self, request, queryset):
if self.value():
lookup = []
for i in Target.objects.using('externaldb').filter(target=self.value()).values('service_number').distinct():
lookup.append(str(i['service_number']))
qlist = [Q(**{'service_number': f}) for f in lookup]
queryset = queryset.filter(reduce(operator.or_, qlist))
return queryset
The listed code worked for years, but became fast slower and now isn't working at all. I've tried to use frozensets, but this doesn't seem to work.
Do you have an idea on how I can reduce very large sets?
Thanks for any hint!
A little too late to the game, but I just started working with Django a couple of weeks ago and I had huge tables to work with. For large queries I used RAW Sql in Django and cursors as a matter of fact to limit the returned output, and process it in batches. Querysets get everything and loads it in memory which is not practical in large deployments.
Look at https://docs.djangoproject.com/en/3.0/topics/db/sql/ and more specifically Executing custom SQL directly
For a mock web service I wrote a little Django app, that serves as a web API, which my android application queries. When I make requests tp the API, I am also able to hand over an offset and limit to only have the really necessary data transmitted. Anyway, I ran into the problem, that Django gives me different results for the same query to the API. It seems as if the results are returned round robin.
This is the Django code that will be run:
def getMetaForCategory(request, offset, limit):
if request.method == "GET":
result = { "meta_information": [] }
categoryIDs = request.GET.getlist("category_ids[]")
categorySet = set(toInt(categoryIDs))
categories = Category.objects.filter(id__in = categoryIDs)
metaSet = set([])
for category in categories:
metaSet = metaSet | set(category.meta_information.all())
metaList = list(metaSet)
metaList.sort()
for meta in metaList[int(offset):int(limit)]:
relatedCategoryIDs = getIDs(meta.category_set.all())
item = {
"_id": meta.id,
"name": meta.name,
"type": meta.type,
"categories": list(categorySet & set(relatedCategoryIDs))
}
result['meta_information'].append(item)
return HttpResponse(content = simplejson.dumps(result), mimetype = "application/json")
else:
return HttpResponse(status = 403)
What happens is the following: If all MetaInformation objects would be Foo, Bar, Baz and Blib and I would set the limit to 0:2, then I would get [Foo, Bar] with the first request and with the exact same request the method would return [Baz, Blib] when I run it for the second time.
Does anyone see what I am doing wrong here? Or is it the Django cache that somehow gets into my way?
I think the difficulty is that you are using a set to store your objects, and slicing that - and sets have no ordering (they are like dictionaries in that way). So, the results from your query are in fact indeterminate.
There are various implementations of ordered sets around - you could look into using one of them. However, I must say that I think you are doing a lot of unnecessary and expensive unique-ifying and sorting in Python, when most of this could be done directly by the database. For instance, you seem to be trying to get the unique list of Metas that are related to the categories you pass. Well, this could be done in a single ORM query:
meta_list = MetaInformation.objects.filter(category__id__in=categoryIDs)
and you could then drop the set, looping and sorting commands.