How to use Django ORM to find dangling records? - django

How would one use the Django ORM to write something similar to the following SQL:
SELECT * FROM entities
WHERE NOT EXISTS (SELECT 1 FROM apples WHERE apples.entity_id = entities.id)
AND NOT EXISTS (SELECT 1 FROM oranges WHERE oranges.entity_id = entities.id)
AND NOT EXISTS (SELECT 1 FROM bananas WHERE bananas.entity_id = entities.id)
I have several meta tables that refer to an actual record with details but it's possible for those records to have no references, in which case they're "dangling".
The problem is that there's over 100 million records so a simple exclude using an in filter doesn't work:
Entity.objects.exclude(userid__in=Apple.objects.all().values_list('entity_id'))
The SQL statement using NOT EXISTS, on the other hand, executes at lightning speed.
I'm currently on Django 2.2 (with plans to upgrade to 4.x within a year).

You can .filter(…) [Django-doc] with:
Entity.objects.filter(apple=None, orange=None, banana=None)
This will make LEFT OUTER JOINs on the tables for the Apple, Orange, and Banana models, and then check if these are None/NULL.
It will work with the value related_query_name=… parameter [Django-doc] for the ForeignKeys from Apple, Orange and Banana to Entity. If that one is not specified, it will use the related_name=… parameter [Django-doc] instead, and if that is not specified either, it uses the name of the model in lowercase, so here apple, orange and banana.

Related

Django queryset EXCLUDE: are these equivalent?

In Django, are these two equivalent?
Cars.objects.exclude(brand='mercedes').exclude(year__lte=2000)
and
Cars.objects.exclude(brand='mercedes', year__lte=2000)
?
I know the first one says: exclude any mercedes and exclude any car older than year 2000.
What about the second one? Does it say the same? Or it does only exclude the combination of mercedes being older than year 2000?
Thanks!
The two are not equivalent.
As is specified in the documentation on exclude(..):
The lookup parameters (**kwargs) should be in the format described in Field lookups below. Multiple parameters are joined via AND in the underlying SQL statement, and the whole thing is enclosed in a NOT().
So the first query can be read as:
-- first query
SELECT car.*
FROM car
WHERE NOT (brand = 'mercedes') AND NOT (YEAR <= 2000)
whereas the latter is equivalent to:
-- second query
SELECT car.*
FROM car
WHERE NOT (brand = 'mercedes' AND YEAR <= 2000)
The first query this is semantically the same as "All cars that are not a Mercedes; and that are not build before or in the year 2000". Whereas the second query is "All cars except Mercedeses that are built before or in the year 2000.".
So in case the table contains a Ford that is built in 1993, then the first query will not include it, whereas the second will, since the brand is not Mercedes, it is not excluded (so it will be part of the queryset).

Django query: get the last entries of all distinct values by list

I am trying to make a Django query for getting a list of the last entries for each distinct values from a MySQL database. I will show an example below as this explanation can be very complicated. Getting the distinct values by themselves obviously in Django is no problem using .values(). I was thinking to create couple of Django queries but that looks to be cumbersome. Is there an easy way of doing this.
For the example below. Suppose I want the rows with distinct Names with their last entry(latest date).
Name email date
_________________________________________________
Dane dane#yahoo.com 2017-06-20
Kim kim#gmail.com 2017-06-10
Hong hong#gmail.com 2016-06-25
Dane dddd#gmail.com 2017-06-04
Susan Susan#gmail.com 2017-05-21
Dane kkkk#gmail.com 2017-02-01
Susan sss#gmail.com 2017-05-20
All the distinct values are Dane, kim, Hong, Susan. I also want the rows with the latest dates associated with these distinct name. The list with entries I would like is the rows below. Notice Names are all distinct, and they are associated with the latest date.
Name email date
_________________________________________________
Dane dane#yahoo.com 2017-06-20
Kim kim#gmail.com 2017-06-10
Hong hong#gmail.com 2016-06-25
Susan Susan#gmail.com 2017-05-21
with postgresql you should able to do:
EmailModel.objects.all().order_by('date').distinct('Name')
for more methods/functions like this, you can visit the docs here
This only applies to POSTGRES
You can use the ORDER_BY command to set your query set as ordered by date, then chain with the DISTINCT command to get distinct rows and specify which field. The DISTINCT command will take the first entry for each name. Refer The Docs For More
Edit
For MYSQL, you will have to use raw SQL queries, refer here
Only Postgres supports providing field names in distinct. Also, any field provided in values and order_by is in distinct, thus providing ambiguous results sometimes.
However for MySQL:
Model.objects.values('names').distinct().latest('date')

Django - joining multiple tables (models) and filtering out based on their attribute

I'm new to django and ORM in general, and so have trouble coming up with query which would join multiple tables.
I have 4 Models that need joining - Category, SubCategory, Product and Packaging, example values would be:
Category: 'male'
SubCategory: 'shoes'
Product: 'nikeXYZ'
Packaging: 'size_36: 1'
Each of the Model have FK to the model above (ie. SubCategory has field category etc).
My question is - how can I filter Product given a Category (e.g. male) and only show products which have Packaging attribute available set to True? Obviously I want to minimise the hits on my database (ideally do it with 1 SQL query).
I could do something along these lines:
available = Product.objects.filter(packaging__available=True)
subcategories = SubCategory.objects.filter(category_id=<id_of_male>)
products = available.filter(subcategory_id__in=subcategories)
but then that requires 2 hits on database at least (available, subcategories) I think. Is there a way to do it in one go?
try this:
lookup = {'packaging_available': True, 'subcategory__category_id__in': ['ids of males']}
product_objs = Product.objects.filter(**lookup)
Try to read:
this
You can query with _set, multi __ (to link models by FK) or create list ids
I think this should work but it's not tested:
Product.objects.filter(packaging__available=True,subcategori‌​es__category_id__in=‌​[id_of_male])
it isn't tested but I think that subcategories should be plural (related_name), if you didn't set related_name, then subcategory__set instead od subcategories should work.
Probably subcategori‌​es__category_id__in=‌​[id_of_male] can be switched to .._id=id_of_male.

optimal django manytomany query

I'm having trouble reducing the number of queries for a particular view. It's a fairly heavy one but I'm sure it can be reduced:
Profile:
name = CharField()
Officers:
club= ManyToManyField(Club, related_name='officers')
title= CharField()
Club:
name = CharField()
members = ManyToManyField(Profile)
Election:
club = ForeignKey(Club)
elected = ForeignKey(Profile)
title= CharField()
when = DateTimeField()
Clubs have members and officers (president, tournament director). People can be members of multiple clubs etc...
Officers are elected at elections, the results of which are stored.
Given a player how can I find out the most recently elected officer at each of the players clubs?
At the moment I have
clubs = Club.objects.filter(members=me).prefetch_related('officers')
for c in clubs:
officers = c.officers.all()
most_recent = Elections.objects.filter(club=c).filter(elected__in=officers).order_by('-when')[:1].get()
print(c.name + ' elected ' + most_recent.name + ' most recently')
Problem is the looped query, it's nice and fast if you're a member of 1 club but if you join fifty my database crawls.
Edit:
The answer from Nil does what I want but doesn't get the object. I don't really need the object but I do need another field as well as the datetime. If it's helpful the query:
Club.objects.annotate(last_election=Max('election__when'))
produces the raw SQL
SELECT "organisation_club"."id", "organisation_club"."name", MAX("organisation_election"."when") AS "last_election"
FROM "organisation_club"
LEFT OUTER JOIN "organisation_election" ON ( "organisation_club"."id" = "organisation_election"."club_id" )
GROUP BY "organisation_club"."id", "organisation_club"."name"
I'd really like an ORM answer if at all possible (or a 'mostly' ORM answer).
I believe this is what you're looking for:
from django.db.models import Max, F
Election.objects.filter(club__members=me) \
.annotate(max_date=Max('club__election_set__when')) \
.filter(when=F('max_date')).select_related('elected')
Relations can be followed forwards and backwards again in a single statement, allowing you to annotate the max_date for any election related to the club of the current election. The F class allows you to filter a queryset based on selected fields in SQL, including any extra fields added through annotation, aggregation, joins etc.
What you want is defined here in SQL term: query the Election table, group them by Club and keep only the last election of each club.
Now, how can we translate that in Django ORM? Looking at the documentation, we learn that we can do it with an annotation. The trick is that you need to think in reverse. You want to annotate (add a new data) each club with its last election. This gives us:
Club.objects.annotate(last_election=Max('election__when'))
# Use it in a for loop like that
for club in Club.objects.annotate(last_election=Max('election__when')):
print(club, club.last_election)
Sadly, this only adds the date, which doesn't answer your question! You want the name or the complete Club object. I searched and I still don't know how to do it properly. If everything fails though, you can still use a raw SQL query in Django using a query like in the first link.
The simplest way I can think of is filtering partially at the application level
If you do
e = Election.objects.filter(club__members=me).select_related('elected')
or
e = me.club_set.election_set.select_related('elected')
This is a single query and it should get back all the elections that happened for the all the clubs that the member me is in. Then you can use python to just get the most recent date. Of course, if you have many elections per club, you end up fetching much more data than will be used.
Another way which should do it in two queries:
# Get all member's clubs & most recent election
clubs = Club.objects.filter(members=me).annotate(last_election=Max('election__when'))
# Create filters for election based on the club id and the latest election time
election_Q = [Q(club__id=c.id) & Q(when=c.last_election) for c in clubs]
# Combine filters with an OR
election_filter = reduce(lambda f1, f2: f1 | f2, election_Q)
# Get elections restricting by specific clubs & election date
elections = Election.objects.filter(election_filter).select_related('elected')
for e in elections:
print '%s elected %s most recently at %s' % (e.club.name, e.elected, e.when)
This builds upon #Nil's method and uses its result to build a query in python, then feeds it into the second query. However, there is a limit with the size of a SQL statement and if there are a lot of clubs that a member is in, then you may hit the limit. The limit is fairly high though and I've only ever reached it when importing large datasets in a single INSERT statement so I think it should be fine for your purpose.
Sorry I cannot think of a way that the Django ORM can link them together using a single SQL query. The Django ORM is actually quite limited for complex queries so if you really need the efficiency I think it's probably best to write the raw SQL query.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])